Skip to content

Question: on SM8750/sun, why do v79 HVX FP8 vcvt/vmpy work in hexagon-sim but return all zeros in FastRPC user PD? #4

@happyyzy

Description

@happyyzy

This is not a llama.cpp bug report.

I am asking here because this Qualcomm fork appears to be maintained by the people with the most relevant Hexagon / ggml-hexagon expertise, and I would like to confirm whether the behavior below is expected on current production devices.

Research Stage

Background Research

Previous existing literature and research

I am testing raw Hexagon v79 HVX FP8 instructions (vcvt(...f8) and vmpy(...f8, ...f8)) with a minimal standalone probe, outside of llama.cpp model evaluation.

Environment:

  • Device: SM8750 / sun
  • FastRPC capability query reports DSP arch 0x8c79
  • Hexagon SDK: 6.5.0.0
  • Hexagon tools: 19.0.07
  • Android NDK: r25c

What I have already verified:

  1. The generated v79 object code does contain FP8 instructions such as:

    • v?.f8 = vcvt(...)
    • v?:?.hf = vcvt(v?.f8)
    • v?:?.hf = vmpy(v?.f8, v?.f8)
  2. The same source works correctly in hexagon-sim -mv79.
    In simulator, the FP8 conversion / multiply path produces correct non-zero results.

  3. On device, inside FastRPC user PD, the same probe returns all zeros:

    • hf -> f8 gives all-zero bytes
    • f8 -> hf gives all-zero half values
    • f8 * f8 -> hf gives all-zero half values
  4. This remains true for:

    • intrinsic-generated code
    • inline asm
    • direct raw FP8 byte input followed by vcvt / vmpy
  5. I also tried:

    • qurt_hvx_lock(QURT_HVX_MODE_128B)
    • multiple qfloat codegen modes:
      • strict-ieee
      • ieee
      • lossy
      • legacy
        None of these changed the on-device result.
  6. I also traced existing vendor HTP/QNN paths on the same device.
    What I observed is:

    • working vendor QNN HTP runtime goes through CDSP unsigned PD
    • if I force the same path to signed PD, remote_handle_open / remote_handle64_open fails before execution

So at the moment I cannot tell whether this is:

  • an expected production-device limitation
  • an unsigned-PD limitation
  • a DSP image / firmware limitation
  • or something else specific to the runtime environment

Artifact bundle (logs and minimal sources):

Hypothesis

My current working hypothesis is:

  • code generation is correct
  • simulator behavior is correct
  • the issue is in the actual device execution environment, not in the C/intrinsic source itself

More specifically, it looks like on this SM8750/sun production image, the HVX FP8 datapath is not actually usable from the FastRPC user-PD path that is otherwise available to normal workloads.

However, I do not know whether that is:

  • expected platform behavior
  • a policy restriction
  • or something that should work on supported Qualcomm Hexagon runtimes

Implementation

Minimal standalone repro only.

This is not tied to a llama.cpp model or to llama.cpp FP8 code, and I am not claiming that this repository itself introduced the issue.

The reason for posting here is purely to ask the maintainers with Qualcomm Hexagon expertise:

Is it expected that v79 HVX FP8 vcvt / vmpy work in simulator but return all zeros on-device in the normal FastRPC user-PD path on SM8750/sun?

And if this is expected:

What execution environment is actually required for correct HVX FP8 execution on such a device?
For example:

  • signed PD only?
  • a different vendor runtime path?
  • a specific DSP image capability?
  • not supported at all on production user-accessible paths?

Analysis

Observed facts:

  • simulator: correct non-zero FP8 results
  • on-device FastRPC user PD: all-zero FP8 results
  • forcing signed PD on the vendor QNN path causes open failure, so I could not validate FP8 there

This strongly suggests that the problem is not in the source-level implementation of the FP8 instructions, but in the platform/runtime environment available on the device.

Relevant log output

On-device probe:
- hf -> f8 : all zeros
- f8 -> hf : all zeros
- f8 * f8 -> hf : all zeros

Simulator (`hexagon-sim -mv79`) output:
- same source produces correct non-zero FP8 conversion/multiply results

I can provide:

  • full on-device probe log
  • simulator output
  • disassembly snippet showing emitted FP8 instructions
  • signed/unsigned PD tracing logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions