Remove redundant CUDA copies after gated_delta_net.#23940
Conversation
Currently, GDN writes recurrent state snapshots into its output tail, then the graph immediately copies those snapshots into ssm_states_all. With MTP draft length 3, target decode uses K=4, so that becomes 4 extra ggml_cuda_cpy calls. The change detects that gated_delta_net -> view -> cpy pattern and makes the CUDA GDN kernel write the state snapshot(s) directly into the recurrent cache, skipping the intermediate tail writes and copy kernels when safe.
|
This seems like it should be solved at the graph level instead of fusion at the CUDA level. Is there a reason not to do that? |
Yes, solving this at the graph level makes sense to me as it will help all the backends. But it will require modifying the ggml API for gated_delta_net and updating all the backends. I can create a POC if that looks like a reasonable approach. |
|
Yes that would be better I think since the extra copy is not specific to CUDA itself, but I'm not sure if it exists for a reason or can be safely removed. |
|
@ggerganov I would love to know your thoughts on this, as it will require updating the GGML API and changing the op to directly write into the persistent cache. |
|
I don't think we can avoid the copy at the graph level? @am17an Do you have something in mind? In general, the ggml pattern for caching results of partial tensors is: result = ggml_some_op(...);
ggml_cpy(result, ggml_view(cache)); |
|
Can we not pass a view in the op itself? |
There was a problem hiding this comment.
If at all possible, adding an explicit cache via fusion should be avoided. I definitely think we should not do it for a few % - but as Aman is the primary maintainer for the fusion code I will leave this decision at his discretion. The in my opinion correct way to optimize out copies (for any op) is this:
- Identify that the output of the preceding tensor (which in almost all cases is contiguous) is used exactly once.
- Identify that the use is a
GGML_OP_CPYto another tensor. - Substitute the destination pointer and strides in the kernel - most kernels only support contiguous output so they may need to be extended.
Currently, GDN writes recurrent state snapshots into its output tail, then the graph immediately copies those snapshots into
ssm_states_all. With MTP draft length 3, target decode uses K=4, so that becomes 4 extra ggml_cuda_cpy calls.The change detects that gated_delta_net -> view -> cpy pattern and makes the CUDA GDN kernel write the state snapshots directly into the recurrent cache, skipping the intermediate tail writes and copy kernels when safe.
Performance on DGX Spark with Qwen3.6-35B-A3B-UD-Q4_K_M.gguf:
MTP OFF: shows a gain of 3% in the decode phase.
MTP ON: shows an average gain of 4%.
More perf details on DGX Spark with Qwen3.6-35B-A3B-UD-Q4_K_M.gguf
MTP off:
MTP ON: Command-
llama-server -m Qwen3.6-35B-A3B-UD-Q4_K_M.gguf --spec-type draft-mtpMaster:
PR:
Requirements