ggml : add GGML_OP_GATHER for DeepSeek Sparse Attention (DSA) #21149#21458
ggml : add GGML_OP_GATHER for DeepSeek Sparse Attention (DSA) #21149#21458LilySu wants to merge 4 commits intoggml-org:masterfrom
Conversation
…gather_f32 in ggml-cpu/ops.cpp
…ard dispatch switch statement in ggml-cpu.c
|
Hi @LilySu, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
|
I don't know whether it would be a good idea, but could this operation be accomplished with set_rows and a row length of 1? |
|
I manually made the modifications following the Scatter and Gated Delta Net implementations before it following existing patterns. AI assisted in answering my questions about the repo, proofread my implementation, made suggestions. |
set_rows writes values to indexed positions, in this operation we need to read. get_rows is the equivalent read operation, but it currently applies the same indices across all batches. To call get_rows on a per-batch-basis would mean an operation per batch in the compute graph. This means graph topology changes with batch size, which decreases graph reuse opportunities. The indices tensor of ggml tensor ne must have size 1 in the outermost dimension. Modifying get_rows would change the output tensor shape. GGML tensor dimensions for reference: When the runtime assertion GGML_ASSERT(b->ne[3] == 1) changes, it would require an update to every backend. Looking at other frameworks, PyTorch has torch.gather for per-position indices that is distinct from torch.index_select, TensorRT-LLM has gather vs index_select. |
Overview
Overview: Implementation of CPU-only (f32/f16) GGML_OP_GATHER, which enables extracting mask[idx, b] where idx = top_k[i,b] in a single op.
Why DSA needs GGML_OP_GATHER:
scatter cannot read values from indexed positions, but can only write constants.
Why GGML_OP_FILL wouldn't work: ggml_fill: writes into positions — wrong direction (DSA needs to read).
What It Does:
ggml_gather(mask, top_k) extracts the mask entries at top_k positions, producing a smaller [n_top_k, n_batch, 1, n_stream] tensor instead of operating on the full [n_kv, n_batch, 1, n_stream] mask.
CPU-only for now following the contribution guideline of CPU first, backend support in follow-ups.
Verification:
Additional information
This op was identified as an optimization path in PR #21149 (DeepSeek V3.2 DSA support). The current implementation uses simple masking — functional but inefficient, as it still computes attention over all ~100k positions.
GGML_OP_GATHER would enable the extraction approach ggerganov suggested: selecting only top-k K/V rows before attention, which fairydreaming noted requires per-batch indexing that ggml_get_rows doesn't support. Beyond DeepSeek V3.2, this op will benefit other sparse attention architectures like GLM5 — as models increasingly adopt top-k selection for long-context efficiency, ggml_gather becomes foundational infrastructure.
Requirements