Skip to content

ggml : add GGML_OP_GATHER for DeepSeek Sparse Attention (DSA) #21149#21458

Open
LilySu wants to merge 4 commits intoggml-org:masterfrom
LilySu:gg/gather-op
Open

ggml : add GGML_OP_GATHER for DeepSeek Sparse Attention (DSA) #21149#21458
LilySu wants to merge 4 commits intoggml-org:masterfrom
LilySu:gg/gather-op

Conversation

@LilySu
Copy link
Copy Markdown

@LilySu LilySu commented Apr 5, 2026

Overview

Overview: Implementation of CPU-only (f32/f16) GGML_OP_GATHER, which enables extracting mask[idx, b] where idx = top_k[i,b] in a single op.

Why DSA needs GGML_OP_GATHER:

  1. DSA needs per-batch indices. Each batch has different top-k indices
  2. currently, ggml_get_rows applies the same indices to all rows, but DSA needs per-batch different indices.
  3. Currently scatter operation only accepts a single scaler value (float c), and not tensor values. Here is its signature: ggml_scatter(ctx, a, ids, float c)

scatter cannot read values from indexed positions, but can only write constants.

Why GGML_OP_FILL wouldn't work: ggml_fill: writes into positions — wrong direction (DSA needs to read).

What It Does:

ggml_gather(src, indices, axis=0)

// dst[i, b] = src[indices[i, b], b]

src:     [n_kv,    n_batch, 1, n_stream]  →  e.g. [100000, 512, 1, 1]
indices: [n_top_k, n_batch, 1, n_stream]  →  e.g. [2048,   512, 1, 1]
dst:     [n_top_k, n_batch, 1, n_stream]  →  e.g. [2048,   512, 1, 1]

ggml_gather(mask, top_k) extracts the mask entries at top_k positions, producing a smaller [n_top_k, n_batch, 1, n_stream] tensor instead of operating on the full [n_kv, n_batch, 1, n_stream] mask.

CPU-only for now following the contribution guideline of CPU first, backend support in follow-ups.

Verification:

  • Build compiles clean: cmake --build succeeded
  • All 4 gather tests pass on CPU backend
  • No trailing whitespace issues

Additional information

This op was identified as an optimization path in PR #21149 (DeepSeek V3.2 DSA support). The current implementation uses simple masking — functional but inefficient, as it still computes attention over all ~100k positions.

GGML_OP_GATHER would enable the extraction approach ggerganov suggested: selecting only top-k K/V rows before attention, which fairydreaming noted requires per-batch indexing that ggml_get_rows doesn't support. Beyond DeepSeek V3.2, this op will benefit other sparse attention architectures like GLM5 — as models increasingly adopt top-k selection for long-context efficiency, ggml_gather becomes foundational infrastructure.

Requirements

@LilySu LilySu requested a review from ggerganov as a code owner April 5, 2026 03:55
@github-actions github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Apr 5, 2026
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot bot commented Apr 5, 2026

Hi @LilySu, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@jeffbolznv
Copy link
Copy Markdown
Contributor

I don't know whether it would be a good idea, but could this operation be accomplished with set_rows and a row length of 1?

@LilySu
Copy link
Copy Markdown
Author

LilySu commented Apr 5, 2026

I manually made the modifications following the Scatter and Gated Delta Net implementations before it following existing patterns. AI assisted in answering my questions about the repo, proofread my implementation, made suggestions.

@LilySu
Copy link
Copy Markdown
Author

LilySu commented Apr 5, 2026

I don't know whether it would be a good idea, but could this operation be accomplished with set_rows and a row length of 1?

set_rows writes values to indexed positions, in this operation we need to read.

get_rows is the equivalent read operation, but it currently applies the same indices across all batches.

To call get_rows on a per-batch-basis would mean an operation per batch in the compute graph. This means graph topology changes with batch size, which decreases graph reuse opportunities.

The indices tensor of ggml tensor ne must have size 1 in the outermost dimension. Modifying get_rows would change the output tensor shape. GGML tensor dimensions for reference:

ne[0] = innermost, embedding dim
ne[1] = second dim, sequence length / n_kv
ne[2] = third dim,  heads
ne[3] = outermost, batch / stream

When the runtime assertion GGML_ASSERT(b->ne[3] == 1) changes, it would require an update to every backend.
Also, modifying get_rows to handle both same-indices-across-batches and per-batch-indices may make testing for each of these two cases more complicated.

Looking at other frameworks, PyTorch has torch.gather for per-position indices that is distinct from torch.index_select, TensorRT-LLM has gather vs index_select.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants