[codex] Fix HIP FA route fallback for D=256 MTP decode by nycdubliner · Pull Request #68 · Anbeeld/beellama.cpp

nycdubliner · 2026-06-12T21:34:25Z

Summary

This fixes the surviving HIP FlashAttention routing hole on the v0.3.2 two-GPU tensor-split + draft-mtp serving path for Gemma 4.

The original version of this PR fixed one D=256 route hole, but repeated-request serving was still not stable. With prompt reuse enabled, the server could still abort on later turns after slot reuse and checkpoint restore.

The updated change keeps the failing path serving-safe by widening the HIP D=256 fallback and by logging the full FA route inputs at the abort site.

Reproduction

Command under test:

GGML_CUDA_ALLREDUCE=nccl HIP_VISIBLE_DEVICES=0,1 build-rocm-rccl/bin/llama-server \
  -hf "unsloth/gemma-4-31b-it-GGUF:UD-Q4_K_XL" \
  --spec-type draft-mtp \
  --spec-draft-n-max 10 \
  --host 0.0.0.0 \
  --port 8080 \
  -fa on \
  --reasoning on \
  --reasoning-loop-min-tokens 16384 \
  -ngl 999 \
  -fit off \
  --temp 1.0 --top-p 0.95 --top-k 64 \
  --ctx-size 32768 \
  -np 1 \
  --threads 8 \
  --mmap \
  --no-mmproj \
  --cache-ram 0 \
  -sm tensor \
  -ts 1,1 \
  -ctk f16 \
  -ctv f16 \
  -b 2048 -ub 512 \
  --metrics \
  --log-timestamps

RCCL is active in the tested build:

librccl.so.1 is linked into libggml-hip.so
GGML_CUDA_NCCL:BOOL=ON
GGML_HIP_RCCL:BOOL=ON

Observed failing pattern before the update:

first request succeeds
second request may succeed
later request after web-UI style chat reuse can abort with:

No CUDA FA kernel selected: K=f16 V=f16 D=256

The important part is not browser UI specifically, but the request shape it drives:

same slot reused by high LCP similarity
restored context checkpoint
MTP draft decode over reused conversation state

Root Cause

The crash is in the MTP draft decode path:

common_speculative_impl_draft_mtp::draft
llama_decode
ggml_cuda_flash_attn_ext

This was not an RCCL failure and not a generic tensor-split failure. It was a HIP FlashAttention route-selection hole that only shows up on reused-context draft graphs.

For the failing path, the effective FA inputs are still a legal f16/f16 D=256 attention shape, but the planner could still return BEST_FATTN_KERNEL_NONE after prompt reuse / checkpoint restore. That fed the unconditional abort in ggml_cuda_flash_attn_ext().

Change

Broaden the HIP tile fallback in ggml_cuda_fattn_make_route_plan().

Previously the fallback was too narrow. It now covers HIP f16/f16 D=256 reused-context shapes as long as the K sequence length is stride-compatible, instead of rejecting them because of an over-strict mask gate.

Add targeted route diagnostics.

When route debug is enabled, the planner now logs:

Q/K/V shapes
mask shape and strides
raw and effective K/V types
selected kernel
none_reason when selection falls through

If the selector ever still reaches BEST_FATTN_KERNEL_NONE, the abort log now prints the full failing node shape instead of only K, V, and D.

Validation

Built the HIP backend with:

cmake --build build-rocm-rccl --config RelWithDebInfo --target ggml-hip -j 8

Validated with a persistent single-slot chat conversation against the same server config, including:

LCP-based slot reuse
restored context checkpoints
repeated short turns after a long code-generation turn
repeated long + short mixed turns in one server process

Observed post-fix behavior on the test host:

restored checkpoint at pos_min = 271, pos_max = 1806, size = 800.013 MiB
repeated high-LCP reuse (sim_best up to 0.995)
no FA abort through 8 sequential same-thread requests
graphs reused continued increasing (324 -> 1001 in the captured run)

Representative timings from the reused-context run:

initial long prompt: 175.87 tok/s prompt, 77.91 tok/s eval, acceptance 0.56006
later reused short turns: prompt around 81-85 tok/s, eval around 40-46 tok/s
later reused long turn: 517.23 tok/s prompt, 41.90 tok/s eval after restored checkpoint

Scope

This PR changes only ggml/src/ggml-cuda/fattn.cu.

It does not change:

tensor split policy
MTP scheduling
RCCL setup
server checkpoint logic
sampler placement logic

There is a separate local sampler fallback fix outside this PR; it is intentionally not included here.

Impact

This makes the HIP two-GPU tensor-split + draft-mtp serving path more robust under real repeated-request chat reuse by using a safe tile fallback for the remaining f16/f16 D=256 route hole and by making any future selector failure self-describing.

nycdubliner added 2 commits June 12, 2026 22:33

Fix HIP FA route fallback for D=256 MTP decode

0315aef

Broaden HIP FA D=256 fallback for reused contexts

c0a003f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[codex] Fix HIP FA route fallback for D=256 MTP decode#68

[codex] Fix HIP FA route fallback for D=256 MTP decode#68
nycdubliner wants to merge 2 commits into
Anbeeld:v0.3.2from
nycdubliner:codex/hip-fattn-tile-fallback-v032

nycdubliner commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

nycdubliner commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Reproduction

Root Cause

Change

Validation

Scope

Impact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nycdubliner commented Jun 12, 2026 •

edited

Loading