Skip to content

[codex] Fix HIP FA route fallback for D=256 MTP decode#68

Draft
nycdubliner wants to merge 2 commits into
Anbeeld:v0.3.2from
nycdubliner:codex/hip-fattn-tile-fallback-v032
Draft

[codex] Fix HIP FA route fallback for D=256 MTP decode#68
nycdubliner wants to merge 2 commits into
Anbeeld:v0.3.2from
nycdubliner:codex/hip-fattn-tile-fallback-v032

Conversation

@nycdubliner

@nycdubliner nycdubliner commented Jun 12, 2026

Copy link
Copy Markdown

Summary

This fixes the surviving HIP FlashAttention routing hole on the v0.3.2 two-GPU tensor-split + draft-mtp serving path for Gemma 4.

The original version of this PR fixed one D=256 route hole, but repeated-request serving was still not stable. With prompt reuse enabled, the server could still abort on later turns after slot reuse and checkpoint restore.

The updated change keeps the failing path serving-safe by widening the HIP D=256 fallback and by logging the full FA route inputs at the abort site.

Reproduction

Command under test:

GGML_CUDA_ALLREDUCE=nccl HIP_VISIBLE_DEVICES=0,1 build-rocm-rccl/bin/llama-server \
  -hf "unsloth/gemma-4-31b-it-GGUF:UD-Q4_K_XL" \
  --spec-type draft-mtp \
  --spec-draft-n-max 10 \
  --host 0.0.0.0 \
  --port 8080 \
  -fa on \
  --reasoning on \
  --reasoning-loop-min-tokens 16384 \
  -ngl 999 \
  -fit off \
  --temp 1.0 --top-p 0.95 --top-k 64 \
  --ctx-size 32768 \
  -np 1 \
  --threads 8 \
  --mmap \
  --no-mmproj \
  --cache-ram 0 \
  -sm tensor \
  -ts 1,1 \
  -ctk f16 \
  -ctv f16 \
  -b 2048 -ub 512 \
  --metrics \
  --log-timestamps

RCCL is active in the tested build:

  • librccl.so.1 is linked into libggml-hip.so
  • GGML_CUDA_NCCL:BOOL=ON
  • GGML_HIP_RCCL:BOOL=ON

Observed failing pattern before the update:

  • first request succeeds
  • second request may succeed
  • later request after web-UI style chat reuse can abort with:
No CUDA FA kernel selected: K=f16 V=f16 D=256

The important part is not browser UI specifically, but the request shape it drives:

  • same slot reused by high LCP similarity
  • restored context checkpoint
  • MTP draft decode over reused conversation state

Root Cause

The crash is in the MTP draft decode path:

  • common_speculative_impl_draft_mtp::draft
  • llama_decode
  • ggml_cuda_flash_attn_ext

This was not an RCCL failure and not a generic tensor-split failure. It was a HIP FlashAttention route-selection hole that only shows up on reused-context draft graphs.

For the failing path, the effective FA inputs are still a legal f16/f16 D=256 attention shape, but the planner could still return BEST_FATTN_KERNEL_NONE after prompt reuse / checkpoint restore. That fed the unconditional abort in ggml_cuda_flash_attn_ext().

Change

  1. Broaden the HIP tile fallback in ggml_cuda_fattn_make_route_plan().

Previously the fallback was too narrow. It now covers HIP f16/f16 D=256 reused-context shapes as long as the K sequence length is stride-compatible, instead of rejecting them because of an over-strict mask gate.

  1. Add targeted route diagnostics.

When route debug is enabled, the planner now logs:

  • Q/K/V shapes
  • mask shape and strides
  • raw and effective K/V types
  • selected kernel
  • none_reason when selection falls through

If the selector ever still reaches BEST_FATTN_KERNEL_NONE, the abort log now prints the full failing node shape instead of only K, V, and D.

Validation

Built the HIP backend with:

cmake --build build-rocm-rccl --config RelWithDebInfo --target ggml-hip -j 8

Validated with a persistent single-slot chat conversation against the same server config, including:

  • LCP-based slot reuse
  • restored context checkpoints
  • repeated short turns after a long code-generation turn
  • repeated long + short mixed turns in one server process

Observed post-fix behavior on the test host:

  • restored checkpoint at pos_min = 271, pos_max = 1806, size = 800.013 MiB
  • repeated high-LCP reuse (sim_best up to 0.995)
  • no FA abort through 8 sequential same-thread requests
  • graphs reused continued increasing (324 -> 1001 in the captured run)

Representative timings from the reused-context run:

  • initial long prompt: 175.87 tok/s prompt, 77.91 tok/s eval, acceptance 0.56006
  • later reused short turns: prompt around 81-85 tok/s, eval around 40-46 tok/s
  • later reused long turn: 517.23 tok/s prompt, 41.90 tok/s eval after restored checkpoint

Scope

This PR changes only ggml/src/ggml-cuda/fattn.cu.

It does not change:

  • tensor split policy
  • MTP scheduling
  • RCCL setup
  • server checkpoint logic
  • sampler placement logic

There is a separate local sampler fallback fix outside this PR; it is intentionally not included here.

Impact

This makes the HIP two-GPU tensor-split + draft-mtp serving path more robust under real repeated-request chat reuse by using a safe tile fallback for the remaining f16/f16 D=256 route hole and by making any future selector failure self-describing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant