[codex] Fix HIP FA route fallback for D=256 MTP decode#68
Draft
nycdubliner wants to merge 2 commits into
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This fixes the surviving HIP FlashAttention routing hole on the
v0.3.2two-GPU tensor-split +draft-mtpserving path for Gemma 4.The original version of this PR fixed one
D=256route hole, but repeated-request serving was still not stable. With prompt reuse enabled, the server could still abort on later turns after slot reuse and checkpoint restore.The updated change keeps the failing path serving-safe by widening the HIP
D=256fallback and by logging the full FA route inputs at the abort site.Reproduction
Command under test:
GGML_CUDA_ALLREDUCE=nccl HIP_VISIBLE_DEVICES=0,1 build-rocm-rccl/bin/llama-server \ -hf "unsloth/gemma-4-31b-it-GGUF:UD-Q4_K_XL" \ --spec-type draft-mtp \ --spec-draft-n-max 10 \ --host 0.0.0.0 \ --port 8080 \ -fa on \ --reasoning on \ --reasoning-loop-min-tokens 16384 \ -ngl 999 \ -fit off \ --temp 1.0 --top-p 0.95 --top-k 64 \ --ctx-size 32768 \ -np 1 \ --threads 8 \ --mmap \ --no-mmproj \ --cache-ram 0 \ -sm tensor \ -ts 1,1 \ -ctk f16 \ -ctv f16 \ -b 2048 -ub 512 \ --metrics \ --log-timestampsRCCL is active in the tested build:
librccl.so.1is linked intolibggml-hip.soGGML_CUDA_NCCL:BOOL=ONGGML_HIP_RCCL:BOOL=ONObserved failing pattern before the update:
The important part is not browser UI specifically, but the request shape it drives:
Root Cause
The crash is in the MTP draft decode path:
common_speculative_impl_draft_mtp::draftllama_decodeggml_cuda_flash_attn_extThis was not an RCCL failure and not a generic tensor-split failure. It was a HIP FlashAttention route-selection hole that only shows up on reused-context draft graphs.
For the failing path, the effective FA inputs are still a legal
f16/f16 D=256attention shape, but the planner could still returnBEST_FATTN_KERNEL_NONEafter prompt reuse / checkpoint restore. That fed the unconditional abort inggml_cuda_flash_attn_ext().Change
ggml_cuda_fattn_make_route_plan().Previously the fallback was too narrow. It now covers HIP
f16/f16 D=256reused-context shapes as long as the K sequence length is stride-compatible, instead of rejecting them because of an over-strict mask gate.When route debug is enabled, the planner now logs:
none_reasonwhen selection falls throughIf the selector ever still reaches
BEST_FATTN_KERNEL_NONE, the abort log now prints the full failing node shape instead of onlyK,V, andD.Validation
Built the HIP backend with:
Validated with a persistent single-slot chat conversation against the same server config, including:
Observed post-fix behavior on the test host:
pos_min = 271, pos_max = 1806, size = 800.013 MiBsim_bestup to0.995)324 -> 1001in the captured run)Representative timings from the reused-context run:
175.87 tok/sprompt,77.91 tok/seval, acceptance0.5600681-85 tok/s, eval around40-46 tok/s517.23 tok/sprompt,41.90 tok/seval after restored checkpointScope
This PR changes only
ggml/src/ggml-cuda/fattn.cu.It does not change:
There is a separate local sampler fallback fix outside this PR; it is intentionally not included here.
Impact
This makes the HIP two-GPU tensor-split +
draft-mtpserving path more robust under real repeated-request chat reuse by using a safe tile fallback for the remainingf16/f16 D=256route hole and by making any future selector failure self-describing.