eagle3: add qwen3.5 4B 9B 35B-A3B support#21437
eagle3: add qwen3.5 4B 9B 35B-A3B support#2143736330 wants to merge 18 commits intoggml-org:masterfrom
Conversation
EAGLE3 is an encoder-decoder based speculative decoding method: - Extracts features from target model at specific layers - Uses feature fusion layer to compress target features - Generates draft tokens with single-layer decoder - Maps draft vocabulary to target vocabulary via d2t tensor Key changes: - Add LLM_ARCH_EAGLE3 architecture - Add EAGLE3 encoder/decoder graph (src/models/eagle3.cpp) - Add feature extraction from target model layers - Add g_embeddings handling for decoder input - Add GGML_TENSOR_FLAG_SYNC for GPU synchronization - Add --eagle3 flag for speculative-simple example - Add EAGLE3 model conversion in convert_hf_to_gguf.py
|
Hi @36330, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
|
how does this compare to MTP? |
|
"extends the EAGLE3 implementation" - what does it mean? Has EAGLE3 ever been implemented in llama.cpp? |
|
EAGLE3 has been implemented in llama.cpp. However, due to the particularity of the qwen3.5 linear attention architecture, some adaptations have been made. |
|
You're pushing to wrong branch, I suppose this is not yet ready |
|
I successfully merged the latest version. |
this looks like a draft |
qwen3.5 not support |
maybe you wanted to push into ichbinhandsome:eagle3-adapt-new-arch instead into master? |
Summary
This PR extends the EAGLE3 implementation with recurrent verification-state support.
Compared with the earlier EAGLE3 work, this version keeps target-side verification in a single batched decode path and adds the recurrent state handling needed to make that flow work correctly for hybrid / recurrent models.
Main changes:
speculative-simpleseq_rm()pathDetails
Core files changed:
examples/speculative-simple/speculative-simple.cppinclude/llama.hsrc/llama-context.cppsrc/llama-memory.hsrc/llama-memory-hybrid.hsrc/llama-memory-hybrid.cppsrc/llama-memory-recurrent.hsrc/llama-memory-recurrent.cppsrc/models/qwen35.cppsrc/models/qwen35moe.cppAdditional related updates:
convert_hf_to_gguf.pyexamples/speculative/speculative.cppsrc/models/qwen3vl-moe.cpptest examples:
Qwen3.5-9B-BF16.gguf
eagle3-qwen3.5-9b-eagle.gguf
draft1 accept = 61.463%
encoded 26 tokens in 0.390 seconds, speed: 66.656 t/s
decoded 259 tokens in 15.686 seconds, speed: 16.512 t/s
无eagle3
[ Prompt: 66.2 t/s | Generation: 9.8 t/s ]
speedup:1.68x
Qwen3.5-4B-Q4_K_M.gguf
eagle3-qwen35-4b-draft-Q4_K_M.gguf
draft2 accept = 53.140%
encoded 26 tokens in 0.080 seconds, speed: 326.052 t/s
decoded 257 tokens in 3.047 seconds, speed: 84.339 t/s
无eagle3
[ Prompt: 437.6 t/s | Generation: 64.2 t/s ]
speedup:1.31x