Skip to content

eagle3: add qwen3.5 4B 9B 35B-A3B support#21437

Draft
36330 wants to merge 18 commits intoggml-org:masterfrom
36330:pr/eagle3-more
Draft

eagle3: add qwen3.5 4B 9B 35B-A3B support#21437
36330 wants to merge 18 commits intoggml-org:masterfrom
36330:pr/eagle3-more

Conversation

@36330
Copy link
Copy Markdown

@36330 36330 commented Apr 4, 2026

Summary

This PR extends the EAGLE3 implementation with recurrent verification-state support.

Compared with the earlier EAGLE3 work, this version keeps target-side verification in a single batched decode path and adds the recurrent state handling needed to make that flow work correctly for hybrid / recurrent models.

Main changes:

  • add EAGLE3 recurrent round-state APIs
  • preserve single-batch target verification in speculative-simple
  • promote the accepted recurrent depth state directly after verification
  • keep KV cleanup on the existing seq_rm() path
  • add related Qwen3.5 / Qwen3.5-MoE integration
  • include small converter / model wiring updates needed by this flow

Details

Core files changed:

  • examples/speculative-simple/speculative-simple.cpp
  • include/llama.h
  • src/llama-context.cpp
  • src/llama-memory.h
  • src/llama-memory-hybrid.h
  • src/llama-memory-hybrid.cpp
  • src/llama-memory-recurrent.h
  • src/llama-memory-recurrent.cpp
  • src/models/qwen35.cpp
  • src/models/qwen35moe.cpp

Additional related updates:

  • convert_hf_to_gguf.py
  • examples/speculative/speculative.cpp
  • src/models/qwen3vl-moe.cpp

test examples:
Qwen3.5-9B-BF16.gguf
eagle3-qwen3.5-9b-eagle.gguf

draft1 accept = 61.463%
encoded 26 tokens in 0.390 seconds, speed: 66.656 t/s
decoded 259 tokens in 15.686 seconds, speed: 16.512 t/s
无eagle3
[ Prompt: 66.2 t/s | Generation: 9.8 t/s ]
speedup:1.68x

Qwen3.5-4B-Q4_K_M.gguf
eagle3-qwen35-4b-draft-Q4_K_M.gguf

draft2 accept = 53.140%
encoded 26 tokens in 0.080 seconds, speed: 326.052 t/s
decoded 257 tokens in 3.047 seconds, speed: 84.339 t/s
无eagle3
[ Prompt: 437.6 t/s | Generation: 64.2 t/s ]
speedup:1.31x

ruixiang63 and others added 18 commits December 14, 2025 18:12
EAGLE3 is an encoder-decoder based speculative decoding method:
- Extracts features from target model at specific layers
- Uses feature fusion layer to compress target features
- Generates draft tokens with single-layer decoder
- Maps draft vocabulary to target vocabulary via d2t tensor

Key changes:
- Add LLM_ARCH_EAGLE3 architecture
- Add EAGLE3 encoder/decoder graph (src/models/eagle3.cpp)
- Add feature extraction from target model layers
- Add g_embeddings handling for decoder input
- Add GGML_TENSOR_FLAG_SYNC for GPU synchronization
- Add --eagle3 flag for speculative-simple example
- Add EAGLE3 model conversion in convert_hf_to_gguf.py
@36330 36330 requested review from a team, CISC and ggerganov as code owners April 4, 2026 14:35
@github-actions github-actions bot added model Model specific examples python python script changes server labels Apr 4, 2026
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot bot commented Apr 4, 2026

Hi @36330, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@sorasoras
Copy link
Copy Markdown

how does this compare to MTP?

@jacekpoplawski
Copy link
Copy Markdown
Contributor

"extends the EAGLE3 implementation" - what does it mean? Has EAGLE3 ever been implemented in llama.cpp?

@36330
Copy link
Copy Markdown
Author

36330 commented Apr 4, 2026

EAGLE3 has been implemented in llama.cpp. However, due to the particularity of the qwen3.5 linear attention architecture, some adaptations have been made.

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Apr 4, 2026

You're pushing to wrong branch, I suppose this is not yet ready

@ngxson ngxson marked this pull request as draft April 4, 2026 15:03
@36330
Copy link
Copy Markdown
Author

36330 commented Apr 4, 2026

I successfully merged the latest version.

@jacekpoplawski
Copy link
Copy Markdown
Contributor

EAGLE3 has been implemented in llama.cpp

this looks like a draft
#18039
(please correct me if I am wrong)

@36330
Copy link
Copy Markdown
Author

36330 commented Apr 4, 2026

EAGLE3 has been implemented in llama.cpp

this looks like a draft #18039 (please correct me if I am wrong)

qwen3.5 not support

@jacekpoplawski
Copy link
Copy Markdown
Contributor

EAGLE3 has been implemented in llama.cpp

this looks like a draft #18039 (please correct me if I am wrong)

qwen3.5 not support

maybe you wanted to push into ichbinhandsome:eagle3-adapt-new-arch instead into master?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples model Model specific python python script changes server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants