Skip to content

[ATOM SGL] MTP Spec decode#1361

Open
ZhiweiYan-96 wants to merge 4 commits into
mainfrom
zhiwei/v4_mtp
Open

[ATOM SGL] MTP Spec decode#1361
ZhiweiYan-96 wants to merge 4 commits into
mainfrom
zhiwei/v4_mtp

Conversation

@ZhiweiYan-96

@ZhiweiYan-96 ZhiweiYan-96 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Motivation

Enable DeepSeek-V4 MTP speculative decoding in the ATOM SGLang plugin path.

Technical Details

  • Export DeepSeek-V4 target hidden states through the SGLang wrapper so the MTP draft model can consume the pre-hc_head hidden states.
  • Add a DeepseekV4ForCausalLMNextN SGLang wrapper that creates the DeepSeek-V4 MTP draft model through ATOM.
  • Share the target model’s embedding and LM head weights with the MTP draft model.
  • Extend the DeepSeek-V4 SGLang KV-cache bridge to support both target layers and MTP draft layers.
  • Route SGLang’s DeepSeek-V4 speculative draft attention setup through the ATOM plugin path.

Test Result

DeepSeek-V4-Pro

image

Acceptance rate(MTP=3):

image

Submission Checklist

@ZhiweiYan-96 ZhiweiYan-96 marked this pull request as ready for review June 26, 2026 07:02
Copilot AI review requested due to automatic review settings June 26, 2026 07:02

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enable DeepSeek-V4 MTP speculative decoding in the ATOM SGLang plugin path by adding an SGLang-compatible NextN draft wrapper and extending the DeepSeek-V4 proxy KV-cache/metadata bridge to support speculative + CUDA-graph workflows.

Changes:

  • Add DeepseekV4ForCausalLMNextN SGLang wrapper backed by ATOM’s DeepseekV4MTP, including shared embedding/LM-head weight tying for draft runs.
  • Extend DeepSeek-V4 SGLang bridge/backends to publish and reuse ATOM V4 graph/attention metadata across decode/prefill/graph capture paths, including MTP layer support.
  • Patch SGLang speculative factories / CUDA-graph eligibility to route DeepSeek-V4 draft backends through ATOM plugin mode.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
atom/plugin/sglang/runtime/forward_context.py Adds V4 graph-metadata capture slicing and more robust metadata discovery paths.
atom/plugin/sglang/models/deepseek_v4_nextn_wrapper.py New SGLang entry wrapper that instantiates ATOM DeepseekV4MTP and shares embed/head weights for NextN draft.
atom/plugin/sglang/models/base_model_wrapper.py Routes DeepSeek-V4 logits through LogitsProcessor consistently; exposes embed/head for weight sharing.
atom/plugin/sglang/deepseek_v4_bridge.py Extends proxy KV binding to MTP blocks and updates decode-graph metadata construction/indexer metadata.
atom/plugin/sglang/attention_backend/deepseek_v4_backend.py Publishes ATOM metadata into ForwardBatch, supports multi-step draft backend expectations, and supports old/new graph hooks.
atom/plugin/register.py Monkeypatches SGLang speculative backend factories and CUDA-graph behavior to force DeepSeek-V4 through ATOM shims.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +692 to +695
pos_np = positions[:total].detach().cpu().numpy().astype(np.int32)
repeats = max(1, total // max(1, bs))
batch_np = np.repeat(np.arange(bs, dtype=np.int64), repeats)[:total]
else:
Comment on lines 14 to +16
from atom.plugin.sglang.runtime.context import bind_current_forward_batch

logger = logging.getLogger("atom.plugin.sglang.runtime.forward_context")
@zufayu zufayu requested a review from valarLip June 26, 2026 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants