Skip to content

fix(server): batch stream-chunk dispatch#1367

Merged
valarLip merged 1 commit into
mainfrom
server_conv
Jun 26, 2026
Merged

fix(server): batch stream-chunk dispatch#1367
valarLip merged 1 commit into
mainfrom
server_conv

Conversation

@ZhangLirong-amd

@ZhangLirong-amd ZhangLirong-amd commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Motivation

In DP high conc, such as 1024, when part of requet finished inference, schedule can not get new request immediately, It will wait for approximately 10 seconds.

Buffer chunks per-seq (decode only) and flush a whole decode step with one scheduled call. TTFT Mean -41% / Median -69%, throughput unchanged

before

============ Serving Benchmark Result ============
Successful requests:                     2048      
Benchmark duration (s):                  148.22    
Total input tokens:                      1887965   
Total generated tokens:                  1885783   
Request throughput (req/s):              13.82     
Output token throughput (tok/s):         12722.80  
Total Token throughput (tok/s):          25460.32  
---------------Time to First Token----------------
Mean TTFT (ms):                          11006.58  
Median TTFT (ms):                        11180.13  
P99 TTFT (ms):                           20287.05  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          67.89     
Median TPOT (ms):                        67.23     
P99 TPOT (ms):                           83.13     
---------------Inter-token Latency----------------
Mean ITL (ms):                           67.56     
Median ITL (ms):                         0.01      
P99 ITL (ms):                            155.80    

after

============ Serving Benchmark Result ============
Successful requests:                     2048      
Benchmark duration (s):                  147.81    
Total input tokens:                      1887965   
Total generated tokens:                  1885783   
Request throughput (req/s):              13.86     
Output token throughput (tok/s):         12758.25  
Total Token throughput (tok/s):          25531.27  
---------------Time to First Token----------------
Mean TTFT (ms):                          6535.80   
Median TTFT (ms):                        3462.88   
P99 TTFT (ms):                           18682.07  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          68.76     
Median TPOT (ms):                        68.93     
P99 TPOT (ms):                           88.36     
---------------Inter-token Latency----------------
Mean ITL (ms):                           68.86     
Median ITL (ms):                         41.27     
P99 ITL (ms):                            969.44    

Technical Details

Test Plan

Test Result

Submission Checklist

Copilot AI review requested due to automatic review settings June 26, 2026 07:33

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes the OpenAI API server streaming path to batch-dispatch stream chunks per engine “step”, reducing call_soon_threadsafe scheduling overhead on the API event loop at high batch sizes.

Changes:

  • Buffer per-sequence stream chunks in the OpenAI server callback and flush them in a single batched dispatch (flush_stream_batch).
  • Update EngineCoreMgr STREAM handling to run per-sequence callbacks first, then flush the batched stream chunks once per step.
  • Add a lazy-resolved flush hook in EngineCoreMgr to avoid an import cycle.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
atom/model_engine/engine_core_mgr.py Runs callbacks for STREAM outputs and then triggers a single batched flush per step; adds lazy flush resolver.
atom/entrypoints/openai/api_server.py Implements thread-local buffering for stream chunks and a flush_stream_batch() function to batch-dispatch them onto the API loop.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +425 to +431
try:
from atom.entrypoints.openai.api_server import flush_stream_batch

fn = self._flush_stream_batch_fn = flush_stream_batch
except Exception:
self._flush_stream_batch_fn = lambda: None # resolve to no-op
return
@zufayu zufayu requested a review from valarLip June 26, 2026 13:58
@valarLip valarLip merged commit 4b40ede into main Jun 26, 2026
47 of 53 checks passed
@valarLip valarLip deleted the server_conv branch June 26, 2026 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants