fix(server): batch stream-chunk dispatch by ZhangLirong-amd · Pull Request #1367 · ROCm/ATOM

ZhangLirong-amd · 2026-06-26T07:33:10Z

Motivation

In DP high conc, such as 1024, when part of requet finished inference, schedule can not get new request immediately, It will wait for approximately 10 seconds.

Buffer chunks per-seq (decode only) and flush a whole decode step with one scheduled call. TTFT Mean -41% / Median -69%, throughput unchanged

before

============ Serving Benchmark Result ============
Successful requests:                     2048      
Benchmark duration (s):                  148.22    
Total input tokens:                      1887965   
Total generated tokens:                  1885783   
Request throughput (req/s):              13.82     
Output token throughput (tok/s):         12722.80  
Total Token throughput (tok/s):          25460.32  
---------------Time to First Token----------------
Mean TTFT (ms):                          11006.58  
Median TTFT (ms):                        11180.13  
P99 TTFT (ms):                           20287.05  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          67.89     
Median TPOT (ms):                        67.23     
P99 TPOT (ms):                           83.13     
---------------Inter-token Latency----------------
Mean ITL (ms):                           67.56     
Median ITL (ms):                         0.01      
P99 ITL (ms):                            155.80

after

============ Serving Benchmark Result ============
Successful requests:                     2048      
Benchmark duration (s):                  147.81    
Total input tokens:                      1887965   
Total generated tokens:                  1885783   
Request throughput (req/s):              13.86     
Output token throughput (tok/s):         12758.25  
Total Token throughput (tok/s):          25531.27  
---------------Time to First Token----------------
Mean TTFT (ms):                          6535.80   
Median TTFT (ms):                        3462.88   
P99 TTFT (ms):                           18682.07  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          68.76     
Median TPOT (ms):                        68.93     
P99 TPOT (ms):                           88.36     
---------------Inter-token Latency----------------
Mean ITL (ms):                           68.86     
Median ITL (ms):                         41.27     
P99 ITL (ms):                            969.44

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This PR changes the OpenAI API server streaming path to batch-dispatch stream chunks per engine “step”, reducing call_soon_threadsafe scheduling overhead on the API event loop at high batch sizes.

Changes:

Buffer per-sequence stream chunks in the OpenAI server callback and flush them in a single batched dispatch (flush_stream_batch).
Update EngineCoreMgr STREAM handling to run per-sequence callbacks first, then flush the batched stream chunks once per step.
Add a lazy-resolved flush hook in EngineCoreMgr to avoid an import cycle.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`atom/model_engine/engine_core_mgr.py`	Runs callbacks for STREAM outputs and then triggers a single batched flush per step; adds lazy flush resolver.
`atom/entrypoints/openai/api_server.py`	Implements thread-local buffering for stream chunks and a `flush_stream_batch()` function to batch-dispatch them onto the API loop.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+            try:
+                from atom.entrypoints.openai.api_server import flush_stream_batch
+
+                fn = self._flush_stream_batch_fn = flush_stream_batch
+            except Exception:
+                self._flush_stream_batch_fn = lambda: None  # resolve to no-op
+                return


fix(server): batch stream-chunk dispatch

b01ff80

Copilot AI review requested due to automatic review settings June 26, 2026 07:33

Copilot started reviewing on behalf of ZhangLirong-amd June 26, 2026 07:34 View session

Copilot AI reviewed Jun 26, 2026

View reviewed changes

Comment thread atom/model_engine/engine_core_mgr.py

Comment on lines +425 to +431

try:

from atom.entrypoints.openai.api_server import flush_stream_batch

fn = self._flush_stream_batch_fn = flush_stream_batch

except Exception:

self._flush_stream_batch_fn = lambda: None # resolve to no-op

return

zufayu requested a review from valarLip June 26, 2026 13:58

valarLip approved these changes Jun 26, 2026

View reviewed changes

valarLip merged commit 4b40ede into main Jun 26, 2026
47 of 53 checks passed

valarLip deleted the server_conv branch June 26, 2026 16:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(server): batch stream-chunk dispatch#1367

fix(server): batch stream-chunk dispatch#1367
valarLip merged 1 commit into
mainfrom
server_conv

ZhangLirong-amd commented Jun 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ZhangLirong-amd commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ZhangLirong-amd commented Jun 26, 2026 •

edited

Loading