Skip to content

Conversation

@sufubao
Copy link
Collaborator

@sufubao sufubao commented Jan 20, 2026

python -m lightllm.server.api_server --model_dir /dev/shm/GLM-4.7/ --tp 8 --max_req_total_len 202752

# accuracy check
python ./test/test_api/test_gsmk.py --num-questions 1000 --port 8000
100%|██████████████████████████| 1000/1000 [01:46<00:00,  9.38it/s]
Accuracy: 0.959
Invalid: 0.000
Latency: 106.773 s

# benchmark
============ Serving Benchmark Result ============
Backend:                                 vllm
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     1000
Benchmark duration (s):                  75.95
Total input tokens:                      297150
Total input text tokens:                 297150
Total generated tokens:                  194059
Total generated tokens (retokenized):    194043
Request throughput (req/s):              13.17
Input token throughput (tok/s):          3912.50
Output token throughput (tok/s):         2555.12
Peak output token throughput (tok/s):    8134.00
Peak concurrent requests:                1000
Total token throughput (tok/s):          6467.62
Concurrency:                             375.36
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   28508.58
Median E2E Latency (ms):                 28429.38
P90 E2E Latency (ms):                    45289.75
P99 E2E Latency (ms):                    55608.76
---------------Time to First Token----------------
Mean TTFT (ms):                          7648.08
Median TTFT (ms):                        7433.17
P99 TTFT (ms):                           13111.28
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          349.80
Median TPOT (ms):                        142.88
P99 TPOT (ms):                           3805.27
---------------Inter-Token Latency----------------
Mean ITL (ms):                           108.50
Median ITL (ms):                         55.81
P95 ITL (ms):                            154.64
P99 ITL (ms):                            280.97
Max ITL (ms):                            11435.30
==================================================

# benchmark vs SGLang
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     1000
Benchmark duration (s):                  83.18
Total input tokens:                      297150
Total input text tokens:                 297150
Total generated tokens:                  194059
Total generated tokens (retokenized):    193830
Request throughput (req/s):              12.02
Input token throughput (tok/s):          3572.38
Output token throughput (tok/s):         2333.01
Peak output token throughput (tok/s):    7166.00
Peak concurrent requests:                1000
Total token throughput (tok/s):          5905.39
Concurrency:                             388.59
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   32322.99
Median E2E Latency (ms):                 31198.81
P90 E2E Latency (ms):                    50799.95
P99 E2E Latency (ms):                    62104.49
---------------Time to First Token----------------
Mean TTFT (ms):                          9145.00
Median TTFT (ms):                        9131.92
P99 TTFT (ms):                           16137.12
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          444.81
Median TPOT (ms):                        159.32
P99 TPOT (ms):                           5199.22
---------------Inter-Token Latency----------------
Mean ITL (ms):                           120.11
Median ITL (ms):                         65.75
P95 ITL (ms):                            129.58
P99 ITL (ms):                            497.77
Max ITL (ms):                            14352.57
==================================================

yeahdongcn and others added 9 commits January 6, 2026 19:29
This PR adds support for Moore Threads (MUSA) GPU platform, expanding
LightLLM's hardware compatibility.

*NOTE:*

1. `_fwd_kernel_token_att1` has been slightly updated to ensure
compatibility with the Triton version.
2. `has_mtlink` will be used in upcoming enhancements to enable
multi-GPU support.
3. `torch` / `torch_musa` need to be upgraded to the latest versions.

### Testing Done

```bash
root@worker3218:/ws# python -m lightllm.server.api_server --model_dir /home/dist/Qwen3-0.6B/ --disable_cudagraph --host 0.0.0.0
WARNING 01-02 12:22:47 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3.         Try to upgrade it.
WARNING 01-02 12:22:47 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
INFO 01-02 12:22:48 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 01-02 12:22:48 [__init__.py:38] - musa -> vllm_musa:register
INFO 01-02 12:22:48 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 01-02 12:22:48 [__init__.py:232] Platform plugin musa is activated
WARNING 01-02 12:22:48 [vllm_utils.py:18] vllm is not installed, you can't use the api of it.                    You can solve it by running `pip install vllm`.
INFO 01-02 12:22:48 [communication_op.py:57] deep_ep is not installed, you can't use the api of it.
INFO 01-02 12:22:48 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On
WARNING 01-02 12:22:48 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm
WARNING 01-02 12:22:48 [nixl_kv_transporter.py:19] nixl is not installed, which is required for pd disagreggation!!!
INFO 01-02 12:22:48 [shm_size_check.py:21] SHM check: Available=500.00 GB,Recommended=2.32 GB.Sufficient: True
INFO 01-02 12:22:48 [api_start.py:94] zmq mode head: ipc:///tmp/_28765_0_
INFO 01-02 12:22:48 [api_start.py:96] use tgi api: False
INFO 01-02 12:22:48 [api_start.py:233] alloced ports: [10105, 10128, 10009, 10002, 10268, 10173, 10255, 10190, 10225, 10305]
INFO 01-02 12:22:48 [api_start.py:284] all start args:Namespace(run_mode='normal', host='0.0.0.0', port=8000, httpserver_workers=1, zmq_mode='ipc:///tmp/_28765_0_', pd_master_ip='0.0.0.0', pd_master_port=1212, pd_decode_rpyc_port=42000, select_p_d_node_strategy='round_robin', config_server_host=None, config_server_port=None, nixl_pd_kv_page_num=16, nixl_pd_kv_page_size=1024, model_name='default_model_name', model_dir='/home/dist/Qwen3-0.6B/', tokenizer_mode='fast', load_way='HF', max_total_token_num=None, mem_fraction=0.9, batch_max_tokens=8448, eos_id=[151645], tool_call_parser=None, reasoning_parser=None, chat_template=None, running_max_req_size=1000, nnodes=1, node_rank=0, multinode_httpmanager_port=12345, multinode_router_gloo_port=20001, tp=1, dp=1, dp_balancer='bs_balancer', max_req_total_len=16384, nccl_host='127.0.0.1', nccl_port=28765, use_config_server_to_init_nccl=False, mode=[], trust_remote_code=False, disable_log_stats=False, log_stats_interval=10, disable_shm_warning=False, router_token_ratio=0.0, router_max_new_token_len=1024, router_max_wait_tokens=1, disable_aggressive_schedule=False, use_dynamic_prompt_cache=False, disable_dynamic_prompt_cache=False, chunked_prefill_size=4096, disable_chunked_prefill=False, diverse_mode=False, token_healing_mode=False, output_constraint_mode='none', first_token_constraint_mode=False, enable_multimodal=False, enable_multimodal_audio=False, enable_mps=False, disable_custom_allreduce=False, enable_custom_allgather=False, enable_tpsp_mix_mode=False, enable_dp_prefill_balance=False, enable_prefill_microbatch_overlap=False, enable_decode_microbatch_overlap=False, enable_flashinfer_prefill=False, enable_flashinfer_decode=False, enable_fa3=False, cache_capacity=200, embed_cache_storage_size=4, data_type='bfloat16', return_all_prompt_logprobs=False, use_reward_model=False, long_truncation_mode=None, use_tgi_api=False, health_monitor=False, metric_gateway=None, job_name='lightllm', grouping_key=[], push_interval=10, visual_infer_batch_size=1, visual_send_batch_size=1, visual_gpu_ids=[0], visual_tp=1, visual_dp=1, visual_nccl_ports=[29500], enable_monitor_auth=False, disable_cudagraph=True, enable_prefill_cudagraph=False, prefll_cudagraph_max_handle_token=512, graph_max_batch_size=256, graph_split_batch_size=32, graph_grow_step_size=16, graph_max_len_in_batch=16384, quant_type='none', quant_cfg=None, vit_quant_type='none', vit_quant_cfg=None, sampling_backend='triton', penalty_counter_mode='gpu_counter', ep_redundancy_expert_config_path=None, auto_update_redundancy_expert=False, enable_fused_shared_experts=False, mtp_mode=None, mtp_draft_model_dir=None, mtp_step=0, kv_quant_calibration_config_path=None, schedule_time_interval=0.03, enable_cpu_cache=False, cpu_cache_storage_size=2, cpu_cache_token_page_size=256, enable_disk_cache=False, disk_cache_storage_size=10, disk_cache_dir=None, enable_dp_prompt_cache_fetch=False, router_port=10105, detokenization_port=10128, http_server_port=10009, visual_port=10002, audio_port=10268, cache_port=10173, metric_port=10255, multi_level_kv_cache_port=10190, pd_node_infer_rpyc_ports=[10305], pd_node_id=294623010895931863621527973304373176200, pd_p_allowed_port_min=20000, pd_p_allowed_port_max=30000)
WARNING 01-02 12:22:55 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3.         Try to upgrade it.
WARNING 01-02 12:22:55 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
INFO 01-02 12:22:55 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 01-02 12:22:55 [__init__.py:38] - musa -> vllm_musa:register
INFO 01-02 12:22:55 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 01-02 12:22:55 [__init__.py:232] Platform plugin musa is activated
WARNING 01-02 12:22:55 [vllm_utils.py:18] vllm is not installed, you can't use the api of it.                    You can solve it by running `pip install vllm`.
INFO 01-02 12:22:55 [communication_op.py:57] deep_ep is not installed, you can't use the api of it.
2026-01-02 12:22:55 | server | 140684395422848 | INFO : server started on [0.0.0.0]:10255
INFO 01-02 12:22:55 [start_utils.py:37] init func start_metric_manager : init ok
WARNING 01-02 12:23:02 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3.         Try to upgrade it.
WARNING 01-02 12:23:02 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
WARNING 01-02 12:23:02 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3.         Try to upgrade it.
WARNING 01-02 12:23:02 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
INFO 01-02 12:23:02 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 01-02 12:23:02 [__init__.py:38] - musa -> vllm_musa:register
INFO 01-02 12:23:02 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 01-02 12:23:02 [__init__.py:232] Platform plugin musa is activated
WARNING 01-02 12:23:02 [vllm_utils.py:18] vllm is not installed, you can't use the api of it.                    You can solve it by running `pip install vllm`.
INFO 01-02 12:23:02 [communication_op.py:57] deep_ep is not installed, you can't use the api of it.
INFO 01-02 12:23:02 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On
INFO 01-02 12:23:02 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 01-02 12:23:02 [__init__.py:38] - musa -> vllm_musa:register
INFO 01-02 12:23:02 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 01-02 12:23:02 [__init__.py:232] Platform plugin musa is activated
WARNING 01-02 12:23:02 [vllm_utils.py:18] vllm is not installed, you can't use the api of it.                    You can solve it by running `pip install vllm`.
INFO 01-02 12:23:02 [communication_op.py:57] deep_ep is not installed, you can't use the api of it.
WARNING 01-02 12:23:02 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm
INFO 01-02 12:23:02 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On
WARNING 01-02 12:23:03 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm
INFO 01-02 12:23:03 [manager.py:36] pub_to_httpserver sendhwm 1000
WARNING 01-02 12:23:03 [nixl_kv_transporter.py:19] nixl is not installed, which is required for pd disagreggation!!!
2026-01-02 12:23:03 | server | 140684395422848 | INFO : accepted ('127.0.0.1', 36414) with fd 25
2026-01-02 12:23:03 | server | 140653235951168 | INFO : welcome ('127.0.0.1', 36414)
INFO 01-02 12:23:08 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On
WARNING 01-02 12:23:09 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3.         Try to upgrade it.
INFO 01-02 12:23:10 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 01-02 12:23:10 [__init__.py:38] - musa -> vllm_musa:register
INFO 01-02 12:23:10 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 01-02 12:23:10 [__init__.py:232] Platform plugin musa is activated
WARNING 01-02 12:23:10 [vllm_utils.py:18] vllm is not installed, you can't use the api of it.                    You can solve it by running `pip install vllm`.
WARNING 01-02 12:23:10 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
WARNING 01-02 12:23:10 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm
INFO 01-02 12:23:10 [communication_op.py:57] deep_ep is not installed, you can't use the api of it.
WARNING 01-02 12:23:10 [nixl_kv_transporter.py:19] nixl is not installed, which is required for pd disagreggation!!!
INFO 01-02 12:23:10 [model_rpc.py:67] Initialized RPC server for rank 0.
INFO 01-02 12:23:10 [model_rpc.py:168] use ChunkedPrefillBackend
INFO 01-02 12:23:11 [basemodel.py:157] Initial quantization. The default quantization method is none
pid 39235 Loading model weights with 1 workers: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.01it/s]
INFO 01-02 12:23:12 [mem_utils.py:37] mode setting params: []
INFO 01-02 12:23:12 [mem_utils.py:57] Model kv cache using mode normal
INFO 01-02 12:23:12 [mem_manager.py:84] 69.38735313415528 GB space is available after load the model weight
INFO 01-02 12:23:12 [mem_manager.py:84] 0.109375 MB is the size of one token kv cache
INFO 01-02 12:23:12 [mem_manager.py:84] 649624 is the profiled max_total_token_num with the mem_fraction 0.9
INFO 01-02 12:23:12 [mem_manager.py:84] 
warming up:   0%|                                                                                                                                                                  | 0/12 [00:00<?, ?it/s]WARNING 01-02 12:23:23 [autotuner.py:169] No kernel config for silu_and_mul_fwd:v1 in {N=3072,out_dtype=torch.bfloat16}_MTT_S5000.json,the performance may be suboptimal!You can use LIGHTLLM_TRITON_AUTOTUNE_LEVEL=1 to enable autotune.
WARNING 01-02 12:23:23 [kernel_config.py:40] can not find config_path /ws/lightllm/common/all_kernel_configs/moe_silu_and_mul_kernel/{N=3072,out_dtype=torch.bfloat16}_MTT_S5000.json kernel name moe_silu_and_mul_kernel use default kernel setting
warming up: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:15<00:00,  1.29s/it]
INFO 01-02 12:23:30 [basemodel.py:812] begin check max_len infer
INFO 01-02 12:23:30 [basemodel.py:849] check max_len 8448 infer ok
INFO 01-02 12:23:45 [base_backend.py:185] loaded model class <class 'lightllm.models.qwen3.model.Qwen3TpPartModel'>
INFO 01-02 12:23:45 [manager.py:196] use req queue ChunkedPrefillQueue
INFO 01-02 12:23:45 [start_utils.py:37] init func start_router_process : init ok
INFO 01-02 12:23:45 [start_utils.py:37] init func start_detokenization_process : init ok
INFO 01-02 12:23:45 [api_start.py:58] start process pid 30307
INFO 01-02 12:23:45 [api_start.py:59] http server pid 54746
[2026-01-02 12:23:45 +0800] [54746] [INFO] Starting gunicorn 23.0.0
[2026-01-02 12:23:45 +0800] [54746] [INFO] Listening at: http://0.0.0.0:8000 (54746)
[2026-01-02 12:23:45 +0800] [54746] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2026-01-02 12:23:45 +0800] [54966] [INFO] Booting worker with pid: 54966
WARNING 01-02 12:23:51 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3.         Try to upgrade it.
WARNING 01-02 12:23:51 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
INFO 01-02 12:23:52 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 01-02 12:23:52 [__init__.py:38] - musa -> vllm_musa:register
INFO 01-02 12:23:52 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 01-02 12:23:52 [__init__.py:232] Platform plugin musa is activated
WARNING 01-02 12:23:52 [vllm_utils.py:18] vllm is not installed, you can't use the api of it.                    You can solve it by running `pip install vllm`.
INFO 01-02 12:23:52 [communication_op.py:57] deep_ep is not installed, you can't use the api of it.
INFO 01-02 12:23:52 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On
WARNING 01-02 12:23:52 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm
[2026-01-02 12:23:52 +0800] [54966] [INFO] Started server process [54966]
[2026-01-02 12:23:52 +0800] [54966] [INFO] Waiting for application startup.
INFO 01-02 12:23:52 [api_http.py:359] server start up
2026-01-02 12:23:53 | server | 140684395422848 | INFO : accepted ('127.0.0.1', 55128) with fd 26
2026-01-02 12:23:53 | server | 140653227558464 | INFO : welcome ('127.0.0.1', 55128)
2026-01-02 12:23:53 | server | 140684395422848 | INFO : accepted ('127.0.0.1', 55144) with fd 27
2026-01-02 12:23:53 | server | 140653219165760 | INFO : welcome ('127.0.0.1', 55144)
INFO 01-02 12:23:54 [req_id_generator.py:34] ReqIDGenerator init finished
INFO 01-02 12:23:54 [api_http.py:363] server start up ok, loop use is <uvloop.Loop running=True closed=False debug=False>
[2026-01-02 12:23:54 +0800] [54966] [INFO] Application startup complete.
INFO 01-02 12:23:58 [manager.py:417] recieved req X-Request-Id: X-Session-Id: start_time:2026-01-02 12:23:58 lightllm_req_id:8 
INFO 01-02 12:23:58 [manager.py:424] router recive req id 8 cost time 0.05271601676940918 s
DEBUG 01-02 12:23:58 [manager.py:322] Prefill Batch: batch_id=-1, time:1767327838.6764812s req_ids:[8] 
DEBUG 01-02 12:23:58 [manager.py:322] 
INFO 01-02 12:23:58 [manager.py:55] detokenization recv req id 8 cost time 0.0744318962097168 s
INFO 01-02 12:23:59 [manager.py:163] detoken release req id 8
INFO 01-02 12:23:59 [manager.py:611] X-Request-Id: X-Session-Id: start_time:2026-01-02 12:23:58 lightllm_req_id:8 first_token_cost:409.63053703308105ms total_cost_time:907.1474075317383ms,out_token_counter:17 mean_per_token_cost_time: 29.265698264626895ms prompt_token_num:4 gpu cache hit: False gpu_prompt_cache_len:0 gpu_prompt_cache_ratio:0.0 cpu cache hit: False cpu_prompt_cache_len:0 cpu_prompt_cache_ratio:0.0 disk cache hit: False disk_prompt_cache_len:0 disk_prompt_cache_ratio:0.0 mtp_avg_token_per_step:1.0 
127.0.0.1:38158 - "POST /generate HTTP/1.1" 200
DEBUG 01-02 12:23:59 [req_manager.py:78] freed all request size 1008
DEBUG 01-02 12:23:59 [infer_batch.py:172] free a batch state:
DEBUG 01-02 12:23:59 [infer_batch.py:172] radix refed token num 0
DEBUG 01-02 12:23:59 [infer_batch.py:172] radix hold token num 21
DEBUG 01-02 12:23:59 [infer_batch.py:172] mem manager can alloc token num 649603
DEBUG 01-02 12:23:59 [infer_batch.py:172] mem manager total size 649624
INFO 01-02 12:23:59 [batch.py:56] router release req id 8
INFO 01-02 12:23:59 [shm_req_manager.py:111] all shm req has been release ok
```

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: wangzaijun <wangzaijun@sensetime.com>
Co-authored-by: root <root@DESKTOP-5FJJCPK.localdomain>
…doc (#1175)

### Testing Done

Tested in a clean docker container without vllm installed.

```bash
root@worker3218:/ws# python -m lightllm.server.api_server --model_dir /home/dist/Qwen3-0.6B/ --disable_cudagraph --host 0.0.0.0
WARNING 01-12 13:45:20 [sgl_utils.py:14] sgl_kernel is not installed, you can't use the api of it.                    You can solve it by running `pip install sgl_kernel`.
WARNING 01-12 13:45:20 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3.         Try to upgrade it.
WARNING 01-12 13:45:20 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
WARNING 01-12 13:45:20 [vllm_utils.py:18] vllm is not installed, you can't use the api of it.                    You can solve it by running `pip install vllm`.
INFO 01-12 13:45:20 [communication_op.py:57] deep_ep is not installed, you can't use the api of it.
INFO 01-12 13:45:20 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On
WARNING 01-12 13:45:20 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm
WARNING 01-12 13:45:20 [nixl_kv_transporter.py:19] nixl is not installed, which is required for pd disagreggation!!!
INFO 01-12 13:45:21 [shm_size_check.py:21] SHM check: Available=500.00 GB,Recommended=2.32 GB.Sufficient: True
INFO 01-12 13:45:21 [api_start.py:94] zmq mode head: ipc:///tmp/_28765_0_
INFO 01-12 13:45:21 [api_start.py:96] use tgi api: False
INFO 01-12 13:45:21 [api_start.py:219] alloced ports: [10017, 10004, 10209, 10223, 10297, 10257, 10068, 10179, 10206, 10285]
INFO 01-12 13:45:21 [api_start.py:270] all start args:Namespace(run_mode='normal', host='0.0.0.0', port=8000, httpserver_workers=1, zmq_mode='ipc:///tmp/_28765_0_', pd_master_ip='0.0.0.0', pd_master_port=1212, pd_decode_rpyc_port=42000, select_p_d_node_strategy='round_robin', config_server_host=None, config_server_port=None, nixl_pd_kv_page_num=16, nixl_pd_kv_page_size=1024, model_name='default_model_name', model_dir='/home/dist/Qwen3-0.6B/', tokenizer_mode='fast', load_way='HF', max_total_token_num=None, mem_fraction=0.9, batch_max_tokens=8448, eos_id=[151645], tool_call_parser=None, reasoning_parser=None, chat_template=None, running_max_req_size=1000, nnodes=1, node_rank=0, multinode_httpmanager_port=12345, multinode_router_gloo_port=20001, tp=1, dp=1, dp_balancer='bs_balancer', max_req_total_len=16384, nccl_host='127.0.0.1', nccl_port=28765, use_config_server_to_init_nccl=False, trust_remote_code=False, disable_log_stats=False, log_stats_interval=10, disable_shm_warning=False, router_token_ratio=0.0, router_max_new_token_len=1024, router_max_wait_tokens=1, disable_aggressive_schedule=False, use_dynamic_prompt_cache=False, disable_dynamic_prompt_cache=False, chunked_prefill_size=4096, disable_chunked_prefill=False, diverse_mode=False, token_healing_mode=False, output_constraint_mode='none', first_token_constraint_mode=False, enable_multimodal=False, enable_multimodal_audio=False, enable_mps=False, disable_custom_allreduce=False, enable_custom_allgather=False, enable_tpsp_mix_mode=False, enable_dp_prefill_balance=False, enable_prefill_microbatch_overlap=False, enable_decode_microbatch_overlap=False, llm_prefill_att_backend=['triton'], llm_decode_att_backend=['triton'], llm_kv_type='None', llm_kv_quant_group_size=8, cache_capacity=200, embed_cache_storage_size=4, data_type='bfloat16', return_all_prompt_logprobs=False, use_reward_model=False, long_truncation_mode=None, use_tgi_api=False, health_monitor=False, metric_gateway=None, job_name='lightllm', grouping_key=[], push_interval=10, visual_infer_batch_size=1, visual_send_batch_size=1, visual_gpu_ids=[0], visual_tp=1, visual_dp=1, visual_nccl_ports=[29500], enable_monitor_auth=False, disable_cudagraph=True, enable_prefill_cudagraph=False, prefll_cudagraph_max_handle_token=512, graph_max_batch_size=256, graph_split_batch_size=32, graph_grow_step_size=16, graph_max_len_in_batch=16384, quant_type='none', quant_cfg=None, vit_quant_type='none', vit_quant_cfg=None, sampling_backend='triton', penalty_counter_mode='gpu_counter', ep_redundancy_expert_config_path=None, auto_update_redundancy_expert=False, enable_fused_shared_experts=False, mtp_mode=None, mtp_draft_model_dir=None, mtp_step=0, kv_quant_calibration_config_path=None, schedule_time_interval=0.03, enable_cpu_cache=False, cpu_cache_storage_size=2, cpu_cache_token_page_size=256, enable_disk_cache=False, disk_cache_storage_size=10, disk_cache_dir=None, enable_dp_prompt_cache_fetch=False, router_port=10017, detokenization_port=10004, http_server_port=10209, visual_port=10223, audio_port=10297, cache_port=10257, metric_port=10068, multi_level_kv_cache_port=10179, pd_node_infer_rpyc_ports=[10285], pd_node_id=288479957063433772586255832729030629155, pd_p_allowed_port_min=20000, pd_p_allowed_port_max=30000)
WARNING 01-12 13:45:27 [sgl_utils.py:14] sgl_kernel is not installed, you can't use the api of it.                    You can solve it by running `pip install sgl_kernel`.
WARNING 01-12 13:45:27 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3.         Try to upgrade it.
WARNING 01-12 13:45:27 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
WARNING 01-12 13:45:27 [vllm_utils.py:18] vllm is not installed, you can't use the api of it.                    You can solve it by running `pip install vllm`.
INFO 01-12 13:45:27 [communication_op.py:57] deep_ep is not installed, you can't use the api of it.
2026-01-12 13:45:27 | server | 140078322902144 | INFO : server started on [0.0.0.0]:10068
INFO 01-12 13:45:27 [start_utils.py:37] init func start_metric_manager : init ok
WARNING 01-12 13:45:33 [sgl_utils.py:14] sgl_kernel is not installed, you can't use the api of it.                    You can solve it by running `pip install sgl_kernel`.
WARNING 01-12 13:45:33 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3.         Try to upgrade it.
WARNING 01-12 13:45:33 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
WARNING 01-12 13:45:33 [vllm_utils.py:18] vllm is not installed, you can't use the api of it.                    You can solve it by running `pip install vllm`.
INFO 01-12 13:45:33 [communication_op.py:57] deep_ep is not installed, you can't use the api of it.
INFO 01-12 13:45:33 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On
WARNING 01-12 13:45:33 [sgl_utils.py:14] sgl_kernel is not installed, you can't use the api of it.                    You can solve it by running `pip install sgl_kernel`.
WARNING 01-12 13:45:33 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3.         Try to upgrade it.
WARNING 01-12 13:45:33 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
WARNING 01-12 13:45:33 [vllm_utils.py:18] vllm is not installed, you can't use the api of it.                    You can solve it by running `pip install vllm`.
INFO 01-12 13:45:33 [communication_op.py:57] deep_ep is not installed, you can't use the api of it.
INFO 01-12 13:45:33 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On
WARNING 01-12 13:45:33 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm
WARNING 01-12 13:45:33 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm
INFO 01-12 13:45:33 [manager.py:36] pub_to_httpserver sendhwm 1000
WARNING 01-12 13:45:33 [nixl_kv_transporter.py:19] nixl is not installed, which is required for pd disagreggation!!!
2026-01-12 13:45:33 | server | 140078322902144 | INFO : accepted ('127.0.0.1', 47548) with fd 25
2026-01-12 13:45:33 | server | 140046992746048 | INFO : welcome ('127.0.0.1', 47548)
INFO 01-12 13:45:38 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On
WARNING 01-12 13:45:38 [sgl_utils.py:14] sgl_kernel is not installed, you can't use the api of it.                    You can solve it by running `pip install sgl_kernel`.
WARNING 01-12 13:45:38 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3.         Try to upgrade it.
WARNING 01-12 13:45:38 [vllm_utils.py:18] vllm is not installed, you can't use the api of it.                    You can solve it by running `pip install vllm`.
WARNING 01-12 13:45:38 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
WARNING 01-12 13:45:38 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm
INFO 01-12 13:45:38 [communication_op.py:57] deep_ep is not installed, you can't use the api of it.
WARNING 01-12 13:45:40 [nixl_kv_transporter.py:19] nixl is not installed, which is required for pd disagreggation!!!
INFO 01-12 13:45:40 [model_rpc.py:67] Initialized RPC server for rank 0.
INFO 01-12 13:45:40 [model_rpc.py:168] use ChunkedPrefillBackend
INFO 01-12 13:45:43 [basemodel.py:169] Initial quantization. The default quantization method is none
pid 45988 Loading model weights with 1 workers:   0%|                                                                      | 0/1 [00:00<?, ?it/s]INFO 01-12 13:45:43 [embedding_weight.py:30] loaded weight vocab_size: 151936
pid 45988 Loading model weights with 1 workers: 100%|██████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.19it/s]
INFO 01-12 13:45:43 [mem_utils.py:30] mode setting params: None
INFO 01-12 13:45:43 [mem_utils.py:40] Model kv cache using mem_manager class: <class 'lightllm.common.kv_cache_mem_manager.mem_manager.MemoryManager'>
INFO 01-12 13:45:43 [mem_manager.py:99] 69.76169700622559 GB space is available after load the model weight
INFO 01-12 13:45:43 [mem_manager.py:99] 0.109375 MB is the size of one token kv cache
INFO 01-12 13:45:43 [mem_manager.py:99] 653128 is the profiled max_total_token_num with the mem_fraction 0.9
INFO 01-12 13:45:43 [mem_manager.py:99] 
INFO 01-12 13:45:44 [basemodel.py:126] use prefill att backend: TritonAttBackend
INFO 01-12 13:45:44 [basemodel.py:127] use decode att backend: TritonAttBackend
warming up:   0%|                                                                                                         | 0/12 [00:00<?, ?it/s]WARNING 01-12 13:46:16 [autotuner.py:169] No kernel config for silu_and_mul_fwd:v1 in {N=3072,out_dtype=torch.bfloat16}_MTT_S5000.json,the performance may be suboptimal!You can use LIGHTLLM_TRITON_AUTOTUNE_LEVEL=1 to enable autotune.
WARNING 01-12 13:46:16 [kernel_config.py:40] can not find config_path /ws/lightllm/common/all_kernel_configs/moe_silu_and_mul_kernel/{N=3072,out_dtype=torch.bfloat16}_MTT_S5000.json kernel name moe_silu_and_mul_kernel use default kernel setting
warming up: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:40<00:00,  3.41s/it]
INFO 01-12 13:46:25 [basemodel.py:846] begin check max_len infer
INFO 01-12 13:46:25 [basemodel.py:882] check max_len 8448 infer ok
INFO 01-12 13:46:40 [base_backend.py:184] loaded model class <class 'lightllm.models.qwen3.model.Qwen3TpPartModel'>
INFO 01-12 13:46:40 [manager.py:194] use req queue ChunkedPrefillQueue
INFO 01-12 13:46:40 [start_utils.py:37] init func start_router_process : init ok
INFO 01-12 13:46:40 [start_utils.py:37] init func start_detokenization_process : init ok
INFO 01-12 13:46:40 [api_start.py:58] start process pid 38328
INFO 01-12 13:46:40 [api_start.py:59] http server pid 5689
[2026-01-12 13:46:40 +0800] [5689] [INFO] Starting gunicorn 23.0.0
[2026-01-12 13:46:40 +0800] [5689] [INFO] Listening at: http://0.0.0.0:8000 (5689)
[2026-01-12 13:46:40 +0800] [5689] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2026-01-12 13:46:40 +0800] [5690] [INFO] Booting worker with pid: 5690
WARNING 01-12 13:46:46 [sgl_utils.py:14] sgl_kernel is not installed, you can't use the api of it.                    You can solve it by running `pip install sgl_kernel`.
WARNING 01-12 13:46:46 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3.         Try to upgrade it.
WARNING 01-12 13:46:46 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
WARNING 01-12 13:46:46 [vllm_utils.py:18] vllm is not installed, you can't use the api of it.                    You can solve it by running `pip install vllm`.
INFO 01-12 13:46:46 [communication_op.py:57] deep_ep is not installed, you can't use the api of it.
INFO 01-12 13:46:46 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On
WARNING 01-12 13:46:46 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm
[2026-01-12 13:46:47 +0800] [5690] [INFO] Started server process [5690]
[2026-01-12 13:46:47 +0800] [5690] [INFO] Waiting for application startup.
INFO 01-12 13:46:47 [api_http.py:359] server start up
2026-01-12 13:46:47 | server | 140078322902144 | INFO : accepted ('127.0.0.1', 35962) with fd 26
2026-01-12 13:46:47 | server | 140046984353344 | INFO : welcome ('127.0.0.1', 35962)
2026-01-12 13:46:47 | server | 140078322902144 | INFO : accepted ('127.0.0.1', 35966) with fd 27
2026-01-12 13:46:47 | server | 140046975960640 | INFO : welcome ('127.0.0.1', 35966)
INFO 01-12 13:46:48 [req_id_generator.py:34] ReqIDGenerator init finished
INFO 01-12 13:46:48 [api_http.py:363] server start up ok, loop use is <uvloop.Loop running=True closed=False debug=False>
[2026-01-12 13:46:48 +0800] [5690] [INFO] Application startup complete.
DEBUG 01-12 13:47:52 [manager.py:283] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:47:52 [manager.py:283] 
DEBUG 01-12 13:47:52 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 
DEBUG 01-12 13:47:52 [manager.py:284] 
[2026-01-12 13:48:13 +0800] [5689] [INFO] Handling signal: winch
[2026-01-12 13:48:13 +0800] [5689] [INFO] Handling signal: winch
[2026-01-12 13:48:13 +0800] [5689] [INFO] Handling signal: winch
[2026-01-12 13:48:13 +0800] [5689] [INFO] Handling signal: winch
DEBUG 01-12 13:48:55 [manager.py:283] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:48:55 [manager.py:283] 
DEBUG 01-12 13:48:55 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 
DEBUG 01-12 13:48:55 [manager.py:284] 
DEBUG 01-12 13:49:58 [manager.py:283] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:49:58 [manager.py:283] 
DEBUG 01-12 13:49:58 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 
DEBUG 01-12 13:49:58 [manager.py:284] 
DEBUG 01-12 13:51:02 [manager.py:283] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:51:02 [manager.py:283] 
DEBUG 01-12 13:51:02 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 
DEBUG 01-12 13:51:02 [manager.py:284] 
INFO 01-12 13:51:09 [manager.py:417] recieved req X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:09 lightllm_req_id:8 
INFO 01-12 13:51:09 [manager.py:422] router recive req id 8 cost time 0.05662369728088379 s
DEBUG 01-12 13:51:09 [manager.py:320] Prefill Batch: batch_id=-1, time:1768197069.7485027s req_ids:[8] 
DEBUG 01-12 13:51:09 [manager.py:320] 
INFO 01-12 13:51:09 [manager.py:55] detokenization recv req id 8 cost time 0.07959198951721191 s
DEBUG 01-12 13:51:11 [manager.py:251] dp_i 0 current batch size: 1 
DEBUG 01-12 13:51:11 [manager.py:251] dp_i 0 paused req num: 0 
DEBUG 01-12 13:51:11 [manager.py:251] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:51:11 [manager.py:251] dp_i 0 estimated_peak_token_count: 39 
DEBUG 01-12 13:51:11 [manager.py:251] dp_i 0 token used ratio: 6.12437378278071e-06 not contain prompt cache tree unrefed token
DEBUG 01-12 13:51:11 [manager.py:251] dp_i 0 token used ratio: 6.12437378278071e-06 contain prompt cache tree unrefed token
DEBUG 01-12 13:51:14 [manager.py:251] dp_i 0 current batch size: 1 
DEBUG 01-12 13:51:14 [manager.py:251] dp_i 0 paused req num: 0 
DEBUG 01-12 13:51:14 [manager.py:251] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:51:14 [manager.py:251] dp_i 0 estimated_peak_token_count: 39 
DEBUG 01-12 13:51:14 [manager.py:251] dp_i 0 token used ratio: 7.655467228475888e-06 not contain prompt cache tree unrefed token
DEBUG 01-12 13:51:14 [manager.py:251] dp_i 0 token used ratio: 7.655467228475888e-06 contain prompt cache tree unrefed token
INFO 01-12 13:51:16 [manager.py:163] detoken release req id 8
INFO 01-12 13:51:16 [manager.py:614] X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:09 lightllm_req_id:8 first_token_cost:6353.325128555298ms total_cost_time:6671.096563339233ms,out_token_counter:17 mean_per_token_cost_time: 18.692437340231503ms prompt_token_num:4 gpu cache hit: False gpu_prompt_cache_len:0 gpu_prompt_cache_ratio:0.0 cpu cache hit: False cpu_prompt_cache_len:0 cpu_prompt_cache_ratio:0.0 disk cache hit: False disk_prompt_cache_len:0 disk_prompt_cache_ratio:0.0 mtp_avg_token_per_step:1.0 
127.0.0.1:55472 - "POST /generate HTTP/1.1" 200
DEBUG 01-12 13:51:16 [req_manager.py:78] freed all request size 1008
DEBUG 01-12 13:51:16 [infer_batch.py:172] free a batch state:
DEBUG 01-12 13:51:16 [infer_batch.py:172] radix refed token num 0
DEBUG 01-12 13:51:16 [infer_batch.py:172] radix hold token num 21
DEBUG 01-12 13:51:16 [infer_batch.py:172] mem manager can alloc token num 653107
DEBUG 01-12 13:51:16 [infer_batch.py:172] mem manager total size 653128
INFO 01-12 13:51:16 [batch.py:56] router release req id 8
INFO 01-12 13:51:16 [shm_req_manager.py:111] all shm req has been release ok
INFO 01-12 13:51:19 [manager.py:417] recieved req X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:19 lightllm_req_id:16 
INFO 01-12 13:51:19 [manager.py:422] router recive req id 16 cost time 0.019651412963867188 s
DEBUG 01-12 13:51:19 [manager.py:320] Prefill Batch: batch_id=-1, time:1768197079.421846s req_ids:[16] 
DEBUG 01-12 13:51:19 [manager.py:320] 
INFO 01-12 13:51:19 [manager.py:55] detokenization recv req id 16 cost time 0.021979331970214844 s
INFO 01-12 13:51:19 [manager.py:163] detoken release req id 16
INFO 01-12 13:51:19 [manager.py:614] X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:19 lightllm_req_id:16 first_token_cost:102.96440124511719ms total_cost_time:407.08088874816895ms,out_token_counter:17 mean_per_token_cost_time: 17.88920514723834ms prompt_token_num:4 gpu cache hit: True gpu_prompt_cache_len:3 gpu_prompt_cache_ratio:0.75 cpu cache hit: False cpu_prompt_cache_len:0 cpu_prompt_cache_ratio:0.0 disk cache hit: False disk_prompt_cache_len:0 disk_prompt_cache_ratio:0.0 mtp_avg_token_per_step:1.0 
127.0.0.1:47146 - "POST /generate HTTP/1.1" 200
DEBUG 01-12 13:51:19 [req_manager.py:78] freed all request size 1008
DEBUG 01-12 13:51:19 [infer_batch.py:172] free a batch state:
DEBUG 01-12 13:51:19 [infer_batch.py:172] radix refed token num 0
DEBUG 01-12 13:51:19 [infer_batch.py:172] radix hold token num 35
DEBUG 01-12 13:51:19 [infer_batch.py:172] mem manager can alloc token num 653093
DEBUG 01-12 13:51:19 [infer_batch.py:172] mem manager total size 653128
INFO 01-12 13:51:19 [batch.py:56] router release req id 16
INFO 01-12 13:51:19 [shm_req_manager.py:111] all shm req has been release ok
INFO 01-12 13:51:22 [manager.py:417] recieved req X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:22 lightllm_req_id:24 
INFO 01-12 13:51:22 [manager.py:422] router recive req id 24 cost time 0.015377998352050781 s
DEBUG 01-12 13:51:22 [manager.py:320] Prefill Batch: batch_id=-1, time:1768197082.1040523s req_ids:[24] 
DEBUG 01-12 13:51:22 [manager.py:320] 
INFO 01-12 13:51:22 [manager.py:55] detokenization recv req id 24 cost time 0.016767501831054688 s
INFO 01-12 13:51:22 [manager.py:163] detoken release req id 24
INFO 01-12 13:51:22 [manager.py:614] X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:22 lightllm_req_id:24 first_token_cost:86.02452278137207ms total_cost_time:432.842493057251ms,out_token_counter:17 mean_per_token_cost_time: 20.4010570750517ms prompt_token_num:4 gpu cache hit: True gpu_prompt_cache_len:3 gpu_prompt_cache_ratio:0.75 cpu cache hit: False cpu_prompt_cache_len:0 cpu_prompt_cache_ratio:0.0 disk cache hit: False disk_prompt_cache_len:0 disk_prompt_cache_ratio:0.0 mtp_avg_token_per_step:1.0 
127.0.0.1:47156 - "POST /generate HTTP/1.1" 200
DEBUG 01-12 13:51:22 [req_manager.py:78] freed all request size 1008
DEBUG 01-12 13:51:22 [infer_batch.py:172] free a batch state:
DEBUG 01-12 13:51:22 [infer_batch.py:172] radix refed token num 0
DEBUG 01-12 13:51:22 [infer_batch.py:172] radix hold token num 51
DEBUG 01-12 13:51:22 [infer_batch.py:172] mem manager can alloc token num 653077
DEBUG 01-12 13:51:22 [infer_batch.py:172] mem manager total size 653128
INFO 01-12 13:51:22 [batch.py:56] router release req id 24
INFO 01-12 13:51:22 [shm_req_manager.py:111] all shm req has been release ok
INFO 01-12 13:51:26 [manager.py:417] recieved req X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:26 lightllm_req_id:32 
INFO 01-12 13:51:26 [manager.py:422] router recive req id 32 cost time 0.008630990982055664 s
DEBUG 01-12 13:51:26 [manager.py:320] Prefill Batch: batch_id=-1, time:1768197086.9206343s req_ids:[32] 
DEBUG 01-12 13:51:26 [manager.py:320] 
INFO 01-12 13:51:26 [manager.py:55] detokenization recv req id 32 cost time 0.011269092559814453 s
INFO 01-12 13:51:27 [manager.py:163] detoken release req id 32
INFO 01-12 13:51:27 [manager.py:614] X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:26 lightllm_req_id:32 first_token_cost:74.12481307983398ms total_cost_time:378.31759452819824ms,out_token_counter:17 mean_per_token_cost_time: 17.89369302637437ms prompt_token_num:4 gpu cache hit: True gpu_prompt_cache_len:3 gpu_prompt_cache_ratio:0.75 cpu cache hit: False cpu_prompt_cache_len:0 cpu_prompt_cache_ratio:0.0 disk cache hit: False disk_prompt_cache_len:0 disk_prompt_cache_ratio:0.0 mtp_avg_token_per_step:1.0 
127.0.0.1:47160 - "POST /generate HTTP/1.1" 200
DEBUG 01-12 13:51:27 [req_manager.py:78] freed all request size 1008
DEBUG 01-12 13:51:27 [infer_batch.py:172] free a batch state:
DEBUG 01-12 13:51:27 [infer_batch.py:172] radix refed token num 0
DEBUG 01-12 13:51:27 [infer_batch.py:172] radix hold token num 68
DEBUG 01-12 13:51:27 [infer_batch.py:172] mem manager can alloc token num 653060
DEBUG 01-12 13:51:27 [infer_batch.py:172] mem manager total size 653128
INFO 01-12 13:51:27 [batch.py:56] router release req id 32
INFO 01-12 13:51:27 [shm_req_manager.py:111] all shm req has been release ok
INFO 01-12 13:51:44 [manager.py:417] recieved req X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:44 lightllm_req_id:40 
INFO 01-12 13:51:44 [manager.py:422] router recive req id 40 cost time 0.009232759475708008 s
DEBUG 01-12 13:51:44 [manager.py:320] Prefill Batch: batch_id=-1, time:1768197104.2886696s req_ids:[40] 
DEBUG 01-12 13:51:44 [manager.py:320] 
INFO 01-12 13:51:44 [manager.py:55] detokenization recv req id 40 cost time 0.010197639465332031 s
DEBUG 01-12 13:51:47 [manager.py:251] dp_i 0 current batch size: 1 
DEBUG 01-12 13:51:47 [manager.py:251] dp_i 0 paused req num: 0 
DEBUG 01-12 13:51:47 [manager.py:251] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:51:47 [manager.py:251] dp_i 0 estimated_peak_token_count: 2022 
DEBUG 01-12 13:51:47 [manager.py:251] dp_i 0 token used ratio: 0.00019597996104898273 not contain prompt cache tree unrefed token
DEBUG 01-12 13:51:47 [manager.py:251] dp_i 0 token used ratio: 0.0002955010350191693 contain prompt cache tree unrefed token
DEBUG 01-12 13:51:50 [manager.py:251] dp_i 0 current batch size: 1 
DEBUG 01-12 13:51:50 [manager.py:251] dp_i 0 paused req num: 0 
DEBUG 01-12 13:51:50 [manager.py:251] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:51:50 [manager.py:251] dp_i 0 estimated_peak_token_count: 2022 
DEBUG 01-12 13:51:50 [manager.py:251] dp_i 0 token used ratio: 0.0002618169792138754 not contain prompt cache tree unrefed token
DEBUG 01-12 13:51:50 [manager.py:251] dp_i 0 token used ratio: 0.0003613380531840619 contain prompt cache tree unrefed token
DEBUG 01-12 13:51:53 [manager.py:251] dp_i 0 current batch size: 1 
DEBUG 01-12 13:51:53 [manager.py:251] dp_i 0 paused req num: 0 
DEBUG 01-12 13:51:53 [manager.py:251] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:51:53 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 
DEBUG 01-12 13:51:53 [manager.py:251] dp_i 0 token used ratio: 0.0005052608370794086 not contain prompt cache tree unrefed token
DEBUG 01-12 13:51:53 [manager.py:251] dp_i 0 token used ratio: 0.0006047819110495952 contain prompt cache tree unrefed token
DEBUG 01-12 13:51:56 [manager.py:251] dp_i 0 current batch size: 1 
DEBUG 01-12 13:51:56 [manager.py:251] dp_i 0 paused req num: 0 
DEBUG 01-12 13:51:56 [manager.py:251] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:51:56 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 
DEBUG 01-12 13:51:56 [manager.py:251] dp_i 0 token used ratio: 0.0007456425080535515 not contain prompt cache tree unrefed token
DEBUG 01-12 13:51:56 [manager.py:251] dp_i 0 token used ratio: 0.000845163582023738 contain prompt cache tree unrefed token
DEBUG 01-12 13:51:59 [manager.py:251] dp_i 0 current batch size: 1 
DEBUG 01-12 13:51:59 [manager.py:251] dp_i 0 paused req num: 0 
DEBUG 01-12 13:51:59 [manager.py:251] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:51:59 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 
DEBUG 01-12 13:51:59 [manager.py:251] dp_i 0 token used ratio: 0.0009875552724733895 not contain prompt cache tree unrefed token
DEBUG 01-12 13:51:59 [manager.py:251] dp_i 0 token used ratio: 0.001087076346443576 contain prompt cache tree unrefed token
DEBUG 01-12 13:52:02 [manager.py:251] dp_i 0 current batch size: 1 
DEBUG 01-12 13:52:02 [manager.py:251] dp_i 0 paused req num: 0 
DEBUG 01-12 13:52:02 [manager.py:251] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:52:02 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 
DEBUG 01-12 13:52:02 [manager.py:251] dp_i 0 token used ratio: 0.0012264058500018372 not contain prompt cache tree unrefed token
DEBUG 01-12 13:52:02 [manager.py:251] dp_i 0 token used ratio: 0.001325926923972024 contain prompt cache tree unrefed token
DEBUG 01-12 13:52:05 [manager.py:251] dp_i 0 current batch size: 1 
DEBUG 01-12 13:52:05 [manager.py:251] dp_i 0 paused req num: 0 
DEBUG 01-12 13:52:05 [manager.py:251] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:52:05 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 
DEBUG 01-12 13:52:05 [manager.py:251] dp_i 0 token used ratio: 0.0014086059700395635 not contain prompt cache tree unrefed token
DEBUG 01-12 13:52:05 [manager.py:251] dp_i 0 token used ratio: 0.00150812704400975 contain prompt cache tree unrefed token
DEBUG 01-12 13:52:08 [manager.py:251] dp_i 0 current batch size: 1 
DEBUG 01-12 13:52:08 [manager.py:251] dp_i 0 paused req num: 0 
DEBUG 01-12 13:52:08 [manager.py:251] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:52:08 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 
DEBUG 01-12 13:52:08 [manager.py:251] dp_i 0 token used ratio: 0.0015724329687289474 not contain prompt cache tree unrefed token
DEBUG 01-12 13:52:08 [manager.py:251] dp_i 0 token used ratio: 0.001671954042699134 contain prompt cache tree unrefed token
DEBUG 01-12 13:52:11 [manager.py:251] dp_i 0 current batch size: 1 
DEBUG 01-12 13:52:11 [manager.py:251] dp_i 0 paused req num: 0 
DEBUG 01-12 13:52:11 [manager.py:251] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:52:11 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 
DEBUG 01-12 13:52:11 [manager.py:251] dp_i 0 token used ratio: 0.0017331977805269412 not contain prompt cache tree unrefed token
DEBUG 01-12 13:52:11 [manager.py:251] dp_i 0 token used ratio: 0.0018327188544971277 contain prompt cache tree unrefed token
DEBUG 01-12 13:52:14 [manager.py:251] dp_i 0 current batch size: 1 
DEBUG 01-12 13:52:14 [manager.py:251] dp_i 0 paused req num: 0 
DEBUG 01-12 13:52:14 [manager.py:251] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:52:14 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 
DEBUG 01-12 13:52:14 [manager.py:251] dp_i 0 token used ratio: 0.0018939625923249349 not contain prompt cache tree unrefed token
DEBUG 01-12 13:52:14 [manager.py:251] dp_i 0 token used ratio: 0.0019934836662951214 contain prompt cache tree unrefed token
DEBUG 01-12 13:52:17 [manager.py:251] dp_i 0 current batch size: 1 
DEBUG 01-12 13:52:17 [manager.py:251] dp_i 0 paused req num: 0 
DEBUG 01-12 13:52:17 [manager.py:251] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:52:17 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 
DEBUG 01-12 13:52:17 [manager.py:251] dp_i 0 token used ratio: 0.0020531963106772333 not contain prompt cache tree unrefed token
DEBUG 01-12 13:52:17 [manager.py:251] dp_i 0 token used ratio: 0.00215271738464742 contain prompt cache tree unrefed token
DEBUG 01-12 13:52:20 [manager.py:251] dp_i 0 current batch size: 1 
DEBUG 01-12 13:52:20 [manager.py:251] dp_i 0 paused req num: 0 
DEBUG 01-12 13:52:20 [manager.py:251] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:52:20 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 
DEBUG 01-12 13:52:20 [manager.py:251] dp_i 0 token used ratio: 0.002213961122475227 not contain prompt cache tree unrefed token
DEBUG 01-12 13:52:20 [manager.py:251] dp_i 0 token used ratio: 0.0023134821964454133 contain prompt cache tree unrefed token
DEBUG 01-12 13:52:23 [manager.py:251] dp_i 0 current batch size: 1 
DEBUG 01-12 13:52:23 [manager.py:251] dp_i 0 paused req num: 0 
DEBUG 01-12 13:52:23 [manager.py:251] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:52:23 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 
DEBUG 01-12 13:52:23 [manager.py:251] dp_i 0 token used ratio: 0.0023731948408275256 not contain prompt cache tree unrefed token
DEBUG 01-12 13:52:23 [manager.py:251] dp_i 0 token used ratio: 0.002472715914797712 contain prompt cache tree unrefed token
DEBUG 01-12 13:52:26 [manager.py:251] dp_i 0 current batch size: 1 
DEBUG 01-12 13:52:26 [manager.py:251] dp_i 0 paused req num: 0 
DEBUG 01-12 13:52:26 [manager.py:251] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:52:26 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 
DEBUG 01-12 13:52:26 [manager.py:251] dp_i 0 token used ratio: 0.002509462157494396 not contain prompt cache tree unrefed token
DEBUG 01-12 13:52:26 [manager.py:251] dp_i 0 token used ratio: 0.002608983231464583 contain prompt cache tree unrefed token
DEBUG 01-12 13:52:29 [manager.py:251] dp_i 0 current batch size: 1 
DEBUG 01-12 13:52:29 [manager.py:251] dp_i 0 paused req num: 0 
DEBUG 01-12 13:52:29 [manager.py:251] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:52:29 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 
DEBUG 01-12 13:52:29 [manager.py:251] dp_i 0 token used ratio: 0.0026288874462586202 not contain prompt cache tree unrefed token
DEBUG 01-12 13:52:29 [manager.py:251] dp_i 0 token used ratio: 0.0027284085202288065 contain prompt cache tree unrefed token
DEBUG 01-12 13:52:32 [manager.py:251] dp_i 0 current batch size: 1 
DEBUG 01-12 13:52:32 [manager.py:251] dp_i 0 paused req num: 0 
DEBUG 01-12 13:52:32 [manager.py:251] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:52:32 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 
DEBUG 01-12 13:52:32 [manager.py:251] dp_i 0 token used ratio: 0.002746781641577149 not contain prompt cache tree unrefed token
DEBUG 01-12 13:52:32 [manager.py:251] dp_i 0 token used ratio: 0.002846302715547335 contain prompt cache tree unrefed token
DEBUG 01-12 13:52:35 [manager.py:251] dp_i 0 current batch size: 1 
DEBUG 01-12 13:52:35 [manager.py:251] dp_i 0 paused req num: 0 
DEBUG 01-12 13:52:35 [manager.py:251] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:52:35 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 
DEBUG 01-12 13:52:35 [manager.py:251] dp_i 0 token used ratio: 0.002861613650004287 not contain prompt cache tree unrefed token
DEBUG 01-12 13:52:35 [manager.py:251] dp_i 0 token used ratio: 0.0029611347239744735 contain prompt cache tree unrefed token
DEBUG 01-12 13:52:38 [manager.py:251] dp_i 0 current batch size: 1 
DEBUG 01-12 13:52:38 [manager.py:251] dp_i 0 paused req num: 0 
DEBUG 01-12 13:52:38 [manager.py:251] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:52:38 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 
DEBUG 01-12 13:52:38 [manager.py:251] dp_i 0 token used ratio: 0.002939699415734741 not contain prompt cache tree unrefed token
DEBUG 01-12 13:52:38 [manager.py:251] dp_i 0 token used ratio: 0.0030392204897049277 contain prompt cache tree unrefed token
DEBUG 01-12 13:52:41 [manager.py:251] dp_i 0 current batch size: 1 
DEBUG 01-12 13:52:41 [manager.py:251] dp_i 0 paused req num: 0 
DEBUG 01-12 13:52:41 [manager.py:251] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:52:41 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 
DEBUG 01-12 13:52:41 [manager.py:251] dp_i 0 token used ratio: 0.0030116608076824146 not contain prompt cache tree unrefed token
DEBUG 01-12 13:52:41 [manager.py:251] dp_i 0 token used ratio: 0.003111181881652601 contain prompt cache tree unrefed token
INFO 01-12 13:52:42 [manager.py:163] detoken release req id 40
INFO 01-12 13:52:42 [manager.py:614] X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:44 lightllm_req_id:40 first_token_cost:91.23969078063965ms total_cost_time:58654.03771400452ms,out_token_counter:2000 mean_per_token_cost_time: 29.28139901161194ms prompt_token_num:4 gpu cache hit: True gpu_prompt_cache_len:3 gpu_prompt_cache_ratio:0.75 cpu cache hit: False cpu_prompt_cache_len:0 cpu_prompt_cache_ratio:0.0 disk cache hit: False disk_prompt_cache_len:0 disk_prompt_cache_ratio:0.0 mtp_avg_token_per_step:1.0 
127.0.0.1:50156 - "POST /generate HTTP/1.1" 200
DEBUG 01-12 13:52:42 [req_manager.py:78] freed all request size 1008
DEBUG 01-12 13:52:42 [infer_batch.py:172] free a batch state:
DEBUG 01-12 13:52:42 [infer_batch.py:172] radix refed token num 0
DEBUG 01-12 13:52:42 [infer_batch.py:172] radix hold token num 2068
DEBUG 01-12 13:52:42 [infer_batch.py:172] mem manager can alloc token num 651060
DEBUG 01-12 13:52:42 [infer_batch.py:172] mem manager total size 653128
INFO 01-12 13:52:42 [batch.py:56] router release req id 40
INFO 01-12 13:52:42 [shm_req_manager.py:111] all shm req has been release ok
DEBUG 01-12 13:52:50 [manager.py:283] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:52:50 [manager.py:283] 
DEBUG 01-12 13:52:50 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 
DEBUG 01-12 13:52:50 [manager.py:284] 
DEBUG 01-12 13:53:53 [manager.py:283] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:53:53 [manager.py:283] 
DEBUG 01-12 13:53:53 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 
DEBUG 01-12 13:53:53 [manager.py:284] 
DEBUG 01-12 13:54:56 [manager.py:283] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:54:56 [manager.py:283] 
DEBUG 01-12 13:54:56 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 
DEBUG 01-12 13:54:56 [manager.py:284] 
DEBUG 01-12 13:56:00 [manager.py:283] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:56:00 [manager.py:283] 
DEBUG 01-12 13:56:00 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 
DEBUG 01-12 13:56:00 [manager.py:284] 
DEBUG 01-12 13:57:03 [manager.py:283] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:57:03 [manager.py:283] 
DEBUG 01-12 13:57:03 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 
DEBUG 01-12 13:57:03 [manager.py:284] 
DEBUG 01-12 13:58:06 [manager.py:283] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:58:06 [manager.py:283] 
DEBUG 01-12 13:58:06 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 
DEBUG 01-12 13:58:06 [manager.py:284] 
DEBUG 01-12 13:59:09 [manager.py:283] dp_i 0 frozen token num: 0 
DEBUG 01-12 13:59:09 [manager.py:283] 
DEBUG 01-12 13:59:09 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 
DEBUG 01-12 13:59:09 [manager.py:284] 
INFO 01-12 14:00:06 [manager.py:417] recieved req X-Request-Id: X-Session-Id: start_time:2026-01-12 14:00:06 lightllm_req_id:48 
INFO 01-12 14:00:06 [manager.py:422] router recive req id 48 cost time 0.00828862190246582 s
DEBUG 01-12 14:00:06 [manager.py:320] Prefill Batch: batch_id=-1, time:1768197606.2045314s req_ids:[48] 
DEBUG 01-12 14:00:06 [manager.py:320] 
INFO 01-12 14:00:06 [manager.py:55] detokenization recv req id 48 cost time 0.010654926300048828 s
DEBUG 01-12 14:00:06 [manager.py:251] dp_i 0 current batch size: 1 
DEBUG 01-12 14:00:06 [manager.py:251] dp_i 0 paused req num: 0 
DEBUG 01-12 14:00:06 [manager.py:251] dp_i 0 frozen token num: 0 
DEBUG 01-12 14:00:06 [manager.py:251] dp_i 0 estimated_peak_token_count: 222 
DEBUG 01-12 14:00:06 [manager.py:251] dp_i 0 token used ratio: 4.746389681655051e-05 not contain prompt cache tree unrefed token
DEBUG 01-12 14:00:06 [manager.py:251] dp_i 0 token used ratio: 0.0032091718621770926 contain prompt cache tree unrefed token
DEBUG 01-12 14:00:09 [manager.py:251] dp_i 0 current batch size: 1 
DEBUG 01-12 14:00:09 [manager.py:251] dp_i 0 paused req num: 0 
DEBUG 01-12 14:00:09 [manager.py:251] dp_i 0 frozen token num: 0 
DEBUG 01-12 14:00:09 [manager.py:251] dp_i 0 estimated_peak_token_count: 222 
DEBUG 01-12 14:00:09 [manager.py:251] dp_i 0 token used ratio: 0.0002878455677906934 not contain prompt cache tree unrefed token
DEBUG 01-12 14:00:09 [manager.py:251] dp_i 0 token used ratio: 0.003449553533151235 contain prompt cache tree unrefed token
INFO 01-12 14:00:10 [manager.py:163] detoken release req id 48
INFO 01-12 14:00:10 [manager.py:614] X-Request-Id: X-Session-Id: start_time:2026-01-12 14:00:06 lightllm_req_id:48 first_token_cost:94.14434432983398ms total_cost_time:3917.818784713745ms,out_token_counter:200 mean_per_token_cost_time: 19.118372201919556ms prompt_token_num:4 gpu cache hit: True gpu_prompt_cache_len:3 gpu_prompt_cache_ratio:0.75 cpu cache hit: False cpu_prompt_cache_len:0 cpu_prompt_cache_ratio:0.0 disk cache hit: False disk_prompt_cache_len:0 disk_prompt_cache_ratio:0.0 mtp_avg_token_per_step:1.0 
127.0.0.1:53836 - "POST /generate HTTP/1.1" 200
DEBUG 01-12 14:00:10 [req_manager.py:78] freed all request size 1008
DEBUG 01-12 14:00:10 [infer_batch.py:172] free a batch state:
DEBUG 01-12 14:00:10 [infer_batch.py:172] radix refed token num 0
DEBUG 01-12 14:00:10 [infer_batch.py:172] radix hold token num 2266
DEBUG 01-12 14:00:10 [infer_batch.py:172] mem manager can alloc token num 650862
DEBUG 01-12 14:00:10 [infer_batch.py:172] mem manager total size 653128
INFO 01-12 14:00:10 [batch.py:56] router release req id 48
INFO 01-12 14:00:10 [shm_req_manager.py:111] all shm req has been release ok
DEBUG 01-12 14:00:12 [manager.py:283] dp_i 0 frozen token num: 0 
DEBUG 01-12 14:00:12 [manager.py:283] 
DEBUG 01-12 14:00:12 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 
DEBUG 01-12 14:00:12 [manager.py:284] 
DEBUG 01-12 14:01:16 [manager.py:283] dp_i 0 frozen token num: 0 
DEBUG 01-12 14:01:16 [manager.py:283] 
DEBUG 01-12 14:01:16 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 
DEBUG 01-12 14:01:16 [manager.py:284] 
DEBUG 01-12 14:02:19 [manager.py:283] dp_i 0 frozen token num: 0 
DEBUG 01-12 14:02:19 [manager.py:283] 
DEBUG 01-12 14:02:19 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 
DEBUG 01-12 14:02:19 [manager.py:284] 
[2026-01-12 14:03:16 +0800] [5689] [INFO] Handling signal: winch
DEBUG 01-12 14:03:22 [manager.py:283] dp_i 0 frozen token num: 0 
DEBUG 01-12 14:03:22 [manager.py:283] 
DEBUG 01-12 14:03:22 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 
DEBUG 01-12 14:03:22 [manager.py:284] 
DEBUG 01-12 14:04:25 [manager.py:283] dp_i 0 frozen token num: 0 
DEBUG 01-12 14:04:25 [manager.py:283] 
DEBUG 01-12 14:04:25 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 
DEBUG 01-12 14:04:25 [manager.py:284] 
DEBUG 01-12 14:05:28 [manager.py:283] dp_i 0 frozen token num: 0 
DEBUG 01-12 14:05:28 [manager.py:283] 
DEBUG 01-12 14:05:28 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 
DEBUG 01-12 14:05:28 [manager.py:284] 
[2026-01-12 14:06:28 +0800] [5689] [INFO] Handling signal: winch
[2026-01-12 14:06:28 +0800] [5689] [INFO] Handling signal: winch
[2026-01-12 14:06:28 +0800] [5689] [INFO] Handling signal: winch
DEBUG 01-12 14:06:31 [manager.py:283] dp_i 0 frozen token num: 0 
DEBUG 01-12 14:06:31 [manager.py:283] 
DEBUG 01-12 14:06:31 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 
DEBUG 01-12 14:06:31 [manager.py:284] 
DEBUG 01-12 14:07:35 [manager.py:283] dp_i 0 frozen token num: 0 
DEBUG 01-12 14:07:35 [manager.py:283] 
DEBUG 01-12 14:07:35 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 
DEBUG 01-12 14:07:35 [manager.py:284] 
DEBUG 01-12 14:08:38 [manager.py:283] dp_i 0 frozen token num: 0 
DEBUG 01-12 14:08:38 [manager.py:283] 
DEBUG 01-12 14:08:38 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 
DEBUG 01-12 14:08:38 [manager.py:284] 
DEBUG 01-12 14:09:41 [manager.py:283] dp_i 0 frozen token num: 0 
DEBUG 01-12 14:09:41 [manager.py:283] 
DEBUG 01-12 14:09:41 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 
DEBUG 01-12 14:09:41 [manager.py:284] 
DEBUG 01-12 14:10:44 [manager.py:283] dp_i 0 frozen token num: 0 
DEBUG 01-12 14:10:44 [manager.py:283] 
DEBUG 01-12 14:10:44 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 
DEBUG 01-12 14:10:44 [manager.py:284] 
DEBUG 01-12 14:11:47 [manager.py:283] dp_i 0 frozen token num: 0 
DEBUG 01-12 14:11:47 [manager.py:283] 
DEBUG 01-12 14:11:47 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 
DEBUG 01-12 14:11:47 [manager.py:284] 
[2026-01-12 14:11:57 +0800] [5689] [INFO] Handling signal: winch
DEBUG 01-12 14:12:51 [manager.py:283] dp_i 0 frozen token num: 0 
DEBUG 01-12 14:12:51 [manager.py:283] 
DEBUG 01-12 14:12:51 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 
DEBUG 01-12 14:12:51 [manager.py:284] 
DEBUG 01-12 14:13:54 [manager.py:283] dp_i 0 frozen token num: 0 
DEBUG 01-12 14:13:54 [manager.py:283] 
DEBUG 01-12 14:13:54 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 
DEBUG 01-12 14:13:54 [manager.py:284] 
```

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: shihaobai <42648726+shihaobai@users.noreply.github.com>
Co-authored-by: wangzaijun <wangzaijun@sensetime.com>
Co-authored-by: sangchengmeng <sangchengmeng@sensetime.com>
Add support for the GLM-4.7 MoE (glm4_moe) model with:
- 160 routed experts + 1 shared expert with top-8 routing
- Sigmoid gating with e_score_correction_bias
- QK normalization (RMSNorm on Q and K projections)
- Partial rotary embeddings (factor 0.5)
- routed_scaling_factor = 2.5
- First 3 dense layers, rest MoE (first_k_dense_replace=3)
- GQA with 8 KV heads

Also includes MTP (Multi-Token Prediction) variant support.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @sufubao, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the necessary infrastructure and implementation to support the GLM-4.7 Mixture-of-Experts (MoE) model within the system. It encompasses the integration of its distinct architectural characteristics, such as specialized attention mechanisms and a dynamic FFN structure, alongside a dedicated Multi-Token Prediction (MTP) variant for enhanced decoding strategies. The changes ensure proper handling of model weights, inference logic, and parallelism strategies for this advanced model.

Highlights

  • GLM-4.7 MoE Model Support: Full integration of the GLM-4.7 Mixture-of-Experts (MoE) model architecture, including its unique features.
  • Advanced MoE Configuration: Implements 160 routed experts and 1 shared expert with top-8 routing, sigmoid gating, e_score_correction_bias, and a routed_scaling_factor of 2.5.
  • Attention Enhancements: Incorporates QK normalization (RMSNorm on Q and K projections) and partial rotary embeddings (0.5 factor) for improved attention mechanisms.
  • Hybrid FFN Structure: The first 3 layers utilize dense Feed-Forward Networks (FFN), while subsequent layers employ the MoE FFN structure.
  • Multi-Token Prediction (MTP) Variant: Adds support for the MTP variant, designed for speculative decoding, which focuses solely on FFN computations without attention.
  • Parallelism Support: Includes logic for both Tensor Parallelism (TP) and Expert Parallelism (EP) modes for MoE layers, and handles overlapped TPSP (Token/Prefill Sequence Parallelism) for MoE.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the GLM-4.7 MoE model architecture, including its MTP (Multi-Token Prediction) variant for speculative decoding. The implementation covers the specific features of this model, such as QK normalization, partial rotary embeddings, and the unique MoE structure with shared experts. The code is well-structured, adding new modules for the model, its layers, and weights. My review focuses on ensuring robustness, maintainability, and correctness, especially in the context of distributed execution and quantization. I've identified a few areas for improvement, including a potential issue with quantization configuration loading and a risky class reuse that could affect future maintainability. Overall, this is a solid contribution that extends the framework's model support.

Comment on lines +206 to +207
kv_b_quant_method = self.quant_cfg.get_quant_method(self.layer_num_, "kv_b_proj")
weight_scale_suffix = kv_b_quant_method.weight_scale_suffix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The weight name kv_b_proj seems incorrect for getting the quantization method for shared FFN experts. This could be a copy-paste error. It should probably be a name related to an FFN projection, like gate_up_proj, to ensure the correct quantization configuration is used. Using the wrong quantization method could lead to incorrect behavior or errors when quantization is enabled.

Suggested change
kv_b_quant_method = self.quant_cfg.get_quant_method(self.layer_num_, "kv_b_proj")
weight_scale_suffix = kv_b_quant_method.weight_scale_suffix
ffn_quant_method = self.quant_cfg.get_quant_method(self.layer_num_, "gate_up_proj")
weight_scale_suffix = ffn_quant_method.weight_scale_suffix

_0_topk_weight, _0_topk_idx, _0_qinput_tensor = layer_weight.experts.select_experts_and_quant_input(
_0_input1, _0_router_logits
)
from deep_ep import Buffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The local import of deep_ep can cause an ImportError if the package is not installed, as this execution path might be taken. It's better to handle this possibility, for example by wrapping it in a try...except block, or by using a global flag set during a guarded import at the top of the file, similar to how it's done in other parts of the codebase (e.g., lightllm/common/fused_moe/grouped_fused_moe_ep.py).

"""

pre_and_post_weight_class = Glm4MoeMTPPreAndPostLayerWeight
pre_layer_infer_class = Deepseek3MTPPreLayerInfer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Reusing Deepseek3MTPPreLayerInfer directly is a bit risky because its methods are type-hinted to expect Deepseek3MTPPreAndPostLayerWeight, but this model uses Glm4MoeMTPPreAndPostLayerWeight. While it may work now due to duck typing (if the attributes match), it's fragile and can break with future changes. It would be more robust to create a Glm4MoeMTPPreLayerInfer class, possibly inheriting from Deepseek3MTPPreLayerInfer or just copying its logic, to ensure type consistency and better maintainability.

- graph_max_batch_size: 256 -> 512
- httpserver_workers: 1 -> 4
- LIGHTLLM_TRITON_AUTOTUNE_LEVEL: 0 -> 1
@shihaobai shihaobai force-pushed the main branch 2 times, most recently from 0777e28 to c307fdd Compare January 21, 2026 07:50
@sufubao sufubao closed this Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants