vllm metrics docs#1662
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces documentation for vLLM engine metrics, including a new MDX file and an update to the documentation metadata. The documentation covers enabling metrics, support across different inference paths, and specific metrics logged to external backends like wandb. Feedback was provided regarding several internal file links that use repository-root relative paths, which may result in broken links in some Markdown renderers; it is recommended to use relative paths from the documentation file instead.
| | Old inference + `generator.async_engine=true` | Yes | | ||
| | Old inference + `generator.async_engine=false` | **No** | | ||
|
|
||
| The new inference path ([vllm_server_actor.py:329-339](skyrl/backends/skyrl_train/inference_servers/vllm_server_actor.py#L329-L339)) |
There was a problem hiding this comment.
The link path skyrl/backends/skyrl_train/inference_servers/vllm_server_actor.py is relative to the repository root. In many Markdown renderers (including GitHub), this will be interpreted as relative to the current directory (docs/content/docs/checkpointing-logging/), which will result in a broken link. Consider using a relative path from this file (e.g., ../../../../skyrl/...).
The new inference path ([vllm_server_actor.py:329-339](../../../../skyrl/backends/skyrl_train/inference_servers/vllm_server_actor.py#L329-L339))
| always uses `AsyncLLMEngine` and wires the stat logger unconditionally. | ||
|
|
||
| The legacy path supports it only when `async_engine=true` | ||
| ([vllm_engine.py:359-370](skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py#L359-L370)). |
| The legacy path supports it only when `async_engine=true` | ||
| ([vllm_engine.py:359-370](skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py#L359-L370)). | ||
| The synchronous `VLLMInferenceEngine` pops the flag and emits a warning | ||
| ([vllm_engine.py:240-247](skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py#L240-L247)): |
| ## Metrics logged to wandb | ||
|
|
||
| When the flag is on, the trainer constructs a `VLLMMetricsScraper` | ||
| ([trainer.py:122-124](skyrl/train/trainer.py#L122-L124)) that scrapes every |
|
|
||
| The full set of vLLM metrics is still available via the Prometheus endpoints | ||
| themselves — only this curated subset is forwarded to wandb. The selection | ||
| lives in [vllm_metrics_scraper.py:27-51](skyrl/train/utils/vllm_metrics_scraper.py#L27-L51). |
SumanthRH
left a comment
There was a problem hiding this comment.
Looking good, some minor comments
| in Prometheus text format. On Anyscale this feeds the hosted Prometheus + | ||
| Grafana stack with no extra setup. |
There was a problem hiding this comment.
| in Prometheus text format. On Anyscale this feeds the hosted Prometheus + | |
| Grafana stack with no extra setup. | |
| in Prometheus text format. |
| The full set of vLLM metrics is still available via the Prometheus endpoints | ||
| themselves — only this curated subset is forwarded to wandb. The selection | ||
| lives in [vllm_metrics_scraper.py:27-51](skyrl/train/utils/vllm_metrics_scraper.py#L27-L51). |
There was a problem hiding this comment.
Can you provide an example for querying KV Cache Residency metrics like Lifetime here?
| @@ -2,6 +2,7 @@ | |||
| "title": "Checkpointing and Logging", | |||
There was a problem hiding this comment.
| "title": "Checkpointing and Logging", | |
| "title": "Checkpointing and Observability", |
add docs for vllm metrics. Also, ran /examples/train/megatron/run_megatron_dapo_qwen3.5_35b_a3b.sh to ensure that mp backend also logs metrics correctly:
https://wandb.ai/sky-posttraining-uc-berkeley/qwen3_5_dapo/runs/mhq1q8aa?nw=nwuserahao