vllm metrics docs by hao-aaron · Pull Request #1662 · NovaSky-AI/SkyRL

hao-aaron · 2026-05-13T19:35:09Z

add docs for vllm metrics. Also, ran /examples/train/megatron/run_megatron_dapo_qwen3.5_35b_a3b.sh to ensure that mp backend also logs metrics correctly:
https://wandb.ai/sky-posttraining-uc-berkeley/qwen3_5_dapo/runs/mhq1q8aa?nw=nwuserahao

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces documentation for vLLM engine metrics, including a new MDX file and an update to the documentation metadata. The documentation covers enabling metrics, support across different inference paths, and specific metrics logged to external backends like wandb. Feedback was provided regarding several internal file links that use repository-root relative paths, which may result in broken links in some Markdown renderers; it is recommended to use relative paths from the documentation file instead.

gemini-code-assist · 2026-05-13T19:37:02Z

+| Old inference + `generator.async_engine=true`   | Yes       |
+| Old inference + `generator.async_engine=false`  | **No**    |
+
+The new inference path ([vllm_server_actor.py:329-339](skyrl/backends/skyrl_train/inference_servers/vllm_server_actor.py#L329-L339))


The link path skyrl/backends/skyrl_train/inference_servers/vllm_server_actor.py is relative to the repository root. In many Markdown renderers (including GitHub), this will be interpreted as relative to the current directory (docs/content/docs/checkpointing-logging/), which will result in a broken link. Consider using a relative path from this file (e.g., ../../../../skyrl/...).

The new inference path ([vllm_server_actor.py:329-339](../../../../skyrl/backends/skyrl_train/inference_servers/vllm_server_actor.py#L329-L339))

gemini-code-assist · 2026-05-13T19:37:02Z

+always uses `AsyncLLMEngine` and wires the stat logger unconditionally.
+
+The legacy path supports it only when `async_engine=true`
+([vllm_engine.py:359-370](skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py#L359-L370)).


The link path is relative to the repository root, which may result in a broken link. Consider using a relative path from this file.

([vllm_engine.py:359-370](../../../../skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py#L359-L370)).

gemini-code-assist · 2026-05-13T19:37:02Z

+The legacy path supports it only when `async_engine=true`
+([vllm_engine.py:359-370](skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py#L359-L370)).
+The synchronous `VLLMInferenceEngine` pops the flag and emits a warning
+([vllm_engine.py:240-247](skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py#L240-L247)):


The link path is relative to the repository root, which may result in a broken link. Consider using a relative path from this file.

([vllm_engine.py:240-247](../../../../skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py#L240-L247)):

gemini-code-assist · 2026-05-13T19:37:03Z

+## Metrics logged to wandb
+
+When the flag is on, the trainer constructs a `VLLMMetricsScraper`
+([trainer.py:122-124](skyrl/train/trainer.py#L122-L124)) that scrapes every


The link path is relative to the repository root, which may result in a broken link. Consider using a relative path from this file.

([trainer.py:122-124](../../../../skyrl/train/trainer.py#L122-L124)) that scrapes every

gemini-code-assist · 2026-05-13T19:37:03Z

+
+The full set of vLLM metrics is still available via the Prometheus endpoints
+themselves — only this curated subset is forwarded to wandb. The selection
+lives in [vllm_metrics_scraper.py:27-51](skyrl/train/utils/vllm_metrics_scraper.py#L27-L51).


The link path is relative to the repository root, which may result in a broken link. Consider using a relative path from this file.

lives in [vllm_metrics_scraper.py:27-51](../../../../skyrl/train/utils/vllm_metrics_scraper.py#L27-L51).

SumanthRH

Looking good, some minor comments

SumanthRH · 2026-05-14T00:05:27Z

+in Prometheus text format. On Anyscale this feeds the hosted Prometheus +
+Grafana stack with no extra setup.


Suggested change

in Prometheus text format. On Anyscale this feeds the hosted Prometheus +

Grafana stack with no extra setup.

in Prometheus text format.

SumanthRH · 2026-05-14T00:06:17Z

+The full set of vLLM metrics is still available via the Prometheus endpoints
+themselves — only this curated subset is forwarded to wandb. The selection
+lives in [vllm_metrics_scraper.py:27-51](skyrl/train/utils/vllm_metrics_scraper.py#L27-L51).


Can you provide an example for querying KV Cache Residency metrics like Lifetime here?

SumanthRH · 2026-05-14T00:06:47Z

@@ -2,6 +2,7 @@
  "title": "Checkpointing and Logging",


Suggested change

"title": "Checkpointing and Logging",

"title": "Checkpointing and Observability",

x

c4074f7

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

SumanthRH requested changes May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vllm metrics docs#1662

vllm metrics docs#1662
hao-aaron wants to merge 1 commit into
NovaSky-AI:mainfrom
hao-aaron:vllm-metrics-docs

hao-aaron commented May 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

SumanthRH left a comment

Uh oh!

SumanthRH May 14, 2026

Uh oh!

SumanthRH May 14, 2026

Uh oh!

SumanthRH May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		in Prometheus text format. On Anyscale this feeds the hosted Prometheus +
		Grafana stack with no extra setup.

	in Prometheus text format. On Anyscale this feeds the hosted Prometheus +
	Grafana stack with no extra setup.
	in Prometheus text format.

	"title": "Checkpointing and Logging",
	"title": "Checkpointing and Observability",

Conversation

hao-aaron commented May 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

SumanthRH left a comment

Choose a reason for hiding this comment

Uh oh!

SumanthRH May 14, 2026

Choose a reason for hiding this comment

Uh oh!

SumanthRH May 14, 2026

Choose a reason for hiding this comment

Uh oh!

SumanthRH May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants