Skip to content

vllm metrics docs#1662

Open
hao-aaron wants to merge 1 commit into
NovaSky-AI:mainfrom
hao-aaron:vllm-metrics-docs
Open

vllm metrics docs#1662
hao-aaron wants to merge 1 commit into
NovaSky-AI:mainfrom
hao-aaron:vllm-metrics-docs

Conversation

@hao-aaron
Copy link
Copy Markdown
Collaborator

add docs for vllm metrics. Also, ran /examples/train/megatron/run_megatron_dapo_qwen3.5_35b_a3b.sh to ensure that mp backend also logs metrics correctly:
https://wandb.ai/sky-posttraining-uc-berkeley/qwen3_5_dapo/runs/mhq1q8aa?nw=nwuserahao

x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces documentation for vLLM engine metrics, including a new MDX file and an update to the documentation metadata. The documentation covers enabling metrics, support across different inference paths, and specific metrics logged to external backends like wandb. Feedback was provided regarding several internal file links that use repository-root relative paths, which may result in broken links in some Markdown renderers; it is recommended to use relative paths from the documentation file instead.

| Old inference + `generator.async_engine=true` | Yes |
| Old inference + `generator.async_engine=false` | **No** |

The new inference path ([vllm_server_actor.py:329-339](skyrl/backends/skyrl_train/inference_servers/vllm_server_actor.py#L329-L339))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The link path skyrl/backends/skyrl_train/inference_servers/vllm_server_actor.py is relative to the repository root. In many Markdown renderers (including GitHub), this will be interpreted as relative to the current directory (docs/content/docs/checkpointing-logging/), which will result in a broken link. Consider using a relative path from this file (e.g., ../../../../skyrl/...).

The new inference path ([vllm_server_actor.py:329-339](../../../../skyrl/backends/skyrl_train/inference_servers/vllm_server_actor.py#L329-L339))

always uses `AsyncLLMEngine` and wires the stat logger unconditionally.

The legacy path supports it only when `async_engine=true`
([vllm_engine.py:359-370](skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py#L359-L370)).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The link path is relative to the repository root, which may result in a broken link. Consider using a relative path from this file.

([vllm_engine.py:359-370](../../../../skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py#L359-L370)).

The legacy path supports it only when `async_engine=true`
([vllm_engine.py:359-370](skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py#L359-L370)).
The synchronous `VLLMInferenceEngine` pops the flag and emits a warning
([vllm_engine.py:240-247](skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py#L240-L247)):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The link path is relative to the repository root, which may result in a broken link. Consider using a relative path from this file.

([vllm_engine.py:240-247](../../../../skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py#L240-L247)):

## Metrics logged to wandb

When the flag is on, the trainer constructs a `VLLMMetricsScraper`
([trainer.py:122-124](skyrl/train/trainer.py#L122-L124)) that scrapes every
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The link path is relative to the repository root, which may result in a broken link. Consider using a relative path from this file.

([trainer.py:122-124](../../../../skyrl/train/trainer.py#L122-L124)) that scrapes every


The full set of vLLM metrics is still available via the Prometheus endpoints
themselves — only this curated subset is forwarded to wandb. The selection
lives in [vllm_metrics_scraper.py:27-51](skyrl/train/utils/vllm_metrics_scraper.py#L27-L51).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The link path is relative to the repository root, which may result in a broken link. Consider using a relative path from this file.

lives in [vllm_metrics_scraper.py:27-51](../../../../skyrl/train/utils/vllm_metrics_scraper.py#L27-L51).

Copy link
Copy Markdown
Member

@SumanthRH SumanthRH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, some minor comments

Comment on lines +23 to +24
in Prometheus text format. On Anyscale this feeds the hosted Prometheus +
Grafana stack with no extra setup.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
in Prometheus text format. On Anyscale this feeds the hosted Prometheus +
Grafana stack with no extra setup.
in Prometheus text format.

Comment on lines +70 to +72
The full set of vLLM metrics is still available via the Prometheus endpoints
themselves — only this curated subset is forwarded to wandb. The selection
lives in [vllm_metrics_scraper.py:27-51](skyrl/train/utils/vllm_metrics_scraper.py#L27-L51).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide an example for querying KV Cache Residency metrics like Lifetime here?

@@ -2,6 +2,7 @@
"title": "Checkpointing and Logging",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"title": "Checkpointing and Logging",
"title": "Checkpointing and Observability",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants