route collectives through torchcomms by tushar00jain · Pull Request #5385 · NVIDIA/Megatron-LM

tushar00jain · 2026-06-16T23:58:15Z

Summary:

Route Megatron-LM collectives through PyTorch TorchComms

Summary

This PR makes Megatron-LM's process-group setup compatible with torchcomms, so every torch.distributed collective — both the NCCL device path and the Gloo CPU path — can be routed through TorchComms by enabling PyTorch's torch.distributed.config.use_torchcomms (env TORCH_DISTRIBUTED_USE_TORCHCOMMS).

No call site switches to a new API. Existing new_group / init_process_group calls route through TorchComms' split_group path automatically when the flag is on, so the change set is small and the default (NCCL/Gloo ProcessGroup) path is untouched when the flag is off.

Motivation

1. Migration to torchcomms

torchcomms is the modern PyTorch communications library designed to replace the legacy ProcessGroup + Backend abstraction. We want Megatron-LM to be able to run its entire distributed stack over torchcomms with only an environment-variable flip, as a step toward adopting it as the default collective backend.

2. Minimal, reversible change

Keeping new_group (rather than calling split_group directly) means the diff is small, the non-torchcomms path is byte-for-byte unchanged, and the whole behavior is gated behind a single env var.

3. No silent config loss

Where split_group would drop ProcessGroupNCCL.Options on the floor, we translate the relevant knobs (is_high_priority_stream, cga_cluster_size, max_ctas, min_ctas) into TorchComms' CommOptions.hints and build a standalone comm so they're actually honored.

What changed

`megatron/core/parallel_state.py` — torchcomms-compatible group creation

Torchcomms routes new_group through split_group, which requires (a) the parent PG to be eagerly device-bound (bound_device_id) and (b) the backend filter handed to subgroups to be device-qualified and to include the parent's default device backend

`megatron/training/initialize.py` — eager device-bound world PG

_initialize_distributed now, when torchcomms is enabled and a CUDA device_id exists:

Seeds TORCHCOMM_RANK / TORCHCOMM_SIZE for the TorchComms bootstrap.
Inits the world PG with backend='cpu:gloo,cuda:nccl' and device_id=… so the parent is eagerly device-bound.
Issues dist.barrier(device_ids=[device_id.index]) immediately after init as a defensive eager-init flush. device_id alone sets bound_device_id (which split_group checks) but the underlying NCCL comm is still created lazily on first collective; the no-op device barrier forces that creation, sidestepping the intermittent init-time hang documented in pytorch/pytorch#153960. One collective at boot — essentially free.

`megatron/core/process_groups_config.py` — singleton group inheritance

The singleton expt_dp_group now routes through parallel_state.create_group(...) so it picks up the same backend-qualification and torchcomms routing as every other group.

`tests/unit_tests/test_utilities.py` — Utils.initialize_distributed mirror

Utils.initialize_distributed is the test-side analogue of _initialize_distributed. Under torchcomms it now inits with backend='cpu:gloo,cuda:nccl', passes device_id, seeds TORCHCOMM_RANK/SIZE, and barriers — so unit tests that subsequently ask for a backend='gloo' subgroup don't trip split_group's "Requested backend for device 'cpu' is not present in the parent" error. With torchcomms off it keeps the original backend='nccl' path.

Tests

Validated on a 4 × H100 (Hopper) host against PyTorch + torchcomms nightlies.

End-to-end smokes (smoke_*.py)
CI unit-test subset (CI_LIGHT)
No regression with the flag off

All of the above pass with TORCH_DISTRIBUTED_USE_TORCHCOMMS=0, using the standard ProcessGroupNCCL / ProcessGroupGloo backends.

Rollback / gating

The whole change is gated behind TORCH_DISTRIBUTED_USE_TORCHCOMMS. It is a no-op unless the torchcomms package is installed (torch's _use_torchcomms_enabled() also checks availability), and it can be disabled at any time with TORCH_DISTRIBUTED_USE_TORCHCOMMS=0 without touching code — the default NCCL/Gloo ProcessGroup path is unchanged.

Signed-off-by: Tushar Jain tushar00jain@users.noreply.github.com

copy-pr-bot · 2026-06-16T23:58:18Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

guihong-nv · 2026-06-17T01:59:39Z

Thanks for opening this PR.

Could you please update the PR description to follow the Megatron-LM pull request template? The current PR body is empty, so reviewers do not have the required summary, issue tracking, testing, documentation, and pre-check information.

A few specific items to address:

Fill out the PR template, including the overview and pre-checks: https://github.com/NVIDIA/Megatron-LM/blob/main/.github/pull_request_template.md#L1-L25
If this is a new feature, link the required feature request issue: https://github.com/NVIDIA/Megatron-LM/blob/main/.github/pull_request_template.md#L8-L15
Please make sure the contribution follows the code submission policy: https://github.com/NVIDIA/Megatron-LM/blob/main/docs/developer/contribute.md#L31-L49
Please also sign off your commit as required by the DCO policy: https://github.com/NVIDIA/Megatron-LM/blob/main/docs/developer/contribute.md#L50-L56

Once those are updated, we can continue review. If the PR remains without the required template/policy information, we may need to close it until it is resubmitted or updated accordingly.

Summary: # Route Megatron-LM collectives through PyTorch TorchComms ## Summary This PR makes Megatron-LM's process-group setup compatible with [**torchcomms**](https://meta-pytorch.org/torchcomms/main/index.html), so every `torch.distributed` collective — both the NCCL device path and the Gloo CPU path — can be routed through TorchComms by enabling PyTorch's `torch.distributed.config.use_torchcomms` (env `TORCH_DISTRIBUTED_USE_TORCHCOMMS`). No call site switches to a new API. Existing `new_group` / `init_process_group` calls route through TorchComms' `split_group` path automatically when the flag is on, so the change set is small and the default (NCCL/Gloo `ProcessGroup`) path is untouched when the flag is off. --- ## Motivation ### 1. Migration to torchcomms torchcomms is the modern PyTorch communications library designed to replace the legacy `ProcessGroup` + `Backend` abstraction. We want Megatron-LM to be able to run its entire distributed stack over torchcomms with only an environment-variable flip, as a step toward adopting it as the default collective backend. ### 2. Minimal, reversible change Keeping `new_group` (rather than calling `split_group` directly) means the diff is small, the non-torchcomms path is byte-for-byte unchanged, and the whole behavior is gated behind a single env var. ### 3. No silent config loss Where `split_group` would drop `ProcessGroupNCCL.Options` on the floor, we translate the relevant knobs (`is_high_priority_stream`, `cga_cluster_size`, `max_ctas`, `min_ctas`) into TorchComms' `CommOptions.hints` and build a standalone comm so they're actually honored. --- ## What changed ### `megatron/core/parallel_state.py` — torchcomms-compatible group creation - Torchcomms routes `new_group` through `split_group`, which requires (a) the parent PG to be eagerly **device-bound** (`bound_device_id`) and (b) the backend filter handed to subgroups to be **device-qualified** and to include the parent's default device backend ### `megatron/training/initialize.py` — eager device-bound world PG `_initialize_distributed` now, when torchcomms is enabled and a CUDA `device_id` exists: - Seeds `TORCHCOMM_RANK` / `TORCHCOMM_SIZE` for the TorchComms bootstrap. - Inits the world PG with `backend='cpu:gloo,cuda:nccl'` and `device_id=…` so the parent is eagerly device-bound. - Issues `dist.barrier(device_ids=[device_id.index])` immediately after init as a defensive eager-init flush. `device_id` alone sets `bound_device_id` (which `split_group` checks) but the underlying NCCL comm is still created lazily on first collective; the no-op device barrier forces that creation, sidestepping the intermittent init-time hang documented in [pytorch/pytorch#153960](pytorch/pytorch#153960). One collective at boot — essentially free. ### `megatron/core/process_groups_config.py` — singleton group inheritance The singleton `expt_dp_group` now routes through `parallel_state.create_group(...)` so it picks up the same backend-qualification and torchcomms routing as every other group. ### `tests/unit_tests/test_utilities.py` — Utils.initialize_distributed mirror `Utils.initialize_distributed` is the test-side analogue of `_initialize_distributed`. Under torchcomms it now inits with `backend='cpu:gloo,cuda:nccl'`, passes `device_id`, seeds `TORCHCOMM_RANK`/`SIZE`, and barriers — so unit tests that subsequently ask for a `backend='gloo'` subgroup don't trip `split_group`'s "Requested backend for device 'cpu' is not present in the parent" error. With torchcomms off it keeps the original `backend='nccl'` path. --- ## Tests Validated on a 4 × H100 (Hopper) host against PyTorch + torchcomms nightlies. 1. End-to-end smokes (`smoke_*.py`) 2. CI unit-test subset (`CI_LIGHT`) 3. No regression with the flag off All of the above pass with `TORCH_DISTRIBUTED_USE_TORCHCOMMS=0`, using the standard `ProcessGroupNCCL` / `ProcessGroupGloo` backends. --- ## Rollback / gating The whole change is gated behind `TORCH_DISTRIBUTED_USE_TORCHCOMMS`. It is a no-op unless the `torchcomms` package is installed (torch's `_use_torchcomms_enabled()` also checks availability), and it can be disabled at any time with `TORCH_DISTRIBUTED_USE_TORCHCOMMS=0` without touching code — the default NCCL/Gloo `ProcessGroup` path is unchanged. Signed-off-by: Tushar Jain <tushar00jain@users.noreply.github.com>

github-actions Bot added the community-request label Jun 16, 2026

tushar00jain changed the title ~~enable torchcomms~~ route collectives through torchcomms Jun 17, 2026

tushar00jain force-pushed the pr5385 branch from ea78542 to 7c8e4a0 Compare June 17, 2026 17:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

route collectives through torchcomms#5385

route collectives through torchcomms#5385
tushar00jain wants to merge 1 commit into
NVIDIA:mainfrom
tushar00jain:pr5385

tushar00jain commented Jun 16, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

guihong-nv commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tushar00jain commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Route Megatron-LM collectives through PyTorch TorchComms

Summary

Motivation

1. Migration to torchcomms

2. Minimal, reversible change

3. No silent config loss

What changed

megatron/core/parallel_state.py — torchcomms-compatible group creation

megatron/training/initialize.py — eager device-bound world PG

megatron/core/process_groups_config.py — singleton group inheritance

tests/unit_tests/test_utilities.py — Utils.initialize_distributed mirror

Tests

Rollback / gating

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

guihong-nv commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tushar00jain commented Jun 16, 2026 •

edited

Loading

`megatron/core/parallel_state.py` — torchcomms-compatible group creation

`megatron/training/initialize.py` — eager device-bound world PG

`megatron/core/process_groups_config.py` — singleton group inheritance

`tests/unit_tests/test_utilities.py` — Utils.initialize_distributed mirror