Add hetero grid args and MoE process groups for MIMO example by yashaswikarnati · Pull Request #5375 · NVIDIA/Megatron-LM

yashaswikarnati · 2026-06-16T20:10:59Z

What

Add the heterogeneous-grid CLI args module (add_hetero_grid_args, validate_hetero_grid_args, build_module_grid_specs) for MIMO training, and extend the example topology helper:

topology.py: create additional dense process groups (tp+dp, tp+dp+cp, tp+cp+dp+pp) and set pgc.tp_dp / pgc.tp_dp_cp explicitly. MoE layers/router and finalize_model_grads read tp_dp_cp; cuda-graph capture reads tp_dp; the ProcessGroupCollection leaves them init=False by default.

Why

topology.py is the file that landed with MM1 (PR #5331-series); these edits are purely additive. args.py is a new leaf module that imports ModuleGridSpec from topology.py. The grid-args unit test travels with args.py; the existing topology unit test on main is unchanged.

Validation

Validated in the 8-GPU 20L Nemotron VLM e2e (trains + checkpoint save/resume, lm loss 12.18->11.54 across resume).

CODEOWNERS

examples/mimo/... + tests/unit_tests/... -> repo default owners.

🤖 Generated with Claude Code

Add the heterogeneous-grid CLI args module (add_hetero_grid_args, validate_hetero_grid_args, build_module_grid_specs) used to configure per-module grids for MIMO training, and extend the example topology helper: - topology.py: create the additional dense process groups (tp+dp, tp+dp+cp, tp+cp+dp+pp) and set pgc.tp_dp / pgc.tp_dp_cp explicitly. MoE layers/router and finalize_model_grads read tp_dp_cp; cuda-graph capture reads tp_dp; the ProcessGroupCollection leaves them init=False by default. These edits extend the topology helper that landed with MM1; they are purely additive. The grid-args unit test travels with args.py. The existing topology unit test on main is unchanged. Validated in the 8-GPU 20L Nemotron VLM e2e (trains + checkpoint save/resume, lm loss 12.18->11.54 across resume). Signed-off-by: ykarnati <ykarnati@nvidia.com>

copy-pr-bot · 2026-06-16T20:11:04Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yashaswikarnati · 2026-06-18T17:27:02Z

+    grid = parser.add_argument_group("hetero module grids")
+
+    # Encoder grid placement + factorization.
+    grid.add_argument("--encoder-offset", type=int, default=0,


we dont need encoder offset

yashaswikarnati · 2026-06-18T17:28:35Z

+def validate_hetero_grid_args(args: argparse.Namespace, world_size: int) -> tuple[int, int]:
+    """Validate the disjoint-grid hetero layout. Returns ``(encoder_size, llm_size)``.
+
+    Call AFTER stock ``validate_args`` (so ``micro_batch_size`` / ``num_experts``


doc strings are verbose. can be concsely written, dont explain code in doc strings here. validate is a simple check function

yashaswikarnati · 2026-06-18T17:31:19Z

+    return parser.parse_args(argv)
+
+
+def _layout_8gpu_20l(**overrides):


tests not adding value, just smoke tests should be avoided. tests can be authored more concise here

yashaswikarnati · 2026-06-18T17:32:13Z

    pgc.dp_cp = grid.get_pg(["dp", "cp"])
    pgc.intra_dp_cp = pgc.dp_cp
    pgc.tp_cp = grid.get_pg(["tp", "cp"])
+    # MoE layers/router and finalize_model_grads read tp_dp_cp (tensor+data+context


remove - # MoE layers/router and finalize_model_grads read tp_dp_cp (tensor+data+context
# parallel group); cuda-graph capture reads tp_dp. Set them explicitly since the
# ProcessGroupCollection leaves them init=False.

verbose comment and not necessary here

yashaswikarnati · 2026-06-18T17:33:03Z

+# ``args.vision_encoder_key = "radio_encoder"``; we mirror that default here so
+# the encoder ModuleGridSpec carries the same module name the provider/runtime
+# look up. Resolved at spec-build time via ``getattr(args, "vision_encoder_key")``.
+DEFAULT_ENCODER_MODULE_NAME = "radio_encoder"


why do we need this - DEFAULT_ENCODER_MODULE_NAME ?

yashaswikarnati · 2026-06-18T17:33:52Z

+def add_hetero_grid_args(parser: argparse.ArgumentParser) -> argparse.ArgumentParser:
+    """Register the hetero parallelism/topology arg group.
+
+    Stock-hook compatible: returns the parser so it can be passed straight to


dont use the wording Stock-hook . stock has no meanin here. also currnt doc string looks very vague and verbose.

yashaswikarnati · 2026-06-18T17:34:26Z

+                      help="Language pipeline-model-parallel size.")
+    grid.add_argument("--llm-dp", type=int, default=2,
+                      help="Language data-parallel size. Global batch is keyed on this.")
+    # MoE expert parallelism for the language grid. Relocated here from the E1


remove this verbose comment

yashaswikarnati · 2026-06-18T17:57:35Z

+    return [encoder_spec, language_spec]
+
+
+def _resolve_expt_tp(expt_tp, tp: int) -> int:


why we need special function for this

yashaswikarnati · 2026-06-18T17:57:57Z

+
+
+def _num_experts(args: argparse.Namespace) -> int:
+    """Resolve MoE expert count from stock (--num-experts) or prototype args."""


why do we need two args? wdym by prototype args ?

yashaswikarnati · 2026-06-18T17:58:26Z

+    return 0
+
+
+def _num_microbatches(args: argparse.Namespace) -> int:


why we need num microbatches? why args needs to pass it ?

yashaswikarnati · 2026-06-18T21:42:41Z

@@ -0,0 +1,253 @@
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+
+"""Hetero grid/topology CLI args + validation for the stock-args MIMO examples.


verbose doc string / comments for the file

yashaswikarnati · 2026-06-18T21:45:26Z

+
+    # Sample-based scheduler resolution: derive --train-iters from --train-samples
+    # using the llm_dp-keyed global batch size.
+    if getattr(args, "train_samples", None) is not None:


why do we need to check this here? isnt this part of megatron/training training loop?

yashaswikarnati · 2026-06-18T21:46:35Z

+    """
+    encoder_size, llm_size = validate_hetero_grid_args(args, world_size)
+
+    language_spec = ModuleGridSpec(


should call this language_grid_spec to avoid confusion with module spec we use for model init?

Trim verbose module/function docstrings and comments, remove the unused --encoder-offset arg (the encoder span always starts at rank 0), inline the trivial expert-TP default, read MoE expert count from stock --num-experts only, and rename the local grid specs to language_grid_spec / encoder_grid_spec. Slim the pure-args tests to the value-asserting cases. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

Drop the _num_microbatches helper and key the --train-samples to --train-iters conversion off the explicit --global-batch-size. num_microbatches is derived by the stock calculator (gbs / (mbs * llm_dp)), so the redundant --num-microbatches read is removed here. The conversion stays: stock validate_args does not derive train_iters from train_samples and the MIMO loop reads args.train_iters. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

yashaswikarnati · 2026-06-19T00:42:23Z

+        validate_hetero_grid_args(args, WORLD_SIZE_8)
+
+
+def test_cp_must_be_one():


yashaswikarnati · 2026-06-19T00:42:28Z

+        validate_hetero_grid_args(args, WORLD_SIZE_8)
+
+
+def test_llm_only_requires_offset_zero():


yashaswikarnati · 2026-06-19T00:42:38Z

+    assert specs[0].name == MIMO_LANGUAGE_MODULE_KEY
+
+
+def test_train_samples_resolves_iters():


…aram The caller (model provider) owns the encoder module name. Drop the duplicate DEFAULT_ENCODER_MODULE_NAME constant and the dead vision_encoder_key getattr; RADIO_ENCODER_MODULE_NAME (radio_encoder.py) is now the single source. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

yashaswikarnati · 2026-06-19T01:12:45Z

+        gbs = getattr(args, "global_batch_size", None)
+        if not gbs or gbs <= 0:
+            raise ValueError("--train-samples requires a positive --global-batch-size")
+        args.train_iters = math.ceil(args.train_samples / gbs)


why are we updating train iters ourself? in args? how does megatron train loop supposed ot handle this?

yashaswikarnati · 2026-06-19T01:13:45Z

+
+    Returns ``[encoder_grid_spec, language_grid_spec]`` (or just the language spec
+    when ``--llm-only``). The caller supplies ``encoder_module_name`` (the model
+    provider owns it). ``num_ranks`` is the ground truth ModuleGridSpec field;


verbose commentary/doc string - not required

The conversion belongs to the training loop, not grid-layout validation. Stock update_train_iters owns it; the MIMO entry invokes it. validate_hetero_grid_args is now purely about grid layout. Also trim the build_module_grid_specs docstring. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

yashaswikarnati mentioned this pull request Jun 16, 2026

Add hetero MIMO entrypoint, bootstrap, and mock data (integration) #5377

Draft

yashaswikarnati commented Jun 18, 2026

View reviewed changes

Comment thread examples/mimo/training/args.py

yashaswikarnati commented Jun 18, 2026

View reviewed changes

yashaswikarnati and others added 2 commits June 18, 2026 17:30

yashaswikarnati commented Jun 19, 2026

View reviewed changes

		return parser.parse_args(argv)


		def _layout_8gpu_20l(**overrides):

		return [encoder_spec, language_spec]


		def _resolve_expt_tp(expt_tp, tp: int) -> int:



		def _num_experts(args: argparse.Namespace) -> int:
		"""Resolve MoE expert count from stock (--num-experts) or prototype args."""

		return 0


		def _num_microbatches(args: argparse.Namespace) -> int:

		@@ -0,0 +1,253 @@
		# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.

		"""Hetero grid/topology CLI args + validation for the stock-args MIMO examples.

		validate_hetero_grid_args(args, WORLD_SIZE_8)


		def test_cp_must_be_one():

		validate_hetero_grid_args(args, WORLD_SIZE_8)


		def test_llm_only_requires_offset_zero():

		assert specs[0].name == MIMO_LANGUAGE_MODULE_KEY


		def test_train_samples_resolves_iters():

Conversation

yashaswikarnati commented Jun 16, 2026

What

Why

Validation

CODEOWNERS

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant