Add Nemotron6-MoE VLM model provider for MIMO example by yashaswikarnati · Pull Request #5374 · NVIDIA/Megatron-LM

yashaswikarnati · 2026-06-16T20:10:41Z

What

Add the Nemotron6-MoE vision-language model provider for the MIMO heterogeneous training example, built on stock Megatron args. The RADIO encoder (wrapper + pixel shuffle + vision config + encoder-spec builder + RADIO args) now lives in its own leaf module examples/mimo/model_providers/radio_encoder.py and is imported here. This PR adds the language MambaModel spec, the affine projector spec (core ColumnParallelLinear for fc1 — the affine path instantiates fc1 with gather_output=True, which TE column-parallel linears reject), and the per-PP-rank model assembly. A unit test travels with it.

Stacking

Stacked on top of #5397 (RADIO encoder leaf) — merge #5397 first. GitHub can't set this PR's base to a fork-only branch, so the base stays main and the #5397 commit rides along until it merges; rebasing onto main afterward leaves only the provider diff.

Why

language_model_spec / vision_submodules_spec are the provider-definition leaf with no in-PR caller; their consumers land in the integration PR. Kept here so the provider can be reviewed/merged independently.

CODEOWNERS

examples/mimo/... + tests/unit_tests/... -> repo default owners (no dedicated group).

🤖 Generated with Claude Code

copy-pr-bot · 2026-06-16T20:10:45Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yashaswikarnati · 2026-06-16T21:40:22Z

+
+"""Nemotron6-MoE VLM model provider for hetero MIMO examples (stock-args edition).
+
+This is PR-E1 of the NMFW-516 hetero-MIMO upstreaming effort. It ports the


remove - This is PR-E1 of the NMFW-516 hetero-MIMO upstreaming effort. It ports the
prototype provider (sasatheesh/pre-vlm-07) onto Megatron's stock argument
system (megatron.training.arguments). and all verbose comments

no mentions of prototypes

yashaswikarnati · 2026-06-16T21:41:42Z

+prototype provider (sasatheesh/pre-vlm-07) onto Megatron's *stock* argument
+system (``megatron.training.arguments``).
+
+The crucial difference from the prototype: stock ``arguments.py`` already


remove differences commentary

yashaswikarnati · 2026-06-16T21:43:24Z

+    TERowParallelLinear = None
+
+MOCK_MODEL_PROVIDER = "mock"
+NEMOTRON_20L_MODEL_PROVIDER = "nemotron-moe-vlm-20l"


do we need seperate 20l vs 54l providers ? we can control from the layer pattern alone ? also why do we need all the max num tiles/default stage etc global variables ?

yashaswikarnati · 2026-06-16T21:44:04Z

+
+
+# ---------------------------------------------------------------------------
+# Small process-group / grid helpers.


remove verbose comments

yashaswikarnati · 2026-06-16T21:45:16Z

+                          help="Total image tokens emitted by the encoder per sample.")
+    provider.add_argument("--image-token-id", type=int, default=511,
+                          help="Vocab id of the placeholder image token.")
+    provider.add_argument("--pad-token-id", type=int, default=0)


how would this match what the tokenizer knows ? also how does megatron train loop uses/derives pad token id ?

yashaswikarnati · 2026-06-16T21:49:45Z

+        ),
+    )
+    provider.add_argument(
+        "--dynamic-resolution-min-patches",


lets check what all of these args are we using/useful. like which downstream code paths use these args ?

yashaswikarnati · 2026-06-16T21:50:52Z

+        help="Projection module from frozen vision features to language hidden size.",
+    )
+    provider.add_argument(
+        "--training-stage", choices=["stage1", "stage2", "stage3"], default=None


isnt training stages and then again freeze vit etc redundant ? which one does the train loop and model init respect ?

yashaswikarnati · 2026-06-16T21:53:29Z

+        raise ValueError("--pad-token-id must be within the vocabulary")
+
+
+def _pixel_shuffle_dynamic_res(x, imgs_sizes, patch_dim, scale_factor=0.5, version=2):


doesn't the main llava model path already have pixel shiffle and radio encoder already implemented ? why are we redefinign them here again >

yashaswikarnati · 2026-06-17T04:42:34Z

+    args.use_thumbnail = not args.dynamic_resolution
+
+
+def apply_training_stage(args: argparse.Namespace) -> None:


we dont need both stages and freeze vit, freeze lm args. we can have freeze ones only as part of this particular model provider args ?

yashaswikarnati · 2026-06-17T04:44:37Z

+        """Expose the underlying RADIO config for DDP wrapping."""
+        return self.radio_model.config
+
+    def forward(


is this not present on main already ?

yashaswikarnati · 2026-06-17T22:45:41Z

+except ImportError:  # pragma: no cover - TE always present in the CI container
+    TERowParallelLinear = None
+
+MOCK_MODEL_PROVIDER = "mock"


why do we have MOCK_MODEL_PROVIDER ? whats the use ?

yashaswikarnati · 2026-06-17T22:46:17Z

+
+MOCK_MODEL_PROVIDER = "mock"
+NEMOTRON_MODEL_PROVIDER = "nemotron-moe-vlm"
+NEMOTRON_IMAGE_SEQ_PER_TILE = 256


why do we have these as part of model provider, what/ehy exactly are they used ? NEMOTRON_IMAGE_SEQ_PER_TILE = 256
NEMOTRON_MAX_NUM_TILES = 12
NEMOTRON_DEFAULT_STAGE = "stage2"
NEMOTRON_SEQ_LENGTH = 8192
NEMOTRON_VISION_ENCODER_KEY = "radio_encoder"

yashaswikarnati · 2026-06-17T22:46:39Z

+
+
+# Process-group / grid helpers.
+def get_grid_dim_size(grid: HyperCommGrid, dim: str) -> int:


does hypercomm grid already have helper function for this ?

yashaswikarnati · 2026-06-17T22:47:14Z

+    return rank if rank >= 0 else fallback
+
+
+def is_process_group_member(pg) -> bool:


we might have defined these helper functions at multiple palces so far i guess, can we check

yashaswikarnati · 2026-06-17T22:47:21Z

+    return pg is not None and dist.get_rank(group=pg) >= 0
+
+
+def is_nemotron_moe_vlm(args: argparse.Namespace) -> bool:


why do we need this ?

the entire provider we have here is for nemotron vlm ? why we need is_nemotron_vlm ?

yashaswikarnati · 2026-06-17T22:48:29Z

+    provider.add_argument("--pad-token-id", type=int, default=0)
+    provider.add_argument("--image-token", type=str, default="<image>")
+    provider.add_argument("--tokenizer-prompt-format", type=str, default="nemotron6-moe")
+    provider.add_argument("--image-tag-type", type=str, default="")


we have so many args, lets souble check what each one of this is actually used for and which one is dead code

yashaswikarnati · 2026-06-17T22:55:46Z

+    return args
+
+
+def test_preset_sets_vision_and_datapath_knobs():


lot of useless and dead tests. also looks like we dont have a test actully initializing a mimo model based on nemotron provider/spec ?

Add examples/mimo/model_providers/radio_encoder.py: the RADIO encoder wrapper (class-token drop + pixel-shuffle adapter over core RADIOViTModel), its vision TransformerConfig, the encoder ModuleSpec builder, and the RADIO-specific CLI args (--class-token-len, --pixel-shuffle, --disable-vision-class-token). The spec builder reads --pixel-shuffle and --disable-vision-class-token off args so they stop being write-only. This is a leaf, merge-first dependency of the Nemotron6-MoE VLM provider. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

yashaswikarnati · 2026-06-18T00:17:06Z

+    # Vision / MIMO-specific knobs.
+    provider.add_argument("--image-seq-length", type=int, default=None,
+                          help="Total image tokens emitted by the encoder per sample.")
+    provider.add_argument("--image-token-id", type=int, default=511,


why do we need arg for image token id and pad token id ? who uses that ?

yashaswikarnati · 2026-06-18T00:18:19Z

+    provider.add_argument("--pad-token-id", type=int, default=0)
+    provider.add_argument("--image-token", type=str, default="<image>")
+    provider.add_argument("--tokenizer-prompt-format", type=str, default="nemotron6-moe")
+    provider.add_argument("--image-tag-type", type=str, default="")


who uses image tag type/tokenizer-prompt-format/force-system-message etc. )? even dynamic resolution/num image tiles ?

yashaswikarnati · 2026-06-18T00:20:25Z

+    """
+    apply_model_provider_defaults(args)
+    resolve_image_token_id(args)
+    args.vision_encoder_key = NEMOTRON_VISION_ENCODER_KEY


whats the use of vision encoder key ?

yashaswikarnati · 2026-06-18T00:21:24Z

+    args.vision_input_mode = "pixels" if is_nemotron_moe_vlm(args) else "hidden_states"
+
+
+def apply_model_provider_defaults(args: argparse.Namespace) -> None:


why do we have this defaults thing ?

yashaswikarnati · 2026-06-18T00:22:30Z

+    return vision_submodule.encoders[NEMOTRON_VISION_ENCODER_KEY]
+
+
+def iter_vision_projection_modules(vision_submodule):


why do we have this? who uses this iter ?

yashaswikarnati · 2026-06-18T01:13:19Z

+    config.attention_backend = AttnBackend.flash
+    config.use_cpu_initialization = True
+    config.variable_seq_lengths = True
+    config.cross_entropy_fusion_impl = "te"


use local or native cross entropy fusion impl

Add a single-source RADIO_ENCODER_MODULE_NAME constant (the encoders-dict / topology module name). Replace the CPU arg-registration tests with a real 1-GPU forward+backward test that builds RADIOEncoderWrapper via radio_vision_encoder_spec (real RADIOViTModel + TE), asserts the output-shape effects of class-token-drop and pixel-shuffle, and exercises the dynamic-resolution packed-tile path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

The packed (thd) attention path requires bf16 and a flash/fused backend; the fixed sbhd cases tolerate fp32 but the dynamic packed path hit a TE illegal-memory-access under the fp32/auto-backend config. Build that case with params_dtype=bf16 + attention_backend=flash (matching the e2e-validated path) and feed bf16 packed input. The cu_seqlens/imgs_sizes contract was already correct (RADIOViTModel adds class tokens to cu_seqlens internally). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

TE fused attention requires the PackedSeqParams cu_seqlens on the device (mirrors training/step.py::move_batch_to_cuda). Build cu_seqlens_q/kv as int32 CUDA tensors and pass max_seqlen_q/kv as plain ints; imgs_sizes stays on CPU (RADIOViTModel reads it via .tolist()/Python iteration). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

yashaswikarnati · 2026-06-18T18:08:47Z

+
+"""Nemotron6-MoE VLM model provider for hetero MIMO examples.
+
+Declares the MIMO/vision provider args and builds the language ``MambaModel``


remove this doc string, verbose and not adding value

yashaswikarnati · 2026-06-18T18:13:45Z

+        default=NEMOTRON_MODEL_PROVIDER,
+        help="Which MIMO model provider/preset to build.",
+    )
+    provider.add_argument(


isnt dynamic resolution part of radio encoder args ? also --no-dynamic-resolution seems weird, we dont have to make this default and always pass from the args

yashaswikarnati · 2026-06-18T18:14:33Z

+    """Nemotron6-MoE language config: stock from-args base + model-specific overrides."""
+    config = deepcopy(_base_config(args))
+    bf16, dtype = _dtype(args)
+    # Code-only fields (not cleanly arg-expressible) + hetero parallelism pins.


remove these verbose comments

Move the --dynamic-resolution flag into add_radio_encoder_args as a plain store_true (default False); the encoder owns its vision knobs. Trim the module docstring and the constant comment to one line each. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

yashaswikarnati · 2026-06-18T20:24:56Z

+    grid). ``llm_grid`` is the language ``HyperCommGrid`` used only for fallback
+    dim sizes when a group is missing.
+    """
+    pp_pg = getattr(pg_collection, "pp", None) if pg_collection is not None else None


none fall back is dangerous - instead we can assert that groups are required.

yashaswikarnati · 2026-06-18T20:25:38Z

+    ep_pg = getattr(pg_collection, "ep", None) if pg_collection is not None else None
+    expt_tp_pg = getattr(pg_collection, "expt_tp", None) if pg_collection is not None else None
+
+    fallback_tp_size = get_grid_dim_size(llm_grid, "tp")


why do we need fallback tp ?

…lization) Stop forcing CPU init in radio_vision_config; let it flow from args (default False = init weights directly on GPU, matching stock). The GPU test now builds with the default (GPU init); its setup already seeds the CUDA RNG tracker (model_parallel_cuda_manual_seed) that _initialize_affine_weight_gpu needs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

Add the Nemotron6-MoE vision-language model provider for the MIMO heterogeneous training example, built on stock Megatron args. Includes the RADIO encoder wrapper (class-token drop + pixel shuffle), the affine projector spec using the core ColumnParallelLinear for fc1 (the affine path instantiates fc1 with gather_output=True, which TE column-parallel linears reject), and the per-PP-rank model assembly. A new leaf module under examples/mimo with no dependency on the other example training modules. The unit test exercises provider construction and travels with it. Validated in the 8-GPU 20L Nemotron VLM e2e (trains + checkpoint save/resume, lm loss 12.18->11.54 across resume). Signed-off-by: ykarnati <ykarnati@nvidia.com>

…ovider Remove the verbose module docstring, prototype/pre-vlm-07 references, and differences commentary; collapse verbose inline comments to terse one-liners. No behavior change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

Replace the three hardcoded TransformerConfigs with the pretrain_vlm pattern: core_transformer_config_from_args(args) deep-copied per tower, then thin override helpers that set only the code-only fields (attention_backend, use_cpu_initialization, variable_seq_lengths, cross_entropy_fusion_impl, env MoE dispatcher/backend), the per-module parallelism pins, and the vision / projection model-specifics. Architecture (num_layers, hidden_size, moe_*, mamba_*, hybrid pattern, ...) now comes from CLI flags, so the provider is arg-compatible like pretrain_gpt; the run script passes the values (PR NVIDIA#5377). Collapse the 20L/54L providers into one nemotron-moe-vlm provider driven by --num-layers/--hybrid-layer-pattern, removing the layer/pattern constants and the dual source of truth. The preset now only sets the vision/data-path knobs the dataloaders read. Pass force_eval_mode into RADIOViTModel (which already pins eval) instead of the wrapper's own eval()/train() override. Strip inherited MoE/Mamba/hybrid settings from the dense vision and projection configs. Parity (Option A, verified statically; live diff + parity test run in CI since torch is unavailable in the dev shell): the from-args language config equals the old hardcoded config on every field except deallocate_pipeline_outputs (False->True; stock-correct, inert at PP=1) and inference_sampling_seed (42->--seed; inference-only), both supplied by core_transformer_config_from_args. test_language_config_parity asserts the reference arch and these two exceptions. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

The parity test hand-rolled a Namespace from parse_args alone and called core_transformer_config_from_args, but params_dtype (and other derived fields) are set by validate_args, which the test skipped -> AttributeError on args.params_dtype. Build args the same way examples/mimo/pretrain_mimo.py::_parse_and_validate does: parse_args(add_model_provider_args) -> prepare_model_provider_args (preset, before validate) -> validate_args. At world_size=1, tp=pp=cp=1 the divisibility checks pass with no distributed/mpu init, and global_batch_size defaults from micro_batch_size * data_parallel_size. This resolves params_dtype and every other validate_args-derived field exactly as in a real run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

core_transformer_config_from_args runs TransformerConfig.__post_init__ at construction. The provider set moe_token_dispatcher_type from an env var AFTER that, but --moe-token-dispatcher-type defaults to 'allgather', so the base config built with allgather + --moe-shared-expert-overlap failed validation ('moe_shared_expert_overlap only works with alltoall or flex token dispatcher') before the post-hoc override applied. Stop post-hoc-setting moe_token_dispatcher_type / moe_flex_dispatcher_backend; they are args-expressible (--moe-token-dispatcher-type, auto-generated --moe-flex-dispatcher-backend), so let them come from CLI flags and the base config validates up front. The run script defaults them from env via the flags. Audited every post-hoc override against __post_init__: the dispatcher type was the only field whose override made the base config invalid at construction. cross_entropy_fusion_impl='te' stays a code-only override (it deliberately bypasses the validate_args ban on loss-fusion + te and is not validated in __post_init__); the parallelism/dtype/seq-length overrides are not post_init-validated for our values. Add --moe-token-dispatcher-type alltoall + --moe-flex-dispatcher-backend deepep to the parity-test argv and reference fixture, kept in sync with the run script. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

The larger Nemotron variant was parametrized as num_layers=52, but its hybrid_layer_pattern encodes 54 layer-tokens (23 M + 23 E + 8 *), and get_hybrid_total_layer_count derives num_layers from the pattern, so config.num_layers is 54. The '52L' label was a misnomer; the canonical config is 54L. Set the parametrization to 54 to match the pattern. Provider unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

Import the extracted RADIO encoder (wrapper, vision config, encoder-spec builder, args, shared config helpers) from radio_encoder instead of carrying local copies. Address review: - Drop --training-stage / apply_training_stage / NEMOTRON_DEFAULT_STAGE; the --freeze-{vit,lm,projection} flags are the freeze interface. - Remove the mock provider, the dead non-Nemotron GPTModel fallback, and the now-unused GPTModel / gpt-layer-spec imports; --model-provider is nemotron-only. - Inline the num-image-tiles default; source the LM seq length from --seq-length instead of hardcoding 8192 and silently overriding it. - Move the grid/process-group helpers into examples/mimo/utils/hetero.py. - Drop unconsumed args (--vision-model-type, --use-tiling, --use-thumbnail, --dynamic-resolution-min/max-patches); RADIO knobs now live in radio_encoder. - Clarify that projector_type=affine uses only linear_fc1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

seq_length / max_position_embeddings are not TransformerConfig dataclass fields; the provider no longer monkey-sets them on the config (they flow from --seq-length into the MambaModel constructor). The seq-length contract is covered by test_language_model_spec_builds_mamba (max_sequence_length == args.seq_length). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

Keep only args whose consumer is main, the radio encoder, or an opened stack PR; image-token / tokenizer / data-path knobs move with the data pipeline. - Remove apply_model_provider_defaults (the post-parse preset). --dynamic- resolution (read by the provider's own radio encoder spec) keeps its on default via argparse default=True; the other preset-set knobs were data-only. - Remove data/tokenizer args: --image-seq-length (+ NEMOTRON_IMAGE_SEQ_PER_TILE), --image-token-id, --image-token, --pad-token-id, --tokenizer-prompt-format, --image-tag-type, --force-system-message, --num-image-tiles. With their args gone, resolve_image_token_id and validate_model_provider_args go too. - Remove the dead get_vision_encoder_module / iter_vision_projection_modules. - prepare_model_provider_args now only sets the derived vision_encoder_key. - Trim the provider tests to match; parity argv passes the radio flags. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

The encoder module name resolves via NVIDIA#5375 args.py DEFAULT_ENCODER_MODULE_NAME; the provider no longer needs to publish args.vision_encoder_key. That emptied prepare_model_provider_args, so remove it. NEMOTRON_VISION_ENCODER_KEY stays as the encoders-dict key in vision_submodules_spec. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

- Remove is_nemotron_moe_vlm and the always-true vision_submodules_spec guard (provider is nemotron-only). - Drop the attention_backend=flash hardcode; it flows from --attention-backend (parity argv now passes --attention-backend flash). - Drop the cross_entropy_fusion_impl="te" post-construction override; the native default flows from args and is allowed alongside --cross-entropy-loss- fusion, removing the silent bypass of the te-CE stability guard. - Replace the local NEMOTRON_VISION_ENCODER_KEY with the single-source RADIO_ENCODER_MODULE_NAME imported from radio_encoder. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

…w a radio arg) - Module docstring -> one line. - --dynamic-resolution moves to radio_encoder's add_radio_encoder_args (store_true, default off); remove the provider's duplicate definition. - Trim the stale attention/CE comment and the verbose moe-dispatcher / seq-length notes in nemotron_language_config to terse one-liners. - Add a terse affine-vs-mlp note at the projection spec (no behavior change; core MultimodalProjector already supports both via --vision-projection-type). - Test: dynamic-resolution now defaults off and is set by --dynamic-resolution. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

…ack) The spec builders fell back to grid dim sizes via per-attribute getattr(...,None), which silently masked a malformed collection. Keep the pg_collection=None path (bootstrap passes None on ranks outside a module's grid) using grid sizes, but when a collection IS provided, assert its pp/tp(/ep/expt_tp) groups are present and read sizes from them. Replace get_group_size_or/get_group_rank_or with assert-based get_pg_size/get_pg_rank. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

Drop the use_cpu_initialization=True override in nemotron_language_config and nemotron_projection_config; it flows from args (default False = init on GPU, matching stock). Better at scale (no full host-side master weight + H2D copy). bootstrap's .to(cuda) becomes a harmless no-op. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

yashaswikarnati · 2026-06-18T22:03:55Z

+    """
+    # None on ranks outside the language grid -> sizes come from the grid; when a
+    # collection is provided its pp/tp/ep/expt_tp groups must all be present.
+    if pg_collection is None:


why do we need to handle this ? pg collection None ? we dont call language model spec on vision ranks ??

yashaswikarnati mentioned this pull request Jun 16, 2026

Add hetero MIMO entrypoint, bootstrap, and mock data (integration) #5377

Draft

yashaswikarnati commented Jun 16, 2026

View reviewed changes

yashaswikarnati commented Jun 17, 2026

View reviewed changes

Comment thread examples/mimo/model_providers/nemotron_moe_vlm.py

yashaswikarnati commented Jun 17, 2026

View reviewed changes

Comment thread examples/mimo/model_providers/nemotron_moe_vlm.py

yashaswikarnati commented Jun 17, 2026

View reviewed changes

Comment thread examples/mimo/model_providers/nemotron_moe_vlm.py

yashaswikarnati commented Jun 17, 2026

View reviewed changes

yashaswikarnati force-pushed the x1-nemotron-provider branch from 33229c0 to bdef29f Compare June 17, 2026 23:57