Add Nemotron6-MoE VLM model provider for MIMO example#5374
Add Nemotron6-MoE VLM model provider for MIMO example#5374yashaswikarnati wants to merge 20 commits into
Conversation
|
|
||
| """Nemotron6-MoE VLM model provider for hetero MIMO examples (stock-args edition). | ||
|
|
||
| This is PR-E1 of the NMFW-516 hetero-MIMO upstreaming effort. It ports the |
There was a problem hiding this comment.
remove - This is PR-E1 of the NMFW-516 hetero-MIMO upstreaming effort. It ports the
prototype provider (sasatheesh/pre-vlm-07) onto Megatron's stock argument
system (megatron.training.arguments). and all verbose comments
There was a problem hiding this comment.
no mentions of prototypes
| prototype provider (sasatheesh/pre-vlm-07) onto Megatron's *stock* argument | ||
| system (``megatron.training.arguments``). | ||
|
|
||
| The crucial difference from the prototype: stock ``arguments.py`` already |
There was a problem hiding this comment.
remove differences commentary
| TERowParallelLinear = None | ||
|
|
||
| MOCK_MODEL_PROVIDER = "mock" | ||
| NEMOTRON_20L_MODEL_PROVIDER = "nemotron-moe-vlm-20l" |
There was a problem hiding this comment.
do we need seperate 20l vs 54l providers ? we can control from the layer pattern alone ? also why do we need all the max num tiles/default stage etc global variables ?
|
|
||
|
|
||
| # --------------------------------------------------------------------------- | ||
| # Small process-group / grid helpers. |
There was a problem hiding this comment.
remove verbose comments
| help="Total image tokens emitted by the encoder per sample.") | ||
| provider.add_argument("--image-token-id", type=int, default=511, | ||
| help="Vocab id of the placeholder image token.") | ||
| provider.add_argument("--pad-token-id", type=int, default=0) |
There was a problem hiding this comment.
how would this match what the tokenizer knows ? also how does megatron train loop uses/derives pad token id ?
| ), | ||
| ) | ||
| provider.add_argument( | ||
| "--dynamic-resolution-min-patches", |
There was a problem hiding this comment.
lets check what all of these args are we using/useful. like which downstream code paths use these args ?
| help="Projection module from frozen vision features to language hidden size.", | ||
| ) | ||
| provider.add_argument( | ||
| "--training-stage", choices=["stage1", "stage2", "stage3"], default=None |
There was a problem hiding this comment.
isnt training stages and then again freeze vit etc redundant ? which one does the train loop and model init respect ?
| raise ValueError("--pad-token-id must be within the vocabulary") | ||
|
|
||
|
|
||
| def _pixel_shuffle_dynamic_res(x, imgs_sizes, patch_dim, scale_factor=0.5, version=2): |
There was a problem hiding this comment.
doesn't the main llava model path already have pixel shiffle and radio encoder already implemented ? why are we redefinign them here again >
| args.use_thumbnail = not args.dynamic_resolution | ||
|
|
||
|
|
||
| def apply_training_stage(args: argparse.Namespace) -> None: |
There was a problem hiding this comment.
we dont need both stages and freeze vit, freeze lm args. we can have freeze ones only as part of this particular model provider args ?
| """Expose the underlying RADIO config for DDP wrapping.""" | ||
| return self.radio_model.config | ||
|
|
||
| def forward( |
There was a problem hiding this comment.
is this not present on main already ?
| except ImportError: # pragma: no cover - TE always present in the CI container | ||
| TERowParallelLinear = None | ||
|
|
||
| MOCK_MODEL_PROVIDER = "mock" |
There was a problem hiding this comment.
why do we have MOCK_MODEL_PROVIDER ? whats the use ?
|
|
||
| MOCK_MODEL_PROVIDER = "mock" | ||
| NEMOTRON_MODEL_PROVIDER = "nemotron-moe-vlm" | ||
| NEMOTRON_IMAGE_SEQ_PER_TILE = 256 |
There was a problem hiding this comment.
why do we have these as part of model provider, what/ehy exactly are they used ? NEMOTRON_IMAGE_SEQ_PER_TILE = 256
NEMOTRON_MAX_NUM_TILES = 12
NEMOTRON_DEFAULT_STAGE = "stage2"
NEMOTRON_SEQ_LENGTH = 8192
NEMOTRON_VISION_ENCODER_KEY = "radio_encoder"
|
|
||
|
|
||
| # Process-group / grid helpers. | ||
| def get_grid_dim_size(grid: HyperCommGrid, dim: str) -> int: |
There was a problem hiding this comment.
does hypercomm grid already have helper function for this ?
| return rank if rank >= 0 else fallback | ||
|
|
||
|
|
||
| def is_process_group_member(pg) -> bool: |
There was a problem hiding this comment.
we might have defined these helper functions at multiple palces so far i guess, can we check
| return pg is not None and dist.get_rank(group=pg) >= 0 | ||
|
|
||
|
|
||
| def is_nemotron_moe_vlm(args: argparse.Namespace) -> bool: |
There was a problem hiding this comment.
why do we need this ?
There was a problem hiding this comment.
the entire provider we have here is for nemotron vlm ? why we need is_nemotron_vlm ?
| provider.add_argument("--pad-token-id", type=int, default=0) | ||
| provider.add_argument("--image-token", type=str, default="<image>") | ||
| provider.add_argument("--tokenizer-prompt-format", type=str, default="nemotron6-moe") | ||
| provider.add_argument("--image-tag-type", type=str, default="") |
There was a problem hiding this comment.
we have so many args, lets souble check what each one of this is actually used for and which one is dead code
| return args | ||
|
|
||
|
|
||
| def test_preset_sets_vision_and_datapath_knobs(): |
There was a problem hiding this comment.
lot of useless and dead tests. also looks like we dont have a test actully initializing a mimo model based on nemotron provider/spec ?
Add examples/mimo/model_providers/radio_encoder.py: the RADIO encoder wrapper (class-token drop + pixel-shuffle adapter over core RADIOViTModel), its vision TransformerConfig, the encoder ModuleSpec builder, and the RADIO-specific CLI args (--class-token-len, --pixel-shuffle, --disable-vision-class-token). The spec builder reads --pixel-shuffle and --disable-vision-class-token off args so they stop being write-only. This is a leaf, merge-first dependency of the Nemotron6-MoE VLM provider. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
33229c0 to
bdef29f
Compare
| # Vision / MIMO-specific knobs. | ||
| provider.add_argument("--image-seq-length", type=int, default=None, | ||
| help="Total image tokens emitted by the encoder per sample.") | ||
| provider.add_argument("--image-token-id", type=int, default=511, |
There was a problem hiding this comment.
why do we need arg for image token id and pad token id ? who uses that ?
| provider.add_argument("--pad-token-id", type=int, default=0) | ||
| provider.add_argument("--image-token", type=str, default="<image>") | ||
| provider.add_argument("--tokenizer-prompt-format", type=str, default="nemotron6-moe") | ||
| provider.add_argument("--image-tag-type", type=str, default="") |
There was a problem hiding this comment.
who uses image tag type/tokenizer-prompt-format/force-system-message etc. )? even dynamic resolution/num image tiles ?
| """ | ||
| apply_model_provider_defaults(args) | ||
| resolve_image_token_id(args) | ||
| args.vision_encoder_key = NEMOTRON_VISION_ENCODER_KEY |
There was a problem hiding this comment.
whats the use of vision encoder key ?
| args.vision_input_mode = "pixels" if is_nemotron_moe_vlm(args) else "hidden_states" | ||
|
|
||
|
|
||
| def apply_model_provider_defaults(args: argparse.Namespace) -> None: |
There was a problem hiding this comment.
why do we have this defaults thing ?
| return vision_submodule.encoders[NEMOTRON_VISION_ENCODER_KEY] | ||
|
|
||
|
|
||
| def iter_vision_projection_modules(vision_submodule): |
There was a problem hiding this comment.
why do we have this? who uses this iter ?
| config.attention_backend = AttnBackend.flash | ||
| config.use_cpu_initialization = True | ||
| config.variable_seq_lengths = True | ||
| config.cross_entropy_fusion_impl = "te" |
There was a problem hiding this comment.
use local or native cross entropy fusion impl
Add a single-source RADIO_ENCODER_MODULE_NAME constant (the encoders-dict / topology module name). Replace the CPU arg-registration tests with a real 1-GPU forward+backward test that builds RADIOEncoderWrapper via radio_vision_encoder_spec (real RADIOViTModel + TE), asserts the output-shape effects of class-token-drop and pixel-shuffle, and exercises the dynamic-resolution packed-tile path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
462a32d to
3aa2760
Compare
The packed (thd) attention path requires bf16 and a flash/fused backend; the fixed sbhd cases tolerate fp32 but the dynamic packed path hit a TE illegal-memory-access under the fp32/auto-backend config. Build that case with params_dtype=bf16 + attention_backend=flash (matching the e2e-validated path) and feed bf16 packed input. The cu_seqlens/imgs_sizes contract was already correct (RADIOViTModel adds class tokens to cu_seqlens internally). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
TE fused attention requires the PackedSeqParams cu_seqlens on the device (mirrors training/step.py::move_batch_to_cuda). Build cu_seqlens_q/kv as int32 CUDA tensors and pass max_seqlen_q/kv as plain ints; imgs_sizes stays on CPU (RADIOViTModel reads it via .tolist()/Python iteration). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
|
|
||
| """Nemotron6-MoE VLM model provider for hetero MIMO examples. | ||
|
|
||
| Declares the MIMO/vision provider args and builds the language ``MambaModel`` |
There was a problem hiding this comment.
remove this doc string, verbose and not adding value
| default=NEMOTRON_MODEL_PROVIDER, | ||
| help="Which MIMO model provider/preset to build.", | ||
| ) | ||
| provider.add_argument( |
There was a problem hiding this comment.
isnt dynamic resolution part of radio encoder args ? also --no-dynamic-resolution seems weird, we dont have to make this default and always pass from the args
| """Nemotron6-MoE language config: stock from-args base + model-specific overrides.""" | ||
| config = deepcopy(_base_config(args)) | ||
| bf16, dtype = _dtype(args) | ||
| # Code-only fields (not cleanly arg-expressible) + hetero parallelism pins. |
There was a problem hiding this comment.
remove these verbose comments
Move the --dynamic-resolution flag into add_radio_encoder_args as a plain store_true (default False); the encoder owns its vision knobs. Trim the module docstring and the constant comment to one line each. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
3aa2760 to
02af260
Compare
| grid). ``llm_grid`` is the language ``HyperCommGrid`` used only for fallback | ||
| dim sizes when a group is missing. | ||
| """ | ||
| pp_pg = getattr(pg_collection, "pp", None) if pg_collection is not None else None |
There was a problem hiding this comment.
none fall back is dangerous - instead we can assert that groups are required.
| ep_pg = getattr(pg_collection, "ep", None) if pg_collection is not None else None | ||
| expt_tp_pg = getattr(pg_collection, "expt_tp", None) if pg_collection is not None else None | ||
|
|
||
| fallback_tp_size = get_grid_dim_size(llm_grid, "tp") |
There was a problem hiding this comment.
why do we need fallback tp ?
…lization) Stop forcing CPU init in radio_vision_config; let it flow from args (default False = init weights directly on GPU, matching stock). The GPU test now builds with the default (GPU init); its setup already seeds the CUDA RNG tracker (model_parallel_cuda_manual_seed) that _initialize_affine_weight_gpu needs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
Add the Nemotron6-MoE vision-language model provider for the MIMO heterogeneous training example, built on stock Megatron args. Includes the RADIO encoder wrapper (class-token drop + pixel shuffle), the affine projector spec using the core ColumnParallelLinear for fc1 (the affine path instantiates fc1 with gather_output=True, which TE column-parallel linears reject), and the per-PP-rank model assembly. A new leaf module under examples/mimo with no dependency on the other example training modules. The unit test exercises provider construction and travels with it. Validated in the 8-GPU 20L Nemotron VLM e2e (trains + checkpoint save/resume, lm loss 12.18->11.54 across resume). Signed-off-by: ykarnati <ykarnati@nvidia.com>
…ovider Remove the verbose module docstring, prototype/pre-vlm-07 references, and differences commentary; collapse verbose inline comments to terse one-liners. No behavior change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
Replace the three hardcoded TransformerConfigs with the pretrain_vlm pattern: core_transformer_config_from_args(args) deep-copied per tower, then thin override helpers that set only the code-only fields (attention_backend, use_cpu_initialization, variable_seq_lengths, cross_entropy_fusion_impl, env MoE dispatcher/backend), the per-module parallelism pins, and the vision / projection model-specifics. Architecture (num_layers, hidden_size, moe_*, mamba_*, hybrid pattern, ...) now comes from CLI flags, so the provider is arg-compatible like pretrain_gpt; the run script passes the values (PR NVIDIA#5377). Collapse the 20L/54L providers into one nemotron-moe-vlm provider driven by --num-layers/--hybrid-layer-pattern, removing the layer/pattern constants and the dual source of truth. The preset now only sets the vision/data-path knobs the dataloaders read. Pass force_eval_mode into RADIOViTModel (which already pins eval) instead of the wrapper's own eval()/train() override. Strip inherited MoE/Mamba/hybrid settings from the dense vision and projection configs. Parity (Option A, verified statically; live diff + parity test run in CI since torch is unavailable in the dev shell): the from-args language config equals the old hardcoded config on every field except deallocate_pipeline_outputs (False->True; stock-correct, inert at PP=1) and inference_sampling_seed (42->--seed; inference-only), both supplied by core_transformer_config_from_args. test_language_config_parity asserts the reference arch and these two exceptions. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
The parity test hand-rolled a Namespace from parse_args alone and called core_transformer_config_from_args, but params_dtype (and other derived fields) are set by validate_args, which the test skipped -> AttributeError on args.params_dtype. Build args the same way examples/mimo/pretrain_mimo.py::_parse_and_validate does: parse_args(add_model_provider_args) -> prepare_model_provider_args (preset, before validate) -> validate_args. At world_size=1, tp=pp=cp=1 the divisibility checks pass with no distributed/mpu init, and global_batch_size defaults from micro_batch_size * data_parallel_size. This resolves params_dtype and every other validate_args-derived field exactly as in a real run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
core_transformer_config_from_args runs TransformerConfig.__post_init__ at
construction. The provider set moe_token_dispatcher_type from an env var AFTER
that, but --moe-token-dispatcher-type defaults to 'allgather', so the base
config built with allgather + --moe-shared-expert-overlap failed validation
('moe_shared_expert_overlap only works with alltoall or flex token dispatcher')
before the post-hoc override applied.
Stop post-hoc-setting moe_token_dispatcher_type / moe_flex_dispatcher_backend;
they are args-expressible (--moe-token-dispatcher-type, auto-generated
--moe-flex-dispatcher-backend), so let them come from CLI flags and the base
config validates up front. The run script defaults them from env via the flags.
Audited every post-hoc override against __post_init__: the dispatcher type was
the only field whose override made the base config invalid at construction.
cross_entropy_fusion_impl='te' stays a code-only override (it deliberately
bypasses the validate_args ban on loss-fusion + te and is not validated in
__post_init__); the parallelism/dtype/seq-length overrides are not
post_init-validated for our values.
Add --moe-token-dispatcher-type alltoall + --moe-flex-dispatcher-backend deepep
to the parity-test argv and reference fixture, kept in sync with the run script.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
The larger Nemotron variant was parametrized as num_layers=52, but its hybrid_layer_pattern encodes 54 layer-tokens (23 M + 23 E + 8 *), and get_hybrid_total_layer_count derives num_layers from the pattern, so config.num_layers is 54. The '52L' label was a misnomer; the canonical config is 54L. Set the parametrization to 54 to match the pattern. Provider unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
Import the extracted RADIO encoder (wrapper, vision config, encoder-spec
builder, args, shared config helpers) from radio_encoder instead of carrying
local copies. Address review:
- Drop --training-stage / apply_training_stage / NEMOTRON_DEFAULT_STAGE; the
--freeze-{vit,lm,projection} flags are the freeze interface.
- Remove the mock provider, the dead non-Nemotron GPTModel fallback, and the
now-unused GPTModel / gpt-layer-spec imports; --model-provider is
nemotron-only.
- Inline the num-image-tiles default; source the LM seq length from
--seq-length instead of hardcoding 8192 and silently overriding it.
- Move the grid/process-group helpers into examples/mimo/utils/hetero.py.
- Drop unconsumed args (--vision-model-type, --use-tiling, --use-thumbnail,
--dynamic-resolution-min/max-patches); RADIO knobs now live in radio_encoder.
- Clarify that projector_type=affine uses only linear_fc1.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
seq_length / max_position_embeddings are not TransformerConfig dataclass fields; the provider no longer monkey-sets them on the config (they flow from --seq-length into the MambaModel constructor). The seq-length contract is covered by test_language_model_spec_builds_mamba (max_sequence_length == args.seq_length). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
Keep only args whose consumer is main, the radio encoder, or an opened stack PR; image-token / tokenizer / data-path knobs move with the data pipeline. - Remove apply_model_provider_defaults (the post-parse preset). --dynamic- resolution (read by the provider's own radio encoder spec) keeps its on default via argparse default=True; the other preset-set knobs were data-only. - Remove data/tokenizer args: --image-seq-length (+ NEMOTRON_IMAGE_SEQ_PER_TILE), --image-token-id, --image-token, --pad-token-id, --tokenizer-prompt-format, --image-tag-type, --force-system-message, --num-image-tiles. With their args gone, resolve_image_token_id and validate_model_provider_args go too. - Remove the dead get_vision_encoder_module / iter_vision_projection_modules. - prepare_model_provider_args now only sets the derived vision_encoder_key. - Trim the provider tests to match; parity argv passes the radio flags. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
The encoder module name resolves via NVIDIA#5375 args.py DEFAULT_ENCODER_MODULE_NAME; the provider no longer needs to publish args.vision_encoder_key. That emptied prepare_model_provider_args, so remove it. NEMOTRON_VISION_ENCODER_KEY stays as the encoders-dict key in vision_submodules_spec. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
- Remove is_nemotron_moe_vlm and the always-true vision_submodules_spec guard (provider is nemotron-only). - Drop the attention_backend=flash hardcode; it flows from --attention-backend (parity argv now passes --attention-backend flash). - Drop the cross_entropy_fusion_impl="te" post-construction override; the native default flows from args and is allowed alongside --cross-entropy-loss- fusion, removing the silent bypass of the te-CE stability guard. - Replace the local NEMOTRON_VISION_ENCODER_KEY with the single-source RADIO_ENCODER_MODULE_NAME imported from radio_encoder. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
…w a radio arg) - Module docstring -> one line. - --dynamic-resolution moves to radio_encoder's add_radio_encoder_args (store_true, default off); remove the provider's duplicate definition. - Trim the stale attention/CE comment and the verbose moe-dispatcher / seq-length notes in nemotron_language_config to terse one-liners. - Add a terse affine-vs-mlp note at the projection spec (no behavior change; core MultimodalProjector already supports both via --vision-projection-type). - Test: dynamic-resolution now defaults off and is set by --dynamic-resolution. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
…ack) The spec builders fell back to grid dim sizes via per-attribute getattr(...,None), which silently masked a malformed collection. Keep the pg_collection=None path (bootstrap passes None on ranks outside a module's grid) using grid sizes, but when a collection IS provided, assert its pp/tp(/ep/expt_tp) groups are present and read sizes from them. Replace get_group_size_or/get_group_rank_or with assert-based get_pg_size/get_pg_rank. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
Drop the use_cpu_initialization=True override in nemotron_language_config and nemotron_projection_config; it flows from args (default False = init on GPU, matching stock). Better at scale (no full host-side master weight + H2D copy). bootstrap's .to(cuda) becomes a harmless no-op. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
b7f05d1 to
b397a89
Compare
| """ | ||
| # None on ranks outside the language grid -> sizes come from the grid; when a | ||
| # collection is provided its pp/tp/ep/expt_tp groups must all be present. | ||
| if pg_collection is None: |
There was a problem hiding this comment.
why do we need to handle this ? pg collection None ? we dont call language model spec on vision ranks ??
What
Add the Nemotron6-MoE vision-language model provider for the MIMO heterogeneous training example, built on stock Megatron args. The RADIO encoder (wrapper + pixel shuffle + vision config + encoder-spec builder + RADIO args) now lives in its own leaf module
examples/mimo/model_providers/radio_encoder.pyand is imported here. This PR adds the languageMambaModelspec, the affine projector spec (coreColumnParallelLinearforfc1— the affine path instantiatesfc1withgather_output=True, which TE column-parallel linears reject), and the per-PP-rank model assembly. A unit test travels with it.Stacking
Stacked on top of #5397 (RADIO encoder leaf) — merge #5397 first. GitHub can't set this PR's base to a fork-only branch, so the base stays
mainand the #5397 commit rides along until it merges; rebasing ontomainafterward leaves only the provider diff.Why
language_model_spec/vision_submodules_specare the provider-definition leaf with no in-PR caller; their consumers land in the integration PR. Kept here so the provider can be reviewed/merged independently.CODEOWNERS
examples/mimo/...+tests/unit_tests/...-> repo default owners (no dedicated group).🤖 Generated with Claude Code