Skip to content

Add Nemotron6-MoE VLM model provider for MIMO example#5374

Draft
yashaswikarnati wants to merge 20 commits into
NVIDIA:mainfrom
yashaswikarnati:x1-nemotron-provider
Draft

Add Nemotron6-MoE VLM model provider for MIMO example#5374
yashaswikarnati wants to merge 20 commits into
NVIDIA:mainfrom
yashaswikarnati:x1-nemotron-provider

Conversation

@yashaswikarnati

@yashaswikarnati yashaswikarnati commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

What

Add the Nemotron6-MoE vision-language model provider for the MIMO heterogeneous training example, built on stock Megatron args. The RADIO encoder (wrapper + pixel shuffle + vision config + encoder-spec builder + RADIO args) now lives in its own leaf module examples/mimo/model_providers/radio_encoder.py and is imported here. This PR adds the language MambaModel spec, the affine projector spec (core ColumnParallelLinear for fc1 — the affine path instantiates fc1 with gather_output=True, which TE column-parallel linears reject), and the per-PP-rank model assembly. A unit test travels with it.

Stacking

Stacked on top of #5397 (RADIO encoder leaf) — merge #5397 first. GitHub can't set this PR's base to a fork-only branch, so the base stays main and the #5397 commit rides along until it merges; rebasing onto main afterward leaves only the provider diff.

Why

language_model_spec / vision_submodules_spec are the provider-definition leaf with no in-PR caller; their consumers land in the integration PR. Kept here so the provider can be reviewed/merged independently.

CODEOWNERS

examples/mimo/... + tests/unit_tests/... -> repo default owners (no dedicated group).

🤖 Generated with Claude Code

@copy-pr-bot

copy-pr-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.


"""Nemotron6-MoE VLM model provider for hetero MIMO examples (stock-args edition).

This is PR-E1 of the NMFW-516 hetero-MIMO upstreaming effort. It ports the

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove - This is PR-E1 of the NMFW-516 hetero-MIMO upstreaming effort. It ports the
prototype provider (sasatheesh/pre-vlm-07) onto Megatron's stock argument
system (megatron.training.arguments). and all verbose comments

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no mentions of prototypes

prototype provider (sasatheesh/pre-vlm-07) onto Megatron's *stock* argument
system (``megatron.training.arguments``).

The crucial difference from the prototype: stock ``arguments.py`` already

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove differences commentary

TERowParallelLinear = None

MOCK_MODEL_PROVIDER = "mock"
NEMOTRON_20L_MODEL_PROVIDER = "nemotron-moe-vlm-20l"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need seperate 20l vs 54l providers ? we can control from the layer pattern alone ? also why do we need all the max num tiles/default stage etc global variables ?



# ---------------------------------------------------------------------------
# Small process-group / grid helpers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove verbose comments

help="Total image tokens emitted by the encoder per sample.")
provider.add_argument("--image-token-id", type=int, default=511,
help="Vocab id of the placeholder image token.")
provider.add_argument("--pad-token-id", type=int, default=0)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would this match what the tokenizer knows ? also how does megatron train loop uses/derives pad token id ?

),
)
provider.add_argument(
"--dynamic-resolution-min-patches",

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets check what all of these args are we using/useful. like which downstream code paths use these args ?

help="Projection module from frozen vision features to language hidden size.",
)
provider.add_argument(
"--training-stage", choices=["stage1", "stage2", "stage3"], default=None

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnt training stages and then again freeze vit etc redundant ? which one does the train loop and model init respect ?

raise ValueError("--pad-token-id must be within the vocabulary")


def _pixel_shuffle_dynamic_res(x, imgs_sizes, patch_dim, scale_factor=0.5, version=2):

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't the main llava model path already have pixel shiffle and radio encoder already implemented ? why are we redefinign them here again >

args.use_thumbnail = not args.dynamic_resolution


def apply_training_stage(args: argparse.Namespace) -> None:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont need both stages and freeze vit, freeze lm args. we can have freeze ones only as part of this particular model provider args ?

"""Expose the underlying RADIO config for DDP wrapping."""
return self.radio_model.config

def forward(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this not present on main already ?

except ImportError: # pragma: no cover - TE always present in the CI container
TERowParallelLinear = None

MOCK_MODEL_PROVIDER = "mock"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have MOCK_MODEL_PROVIDER ? whats the use ?


MOCK_MODEL_PROVIDER = "mock"
NEMOTRON_MODEL_PROVIDER = "nemotron-moe-vlm"
NEMOTRON_IMAGE_SEQ_PER_TILE = 256

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have these as part of model provider, what/ehy exactly are they used ? NEMOTRON_IMAGE_SEQ_PER_TILE = 256
NEMOTRON_MAX_NUM_TILES = 12
NEMOTRON_DEFAULT_STAGE = "stage2"
NEMOTRON_SEQ_LENGTH = 8192
NEMOTRON_VISION_ENCODER_KEY = "radio_encoder"



# Process-group / grid helpers.
def get_grid_dim_size(grid: HyperCommGrid, dim: str) -> int:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does hypercomm grid already have helper function for this ?

return rank if rank >= 0 else fallback


def is_process_group_member(pg) -> bool:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might have defined these helper functions at multiple palces so far i guess, can we check

return pg is not None and dist.get_rank(group=pg) >= 0


def is_nemotron_moe_vlm(args: argparse.Namespace) -> bool:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the entire provider we have here is for nemotron vlm ? why we need is_nemotron_vlm ?

provider.add_argument("--pad-token-id", type=int, default=0)
provider.add_argument("--image-token", type=str, default="<image>")
provider.add_argument("--tokenizer-prompt-format", type=str, default="nemotron6-moe")
provider.add_argument("--image-tag-type", type=str, default="")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have so many args, lets souble check what each one of this is actually used for and which one is dead code

Comment thread examples/mimo/model_providers/nemotron_moe_vlm.py
Comment thread examples/mimo/model_providers/nemotron_moe_vlm.py
Comment thread examples/mimo/model_providers/nemotron_moe_vlm.py
return args


def test_preset_sets_vision_and_datapath_knobs():

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lot of useless and dead tests. also looks like we dont have a test actully initializing a mimo model based on nemotron provider/spec ?

Add examples/mimo/model_providers/radio_encoder.py: the RADIO encoder
wrapper (class-token drop + pixel-shuffle adapter over core RADIOViTModel),
its vision TransformerConfig, the encoder ModuleSpec builder, and the
RADIO-specific CLI args (--class-token-len, --pixel-shuffle,
--disable-vision-class-token). The spec builder reads --pixel-shuffle and
--disable-vision-class-token off args so they stop being write-only.

This is a leaf, merge-first dependency of the Nemotron6-MoE VLM provider.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
# Vision / MIMO-specific knobs.
provider.add_argument("--image-seq-length", type=int, default=None,
help="Total image tokens emitted by the encoder per sample.")
provider.add_argument("--image-token-id", type=int, default=511,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need arg for image token id and pad token id ? who uses that ?

provider.add_argument("--pad-token-id", type=int, default=0)
provider.add_argument("--image-token", type=str, default="<image>")
provider.add_argument("--tokenizer-prompt-format", type=str, default="nemotron6-moe")
provider.add_argument("--image-tag-type", type=str, default="")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who uses image tag type/tokenizer-prompt-format/force-system-message etc. )? even dynamic resolution/num image tiles ?

"""
apply_model_provider_defaults(args)
resolve_image_token_id(args)
args.vision_encoder_key = NEMOTRON_VISION_ENCODER_KEY

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats the use of vision encoder key ?

args.vision_input_mode = "pixels" if is_nemotron_moe_vlm(args) else "hidden_states"


def apply_model_provider_defaults(args: argparse.Namespace) -> None:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have this defaults thing ?

return vision_submodule.encoders[NEMOTRON_VISION_ENCODER_KEY]


def iter_vision_projection_modules(vision_submodule):

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have this? who uses this iter ?

config.attention_backend = AttnBackend.flash
config.use_cpu_initialization = True
config.variable_seq_lengths = True
config.cross_entropy_fusion_impl = "te"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use local or native cross entropy fusion impl

Add a single-source RADIO_ENCODER_MODULE_NAME constant (the encoders-dict /
topology module name). Replace the CPU arg-registration tests with a real
1-GPU forward+backward test that builds RADIOEncoderWrapper via
radio_vision_encoder_spec (real RADIOViTModel + TE), asserts the output-shape
effects of class-token-drop and pixel-shuffle, and exercises the
dynamic-resolution packed-tile path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
yashaswikarnati and others added 2 commits June 17, 2026 18:42
The packed (thd) attention path requires bf16 and a flash/fused backend; the
fixed sbhd cases tolerate fp32 but the dynamic packed path hit a TE
illegal-memory-access under the fp32/auto-backend config. Build that case with
params_dtype=bf16 + attention_backend=flash (matching the e2e-validated path)
and feed bf16 packed input. The cu_seqlens/imgs_sizes contract was already
correct (RADIOViTModel adds class tokens to cu_seqlens internally).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
TE fused attention requires the PackedSeqParams cu_seqlens on the device
(mirrors training/step.py::move_batch_to_cuda). Build cu_seqlens_q/kv as int32
CUDA tensors and pass max_seqlen_q/kv as plain ints; imgs_sizes stays on CPU
(RADIOViTModel reads it via .tolist()/Python iteration).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>

"""Nemotron6-MoE VLM model provider for hetero MIMO examples.

Declares the MIMO/vision provider args and builds the language ``MambaModel``

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this doc string, verbose and not adding value

default=NEMOTRON_MODEL_PROVIDER,
help="Which MIMO model provider/preset to build.",
)
provider.add_argument(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnt dynamic resolution part of radio encoder args ? also --no-dynamic-resolution seems weird, we dont have to make this default and always pass from the args

"""Nemotron6-MoE language config: stock from-args base + model-specific overrides."""
config = deepcopy(_base_config(args))
bf16, dtype = _dtype(args)
# Code-only fields (not cleanly arg-expressible) + hetero parallelism pins.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove these verbose comments

Comment thread examples/mimo/model_providers/nemotron_moe_vlm.py
Comment thread examples/mimo/model_providers/nemotron_moe_vlm.py
Move the --dynamic-resolution flag into add_radio_encoder_args as a plain
store_true (default False); the encoder owns its vision knobs. Trim the module
docstring and the constant comment to one line each.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
grid). ``llm_grid`` is the language ``HyperCommGrid`` used only for fallback
dim sizes when a group is missing.
"""
pp_pg = getattr(pg_collection, "pp", None) if pg_collection is not None else None

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

none fall back is dangerous - instead we can assert that groups are required.

ep_pg = getattr(pg_collection, "ep", None) if pg_collection is not None else None
expt_tp_pg = getattr(pg_collection, "expt_tp", None) if pg_collection is not None else None

fallback_tp_size = get_grid_dim_size(llm_grid, "tp")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need fallback tp ?

yashaswikarnati and others added 15 commits June 18, 2026 13:51
…lization)

Stop forcing CPU init in radio_vision_config; let it flow from args (default
False = init weights directly on GPU, matching stock). The GPU test now builds
with the default (GPU init); its setup already seeds the CUDA RNG tracker
(model_parallel_cuda_manual_seed) that _initialize_affine_weight_gpu needs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
Add the Nemotron6-MoE vision-language model provider for the MIMO
heterogeneous training example, built on stock Megatron args. Includes the
RADIO encoder wrapper (class-token drop + pixel shuffle), the affine
projector spec using the core ColumnParallelLinear for fc1 (the affine path
instantiates fc1 with gather_output=True, which TE column-parallel linears
reject), and the per-PP-rank model assembly.

A new leaf module under examples/mimo with no dependency on the other
example training modules. The unit test exercises provider construction and
travels with it.

Validated in the 8-GPU 20L Nemotron VLM e2e (trains + checkpoint
save/resume, lm loss 12.18->11.54 across resume).

Signed-off-by: ykarnati <ykarnati@nvidia.com>
…ovider

Remove the verbose module docstring, prototype/pre-vlm-07 references, and
differences commentary; collapse verbose inline comments to terse one-liners.
No behavior change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
Replace the three hardcoded TransformerConfigs with the pretrain_vlm pattern:
core_transformer_config_from_args(args) deep-copied per tower, then thin
override helpers that set only the code-only fields (attention_backend,
use_cpu_initialization, variable_seq_lengths, cross_entropy_fusion_impl, env
MoE dispatcher/backend), the per-module parallelism pins, and the vision /
projection model-specifics. Architecture (num_layers, hidden_size, moe_*,
mamba_*, hybrid pattern, ...) now comes from CLI flags, so the provider is
arg-compatible like pretrain_gpt; the run script passes the values (PR NVIDIA#5377).

Collapse the 20L/54L providers into one nemotron-moe-vlm provider driven by
--num-layers/--hybrid-layer-pattern, removing the layer/pattern constants and
the dual source of truth. The preset now only sets the vision/data-path knobs
the dataloaders read.

Pass force_eval_mode into RADIOViTModel (which already pins eval) instead of the
wrapper's own eval()/train() override. Strip inherited MoE/Mamba/hybrid settings
from the dense vision and projection configs.

Parity (Option A, verified statically; live diff + parity test run in CI since
torch is unavailable in the dev shell): the from-args language config equals the
old hardcoded config on every field except deallocate_pipeline_outputs
(False->True; stock-correct, inert at PP=1) and inference_sampling_seed
(42->--seed; inference-only), both supplied by core_transformer_config_from_args.
test_language_config_parity asserts the reference arch and these two exceptions.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
The parity test hand-rolled a Namespace from parse_args alone and called
core_transformer_config_from_args, but params_dtype (and other derived fields)
are set by validate_args, which the test skipped -> AttributeError on
args.params_dtype.

Build args the same way examples/mimo/pretrain_mimo.py::_parse_and_validate does:
parse_args(add_model_provider_args) -> prepare_model_provider_args (preset,
before validate) -> validate_args. At world_size=1, tp=pp=cp=1 the divisibility
checks pass with no distributed/mpu init, and global_batch_size defaults from
micro_batch_size * data_parallel_size. This resolves params_dtype and every
other validate_args-derived field exactly as in a real run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
core_transformer_config_from_args runs TransformerConfig.__post_init__ at
construction. The provider set moe_token_dispatcher_type from an env var AFTER
that, but --moe-token-dispatcher-type defaults to 'allgather', so the base
config built with allgather + --moe-shared-expert-overlap failed validation
('moe_shared_expert_overlap only works with alltoall or flex token dispatcher')
before the post-hoc override applied.

Stop post-hoc-setting moe_token_dispatcher_type / moe_flex_dispatcher_backend;
they are args-expressible (--moe-token-dispatcher-type, auto-generated
--moe-flex-dispatcher-backend), so let them come from CLI flags and the base
config validates up front. The run script defaults them from env via the flags.

Audited every post-hoc override against __post_init__: the dispatcher type was
the only field whose override made the base config invalid at construction.
cross_entropy_fusion_impl='te' stays a code-only override (it deliberately
bypasses the validate_args ban on loss-fusion + te and is not validated in
__post_init__); the parallelism/dtype/seq-length overrides are not
post_init-validated for our values.

Add --moe-token-dispatcher-type alltoall + --moe-flex-dispatcher-backend deepep
to the parity-test argv and reference fixture, kept in sync with the run script.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
The larger Nemotron variant was parametrized as num_layers=52, but its
hybrid_layer_pattern encodes 54 layer-tokens (23 M + 23 E + 8 *), and
get_hybrid_total_layer_count derives num_layers from the pattern, so
config.num_layers is 54. The '52L' label was a misnomer; the canonical config
is 54L. Set the parametrization to 54 to match the pattern. Provider unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
Import the extracted RADIO encoder (wrapper, vision config, encoder-spec
builder, args, shared config helpers) from radio_encoder instead of carrying
local copies. Address review:

- Drop --training-stage / apply_training_stage / NEMOTRON_DEFAULT_STAGE; the
  --freeze-{vit,lm,projection} flags are the freeze interface.
- Remove the mock provider, the dead non-Nemotron GPTModel fallback, and the
  now-unused GPTModel / gpt-layer-spec imports; --model-provider is
  nemotron-only.
- Inline the num-image-tiles default; source the LM seq length from
  --seq-length instead of hardcoding 8192 and silently overriding it.
- Move the grid/process-group helpers into examples/mimo/utils/hetero.py.
- Drop unconsumed args (--vision-model-type, --use-tiling, --use-thumbnail,
  --dynamic-resolution-min/max-patches); RADIO knobs now live in radio_encoder.
- Clarify that projector_type=affine uses only linear_fc1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
seq_length / max_position_embeddings are not TransformerConfig dataclass
fields; the provider no longer monkey-sets them on the config (they flow from
--seq-length into the MambaModel constructor). The seq-length contract is
covered by test_language_model_spec_builds_mamba (max_sequence_length ==
args.seq_length).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
Keep only args whose consumer is main, the radio encoder, or an opened stack
PR; image-token / tokenizer / data-path knobs move with the data pipeline.

- Remove apply_model_provider_defaults (the post-parse preset). --dynamic-
  resolution (read by the provider's own radio encoder spec) keeps its on
  default via argparse default=True; the other preset-set knobs were data-only.
- Remove data/tokenizer args: --image-seq-length (+ NEMOTRON_IMAGE_SEQ_PER_TILE),
  --image-token-id, --image-token, --pad-token-id, --tokenizer-prompt-format,
  --image-tag-type, --force-system-message, --num-image-tiles. With their args
  gone, resolve_image_token_id and validate_model_provider_args go too.
- Remove the dead get_vision_encoder_module / iter_vision_projection_modules.
- prepare_model_provider_args now only sets the derived vision_encoder_key.
- Trim the provider tests to match; parity argv passes the radio flags.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
The encoder module name resolves via NVIDIA#5375 args.py DEFAULT_ENCODER_MODULE_NAME;
the provider no longer needs to publish args.vision_encoder_key. That emptied
prepare_model_provider_args, so remove it. NEMOTRON_VISION_ENCODER_KEY stays as
the encoders-dict key in vision_submodules_spec.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
- Remove is_nemotron_moe_vlm and the always-true vision_submodules_spec guard
  (provider is nemotron-only).
- Drop the attention_backend=flash hardcode; it flows from --attention-backend
  (parity argv now passes --attention-backend flash).
- Drop the cross_entropy_fusion_impl="te" post-construction override; the
  native default flows from args and is allowed alongside --cross-entropy-loss-
  fusion, removing the silent bypass of the te-CE stability guard.
- Replace the local NEMOTRON_VISION_ENCODER_KEY with the single-source
  RADIO_ENCODER_MODULE_NAME imported from radio_encoder.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
…w a radio arg)

- Module docstring -> one line.
- --dynamic-resolution moves to radio_encoder's add_radio_encoder_args
  (store_true, default off); remove the provider's duplicate definition.
- Trim the stale attention/CE comment and the verbose moe-dispatcher / seq-length
  notes in nemotron_language_config to terse one-liners.
- Add a terse affine-vs-mlp note at the projection spec (no behavior change;
  core MultimodalProjector already supports both via --vision-projection-type).
- Test: dynamic-resolution now defaults off and is set by --dynamic-resolution.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
…ack)

The spec builders fell back to grid dim sizes via per-attribute getattr(...,None),
which silently masked a malformed collection. Keep the pg_collection=None path
(bootstrap passes None on ranks outside a module's grid) using grid sizes, but
when a collection IS provided, assert its pp/tp(/ep/expt_tp) groups are present
and read sizes from them. Replace get_group_size_or/get_group_rank_or with
assert-based get_pg_size/get_pg_rank.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
Drop the use_cpu_initialization=True override in nemotron_language_config and
nemotron_projection_config; it flows from args (default False = init on GPU,
matching stock). Better at scale (no full host-side master weight + H2D copy).
bootstrap's .to(cuda) becomes a harmless no-op.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
"""
# None on ranks outside the language grid -> sizes come from the grid; when a
# collection is provided its pp/tp/ep/expt_tp groups must all be present.
if pg_collection is None:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to handle this ? pg collection None ? we dont call language model spec on vision ranks ??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant