[ci] H100 CI#1679
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds H100 GPU support, new CI configurations, and optimizations for large MoE models, specifically refactoring FSDP2 initialization to use meta-device swapping and PyTorch's set_model_state_dict. The weight update protocol was also updated to support NCCL transfers alongside existing IPC methods. Review feedback identified opportunities to further optimize memory usage by restricting state_dict calls to rank 0 and removing unused arguments in buffer synchronization.
| @@ -219,6 +219,33 @@ def _fsdp_init_model(self, model, is_train=True, is_wrapped=False): | |||
| } | |||
| module = model.model if is_wrapped else model | |||
| full_state = module.state_dict() | |||
There was a problem hiding this comment.
Calling state_dict() on all ranks materializes the full model state in memory on every process. Since fsdp2_load_full_state_dict uses set_model_state_dict with broadcast_from_rank0=True, only rank 0 needs the actual data. On non-zero ranks, this leads to significant and unnecessary CPU memory usage, which can cause OOMs on nodes with many GPUs when training large models.
| full_state = module.state_dict() | |
| full_state = module.state_dict() if dist.get_rank() == 0 else {} |
| # Broadcast non-persistent buffers (e.g. inv_freq from RotaryEmbedding) | ||
| # that are excluded from state_dict. On non-rank-0 meta-init these are | ||
| # still on the meta device with no data; rank 0 has the correct values. | ||
| _sync_non_persistent_buffers(model, model.state_dict()) |
There was a problem hiding this comment.
The second argument to _sync_non_persistent_buffers is not utilized by the function implementation. Calling model.state_dict() on an FSDP2 model, while relatively cheap as it returns sharded DTensors, still incurs unnecessary overhead and creates a large dictionary of tensors for no reason.
| _sync_non_persistent_buffers(model, model.state_dict()) | |
| _sync_non_persistent_buffers(model, {}) |
| wrapped_model = HFModelWrapper( | ||
| model_path, | ||
| use_flash_attention_2=self.cfg.flash_attn, | ||
| bf16=False, | ||
| lora_rank=self.cfg.policy.model.lora.rank, | ||
| lora_alpha=self.cfg.policy.model.lora.alpha, | ||
| lora_dropout=self.cfg.policy.model.lora.dropout, |
There was a problem hiding this comment.
This change - usingbf16=True - is going to initialize the model weights in BF16, and will lead to pure BF16 training.
You want to initialize the model weights in FP32 and let FSDP handle casting to BF16 during forward pass.
https://docs.pytorch.org/docs/2.12/fsdp.html#torch.distributed.fsdp.MixedPrecision
"Outside forward and backward, the sharded parameters are kept in full precision (e.g. for the optimizer step), and for model checkpointing, the parameters are always saved in full precision."
| # Large MoE models: AdamW's fp32 optimizer state (~6x model) is the | ||
| # dominant per-rank GPU cost during Megatron init. | ||
| # use_precision_aware_optimizer keeps it in bf16 (halves it), and | ||
| # optimizer_cpu_offload moves it off GPU entirely. Without these the | ||
| # optimizer state alone OOMs the H100 on 4 GPUs. |
There was a problem hiding this comment.
It would be good to just initialize policy without any optimizer state
We might need a fix in SkyRL FSDP code for this, which is the better way to do this than using optimizer offload.
We can explicitly pass optimizer_config as None to the FSDPStrategy :
SkyRL/skyrl/backends/skyrl_train/distributed/fsdp_strategy.py
Lines 231 to 234 in 33f18ba
Currently this assertion willl prevent us from having optimizer as None for PolicyWorker,
SkyRL/skyrl/backends/skyrl_train/workers/fsdp/fsdp_worker.py
Lines 189 to 191 in 33f18ba
but we can just modify it to say that it should be non-null only if optimizer_config is non-null
SumanthRH
left a comment
There was a problem hiding this comment.
Can you break up this PR into
- A minimal PR to initialize H100 CI with Qwen 3.5 35B
- A PR to add support for Nemotron 30B
Also, it is hard to see which of the changes in FSDP worker initialization were essential and also which model they apply to. When you break up the PR, please also add only the relevant changes needed.
This PR also switches to initialize the model weights directly in BF16. While this saves memory, it introduces a correctness bug because training will no longer be in mixed precision but instead in pure BF16.
H100 CI tests for large MoE models
Introduces an opt-in H100 CI lane that exercises two ~30B-class MoE models end-to-end:
test_policy_local_engines_e2e), both colocated and non-colocated with vLLM. Megatron logprob roundtrip (test_megatron_models).test_megatron_models), re-enabled after the original 8-GPU param was skipped.Tests are gated by
pytest.mark.h100(auto-skipped unless-m h100is passed). The new GitHub workflow submits them as an Anyscale staging job on thellm-team-h100-4x:1compute config.Test infrastructure
h100marker intests/backends/skyrl_train/gpu/conftest.pyand auto-skip those tests unless-m h100is explicitly passed.test_policy_local_engines_e2e(colocated + non-colocated) andtest_megatron_models(TP=4 EP=4).h100marker.Core fixes uncovered along the way
fsdp_strategy.py): swap the module to meta beforeapply_fsdp2, so sharded DTensors allocate directly at their final shard size instead of materializing the full model on rank 0 first. Snapshot/restore non-persistent buffers (e.g.RotaryEmbedding.inv_freq) around the meta swap so they survive the broadcast.set_model_state_dict(fsdp_utils.py): the ~80-line per-parameter broadcast +distribute_tensorloop is now PyTorch'sset_model_state_dict(..., StateDictOptions(full_state_dict=True, broadcast_from_rank0=True)).fsdp_utils.py): the trick (offload_fsdp2_model_to_cpu→empty_cache→load_fsdp2_model_to_gpu) was meant to clear reserved-but-unallocated PyTorch memory. For FSDP2 it's a no-op (model.to("cpu")doesn't move FSDPParam-managed storage during init), then the reload allocates a second copy — doubling memory. Removed;set_model_state_dictalready leaves us with exactly the shard on GPU.set_current_vllm_config(new_inference_worker_wrap.py,remote_inference_client.py,broadcast_strategy.py): MoE models (FlashInfer CUTLASS kernel) readget_current_vllm_config()duringload_weights, which is only set aroundinit_device/load_model. Added a newupdate_weights_ncclworker method that wrapsweight_transfer_engine.receive_weightsinset_current_vllm_config, and route the broadcast sender throughstart_weight_update+update_weights_nccl+finish_weight_updateinstead of vLLM's native/update_weightsendpoint. Same pattern CUDA IPC already used. Tracked against upstream vllm-project/vllm#42577.update_weights_chunk→update_weights_ipcacrossnew_inference_worker_wrap.py,remote_inference_client.py, andcuda_ipc_strategy.pyso the IPC and NCCL paths have parallel names.bf16=Falsein policy FSDP init (fsdp_worker.py:162): this was forcing fp32 master weights even though the wrapper default isbf16=True(and the critic path at line 408 already respects the config). Halves per-rank shard size — was the difference between fitting and OOMing on non-colocated 2-GPU FSDP.test_megatron_models.py): for Qwen3.5-35B and Nemotron-3-Nano, setuse_precision_aware_optimizer=True+optimizer_cpu_offload=True+optimizer_offload_fraction=1.0. Megatron eagerly materializes the fp32 AdamW state on GPU at init (unlike PyTorch's lazy AdamW), so the optimizer state alone OOMs without offload.gpu_memory_utilizationto 0.5 for these large MoE models in the Megatron test (_engine_overrides_for_model) so the policy shard + vLLM pool both fit on each H100.sleep_level=2intest_megatron_models: the test explicitly syncs weights, so full sleep (matching theInferenceEngineState.createdefault and the FSDP e2e test) is the right level. Previously was hardcoded to 1.