-
Notifications
You must be signed in to change notification settings - Fork 3
Add RCCL and training warmup for HYBRID_SHARD stability #83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
13a4b57
bc511a6
8623e15
6d36362
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -1,42 +1,133 @@ | ||||||
| #!/bin/bash | ||||||
| # ============================================================================= | ||||||
| # Global NCCL/RCCL environment variables for multi-node training | ||||||
| # Based on DLRM_set_env_variables.sh | ||||||
| # Configured for MI350X cluster | ||||||
| # | ||||||
| # This file is the SINGLE SOURCE OF TRUTH for all NCCL/RCCL configuration. | ||||||
| # Edit variables here - local_launch.sh will automatically pick them up. | ||||||
| # | ||||||
| # NOTE: When adding a new environment variable, you MUST also add its name | ||||||
| # to the DOCKER_ENV_VARS array below, otherwise it won't be passed | ||||||
| # to the Docker container. | ||||||
| # ============================================================================= | ||||||
|
|
||||||
| # NCCL Debug Settings (use INFO for debugging network issues) | ||||||
| export NCCL_DEBUG=INFO | ||||||
| export NCCL_DEBUG_SUBSYS=INIT,NET | ||||||
| # Try disabling IB if InfiniBand is not properly configured | ||||||
| export NCCL_IB_DISABLE=1 | ||||||
| # ----------------------------------------------------------------------------- | ||||||
| # NCCL Debug Settings | ||||||
| # ----------------------------------------------------------------------------- | ||||||
| export NCCL_DEBUG=WARN | ||||||
| export NCCL_DEBUG_SUBSYS= # Options: COLL,INIT,NET (empty = none) | ||||||
|
||||||
| export NCCL_DEBUG_SUBSYS= # Options: COLL,INIT,NET (empty = none) | |
| unset NCCL_DEBUG_SUBSYS # Options: COLL,INIT,NET (empty/unset = none) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timeout NCCL_TIMEOUT and TORCH_DIST_INIT_TIMEOUT are pointing to different values. Any reason why do we have different values and different environment variables for same thing? Which one is effective in our code?
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DOCKER_ENV_VARS array includes NCCL_MIN_NCHANNELS (line 91), but the actual export statement for NCCL_MIN_NCHANNELS is commented out (line 44). This means the build_docker_env_flags function will pass an undefined or empty value for NCCL_MIN_NCHANNELS to the Docker container. Either uncomment the export on line 44 or remove NCCL_MIN_NCHANNELS from the DOCKER_ENV_VARS array.
| NCCL_MIN_NCHANNELS |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The build_docker_env_flags function does not quote the environment variable values when building the flags string. If any environment variable contains spaces or special characters (e.g., NCCL_SOCKET_IFNAME=enp49s0f0np0,fenic0), this could cause issues with shell word splitting when the flags are used in the docker exec command. Consider quoting the values: flags+=" -e ${var}="${value}""
| flags+=" -e ${var}=${value}" | |
| flags+=" -e ${var}=$(printf '%q' "$value")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expansion of
DOCKER_EXEChere is vulnerable to shell command injection because it is built from untrusted inputs (theDOCKER_CONTAINERCLI argument and environment-derived values inDOCKER_ENV_FLAGS) and then expanded unquoted in a command context. An attacker who can influenceDOCKER_CONTAINERor one of the environment variables (e.g., by including characters like;,&&or$(...)) can break out of the intendeddocker execinvocation and execute arbitrary additional commands on the host. To mitigate this, avoid constructing a full command string inDOCKER_EXECand instead passdocker execand its arguments as separate, properly quoted words (or an array), and ensureDOCKER_CONTAINERis strictly validated/whitelisted to be a simple container name.