Skip to content

Add Per-GPU-Pair Bandwidth Diagnostics for NCCL_TESTS_SPLIT_MASK#376

Open
kzlxd wants to merge 1 commit into
NVIDIA:masterfrom
kzlxd:multi_dp_nccl_test
Open

Add Per-GPU-Pair Bandwidth Diagnostics for NCCL_TESTS_SPLIT_MASK#376
kzlxd wants to merge 1 commit into
NVIDIA:masterfrom
kzlxd:multi_dp_nccl_test

Conversation

@kzlxd

@kzlxd kzlxd commented Apr 2, 2026

Copy link
Copy Markdown

Add Per-GPU-Pair Bandwidth Diagnostics for NCCL_TESTS_SPLIT_MASK

Description

Background

When using NCCL_TESTS_SPLIT_MASK in multi-node environments (e.g., 2 nodes × 8 GPUs), MPI ranks can be divided into subgroups, where each subgroup represents a cross-node GPU pair with the same local index.

This mechanism is particularly useful for analyzing communication performance between specific GPU pairs.

Motivation

Currently, nccl-tests only reports aggregated performance across all ranks, which makes it difficult to:

  • Inspect bandwidth for individual GPU pairs
  • Identify performance imbalance across pairs
  • Diagnose slow GPUs, NICs, or network paths

Changes

This PR introduces an optional diagnostic feature that enables per-GPU-pair performance visibility when using NCCL_TESTS_SPLIT_MASK.

When enabled via:

NCCL_TESTS_SPLIT_VERBOSE=1

the following enhancements are provided:

  • Output full nccl-tests-formatted results per GPU pair
  • Preserve the original aggregated output for backward compatibility
  • Add a summary table based on the largest message size
  • Automatically highlight relatively slow GPU pairs

Example Output

#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
    16777216       4194304     float     sum      -1  3280.85    35.11    35.11    N/A  3284.68    35.11    35.11    N/A
    ...
  1073741824     268435456     float     sum      -1   190235    35.64    35.64    N/A   189238    35.67    35.67    N/A
#
# ============= Per-GPU-pair Bandwidth Breakdown =============
#
# --- GPU Pair 0 --- <<< SLOW
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
    ...
#
# --- GPU Pair 1 ---
#       ...
#
# --- Summary (based on largest msg size: 1073741824 bytes) ---
# GPU_Pair    OOP(GB/s)   vs_max     IP(GB/s)   vs_max  Status
#        0         5.64    12.1%         5.67    12.1%  <<< SLOW
#        1        46.21    98.7%        46.34    99.0%  OK
#        ...
#
# =============================================================

Usage

mpirun --allow-run-as-root \
  --map-by ppr:8:node --np 16 --hostfile two_hostfile \
  ... \
  -x NCCL_TESTS_SPLIT_MASK=0x7 \
  -x NCCL_TESTS_SPLIT_VERBOSE=1 \
  /path/to/all_reduce_perf_mpi \
  -b 16M -e 1G -f 2 -g 1 -t 1 -c 0 -n 20 -w 20 -o sum

Compatibility

  • Fully backward compatible
  • No behavior change unless NCCL_TESTS_SPLIT_VERBOSE=1 is set
  • Original aggregated output format remains unchanged

Benefits

  • Provides fine-grained per-GPU-pair performance visibility
  • Helps identify imbalanced communication paths
  • Improves debugging efficiency in multi-node environments
  • Zero overhead when the feature is disabled

Notes

  • The summary is based on the largest message size to reflect steady-state bandwidth
  • NCCL_TESTS_SPLIT_VERBOSE is intended to be used together with NCCL_TESTS_SPLIT_MASK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant