Add Per-GPU-Pair Bandwidth Diagnostics for NCCL_TESTS_SPLIT_MASK by kzlxd · Pull Request #376 · NVIDIA/nccl-tests

kzlxd · 2026-04-02T03:21:00Z

Add Per-GPU-Pair Bandwidth Diagnostics for `NCCL_TESTS_SPLIT_MASK`

Description

Background

When using NCCL_TESTS_SPLIT_MASK in multi-node environments (e.g., 2 nodes × 8 GPUs), MPI ranks can be divided into subgroups, where each subgroup represents a cross-node GPU pair with the same local index.

This mechanism is particularly useful for analyzing communication performance between specific GPU pairs.

Motivation

Currently, nccl-tests only reports aggregated performance across all ranks, which makes it difficult to:

Inspect bandwidth for individual GPU pairs
Identify performance imbalance across pairs
Diagnose slow GPUs, NICs, or network paths

Changes

This PR introduces an optional diagnostic feature that enables per-GPU-pair performance visibility when using NCCL_TESTS_SPLIT_MASK.

When enabled via:

NCCL_TESTS_SPLIT_VERBOSE=1

the following enhancements are provided:

Output full nccl-tests-formatted results per GPU pair
Preserve the original aggregated output for backward compatibility
Add a summary table based on the largest message size
Automatically highlight relatively slow GPU pairs

Example Output

#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
    16777216       4194304     float     sum      -1  3280.85    35.11    35.11    N/A  3284.68    35.11    35.11    N/A
    ...
  1073741824     268435456     float     sum      -1   190235    35.64    35.64    N/A   189238    35.67    35.67    N/A
#
# ============= Per-GPU-pair Bandwidth Breakdown =============
#
# --- GPU Pair 0 --- <<< SLOW
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
    ...
#
# --- GPU Pair 1 ---
#       ...
#
# --- Summary (based on largest msg size: 1073741824 bytes) ---
# GPU_Pair    OOP(GB/s)   vs_max     IP(GB/s)   vs_max  Status
#        0         5.64    12.1%         5.67    12.1%  <<< SLOW
#        1        46.21    98.7%        46.34    99.0%  OK
#        ...
#
# =============================================================

Usage

mpirun --allow-run-as-root \
  --map-by ppr:8:node --np 16 --hostfile two_hostfile \
  ... \
  -x NCCL_TESTS_SPLIT_MASK=0x7 \
  -x NCCL_TESTS_SPLIT_VERBOSE=1 \
  /path/to/all_reduce_perf_mpi \
  -b 16M -e 1G -f 2 -g 1 -t 1 -c 0 -n 20 -w 20 -o sum

Compatibility

Fully backward compatible
No behavior change unless NCCL_TESTS_SPLIT_VERBOSE=1 is set
Original aggregated output format remains unchanged

Benefits

Provides fine-grained per-GPU-pair performance visibility
Helps identify imbalanced communication paths
Improves debugging efficiency in multi-node environments
Zero overhead when the feature is disabled

Notes

The summary is based on the largest message size to reflect steady-state bandwidth
NCCL_TESTS_SPLIT_VERBOSE is intended to be used together with NCCL_TESTS_SPLIT_MASK

Add multi-channel DP test results

56d0a64

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Per-GPU-Pair Bandwidth Diagnostics for NCCL_TESTS_SPLIT_MASK#376

Add Per-GPU-Pair Bandwidth Diagnostics for NCCL_TESTS_SPLIT_MASK#376
kzlxd wants to merge 1 commit into
NVIDIA:masterfrom
kzlxd:multi_dp_nccl_test

kzlxd commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kzlxd commented Apr 2, 2026