Skip to content

running multi-node with Exclusive Process #369

@angainor

Description

@angainor

When running all_reduce_perf across 2 nodes / 4 GH200 GPUs per node I run into trouble: nccl-tests tell me GPUs are busy, although they are not. I have the following SLURM job:

#SBATCH --ntasks-per-node=4 --gpus-per-node=4 --nodes=2
srun ./build/all_reduce_perf -b 8 -e 128M -f 2

The output looks good in the start, but then the program fails:

# nccl-tests version 2.17.8 nccl-headers=22606 nccl-library=22606
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 3617715 on    gpu-1-1 device  0 [0009:01:00] NVIDIA GH200 120GB
#  Rank  1 Group  0 Pid 3617716 on    gpu-1-1 device  1 [0019:01:00] NVIDIA GH200 120GB
#  Rank  2 Group  0 Pid 3617717 on    gpu-1-1 device  2 [0029:01:00] NVIDIA GH200 120GB
#  Rank  3 Group  0 Pid 3617718 on    gpu-1-1 device  3 [0039:01:00] NVIDIA GH200 120GB
#  Rank  4 Group  0 Pid 3230358 on    gpu-1-7 device  0 [0009:01:00] NVIDIA GH200 120GB
#  Rank  5 Group  0 Pid 3230359 on    gpu-1-7 device  1 [0019:01:00] NVIDIA GH200 120GB
#  Rank  6 Group  0 Pid 3230360 on    gpu-1-7 device  2 [0029:01:00] NVIDIA GH200 120GB
#  Rank  7 Group  0 Pid 3230361 on    gpu-1-7 device  3 [0039:01:00] NVIDIA GH200 120GB
gpu-1-1: Test CUDA failure common.cu:1304 'CUDA-capable device(s) is/are busy or unavailable'
 .. gpu-1-1 pid 3617715: Test failure common.cu:1189
gpu-1-7: Test CUDA failure common.cu:1304 'CUDA-capable device(s) is/are busy or unavailable'
 .. gpu-1-7 pid 3230359: Test failure common.cu:1189
gpu-1-7: Test CUDA failure common.cu:1304 'CUDA-capable device(s) is/are busy or unavailable'
 .. gpu-1-7 pid 3230361: Test failure common.cu:1189
gpu-1-7: Test CUDA failure common.cu:1304 'CUDA-capable device(s) is/are busy or unavailable'
 .. gpu-1-7 pid 3230358: Test failure common.cu:1189
gpu-1-7: Test CUDA failure common.cu:1304 'CUDA-capable device(s) is/are busy or unavailable'
 .. gpu-1-7 pid 3230360: Test failure common.cu:1189
gpu-1-1: Test CUDA failure common.cu:1304 'CUDA-capable device(s) is/are busy or unavailable'
 .. gpu-1-1 pid 3617718: Test failure common.cu:1189

Now, my GPUs are configured with Exclusive process, so multiple processes cannot use one GPU. I guess this is what is happening in nccl-tests, because when I change to Default:

nvidia-smi -i 0,1,2,3 -c 0

the test runs through. Inspection with nvidia-smi shows that indeed, many processes use GPU0:

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   3203007      C   ...ests-2.17.8/./build/all_reduce_perf        788MiB |
|    0   N/A  N/A   3203008      C   ...ests-2.17.8/./build/all_reduce_perf        556MiB |
|    0   N/A  N/A   3203009      C   ...ests-2.17.8/./build/all_reduce_perf        556MiB |
|    0   N/A  N/A   3203010      C   ...ests-2.17.8/./build/all_reduce_perf        556MiB |
|    1   N/A  N/A   3203008      C   ...ests-2.17.8/./build/all_reduce_perf        790MiB |
|    2   N/A  N/A   3203009      C   ...ests-2.17.8/./build/all_reduce_perf        786MiB |
|    3   N/A  N/A   3203010      C   ...ests-2.17.8/./build/all_reduce_perf        784MiB |
+-----------------------------------------------------------------------------------------+

Or rather, the 3 processes that should use GPUs 1,2,3 also in addition have a process running on GPU 0.

Is it possible to run nccl-tests with Exclusive process? Or do I have to consider changing that setting on the system? Also, do you know if this is this only nccl-tests limitation, or a general nccl issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions