Skip to content

Cray MPI autodetection fails on AMD MI300A node (missing mpi_gtl_hsa preload) #919

@LasNikas

Description

@LasNikas

On a node with 4× AMD MI300A GPUs, the automatic detection of the setup via
MPIPreferences.use_system_binary(vendor="cray") fails.

julia> MPIPreferences.use_system_binary(vendor="cray")
ERROR: ArgumentError: Collection has multiple elements, must contain exactly 1 element
Stacktrace:
 [1] _only
   @ ./iterators.jl:1554 [inlined]
 [2] only
   @ ./iterators.jl:1545 [inlined]
 [3] only_or_nothing
   @ ~/.julia/packages/MPIPreferences/PLH7x/src/parse_cray_cc.jl:42 [inlined]
 [4] cray_gtl
   @ ~/.julia/packages/MPIPreferences/PLH7x/src/parse_cray_cc.jl:56 [inlined]
 [5] analyze_cray_cc()
   @ MPIPreferences.CrayParser ~/.julia/packages/MPIPreferences/PLH7x/src/parse_cray_cc.jl:84
 [6] use_system_binary(...)
   @ MPIPreferences ~/.julia/packages/MPIPreferences/PLH7x/src/MPIPreferences.jl:180
 [7] top-level scope
   @ REPL[2]:1

Manually editing LocalPreferences.toml to include:

[MPIPreferences]
_format = "1.1"
abi = "MPICH"
binary = "system"
libmpi = "libmpi_cray"
preloads = ["libmpi_gtl_hsa.so"]
cclibs = []

fixes the issue - MPI.jl then runs correctly across GPUs.

The node uses

cray-mpich/8.1.30
rocm/6.2.2

$CRAY_MPICH_ROOTDIR/gtl/lib contains:

libmpi_gtl_cuda.a  libmpi_gtl_cuda.so  libmpi_gtl_cuda.so.0  libmpi_gtl_cuda.so.0.1.0  libmpi_gtl_hsa.a  libmpi_gtl_hsa.so  libmpi_gtl_hsa.so.0  libmpi_gtl_hsa.so.0.1.0  libmpi_gtl_ze.a  libmpi_gtl_ze.so  libmpi_gtl_ze.so.0  libmpi_gtl_ze.so.0.1.0  pkgconfig

cc @vchuravy

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions