Skip to content

[ROCm] Release fixes for rocm-jaxlib-v0.10.2 (combined cherry-picks)#993

Merged
i-chaochen merged 10 commits into
rocm-jaxlib-v0.10.2from
combined-cherrypicks-v0.10.2
Jun 24, 2026
Merged

[ROCm] Release fixes for rocm-jaxlib-v0.10.2 (combined cherry-picks)#993
i-chaochen merged 10 commits into
rocm-jaxlib-v0.10.2from
combined-cherrypicks-v0.10.2

Conversation

@magaonka-amd

Copy link
Copy Markdown

Motivation

Combines the four ROCm cherry-pick PRs for the JAX 0.10.2 release into a single PR against rocm-jaxlib-v0.10.2 (cut from pinned XLA 5a9e73cb). Supersedes #985, #987, #988, #989.

Commits (all cherry-picked with -x, oldest first)

From #983 backport (#985):

  • PR #40385: [ROCm] Streamline bazel targets for rocm libraries
  • PR #41591: [ROCm] Unblock CI after PR #40385
  • [ROCm] Add librocm_kpack and libclang-cpp to runfiles globs
  • Fix formatting error in rocm_blas.cc

Additional fixes requested in #985 review:

Notes

Test Plan

  • Pinned XLA for the JAX release/0.10.2 release-validation workflow.

draganmladjenovic and others added 7 commits June 23, 2026 14:14
Imported from GitHub PR openxla#40385

📝 Summary of Changes
Move from linking to rocm libraries trough DsoLoader to direct linking. Make sure we only link to specified rocm library, not to any other library said library may use. This change leaves enough wrapper (namespace wrap) headers needed for TF.

🎯 Justification
Makes our build targets not tainted by transitive dependencies of rocm libraries. DsoLoader use has little benefit for rocm since the libraries get loaded ahead of time by jax plugin anyways.

🚀 Kind of Contribution
♻️ Cleanup

Copybara import of the project:

--
38470a3 by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>:

[ROCm] Streamline bazel targets for rocm libraries

--
7f2ec34 by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>:

[ROCm] Remove rocm_solver_context

--
1e34f26 by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>:

[ROCm] Remove hipblaslt_wrapper.h

--
2a35a53 by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>:

[ROCm] Remove hipsparse_wrapper.h

--
37338f2 by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>:

Remove rocsolver_wrapper.h

--
545039f by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>:

Remove rocblas_wrapper.h

--
c7e959e by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>:

Remove rocm_driver_wrapper.h

--
e41363d by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>:

Remove wrap namespace usage from rocm_fft

--
94dc13e by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>:

Remove wrap namespace usage from rocm_dnn

--
0885e40 by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>:

Remove TryDlopenROCmLibraries

--
983cc3f by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>:

Remove :all_files

--
aa5b6c3 by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>:

Partialy restore *wrapper.h files needed by ROCm TF

--
d2e89c2 by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>:

Use more filegroups

--
3abd237 by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>:

Remove DsoLoader ROCm support

--
474acd2 by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>:

Unblock X86 build CI

--
57e1100 by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>:

Fix RBE build

--
8a55bfb by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>:

Move to rocm_lib_import macro

Merging this change closes openxla#40385

COPYBARA_INTEGRATE_REVIEW=openxla#40385 from ROCm:draganm/rocm_libs 8a55bfb
PiperOrigin-RevId: 904252712

(cherry picked from commit 4f993de)
(cherry picked from commit 16f1d8d)
Imported from GitHub PR openxla#41591

📝 Summary of Changes
Default for hermetic paths in non-hermetic build too

🎯 Justification
Makes CI pick up rocm libs when doing local build + remote testing

🚀 Kind of Contribution
🐛 Bug Fix
Copybara import of the project:

--
2ab9ead by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>:

[ROCm] Unblock CI after PR openxla#40385

Merging this change closes openxla#41591

COPYBARA_INTEGRATE_REVIEW=openxla#41591 from ROCm:draganm/rocm_libs_followup 2ab9ead
PiperOrigin-RevId: 909965137

(cherry picked from commit d9922e7)
(cherry picked from commit 83380a5)
Two new DT_NEEDED entries appeared in ROCm 7.13 that BUILD.tpl's data
globs weren't mirroring into runfiles, causing dlopen failures at
runtime:

  - libamdhip64.so.7   -> librocm_kpack.so.0
  - libamd_comgr.so.3  -> libclang-cpp.so

(cherry picked from commit 9fb3e7d)
(cherry picked from commit 802fe8a)
…ent_sum by matchin…

Imported from GitHub PR openxla#44428

…g FloatNormalization conversions.

📝 Summary of Changes
Make atomic-RMW matcher (GetAtomicModifierParameters) to look through the extf → addf(f32) → truncf body that FloatNormalization emits for bf16, recovering the narrow bf16 modifier so scatter-add lowers to packed atomicrmw fadd <2 x bf16> (global_atomic_pk_add_bf16) instead of a CAS loop. GpuFloatSupport/FloatNormalization are unchanged; targets without a native bf16 atomic still fall back to CAS.

🎯 Justification
bf16 segment_sum/scatter-add result in slow CAS loop on MI300/MI350 despite the HW having a packed bf16 atomic, making bf16 ~7x slower than f16.

🚀 Kind of Contribution
Please remove what does not apply: ⚡️ Performance Improvement,
🧪 Tests

📊 Benchmark (for Performance Improvements)
Please measure and include speedups for one of the public HLOs in
`compiler/xla/tools/benchmarks/hlo/`.

🧪 Unit Tests:
Added direct_atomic_rmw_fadd_bf16_widened + a gfx942 CHECK-GFX942-MI300 RUN line to lower_tensors.mlir, asserting the packed atomicrmw fadd <2 x bf16> with no CAS. All 9 RUN-line prefixes pass.

🧪 Execution Tests:
What execution tests were added? For example, a new optimization should be
tested with an end-to-end execution test triggering the optimization and
asserting correctness. Please provide test cases running with at most 2 GPUs.

Copybara import of the project:

--
edcb06b by Zoran Jovanovic <zjovanov@amd.com>:

[ROCm] Emit packed bf16 atomic add for scatter/segment_sum by matching FloatNormalization conversions.

Merging this change closes openxla#44428

COPYBARA_INTEGRATE_REVIEW=openxla#44428 from ROCm:rocm-bf16-atomic-scatter edcb06b
PiperOrigin-RevId: 933630040

(cherry picked from commit 97544f7)
(cherry picked from commit 06b0034)
* Add cpu test job

* Delete .github/workflows/rocm_xla_ci.yml

* Trigger CI/CD pipeline

* Force compile locally

* Ignore failing mgpu tests

(cherry picked from commit 13dc711)
(cherry picked from commit 70e3548)
(cherry picked from commit 9b7baa6)
(cherry picked from commit 01cb09e)
magaonka-amd and others added 3 commits June 24, 2026 09:59
The ROCm CI image's /usertools/rocm.bazelrc excludes
-//xla/backends/gpu/codegen:dynamic_slice_fusion_test_amdgpu_any, but at this
pinned XLA revision (5a9e73cb) that test is an xla_cc_test in
//xla/backends/gpu/transforms, so the label doesn't exist and bazel aborts
pattern evaluation under --nokeep_going (single_gpu/multi_gpu/rocm_cpu all
fail before any test runs).

Add an alias under the expected codegen label so the exclude resolves. The
real test still runs via //xla/... and passes locally on gfx950. This is a
temporary workaround for the CI-image/pin mismatch; revert once the image
bazelrc is corrected.
(cherry picked from commit 8e36d62)
The release-branch EXCLUDED_TESTS only filtered two F8E4M3FN cases, while
upstream openxla/xla ROCm CI filters a larger set of known
ROCm-unsupported / hipBLASLt-gap cases. This left the f64 cublasLt cases
(DotOperationTestWithCublasLt.../GeneralMatMulActivation and
MatmulTestWithCublas.GemmRewriter_RegressionTestF64) running and failing
with "Failed to get a MatmulPlan: no valid algorithm found" (hipBLASLt has
no f64+activation algorithm), plus the F8E5M2 / triton dot-algorithm /
scaled-dot cases.

Take the union: keep our two F8E4M3FN entries and add upstream's skips.
@magaonka-amd

Copy link
Copy Markdown
Author

Thanks , JD
image

this is my local test results , as discussed this CI runner using MI250 and there are fails related to it.

I cant merge it may be I dont have that power.

@i-chaochen i-chaochen merged commit 55f5563 into rocm-jaxlib-v0.10.2 Jun 24, 2026
4 of 14 checks passed
@i-chaochen

Copy link
Copy Markdown
Collaborator

we might need this openxla#44047 for this UT

[ RUN      ] FloatSupportTestWithCublas.MixedTypeDotIsNotUpcasted
I0000 00:00:1782327262.048556   24308 se_gpu_pjrt_client.cc:1358] Using BFC allocator.
I0000 00:00:1782327262.048604   24308 gpu_helpers.cc:161] XLA backend will use up to 47.99GiB (51527024640 bytes) on device 0 for BFCAllocator.
I0000 00:00:1782327262.048619   24308 gpu_helpers.cc:201] XLA backend will use up to 17175674880 bytes on device 0 for CollectiveBFCAllocator.
./xla/tests/hlo_runner_agnostic_test_base.h:203: Failure
Value of: _status_or_value41.status().ok()
  Actual: false
Expected: true
INTERNAL: GEMM is not supported by cublasLt and legacy cublas fallback is removed.
xla/service/gpu/float_support_test.cc:60: Failure
Value of: RunAndCompare(kHloText, ErrorSpec{1e-6, 1e-6})
  Actual: false (INTERNAL: GEMM is not supported by cublasLt and legacy cublas fallback is removed.)
Expected: true
[  FAILED  ] FloatSupportTestWithCublas.MixedTypeDotIsNotUpcasted (96 ms)

rest of gemm related tests are due to MI250 on CI, and it's no problem on MI350 as the test result on the above

@magaonka-amd

Copy link
Copy Markdown
Author

94e0cff
c64/c128 support PR is in 10.2

magaonka-amd added a commit to ROCm/jax that referenced this pull request Jun 24, 2026
Update XLA_COMMIT to 55f5563f, the rocm-jaxlib-v0.10.2 HEAD after
ROCm/xla#993 merged (combined ROCm 0.10.2 release cherry-picks:
PR #40385/#41591 bazel target streamlining, librocm_kpack/libclang-cpp
runfiles, rocm_blas formatting, openxla #44428/#44389, fix-failing-tests
jax-ml#939, dynamic_slice_fusion codegen alias, missing Tensile libs #44567,
and EXCLUDED_TESTS alignment). Recompute XLA_SHA256 from the ROCm/xla
archive. The four third_party/xla build patches were verified to still
apply cleanly against this revision.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants