[WIP][XLA:GPU][ROCm] Fix device memory bandwidth with per-gfx peak table by nurmukhametov · Pull Request #970 · ROCm/xla

nurmukhametov · 2026-06-19T14:45:55Z

The legacy 2 * bus_width * mem_clock formula undercounts memory bandwidth on HBM3/HBM3e (MI300X, MI350X) and GDDR6 (RX 9070 XT) because HIP reports the controller clock (UCLK), not the data-rate clock (~1.7x low on MI350X), skewing the GPU cost model's fusion decisions.

GetRocmMemoryBandwidth returns a per-gfx spec peak for those arches and falls back to the formula otherwise (still correct on HBM2/HBM2e).

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

i-chaochen · 2026-06-19T15:09:59Z

+  // clock` lands at spec peak, so those arches fall through to it. On HBM3/HBM3e
+  // (gfx942 MI300X, gfx950 MI350X) and GDDR6 (gfx1201) the formula falls short
+  // of spec peak, so an explicit per-gfx value is used instead.
+  if (cc.gfx9_mi300()) return 5300 * kGbps;  // MI300X, HBM3


why we just put these hardware related info to rocm_compute_capability.h ?

I just followed the precedent set by rocm_pcie_bandwidth, which uses the same separate-translation-unit approach

The legacy `2 * bus_width * mem_clock` formula undercounts memory bandwidth on HBM3/HBM3e (MI300X, MI350X) and GDDR6 (RX 9070 XT) because HIP reports the controller clock (UCLK), not the data-rate clock (~1.7x low on MI350X), skewing the GPU cost model's fusion decisions. GetRocmMemoryBandwidth returns a per-gfx spec peak for those arches and falls back to the formula otherwise (still correct on HBM2/HBM2e). Add test coverage for the fix: - New MI350X (gfx950) test device: AMDMI350DeviceInfo() and the mi350.txtpb target config. - priority_fusion_test.cc: PriorityFusionRocmMemoryBandwidthTest verifies that the modeled bandwidth tips PriorityFusion's reduce-into-consumers decision. - memory_bandwidth_fusion.hlo: lit test checking the final fused HLO per arch (mi350 collapses to one multi-output fusion; mi200 stays split). - triton_tdm_elemwise.hlo: add CHECK-mi350 prefixes (same non-TDM form as mi200). - BUILD: add the mi350 target to the GPU lit test suite.

nurmukhametov requested review from draganmladjenovic and i-chaochen June 19, 2026 14:46

i-chaochen reviewed Jun 19, 2026

View reviewed changes

nurmukhametov requested a review from i-chaochen June 22, 2026 15:14

nurmukhametov force-pushed the anurmukh/fix-memory-bandwidth branch 4 times, most recently from a18cc9b to 10524e5 Compare June 24, 2026 11:08

nurmukhametov changed the title ~~[XLA:GPU][ROCm] Fix device memory bandwidth with per-gfx peak table~~ [WIP][XLA:GPU][ROCm] Fix device memory bandwidth with per-gfx peak table Jun 24, 2026

nurmukhametov force-pushed the anurmukh/fix-memory-bandwidth branch 4 times, most recently from 098b50e to f0d3897 Compare June 25, 2026 09:13

nurmukhametov force-pushed the anurmukh/fix-memory-bandwidth branch from f0d3897 to 20782eb Compare June 25, 2026 10:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][XLA:GPU][ROCm] Fix device memory bandwidth with per-gfx peak table#970

[WIP][XLA:GPU][ROCm] Fix device memory bandwidth with per-gfx peak table#970
nurmukhametov wants to merge 1 commit into
mainfrom
anurmukh/fix-memory-bandwidth

nurmukhametov commented Jun 19, 2026

Uh oh!

i-chaochen Jun 19, 2026

Uh oh!

nurmukhametov Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

nurmukhametov commented Jun 19, 2026

Submission Checklist

Uh oh!

i-chaochen Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

nurmukhametov Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants