Skip to content

[WIP][XLA:GPU][ROCm] Fix device memory bandwidth with per-gfx peak table#970

Open
nurmukhametov wants to merge 1 commit into
mainfrom
anurmukh/fix-memory-bandwidth
Open

[WIP][XLA:GPU][ROCm] Fix device memory bandwidth with per-gfx peak table#970
nurmukhametov wants to merge 1 commit into
mainfrom
anurmukh/fix-memory-bandwidth

Conversation

@nurmukhametov

Copy link
Copy Markdown
Member

The legacy 2 * bus_width * mem_clock formula undercounts memory bandwidth on HBM3/HBM3e (MI300X, MI350X) and GDDR6 (RX 9070 XT) because HIP reports the controller clock (UCLK), not the data-rate clock (~1.7x low on MI350X), skewing the GPU cost model's fusion decisions.

GetRocmMemoryBandwidth returns a per-gfx spec peak for those arches and falls back to the formula otherwise (still correct on HBM2/HBM2e).

Submission Checklist

// clock` lands at spec peak, so those arches fall through to it. On HBM3/HBM3e
// (gfx942 MI300X, gfx950 MI350X) and GDDR6 (gfx1201) the formula falls short
// of spec peak, so an explicit per-gfx value is used instead.
if (cc.gfx9_mi300()) return 5300 * kGbps; // MI300X, HBM3

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we just put these hardware related info to rocm_compute_capability.h ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just followed the precedent set by rocm_pcie_bandwidth, which uses the same separate-translation-unit approach

Comment thread xla/stream_executor/rocm/rocm_executor.cc Outdated
@nurmukhametov nurmukhametov requested a review from i-chaochen June 22, 2026 15:14
@nurmukhametov nurmukhametov force-pushed the anurmukh/fix-memory-bandwidth branch 4 times, most recently from a18cc9b to 10524e5 Compare June 24, 2026 11:08
@nurmukhametov nurmukhametov changed the title [XLA:GPU][ROCm] Fix device memory bandwidth with per-gfx peak table [WIP][XLA:GPU][ROCm] Fix device memory bandwidth with per-gfx peak table Jun 24, 2026
@nurmukhametov nurmukhametov force-pushed the anurmukh/fix-memory-bandwidth branch 4 times, most recently from 098b50e to f0d3897 Compare June 25, 2026 09:13
The legacy `2 * bus_width * mem_clock` formula undercounts memory
bandwidth on HBM3/HBM3e (MI300X, MI350X) and GDDR6 (RX 9070 XT) because
HIP reports the controller clock (UCLK), not the data-rate clock (~1.7x
low on MI350X), skewing the GPU cost model's fusion decisions.

GetRocmMemoryBandwidth returns a per-gfx spec peak for those arches and
falls back to the formula otherwise (still correct on HBM2/HBM2e).

Add test coverage for the fix:
- New MI350X (gfx950) test device: AMDMI350DeviceInfo() and the
  mi350.txtpb target config.
- priority_fusion_test.cc: PriorityFusionRocmMemoryBandwidthTest verifies
  that the modeled bandwidth tips PriorityFusion's reduce-into-consumers
  decision.
- memory_bandwidth_fusion.hlo: lit test checking the final fused HLO per
  arch (mi350 collapses to one multi-output fusion; mi200 stays split).
- triton_tdm_elemwise.hlo: add CHECK-mi350 prefixes (same non-TDM form as
  mi200).
- BUILD: add the mi350 target to the GPU lit test suite.
@nurmukhametov nurmukhametov force-pushed the anurmukh/fix-memory-bandwidth branch from f0d3897 to 20782eb Compare June 25, 2026 10:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants