[WIP][XLA:GPU][ROCm] Fix device memory bandwidth with per-gfx peak table#970
Open
nurmukhametov wants to merge 1 commit into
Open
[WIP][XLA:GPU][ROCm] Fix device memory bandwidth with per-gfx peak table#970nurmukhametov wants to merge 1 commit into
nurmukhametov wants to merge 1 commit into
Conversation
i-chaochen
reviewed
Jun 19, 2026
| // clock` lands at spec peak, so those arches fall through to it. On HBM3/HBM3e | ||
| // (gfx942 MI300X, gfx950 MI350X) and GDDR6 (gfx1201) the formula falls short | ||
| // of spec peak, so an explicit per-gfx value is used instead. | ||
| if (cc.gfx9_mi300()) return 5300 * kGbps; // MI300X, HBM3 |
Collaborator
There was a problem hiding this comment.
why we just put these hardware related info to rocm_compute_capability.h ?
Member
Author
There was a problem hiding this comment.
I just followed the precedent set by rocm_pcie_bandwidth, which uses the same separate-translation-unit approach
a18cc9b to
10524e5
Compare
098b50e to
f0d3897
Compare
The legacy `2 * bus_width * mem_clock` formula undercounts memory bandwidth on HBM3/HBM3e (MI300X, MI350X) and GDDR6 (RX 9070 XT) because HIP reports the controller clock (UCLK), not the data-rate clock (~1.7x low on MI350X), skewing the GPU cost model's fusion decisions. GetRocmMemoryBandwidth returns a per-gfx spec peak for those arches and falls back to the formula otherwise (still correct on HBM2/HBM2e). Add test coverage for the fix: - New MI350X (gfx950) test device: AMDMI350DeviceInfo() and the mi350.txtpb target config. - priority_fusion_test.cc: PriorityFusionRocmMemoryBandwidthTest verifies that the modeled bandwidth tips PriorityFusion's reduce-into-consumers decision. - memory_bandwidth_fusion.hlo: lit test checking the final fused HLO per arch (mi350 collapses to one multi-output fusion; mi200 stays split). - triton_tdm_elemwise.hlo: add CHECK-mi350 prefixes (same non-TDM form as mi200). - BUILD: add the mi350 target to the GPU lit test suite.
f0d3897 to
20782eb
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The legacy
2 * bus_width * mem_clockformula undercounts memory bandwidth on HBM3/HBM3e (MI300X, MI350X) and GDDR6 (RX 9070 XT) because HIP reports the controller clock (UCLK), not the data-rate clock (~1.7x low on MI350X), skewing the GPU cost model's fusion decisions.GetRocmMemoryBandwidth returns a per-gfx spec peak for those arches and falls back to the formula otherwise (still correct on HBM2/HBM2e).
Submission Checklist