feat(ck_tile_unification): Swizzle support + gfx950 mixPrecScale + misc by krithalith · Pull Request #8315 · ROCm/rocm-libraries

krithalith · 2026-06-11T15:12:27Z

Motivation

This MR is about adding Swizzle support to the Tile Distribution Encoding Calculator and Mma Pipelines in the Unification framework. Swizzle is a modifier for Tile Distribution Encodings that effectively performs a permutation in the M dimension. This means that it affects the Tile Distribution Encodings of A and C. When combined with CTranspose, it affects the Encodings of B and C instead. In principle, for a regular gemm, the Swizzle factor does not affect the correctness of the kernel, since matrix multiplication is symmetric under permutations of rows and columns (M). However, this is only true if the same Encodings are used for the loading and storing of the data. For consecutive matrix multiplications, we may be in a situation where we use Swizzle to account for the effective layout of an intermediate result, so that it can immediately be used in another matrix operation without additional shuffling. In these cases, the Swizzle factor is crucial for correctness. As far as I know, this seems most likely to occur in attention kernels.

Changes

I adapted the Tile Distribution Encoding Calculator to accept any Swizzle modifier, and use this to modify the layouts just like in CK Tile. Note that Swizzle is only compatible with certain intrinsics, due to the restriction that the Swizzle factor divides kCMNumAccess. This is possible for 32x32 MFMA instructions with SFactor 2 or 4, and for gfx11 WMMA instructions with SFactor 2, 4, or 8, although this is not used in CK Tile.
I adapted the layout test to check the correctness of layout with Swizzle modification, for all possible Swizzle factors for each intrinsic.
I adapted the Unification Dispatcher to take a Swizzle Factor and pass it on to the MmaPipelines. Note that the original dispatcher takes a boolean instead, which I convert to an SFactor of 2 when true. I believe this is correct since in all cases where CK Tile previously used the old dispatcher, and SFactor of 2 ended up being used. However, there are two named WarpGemms (WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution and WarpGemmMfmaI8I8I32M32N32K32SwizzleBTransposedCDistribution) which can support any Swizzle factor, and are actually used with Swizzle factors up to 4. These were not used in the old dispatcher but instead always used directly in CK Tile pipelines.
I added custom named WarpGemms in case the Unification flag is ON, for the named WarpGemms using Swizzle that are directly used in CK Tile pipelines. There are only two of them and they are the ones mentioned in the previous point.

Changes part 2

While trying to get a swizzle example to work, I ended up having to add a lot of other changes which would have normally been their own issue. We have:

Adding all mixed precision gfx950 scale intrinsics (50 in total)
Adding these intrinsics to the layout test
Tile distribution encoding tweak: Allow for simplified C layouts in blockless cases
MmaPipelines tweaks: Make pretty much all old-style layout params available

Note on AttrNumAccess

For the scale gfx950 intrinsics, the "canonical" layouts for A and B have NumAccess 1 or 2, depending on the A and B types. The 8-bit types have a canonical NumAccess of 2, and the others 1. So overall we may have (1, 1), (2, 1), (1, 2), or (2, 2). This is reflected in the intrinsic definitions. However, for the fully 8-bit intrinsics I still define them with (1, 1). The reason for this is that it is in principle possible to use these intrinsics with (1, 1) as long as you don't use scale. This may actually happen in CK Tile. Furthermore, there are some pipelines that instantiate a WarpGemm with (1, 1) just to peek at some parameters. Note that the (1, 2) and (2, 1) cases MUST have these NumAccess values or the base MMA does not work (regardless of scale). This is because you can't just permute K for A without doing the same for B and vice versa.

Tests

Layout tests with swizzle work. tile_example_fmha_fwd and tile_example_fmha_bwd now compile and run, with correct verification for default settings. With fp8bf16 and init=3, get 5% wrong results on both this branch and develop, and this one is definitely sensitive to swizzle, because without swizzle it's 50% wrong. Still looking for better tests but we will encounter issues if there are any in our overall unification coverage checking scripts.

JP-Fernando

This pull request adds swizzle support to the MMA-unified model. The author is looking for test cases for validation. So far my only concern is the computation of kABKLane in the three pipelines, exposed via WarpGemmAttribute::Impl.

chris-tsiaousis-hpc

LGTM, added some comments you might want to address!

chris-tsiaousis-hpc · 2026-06-24T07:24:47Z

+            static constexpr index_t kM = MmaOp::kM;     // Tentative
+
+            // Seems to be entire M size excluding blocks. Dubious for gfx1250, needs attention.
+            static constexpr index_t kAMLane =


Can we tighten this before it lands? kAMLane now feeds the old-style WarpGemmAttribute::Impl compatibility surface, so a hardcoded gfx1250 exception with Dubious in the comment feels a bit shaky. If this is only preserving existing behavior, maybe link it to the exact WMMA/gfx1250 follow-up?

I defined the value like this because I think this will match the value for all the existing WarpGemms. The gfx1250 value is dubious because although it matches what is used currently in CK tile, it might not actually make sense for all gfx1250 intrinsics. I will add a note to whatever gfx1250 issue(s) we have to look into this (and the other) old-style layout params.

chris-tsiaousis-hpc · 2026-06-24T07:25:34Z

 * @tparam MmaOp Intrinsic (amdgcn_mma) to be tested
 */
-template <typename MmaOp> // TODO: C++20 concept for MmaOp
+template <typename MmaOp, bool CTranspose, int SFactor> // TODO: C++20 concept for MmaOp


NIT: SFactor is int here, but run_mma_layout_test_single and TileDistrEncCalc use index_t. Could we keep it index_t all the way through? It avoids a weird NTTP mismatch if index_t ever differs from int.

chris-tsiaousis-hpc · 2026-06-24T07:27:11Z

-        EXPECT_EQ(h_errors[case_idx], 0u) << "Mismatch for m=" << m << " k=" << k << " n=" << n;
-    }
+    // Try all possible Swizzle factors. Incompatible intrinsics are skipped.
+    // CTranspose does not work with current test kernel.


The new WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution and i8 alias instantiate CTranspose=true with SFactor > 1, and those are the FMHA/SageAttention users. Can we add a CTranspose-aware check here, or at least a small compile-time test for that exact encoding?

The layout test simply is not compatible with CTranspose currently. Luckily CTranspose is a pretty trivial modifier as far as the Tile Distribution Enc Calc is concerned. I will make an issue about CTranspose testing in the layout test. For now, we already know CTranspose is working in real pipelines because the of original tests and examples.

therock-pr-bot · 2026-06-29T10:53:17Z

❌ PR Check — Action Required

Check	Status	Details
🌿 Branch Name	✅ Pass	—
📝 PR Title/Description	❌ Fail	Error: Title does not follow Conventional Commits style. Expected: start with a valid type (feat, fix, docs, …). Desired format: `type(optional-scope): short description`
⛔ Forbidden Files	✅ Pass	—
🧪 Unit Test	✅ Pass	—
🔎 pre-commit	✅ Pass	—
🚫 Draft PR	🔜 To Be Enabled	—
🚩 Feature Flag	🔜 To Be Enabled	—
📊 Code Coverage	🔜 To Be Enabled	—

⚠️ 1 policy check(s) failed. Please address the issues above before this PR can be Reviewed.

🚫 Please fix the failed policies

❌ PR Title/Description

The Not ready to Review label was added to this PR. Once all policies pass, the label is removed automatically.

📖 Need help? See the Policy FAQ for details on every check and how to fix failures.

therock-pr-bot · 2026-06-29T10:53:19Z

🚫 Please fix the failed policies before requesting reviews.

The following policy checks failed:

❌ PR Title/Description

The Not ready to Review label has been added to this PR.
Once all policies pass, the label will be removed automatically.

…tor, with minimal restrictions. Seems to work within the minimal restrictions for gfx908 in layout test for Swizzle 2 and 4. Activated only those intrinsics I expect to pass the layout test on other platforms. Needs more testing. Not all "working" layout configurations may actually make sense.

… compatible with the intrinsic.

…the Unification Dispatcher take a scalar SwizzleFactor instead of a bool because we may need a SwizzleFactor of 4 for some named WarpGemms. Added unification version of the two named WarpGemms with swizzle that are actually used in higher level code.

…lines.

…for now.

…AMLane and kABKLane. Re-ordered to match original order.

…ibution to the enc calc for blockless intrinsics to placate (fragile) higher level code, and add all mixed precision gfx950 intrinsics!

…x950 scale intrinsics. We add a canonical (1,2) or (2,1) AttrNumAccess to these intrinsics and make the tile distr enc calc use these as minima again.

github-actions Bot added the project: composablekernel label Jun 11, 2026

assistant-librarian Bot added the organization: ROCm label Jun 11, 2026

krithalith force-pushed the users/krithalith/ck/unification_flag_removal branch 5 times, most recently from 5138319 to fd99e01 Compare June 18, 2026 08:30

krithalith force-pushed the users/krithalith/ck/unification_swizzle branch from 4479f7f to 337070c Compare June 18, 2026 09:00

krithalith added the organization: streamhpc contributors from streamhpc label Jun 18, 2026

krithalith requested review from JP-Fernando and chris-tsiaousis-hpc June 19, 2026 11:41

krithalith force-pushed the users/krithalith/ck/unification_flag_removal branch from fd99e01 to bf4ab7d Compare June 22, 2026 10:20

krithalith force-pushed the users/krithalith/ck/unification_swizzle branch from 6171b73 to 8dd984e Compare June 22, 2026 10:22

JP-Fernando requested changes Jun 23, 2026

View reviewed changes

chris-tsiaousis-hpc approved these changes Jun 24, 2026

View reviewed changes

krithalith force-pushed the users/krithalith/ck/unification_flag_removal branch from bf4ab7d to a2816c8 Compare June 24, 2026 13:43

krithalith force-pushed the users/krithalith/ck/unification_swizzle branch 2 times, most recently from f8be90f to eeac4d5 Compare June 25, 2026 13:24

Base automatically changed from users/krithalith/ck/unification_flag_removal to develop June 26, 2026 12:00

krithalith force-pushed the users/krithalith/ck/unification_swizzle branch 2 times, most recently from 51ef57e to b362555 Compare June 29, 2026 10:49

therock-pr-bot Bot added the Not ready to Review label Jun 29, 2026

krithalith changed the title ~~Implement Swizzle in Mma Unification framework~~ [CK TILE] Unification Work – Swizzle support + gfx950 mixed prec scale + misc compile fixes Jun 29, 2026

krithalith force-pushed the users/krithalith/ck/unification_swizzle branch 4 times, most recently from 3f6ccf4 to d933449 Compare June 29, 2026 15:51

krithalith added 16 commits June 30, 2026 07:40

Test all possible swizzle factors in layout test, as long as they are…

8bc13d1

… compatible with the intrinsic.

Cleanup

c945295

Add some more Impl params that are apparently used by higher level code.

0a2586d

Fixup comments

82898c3

Add kAMLane to old-style layouts params for new MmaPipelines

c38e50f

Add kABKPerLane to old-style layout params for new MmaPipelines

b03f249

Add some more old-style layout params and allow fp6 microscaling pipe…

f75f502

…lines.

Dubious but go back to allowing AttrNumAccess 1 for scale intrinsics …

08177c9

…for now.

Add some mixed precision gfx9 scale intrinsics (f8bf8 bf8f8)

25b558d

Fix up old style layout params. Added kBNLane. Fixed definitions of k…

af8342c

…AMLane and kABKLane. Re-ordered to match original order.

Revert test distributions from scale mma pipeline

ea9bd25

Improve CTranspose implementation, add conditional simplified C distr…

d1a809a

…ibution to the enc calc for blockless intrinsics to placate (fragile) higher level code, and add all mixed precision gfx950 intrinsics!

Don't get too clever with the gfx950 macros

dd7878e

Add all mixed precision scale gfx950 intrinsics to layout test.

292bf9c

Tentative fix for mixed precision scale layout tests. Needs better fix.

1263ea4

krithalith force-pushed the users/krithalith/ck/unification_swizzle branch from d933449 to 1263ea4 Compare June 30, 2026 07:55

Cleaner approach to dealing with AttrNumAccess for mixed precision gf…

f35f779

…x950 scale intrinsics. We add a canonical (1,2) or (2,1) AttrNumAccess to these intrinsics and make the tile distr enc calc use these as minima again.

krithalith force-pushed the users/krithalith/ck/unification_swizzle branch from f131075 to f35f779 Compare June 30, 2026 11:12

krithalith changed the title ~~[CK TILE] Unification Work – Swizzle support + gfx950 mixed prec scale + misc compile fixes~~ feat(ck_tile_unification): Swizzle support + gfx950 mixPrecScale + misc Jun 30, 2026

krithalith mentioned this pull request Jun 30, 2026

[Feature]: [CK Tile] [MFMA/WMMA unification] Swizzle support + gfx950 mixPrecScale + misc #8960

Open

krithalith linked an issue Jun 30, 2026 that may be closed by this pull request

[Feature]: [CK Tile] [MFMA/WMMA unification] Swizzle support + gfx950 mixPrecScale + misc #8960

Open

therock-pr-bot Bot removed the Not ready to Review label Jun 30, 2026

JP-Fernando approved these changes Jun 30, 2026

View reviewed changes

Address review comments + small fixups

531b429

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(ck_tile_unification): Swizzle support + gfx950 mixPrecScale + misc#8315

feat(ck_tile_unification): Swizzle support + gfx950 mixPrecScale + misc#8315
krithalith wants to merge 19 commits into
developfrom
users/krithalith/ck/unification_swizzle

krithalith commented Jun 11, 2026 •

edited

Loading

Uh oh!

JP-Fernando left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chris-tsiaousis-hpc left a comment

Uh oh!

chris-tsiaousis-hpc Jun 24, 2026

Uh oh!

krithalith Jun 30, 2026 •

edited

Loading

Uh oh!

chris-tsiaousis-hpc Jun 24, 2026

Uh oh!

chris-tsiaousis-hpc Jun 24, 2026

Uh oh!

krithalith Jun 30, 2026

Uh oh!

therock-pr-bot Bot commented Jun 29, 2026 •

edited

Loading

Uh oh!

therock-pr-bot Bot commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

krithalith commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Changes part 2

Note on AttrNumAccess

Tests

Uh oh!

JP-Fernando left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chris-tsiaousis-hpc left a comment

Choose a reason for hiding this comment

Uh oh!

chris-tsiaousis-hpc Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

krithalith Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chris-tsiaousis-hpc Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

chris-tsiaousis-hpc Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

krithalith Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

therock-pr-bot Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ PR Check — Action Required

Uh oh!

therock-pr-bot Bot commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

krithalith commented Jun 11, 2026 •

edited

Loading

krithalith Jun 30, 2026 •

edited

Loading

therock-pr-bot Bot commented Jun 29, 2026 •

edited

Loading