gfx1250 moe by XingerZhu · Pull Request #336 · ROCm/FlyDSL

XingerZhu · 2026-04-02T04:11:50Z

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

- Fix Python version compatibility in meta.py: add support for Python < 3.11 by checking for positions attribute availability - Replace hardcoded MLIR library paths in executor.py with environment variable MLIR_PATH, with clear error message when not set - Update LLVM commit hash and enable ROCM runner in build script

* [FLYDSL]:add copy_atom right_inverse * [FLYDSL]: right_inverse dynamic process bugfix * [FLYDSL]:Python refactoring and adaptation * [FLYDSL]:rm example 05

* Migrate Python bindings to PyConcreteType<> and fix TypeID ODR violation - FlyExtension.cpp / FlyROCDLExtension.cpp: migrate from legacy mlir_type_subclass() to PyConcreteType<> CRTP pattern (required by new MLIR Python binding API). Types are defined inside namespace mlir::python::MLIR_BINDINGS_PYTHON_DOMAIN::fly, using ::mlir:: global qualifiers to avoid the mlir::python::mlir namespace collision when NB_DOMAIN=mlir. - CMakeLists.txt: remove MLIRFlyDialect / MLIRFlyROCDLDialect from _fly.so / _fly_rocdl.so PRIVATE_LINK_LIBS. These static archives were being linked into both the extension modules AND FlyPythonCAPI.so (via EMBED_CAPI_LINK_LIBS → MLIRCPIFly), creating duplicate TypeID static variables. The dialect registered under FlyPythonCAPI.so's TypeIDs but _fly.so looked up types with its own copy, causing "storage uniquer isn't initialized" at runtime. Now all symbols are resolved from FlyPythonCAPI.so. - FlyToROCDL.cpp: use string-based type matching for MmaAtomCDNA3_MFMA to work around the same TypeID ODR issue in the conversion pass, and fix ROCDL MFMA intrinsic call to use I32Attr attributes instead of Value operands for cbsz/abid/blgp control parameters. * Fix pass registry ODR violation: register Fly passes via CAPI - PRIVATE_LINK_LIBS MLIRFlyToROCDL in _mlirRegisterEverything pulled in a local copy of MLIRPass, causing registerFlyPasses() to register into a LOCAL pass registry inside _mlirRegisterEverything.so while PassManager.parse() queried the GLOBAL registry in FlyPythonCAPI.so. - Fix by introducing CAPI functions (mlirRegisterFlyPasses, mlirRegisterFlyToROCDLConversionPass) in the CAPI libraries so pass registration happens inside FlyPythonCAPI.so's single global registry. - update cmake/llvm-hash.txt to keep same with triton llvm hash. * Sync build_llvm.sh with pre_bumpupllvm and add ROCM runner Align script with pre_bumpupllvm branch: full clone, buildmlir dir, NVPTX target, NB_DOMAIN=mlir, package install by default. Keep reading LLVM commit from cmake/llvm-hash.txt. Add MLIR_ENABLE_ROCM_RUNNER=ON for GPU kernel execution support. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>

- Rename C++ binding structs with Py prefix (e.g. IntTupleType -> PyIntTupleType) for consistency - Add __all__ exports to typing, primitive, and gpu modules - Add Int4 numeric type - Fix frameInfo.positions compatibility for older Python versions - Fix dialect import order to ensure _Dialect is properly exported - Add fly_rocdl ops/enum gen copy rules in CMake - Improve build_llvm.sh with configurable parallel jobs and --no-install flag - Clean up redundant comments and formatting Co-authored-by: Cursor <cursoragent@cursor.com>

gemm test ready

* [FLYDSL]: add recast_layout op * [FLYDSL]: refactor * [FLYDSL]: add detail namespace * [FLYDSL]: add upcast assert * [FLYDSL]: rm bits number * [FLYDSL]: rm redundant code * [FLYDSL]: bits number only support static value * [FLYDSL]: change APIntAttr to I32Attr * [FLYDSL]: rm notes

* fix run error * port all gemm from main * fuix cudagraph hack * add int4 version * change flymemref convert * test ok * add build script * fix graph2 * add files * fix flops * fix path * fix local test * fix * clean * update readme

* add compile only and dumpir

- Add fly-opt tool (tools/fly-opt/) for MLIR IR transformations, registering Fly/FlyROCDL dialects and all custom passes - Add lit.cfg.py with fly-opt/FileCheck configuration - Test using 'lit -v tests/' to test basic lowering tests - Add LayoutAlgebra tests: construction, size/cosize, coordinate, composition, product, divide, int_tuple operations - Add Transforms tests: canonicalize, layout_lowering - Add Conversion tests for convert-fly-to-rocdl pass, split by category: type_conversion, memref_alloca, memref_ops, pointer_ops, mma_atom, gpu_ops

…gration - Enable LLVM_BUILD_TOOLS so fly-opt is built with the default ninja target - Add MLIR lit test section to scripts/run_tests.sh - Update test/lit.cfg.py to use FLY_BUILD_DIR env var (default: build-fly)

Introduce the new gfx1250 MoE kernel implementation and its dedicated test harness so mxfp4 bring-up can validate the path in-tree and track regressions on the new architecture. Made-with: Cursor

Port the preshuffled FP8/FP4 WMMA_SCALE flow into the gfx1250 MoE stage1/stage2 kernels and extend the MoE harness so A8W4 can be validated through standalone and end-to-end bring-up paths. Made-with: Cursor

- fp16 kernel: replace TDM copy_b_to_lds with software transpose to correctly produce [K, N+pad] LDS layout for lds_transpose_load; add max_total_warps=8 constraint to prevent VGPR overuse on M/L sizes - fp8/fp4 tests: use per-1x32 block-scale quantization instead of per-token quant to match gfx1250 WMMA_SCALE requirements - fp4 tests: generate quantized data from controlled FP32 inputs via per_1x32_f4_quant to avoid numerical overflow - fp16 tests: skip shuffle_weight for native gfx1250 kernel path - compiler: add compile progress logging and include opt_level in cache key; update native lib glob patterns Made-with: Cursor

- Fix pack_as_to_lds A-scale LDS index collision when m_warp > 1: use warp-local wm_idx instead of global row/WMMA_M to avoid different warps writing to the same LDS positions. - Replace TDM-based B-scale loading with software copy in fp8 stage1/stage2 kernels to fix incorrect stride when wmma_n_rep > 1. - Add max_total_warps=8 for fp8/a8w4 kernel shape selection to prevent VGPR overflow on larger tile configurations. - Skip out_f32 tests for gfx1250 native-dtype kernels (fp4/fp8/a8w4/fp16). - Use torch.zeros instead of torch.empty for stage1 output buffer. Made-with: Cursor

…ersion bf16 inputs were falling through to the generic MFMA compilation path, causing SIGABRT on gfx1250 (which only supports WMMA). Route bf16 through the existing fp16 WMMA kernel with a Python-level wrapper that converts bf16 tensors to fp16 before launch. Also fix weight shuffle mismatch in tests where bf16 was incorrectly using MFMA-style shuffle_weight. Made-with: Cursor

Sync the branch with latest main changes and resolve merge conflicts while preserving gfx1250 MOE GEMM updates. Made-with: Cursor

Copilot

Pull request overview

Adds gfx1250-specific MoE 2-stage GEMM kernels and plumbing for an “expert scheduling mode” compile hint, along with new gfx1250-focused kernel tests and a unit test intended to validate ISA emission.

Changes:

Introduce kernels/moe_gemm_2stage_gfx1250.py with gfx1250 MoE stage1/stage2 kernel compilation wrappers and single-kernel paths.
Add a new compile-hint (expert_scheduling_mode) that injects an AMDGPU passthrough attribute on emitted gpu.func.
Add gfx1250 MoE test harness + a unit test for expert scheduling ISA, and tweak existing gfx1250 GEMM scale preshuffle API.

Reviewed changes

Copilot reviewed 6 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`kernels/moe_gemm_2stage_gfx1250.py`	New gfx1250 MoE 2-stage kernel implementations/wrappers (missing SPDX header; contains redundant duplicate assignments).
`python/flydsl/compiler/kernel_function.py`	Adds `expert_scheduling_mode` handling by writing a `passthrough` attribute on `gpu.func` (currently overwrites any existing passthrough).
`tests/unit/test_jit_compile_hints.py`	New unit test intended to validate expert scheduling ISA emission (currently calls `flyc.compile()` with no args, so compilation won’t run).
`tests/kernels/test_moe_gemm_gfx1250.py`	Large gfx1250 MoE correctness/perf harness + pytest cases (contains an accidental `parametrize` decorator on a helper).
`python/flydsl/compiler/backends/rocm.py`	Updates toolchain fingerprint globs for native libs (likely drops hashing of `_mlirDialectsFly*.so` per current CMake module names).
`python/flydsl/__init__.py`	Adds unused imports (`ctypes`, `os`).
`tests/kernels/test_gemm_fp8fp4_gfx1250.py`	Extends `preshuffle_e8m0_scale` with optional `byte_swap`.
`kernels/gemm_fp8fp4_gfx1250.py`	Adds `LDS_PAD_B_BYTES = 0` constant.

Comments suppressed due to low confidence (1)

python/flydsl/compiler/backends/rocm.py:103

native_lib_patterns() no longer includes the _mlirDialectsFly*.so / _mlirDialectsFlyROCDL*.so extension module filenames, but the build configuration still declares those MODULE_NAMEs (see python/mlir_flydsl/CMakeLists.txt). If the output .so files are still named _mlirDialectsFly*.so, the toolchain fingerprint _flydsl_key will stop hashing them, risking stale disk-cache hits after native code changes. Consider keeping the old globs (or including both old and new patterns) unless the extension module filenames have been renamed accordingly.

    def native_lib_patterns(self) -> List[str]:
        return [
            "_fly*.so",
            "_fly_rocdl*.so",
            "libFly*.so",
            "libfly_jit_runtime.so",
            "libmlir_rocm_runtime.so",
            "_mlirRegisterEverything*.so",
        ]

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/unit/test_jit_compile_hints.py

Copilot · 2026-04-02T04:20:48Z

python/flydsl/compiler/kernel_function.py

+                    gpu_func.attributes["passthrough"] = ir.ArrayAttr.get([
+                        ir.ArrayAttr.get([
+                            ir.StringAttr.get("amdgpu-expert-scheduling-mode"),
+                            ir.StringAttr.get("true")
+                        ])
+                    ])


gpu_func.attributes["passthrough"] = ... unconditionally overwrites any existing passthrough entries on the same gpu.func. Other code paths (e.g. kernels that set fast-math passthrough attributes) also populate this attribute, so this can silently drop previously-added passthrough settings when both are used together. Consider appending to the existing passthrough ArrayAttr (if present) instead of replacing it.

Suggested change

gpu_func.attributes["passthrough"] = ir.ArrayAttr.get([

ir.ArrayAttr.get([

ir.StringAttr.get("amdgpu-expert-scheduling-mode"),

ir.StringAttr.get("true")

])

])

# Preserve any existing passthrough attributes by appending

existing_passthrough = gpu_func.attributes.get("passthrough")

new_entry = ir.ArrayAttr.get([

ir.StringAttr.get("amdgpu-expert-scheduling-mode"),

ir.StringAttr.get("true"),

])

if isinstance(existing_passthrough, ir.ArrayAttr):

elements = list(existing_passthrough)

elements.append(new_entry)

gpu_func.attributes["passthrough"] = ir.ArrayAttr.get(elements)

else:

gpu_func.attributes["passthrough"] = ir.ArrayAttr.get([new_entry])

python/flydsl/__init__.py

kernels/moe_gemm_2stage_gfx1250.py

tests/kernels/test_moe_gemm_gfx1250.py

kernels/gemm_fp8fp4_gfx1250.py

aoli26 · 2026-04-02T04:18:05Z

python/flydsl/compiler/kernel_function.py

+                if CompilationContext.get_compile_hints().get("expert_scheduling_mode"):
+                    gpu_func.attributes["passthrough"] = ir.ArrayAttr.get([
+                        ir.ArrayAttr.get([
+                            ir.StringAttr.get("amdgpu-expert-scheduling-mode"),
+                            ir.StringAttr.get("true")
+                        ])
+                    ])


pls remove this block because main already has more comprehensive support

python/flydsl/__init__.py

aoli26 · 2026-04-02T04:22:30Z

tests/kernels/test_gemm_fp8fp4_gfx1250.py

+                          WMMA_DIM: int = 16,
+                          byte_swap: bool = False) -> torch.Tensor:
+    """Preshuffle E8M0 scale: optional byte swap + interleave for ds_load_b128.
+
+    Args:
+        byte_swap: True for FP4 (reorder [0,2,1,3]), False for FP8 (identity).
+    """
    _, K_scale = scale.shape
    assert K_scale % 4 == 0, f"K_scale must be divisible by 4, got {K_scale}"
+
+    if byte_swap:
+        grouped = scale.view(-1, K_scale // 4, 4)
+        scale = grouped[:, :, [0, 2, 1, 3]].contiguous().view(-1, K_scale)
+


byte_swap is deprecated; align directly with main here.

aoli26 · 2026-04-02T04:24:33Z

tests/unit/test_jit_compile_hints.py

no longer needed now, it's fine to remove this file.

Reapply the edits from the locally amended merge commit that was dropped during branch rebase alignment. Made-with: Cursor

coderfeli · 2026-04-02T10:28:37Z

python/flydsl/compiler/backends/rocm.py

        return [
-            "_mlirDialectsFly*.so",
+            "_fly*.so",
+            "_fly_rocdl*.so",


why change this?

coderfeli · 2026-04-02T10:30:12Z

kernels/moe_gemm_2stage_gfx1250.py

+            den = 1.0 + emu
+            sig = rocdl.rcp(T.f32, den)
+            return x * sig
+


kernel style much different from gemm 1250?

…ough Align ROCm backend native library pattern matching with current Fly dialect shared object names and avoid setting expert scheduling passthrough twice in kernel function lowering. Made-with: Cursor

coderfeli · 2026-04-10T04:48:43Z

Is it old codes? shall we close here? Try to reuse more codes. Currently 3000 lines of test and 3000 kernels @XingerZhu

XingerZhu · 2026-04-10T05:29:09Z

It is old codes, just close it

sjfeng1999 and others added 30 commits March 3, 2026 08:49

sync pre_v0.1

b2b94a2

update header macro

dd8c6fe

add separate target-specific rocdl dialect

eafa2d6

Add utility nbmodules

3bd4150

Add universalMma Atom

51e45a5

fix example02

866e2ed

Add DLTensorAdaptor for torch Tensor support

8b29d66

Add logger and EnvManager

ce61b2b

Refact Python module

bba5422

Fix missing module

6ca89f6

Add right inverse

8c784fb

* [FLYDSL]:add copy_atom right_inverse * [FLYDSL]: right_inverse dynamic process bugfix * [FLYDSL]:Python refactoring and adaptation * [FLYDSL]:rm example 05

Add numeric typing

65f6a31

Add ASTRewriter and improve jit_function cache mechanism

a3e25cb

unwrap dsl_type before calling ir Op

9b56be5

fix missing expore in primitive

26d45a9

Add tiled_copy partition

a250f90

Pre v0.1 gemm (#145)

8085ea0

gemm test ready

Pre v0.1 gemm fix (#153)

2679f23

* fix run error * port all gemm from main * fuix cudagraph hack * add int4 version * change flymemref convert * test ok * add build script * fix graph2 * add files * fix flops * fix path * fix local test * fix * clean * update readme

add compile only and dumpir (#154)

890c860

* add compile only and dumpir

add version and wheel build

6cee3f1

port docs

98fad64

build whl and dist version ok, upload pypi ok

04e7c46

add aot example

0a8f4fe

Apply clang-format to fly-opt.cpp

1570126

[Tests][Lit] Add lit tests to run_tests.sh and fix fly-opt build inte…

45e31d1

…gration - Enable LLVM_BUILD_TOOLS so fly-opt is built with the default ninja target - Add MLIR lit test section to scripts/run_tests.sh - Update test/lit.cfg.py to use FLY_BUILD_DIR env var (default: build-fly)

aoli26 and others added 17 commits March 27, 2026 07:44

add sched mode2 compile hints

16b97d5

refactor DISABLE_VALU_STALL & fp8 use scale opsel

0df32b7

optimize stream a wmma interleave

e51e53b

add b preshuffle optimization

c4ecc10

remove non-preshuffle branch

ce78428

Merge branch 'main' into pr/v0.1_gfx12

cc8322e

add expert scheduling compile hint

d110fc5

fix result position issue

3149567

Merge branch 'pr/v0.1_gfx12' into opt/mxfp4_gfx1250

37ab3b7

unified gfx1250 mxfp4/mxfp8 gemm

007bad8

support a8w4 and fix register layout

fcbe0a0

Add gfx1250 MoE kernel and test coverage.

6a28022

Introduce the new gfx1250 MoE kernel implementation and its dedicated test harness so mxfp4 bring-up can validate the path in-tree and track regressions on the new architecture. Made-with: Cursor

Add gfx1250 A8W4 MoE kernel path and tests.

3d2b4c8

Port the preshuffled FP8/FP4 WMMA_SCALE flow into the gfx1250 MoE stage1/stage2 kernels and extend the MoE harness so A8W4 can be validated through standalone and end-to-end bring-up paths. Made-with: Cursor

merge main into mxfp4_gfx1250_moe

7894c8a

Sync the branch with latest main changes and resolve merge conflicts while preserving gfx1250 MOE GEMM updates. Made-with: Cursor

XingerZhu requested review from coderfeli and Copilot April 2, 2026 04:11

Copilot started reviewing on behalf of XingerZhu April 2, 2026 04:13 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

aoli26 reviewed Apr 2, 2026

View reviewed changes

XingerZhu added 2 commits April 2, 2026 07:17

restore changes from lost amend commit

2954399

Reapply the edits from the locally amended merge commit that was dropped during branch rebase alignment. Made-with: Cursor

restore changes from lost amend commit

927d4d4

Reapply the edits from the locally amended merge commit that was dropped during branch rebase alignment. Made-with: Cursor

coderfeli reviewed Apr 2, 2026

View reviewed changes

fix rocm native lib discovery and remove duplicate scheduling passthr…

e76e34b

…ough Align ROCm backend native library pattern matching with current Fly dialect shared object names and avoid setting expert scheduling passthrough twice in kernel function lowering. Made-with: Cursor

coderfeli closed this Apr 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gfx1250 moe#336

gfx1250 moe#336
XingerZhu wants to merge 130 commits intomainfrom
mxfp4_gfx1250_moe

XingerZhu commented Apr 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Apr 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aoli26 Apr 2, 2026

Uh oh!

Uh oh!

aoli26 Apr 2, 2026

Uh oh!

aoli26 Apr 2, 2026

Uh oh!

coderfeli Apr 2, 2026

Uh oh!

coderfeli Apr 2, 2026

Uh oh!

coderfeli commented Apr 10, 2026 •

edited

Loading

Uh oh!

XingerZhu commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

-                    gpu_func.attributes["passthrough"] = ir.ArrayAttr.get([
-                        ir.ArrayAttr.get([
-                            ir.StringAttr.get("amdgpu-expert-scheduling-mode"),
-                            ir.StringAttr.get("true")
-                        ])
-                    ])
+                    # Preserve any existing passthrough attributes by appending
+                    existing_passthrough = gpu_func.attributes.get("passthrough")
+                    new_entry = ir.ArrayAttr.get([
+                        ir.StringAttr.get("amdgpu-expert-scheduling-mode"),
+                        ir.StringAttr.get("true"),
+                    ])
+                    if isinstance(existing_passthrough, ir.ArrayAttr):
+                        elements = list(existing_passthrough)
+                        elements.append(new_entry)
+                        gpu_func.attributes["passthrough"] = ir.ArrayAttr.get(elements)
+                    else:
+                        gpu_func.attributes["passthrough"] = ir.ArrayAttr.get([new_entry])

Conversation

XingerZhu commented Apr 2, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aoli26 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aoli26 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

aoli26 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderfeli Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderfeli Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderfeli commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

XingerZhu commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

coderfeli commented Apr 10, 2026 •

edited

Loading