Skip to content

Multi-DSL/device refactoring#19

Merged
sandlbn merged 15 commits intomainfrom
sandlbn/refactor
May 7, 2026
Merged

Multi-DSL/device refactoring#19
sandlbn merged 15 commits intomainfrom
sandlbn/refactor

Conversation

@sandlbn
Copy link
Copy Markdown
Contributor

@sandlbn sandlbn commented Apr 20, 2026

  • Generalize pipeline from Triton/XPU-only to support multiple DSLs (Triton, Gluon, SYCL, CUDA) and devices (XPU, CUDA, CPU)
  • Add DSL and DeviceType enums, abstract DeviceConfig with XPU/CUDA subclasses
  • Add DSL-stage compatibility registry (dsl_registry.py)
  • Abstract device query module (device_query.py) dispatching XPU/CUDA
  • Parameterize LLM prompts by DSL/device (prompts/device_prompts.py)
  • Restructure knowledge base: common/, triton/{xpu,cuda}/, gluon/xpu/, sycl/xpu/
  • Add --device and --dsl CLI flags
  • Add .gitignore

@sandlbn
Copy link
Copy Markdown
Contributor Author

sandlbn commented Apr 20, 2026

This is more like proposal to add other DSLs, we can discuss it in this thread. @danielfleischer @gbenms

@sandlbn sandlbn marked this pull request as ready for review April 29, 2026 17:26
Copy link
Copy Markdown
Member

@danielfleischer danielfleischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HW and DSL generalization is fine.

Where did the 18 new examples in KB/triton/XPU/ come from? are they synthetic, are they based on KernelBench?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did this come from?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extracted from sycl tla kernels, I will update this PR for couple new when you pass existing stuff.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems to be working:

     Speedup: 23.09x
     - I changed strategy from “pick one supposedly better tile” to a runtime micro-autotuned kernel family while preserving the proven RowMajor/ColumnMajor layout fix and BF16->FP32 math path. This directly addresses the stage issue and avoids locking into a regressing shape. The sweep prioritizes 256x256x32 and 128x128x64, also testing 128x256x32, 256x128x32, and retaining the current 256x128x16 as a fallback/reference. I also hardened reference GEMM stride products to int64. If this still doesn’t be

Total Speedup: 23.10x
Performance: 3.71 → 85.68 TFLOPS
Execution Time: 37.047 ms → 1.604 ms

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this from KernelBench?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is as example of fully fused kernel

Copy link
Copy Markdown
Member

@danielfleischer danielfleischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More comments.

)


class SyclOptimizationSignature(dspy.Signature):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's a good opportunity to have backend-specific agents in dedicated modules. So all the SYCL is there, TRITON, Gluon, etc.

Comment thread src/xe_forge/core/sycl_executor.py Outdated
Comment thread src/xe_forge/core/sycl_executor.py Outdated
Comment thread src/xe_forge/core/sycl_executor.py
Comment thread src/xe_forge/core/sycl_executor.py Outdated
Copy link
Copy Markdown
Member

@danielfleischer danielfleischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good

@sandlbn sandlbn merged commit 5cb0498 into main May 7, 2026
2 checks passed
@sandlbn sandlbn deleted the sandlbn/refactor branch May 7, 2026 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants