Conversation
|
This is more like proposal to add other DSLs, we can discuss it in this thread. @danielfleischer @gbenms |
dea6744 to
a917f31
Compare
danielfleischer
left a comment
There was a problem hiding this comment.
The HW and DSL generalization is fine.
Where did the 18 new examples in KB/triton/XPU/ come from? are they synthetic, are they based on KernelBench?
There was a problem hiding this comment.
Where did this come from?
There was a problem hiding this comment.
extracted from sycl tla kernels, I will update this PR for couple new when you pass existing stuff.
There was a problem hiding this comment.
seems to be working:
Speedup: 23.09x
- I changed strategy from “pick one supposedly better tile” to a runtime micro-autotuned kernel family while preserving the proven RowMajor/ColumnMajor layout fix and BF16->FP32 math path. This directly addresses the stage issue and avoids locking into a regressing shape. The sweep prioritizes 256x256x32 and 128x128x64, also testing 128x256x32, 256x128x32, and retaining the current 256x128x16 as a fallback/reference. I also hardened reference GEMM stride products to int64. If this still doesn’t be
Total Speedup: 23.10x
Performance: 3.71 → 85.68 TFLOPS
Execution Time: 37.047 ms → 1.604 msThere was a problem hiding this comment.
Is this from KernelBench?
There was a problem hiding this comment.
it is as example of fully fused kernel
| ) | ||
|
|
||
|
|
||
| class SyclOptimizationSignature(dspy.Signature): |
There was a problem hiding this comment.
Maybe it's a good opportunity to have backend-specific agents in dedicated modules. So all the SYCL is there, TRITON, Gluon, etc.
DSLandDeviceTypeenums, abstractDeviceConfigwith XPU/CUDA subclassesdsl_registry.py)device_query.py) dispatching XPU/CUDAprompts/device_prompts.py)common/,triton/{xpu,cuda}/,gluon/xpu/,sycl/xpu/--deviceand--dslCLI flags.gitignore