Skip to content

Broadcast Kernel Fusion#119

Draft
ejmeitz wants to merge 12 commits into
developfrom
fusion-again
Draft

Broadcast Kernel Fusion#119
ejmeitz wants to merge 12 commits into
developfrom
fusion-again

Conversation

@ejmeitz
Copy link
Copy Markdown
Member

@ejmeitz ejmeitz commented Apr 24, 2026

This PR is implements the automatic fusion of broadcast expressions with unary & binary operations. Really any intrinsic you could use in a CUDA kernel.

y .= sin.(x)
z = x .+ y .* z
w = x .+ y .+ 2.0f0

# Would be nice to support:
f(x,y) -> x + y
z .= 2.0f0 .* f.(x, y)

Goals

  • Support broadcast expressions with arbitrarily many input/output NDArrays that leverage any binary/unary op and may be combined with some number of scalar arguments.
  • Some kind of fall back logic if we cannot fuse to just use the un-fused kernel and emit a warning.
  • Support broadcasting of arbitrary user defined functions. There is not even an un-fused code path right now. This might "just work" as CUDA.jl should handle this for us.

High Level Implementation

  • Generate key from broadcast expression, check cache.
  • If not in cache: Generate PTX for the broadcast kernel using CUDA.jl by spoofing the NDArrays as CuDeviceArrays
  • Launch the LoadPTXTask and RunPTXTask legate tasks defined in cuda.cpp with the registered kernel

This PR does not aim to support fusion of functions. That will be a future PR.

@ejmeitz ejmeitz marked this pull request as draft April 24, 2026 18:14
@ejmeitz ejmeitz changed the title Basic Kernel Fusion Broadcast Kernel Fusion May 12, 2026
@krasow krasow changed the base branch from main to develop May 14, 2026 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants