Implement ROCm delay kernel by Eetusjo · Pull Request #882 · ROCm/xla

Eetusjo · 2026-05-28T10:52:18Z

Port the CUDA delay-kernel to ROCm to reduce jitter in autotuning measurements. Gated to MI300+ and disabled when HIP_LAUNCH_BLOCKING=1, AMD_SERIALIZE_KERNEL or AMD_SERIALIZE_COPY are set, since those result in the kernel always hitting timeout.

Small bug fix: Fix GetEventElapsedTime silently returning false/0 on failures.

claude · 2026-05-28T11:00:39Z

Review Summary

Solid implementation that mirrors the CUDA delay kernel for ROCm. The architecture gating (MI300+) and HIP serialization env-var checks are well thought out. The RocmTimer lifecycle management (destructor cleanup, timeout detection) looks correct.

Key findings (see inline comments):

perf: delay_kernel_supported is recomputed on every timer creation — should be cached during init like the CUDA side
correctness: delay kernel test runs unconditionally regardless of GPU arch — may need a skip guard for pre-MI300 hardware
minor: missing #include <cstdlib>, copyright year mismatch, include ordering, and clock rate comment accuracy

claude · 2026-05-29T08:16:17Z

Re-review Summary

Good progress — 4 of 6 previous findings have been addressed (copyright year, capability caching, timeout comment, test skip guard, <cstdlib> include). The include-order nit in delay_kernel_rocm.cu.cc remains from last round.

New findings (3 inline comments):

rocm_executor.cc: <unistd.h> displaced from system headers block
delay_kernel_rocm.cu.cc: Missing explanatory comment on timeout path (parity with CUDA)
rocm_timer.cc: Timeout warning could use ERROR severity and actionable guidance like the CUDA version

All new findings are minor nits/parity suggestions — no correctness issues found.

i-chaochen · 2026-06-10T12:07:33Z

+__global__ void DelayKernel(volatile GpuSemaphoreState* semaphore,
+                            GpuSemaphoreState target) {
+  constexpr int64_t WAIT_CYCLES{1024};
+  constexpr int64_t TIMEOUT_CYCLES{200000000};


may I ask why we use this magic 200000000 as TIMEOUT CYCLES?

The timeout is taken from the ported CUDA kernel. It corresponds to a roughly ~100ms timeout on a 2GHz clock rate. I believe the top rates on our GPUs are a bit above that but of course they are also variable, so the timeout differs depending on the environment. But as you say it's a magic number and just pulled out of a hat to have a reasonable timeout. I'll add a comment documenting this.

thanks for the explain, would be ok to have a XLA flag to change it?

Eetusjo added the claude-review Request a Claude AI code review for this PR label May 28, 2026