Skip to content

Implement ROCm delay kernel#882

Open
Eetusjo wants to merge 1 commit into
mainfrom
ci_rocm_delay_kernel
Open

Implement ROCm delay kernel#882
Eetusjo wants to merge 1 commit into
mainfrom
ci_rocm_delay_kernel

Conversation

@Eetusjo

@Eetusjo Eetusjo commented May 28, 2026

Copy link
Copy Markdown
Collaborator

Port the CUDA delay-kernel to ROCm to reduce jitter in autotuning measurements. Gated to MI300+ and disabled when HIP_LAUNCH_BLOCKING=1, AMD_SERIALIZE_KERNEL or AMD_SERIALIZE_COPY are set, since those result in the kernel always hitting timeout.

Small bug fix: Fix GetEventElapsedTime silently returning false/0 on failures.

@Eetusjo Eetusjo added the claude-review Request a Claude AI code review for this PR label May 28, 2026
Comment thread xla/stream_executor/rocm/delay_kernel.h Outdated
Comment thread xla/stream_executor/rocm/rocm_executor.cc Outdated
Comment thread xla/stream_executor/rocm/delay_kernel_rocm.cu.cc Outdated
Comment thread xla/stream_executor/rocm/delay_kernel_rocm.cu.cc Outdated
Comment thread xla/stream_executor/rocm/rocm_timer_test.cc
Comment thread xla/stream_executor/rocm/rocm_executor.cc
@claude

claude Bot commented May 28, 2026

Copy link
Copy Markdown

Review Summary

Solid implementation that mirrors the CUDA delay kernel for ROCm. The architecture gating (MI300+) and HIP serialization env-var checks are well thought out. The RocmTimer lifecycle management (destructor cleanup, timeout detection) looks correct.

Key findings (see inline comments):

  • perf: delay_kernel_supported is recomputed on every timer creation — should be cached during init like the CUDA side
  • correctness: delay kernel test runs unconditionally regardless of GPU arch — may need a skip guard for pre-MI300 hardware
  • minor: missing #include <cstdlib>, copyright year mismatch, include ordering, and clock rate comment accuracy

@github-actions github-actions Bot removed the claude-review Request a Claude AI code review for this PR label May 28, 2026
@Eetusjo Eetusjo force-pushed the ci_rocm_delay_kernel branch from 46f30de to a6995a8 Compare May 29, 2026 08:06
@Eetusjo Eetusjo added the claude-review Request a Claude AI code review for this PR label May 29, 2026
Comment thread xla/stream_executor/rocm/rocm_executor.cc
Comment thread xla/stream_executor/rocm/delay_kernel_rocm.cu.cc
Comment thread xla/stream_executor/rocm/rocm_timer.cc
@claude

claude Bot commented May 29, 2026

Copy link
Copy Markdown

Re-review Summary

Good progress — 4 of 6 previous findings have been addressed (copyright year, capability caching, timeout comment, test skip guard, <cstdlib> include). The include-order nit in delay_kernel_rocm.cu.cc remains from last round.

New findings (3 inline comments):

  • rocm_executor.cc: <unistd.h> displaced from system headers block
  • delay_kernel_rocm.cu.cc: Missing explanatory comment on timeout path (parity with CUDA)
  • rocm_timer.cc: Timeout warning could use ERROR severity and actionable guidance like the CUDA version

All new findings are minor nits/parity suggestions — no correctness issues found.

@github-actions github-actions Bot removed the claude-review Request a Claude AI code review for this PR label May 29, 2026
@Eetusjo Eetusjo requested a review from draganmladjenovic June 1, 2026 05:32
__global__ void DelayKernel(volatile GpuSemaphoreState* semaphore,
GpuSemaphoreState target) {
constexpr int64_t WAIT_CYCLES{1024};
constexpr int64_t TIMEOUT_CYCLES{200000000};

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may I ask why we use this magic 200000000 as TIMEOUT CYCLES?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout is taken from the ported CUDA kernel. It corresponds to a roughly ~100ms timeout on a 2GHz clock rate. I believe the top rates on our GPUs are a bit above that but of course they are also variable, so the timeout differs depending on the environment. But as you say it's a magic number and just pulled out of a hat to have a reasonable timeout. I'll add a comment documenting this.

@i-chaochen i-chaochen Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the explain, would be ok to have a XLA flag to change it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants