Implement ROCm delay kernel#882
Conversation
Review SummarySolid implementation that mirrors the CUDA delay kernel for ROCm. The architecture gating (MI300+) and HIP serialization env-var checks are well thought out. The Key findings (see inline comments):
|
46f30de to
a6995a8
Compare
Re-review SummaryGood progress — 4 of 6 previous findings have been addressed (copyright year, capability caching, timeout comment, test skip guard, New findings (3 inline comments):
All new findings are minor nits/parity suggestions — no correctness issues found. |
| __global__ void DelayKernel(volatile GpuSemaphoreState* semaphore, | ||
| GpuSemaphoreState target) { | ||
| constexpr int64_t WAIT_CYCLES{1024}; | ||
| constexpr int64_t TIMEOUT_CYCLES{200000000}; |
There was a problem hiding this comment.
may I ask why we use this magic 200000000 as TIMEOUT CYCLES?
There was a problem hiding this comment.
The timeout is taken from the ported CUDA kernel. It corresponds to a roughly ~100ms timeout on a 2GHz clock rate. I believe the top rates on our GPUs are a bit above that but of course they are also variable, so the timeout differs depending on the environment. But as you say it's a magic number and just pulled out of a hat to have a reasonable timeout. I'll add a comment documenting this.
There was a problem hiding this comment.
thanks for the explain, would be ok to have a XLA flag to change it?
Port the CUDA delay-kernel to ROCm to reduce jitter in autotuning measurements. Gated to MI300+ and disabled when
HIP_LAUNCH_BLOCKING=1,AMD_SERIALIZE_KERNELorAMD_SERIALIZE_COPYare set, since those result in the kernel always hitting timeout.Small bug fix: Fix
GetEventElapsedTimesilently returning false/0 on failures.