[ROCm] Run MIOpen conv autotuning under an exclusive GPU lock to fix all-zero gradient UAF#980
[ROCm] Run MIOpen conv autotuning under an exclusive GPU lock to fix all-zero gradient UAF#980magaonka-amd wants to merge 1 commit into
Conversation
…all-zero gradient UAF The MIOpen convolution autotuner (GetConvolutionCustomCallConfigs) selects an algorithm by running candidate convolutions via MIOpen find: it allocates scratch from the shared compute allocator and benchmarks the candidates -- including backward-weights, which zero-initializes the result via SetTensor(dW, 0) -- on a separate stream. Under the kComputeSynchronized allocation model the host runs ahead of the device, so a buffer can be returned to the allocator while its producing kernel is still in flight. That chunk can then be handed to the autotuner's scratch allocator, and the zero-init clobbers a result a live kernel still owns, producing an intermittent all-zero convolution gradient -- observed on gfx950 as the flaky tests/batching_test.py::testConvGeneralDilated failure. Run the autotuner's find under the exclusive GPU lock and drain outstanding device work before allocating scratch, so every chunk it receives is no longer owned by an in-flight kernel. The lock spans scratch allocation, zero-init and the MIOpen find (it does not change MIOpen's own per-solver loop). This runs only during autotuning (a compile-time event), so there is no steady-state cost.
| absl::WriterMutexLock serialize_lock(GetGpuMutex(stream_executor)); | ||
| if (!stream_executor->SynchronizeAllActivity()) { | ||
| return absl::InternalError( | ||
| "Failed to synchronize device before convolution autotuning."); | ||
| } |
There was a problem hiding this comment.
Observation (non-blocking): The exclusive GPU mutex is held for the entire duration of GetConvolveRunners → PopulateMIOpenFindDb → miopenFindConvolution*Algorithm, which means all other XLA GPU work on this device ordinal will block during autotuning. This is by design (and the PR description acknowledges the serialization), but worth calling out: the first compilation that triggers MIOpen conv autotuning will see elevated latency on that device. Since autotuning is a one-time cost and the alternative (a dedicated allocator for scratch memory) adds significant complexity, this tradeoff seems reasonable.
The lock-then-sync ordering is correct: acquiring the exclusive lock first prevents new GPU work from launching (other threads block on the reader lock in gpu_executable.cc), then SynchronizeAllActivity() drains any work already in flight.
| // The MIOpen conv find draws+zero-inits shared scratch that can clobber a | ||
| // still-live kernel's buffer; run under the exclusive GPU lock, drain first. | ||
| absl::WriterMutexLock serialize_lock(GetGpuMutex(stream_executor)); | ||
| if (!stream_executor->SynchronizeAllActivity()) { |
There was a problem hiding this comment.
Looks good: SynchronizeAllActivity() returns bool (not absl::Status), and the error check here correctly handles the false return by producing an absl::InternalError. This matches the interface declared in stream_executor.h.
Claude Review SummaryVerdict: Approve — clean, minimal, well-targeted fix. This PR correctly addresses a use-after-free race in MIOpen conv autotuning by acquiring an exclusive GPU mutex and draining in-flight device work before scratch memory allocation. The lock-then-sync ordering is correct, error handling is proper, and the scope is appropriately limited to the non-fused conv path (the fused path doesn't execute on-device during config enumeration). The serialization cost during first-compilation autotuning is an acceptable tradeoff vs. a dedicated allocator. No issues found. See inline comments for detailed observations. |
|
|
||
| // The MIOpen conv find draws+zero-inits shared scratch that can clobber a | ||
| // still-live kernel's buffer; run under the exclusive GPU lock, drain first. | ||
| absl::WriterMutexLock serialize_lock(GetGpuMutex(stream_executor)); |
There was a problem hiding this comment.
is it similar to this one you had before? openxla#40062 what's happening to that PR?
📝 Summary of Changes
Run the MIOpen convolution autotuner's find under the exclusive GPU lock, and drain outstanding device work before it allocates and zero-initializes scratch from the shared compute allocator. This isolates the autotuner from concurrent
live execution and fixes a use-after-free that produced intermittent all-zero convolution gradients on ROCm.
🎯 Justification
On ROCm the eager backward-filter conv gradient in
tests/batching_test.py::testConvGeneralDilated(JAX) intermittently came back all-zero (42/450 elements). This one of the example failing test , any conv related jax UT can show similar fail and I have been seeing many. but among those dialated conv is most likely to occur in CI ( I think it is because how slow dilation conv generally is in MIOpen ).Root cause: the MIOpen conv autotuner (
GetConvolutionCustomCallConfigs) selects an algorithm by running candidate convolutions via MIOpen find. It allocates scratch from the shared compute allocator and benchmarks the candidates ,including backward-weights, which zero-initializes the result via
SetTensor(dW, 0), on a separate stream, with no synchronization to the compute stream. Under thekComputeSynchronizedallocation model the host runsahead of the device, so a buffer can be returned to the allocator while its producing kernel is still in flight; that chunk is then handed to the autotuner's scratch allocator and the zero-init clobbers a result a live kernel still owns.
This is specific to MIOpen: it is the only autotuner backend that executes convolutions during config enumeration (cuDNN enumerates engine configs with null
buffers; hipBLASLt/cuBLASLt/Triton enumerate without device execution).
🚀 Kind of Contribution
🐛 Bug Fix
🧪 Unit Tests
I basically have no ways to reproduce this bug consistantly in JAX or XLA or HIP level, so below example is sort of exaggerating the situation to show how bug manifests. this is not a real UT. just for demo.
A standalone, deterministic reproducer uses the
MultiDeviceAdapter/tsl::BFCAllocatorto modelthe autotuner path: a gradient is produced and freed while its compute work isstill in flight, the autotuner re-serves that chunk from the shared pool and zero-inits it on its own stream, and a downstream consumer then reads the
gradient.
producer event is still
kPendingand zeros it -> the consumer reads0.real gradient before the chunk is reused -> the consumer reads the sentinel.
Sample output:
Standalone reproducer (conv_autotuner_uaf_test.cc)
🧪 Execution Tests
I'm not really sure how to write UT for this race condition.
but
testConvGeneralDilatedon a 4-GPU gfx950 runner (ROCm 7.2.0): without the fix test fails (all-zero gradient, 42/450); with this fix it passes across py3.12–3.14.Alternatives considered
Dedicated autotuner pool (give the conv autotuner its own raw allocator):
also fixes the aliasing and keeps autotune/execute overlap, but introduces a
second device allocator and increases peak memory under preallocate. I did not like the idea of having multiple allocator silently
Deferred device-buffer free (defer returning a buffer to the allocator
until the compute stream drains): fixes the early-free prerequisite generally,
but I feel this is wrong level to attack the issue, and it is overly aggressive
fix is something like this: