[ROCm] Run MIOpen conv autotuning under an exclusive GPU lock to fix all-zero gradient UAF by magaonka-amd · Pull Request #980 · ROCm/xla

magaonka-amd · 2026-06-23T02:19:44Z

📝 Summary of Changes

Run the MIOpen convolution autotuner's find under the exclusive GPU lock, and drain outstanding device work before it allocates and zero-initializes scratch from the shared compute allocator. This isolates the autotuner from concurrent
live execution and fixes a use-after-free that produced intermittent all-zero convolution gradients on ROCm.

🎯 Justification

On ROCm the eager backward-filter conv gradient in tests/batching_test.py::testConvGeneralDilated (JAX) intermittently came back all-zero (42/450 elements). This one of the example failing test , any conv related jax UT can show similar fail and I have been seeing many. but among those dialated conv is most likely to occur in CI ( I think it is because how slow dilation conv generally is in MIOpen ).

Root cause: the MIOpen conv autotuner (GetConvolutionCustomCallConfigs) selects an algorithm by running candidate convolutions via MIOpen find. It allocates scratch from the shared compute allocator and benchmarks the candidates ,
including backward-weights, which zero-initializes the result via SetTensor(dW, 0) , on a separate stream, with no synchronization to the compute stream. Under the kComputeSynchronized allocation model the host runs
ahead of the device, so a buffer can be returned to the allocator while its producing kernel is still in flight; that chunk is then handed to the autotuner's scratch allocator and the zero-init clobbers a result a live kernel still owns.

This is specific to MIOpen: it is the only autotuner backend that executes convolutions during config enumeration (cuDNN enumerates engine configs with null
buffers; hipBLASLt/cuBLASLt/Triton enumerate without device execution).

🚀 Kind of Contribution

🐛 Bug Fix

🧪 Unit Tests

I basically have no ways to reproduce this bug consistantly in JAX or XLA or HIP level, so below example is sort of exaggerating the situation to show how bug manifests. this is not a real UT. just for demo.

A standalone, deterministic reproducer uses the MultiDeviceAdapter/tsl::BFCAllocator to model
the autotuner path: a gradient is produced and freed while its compute work isstill in flight, the autotuner re-serves that chunk from the shared pool and zero-inits it on its own stream, and a downstream consumer then reads the
gradient.

without fix: it re-serves the gradient's chunk while the
producer event is still kPending and zeros it -> the consumer reads 0.
with fix: draining first lets the consumer read the
real gradient before the chunk is reused -> the consumer reads the sentinel.

Sample output:

===== conv-autotuner UAF [UNSERIALIZED autotuner (the bug)] =====
  autotuner scratch=0x...000 (reuses dW chunk: YES); dW event: PENDING (producer kernel still in flight)
  consumer read dW = 0x00000000  -> ZEROED (autotuner clobbered the live gradient!)

===== conv-autotuner UAF [SERIALIZED autotuner (the fix)] =====
  autotuner scratch=0x...000 (reuses dW chunk: YES); dW event: complete (producer kernel finished)
  consumer read dW = 0x3f800000  -> intact gradient (fix works)

Standalone reproducer (conv_autotuner_uaf_test.cc)

#include <cstdint>
#include <cstdio>
#include <memory>
#include <utility>
#include <vector>

#include <gtest/gtest.h>
#include "absl/status/status.h"
#include "absl/status/statusor.h"
#include "absl/strings/string_view.h"
#include "xla/stream_executor/device_address.h"
#include "xla/stream_executor/device_address_allocator.h"
#include "xla/stream_executor/event.h"
#include "xla/stream_executor/integrations/device_mem_allocator.h"
#include "xla/stream_executor/integrations/tf_allocator_adapter.h"
#include "xla/stream_executor/platform.h"
#include "xla/stream_executor/platform_manager.h"
#include "xla/stream_executor/stream.h"
#include "xla/stream_executor/stream_executor.h"
#include "xla/tsl/framework/allocator.h"
#include "xla/tsl/framework/bfc_allocator.h"
#include "xla/tsl/framework/device_id.h"
#include "xla/tsl/platform/errors.h"
#include "xla/tsl/platform/statusor.h"
#include "xla/tsl/platform/test.h"

namespace stream_executor {
namespace {

constexpr uint32_t kSentinel = 0x3f800000u;          // 1.0f
constexpr uint64_t kBytes = 45 * sizeof(uint32_t);   // 3*3*1*5 dW = 180 bytes
constexpr uint64_t kScratchBytes = uint64_t{256} << 20;
constexpr int kDelayIters = 256;  // keep dW's compute in flight during reuse

absl::StatusOr<std::pair<Platform*, StreamExecutor*>> GpuExecutor() {
  for (absl::string_view name : {"CUDA", "ROCM"}) {
    absl::StatusOr<Platform*> p = PlatformManager::PlatformWithName(name);
    if (!p.ok()) continue;
    absl::StatusOr<StreamExecutor*> e = (*p)->ExecutorForDevice(0);
    if (e.ok()) return std::make_pair(*p, *e);
  }
  return absl::NotFoundError("No GPU platform (CUDA/ROCM) available");
}

const char* StatusName(Event::Status s) {
  switch (s) {
    case Event::Status::kPending:
      return "PENDING (producer kernel still in flight)";
    case Event::Status::kComplete:
      return "complete (producer kernel finished)";
    default:
      return "error/unknown";
  }
}

struct ConvOutcome {
  bool reuse_same_addr;
  Event::Status status_at_alloc;
  uint32_t consumer_readback;
};

absl::StatusOr<ConvOutcome> RunAutotunerScenario(Platform* platform,
                                                 StreamExecutor* exec,
                                                 bool serialize) {
  const int ord = exec->device_ordinal();
  std::fprintf(stderr, "\n===== conv-autotuner UAF [%s] =====\n",
               serialize ? "SERIALIZED autotuner (the fix)"
                         : "UNSERIALIZED autotuner (the bug)");

  auto sub =
      std::make_unique<DeviceMemAllocator>(exec, tsl::PlatformDeviceId(ord));
  tsl::BFCAllocator::Options opts;
  opts.allow_growth = true;
  auto bfc = std::make_shared<tsl::BFCAllocator>(
      std::move(sub), uint64_t{4} << 30, "conv_uaf_bfc", opts);

  TF_ASSIGN_OR_RETURN(std::unique_ptr<Stream> compute, exec->CreateStream());
  TF_ASSIGN_OR_RETURN(std::unique_ptr<Stream> autotune, exec->CreateStream());
  TF_ASSIGN_OR_RETURN(std::unique_ptr<Event> event, exec->CreateEvent());

  std::vector<MultiDeviceAdapter::AllocatorInfo> infos;
  infos.push_back({bfc, compute.get(), /*memory_space=*/int64_t{0},
                   /*device_ordinal=*/ord});
  MultiDeviceAdapter adapter(platform, std::move(infos));

  // Scratch first so dW is the lowest small chunk and BFC re-serves it.
  TF_ASSIGN_OR_RETURN(ScopedDeviceAddress<uint8_t> scratch_s,
                      adapter.Allocate(ord, kScratchBytes, true, 0));
  DeviceAddress<uint8_t> scratch = scratch_s.Release();
  TF_ASSIGN_OR_RETURN(ScopedDeviceAddress<uint8_t> dw_s,
                      adapter.Allocate(ord, kBytes, true, 0));
  DeviceAddress<uint8_t> dw = dw_s.Release();
  TF_ASSIGN_OR_RETURN(ScopedDeviceAddress<uint8_t> out_s,
                      adapter.Allocate(ord, kBytes, true, 0));
  DeviceAddress<uint8_t> out = out_s.Release();

  // Produce dW, stay busy, then a downstream consumer reads dW into `out`.
  TF_RETURN_IF_ERROR(compute->Memset32(&dw, kSentinel, kBytes));
  for (int i = 0; i < kDelayIters; ++i)
    TF_RETURN_IF_ERROR(compute->Memset32(&scratch, 0xccccccccu, kScratchBytes));
  TF_RETURN_IF_ERROR(compute->Memcpy(&out, dw, kBytes));
  TF_RETURN_IF_ERROR(compute->RecordEvent(event.get()));

  // dW is freed while its compute work is still in flight.
  TF_RETURN_IF_ERROR(adapter.Deallocate(ord, dw));

  // The fix drains the device before the autotuner draws scratch.
  if (serialize) {
    if (!exec->SynchronizeAllActivity())
      return absl::InternalError("Failed to drain device.");
  }
  TF_ASSIGN_OR_RETURN(ScopedDeviceAddress<uint8_t> scr_s,
                      adapter.Allocate(ord, kBytes, true, 0));
  DeviceAddress<uint8_t> scr = scr_s.Release();

  ConvOutcome o;
  o.reuse_same_addr = (scr.opaque() == dw.opaque());
  o.status_at_alloc = event->PollForStatus();
  std::fprintf(stderr,
               "  autotuner scratch=%p (reuses dW chunk: %s); dW event: %s\n",
               scr.opaque(), o.reuse_same_addr ? "YES" : "no",
               StatusName(o.status_at_alloc));

  // Autotuner zero-inits its scratch on its own stream (models SetTensor(dW,0)).
  TF_RETURN_IF_ERROR(autotune->MemZero(&scr, kBytes));
  TF_RETURN_IF_ERROR(autotune->BlockHostUntilDone());
  TF_RETURN_IF_ERROR(compute->BlockHostUntilDone());

  TF_RETURN_IF_ERROR(
      compute->Memcpy(&o.consumer_readback, out, sizeof(uint32_t)));
  TF_RETURN_IF_ERROR(compute->BlockHostUntilDone());
  std::fprintf(stderr, "  consumer read dW = 0x%08x  -> %s\n",
               o.consumer_readback,
               o.consumer_readback == 0u ? "ZEROED (clobbered live gradient!)"
               : o.consumer_readback == kSentinel ? "intact gradient (fix works)"
                                                  : "unexpected");

  TF_RETURN_IF_ERROR(adapter.Deallocate(ord, scr));
  TF_RETURN_IF_ERROR(adapter.Deallocate(ord, out));
  TF_RETURN_IF_ERROR(adapter.Deallocate(ord, scratch));
  return o;
}

TEST(ConvAutotunerUaf, UnserializedAutotunerClobbersLiveGradient) {
  absl::StatusOr<std::pair<Platform*, StreamExecutor*>> gpu = GpuExecutor();
  if (!gpu.ok()) GTEST_SKIP() << gpu.status();
  TF_ASSERT_OK_AND_ASSIGN(
      ConvOutcome o,
      RunAutotunerScenario(gpu->first, gpu->second, /*serialize=*/false));
  EXPECT_TRUE(o.reuse_same_addr);
  EXPECT_EQ(o.status_at_alloc, Event::Status::kPending);
  EXPECT_EQ(o.consumer_readback, 0u);  // gradient clobbered
}

TEST(ConvAutotunerUaf, SerializedAutotunerPreservesLiveGradient) {
  absl::StatusOr<std::pair<Platform*, StreamExecutor*>> gpu = GpuExecutor();
  if (!gpu.ok()) GTEST_SKIP() << gpu.status();
  TF_ASSERT_OK_AND_ASSIGN(
      ConvOutcome o,
      RunAutotunerScenario(gpu->first, gpu->second, /*serialize=*/true));
  EXPECT_EQ(o.consumer_readback, kSentinel);  // gradient preserved
}

}  // namespace
}  // namespace stream_executor

🧪 Execution Tests

I'm not really sure how to write UT for this race condition.
but testConvGeneralDilated on a 4-GPU gfx950 runner (ROCm 7.2.0): without the fix test fails (all-zero gradient, 42/450); with this fix it passes across py3.12–3.14.

Alternatives considered

Dedicated autotuner pool (give the conv autotuner its own raw allocator):
also fixes the aliasing and keeps autotune/execute overlap, but introduces a
second device allocator and increases peak memory under preallocate. I did not like the idea of having multiple allocator silently
Deferred device-buffer free (defer returning a buffer to the allocator
until the compute stream drains): fixes the early-free prerequisite generally,
but I feel this is wrong level to attack the issue, and it is overly aggressive
fix is something like this:

  // ~AllocatedRawSEDeviceMemory()   (xla/pjrt/tracked_device_buffer.cc)
  se::Stream* stream = local_device_->compute_stream();
  absl::Status cb = stream->DoHostCallback(
      [allocator, device_ordinal, memory]() {
        allocator->Deallocate(device_ordinal, memory).IgnoreError();
      });
  if (!cb.ok())  // fallback: free now if the callback could not be enqueued
    allocator->Deallocate(device_ordinal, memory).IgnoreError();

…all-zero gradient UAF The MIOpen convolution autotuner (GetConvolutionCustomCallConfigs) selects an algorithm by running candidate convolutions via MIOpen find: it allocates scratch from the shared compute allocator and benchmarks the candidates -- including backward-weights, which zero-initializes the result via SetTensor(dW, 0) -- on a separate stream. Under the kComputeSynchronized allocation model the host runs ahead of the device, so a buffer can be returned to the allocator while its producing kernel is still in flight. That chunk can then be handed to the autotuner's scratch allocator, and the zero-init clobbers a result a live kernel still owns, producing an intermittent all-zero convolution gradient -- observed on gfx950 as the flaky tests/batching_test.py::testConvGeneralDilated failure. Run the autotuner's find under the exclusive GPU lock and drain outstanding device work before allocating scratch, so every chunk it receives is no longer owned by an in-flight kernel. The lock spans scratch allocation, zero-init and the MIOpen find (it does not change MIOpen's own per-solver loop). This runs only during autotuning (a compile-time event), so there is no steady-state cost.

claude · 2026-06-23T02:40:59Z

+  absl::WriterMutexLock serialize_lock(GetGpuMutex(stream_executor));
+  if (!stream_executor->SynchronizeAllActivity()) {
+    return absl::InternalError(
+        "Failed to synchronize device before convolution autotuning.");
+  }


Observation (non-blocking): The exclusive GPU mutex is held for the entire duration of GetConvolveRunners → PopulateMIOpenFindDb → miopenFindConvolution*Algorithm, which means all other XLA GPU work on this device ordinal will block during autotuning. This is by design (and the PR description acknowledges the serialization), but worth calling out: the first compilation that triggers MIOpen conv autotuning will see elevated latency on that device. Since autotuning is a one-time cost and the alternative (a dedicated allocator for scratch memory) adds significant complexity, this tradeoff seems reasonable.

The lock-then-sync ordering is correct: acquiring the exclusive lock first prevents new GPU work from launching (other threads block on the reader lock in gpu_executable.cc), then SynchronizeAllActivity() drains any work already in flight.

claude · 2026-06-23T02:41:05Z

+  // The MIOpen conv find draws+zero-inits shared scratch that can clobber a
+  // still-live kernel's buffer; run under the exclusive GPU lock, drain first.
+  absl::WriterMutexLock serialize_lock(GetGpuMutex(stream_executor));
+  if (!stream_executor->SynchronizeAllActivity()) {


Looks good: SynchronizeAllActivity() returns bool (not absl::Status), and the error check here correctly handles the false return by producing an absl::InternalError. This matches the interface declared in stream_executor.h.

claude · 2026-06-23T02:41:13Z

Claude Review Summary

Verdict: Approve — clean, minimal, well-targeted fix.

This PR correctly addresses a use-after-free race in MIOpen conv autotuning by acquiring an exclusive GPU mutex and draining in-flight device work before scratch memory allocation. The lock-then-sync ordering is correct, error handling is proper, and the scope is appropriately limited to the non-fused conv path (the fused path doesn't execute on-device during config enumeration). The serialization cost during first-compilation autotuning is an acceptable tradeoff vs. a dedicated allocator.

No issues found. See inline comments for detailed observations.

i-chaochen · 2026-06-23T10:52:42Z


+  // The MIOpen conv find draws+zero-inits shared scratch that can clobber a
+  // still-live kernel's buffer; run under the exclusive GPU lock, drain first.
+  absl::WriterMutexLock serialize_lock(GetGpuMutex(stream_executor));


is it similar to this one you had before? openxla#40062 what's happening to that PR?

magaonka-amd added the claude-review Request a Claude AI code review for this PR label Jun 23, 2026

claude Bot reviewed Jun 23, 2026

View reviewed changes

github-actions Bot removed the claude-review Request a Claude AI code review for this PR label Jun 23, 2026

magaonka-amd requested a review from i-chaochen June 23, 2026 03:09

i-chaochen requested a review from draganmladjenovic June 23, 2026 10:44

This comment was marked as duplicate.

Sign in to view

i-chaochen reviewed Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm] Run MIOpen conv autotuning under an exclusive GPU lock to fix all-zero gradient UAF#980

[ROCm] Run MIOpen conv autotuning under an exclusive GPU lock to fix all-zero gradient UAF#980
magaonka-amd wants to merge 1 commit into
mainfrom
fix/rocm-conv-autotune-serialize-uaf

magaonka-amd commented Jun 23, 2026 •

edited

Loading

Uh oh!

claude Bot Jun 23, 2026

Uh oh!

claude Bot Jun 23, 2026

Uh oh!

claude Bot commented Jun 23, 2026

Uh oh!

This comment was marked as duplicate.

Uh oh!

i-chaochen Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

magaonka-amd commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📝 Summary of Changes

🎯 Justification

🚀 Kind of Contribution

🧪 Unit Tests

🧪 Execution Tests

Alternatives considered

Uh oh!

claude Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Jun 23, 2026

Claude Review Summary

Uh oh!

This comment was marked as duplicate.

Uh oh!

i-chaochen Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

magaonka-amd commented Jun 23, 2026 •

edited

Loading