Skip to content

[ROCm] Run MIOpen conv autotuning under an exclusive GPU lock to fix all-zero gradient UAF#980

Open
magaonka-amd wants to merge 1 commit into
mainfrom
fix/rocm-conv-autotune-serialize-uaf
Open

[ROCm] Run MIOpen conv autotuning under an exclusive GPU lock to fix all-zero gradient UAF#980
magaonka-amd wants to merge 1 commit into
mainfrom
fix/rocm-conv-autotune-serialize-uaf

Conversation

@magaonka-amd

@magaonka-amd magaonka-amd commented Jun 23, 2026

Copy link
Copy Markdown

📝 Summary of Changes

Run the MIOpen convolution autotuner's find under the exclusive GPU lock, and drain outstanding device work before it allocates and zero-initializes scratch from the shared compute allocator. This isolates the autotuner from concurrent
live execution and fixes a use-after-free that produced intermittent all-zero convolution gradients on ROCm.

🎯 Justification

On ROCm the eager backward-filter conv gradient in tests/batching_test.py::testConvGeneralDilated (JAX) intermittently came back all-zero (42/450 elements). This one of the example failing test , any conv related jax UT can show similar fail and I have been seeing many. but among those dialated conv is most likely to occur in CI ( I think it is because how slow dilation conv generally is in MIOpen ).

Root cause: the MIOpen conv autotuner (GetConvolutionCustomCallConfigs) selects an algorithm by running candidate convolutions via MIOpen find. It allocates scratch from the shared compute allocator and benchmarks the candidates ,
including backward-weights, which zero-initializes the result via SetTensor(dW, 0) , on a separate stream, with no synchronization to the compute stream. Under the kComputeSynchronized allocation model the host runs
ahead of the device, so a buffer can be returned to the allocator while its producing kernel is still in flight; that chunk is then handed to the autotuner's scratch allocator and the zero-init clobbers a result a live kernel still owns.

This is specific to MIOpen: it is the only autotuner backend that executes convolutions during config enumeration (cuDNN enumerates engine configs with null
buffers; hipBLASLt/cuBLASLt/Triton enumerate without device execution).

🚀 Kind of Contribution

🐛 Bug Fix

🧪 Unit Tests

I basically have no ways to reproduce this bug consistantly in JAX or XLA or HIP level, so below example is sort of exaggerating the situation to show how bug manifests. this is not a real UT. just for demo.

A standalone, deterministic reproducer uses the MultiDeviceAdapter/tsl::BFCAllocator to model
the autotuner path: a gradient is produced and freed while its compute work isstill in flight, the autotuner re-serves that chunk from the shared pool and zero-inits it on its own stream, and a downstream consumer then reads the
gradient.

  • without fix: it re-serves the gradient's chunk while the
    producer event is still kPending and zeros it -> the consumer reads 0.
  • with fix: draining first lets the consumer read the
    real gradient before the chunk is reused -> the consumer reads the sentinel.

Sample output:

===== conv-autotuner UAF [UNSERIALIZED autotuner (the bug)] =====
  autotuner scratch=0x...000 (reuses dW chunk: YES); dW event: PENDING (producer kernel still in flight)
  consumer read dW = 0x00000000  -> ZEROED (autotuner clobbered the live gradient!)

===== conv-autotuner UAF [SERIALIZED autotuner (the fix)] =====
  autotuner scratch=0x...000 (reuses dW chunk: YES); dW event: complete (producer kernel finished)
  consumer read dW = 0x3f800000  -> intact gradient (fix works)
Standalone reproducer (conv_autotuner_uaf_test.cc)
#include <cstdint>
#include <cstdio>
#include <memory>
#include <utility>
#include <vector>

#include <gtest/gtest.h>
#include "absl/status/status.h"
#include "absl/status/statusor.h"
#include "absl/strings/string_view.h"
#include "xla/stream_executor/device_address.h"
#include "xla/stream_executor/device_address_allocator.h"
#include "xla/stream_executor/event.h"
#include "xla/stream_executor/integrations/device_mem_allocator.h"
#include "xla/stream_executor/integrations/tf_allocator_adapter.h"
#include "xla/stream_executor/platform.h"
#include "xla/stream_executor/platform_manager.h"
#include "xla/stream_executor/stream.h"
#include "xla/stream_executor/stream_executor.h"
#include "xla/tsl/framework/allocator.h"
#include "xla/tsl/framework/bfc_allocator.h"
#include "xla/tsl/framework/device_id.h"
#include "xla/tsl/platform/errors.h"
#include "xla/tsl/platform/statusor.h"
#include "xla/tsl/platform/test.h"

namespace stream_executor {
namespace {

constexpr uint32_t kSentinel = 0x3f800000u;          // 1.0f
constexpr uint64_t kBytes = 45 * sizeof(uint32_t);   // 3*3*1*5 dW = 180 bytes
constexpr uint64_t kScratchBytes = uint64_t{256} << 20;
constexpr int kDelayIters = 256;  // keep dW's compute in flight during reuse

absl::StatusOr<std::pair<Platform*, StreamExecutor*>> GpuExecutor() {
  for (absl::string_view name : {"CUDA", "ROCM"}) {
    absl::StatusOr<Platform*> p = PlatformManager::PlatformWithName(name);
    if (!p.ok()) continue;
    absl::StatusOr<StreamExecutor*> e = (*p)->ExecutorForDevice(0);
    if (e.ok()) return std::make_pair(*p, *e);
  }
  return absl::NotFoundError("No GPU platform (CUDA/ROCM) available");
}

const char* StatusName(Event::Status s) {
  switch (s) {
    case Event::Status::kPending:
      return "PENDING (producer kernel still in flight)";
    case Event::Status::kComplete:
      return "complete (producer kernel finished)";
    default:
      return "error/unknown";
  }
}

struct ConvOutcome {
  bool reuse_same_addr;
  Event::Status status_at_alloc;
  uint32_t consumer_readback;
};

absl::StatusOr<ConvOutcome> RunAutotunerScenario(Platform* platform,
                                                 StreamExecutor* exec,
                                                 bool serialize) {
  const int ord = exec->device_ordinal();
  std::fprintf(stderr, "\n===== conv-autotuner UAF [%s] =====\n",
               serialize ? "SERIALIZED autotuner (the fix)"
                         : "UNSERIALIZED autotuner (the bug)");

  auto sub =
      std::make_unique<DeviceMemAllocator>(exec, tsl::PlatformDeviceId(ord));
  tsl::BFCAllocator::Options opts;
  opts.allow_growth = true;
  auto bfc = std::make_shared<tsl::BFCAllocator>(
      std::move(sub), uint64_t{4} << 30, "conv_uaf_bfc", opts);

  TF_ASSIGN_OR_RETURN(std::unique_ptr<Stream> compute, exec->CreateStream());
  TF_ASSIGN_OR_RETURN(std::unique_ptr<Stream> autotune, exec->CreateStream());
  TF_ASSIGN_OR_RETURN(std::unique_ptr<Event> event, exec->CreateEvent());

  std::vector<MultiDeviceAdapter::AllocatorInfo> infos;
  infos.push_back({bfc, compute.get(), /*memory_space=*/int64_t{0},
                   /*device_ordinal=*/ord});
  MultiDeviceAdapter adapter(platform, std::move(infos));

  // Scratch first so dW is the lowest small chunk and BFC re-serves it.
  TF_ASSIGN_OR_RETURN(ScopedDeviceAddress<uint8_t> scratch_s,
                      adapter.Allocate(ord, kScratchBytes, true, 0));
  DeviceAddress<uint8_t> scratch = scratch_s.Release();
  TF_ASSIGN_OR_RETURN(ScopedDeviceAddress<uint8_t> dw_s,
                      adapter.Allocate(ord, kBytes, true, 0));
  DeviceAddress<uint8_t> dw = dw_s.Release();
  TF_ASSIGN_OR_RETURN(ScopedDeviceAddress<uint8_t> out_s,
                      adapter.Allocate(ord, kBytes, true, 0));
  DeviceAddress<uint8_t> out = out_s.Release();

  // Produce dW, stay busy, then a downstream consumer reads dW into `out`.
  TF_RETURN_IF_ERROR(compute->Memset32(&dw, kSentinel, kBytes));
  for (int i = 0; i < kDelayIters; ++i)
    TF_RETURN_IF_ERROR(compute->Memset32(&scratch, 0xccccccccu, kScratchBytes));
  TF_RETURN_IF_ERROR(compute->Memcpy(&out, dw, kBytes));
  TF_RETURN_IF_ERROR(compute->RecordEvent(event.get()));

  // dW is freed while its compute work is still in flight.
  TF_RETURN_IF_ERROR(adapter.Deallocate(ord, dw));

  // The fix drains the device before the autotuner draws scratch.
  if (serialize) {
    if (!exec->SynchronizeAllActivity())
      return absl::InternalError("Failed to drain device.");
  }
  TF_ASSIGN_OR_RETURN(ScopedDeviceAddress<uint8_t> scr_s,
                      adapter.Allocate(ord, kBytes, true, 0));
  DeviceAddress<uint8_t> scr = scr_s.Release();

  ConvOutcome o;
  o.reuse_same_addr = (scr.opaque() == dw.opaque());
  o.status_at_alloc = event->PollForStatus();
  std::fprintf(stderr,
               "  autotuner scratch=%p (reuses dW chunk: %s); dW event: %s\n",
               scr.opaque(), o.reuse_same_addr ? "YES" : "no",
               StatusName(o.status_at_alloc));

  // Autotuner zero-inits its scratch on its own stream (models SetTensor(dW,0)).
  TF_RETURN_IF_ERROR(autotune->MemZero(&scr, kBytes));
  TF_RETURN_IF_ERROR(autotune->BlockHostUntilDone());
  TF_RETURN_IF_ERROR(compute->BlockHostUntilDone());

  TF_RETURN_IF_ERROR(
      compute->Memcpy(&o.consumer_readback, out, sizeof(uint32_t)));
  TF_RETURN_IF_ERROR(compute->BlockHostUntilDone());
  std::fprintf(stderr, "  consumer read dW = 0x%08x  -> %s\n",
               o.consumer_readback,
               o.consumer_readback == 0u ? "ZEROED (clobbered live gradient!)"
               : o.consumer_readback == kSentinel ? "intact gradient (fix works)"
                                                  : "unexpected");

  TF_RETURN_IF_ERROR(adapter.Deallocate(ord, scr));
  TF_RETURN_IF_ERROR(adapter.Deallocate(ord, out));
  TF_RETURN_IF_ERROR(adapter.Deallocate(ord, scratch));
  return o;
}

TEST(ConvAutotunerUaf, UnserializedAutotunerClobbersLiveGradient) {
  absl::StatusOr<std::pair<Platform*, StreamExecutor*>> gpu = GpuExecutor();
  if (!gpu.ok()) GTEST_SKIP() << gpu.status();
  TF_ASSERT_OK_AND_ASSIGN(
      ConvOutcome o,
      RunAutotunerScenario(gpu->first, gpu->second, /*serialize=*/false));
  EXPECT_TRUE(o.reuse_same_addr);
  EXPECT_EQ(o.status_at_alloc, Event::Status::kPending);
  EXPECT_EQ(o.consumer_readback, 0u);  // gradient clobbered
}

TEST(ConvAutotunerUaf, SerializedAutotunerPreservesLiveGradient) {
  absl::StatusOr<std::pair<Platform*, StreamExecutor*>> gpu = GpuExecutor();
  if (!gpu.ok()) GTEST_SKIP() << gpu.status();
  TF_ASSERT_OK_AND_ASSIGN(
      ConvOutcome o,
      RunAutotunerScenario(gpu->first, gpu->second, /*serialize=*/true));
  EXPECT_EQ(o.consumer_readback, kSentinel);  // gradient preserved
}

}  // namespace
}  // namespace stream_executor

🧪 Execution Tests

I'm not really sure how to write UT for this race condition.
but testConvGeneralDilated on a 4-GPU gfx950 runner (ROCm 7.2.0): without the fix test fails (all-zero gradient, 42/450); with this fix it passes across py3.12–3.14.

Alternatives considered

  • Dedicated autotuner pool (give the conv autotuner its own raw allocator):
    also fixes the aliasing and keeps autotune/execute overlap, but introduces a
    second device allocator and increases peak memory under preallocate. I did not like the idea of having multiple allocator silently

  • Deferred device-buffer free (defer returning a buffer to the allocator
    until the compute stream drains): fixes the early-free prerequisite generally,
    but I feel this is wrong level to attack the issue, and it is overly aggressive
    fix is something like this:

  // ~AllocatedRawSEDeviceMemory()   (xla/pjrt/tracked_device_buffer.cc)
  se::Stream* stream = local_device_->compute_stream();
  absl::Status cb = stream->DoHostCallback(
      [allocator, device_ordinal, memory]() {
        allocator->Deallocate(device_ordinal, memory).IgnoreError();
      });
  if (!cb.ok())  // fallback: free now if the callback could not be enqueued
    allocator->Deallocate(device_ordinal, memory).IgnoreError();

…all-zero gradient UAF

The MIOpen convolution autotuner (GetConvolutionCustomCallConfigs) selects an
algorithm by running candidate convolutions via MIOpen find: it allocates
scratch from the shared compute allocator and benchmarks the candidates --
including backward-weights, which zero-initializes the result via
SetTensor(dW, 0) -- on a separate stream.

Under the kComputeSynchronized allocation model the host runs ahead of the
device, so a buffer can be returned to the allocator while its producing kernel
is still in flight. That chunk can then be handed to the autotuner's scratch
allocator, and the zero-init clobbers a result a live kernel still owns,
producing an intermittent all-zero convolution gradient -- observed on gfx950 as
the flaky tests/batching_test.py::testConvGeneralDilated failure.

Run the autotuner's find under the exclusive GPU lock and drain outstanding
device work before allocating scratch, so every chunk it receives is no longer
owned by an in-flight kernel. The lock spans scratch allocation, zero-init and
the MIOpen find (it does not change MIOpen's own per-solver loop). This runs only
during autotuning (a compile-time event), so there is no steady-state cost.
@magaonka-amd magaonka-amd added the claude-review Request a Claude AI code review for this PR label Jun 23, 2026
Comment on lines +319 to +323
absl::WriterMutexLock serialize_lock(GetGpuMutex(stream_executor));
if (!stream_executor->SynchronizeAllActivity()) {
return absl::InternalError(
"Failed to synchronize device before convolution autotuning.");
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation (non-blocking): The exclusive GPU mutex is held for the entire duration of GetConvolveRunnersPopulateMIOpenFindDbmiopenFindConvolution*Algorithm, which means all other XLA GPU work on this device ordinal will block during autotuning. This is by design (and the PR description acknowledges the serialization), but worth calling out: the first compilation that triggers MIOpen conv autotuning will see elevated latency on that device. Since autotuning is a one-time cost and the alternative (a dedicated allocator for scratch memory) adds significant complexity, this tradeoff seems reasonable.

The lock-then-sync ordering is correct: acquiring the exclusive lock first prevents new GPU work from launching (other threads block on the reader lock in gpu_executable.cc), then SynchronizeAllActivity() drains any work already in flight.

// The MIOpen conv find draws+zero-inits shared scratch that can clobber a
// still-live kernel's buffer; run under the exclusive GPU lock, drain first.
absl::WriterMutexLock serialize_lock(GetGpuMutex(stream_executor));
if (!stream_executor->SynchronizeAllActivity()) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good: SynchronizeAllActivity() returns bool (not absl::Status), and the error check here correctly handles the false return by producing an absl::InternalError. This matches the interface declared in stream_executor.h.

@claude

claude Bot commented Jun 23, 2026

Copy link
Copy Markdown

Claude Review Summary

Verdict: Approve — clean, minimal, well-targeted fix.

This PR correctly addresses a use-after-free race in MIOpen conv autotuning by acquiring an exclusive GPU mutex and draining in-flight device work before scratch memory allocation. The lock-then-sync ordering is correct, error handling is proper, and the scope is appropriately limited to the non-fused conv path (the fused path doesn't execute on-device during config enumeration). The serialization cost during first-compilation autotuning is an acceptable tradeoff vs. a dedicated allocator.

No issues found. See inline comments for detailed observations.

@github-actions github-actions Bot removed the claude-review Request a Claude AI code review for this PR label Jun 23, 2026
@magaonka-amd magaonka-amd requested a review from i-chaochen June 23, 2026 03:09
i-chaochen

This comment was marked as duplicate.


// The MIOpen conv find draws+zero-inits shared scratch that can clobber a
// still-live kernel's buffer; run under the exclusive GPU lock, drain first.
absl::WriterMutexLock serialize_lock(GetGpuMutex(stream_executor));

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it similar to this one you had before? openxla#40062 what's happening to that PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants