Implement cache for hipStream in ROCm executor by mfrancepillois · Pull Request #869 · ROCm/xla

mfrancepillois · 2026-05-20T13:39:20Z

Motivation

The iota_test was very slow on AMD targets (compared to NVDIA) because the pjrt client was destroyed and recreated for each of the 4500 tests that make up the iota_test. This task in ROCm is ~40× slower than with CUDA (see table below).

Phase	H100	AMD MI300X
Previous client teardown + new client init (pre-BFC log)	~35ms total	~963ms total
BFC allocator re-setup (8 GPUs)	~0.3ms	~0.1ms
Per-test GPU lifecycle cost	35ms	1009ms

The main cause of slowdowns when creating and destroying a pjrt client lies in the creation and destruction of streams.

Per-test overhead (8 GPU ROCm, iota_test):

CREATION (~406ms):
  Phase1 GetGpuXlaClient:      0.2ms    (negligible, singleton)
  Phase2 hipStreamCreate ×112: 385ms    ← dominant creation cost
  Phase3 EnablePeerAccess:     0.4ms    (negligible, cached)
  Phase4 BFC Allocator:        0.2ms    (negligible, no prealloc)
  Phase5 BuildDistributed:      20ms    (RCCL topology)

DESTRUCTION (~513ms):
  dtor body:                   0.05ms
  thread pool shutdown:       138ms 
  hipStreamDestroy ×112:      375ms    ← dominant destruction cost
  SyncAllActivity:              1.5ms   (device 0 only)

TOTAL OVERHEAD:                ~919ms per test
ACTUAL COMPUTATION:             ~90ms  (IotaReshapeExtraDims = 1012ms total)

This PR implements a process-level HipStreamHandleCache singleton directly in rocm_stream.cc. Cache key: (device_ordinal, creation_flags, creation_priority_int).

On destruction (RocmStream::~RocmStream):

BlockHostUntilDone() already ran -- stream is idle.
hipStreamQuery() confirms idleness; on error the handle is destroyed rather than cached (no poisoning).
hipStreamGetFlags / hipStreamGetPriority are called to build the exact cache key, ensuring a retrieved handle always matches the flags and priority the new stream would have used -- even if XLA later switches to hipStreamNonBlocking.
Idle handle is stored; hipStreamDestroy is skipped.

On creation (RocmStream::Create via CreateStream):
The cache is checked first; on hit the cached handle is returned
directly and hipStreamCreate is skipped. On miss the cold path
calls hipStreamCreate as before.

The LocalDeviceState and RocmStream wrapper objects are still created and destroyed normally on every client instantiation. DNN state is cleaned up via DeallocateStream as usual. Only the underlying HIP queue (hipStream_t) is reused.

Also fix a latent use-after-free in LocalDeviceState::~LocalDeviceState:

C++ destroys members in reverse declaration order. compute_events_
(line 352 in local_device_state.h) is declared after callback_thread_
(line 342), so its destructor runs before callback_thread_'s
destructor joins the worker thread. If callback_thread_ still has
pending pop_front(compute_events_) closures when compute_events_ is
destroyed, those closures access freed memory.

The fix adds callback_thread_->Drain() between SynchronizeAllActivity()
and the explicit stream/event clears. After Drain() the callback thread
is idle and compute_events_ can be safely cleared.

claude · 2026-05-20T13:48:16Z

Review Summary

This PR adds a process-level hipStream_t handle cache to avoid expensive hipStreamCreate/hipStreamDestroy calls (~100ms each) on ROCm, and fixes a latent use-after-free in LocalDeviceState::~LocalDeviceState where compute_events_ could be destroyed before the callback thread finishes draining.

Key finding: hipStreamQuery in DestroyStream and the cache-hit path in CreateStream both lack executor->Activate() calls, which could cause incorrect behavior on multi-GPU systems. The CUDA counterpart activates context before the equivalent cuStreamQuery.

The use-after-free fix via WorkerThread::Drain() is clean and correct — it's also platform-agnostic and benefits CUDA builds equally.

See inline comments for details.

hipStreamCreate on ROCm is expensive (~100 ms per stream). When a PjRtClient is destroyed and a new one is immediately created (common in tests and interactive use), all ~18 streams per device are destroyed and recreated, blocking for several seconds. This commit implements a process-level HipStreamHandleCache singleton directly in rocm_stream.cc (ROCm-only, touches no CUDA/SYCL code). Cache key: (device_ordinal, creation_flags, creation_priority_int). On destruction (RocmStream::~RocmStream): 1. BlockHostUntilDone() already ran -- stream is idle. 2. hipStreamQuery() confirms idleness; on error the handle is destroyed rather than cached (no poisoning). 3. hipStreamGetFlags / hipStreamGetPriority are called to build the exact cache key, ensuring a retrieved handle always matches the flags and priority the new stream would have used -- even if XLA later switches to hipStreamNonBlocking. 4. Idle handle is stored; hipStreamDestroy is skipped. On creation (RocmStream::Create via CreateStream): The cache is checked first; on hit the cached handle is returned directly and hipStreamCreate is skipped. On miss the cold path calls hipStreamCreate as before. The LocalDeviceState and RocmStream wrapper objects are still created and destroyed normally on every client instantiation. DNN state is cleaned up via DeallocateStream as usual. Only the underlying HIP queue (hipStream_t) is reused. Also fix a latent use-after-free in LocalDeviceState::~LocalDeviceState: C++ destroys members in reverse declaration order. compute_events_ (line 352 in local_device_state.h) is declared after callback_thread_ (line 342), so its destructor runs *before* callback_thread_'s destructor joins the worker thread. If callback_thread_ still has pending pop_front(compute_events_) closures when compute_events_ is destroyed, those closures access freed memory. The fix adds callback_thread_->Drain() between SynchronizeAllActivity() and the explicit stream/event clears. After Drain() the callback thread is idle and compute_events_ can be safely cleared.

i-chaochen

Just for the reference: this is previous PR #861

I assume this cache for hipStream will be "beneficial" not only to that iota unit test, but also to all hip stream related operations in XLA? it might be improving general e2e workloads as well?

mfrancepillois · 2026-06-09T14:19:28Z

Just for the reference: this is previous PR #861

I assume this cache for hipStream will be "beneficial" not only to that iota unit test, but also to all hip stream related operations in XLA? it might be improving general e2e workloads as well?

Yes, this cache should reduce the execution time of all tests comprising more than one subtest.

ScXfjiang

LGTM

pemeliya · 2026-06-09T15:04:12Z

@mfrancepillois , I am just trying to understand: the current issue is that local_device_state creates about 18 stream per device?

1 compute
1 host2device
4 device2host
4 device2device
4 fixed size pool usage
4 external ready event streams

or there are other places where streams get created/destroyed?

ah I see, this is because each subtest creates/destroyes LocalDeviceState as part of PJRT client

ScXfjiang · 2026-06-09T15:10:29Z

@mfrancepillois , I am just trying to understand: the current issue is that local_device_state creates about 18 stream per device?

1 compute 1 host2device 4 device2host 4 device2device 4 fixed size pool usage 4 external ready event streams

or there are other places where streams get created/destroyed?

It caches the streams in multiple PJRT clients, not the streams within a single PJRT client.

pemeliya · 2026-06-09T15:17:59Z

@mfrancepillois , I am just trying to understand: the current issue is that local_device_state creates about 18 stream per device?
1 compute 1 host2device 4 device2host 4 device2device 4 fixed size pool usage 4 external ready event streams
or there are other places where streams get created/destroyed?

It caches the streams in multiple PJRT clients, not the streams within a single PJRT client.

ook I see.. each subtest now is a full-fledged PJRT client which makes it quite heavy. So, basically, this will only help us with test execution, it won't affect real workload performance..

Possibly, we could also make stream creation in local_device_state lazy? I mean, we probably use just few streams per subtest out of those 18?

mfrancepillois · 2026-06-09T15:27:24Z

@mfrancepillois , I am just trying to understand: the current issue is that local_device_state creates about 18 stream per device?
1 compute 1 host2device 4 device2host 4 device2device 4 fixed size pool usage 4 external ready event streams
or there are other places where streams get created/destroyed?

It caches the streams in multiple PJRT clients, not the streams within a single PJRT client.

ook I see.. each subtest now is a full-fledged PJRT client which makes it quite heavy. So, basically, this will only help us with test execution, it won't affect real workload performance..

Possibly, we could also make stream creation in local_device_state lazy? I mean, we probably use just few streams per subtest out of those 18?

Yes, this only beneficial for tests.
At some point, we discussed about lazy stream creation, and assessed performance (https://github.com/ROCm/frameworks-internal/issues/6589#issuecomment-4452198140). But @draganmladjenovic pointed out that the order in which streams are created is important, as streams are subsequently multiplexed into a single physical queue. So, the order has therefore been defined to achieve the most efficient stream-to-queue mapping, which will no longer be the case with lazy creation.

pemeliya · 2026-06-09T15:54:58Z

Yes, this only beneficial for tests. At some point, we discussed about lazy stream creation, and assessed performance (ROCm/frameworks-internal#6589 (comment)). But @draganmladjenovic pointed out that the order in which streams are created is important, as streams are subsequently multiplexed into a single physical queue. So, the order has therefore been defined to achieve the most efficient stream-to-queue mapping, which will no longer be the case with lazy creation.

ah I see.. yes, the optimal performance is normally achieved with 4 hardware queues on ROCM. So, we can reduce the number of threads for ROCM: instead of creating 18 streams, we can just create 1 of each kind, e.g.:
1 compute, 1 host2device, 1 device2host and 1 device2device, and so on.

If I am not mistaken, currently we create them in the following order:
1 compute hw0
1 host2device hw1
4 dev2host: hw2 hw3 hw0 hw1
4 dev2dev: hw2 hw3 hw0 hw1
4 fixed size ...
4 external ready ...

but this needs to be tested of course

i-chaochen

Thanks! Please upstream it to speed up our UTs and we can investigate the lazy stream/stream-queue mapping later on.

draganmladjenovic · 2026-06-11T11:13:56Z

Yes, this only beneficial for tests. At some point, we discussed about lazy stream creation, and assessed performance (ROCm/frameworks-internal#6589 (comment)). But @draganmladjenovic pointed out that the order in which streams are created is important, as streams are subsequently multiplexed into a single physical queue. So, the order has therefore been defined to achieve the most efficient stream-to-queue mapping, which will no longer be the case with lazy creation.

ah I see.. yes, the optimal performance is normally achieved with 4 hardware queues on ROCM. So, we can reduce the number of threads for ROCM: instead of creating 18 streams, we can just create 1 of each kind, e.g.: 1 compute, 1 host2device, 1 device2host and 1 device2device, and so on.

If I am not mistaken, currently we create them in the following order: 1 compute hw0 1 host2device hw1 4 dev2host: hw2 hw3 hw0 hw1 4 dev2dev: hw2 hw3 hw0 hw1 4 fixed size ... 4 external ready ...

but this needs to be tested of course

It is worse. We have 8 of them 4 by priority. We need to see if we should disable stream priority. Sorry I haven't got around to check this. I understand that queue creation is somewhat expensive, but once you create all 8 of them they just get resutes so the stream should be lightweight.

mfrancepillois force-pushed the ci_maxime_hip_stream_cache_rocm branch from b11b315 to 14eee1d Compare May 20, 2026 13:42

mfrancepillois added the claude-review Request a Claude AI code review for this PR label May 20, 2026