Extending Pre-allocated Outputs to support reusing output buffers for intermediate TensorRT Engines #3985

narendasan · 2025-12-31T22:58:00Z

narendasan
Dec 31, 2025
Collaborator

TL;DR

We already have a feature for pre-allocating output buffers for execution 2 behind the execution of the TRT engine during execution 1. The next step is to just reuse those buffers and skip reallocating them in the case that the TRT engine's outputs will only be used by downstream blocks.

Goal(s)

We are trying to limit the amount of work being done in execute_engine. One of the main sources of latency is the allocation of new output tensors.

Usecases

For graphs with many subgraphs, a lot of time is spent reallocating output buffers for memory that just passes data between TRT and PyTorch blocks with no user interaction. We want to remove those cases.

Proposed APIs / UX

Extend the behavior of use_preallocated_outputs to reuse buffers if they will remain owned by the engine.

Example Workflow

    with torch_tensorrt.runtime.enable_pre_allocated_outputs(optimized_model):
        out_trt = optimized_model(*inputs)
        # Subsequent inferences can use the pre-allocated output buffer (no shape change)
        out_trt = optimized_model(*inputs)

Limitations

For the MVP this will remain opt in. Once perf testing has validated the change we can turn pre-allocated inputs on by default provided they wont interfere with other runtime features like DDS and CUDAGraphs

Internal Implementation

Design

       Case A                        Case B           
                                                    
┌───────────────────┐     ┌────────────────────────┐
│                   │     │                        │
│                   │     │       TensorRT A       │
│                   │     │     (owned outputs)    │
│                   │     │                        │
│                   │     └───────────┬────────────┘
│                   │                 │             
│                   │                 │             
│                   │                 │             
│                   │                 │             
│                   │     ┌───────────┼────────────┐
│     TensorRT A    │     │                        │
│                   │     │       PyTorch B        │
│                   │     │                        │
│                   │     └──────────┬─────────────┘
│                   │                │              
│                   │                │              
│                   │                │              
│                   │                │              
│                   │     ┌──────────┼────────────┐ 
│                   │     │                       │ 
│                   │     │      TensorRT C       │ 
│                   │     │   (unowned outputs)   │
│                   │     │                       │ 
└───────────────────┘     └───────────────────────┘

User does nothing

Case A: We reallocate every output tensor for every iteration (Same as what we do now)
Case B: We reallocate every output tensor for every iteration (Same as what we do now)

User enables use_pre_allocated_outputs

Case A: We reallocate every output tensor every iteration but use the prior iterations allocation to run so that we don't block on tensor allocation during inference. First inference pays the allocation cost twice (Same as what we do now)
Case B: We look at each TRT Blocks is_output_tensors_unowned. For the blocks where tensors are owned, we do not reallocate on the second iteration and only reallocate if the shape changes, unless it is the first iteration. For blocks where outputs are unowned, it follows case A.

The pre_allocated_outputs system already manages state through:

struct TorchTRTRuntimeStates {
  // Indicates whether CUDAGraphs were enabled in the previous execute_engine
  bool old_cudagraphs;
  // Indicates whether pre-allocated output was enabled in the previous execute_engine
  bool old_pre_allocated_outputs;

  // Evaluates whether certain conditions are met to enable CUDA Graph recording or to reuse pre-allocated outputs
  // based on the current and previous states, as well as input shape has changed
  std::tuple<bool, bool> set_runtime_states(bool new_cudagraphs, bool new_pre_allocated_output, bool shape_changed) {
    bool need_cudagraphs_record = false;
    bool can_use_pre_allocated_outputs = false;

    // Cudagraphs record is required if cudagraphs_enabled is switched to True regardless of shape change
    if (new_cudagraphs && (!old_cudagraphs || shape_changed)) {
      need_cudagraphs_record = true;
    }
    // Pre-allocated output can be used when previous and current state are true without shape change
    if (old_pre_allocated_outputs && new_pre_allocated_output && !shape_changed) {
      can_use_pre_allocated_outputs = true;
    }
    old_cudagraphs = new_cudagraphs;
    old_pre_allocated_outputs = new_pre_allocated_output;

    return {need_cudagraphs_record, can_use_pre_allocated_outputs};
  }
};

This system calculates if existing buffers (e.g. for CUDAGraphs or pre_allocated_outputs) would be invalid if shapes have changed.

The main change would be to condition the reallocation at the bottom of the execute engine on are_output_tensors_unowned
f261435#diff-12ec24175cd347c79cd109427bde289f503e2dec9fc9d1a27871e5523a218638R370-R374

Extensions Required to Core API implementations

Two new APIs for the TRTEngine class:

TRTEngine::set_output_tensors_as_unowned

TRTEngine::are_output_tensors_unowned

These APIs indicate that the TRT engine's outputs will only be consumed by other Torch-TensorRT subgraphs and will not be returned to the user.

Implementation Phases

Prototype -

@cehongwang has already prototyped and profiled the performance improvements of reusing output buffers in the case of many graph breaks.

MVP `(2.10)` https://github.com/pytorch/TensorRT/pull/3946/changes

Extend the current feature torch_tensorrt.runtime.use_pre_allocated_outputs to try to reuse output buffers for intermediate TensorRT blocks.
This will still be opt-in (based on users using the context manager)
Add a test case that runs pre-allocated-outputs runtime mode with a fallback graph.

Extension Phase 1 `(2.11)`

Profile the feature and determine if we can turn it on by default
- What are the implications for things like CUDAGraphs?

Extension Phase 2 `(<TARGET RELEASE VERSION>)`

Can we potentially merge OutputAllocator and this pre-allocation system together into something that can handle DDS, CUDAGraphs, DS etc. and simplify the runtime?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extending Pre-allocated Outputs to support reusing output buffers for intermediate TensorRT Engines #3985

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Extending Pre-allocated Outputs to support reusing output buffers for intermediate TensorRT Engines #3985

Uh oh!

Uh oh!

narendasan Dec 31, 2025 Collaborator

TL;DR

Goal(s)

Usecases

Proposed APIs / UX

Example Workflow

Limitations

Internal Implementation

Design

Extensions Required to Core API implementations

Implementation Phases

Prototype -

MVP (2.10) https://github.com/pytorch/TensorRT/pull/3946/changes

Extension Phase 1 (2.11)

Extension Phase 2 (<TARGET RELEASE VERSION>)

Replies: 0 comments

narendasan
Dec 31, 2025
Collaborator

MVP `(2.10)` https://github.com/pytorch/TensorRT/pull/3946/changes

Extension Phase 1 `(2.11)`

Extension Phase 2 `(<TARGET RELEASE VERSION>)`