Extending Pre-allocated Outputs to support reusing output buffers for intermediate TensorRT Engines #3985
narendasan
started this conversation in
RFCs
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
TL;DR
We already have a feature for pre-allocating output buffers for execution 2 behind the execution of the TRT engine during execution 1. The next step is to just reuse those buffers and skip reallocating them in the case that the TRT engine's outputs will only be used by downstream blocks.
Goal(s)
We are trying to limit the amount of work being done in
execute_engine. One of the main sources of latency is the allocation of new output tensors.Usecases
For graphs with many subgraphs, a lot of time is spent reallocating output buffers for memory that just passes data between TRT and PyTorch blocks with no user interaction. We want to remove those cases.
Proposed APIs / UX
Extend the behavior of use_preallocated_outputs to reuse buffers if they will remain owned by the engine.
Example Workflow
Limitations
Internal Implementation
Design
use_pre_allocated_outputsis_output_tensors_unowned. For the blocks where tensors are owned, we do not reallocate on the second iteration and only reallocate if the shape changes, unless it is the first iteration. For blocks where outputs are unowned, it follows case A.The pre_allocated_outputs system already manages state through:
This system calculates if existing buffers (e.g. for CUDAGraphs or pre_allocated_outputs) would be invalid if shapes have changed.
The main change would be to condition the reallocation at the bottom of the execute engine on
are_output_tensors_unownedf261435#diff-12ec24175cd347c79cd109427bde289f503e2dec9fc9d1a27871e5523a218638R370-R374
Extensions Required to Core API implementations
Two new APIs for the
TRTEngineclass:These APIs indicate that the TRT engine's outputs will only be consumed by other Torch-TensorRT subgraphs and will not be returned to the user.
Implementation Phases
Prototype -
MVP
(2.10)https://github.com/pytorch/TensorRT/pull/3946/changestorch_tensorrt.runtime.use_pre_allocated_outputsto try to reuse output buffers for intermediate TensorRT blocks.Extension Phase 1
(2.11)Extension Phase 2
(<TARGET RELEASE VERSION>)Beta Was this translation helpful? Give feedback.
All reactions