perf: improve prefetch logic with pipeline optimizations#192
perf: improve prefetch logic with pipeline optimizations#192EverythingSuckz merged 1 commit intodevfrom
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
What Changed
This change updates the stream prefetch pipeline in
internal/stream/pipe.goto remove the batch barrier created bysync.WaitGroup.Before this change, a batch of concurrent block downloads had to fully complete before any block from that batch was exposed to the reader. That meant the fastest block in the batch could be ready several seconds earlier, but playback still waited for the slowest block.
After this change, each block download writes into its own buffered result channel, and the prefetch loop drains those results in order. This preserves sequential reads while allowing the first completed in-order block to be forwarded immediately.
Old Flow
The previous implementation used:
Effectively, the flow was:
That is correct functionally, but it hurts time-to-first-byte and startup latency.
New Flow
The new implementation introduces a small
blockResulttype and one buffered channel per block:Each download goroutine sends its result into
results[idx], and the main prefetch loop drains those channels in index order:Effectively, the new flow is:
This keeps ordering intact while eliminating the unnecessary full-batch wait.
Why This Is Better
sync.Mutex.sync.WaitGroupas a barrier when the real requirement is ordered delivery, not group completion.p.ctx.Done().Benchmarks From Real Runs
The numbers below come from your local before/after runs against the same re-encoded MP4 with
faststart, usingStreamConcurrency=4.Playback Start Time
Measured from the first stream request log line to the moment the browser started playback.
First Batch Behavior
From the first request batch in your logs:
The important number is the last row. In the old version, even though one block completed at 6.065s, the reader still could not consume anything until the batch barrier cleared at 7.278s. In the new version, the reader can start consuming as soon as the first in-order block completes.
Warmed-Up Batch Comparison
Later batches showed the same pattern:
These later batches are useful because they show the change is not dependent on one unusually slow batch. The benefit comes from removing the barrier, not from changing network behavior.
Notes About File Format
The latest benchmark was run on a video re-encoded with the
faststartflag. That matters because it moves MP4 metadata to the beginning of the file, which reduces extra browser probing and makes startup measurements cleaner.Without
faststart, browsers often make additional seek-heavy requests before actual playback begins, which makes server-side streaming improvements harder to isolate.Scalability and Safety
This approach is safe to scale for the current use case.
StreamConcurrency.fetchError shared output slice that requires lock coordination.This is a reasonable Go concurrency pattern for ordered fan-out and ordered fan-in.