Skip to content

perf: improve prefetch logic with pipeline optimizations#192

Merged
EverythingSuckz merged 1 commit intodevfrom
perf/streaming-overhaul
Mar 7, 2026
Merged

perf: improve prefetch logic with pipeline optimizations#192
EverythingSuckz merged 1 commit intodevfrom
perf/streaming-overhaul

Conversation

@EverythingSuckz
Copy link
Owner

What Changed

This change updates the stream prefetch pipeline in internal/stream/pipe.go to remove the batch barrier created by sync.WaitGroup.

Before this change, a batch of concurrent block downloads had to fully complete before any block from that batch was exposed to the reader. That meant the fastest block in the batch could be ready several seconds earlier, but playback still waited for the slowest block.

After this change, each block download writes into its own buffered result channel, and the prefetch loop drains those results in order. This preserves sequential reads while allowing the first completed in-order block to be forwarded immediately.

Old Flow

The previous implementation used:

blocks := make([][]byte, batchSize)

var wg sync.WaitGroup
var fetchErr error
var errMu sync.Mutex

for i := range batchSize {
    wg.Add(1)
    go func(idx int) {
        defer wg.Done()
        // download into blocks[idx]
    }(i)
}

wg.Wait()

for _, block := range blocks {
    p.blockQueue <- block
}

Effectively, the flow was:

block 0 ready at 6.0s -> wait
block 1 ready at 6.5s -> wait
block 2 ready at 6.9s -> wait
block 3 ready at 7.3s -> release all 4 together

That is correct functionally, but it hurts time-to-first-byte and startup latency.

New Flow

The new implementation introduces a small blockResult type and one buffered channel per block:

type blockResult struct {
    data []byte
    err  error
}

Each download goroutine sends its result into results[idx], and the main prefetch loop drains those channels in index order:

results := make([]chan blockResult, batchSize)

for i := range batchSize {
    results[i] = make(chan blockResult, 1)
    go func(idx int) {
        // download and trim
        results[idx] <- blockResult{data: data}
    }(i)
}

for i := range batchSize {
    res := <-results[i]
    p.blockQueue <- res.data
}

Effectively, the new flow is:

block 0 ready at 4.4s -> forwarded immediately
block 1 ready at 5.0s -> forwarded immediately
block 2 ready at 5.3s -> forwarded immediately
block 3 ready at 5.6s -> forwarded immediately

This keeps ordering intact while eliminating the unnecessary full-batch wait.

Why This Is Better

  1. It reduces startup latency because the reader can consume the first in-order block as soon as it is ready.
  2. It keeps ordering guarantees, so downstream readers still receive bytes in the correct sequence.
  3. It removes shared mutable batch state guarded by sync.Mutex.
  4. It avoids using sync.WaitGroup as a barrier when the real requirement is ordered delivery, not group completion.
  5. It still respects cancellation because both the worker calls and the ordered drain path check p.ctx.Done().

Benchmarks From Real Runs

The numbers below come from your local before/after runs against the same re-encoded MP4 with faststart, using StreamConcurrency=4.

Playback Start Time

Measured from the first stream request log line to the moment the browser started playback.

Metric Without optimization With optimization Improvement
Playback start latency 8.955s 5.694s 3.261s faster
Relative improvement baseline baseline - 36.4% 36.4% faster

First Batch Behavior

From the first request batch in your logs:

Metric Without optimization With optimization
First completed block 6.065s 4.422s
Slowest block in first batch 7.278s 5.571s
Earliest point bytes could reach the reader 7.278s 4.422s

The important number is the last row. In the old version, even though one block completed at 6.065s, the reader still could not consume anything until the batch barrier cleared at 7.278s. In the new version, the reader can start consuming as soon as the first in-order block completes.

Warmed-Up Batch Comparison

Later batches showed the same pattern:

Metric Without optimization With optimization
Batch 2 first completed block 0.497s 0.393s
Batch 3 first completed block 0.434s 0.429s

These later batches are useful because they show the change is not dependent on one unusually slow batch. The benefit comes from removing the barrier, not from changing network behavior.

Notes About File Format

The latest benchmark was run on a video re-encoded with the faststart flag. That matters because it moves MP4 metadata to the beginning of the file, which reduces extra browser probing and makes startup measurements cleaner.

Without faststart, browsers often make additional seek-heavy requests before actual playback begins, which makes server-side streaming improvements harder to isolate.

Scalability and Safety

This approach is safe to scale for the current use case.

  1. Concurrency still stays bounded by StreamConcurrency.
  2. Each goroutine has exactly one buffered result channel, so it does not block on send.
  3. There is no shared fetchErr or shared output slice that requires lock coordination.
  4. Cancellation semantics remain intact because the download path already checks context cancellation.
  5. Memory overhead is small: one channel per in-flight block.

This is a reasonable Go concurrency pattern for ordered fan-out and ordered fan-in.

@coderabbitai
Copy link

coderabbitai bot commented Mar 7, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 48d591c3-81f1-4b8d-a67b-87ebbff33857

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • ✅ Review completed - (🔄 Check again to review again)
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch perf/streaming-overhaul

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@EverythingSuckz EverythingSuckz merged commit 065f1c4 into dev Mar 7, 2026
3 checks passed
@EverythingSuckz EverythingSuckz deleted the perf/streaming-overhaul branch March 7, 2026 04:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant