Skip to content

feat(pack): support turbopack bundle analyzer#2761

Draft
fireairforce wants to merge 1 commit intonextfrom
support-turbopack-analyze
Draft

feat(pack): support turbopack bundle analyzer#2761
fireairforce wants to merge 1 commit intonextfrom
support-turbopack-analyze

Conversation

@fireairforce
Copy link
Copy Markdown
Member

Summary

Test Plan

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a native analysis capability to Turbopack, including a new turbopack-analyze crate and logic to generate detailed data on output assets and module graphs. It integrates this feature into the CLI via an --analyze flag and provides the necessary NAPI bindings. The review feedback highlights a potential logic error where root modules might be missed during graph traversal if they lack incoming edges. Additionally, there is an opportunity to improve performance by parallelizing the processing of output assets and their compressed sizes.

Comment on lines +426 to +439
if let Some((parent_node, reference)) = parent {
all_modules.insert(parent_node);
all_modules.insert(node);
match reference.chunking_type {
ChunkingType::Async => {
all_async_edges.insert((parent_node, node));
}
_ => {
all_edges.insert((parent_node, node));
}
}
}
Ok(())
})?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In analyze_module_graphs, modules are only added to all_modules if they are part of an edge where they are either a dependency or have a parent. Root modules (entry points) that have no incoming edges might be missing from the all_modules set if they are not explicitly inserted. The node should be inserted into all_modules regardless of whether parent is Some to ensure all nodes in the graph are captured.

References
  1. Ensure all relevant nodes in a graph traversal are captured, including roots.

Comment on lines +362 to +390
for &asset in output_assets.await? {
let filename = asset.path().to_string().owned().await?;
if filename.ends_with(".map") || filename.ends_with(".nft.json") {
continue;
}

let output_file_index = builder.add_output_file(AnalyzeOutputFile { filename });
let chunk_parts = split_output_asset_into_parts(*asset).await?;
for chunk_part in chunk_parts {
let decoded_source = urlencoding::decode(&chunk_part.source)?;
let source = if let Some(stripped) = decoded_source.strip_prefix(&prefix) {
Cow::Borrowed(stripped)
} else {
Cow::Owned(format!(
"[project]/{}",
decoded_source.trim_start_matches("../")
))
};
let source_index = builder.ensure_source(&source).1;
let chunk_part_index = builder.add_chunk_part(AnalyzeChunkPart {
source_index,
output_file_index,
size: chunk_part.real_size + chunk_part.unaccounted_size,
compressed_size: chunk_part.get_compressed_size().await?,
});
builder.add_chunk_part_to_output_file(output_file_index, chunk_part_index);
builder.add_chunk_part_to_source(source_index, chunk_part_index);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The processing of output assets and their chunk parts is currently sequential. Since get_compressed_size() can be a computationally expensive operation (involving compression), this can become a significant performance bottleneck for large applications. Consider using try_join or try_flat_join to process assets and their chunk parts in parallel before populating the builder.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

📊 Performance Benchmark Report (with-antd)

Utoopack Performance Report

Report ID: utoopack_performance_report_20260407_070940
Generated: 2026-04-07 07:09:40
Trace File: trace_antd.json (0.6GB, 1.61M spans)
Test Project: examples/with-antd


Executive Summary

Metric Value Assessment
Total Wall Time 14,209.1 ms Baseline
Total Thread Work (de-duped) 30,339.0 ms Non-overlapping busy time
Effective Parallelism 2.1x thread_work / wall_time
Working Threads 5 Threads with actual spans
Thread Utilization 42.7% ⚠️ Suboptimal
Total Spans 1,609,826 All B/E + X events
Meaningful Spans (>= 10us) 530,826 (33.0% of total)
Tracing Noise (< 10us) 1,079,000 (67.0% of total)

Build Phase Timeline

Shows when each build phase is active and how much CPU it consumes.
Self-Time is the time spent exclusively in that phase (excluding children).

Phase Spans Inclusive (ms) Self-Time (ms) Wall Range (ms)
Resolve 126,269 3,871.3 3,047.4 7,378.1
Parse 12,265 1,787.3 1,506.6 14,057.1
Analyze 307,322 18,327.1 13,485.7 13,616.5
Chunk 27,940 2,662.9 2,468.8 9,681.5
Codegen 43,712 5,077.4 3,819.8 9,740.5
Emit 75 55.1 27.5 9,112.3
Other 13,243 2,000.3 1,806.6 14,209.1

Workload Distribution by Diagnostic Tier

Category Spans Inclusive (ms) % Work Self-Time (ms) % Self
P0: Scheduling & Resolution 443,070 23,620.7 77.9% 17,780.0 58.6%
P1: I/O & Heavy Tasks 3,295 126.0 0.4% 98.4 0.3%
P2: Architecture (Locks/Memory) 0 0.0 0.0% 0.0 0.0%
P3: Asset Pipeline 82,767 9,572.5 31.6% 7,840.2 25.8%
P4: Bridge/Interop 0 0.0 0.0% 0.0 0.0%
Other 1,694 462.2 1.5% 443.8 1.5%

Top 20 Tasks by Self-Time

Self-time is the exclusive duration: time spent in the task itself, not in sub-tasks.
This is the most accurate indicator of where CPU cycles are actually spent.

Self (ms) Inclusive (ms) Count Avg Self (us) P95 Self (ms) Max Self (ms) % Work Task Name Top Caller
7,601.7 8,825.0 171,818 44.2 0.1 30.7 25.1% module write all entrypoints to disk (1%)
2,835.6 3,828.4 39,803 71.2 0.1 170.0 9.3% analyze ecmascript module process module (76%)
1,909.8 3,167.4 24,171 79.0 0.3 92.8 6.3% code generation chunking (8%)
1,759.6 1,909.0 64,080 27.5 0.0 18.0 5.8% internal resolving resolving (30%)
1,669.5 1,833.4 16,429 101.6 0.2 103.7 5.5% chunking write all entrypoints to disk (0%)
1,585.4 4,177.4 75,502 21.0 0.0 15.0 5.2% process module module (15%)
1,571.5 1,571.5 17,459 90.0 0.3 6.2 5.2% precompute code generation code generation (46%)
1,333.9 1,505.8 10,623 125.6 0.0 267.5 4.4% write all entrypoints to disk None (0%)
1,276.9 1,951.4 61,487 20.8 0.0 10.7 4.2% resolving module (34%)
1,274.0 1,274.0 16,777 75.9 0.3 143.0 4.2% compute async module info chunking (0%)
1,150.8 1,207.7 8,942 128.7 0.5 74.5 3.8% parse ecmascript analyze ecmascript module (28%)
765.4 766.1 11,345 67.5 0.1 85.6 2.5% compute async chunks compute async chunks (0%)
397.9 408.9 1,489 267.3 1.0 26.0 1.3% webpack loader parse css (11%)
338.5 338.5 2,082 162.6 0.4 37.3 1.1% generate source map code generation (96%)
295.8 519.5 805 367.5 1.5 36.4 1.0% parse css module (6%)
105.0 105.0 1,368 76.8 0.0 26.5 0.3% compute binding usage info write all entrypoints to disk (0%)
63.4 63.4 2,014 31.5 0.0 12.7 0.2% collect mergeable modules compute merged modules (0%)
59.8 59.8 2,507 23.9 0.0 2.0 0.2% read file parse ecmascript (85%)
33.9 63.3 166 204.0 0.3 21.5 0.1% make production chunks chunking (4%)
28.8 32.3 926 31.2 0.1 2.4 0.1% async reference write all entrypoints to disk (0%)

Critical Path Analysis

The longest sequential dependency chains that determine wall-clock time.
Focus on reducing the depth of these chains to improve parallelism.

Rank Self-Time (ms) Depth Path
1 170.1 2 process module → analyze ecmascript module
2 130.1 2 code generation → generate source map
3 74.5 2 analyze ecmascript module → parse ecmascript
4 58.6 2 code generation → generate source map
5 45.0 2 process module → analyze ecmascript module

Batching Candidates

High-volume tasks dominated by a single parent. If the parent can batch them,
it drastically reduces scheduler overhead.

Task Name Count Top Caller (Attribution) Avg Self P95 Self Total Self
analyze ecmascript module 39,803 process module (76%) 71.2 us 0.15 ms 2,835.6 ms

Duration Distribution

Range Count Percentage
<10us 1,079,000 67.0%
10us-100us 502,665 31.2%
100us-1ms 23,254 1.4%
1ms-10ms 4,748 0.3%
10ms-100ms 150 0.0%
>100ms 9 0.0%

Action Items

  1. [P0] Focus on tasks with the highest Self-Time — these are where CPU cycles are actually spent.
  2. [P0] Use Batching Candidates to identify callers that should use try_join or reduce #[turbo_tasks::function] granularity.
  3. [P1] Check Build Phase Timeline for phases with disproportionate wall range vs. self-time (= serialization).
  4. [P1] Inspect P95 Self (ms) for heavy monolith tasks. Focus on long-tail outliers, not averages.
  5. [P1] Review Critical Paths — reducing the longest chain depth directly improves wall-clock time.
  6. [P2] If Thread Utilization < 60%, investigate scheduling gaps (lock contention or deep dependency chains).

Report generated by Utoopack Performance Analysis Agent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant