Specialized optimization for WebAssembly modules produced by component fusion (e.g., meld).
When multiple P2/P3 WebAssembly components are fused into a single core module, the result contains characteristic patterns that standard optimization passes miss. The fused component optimizer targets these patterns specifically, achieving additional size and performance improvements beyond what loom's standard 12-phase pipeline provides.
Component fusion (performed by tools like meld) combines multiple components into one module. This process introduces:
- Same-memory adapters that redundantly allocate+copy within a single linear memory
- Adapter trampolines for cross-component calls
- Duplicate function types from each source component
- Duplicate imports where multiple components imported the same external function
- Dead functions (adapters that can be bypassed)
The fused optimizer detects and eliminates these artifacts.
flowchart LR
subgraph Input
A[Component A.wasm]
B[Component B.wasm]
end
subgraph meld["meld fuse"]
Parse --> Resolve --> Merge --> Adapt --> Encode
end
subgraph loom["loom optimize"]
Fused["Fused Optimizer"] --> Pipeline["12-Phase Pipeline"]
end
A --> Parse
B --> Parse
Encode --> Fused
Pipeline --> Out[Optimized Module]
meld handles the structural transformation: resolving dependencies, merging index spaces, generating adapter trampolines, and producing a valid single-module output.
loom handles the semantic optimization: devirtualizing adapters, eliminating dead code, folding constants, reducing strength, and verifying correctness.
Together they enable whole-program optimization across former component boundaries - something neither tool can achieve alone.
When component A calls a function exported by component B, meld generates an adapter trampoline:
flowchart LR
subgraph before["Before (fused, unoptimized)"]
Caller -->|call| Adapter
Adapter -->|"local.get 0..N; call"| Target
end
subgraph after["After (devirtualized)"]
Caller2[Caller] -->|"call (direct)"| Target2[Target]
end
before --> after
For direct adapters (shared memory), the adapter is a trivial forwarding function:
;; Generated by meld: adapter from component 0 to component 1
(func $adapter (param i32) (result i32)
local.get 0 ;; Forward parameter
call $target_func ;; Call the actual target
)For memory-crossing adapters (multi-memory mode), the adapter is more complex:
flowchart TD
Caller -->|"ptr, len"| Adapter
Adapter -->|cabi_realloc| CalleeMemory[Callee Memory]
Adapter -->|memory.copy| CalleeMemory
Adapter -->|"new_ptr, len"| Target
Target -->|results| Adapter
Adapter -->|memory.copy back| CallerMemory[Caller Memory]
Adapter -->|results| Caller
Each source component contributes its own type section. After fusion, the merged type section contains duplicates:
flowchart LR
subgraph before["Before Dedup"]
T0["Type 0: (i32) → i32 ← Comp A"]
T1["Type 1: (i32, i32) → () ← Comp A"]
T2["Type 2: (i32) → i32 ← Comp B"]
T3["Type 3: (i32) → i32 ← Comp C"]
end
subgraph after["After Dedup"]
C0["Type 0: (i32) → i32 (canonical)"]
C1["Type 1: (i32, i32) → ()"]
end
T0 --> C0
T2 -->|"remapped"| C0
T3 -->|"remapped"| C0
T1 --> C1
Multiple components may import the same external function. These are merged and all references remapped.
flowchart TD
Input[Fused Module from meld] --> P0a
P0a["Pass 0a: Memory Import Dedup"] --> P0b
P0b["Pass 0b: Same-Memory Adapter Collapse"] --> P1
P1["Pass 1: Adapter Devirtualization"] --> P2
P2["Pass 2: Trivial Call Elimination"] --> P3
P3["Pass 3: Function Type Deduplication"] --> P4
P4["Pass 4: Dead Function Elimination"] --> P5
P5["Pass 5: Import Deduplication"] --> Output
Output[Cleaned Module] --> Standard["loom 12-Phase Pipeline"]
What: Merge identical memory imports and remap all memory references to index 0.
When: Meld fuses components that share the same host memory, producing a module with multiple memory imports that all reference the same underlying linear memory (identical (module, name) pair). Per WASM spec §2.5.10, the same import key resolves to the same binding.
Transformation: Remove duplicate memory imports, remap all memory-indexed instructions (memory.copy, memory.size, memory.grow, memory.fill, memory.init), data segment memory indices, and memory exports to index 0.
Safety (all-or-nothing): This pass only fires when ALL memory imports are identical and there are no local memories. Partial dedup could leave references to removed memory indices, so all-or-nothing is the only safe strategy.
Synergy with Pass 0b: After dedup, a previously multi-memory module becomes single-memory, enabling Pass 0b (same-memory adapter collapse) to fire on adapters that were previously skipped.
What: Collapse same-memory adapters (realloc + memory.copy within one linear memory) into trivial forwarding trampolines.
Pattern detected: Functions where all memory operations target a single consistent memory index:
- Have locals (temporary pointers for the copy)
- Contain
memory.copy {N, N}(same-memory copy, any consistent index N) with no cross-memory copies - All load/store instructions target the same memory index N
- Call
cabi_reallocat least once - Call exactly one non-realloc target function
- Have no unsafe control flow (null-guard
Ifblocks with empty/nop else bodies and safe-to-discard then bodies are allowed;Block,Loop,Br,BrIf,BrTableare rejected; balanced single-global save/restore is allowed; memory stores are allowed — they target the discarded realloc'd buffer) - Have the same signature as the target
Transformation: Replace the function body with a forwarding trampoline:
;; Before: allocate + copy + call target
(func $adapter (param i32 i32) (result i32)
... cabi_realloc ... memory.copy 0 0 ... call $target ...)
;; After: trivial forwarding (Pass 1 then devirtualizes)
(func $adapter (param i32 i32) (result i32)
local.get 0
local.get 1
call $target)Why this is correct: When all memory operations target the same linear memory (index N), memory.copy {N, N} copies within a single address space. The adapter allocates a buffer, copies argument data to it, then calls the target with the new pointer. Since both pointers reference the same memory, the target can read the data at the original pointer directly. The allocation and copy are semantically redundant. This holds for any memory index N, not just 0 — the key invariant is consistency within the adapter.
Stack pointer save/restore: Meld-generated adapters for non-trivial types often include a stack pointer prologue/epilogue (global.get $sp; sub; global.set $sp ... global.set $sp to restore). These balanced single-global writes are safe to collapse because the entire body is replaced by a forwarding trampoline — the global is never modified in the collapsed function (net-zero effect). The predicate has_unsafe_global_writes allows this pattern while still rejecting writes to multiple globals or write-only globals.
Memory stores: Meld-generated adapters for complex types (records, lists) perform field-by-field copying via load+store pairs alongside memory.copy. These stores target the freshly realloc'd buffer, which is discarded when the adapter body is replaced by a forwarding trampoline. The target reads from the original pointer in the same address space, so the stores are semantically redundant.
Null-guard If blocks: Meld-generated adapters for optional/nullable types wrap the realloc+copy in a null-guard if block. These are safe to collapse because: if the condition is true (Some), the copy is same-memory redundant — skipping it is equivalent; if false (None), the copy never ran anyway. The then_body must contain only safe-to-discard instructions (locals, constants, arithmetic, loads, stores, realloc calls, memory.copy {0, 0}). Nested If inside If, non-empty else bodies, Block, Loop, Br, BrIf, and BrTable remain rejected.
Cross-memory adapters: Adapters where memory operations target different indices (e.g., memory.copy {0, 1} or loads from memory 0 mixed with stores to memory 1) are NOT collapsed. These represent genuine cross-component data transfer between distinct linear memories, where the copy is semantically necessary. Such adapters are detected and counted as cross_memory_adapters_detected for diagnostic visibility, but no optimization is applied.
Synergy with Pass 1: After collapse, the adapter is a trivial forwarding trampoline. Pass 1 detects this and rewrites all callers to call the target directly, eliminating both the adapter overhead AND the unnecessary allocation/copy.
What: Detect trivial adapter trampolines and rewrite callers to bypass them.
Pattern detected:
(func $adapter (param p0 p1 ... pN) (result r0 r1 ... rM)
local.get 0
local.get 1
...
local.get N
call $target
)Transformation: Replace call $adapter with call $target at every call site.
Why this is correct: The adapter pushes exactly the same arguments onto the stack in the same order and calls the target. It is semantically identical to calling the target directly. No locals are modified, no control flow exists, no side effects occur.
Handles adapter chains: If adapter A calls adapter B which calls target T, we resolve transitively: A → T and B → T.
What: Detect () -> () functions with empty bodies and remove all calls to them.
Pattern detected: Functions generated by meld as cabi_post_return stubs that take no parameters, return no results, and contain only End/Nop instructions.
Transformation: Remove every call $trivial instruction from all function bodies.
Why this is correct: A function with () -> () signature and an empty body is a no-op per the WASM spec. Calling it and not calling it produce identical execution states.
What: Merge structurally identical function types and remap all references.
How: Hash each type by its parameter and result lists. Build a canonical mapping. Update all type references (imports, call_indirect).
Safety: Skipped when raw type section bytes are present (GC/reference types) since those require complex re-encoding.
What: Remove functions unreachable from any export root.
How: Build a liveness set from export roots and start function. Walk the call graph transitively. Remove unreachable functions and remap indices. Parse element sections with wasmparser to extract indirect call targets rather than conservatively marking all functions as live. Rebuild element sections with remapped indices after removal.
What: Merge identical imports (same module + name + type).
How: Hash each import by module name, field name, and type index. Build canonical mapping. Remap all function references (imported function indices shift, local function indices shift by the reduction count).
Each transformation has a corresponding formal proof in Rocq (proofs/simplify/FusedOptimization.v):
| Theorem | Status | Pass |
|---|---|---|
memory_import_dedup_preserves_semantics |
Proven | Pass 0a |
same_memory_collapse_correct |
Proven | Pass 0b |
sp_save_restore_collapse_correct |
Proven | Pass 0b |
null_guard_collapse_correct |
Proven | Pass 0b |
store_discard_collapse_correct |
Proven | Pass 0b |
generalized_same_memory_collapse_correct |
Proven | Pass 0b |
adapter_devirtualization_correct |
Proven | Pass 1 |
devirtualization_preserves_module_semantics |
Proven | Pass 1 |
trivial_call_is_nop |
Proven | Pass 2 |
trivial_call_elimination_preserves_semantics |
Proven | Pass 2 |
type_dedup_preserves_semantics |
Proven | Pass 3 |
type_dedup_idempotent |
Proven | Pass 3 |
dead_function_elim_correct |
Proven | Pass 4 |
dead_function_removal_preserves_semantics |
Proven | Pass 4 |
import_dedup_preserves_semantics |
Proven | Pass 5 |
fused_optimization_correct |
Proven | Combined |
fused_then_standard_correct |
Proven | Composition |
All 17 theorems are proven (Qed). Zero Admitted proofs remain.
The proofs rely on five well-justified semantic axioms from the WASM spec:
flowchart TD
subgraph axioms["Semantic Axioms (WASM Spec)"]
A0["same_memory_adapter_equiv\n(Spec §5.4.7: memory.copy same-mem)"]
A1["trivial_adapter_equiv\n(Spec §4.4.8: call semantics)"]
A2["identical_import_equiv\n(Spec §2.5.10: import resolution)"]
A3["trivial_call_nop\n(Spec §4.4.8: void call = no-op)"]
A4["identical_memory_import_equiv\n(Spec §2.5.10: import resolution)"]
A4b["same_memory_adapter_general_equiv\n(Generalized: any consistent memory N)"]
end
subgraph proven["Proven Theorems (Qed)"]
MD["memory_import_dedup_preserves"]
SM["same_memory_collapse_correct"]
SP["sp_save_restore_collapse_correct"]
NG["null_guard_collapse_correct"]
SD["store_discard_collapse_correct"]
GS["generalized_same_memory_collapse_correct"]
AD["adapter_devirtualization_correct"]
TC["trivial_call_is_nop"]
TP["type_dedup_preserves_semantics"]
TI["type_dedup_idempotent"]
DE["dead_function_elim_correct"]
DR["dead_function_removal_preserves"]
IP["import_dedup_preserves_semantics"]
FC["fused_optimization_correct"]
FS["fused_then_standard_correct"]
end
A4 --> MD
A0 --> SM
A0 --> SP
A0 --> NG
A0 --> SD
A4b --> GS
A1 --> AD
A2 --> IP
A3 --> TC
MD --> FC
SM --> FC
SP --> FC
NG --> FC
SD --> FC
GS --> FC
AD --> FC
TC --> FC
TP --> FC
DE --> FC
IP --> FC
FC --> FS
Adapter Devirtualization uses a direct equivalence proof:
- The adapter body
local.get 0; ...; local.get N; call T; endreconstructs the parameter stack and delegates to the target - Operationally equivalent to calling the target directly
Type Deduplication uses structural equality:
- If
params(T_i) = params(T_j)andresults(T_i) = results(T_j), any instruction referencingT_ibehaves identically when referencingT_j
Dead Function Elimination: Unreachable code cannot affect observable behavior.
Import Deduplication: Same module + name + type resolve to the same external binding per the WebAssembly specification.
flowchart TD
Input["Input WASM Module (from meld)"]
Input --> Fused
subgraph Fused["Fused Component Optimization"]
F0a["0a. Memory import dedup"]
F0b["0b. Same-memory adapter collapse"]
F1["1. Adapter devirtualization"]
F2["2. Trivial call elimination"]
F3["3. Type deduplication"]
F4["4. Dead function elimination"]
F5["5. Import deduplication"]
F0a --> F0b --> F1 --> F2 --> F3 --> F4 --> F5
end
Fused --> Standard
subgraph Standard["loom 12-Phase Pipeline"]
S1["1. Constant folding (ISLE)"]
S2["2. Advanced instructions"]
S3["3. Local simplification"]
S4["4. Dead code elimination"]
S5["5. Code folding"]
S6["6. Loop invariant motion"]
S7["7. Branch simplification"]
S8["8. Added constant optim"]
S9["9. Z3 verification"]
S1 --> S2 --> S3 --> S4 --> S5 --> S6 --> S7 --> S8 --> S9
end
Standard --> Output["Optimized WASM Module"]
Why before? Adapter devirtualization creates dead code (unused adapters) and simplifies call graphs, which benefits all subsequent passes. Type and import deduplication reduce index space sizes, improving analysis precision.
# Step 1: Fuse components with meld
meld fuse component_a.wasm component_b.wasm -o fused.wasm
# Step 2: Optimize fused module with loom
loom optimize fused.wasm -o optimized.wasmloom automatically detects fusion artifacts and applies the fused optimizer.
use loom_core::fused_optimizer::{optimize_fused_module, FusedOptimizationStats};
use loom_core::Module;
let mut module: Module = parse_wasm(&bytes)?;
// Apply fused-specific optimizations first
let stats: FusedOptimizationStats = optimize_fused_module(&mut module)?;
println!("Memory imports deduplicated: {}", stats.memory_imports_deduplicated);
println!("Same-memory adapters collapsed: {}", stats.same_memory_adapters_collapsed);
println!("Adapters devirtualized: {}", stats.calls_devirtualized);
println!("Trivial calls eliminated: {}", stats.trivial_calls_eliminated);
println!("Types deduplicated: {}", stats.types_deduplicated);
println!("Dead functions eliminated: {}", stats.dead_functions_eliminated);
println!("Imports deduplicated: {}", stats.imports_deduplicated);
// Then apply standard loom optimizations
optimize_module(&mut module)?;| Feature | Status | Coverage |
|---|---|---|
| Memory import deduplication | Done | Identical memory imports (aliased host memory) |
| Same-memory adapter collapse | Done | Any module with same-memory realloc+copy adapters (any consistent memory index) |
| Trivial adapter devirtualization | Done | All direct adapters |
| Trivial call elimination | Done | () -> () no-op functions |
| Function type deduplication | Done | Basic types (skips GC) |
| Dead function elimination | Done | With element segment parsing |
| Function import deduplication | Done | Function imports only |
| Cross-memory adapter diagnostics | Done | Detection and counting (not collapsed) |
| Correctness proofs | Done | All 17 theorems proven (Qed) |
flowchart TD
subgraph tier1["Tier 1: Next Optimizations"]
T1["Scalar return elision for cross-memory adapters"]
end
subgraph tier2["Tier 2: Advanced"]
A1["String transcoding detection"]
A2["Multi-memory adapter inlining"]
end
tier1 --> tier2
| Feature | Priority | Impact | Effort |
|---|---|---|---|
| Scalar return elision (cross-memory) | Medium | Avoids copy-back for scalar returns in cross-memory adapters | Medium |
| String transcoding detection | Low | Rare but high savings when hit | Very High |
| Multi-memory adapter inlining | Low | Reduces trampoline overhead in multi-memory mode | High |
Decision: Detect and eliminate adapter patterns in loom rather than in meld.
Rationale: Meld must generate adapters for correctness (they handle Canonical ABI boundary crossing). Only after fusion is complete can we determine which adapters are trivial and can be bypassed. loom, as the optimizer, is the natural place for this analysis.
Decision: Parse element sections to extract indirect call targets instead of conservatively marking all functions as live.
Rationale: Element segments populate indirect call tables (call_indirect). By parsing element sections with wasmparser, we extract the exact function references (both ElementItems::Functions and ref.func in ElementItems::Expressions). Only those functions are marked as potentially callable via indirect calls. After DCE removes dead functions, element sections are rebuilt with remapped indices using wasm_encoder::ElementSection. Falls back to conservative behavior for unsupported element kinds.
Decision: Run import deduplication early to normalize the index space.
Rationale: Every import affects all function indices (local functions are numbered after imports). Removing duplicate imports shifts indices, which is cleaner to do once before the standard pipeline runs multiple analysis passes.
Decision: Run memory import deduplication (Pass 0a) before same-memory adapter collapse (Pass 0b).
Rationale: When meld fuses components sharing the same host memory, it emits multiple identical memory imports. While Pass 0b now handles multi-memory modules (collapsing adapters that use a consistent memory index), running Pass 0a first normalizes memory indices and may expose additional same-memory adapters. For example, adapters using memory.copy {1, 1} that are aliased to memory 0 become memory.copy {0, 0} after dedup, simplifying the index space for all subsequent passes.
Decision: Only deduplicate function imports, not memory/table/global imports.
Rationale: Memory, table, and global imports have richer semantics (mutability, limits, element types). Deduplicating them requires deeper analysis to verify they are truly identical. Function imports are the most common duplicates after fusion and are safe to merge when module+name+type match.