Skip to content

perf(layout): paragraph wrapping speedups (F1 running-width, F2 binary-search) + benchmark regression gate#142

Merged
DemchaAV merged 8 commits into
developfrom
perf/engine-pipeline
Jun 8, 2026
Merged

perf(layout): paragraph wrapping speedups (F1 running-width, F2 binary-search) + benchmark regression gate#142
DemchaAV merged 8 commits into
developfrom
perf/engine-pipeline

Conversation

@DemchaAV

@DemchaAV DemchaAV commented Jun 8, 2026

Copy link
Copy Markdown
Owner

Summary

Consolidates the private performance cycle: two verified canonical-layout
speedups plus the benchmark regression system that proves them. Rendering is
byte-identical (1144 tests, no snapshot/baseline rewritten); ./mvnw verify -pl .
green.

Engine changes (TextFlowSupport)

  • F1 — wrapParagraph running-width. The greedy wrapper keeps a running line
    width and measures each token once, instead of re-measuring the whole accumulated
    line on every token. Deterministic counter probe: long-text measured characters
    −89%
    (291,324 → 32,457). Removes O(line-length × tokens) measured-char work — and
    the per-glyph sanitize/encode it triggered — from paragraph layout. Byte-identical.
  • F2 — fitCharacters binary search. Long unbreakable tokens (URLs/IDs/no-space
    runs, narrow columns) break via binary search instead of re-measuring every growing
    prefix one char at a time. Worst-case −44% wall-clock (same-session A/B on a new
    long-token scenario) and −80% measured chars (652 → 97 width calls). The fit
    predicate is monotonic, so the search returns the same break index → byte-identical.
  • F1b — StringBuilder line assembly. Byte-identical cleanup; assembles each
    wrapped line in a reused StringBuilder rather than currentLine + token. Removes
    a latent O(line²) char-copy on pathologically wide/unwrapped lines. No measurable
    steady-state perf
    (warm A/B: 719.8 → 719.8 KB) — kept as a defensive cleanup, not
    a perf claim.

Benchmark regression system (benchmarks module — not part of the published library)

  • BenchmarkVerdictTool (+test): compares a current-speed run to the committed
    baseline (baselines/current-speed-full.json); classifies each scenario
    improved/neutral/regressed. Hard gate = average latency only; peak heap is
    advisory (GC-timing noisy). A single run is advisory; the hard gate needs a median.
  • CountingTextMeasurementSystem + MeasurementCountBenchmark (+test):
    deterministic measurement-call counts and per-compile allocation bytes
    (ThreadMXBean) — proof independent of wall-clock/GC noise. The probe warms up
    before its allocation window so Alloc KB reflects steady state.
  • CurrentSpeedBenchmark: new long-token worst-case scenario.
  • scripts/run-benchmarks.ps1: 11-verdict-current-speed gate step
    (skippable via -SkipVerdict).

Heap note (honest)

The audit flagged a "~40 MB to lay out one paragraph" headline. Investigation proved
this was a JVM cold-start artifact (class-load / JIT / static-init on the probe's
first compile), not a layout cost: warm steady-state for the same document is
~0.65 MB (≈56× less), and allocation scales sub-linearly. There is no heap bug; the
probe was fixed to warm up so its numbers are trustworthy.

Net

Change Verified result
F1 measured chars −89% (exact) → CPU on text-heavy docs
F2 long-token −44% wall-clock, measured work −80%
F1b byte-identical cleanup (no measurable perf)
Tooling regression gate + deterministic probe (now warm/honest)

One optimization family per commit; each delta is attributable. Base: develop.

DemchaAV added 8 commits June 8, 2026 12:24
…probe

BenchmarkVerdictTool classifies a current-speed run vs the committed baseline (improved/neutral/regressed) and exits non-zero on a regression beyond the noise band. MeasurementCountBenchmark + CountingTextMeasurementSystem capture deterministic textWidth call counts and per-compile allocation bytes (ThreadMXBean) for proving algorithmic/allocation changes. run-benchmarks.ps1 gains the 11-verdict-current-speed gate step (skippable via -SkipVerdict). Adds baselines/current-speed-full.json (full-profile median). Benchmark-module only; not part of the published library.
…line prefix

The greedy line wrapper measured textWidth(currentLine + nextToken) on every token, re-measuring the whole accumulated line - O(line-length x tokens) measured characters plus the per-glyph sanitize/encode it triggers. Keep a running line width and measure each token once instead; line starts re-measure to pin FP drift. Glyph advances are additive (no kerning) and EPS=1e-6 absorbs FP, so break points are unchanged - rendering is byte-identical (1144 tests + all layout/visual snapshots pass).

Probe: long-text measured characters 291,324 -> 32,457 (~9x fewer); same-session A/B (full, Repeat 7): proposal -57% time / +131% throughput. No API or behaviour change. Refs audit finding F1.
…advisory

peakHeapMb is a Runtime used-heap delta - GC-timing dependent and very noisy (observed 48-170 MB across repeats of identical code), so it false-failed the gate on invoice-template (heap +18.7%) even though that run was -15% faster on time. BenchmarkVerdictTool now hard-gates on average latency only; peakHeapMb is reported as advisory (still shown, never fails the build). The deterministic heap signal stays in MeasurementCountBenchmark (per-compile allocation bytes).

run-benchmarks.ps1: step 11 runs the verdict as advisory for single runs (Repeat 1) and hard-gates only for medians (-Repeat >= 2), since one run is too noisy to gate against a median baseline. Unit test + CHANGELOG updated.
…uadratic re-measure)

fitCharacters re-measured text.substring(0,index) for every index when breaking a long unbreakable token - O(n) width calls and O(n^2) measured characters. The fit predicate width(prefix) <= maxWidth is monotonic in prefix length, so binary-search the break index instead: it returns the same lastFitting (byte-identical wrapping) in O(log n) width calls.

Probe on a 600-char token: width calls 652 -> 97, measured chars 36,317 -> 7,114, alloc ~1.5MB -> ~0.8MB. long-text (F1 path) and tables untouched; 1144 tests pass with no snapshot drift. Refs audit finding F2.
40 paragraphs with ~520-char unbreakable URL/ID tokens that overflow the line and force splitLongToken/fitCharacters. Makes the F2 worst case visible (same-session A/B: -44% avg, 14.47 -> 8.06 ms on this scenario) and guards against re-introducing quadratic long-token wrapping.
…n string copy)

wrapParagraph concatenated Strings token-by-token (currentLine + token), re-copying the whole growing line each token and producing a throwaway String per step. Accumulate in a reused StringBuilder instead; the character sequence is identical so wrapping stays byte-for-byte the same (1144 tests, snapshots clean). Measured effect is small on typical text (~1% less compile allocation on long-text, lines bounded by column width) but it removes a latent O(line-length^2) copy on very wide/unwrapped lines.
The probe measured each scenario once, and the first scenario (long-text) in a fresh JVM carried ~36 MB of one-time class-load/JIT/static-init allocation -- a JVM artifact, not a layout cost (verified: cold first compile 36.6 MB vs warm 0.65 MB for the same document; layout alloc scales sub-linearly). Warm up 5 iterations before the measured pass so Alloc KB reflects steady-state per-document allocation; measurement-count columns are exact regardless. Also drops the F1b CHANGELOG perf claim -- a warm A/B shows no measurable steady-state allocation change (719.8 = 719.8 KB), so F1b stays as a byte-identical latent-O(n^2) cleanup, not a perf win.
@DemchaAV DemchaAV merged commit 646a4ac into develop Jun 8, 2026
11 checks passed
@DemchaAV DemchaAV deleted the perf/engine-pipeline branch June 8, 2026 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant