perf(layout): paragraph wrapping speedups (F1 running-width, F2 binary-search) + benchmark regression gate#142
Merged
Merged
Conversation
…probe BenchmarkVerdictTool classifies a current-speed run vs the committed baseline (improved/neutral/regressed) and exits non-zero on a regression beyond the noise band. MeasurementCountBenchmark + CountingTextMeasurementSystem capture deterministic textWidth call counts and per-compile allocation bytes (ThreadMXBean) for proving algorithmic/allocation changes. run-benchmarks.ps1 gains the 11-verdict-current-speed gate step (skippable via -SkipVerdict). Adds baselines/current-speed-full.json (full-profile median). Benchmark-module only; not part of the published library.
…line prefix The greedy line wrapper measured textWidth(currentLine + nextToken) on every token, re-measuring the whole accumulated line - O(line-length x tokens) measured characters plus the per-glyph sanitize/encode it triggers. Keep a running line width and measure each token once instead; line starts re-measure to pin FP drift. Glyph advances are additive (no kerning) and EPS=1e-6 absorbs FP, so break points are unchanged - rendering is byte-identical (1144 tests + all layout/visual snapshots pass). Probe: long-text measured characters 291,324 -> 32,457 (~9x fewer); same-session A/B (full, Repeat 7): proposal -57% time / +131% throughput. No API or behaviour change. Refs audit finding F1.
…advisory peakHeapMb is a Runtime used-heap delta - GC-timing dependent and very noisy (observed 48-170 MB across repeats of identical code), so it false-failed the gate on invoice-template (heap +18.7%) even though that run was -15% faster on time. BenchmarkVerdictTool now hard-gates on average latency only; peakHeapMb is reported as advisory (still shown, never fails the build). The deterministic heap signal stays in MeasurementCountBenchmark (per-compile allocation bytes). run-benchmarks.ps1: step 11 runs the verdict as advisory for single runs (Repeat 1) and hard-gates only for medians (-Repeat >= 2), since one run is too noisy to gate against a median baseline. Unit test + CHANGELOG updated.
…uadratic re-measure) fitCharacters re-measured text.substring(0,index) for every index when breaking a long unbreakable token - O(n) width calls and O(n^2) measured characters. The fit predicate width(prefix) <= maxWidth is monotonic in prefix length, so binary-search the break index instead: it returns the same lastFitting (byte-identical wrapping) in O(log n) width calls. Probe on a 600-char token: width calls 652 -> 97, measured chars 36,317 -> 7,114, alloc ~1.5MB -> ~0.8MB. long-text (F1 path) and tables untouched; 1144 tests pass with no snapshot drift. Refs audit finding F2.
40 paragraphs with ~520-char unbreakable URL/ID tokens that overflow the line and force splitLongToken/fitCharacters. Makes the F2 worst case visible (same-session A/B: -44% avg, 14.47 -> 8.06 ms on this scenario) and guards against re-introducing quadratic long-token wrapping.
…n string copy) wrapParagraph concatenated Strings token-by-token (currentLine + token), re-copying the whole growing line each token and producing a throwaway String per step. Accumulate in a reused StringBuilder instead; the character sequence is identical so wrapping stays byte-for-byte the same (1144 tests, snapshots clean). Measured effect is small on typical text (~1% less compile allocation on long-text, lines bounded by column width) but it removes a latent O(line-length^2) copy on very wide/unwrapped lines.
The probe measured each scenario once, and the first scenario (long-text) in a fresh JVM carried ~36 MB of one-time class-load/JIT/static-init allocation -- a JVM artifact, not a layout cost (verified: cold first compile 36.6 MB vs warm 0.65 MB for the same document; layout alloc scales sub-linearly). Warm up 5 iterations before the measured pass so Alloc KB reflects steady-state per-document allocation; measurement-count columns are exact regardless. Also drops the F1b CHANGELOG perf claim -- a warm A/B shows no measurable steady-state allocation change (719.8 = 719.8 KB), so F1b stays as a byte-identical latent-O(n^2) cleanup, not a perf win.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Consolidates the private performance cycle: two verified canonical-layout
speedups plus the benchmark regression system that proves them. Rendering is
byte-identical (1144 tests, no snapshot/baseline rewritten);
./mvnw verify -pl .green.
Engine changes (
TextFlowSupport)wrapParagraphrunning-width. The greedy wrapper keeps a running linewidth and measures each token once, instead of re-measuring the whole accumulated
line on every token. Deterministic counter probe: long-text measured characters
−89% (291,324 → 32,457). Removes O(line-length × tokens) measured-char work — and
the per-glyph sanitize/encode it triggered — from paragraph layout. Byte-identical.
fitCharactersbinary search. Long unbreakable tokens (URLs/IDs/no-spaceruns, narrow columns) break via binary search instead of re-measuring every growing
prefix one char at a time. Worst-case −44% wall-clock (same-session A/B on a new
long-tokenscenario) and −80% measured chars (652 → 97 width calls). The fitpredicate is monotonic, so the search returns the same break index → byte-identical.
StringBuilderline assembly. Byte-identical cleanup; assembles eachwrapped line in a reused
StringBuilderrather thancurrentLine + token. Removesa latent O(line²) char-copy on pathologically wide/unwrapped lines. No measurable
steady-state perf (warm A/B: 719.8 → 719.8 KB) — kept as a defensive cleanup, not
a perf claim.
Benchmark regression system (benchmarks module — not part of the published library)
BenchmarkVerdictTool(+test): compares a current-speed run to the committedbaseline (
baselines/current-speed-full.json); classifies each scenarioimproved/neutral/regressed. Hard gate = average latency only; peak heap is
advisory (GC-timing noisy). A single run is advisory; the hard gate needs a median.
CountingTextMeasurementSystem+MeasurementCountBenchmark(+test):deterministic measurement-call counts and per-compile allocation bytes
(
ThreadMXBean) — proof independent of wall-clock/GC noise. The probe warms upbefore its allocation window so
Alloc KBreflects steady state.CurrentSpeedBenchmark: newlong-tokenworst-case scenario.scripts/run-benchmarks.ps1:11-verdict-current-speedgate step(skippable via
-SkipVerdict).Heap note (honest)
The audit flagged a "~40 MB to lay out one paragraph" headline. Investigation proved
this was a JVM cold-start artifact (class-load / JIT / static-init on the probe's
first compile), not a layout cost: warm steady-state for the same document is
~0.65 MB (≈56× less), and allocation scales sub-linearly. There is no heap bug; the
probe was fixed to warm up so its numbers are trustworthy.
Net
One optimization family per commit; each delta is attributable. Base:
develop.