Skip to content

Latest commit

 

History

History
375 lines (272 loc) · 15.5 KB

File metadata and controls

375 lines (272 loc) · 15.5 KB

Benchmark Workflow

This document explains the local and CI benchmark flow used in GraphCompose.

The short version is:

  • scripts/run-benchmarks.ps1 is the normal local entry point
  • CurrentSpeedBenchmark has two profiles: smoke and full
  • current-speed diffs are only valid between reports from the same profile
  • repeated local runs should be compared via median aggregation, not by eyeballing one lucky run

If you are changing layout, pagination, render ordering, PDF session lifetime, or benchmark tooling, read this file together with README.md, architecture.md, and CONTRIBUTING.md.

Core terms

  • suite: one benchmark family such as current-speed or comparative
  • profile: a current-speed mode. Today that means smoke or full
  • run: one timestamped JSON/CSV result written as run-<timestamp>.json
  • aggregate: a median report built from several repeated local runs
  • compatible pair: two reports that can be diffed safely. For current-speed, compatibility means the same profile

The local benchmark entry point

The default local workflow is:

powershell -ExecutionPolicy Bypass -File .\scripts\run-benchmarks.ps1

That wrapper is intentionally opinionated. It does more than just invoke one Java main class.

Pipeline stages

The script prints numbered sections so you can map console output to the pipeline:

  1. 01-build-classpath Builds the test classpath once and writes target/benchmark.classpath.
  2. 02-current-speed Runs CurrentSpeedBenchmark in the selected profile.
  3. 03-comparative Runs the GraphCompose canonical vs iText 5 vs JasperReports comparison.
  4. 04-core-engine Runs GraphComposeBenchmark.
  5. 05-full-cv Runs FullCvBenchmark.
  6. 06-scalability Runs the thread-scaling throughput benchmark.
  7. 07-stress Runs the concurrent stability stress test.
  8. 08-endurance Optional. Runs only when -IncludeEndurance is provided.
  9. 09-diff-current-speed Diffs the newest compatible current-speed reports.
  10. 10-diff-comparative Diffs the two newest comparative reports.
  11. 11-verdict-current-speed Judges the newest current-speed median against the committed baseline (baselines/current-speed-full.json). Hard gate on average latency; peak heap is advisory. See Refreshing the committed baseline.

Each step writes a dedicated log file under target/benchmark-runs/<timestamp>/logs/, and the wrapper mirrors that log back to the console after the step finishes.

Current-speed profiles

CurrentSpeedBenchmark supports two intended usage modes:

  • smoke Bounded latency-oriented checks for pull requests and quick local spot checks. Defaults: 30 warmup + 100 measurement iterations per scenario, no throughput pass. Smoke is now sized so the JIT reaches a steady C1/C2 state and the p95 calculation has enough samples to interpolate between order statistics rather than collapsing to the maximum observed sample.
  • full Wider warmup and measurement windows (12 warmup + 40 measurement) plus throughput coverage for local investigation and scheduled runs.

Use the same profile when comparing results. A smoke report and a full report are different experiments, not two samples of the same one.

Methodology notes (v1.3)

  • Every scenario triggers System.gc() and a 50 ms sleep between warmup and measurement so the first measured iteration does not pay for warmup-era garbage. Variance dropped from 10–25 % to 2–5 % between runs on a developer laptop.
  • Percentiles use linear interpolation between order statistics (rank = (n-1) * p). Earlier versions returned sorted[floor], which made p95 == max for small sample counts.
  • A "stage breakdown" table prints alongside the latency table for every template scenario (compose / layout / render / total median ms). Use it when attributing regressions to engine layout vs PDFBox serialization — PDFBox typically takes 35–68 % of the end-to-end timing on these scenarios.
  • The performance gate (-Dgraphcompose.benchmark.enforceGate=true) now uses thresholds calibrated at ~3× the observed avg, leaving room for CI machine variance while still catching ≥50 % regressions.
  • peakHeapMb reports the heap delta over the post-warmup baseline rather than absolute used heap. The metric is closer to per-iteration allocation pressure than to total live data.

Examples:

powershell -ExecutionPolicy Bypass -File .\scripts\run-benchmarks.ps1 -CurrentSpeedProfile smoke
powershell -ExecutionPolicy Bypass -File .\scripts\run-benchmarks.ps1 -CurrentSpeedProfile full

Diff selection rules

Current-speed diffs

For current-speed reports, the wrapper now selects the newest pair that matches the profile of the latest run.

That means:

  • if the newest run is full, the script looks for the newest previous full run
  • if the newest run is smoke, the script looks for the newest previous smoke run
  • if there is no second run with that profile yet, the diff step is skipped instead of failing the whole benchmark run

This mirrors the rule enforced by BenchmarkDiffTool: current-speed reports with different profiles must not be diffed.

Comparative diffs

Comparative reports do not have the same profile split, so the wrapper simply diffs the two newest comparative runs.

Repeated local runs

When you pass -Repeat N, the wrapper reruns:

  • current-speed
  • comparative

After that, it writes median aggregate reports and diffs median-vs-median on later runs. This is the preferred mode for local decision-making because it reduces noise from GC, background processes, JIT warmup differences, and filesystem activity.

Example:

powershell -ExecutionPolicy Bypass -File .\scripts\run-benchmarks.ps1 -CurrentSpeedProfile full -Repeat 5

Recommended local workflows

Quick spot check before a small change

powershell -ExecutionPolicy Bypass -File .\scripts\run-benchmarks.ps1 -CurrentSpeedProfile smoke

Normal local investigation

powershell -ExecutionPolicy Bypass -File .\scripts\run-benchmarks.ps1 -CurrentSpeedProfile full

Safer local comparison after a performance-sensitive change

powershell -ExecutionPolicy Bypass -File .\scripts\run-benchmarks.ps1 -CurrentSpeedProfile full -Repeat 5

When comparing two branches, run a clean compile on both worktrees before the benchmark wrapper. This prevents stale target/classes from making one branch look faster or slower than the code that is actually checked out.

.\mvnw.cmd -B -ntp clean test-compile

Run benchmarks but skip diffs

powershell -ExecutionPolicy Bypass -File .\scripts\run-benchmarks.ps1 -SkipDiff

Open the generated summary and benchmark folder after the run

powershell -ExecutionPolicy Bypass -File .\scripts\run-benchmarks.ps1 -OpenResults

Measuring the impact of an engine change

Changing the engine (layout, pagination, render ordering, PDF session, text measurement, fonts) and want to see how it moves performance? Pick the view that fits, cheapest first:

  • "Did I regress?" — gate against the committed baseline. Run a median and let the 11-verdict-current-speed step score each scenario IMPROVED / NEUTRAL / REGRESSED against baselines/current-speed-full.json (hard gate: average latency ±10%, non-zero exit on a regression):

    powershell -ExecutionPolicy Bypass -File .\scripts\run-benchmarks.ps1 -CurrentSpeedProfile full -Repeat 5
  • "What exactly moved?" — A/B your branch against its base (any OS). Commit your change, then compare it to develop with the A/B scripts (see A/B comparison between two branches). Both sides are rebuilt and benchmarked, with a per-scenario delta:

    ./scripts/ab-bench.sh -a develop -b my/engine-change -r 5

If a change is meant to improve performance and the gate confirms it, refresh the baseline so the gate ratchets down — see Refreshing the committed baseline. Treat sub-~5-10% laptop deltas as inconclusive, and re-run on the final checkout before citing a number.

A/B comparison between two branches

The wrappers above benchmark whatever is currently checked out. To answer "is branch B faster or slower than branch A?" fairly on a noisy laptop, use the dedicated A/B scripts. They interleave the two branches (A,B,A,B,…) so thermal drift averages out, repeat each branch and compare medians, and cool down between runs. Each branch is rebuilt (install -pl .) before its runs so the benchmark measures that branch's engine, and untracked benchmark probes are moved aside around the branch switch so they cannot break the other branch's compile.

  • Windows (PowerShell)scripts/ab-bench.ps1, full suite (latency, throughput, scalability, stress, comparative):

    ./scripts/ab-bench.ps1 -BranchA main -BranchB develop -Repeat 3
    ./scripts/ab-bench.ps1 -BranchA develop -BranchB feature/x -Repeat 5
  • Linux / macOS / Windows Git Bashscripts/ab-bench.sh, current-speed suite (per-scenario latency + parallel throughput, the primary engine-speed signal):

    ./scripts/ab-bench.sh -a main -b develop -r 3
    ./scripts/ab-bench.sh --branch-a develop --branch-b feature/x --repeat 5 --cooldown 45
    ./scripts/ab-bench.sh -a main -b origin/anothertree -r 3   # remote-only ref (detached)

Both accept any pair of checkout-able refs (local branches or origin/<name>). Deltas are reported as B relative to A (negative latency % and positive docs-per-sec % mean B is faster). The working tree must have no uncommitted tracked changes — the scripts switch branches. Output lands under target/ab-compare/ and target/benchmarks/diffs/. Treat sub-~5-10% deltas on a laptop as inconclusive; close other JVMs/IDEs and stay on AC power for the cleanest numbers.

Refreshing the committed baseline (perf gate)

baselines/current-speed-full.json is a committed median current-speed report that 11-verdict-current-speed judges new runs against (hard gate: average latency ±10%; peak heap is advisory, GC-timing noisy). Refresh it only for an intended, verified improvement so the gate ratchets down — never to turn a red gate green. Capture a median of ≥5 runs on the branch that defines the new reference, with the IDE closed:

Windows (PowerShell):

.\mvnw.cmd -B -ntp -f benchmarks\pom.xml test-compile dependency:build-classpath -DincludeScope=test -Dmdep.outputFile=target/benchmark.classpath
$cp = 'benchmarks\target\test-classes;benchmarks\target\classes;' + (Get-Content benchmarks\target\benchmark.classpath -Raw).Trim()
1..5 | ForEach-Object { & java "-Dgraphcompose.benchmark.profile=full" -cp "$cp" com.demcha.compose.CurrentSpeedBenchmark }
$runs = Get-ChildItem target\benchmarks\current-speed\run-*.json | Sort-Object Name | Select-Object -Last 5 | ForEach-Object { $_.FullName }
& java -cp "$cp" com.demcha.compose.BenchmarkMedianTool current-speed @runs
Copy-Item target\benchmarks\aggregates\current-speed\full\latest.json baselines\current-speed-full.json -Force

Linux / macOS / Git Bash:

./mvnw -B -ntp -f benchmarks/pom.xml test-compile dependency:build-classpath -DincludeScope=test -Dmdep.outputFile=target/benchmark.classpath
sep=':'; case "$(uname -s)" in MINGW*|MSYS*|CYGWIN*) sep=';';; esac
cp="benchmarks/target/test-classes${sep}benchmarks/target/classes${sep}$(cat benchmarks/target/benchmark.classpath)"
for i in 1 2 3 4 5; do java -Dgraphcompose.benchmark.profile=full -cp "$cp" com.demcha.compose.CurrentSpeedBenchmark; done
runs=$(ls -t target/benchmarks/current-speed/run-*.json | head -5)
java -cp "$cp" com.demcha.compose.BenchmarkMedianTool current-speed $runs
cp -f target/benchmarks/aggregates/current-speed/full/latest.json baselines/current-speed-full.json

The baseline is machine-class-specific; the JSON records provenance (timestamp, profile, sourceRuns). Validate the refresh against a fresh run — not one of the five that built the median — on that branch; it should score NEUTRAL and exit 0:

java -Dgraphcompose.benchmark.profile=full -cp "$cp" com.demcha.compose.CurrentSpeedBenchmark
java -cp "$cp" com.demcha.compose.BenchmarkVerdictTool baselines/current-speed-full.json target/benchmarks/current-speed/latest.json

Artifact layout

The wrapper writes two groups of artifacts.

Run-level logs and summaries

  • target/benchmark-runs/<timestamp>/SUMMARY.md
  • target/benchmark-runs/<timestamp>/logs/*.log

These are the best place to look when one numbered step fails.

Persistent benchmark reports

  • target/benchmarks/current-speed/
  • target/benchmarks/comparative/
  • target/benchmarks/diffs/
  • target/benchmarks/aggregates/

Typical contents:

  • run-<timestamp>.json
  • suite-specific CSV exports
  • latest.json convenience copies
  • median aggregate reports under aggregates/...

Running the Java entry points directly

The PowerShell wrapper is preferred, but direct runs are still useful when debugging one suite in isolation.

Build the classpath first:

mvn --% -B -ntp -DskipTests test-compile dependency:build-classpath -DincludeScope=test -Dmdep.outputFile=target/benchmark.classpath
$cp = (Get-Content 'target/benchmark.classpath' -Raw).Trim()

Then run the suite you care about:

java -cp "target\test-classes;target\classes;$cp" com.demcha.compose.CurrentSpeedBenchmark
java -Dgraphcompose.benchmark.profile=smoke -cp "target\test-classes;target\classes;$cp" com.demcha.compose.CurrentSpeedBenchmark
java -cp "target\test-classes;target\classes;$cp" com.demcha.compose.ComparativeBenchmark
java -cp "target\test-classes;target\classes;$cp" com.demcha.compose.BenchmarkDiffTool current-speed
java -cp "target\test-classes;target\classes;$cp" com.demcha.compose.BenchmarkDiffTool comparative

Use the suite shortcut when possible. BenchmarkDiffTool current-speed already knows how to select the newest compatible pair for the current-speed suite.

Troubleshooting

Why did 09-diff-current-speed skip?

Because there were not yet two current-speed reports with the same profile as the latest run.

Example:

  • latest run is full
  • historical reports contain only one full run and several smoke runs
  • result: the diff is skipped because there is no compatible pair yet

Why do I see a scary sun.misc.Unsafe warning during 01-build-classpath?

Today that warning comes from Lombok on newer JDKs. If the Maven section still ends with BUILD SUCCESS, treat it as noisy stderr, not as a benchmark failure.

Why did one local run get much slower even though the code did not obviously change?

Local benchmark numbers are sensitive to machine conditions:

  • background CPU load
  • OneDrive or antivirus activity
  • thermal throttling
  • JVM warmup differences
  • GC timing

Do not call a one-off slowdown a code regression until repeated runs show the same direction.

Which numbers should I cite in docs or release notes?

Prefer rerunning the relevant suite on the current checkout. For local claims, median-based repeated runs are safer than one-off results.

Maintenance rules

When changing the benchmark pipeline:

  • keep README.md aligned with the supported command line
  • update this file when the wrapper flow, artifact layout, or diff rules change
  • keep current-speed profile semantics explicit in user-facing docs
  • preserve the rule that incompatible current-speed profiles must never be diffed