perf: rewrite compareUTF8 with surrogate-regex fast path by arv · Pull Request #3 · rocicorp/compare-utf8

arv · 2026-04-01T13:47:07Z

Rewrite `compareUTF8` for significantly faster performance

For BMP-only strings (no surrogate pairs), UTF-16 code-unit order is
identical to UTF-8 byte order, so native JS string comparison (<) gives
the correct result. Use a surrogate regex to detect this case.

Key insights driving the implementation:

V8 stores ASCII/Latin-1 strings in a one-byte backing store. The
surrogate regex (/[\uD800-\uDFFF]/) compiles a byte-range check that
is always false for one-byte strings in O(1) — no characters are
scanned. This makes the BMP fast path essentially free for ASCII.
UTF-8 byte order equals Unicode code point order for all valid scalar
values. Within each byte-length class encoding is monotone; across
classes the first-byte prefix increases with the code point range.
This means the inner loop fallback can use simple charCodeAt()
subtraction for any non-surrogate BMP code unit, and only falls back
to codePointAt() for the rare case of a surrogate pair.
The length threshold of 16 is empirically derived: for ASCII the O(1)
regex check amortises at ~13 chars; for two-byte (CJK) strings where
the regex must scan, the crossover is ~32 chars, but 16 is a net win
across typical real-world workloads.

These changes also remove ~60 lines of dead code: the utf8Bytes(),
compareArrays() helpers and their pre-allocated Uint8Array buffers are
no longer needed.

Added a full benchmark suite covering short / medium / large strings
across ASCII, mixed (non-ASCII in middle and at end), and CJK workloads.

Performance

Benchmarked on Apple M5, Node 24.14.1. "original" = implementation on main before this PR.

Scenario	original	new	speedup
short ASCII (8 chars)	10.30 ns	6.96 ns	1.5× faster
short mixed – non-ASCII mid (8)	9.55 ns	6.12 ns	1.6× faster
short mixed – non-ASCII end (8)	7.14 ns	5.43 ns	1.3× faster
short CJK (8 chars)	17.08 ns	6.95 ns	2.5× faster
medium ASCII (64 chars)	150.76 ns	25.35 ns	5.9× faster
medium mixed – non-ASCII mid (64)	236.14 ns	127.03 ns	1.9× faster
medium mixed – non-ASCII end (64)	144.99 ns	98.79 ns	1.5× faster
medium CJK (64 chars)	113.48 ns	132.50 ns	1.2× slower¹
large ASCII (1024 chars)	1780 ns	62.05 ns	28.7× faster
large mixed – non-ASCII mid (1024)	2380 ns	1550 ns	1.5× faster
large mixed – non-ASCII end (1024)	2410 ns	4640 ns	1.9× slower²
large CJK (1024 chars)	1690 ns	1540 ns	1.1× faster

¹ For medium CJK the surrogate regex must scan 64 two-byte characters; the
charCodeAt loop in the original just happens to be slightly cheaper at this
length. The large CJK case recovers as the O(1) regex overhead amortises.

² Worst case for this implementation: a 1024-char ASCII string with a
supplementary character (surrogate pair, e.g. 🌸) at the very end. The
regex scans the entire string before finding the surrogate, then the loop
iterates to the end too. In practice supplementary characters at the end of
otherwise-ASCII strings are rare; all other large-string scenarios are
dramatically faster.

For BMP-only strings (no surrogate pairs), UTF-16 code-unit order is identical to UTF-8 byte order, so native JS string comparison (`<`) gives the correct result. Use a surrogate regex to detect this case. Key insights driving the implementation: 1. V8 stores ASCII/Latin-1 strings in a one-byte backing store. The surrogate regex (/[\uD800-\uDFFF]/) compiles a byte-range check that is always false for one-byte strings in O(1) — no characters are scanned. This makes the BMP fast path essentially free for ASCII. 2. UTF-8 byte order equals Unicode code point order for all valid scalar values. Within each byte-length class encoding is monotone; across classes the first-byte prefix increases with the code point range. This means the inner loop fallback can use simple charCodeAt() subtraction for any non-surrogate BMP code unit, and only falls back to codePointAt() for the rare case of a surrogate pair. 3. The length threshold of 16 is empirically derived: for ASCII the O(1) regex check amortises at ~13 chars; for two-byte (CJK) strings where the regex must scan, the crossover is ~32 chars, but 16 is a net win across typical real-world workloads. These changes also remove ~60 lines of dead code: the utf8Bytes(), compareArrays() helpers and their pre-allocated Uint8Array buffers are no longer needed.

arv · 2026-04-01T13:49:14Z

@Karavil I also had the fast path in a different PR when experimenting with Intl.Collator.

aboodman · 2026-04-02T06:08:58Z

Holy shit this is a potentially massive win. This comparator is at the root of many, many profiles.

I remember we tried fast-pathing for ascii before but because we didn't know about this regex trick it didn't turn out better.

Do we know how this compares on interesting real world benchmarks of queries?

Merge branch 'main' into arv/optimize

c0f1422

arv merged commit bddc16b into main Apr 1, 2026
3 checks passed

arv deleted the arv/optimize branch April 1, 2026 13:55

arv mentioned this pull request Apr 2, 2026

chore: update compare-utf8 to v0.2.0 rocicorp/mono#5730

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: rewrite compareUTF8 with surrogate-regex fast path#3

perf: rewrite compareUTF8 with surrogate-regex fast path#3
arv merged 2 commits intomainfrom
arv/optimize

arv commented Apr 1, 2026

Uh oh!

arv commented Apr 1, 2026

Uh oh!

Uh oh!

aboodman commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

arv commented Apr 1, 2026

Rewrite compareUTF8 for significantly faster performance

Performance

Uh oh!

arv commented Apr 1, 2026

Uh oh!

Uh oh!

aboodman commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Rewrite `compareUTF8` for significantly faster performance