Skip to content

perf: rewrite compareUTF8 with surrogate-regex fast path#3

Merged
arv merged 2 commits intomainfrom
arv/optimize
Apr 1, 2026
Merged

perf: rewrite compareUTF8 with surrogate-regex fast path#3
arv merged 2 commits intomainfrom
arv/optimize

Conversation

@arv
Copy link
Copy Markdown
Contributor

@arv arv commented Apr 1, 2026

Rewrite compareUTF8 for significantly faster performance

For BMP-only strings (no surrogate pairs), UTF-16 code-unit order is
identical to UTF-8 byte order, so native JS string comparison (<) gives
the correct result. Use a surrogate regex to detect this case.

Key insights driving the implementation:

  1. V8 stores ASCII/Latin-1 strings in a one-byte backing store. The
    surrogate regex (/[\uD800-\uDFFF]/) compiles a byte-range check that
    is always false for one-byte strings in O(1) — no characters are
    scanned. This makes the BMP fast path essentially free for ASCII.

  2. UTF-8 byte order equals Unicode code point order for all valid scalar
    values. Within each byte-length class encoding is monotone; across
    classes the first-byte prefix increases with the code point range.
    This means the inner loop fallback can use simple charCodeAt()
    subtraction for any non-surrogate BMP code unit, and only falls back
    to codePointAt() for the rare case of a surrogate pair.

  3. The length threshold of 16 is empirically derived: for ASCII the O(1)
    regex check amortises at ~13 chars; for two-byte (CJK) strings where
    the regex must scan, the crossover is ~32 chars, but 16 is a net win
    across typical real-world workloads.

These changes also remove ~60 lines of dead code: the utf8Bytes(),
compareArrays() helpers and their pre-allocated Uint8Array buffers are
no longer needed.

  • Added a full benchmark suite covering short / medium / large strings
    across ASCII, mixed (non-ASCII in middle and at end), and CJK workloads.

Performance

Benchmarked on Apple M5, Node 24.14.1. "original" = implementation on main before this PR.

Scenario original new speedup
short ASCII (8 chars) 10.30 ns 6.96 ns 1.5× faster
short mixed – non-ASCII mid (8) 9.55 ns 6.12 ns 1.6× faster
short mixed – non-ASCII end (8) 7.14 ns 5.43 ns 1.3× faster
short CJK (8 chars) 17.08 ns 6.95 ns 2.5× faster
medium ASCII (64 chars) 150.76 ns 25.35 ns 5.9× faster
medium mixed – non-ASCII mid (64) 236.14 ns 127.03 ns 1.9× faster
medium mixed – non-ASCII end (64) 144.99 ns 98.79 ns 1.5× faster
medium CJK (64 chars) 113.48 ns 132.50 ns 1.2× slower¹
large ASCII (1024 chars) 1780 ns 62.05 ns 28.7× faster
large mixed – non-ASCII mid (1024) 2380 ns 1550 ns 1.5× faster
large mixed – non-ASCII end (1024) 2410 ns 4640 ns 1.9× slower²
large CJK (1024 chars) 1690 ns 1540 ns 1.1× faster

¹ For medium CJK the surrogate regex must scan 64 two-byte characters; the
charCodeAt loop in the original just happens to be slightly cheaper at this
length. The large CJK case recovers as the O(1) regex overhead amortises.

² Worst case for this implementation: a 1024-char ASCII string with a
supplementary character (surrogate pair, e.g. 🌸) at the very end. The
regex scans the entire string before finding the surrogate, then the loop
iterates to the end too. In practice supplementary characters at the end of
otherwise-ASCII strings are rare; all other large-string scenarios are
dramatically faster.

For BMP-only strings (no surrogate pairs), UTF-16 code-unit order is
identical to UTF-8 byte order, so native JS string comparison (`<`) gives
the correct result. Use a surrogate regex to detect this case.

Key insights driving the implementation:

1. V8 stores ASCII/Latin-1 strings in a one-byte backing store. The
   surrogate regex (/[\uD800-\uDFFF]/) compiles a byte-range check that
   is always false for one-byte strings in O(1) — no characters are
   scanned. This makes the BMP fast path essentially free for ASCII.

2. UTF-8 byte order equals Unicode code point order for all valid scalar
   values. Within each byte-length class encoding is monotone; across
   classes the first-byte prefix increases with the code point range.
   This means the inner loop fallback can use simple charCodeAt()
   subtraction for any non-surrogate BMP code unit, and only falls back
   to codePointAt() for the rare case of a surrogate pair.

3. The length threshold of 16 is empirically derived: for ASCII the O(1)
   regex check amortises at ~13 chars; for two-byte (CJK) strings where
   the regex must scan, the crossover is ~32 chars, but 16 is a net win
   across typical real-world workloads.

These changes also remove ~60 lines of dead code: the utf8Bytes(),
compareArrays() helpers and their pre-allocated Uint8Array buffers are
no longer needed.
@arv
Copy link
Copy Markdown
Contributor Author

arv commented Apr 1, 2026

@Karavil I also had the fast path in a different PR when experimenting with Intl.Collator.

@arv arv merged commit bddc16b into main Apr 1, 2026
3 checks passed
@arv arv deleted the arv/optimize branch April 1, 2026 13:55
@aboodman
Copy link
Copy Markdown

aboodman commented Apr 2, 2026

Holy shit this is a potentially massive win. This comparator is at the root of many, many profiles.

I remember we tried fast-pathing for ascii before but because we didn't know about this regex trick it didn't turn out better.

Do we know how this compares on interesting real world benchmarks of queries?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants