perf: rewrite compareUTF8 with surrogate-regex fast path#3
Merged
Conversation
For BMP-only strings (no surrogate pairs), UTF-16 code-unit order is identical to UTF-8 byte order, so native JS string comparison (`<`) gives the correct result. Use a surrogate regex to detect this case. Key insights driving the implementation: 1. V8 stores ASCII/Latin-1 strings in a one-byte backing store. The surrogate regex (/[\uD800-\uDFFF]/) compiles a byte-range check that is always false for one-byte strings in O(1) — no characters are scanned. This makes the BMP fast path essentially free for ASCII. 2. UTF-8 byte order equals Unicode code point order for all valid scalar values. Within each byte-length class encoding is monotone; across classes the first-byte prefix increases with the code point range. This means the inner loop fallback can use simple charCodeAt() subtraction for any non-surrogate BMP code unit, and only falls back to codePointAt() for the rare case of a surrogate pair. 3. The length threshold of 16 is empirically derived: for ASCII the O(1) regex check amortises at ~13 chars; for two-byte (CJK) strings where the regex must scan, the crossover is ~32 chars, but 16 is a net win across typical real-world workloads. These changes also remove ~60 lines of dead code: the utf8Bytes(), compareArrays() helpers and their pre-allocated Uint8Array buffers are no longer needed.
Contributor
Author
|
@Karavil I also had the fast path in a different PR when experimenting with |
|
Holy shit this is a potentially massive win. This comparator is at the root of many, many profiles. I remember we tried fast-pathing for ascii before but because we didn't know about this regex trick it didn't turn out better. Do we know how this compares on interesting real world benchmarks of queries? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rewrite
compareUTF8for significantly faster performanceFor BMP-only strings (no surrogate pairs), UTF-16 code-unit order is
identical to UTF-8 byte order, so native JS string comparison (
<) givesthe correct result. Use a surrogate regex to detect this case.
Key insights driving the implementation:
V8 stores ASCII/Latin-1 strings in a one-byte backing store. The
surrogate regex (/[\uD800-\uDFFF]/) compiles a byte-range check that
is always false for one-byte strings in O(1) — no characters are
scanned. This makes the BMP fast path essentially free for ASCII.
UTF-8 byte order equals Unicode code point order for all valid scalar
values. Within each byte-length class encoding is monotone; across
classes the first-byte prefix increases with the code point range.
This means the inner loop fallback can use simple charCodeAt()
subtraction for any non-surrogate BMP code unit, and only falls back
to codePointAt() for the rare case of a surrogate pair.
The length threshold of 16 is empirically derived: for ASCII the O(1)
regex check amortises at ~13 chars; for two-byte (CJK) strings where
the regex must scan, the crossover is ~32 chars, but 16 is a net win
across typical real-world workloads.
These changes also remove ~60 lines of dead code: the utf8Bytes(),
compareArrays() helpers and their pre-allocated Uint8Array buffers are
no longer needed.
across ASCII, mixed (non-ASCII in middle and at end), and CJK workloads.
Performance
Benchmarked on Apple M5, Node 24.14.1. "original" = implementation on
mainbefore this PR.