-
Notifications
You must be signed in to change notification settings - Fork 1.9k
perf: improve performance of lpad/rpad by reusing buffers #19558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
3c3717a to
aeb2911
Compare
Optimized lpad and rpad functions to eliminate per-row allocations by reusing buffers for graphemes and fill characters. The previous implementation allocated new Vec<&str> for graphemes and Vec<char> for fill characters on every row, which was inefficient. This optimization introduces reusable buffers that are allocated once and cleared/refilled for each row. Changes: - lpad: Added graphemes_buf and fill_chars_buf outside loops, clear and refill per row instead of allocating new Vec each time - rpad: Added graphemes_buf outside loops to reuse across iterations - Both functions now allocate buffers once and reuse them for all rows - Buffers are cleared and reused for each row via .clear() and .extend() Optimization impact: - For lpad with fill parameter: Eliminates 2 Vec allocations per row (graphemes + fill_chars) - For lpad without fill: Eliminates 1 Vec allocation per row (graphemes) - For rpad: Eliminates 1 Vec allocation per row (graphemes) This optimization is particularly effective for: - Large arrays with many rows - Strings with multiple graphemes (unicode characters) - Workloads with custom fill patterns Benchmark results comparing main vs optimized branch: lpad benchmarks (size=1024): - utf8, str_len=5, target=20: 119.39 µs -> 56.297 µs (52.8% faster, 2.1x speedup) - stringview, str_len=5, target=20: 118.73 µs -> 56.254 µs (52.6% faster, 2.1x speedup) - utf8, str_len=20, target=50: 329.59 µs -> 169.95 µs (48.4% faster, 1.9x speedup) - stringview, str_len=20, target=50: 328.09 µs -> 175.37 µs (46.5% faster, 1.9x speedup) lpad benchmarks (size=4096): - utf8, str_len=5, target=20: 480.81 µs -> 225.45 µs (53.1% faster, 2.1x speedup) - stringview, str_len=5, target=20: 477.46 µs -> 227.22 µs (52.4% faster, 2.1x speedup) - utf8, str_len=20, target=50: 1.3199 ms -> 687.14 µs (47.9% faster, 1.9x speedup) - stringview, str_len=20, target=50: 1.3209 ms -> 688.33 µs (47.9% faster, 1.9x speedup) rpad benchmarks (size=1024): - utf8, str_len=5, target=20: 99.587 µs -> 61.490 µs (38.3% faster, 1.6x speedup) - stringview, str_len=5, target=20: 100.06 µs -> 61.470 µs (38.6% faster, 1.6x speedup) - utf8, str_len=20, target=50: 274.94 µs -> 170.35 µs (38.0% faster, 1.6x speedup) - stringview, str_len=20, target=50: 277.86 µs -> 171.87 µs (38.1% faster, 1.6x speedup) rpad benchmarks (size=4096): - utf8, str_len=5, target=20: 400.68 µs -> 244.85 µs (38.9% faster, 1.6x speedup) - stringview, str_len=5, target=20: 404.15 µs -> 245.60 µs (39.2% faster, 1.6x speedup) - utf8, str_len=20, target=50: 1.1168 ms -> 686.78 µs (38.5% faster, 1.6x speedup) - stringview, str_len=20, target=50: 1.1219 ms -> 695.01 µs (38.1% faster, 1.6x speedup) Overall improvements: 38-53% faster across all workloads (1.6-2.1x speedup)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes the performance of the lpad and rpad functions by eliminating per-row Vec allocations through buffer reuse. The optimization introduces reusable buffers for graphemes and fill characters that are allocated once outside the loops and cleared/refilled for each row, rather than allocating new vectors on every iteration.
Key changes:
- Added buffer reuse optimization for graphemes in both lpad and rpad functions
- Added buffer reuse optimization for fill characters in lpad function
- Enhanced benchmarks with more comprehensive test cases covering different array sizes, string lengths, and data types
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| datafusion/functions/src/unicode/lpad.rs | Implements buffer reuse for graphemes and fill characters by declaring Vec buffers outside loops and using clear()/extend() pattern instead of per-row allocations |
| datafusion/functions/src/unicode/rpad.rs | Implements buffer reuse for graphemes using the same clear()/extend() pattern; however, missing the fill_chars_buf optimization that was implemented in lpad |
| datafusion/functions/benches/pad.rs | Restructures benchmarks with more comprehensive test coverage including different array sizes (1024, 4096), string lengths (5, 20), target lengths (20, 50), and data types (Utf8, StringView) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| fill.chars() | ||
| .cycle() | ||
| .take(length - graphemes.len()) | ||
| .take(length - graphemes_buf.len()) | ||
| .for_each(|ch| builder.write_char(ch).unwrap()); |
Copilot
AI
Dec 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fill characters should be pre-collected into a reusable buffer similar to how it's done in lpad.rs. Currently, fill.chars() is being called on every row which allocates a new iterator each time. This misses an optimization opportunity mentioned in the PR description: "For lpad with fill parameter: Eliminates 2 Vec allocations per row (graphemes + fill_chars)". The rpad function should also benefit from the same fill_chars_buf optimization by declaring a fill_chars_buf at line 219 and reusing it by clearing and extending with fill.chars() before this usage.
Optimized lpad and rpad functions to eliminate per-row allocations by reusing buffers for graphemes and fill characters.
The previous implementation allocated new Vec<&str> for graphemes and Vec for fill characters on every row, which was inefficient. This optimization introduces reusable buffers that are allocated once and cleared/refilled for each row.
Changes:
Optimization impact:
This optimization is particularly effective for:
Benchmark results comparing main vs optimized branch:
lpad benchmarks:
rpad benchmarks:
Overall improvements: 35-46% faster across all workloads
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?