Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Dec 30, 2025

Optimized lpad and rpad functions to eliminate per-row allocations by reusing buffers for graphemes and fill characters.

The previous implementation allocated new Vec<&str> for graphemes and Vec for fill characters on every row, which was inefficient. This optimization introduces reusable buffers that are allocated once and cleared/refilled for each row.

Changes:

  • lpad: Added graphemes_buf and fill_chars_buf outside loops, clear and refill per row instead of allocating new Vec each time
  • rpad: Added graphemes_buf outside loops to reuse across iterations
  • Both functions now allocate buffers once and reuse them for all rows
  • Buffers are cleared and reused for each row via .clear() and .extend()

Optimization impact:

  • For lpad with fill parameter: Eliminates 2 Vec allocations per row (graphemes + fill_chars)
  • For lpad without fill: Eliminates 1 Vec allocation per row (graphemes)
  • For rpad: Eliminates 1 Vec allocation per row (graphemes)

This optimization is particularly effective for:

  • Large arrays with many rows
  • Strings with multiple graphemes (unicode characters)
  • Workloads with custom fill patterns

Benchmark results comparing main vs optimized branch:

lpad benchmarks:

  • size=1024, str_len=5, target=20: 116.53 µs -> 63.226 µs (45.7% faster)
  • size=1024, str_len=20, target=50: 314.07 µs -> 190.30 µs (39.4% faster)
  • size=4096, str_len=5, target=20: 467.35 µs -> 261.29 µs (44.1% faster)
  • size=4096, str_len=20, target=50: 1.2286 ms -> 754.24 µs (38.6% faster)

rpad benchmarks:

  • size=1024, str_len=5, target=20: 113.89 µs -> 72.645 µs (36.2% faster)
  • size=1024, str_len=20, target=50: 313.68 µs -> 202.98 µs (35.3% faster)
  • size=4096, str_len=5, target=20: 456.08 µs -> 295.57 µs (35.2% faster)
  • size=4096, str_len=20, target=50: 1.2523 ms -> 818.47 µs (34.6% faster)

Overall improvements: 35-46% faster across all workloads

Which issue does this PR close?

  • Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the functions Changes to functions implementation label Dec 30, 2025
@viirya viirya force-pushed the pad_optimize branch 2 times, most recently from 3c3717a to aeb2911 Compare December 30, 2025 07:34
Optimized lpad and rpad functions to eliminate per-row allocations by
reusing buffers for graphemes and fill characters.

The previous implementation allocated new Vec<&str> for graphemes and
Vec<char> for fill characters on every row, which was inefficient. This
optimization introduces reusable buffers that are allocated once and
cleared/refilled for each row.

Changes:
- lpad: Added graphemes_buf and fill_chars_buf outside loops, clear and
  refill per row instead of allocating new Vec each time
- rpad: Added graphemes_buf outside loops to reuse across iterations
- Both functions now allocate buffers once and reuse them for all rows
- Buffers are cleared and reused for each row via .clear() and .extend()

Optimization impact:
- For lpad with fill parameter: Eliminates 2 Vec allocations per row
  (graphemes + fill_chars)
- For lpad without fill: Eliminates 1 Vec allocation per row (graphemes)
- For rpad: Eliminates 1 Vec allocation per row (graphemes)

This optimization is particularly effective for:
- Large arrays with many rows
- Strings with multiple graphemes (unicode characters)
- Workloads with custom fill patterns

Benchmark results comparing main vs optimized branch:

lpad benchmarks (size=1024):
- utf8, str_len=5, target=20:       119.39 µs -> 56.297 µs (52.8% faster, 2.1x speedup)
- stringview, str_len=5, target=20: 118.73 µs -> 56.254 µs (52.6% faster, 2.1x speedup)
- utf8, str_len=20, target=50:      329.59 µs -> 169.95 µs (48.4% faster, 1.9x speedup)
- stringview, str_len=20, target=50: 328.09 µs -> 175.37 µs (46.5% faster, 1.9x speedup)

lpad benchmarks (size=4096):
- utf8, str_len=5, target=20:       480.81 µs -> 225.45 µs (53.1% faster, 2.1x speedup)
- stringview, str_len=5, target=20: 477.46 µs -> 227.22 µs (52.4% faster, 2.1x speedup)
- utf8, str_len=20, target=50:      1.3199 ms -> 687.14 µs (47.9% faster, 1.9x speedup)
- stringview, str_len=20, target=50: 1.3209 ms -> 688.33 µs (47.9% faster, 1.9x speedup)

rpad benchmarks (size=1024):
- utf8, str_len=5, target=20:       99.587 µs -> 61.490 µs (38.3% faster, 1.6x speedup)
- stringview, str_len=5, target=20: 100.06 µs -> 61.470 µs (38.6% faster, 1.6x speedup)
- utf8, str_len=20, target=50:      274.94 µs -> 170.35 µs (38.0% faster, 1.6x speedup)
- stringview, str_len=20, target=50: 277.86 µs -> 171.87 µs (38.1% faster, 1.6x speedup)

rpad benchmarks (size=4096):
- utf8, str_len=5, target=20:       400.68 µs -> 244.85 µs (38.9% faster, 1.6x speedup)
- stringview, str_len=5, target=20: 404.15 µs -> 245.60 µs (39.2% faster, 1.6x speedup)
- utf8, str_len=20, target=50:      1.1168 ms -> 686.78 µs (38.5% faster, 1.6x speedup)
- stringview, str_len=20, target=50: 1.1219 ms -> 695.01 µs (38.1% faster, 1.6x speedup)

Overall improvements: 38-53% faster across all workloads (1.6-2.1x speedup)
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the performance of the lpad and rpad functions by eliminating per-row Vec allocations through buffer reuse. The optimization introduces reusable buffers for graphemes and fill characters that are allocated once outside the loops and cleared/refilled for each row, rather than allocating new vectors on every iteration.

Key changes:

  • Added buffer reuse optimization for graphemes in both lpad and rpad functions
  • Added buffer reuse optimization for fill characters in lpad function
  • Enhanced benchmarks with more comprehensive test cases covering different array sizes, string lengths, and data types

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
datafusion/functions/src/unicode/lpad.rs Implements buffer reuse for graphemes and fill characters by declaring Vec buffers outside loops and using clear()/extend() pattern instead of per-row allocations
datafusion/functions/src/unicode/rpad.rs Implements buffer reuse for graphemes using the same clear()/extend() pattern; however, missing the fill_chars_buf optimization that was implemented in lpad
datafusion/functions/benches/pad.rs Restructures benchmarks with more comprehensive test coverage including different array sizes (1024, 4096), string lengths (5, 20), target lengths (20, 50), and data types (Utf8, StringView)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 286 to 289
fill.chars()
.cycle()
.take(length - graphemes.len())
.take(length - graphemes_buf.len())
.for_each(|ch| builder.write_char(ch).unwrap());
Copy link

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fill characters should be pre-collected into a reusable buffer similar to how it's done in lpad.rs. Currently, fill.chars() is being called on every row which allocates a new iterator each time. This misses an optimization opportunity mentioned in the PR description: "For lpad with fill parameter: Eliminates 2 Vec allocations per row (graphemes + fill_chars)". The rpad function should also benefit from the same fill_chars_buf optimization by declaring a fill_chars_buf at line 219 and reusing it by clearing and extending with fill.chars() before this usage.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant