Skip to content

Conversation

@Xuanwo
Copy link
Collaborator

@Xuanwo Xuanwo commented Dec 29, 2025

This PR will optimize rle implementation


Parts of this PR were drafted with assistance from Codex (with gpt-5.2) and fully reviewed and edited by me. I take full responsibility for all changes.

@github-actions github-actions bot added the enhancement New feature or request label Dec 29, 2025
@github-actions
Copy link
Contributor

Code Review

Summary

This PR optimizes RLE encoding with three main changes:

  1. Smarter compression selection using size estimation instead of fixed run ratio
  2. Removal of the MAX_MINIBLOCK_VALUES workaround (2048 → 4096)
  3. Performance improvements in the decoder using typed vectors instead of byte manipulation

P0 Issues

Removing the issue #4429 workaround without addressing the root cause

The PR removes the 2048 limit workaround for issue #4429 (nested structs causing chunk_bytes <= 16 * 1024 assertion failures). The original comment explains:

"Since while rep/def takes 4B, 4Ki values will lead to the generated chunk buffer too large"

Simply changing max_by_count from 2048 to MAX_MINIBLOCK_VALUES (4096) may re-introduce panics for the specific nested struct cases described in #4429. Before merging:

P1 Issues

Size estimation may be too conservative

let estimated_pairs = (run_count + (num_values / 255)).min(num_values);

The num_values / 255 term assumes worst-case run length splitting (all runs are exactly 255 values). For data with good RLE characteristics (many long runs), this significantly overestimates encoded size, potentially rejecting valid RLE candidates. Consider using a less conservative estimate or documenting this trade-off.

Zero run length validation only in decoder

The decoder now validates length == 0 and returns an error:

if length == 0 {
    return Err(Error::InvalidInput { ... });
}

While defensive, this suggests zero-length runs could exist in the encoded data. If the encoder guarantees this never happens, consider an assertion/debug check instead. If it can happen with malformed data, this is appropriate.

Minor Notes

  • Extra blank line added in general.rs:247 - trivial but could be cleaned up

@codecov
Copy link

codecov bot commented Dec 29, 2025

Codecov Report

❌ Patch coverage is 84.78261% with 7 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-encoding/src/encodings/physical/rle.rs 79.41% 7 Missing ⚠️

📢 Thoughts on this report? Let us know!

@Xuanwo Xuanwo merged commit c20415b into main Dec 30, 2025
29 checks passed
@Xuanwo Xuanwo deleted the xuanwo/rle-opt branch December 30, 2025 05:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants