-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[core][flink] Support incremental clustering for append bucketed table #6835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
1835c68 to
aedeea9
Compare
aedeea9 to
692346a
Compare
JingsongLi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea is that the clustering of this Bucketed table is like merging small files of Append+Bucket-1 table, directly generating tasks for different concurrent writer to reads and writes separately.
| runsInfo); | ||
| }); | ||
| } | ||
| partitionLevels.forEach( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too deep. Use for loop.
Good idea! I will encapsulate the local-sort operation within a single bucket into a Task and move it to paimon-core, which will also facilitate integration with multiple other engines. |
Purpose
Linked issue: close #xxx
By default, Paimon's append bucketed tables maintain data ordering. However, this ordering requirement can be relaxed to enable additional optimizations.
This PR introduces the ability to disable ordering requirements for append bucketed tables, allowing incremental clustering within buckets. When ordering is not strictly required, data can be incrementally clustered within each bucket, significantly improving query performance for
bucket-key + clustering-keycombinations.Unlike append-unaware tables that require range partitioning, bucketed tables only need to shuffle by partition + bucket and perform local clustering within each bucket partition, making this approach much more efficient and resource-friendly.
Tests
API and Format
Documentation