-
Notifications
You must be signed in to change notification settings - Fork 495
iceberg improvements #34911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
DAlperin
wants to merge
8
commits into
MaterializeInc:main
Choose a base branch
from
DAlperin:dov/iceberg-improvements
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
iceberg improvements #34911
+657
−160
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Snapshot batches can contain millions of rows, causing the DeltaWriter's seen_rows HashMap to grow unbounded and consume excessive memory. For snapshots, disable position delete tracking by setting max_seen_rows=0. All deletes will use equality deletes instead, eliminating the memory overhead at the cost of slightly slower reads (acceptable for snapshots). Normal post-snapshot batches continue using position deletes as usual. Requires iceberg-rust 1b01c099 which adds the disable feature.
For fresh sinks, the catch-up batch was incorrectly starting from Timestamp::minimum() instead of as_of, causing it to cover a range where no data exists. Use max(resume_upper, as_of) as the batch lower bound to handle both: - Fresh sinks: start from as_of (where data actually begins) - Resuming sinks: start from resume_upper (where we left off)
Add debug! and trace! logging at key points to help diagnose issues: - Batch description minting (catch-up and future batches) - Waiting for first batch description before processing data - Batch descriptions received by write operator - Stashed rows (trace level) and periodic stash size warnings - Batch closing with frontier positions - Files written per batch This will help debug snapshot processing issues and frontier advancement.
Track max observed timestamps before init to synthesize an upper when a bounded input closes, and exit cleanly once the frontier is empty after init. Start minting once the frontier reaches as_of/resume_upper instead of waiting past them. Close write batches when the input frontier reaches the batch upper and only rescan when batch/frontier advances.
Ensure inactive mint workers drop the table-ready capability so downstream operators do not block waiting for a ready signal.
afecddf to
fa9d85f
Compare
836f684 to
b01656a
Compare
e3d2ace to
edcbf1e
Compare
Switch from using REST catalog for S3 Tables connections to the native
S3TablesCatalog implementation from iceberg-rust.
Changes:
- Add iceberg-catalog-s3tables and aws-sdk-s3tables dependencies
- Update connect_s3tables() to use S3TablesCatalogBuilder with pre-configured
aws-sdk-s3tables client for S3 Tables API calls
- For static credentials: pass access key/secret as FileIO properties
- For AssumeRole: use CustomAwsCredentialLoader to provide the full
credential chain (ambient → jump role → user role with external ID)
- Update load_or_create_table() to recognize S3 Tables NotFoundException
error format ("The specified table does not exist")
- Update workspace to use iceberg-rust rev 1b3541c6 which includes:
- with_file_io_extension() method for S3TablesCatalog
- debug tracing for update_table performance tracking
This ensures S3 Tables connections properly propagate auth configuration
for both the control plane (S3 Tables API) and data plane (S3 object access).
edcbf1e to
72d6254
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Tips for reviewer
Checklist
$T ⇔ Proto$Tmapping (possibly in a backwards-incompatible way), then it is tagged with aT-protolabel.