feat: add lance_dataset_merge_insert for SQL-MERGE-style upsert#39
Merged
jja725 merged 4 commits intolance-format:mainfrom May 8, 2026
Merged
Conversation
Wraps the upstream `MergeInsertBuilder` so callers can merge an Arrow record-batch stream into an existing dataset keyed on `on_columns`, covering find-or-create, upsert, replace-region-of-data, and bulk-key deletion scenarios. Behavior is configured via a single `LanceMergeInsertParams` struct mirroring upstream's three orthogonal mode enums (`when_matched`, `when_not_matched`, `when_not_matched_by_source`). Pass `params=NULL` for the find-or-create defaults. Validation is centralized in the Rust layer: stream is consumed first (matching `lance_dataset_write` so it cannot leak on later error paths), then the dataset / key / params are checked at the FFI boundary, and finally upstream's schema-compatibility, parser, and commit-conflict errors flow through unchanged. Out-of-range mode discriminants and extraneous expression strings on non-`*_IF` modes are rejected so the contract is unambiguous. The C++ wrapper exposes both a full `Dataset::merge_insert(...)` and a convenience `Dataset::upsert(...)` overload.
jja725
approved these changes
May 8, 2026
Contributor
Author
|
Thank you @jja725 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
lance_dataset_merge_insert, the next predicate-driven mutation primitive after_delete(#31) and_update(#33). It exposes upstream'sMergeInsertBuilderto C/C++ callers, covering the four common shapes of a SQL-style MERGE in one entry point: find-or-create, upsert, replace-region-of-data, and bulk delete by key.Behavior is controlled via a single
LanceMergeInsertParamsstruct that maps to upstream's three orthogonal mode enums (when_matched,when_not_matched,when_not_matched_by_source).params=NULLselects the find-or-create defaults (DoNothing/InsertAll/Keep), and a zero-initialized struct picks the same defaults thanks to discriminant pinning. The two*_IFmodes consume an SQL filter; expressions on the wrong mode (or empty strings) are rejected at the FFI boundary so the contract is unambiguous.The validation order matches
lance_dataset_writefor stream lifetime: the source stream is the only thing checked beforefrom_rawconsumes it, which keeps the documented "stream is consumed on every return path" guarantee. Dataset / key / params errors fire afterwards, followed by upstream's own schema-compatibility, parser, and commit-conflict diagnostics. The mutation itself follows the same snapshot-and-republish pattern that_updatealready uses, so existing scanners holding a clone of the innerArckeep their pre-merge view.C++ surface
Dataset::merge_insert(on_columns, source, params=nullptr)is the full surface. ADataset::upsert(on_columns, source)convenience overload covers the most common case (UpdateAll+InsertAll+Keep) without needing to fill out a params struct.Test plan
num_on_columns=0, unknown key column, out-of-range mode discriminants for all three enums, missing/empty/extraneous expression strings, no-op configuration, schema mismatch, version bump, optionalout_result, untouched-on-error semantics) — 28 tests.compile_and_run_test) self-merge the dataset under defaults to validate the FFI plumbing without hand-building Arrow batches in C, plus thenum_on_columns=0rejection path.cargo fmt,cargo clippy --all-targets -- -D warnings, andcargo testall clean locally.Out of scope
The
MergeInsertBuilderknobs not exposed yet (conflict_retries,retry_timeout,skip_auto_cleanup,use_index,source_dedupe_behavior,commit_retries,mark_generations_as_merged) and theexecute_uncommitted/explain_plan/analyze_planpaths. The params struct can grow without breaking the ABI when there's a concrete need.