Skip to content

feat: add lance_dataset_merge_insert for SQL-MERGE-style upsert#39

Merged
jja725 merged 4 commits intolance-format:mainfrom
LuciferYang:feat/dataset-merge-insert
May 8, 2026
Merged

feat: add lance_dataset_merge_insert for SQL-MERGE-style upsert#39
jja725 merged 4 commits intolance-format:mainfrom
LuciferYang:feat/dataset-merge-insert

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

Summary

Adds lance_dataset_merge_insert, the next predicate-driven mutation primitive after _delete (#31) and _update (#33). It exposes upstream's MergeInsertBuilder to C/C++ callers, covering the four common shapes of a SQL-style MERGE in one entry point: find-or-create, upsert, replace-region-of-data, and bulk delete by key.

Behavior is controlled via a single LanceMergeInsertParams struct that maps to upstream's three orthogonal mode enums (when_matched, when_not_matched, when_not_matched_by_source). params=NULL selects the find-or-create defaults (DoNothing / InsertAll / Keep), and a zero-initialized struct picks the same defaults thanks to discriminant pinning. The two *_IF modes consume an SQL filter; expressions on the wrong mode (or empty strings) are rejected at the FFI boundary so the contract is unambiguous.

The validation order matches lance_dataset_write for stream lifetime: the source stream is the only thing checked before from_raw consumes it, which keeps the documented "stream is consumed on every return path" guarantee. Dataset / key / params errors fire afterwards, followed by upstream's own schema-compatibility, parser, and commit-conflict diagnostics. The mutation itself follows the same snapshot-and-republish pattern that _update already uses, so existing scanners holding a clone of the inner Arc keep their pre-merge view.

C++ surface

Dataset::merge_insert(on_columns, source, params=nullptr) is the full surface. A Dataset::upsert(on_columns, source) convenience overload covers the most common case (UpdateAll + InsertAll + Keep) without needing to fill out a params struct.

Test plan

  • Rust integration tests cover all 4 SQL-MERGE shapes plus boundary rejections (NULL/empty args, num_on_columns=0, unknown key column, out-of-range mode discriminants for all three enums, missing/empty/extraneous expression strings, no-op configuration, schema mismatch, version bump, optional out_result, untouched-on-error semantics) — 28 tests.
  • C and C++ smoke tests (compile_and_run_test) self-merge the dataset under defaults to validate the FFI plumbing without hand-building Arrow batches in C, plus the num_on_columns=0 rejection path.
  • cargo fmt, cargo clippy --all-targets -- -D warnings, and cargo test all clean locally.

Out of scope

The MergeInsertBuilder knobs not exposed yet (conflict_retries, retry_timeout, skip_auto_cleanup, use_index, source_dedupe_behavior, commit_retries, mark_generations_as_merged) and the execute_uncommitted / explain_plan / analyze_plan paths. The params struct can grow without breaking the ABI when there's a concrete need.

Wraps the upstream `MergeInsertBuilder` so callers can merge an Arrow
record-batch stream into an existing dataset keyed on `on_columns`,
covering find-or-create, upsert, replace-region-of-data, and bulk-key
deletion scenarios. Behavior is configured via a single
`LanceMergeInsertParams` struct mirroring upstream's three orthogonal
mode enums (`when_matched`, `when_not_matched`, `when_not_matched_by_source`).
Pass `params=NULL` for the find-or-create defaults.

Validation is centralized in the Rust layer: stream is consumed first
(matching `lance_dataset_write` so it cannot leak on later error paths),
then the dataset / key / params are checked at the FFI boundary, and
finally upstream's schema-compatibility, parser, and commit-conflict
errors flow through unchanged. Out-of-range mode discriminants and
extraneous expression strings on non-`*_IF` modes are rejected so the
contract is unambiguous. The C++ wrapper exposes both a full
`Dataset::merge_insert(...)` and a convenience `Dataset::upsert(...)`
overload.
@LuciferYang LuciferYang marked this pull request as draft May 7, 2026 04:13
@LuciferYang LuciferYang marked this pull request as ready for review May 7, 2026 06:01
@jja725 jja725 self-requested a review May 8, 2026 04:33
@jja725 jja725 merged commit 3f30e80 into lance-format:main May 8, 2026
9 checks passed
@LuciferYang
Copy link
Copy Markdown
Contributor Author

Thank you @jja725

@LuciferYang LuciferYang deleted the feat/dataset-merge-insert branch May 8, 2026 05:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants