You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When lance-c is embedded in a host query engine that enforces per-query memory budgets (Velox, Spark+Comet, DuckDB, etc.), the buffers it returns over the Arrow C Data Interface are allocated by Rust's global allocator and are invisible to the host's memory accounting. Inside lance-c, the scanner also holds significant uncounted working memory (read-ahead, decoded pages, take buffers, vector-index scratch, filter evaluation). Together this creates a real OOM risk — the host can be told its budget is fine while the process RSS climbs unchecked.
This issue surfaces the problem, summarizes how comparable systems handle it, and proposes three directions for discussion before anyone designs an API.
Motivation
A Velox Lance connector landed at facebookincubator/velox#16556. It uses lance_scanner_to_arrow_stream and bridges into Velox via velox/vector/arrow/Bridge.h. Velox's ConnectorQueryCtx::memoryPool() is plumbed through but it can only account for whatever the bridge wraps — not for buffers the Rust side holds, and not at all for scanner-internal scratch. Gluten has hit exactly this class of bug (apache/gluten#1124). DataFusion-Comet's design notes call out the same accounting/spill split (apache/datafusion-comet#3873).
What's allocated where today
Allocation site
Owner
Visible to host pool
ArrowArray buffers returned by lance_scanner_*
Rust global allocator (via arrow-buffer)
No — host wraps with release callback only
Scanner read-ahead / decoded pages
arrow-rs + lance-encoding internal Bytes
No
Batch coalescing / take
arrow-rsMutableBuffer
No
Vector-index scratch (IVF probe, HNSW visit set)
lance-index internal
No
Substrait/SQL filter evaluation
datafusion-physical-expr
No
The C API has zero allocator hooks today — see include/lance/lance.h and src/scanner.rs.
Prior art
DataFusion-Comet bridges Spark's TaskMemoryManager to DataFusion via a CometUnifiedMemoryPool that implements DataFusion's MemoryPool trait and JNI-upcalls acquireMemory / releaseMemory. It works for accounting; it doesn't work for spill — NativeMemoryConsumer.spill() returns 0, so Spark cannot reclaim memory from native operators (Spark cannot reclaim memory from native operators (spill callback returns 0) apache/datafusion-comet#3873). This is the closest existing precedent for what a Velox/Lance integration would look like.
arrow-rs already has the right abstraction.arrow_buffer::pool::MemoryPool (reserve / available / used / capacity) plus MemoryReservation and a built-in TrackingMemoryPool exist today. It's a tracking layer over Arc<Bytes>, not a pluggable backing allocator — it does not redirect allocations away from Rust's global allocator. Buffer::from_custom_allocation exists for foreign-owned buffers but has correctness gaps (How does Buffer::from_custom_allocation work ? apache/arrow-rs#6362). C++'s per-instance swappable pool (jemalloc/system/mimalloc) has no arrow-rs equivalent.
Velox flows a per-operator MemoryPool through ConnectorQueryCtx; the published guidance is that all allocations route through it. The native Parquet reader exists in part to keep allocations inside accounting. There is no documented official pattern for connectors whose underlying reader allocates outside the pool — they typically copy-into-pool or accept the leak (Gluten #1124).
lance-duckdb (lance-format/lance-duckdb) ingests each batch via DuckDB's standard ArrowTableFunction::ArrowToDuckDB (src/lance_scan.cpp:1681 and similar). That helper copies Arrow buffers into DuckDB's DataChunk, which is owned by DuckDB's Allocator / BufferManager; the producer release callback then frees the Rust-side buffer. So lance-duckdb effectively gets output-only accounting for free, because DuckDB's default Arrow ingest path is copy-based, not zero-copy. There are no allocator hooks anywhere in lance-duckdb. Caveats:
Peak RSS is briefly (rust_buffer + duckdb_chunk) per batch.
Only output is accounted; Lance-side internal scratch (read-ahead, decoded pages, vector index, take buffers) is still invisible to DuckDB's BufferManager.
Velox is not copy-based by default — importFromArrow is zero-copy with a release callback — which is why the accounting gap is much more visible in the Velox integration.
Proposed approaches
A. Reservation-callback hook on the scanner (accounting-only)
Implement arrow_buffer::pool::MemoryPool inside lance-c as a thin wrapper around user-supplied C callbacks, register it on the scanner, and thread it through to arrow-rs / lance-encoding allocation sites that accept a pool.
Accuracy: good for sites that already take a MemoryPool; gradually improvable as more lance-encoding paths thread it through.
Cost: small C-API surface; bulk of the work is plumbing reservation calls inside lance-rs.
Limit: accounting-only — same model as Comet. Host can throttle / fail-fast; cannot trigger Lance to spill.
B. Memcpy at the FFI bridge into host-pool buffers (output-only — descriptive, not recommended)
The host copies each ArrowArray into pool-allocated buffers and immediately invokes the producer release callback. No lance-c change required.
This is what lance-duckdb gets for free today via DuckDB's standard Arrow ingest. Listed for completeness because it's a real pattern in the wild, but not proposed as the path for lance-c because:
It only accounts for output memory — internal scanner scratch (read-ahead, decode, vector index, coalescing) is the larger source and stays invisible.
It's a permanent per-batch memcpy + transient ~2× memory tax that engines like Velox specifically avoid by importing zero-copy.
It gives lance-c no signal about host budget pressure, so it can never evolve toward cooperative spill.
C. File-level reader API alongside the dataset scanner
Expose lance_file_reader_open(path) -> reader and lance_file_reader_read_columns(reader, columns, range) -> ArrowArray so engines like Velox can drive the scan loop themselves (batch sizing, async scheduling, filter eval, projection — all on the engine side). Pair with reservation hooks at the chunk-decode site.
Accuracy: shrinks the unaccounted surface to "decode one column chunk" — small, predictable.
Cost: meaningful new public API surface; ongoing maintenance.
Limit: additive — doesn't replace the scanner API, which non-engine consumers (Python, Go, generic C++) still want for filter pushdown / vector search / fragment selection. Strictly better fit for engines like Velox; harder justification if no other engine integrations are expected.
Recommendation
A is the right primitive. It uses arrow-rs's existing MemoryPool trait, keeps the C surface tiny, and gives engines like Velox the visibility they need to throttle and fail-fast cleanly. It also captures Lance-side internal scratch — which dominates unaccounted memory in long scans, vector search, and large coalesced batches — that B can never see. It mirrors the Comet pattern and is the only path that can later evolve toward cooperative spill (the reserve callback can return -1 under pressure).
Acknowledge up front (as Comet does) that A is accounting-only: cooperative spill from inside lance-c is out of scope and would need separate design.
C is worth opening as a separate discussion if there is interest in a "file-format" engine integration story beyond Velox; it is strictly bigger than this issue.
Out of scope
Cooperative spill from inside lance-c (Lance reclaiming memory on host signal).
Replacing Rust's global allocator with a per-stream allocator (would require upstream arrow-rs work — there is no per-instance pluggable allocator today, only the tracking MemoryPool).
GPU/CUDA memory; the C Device Data Interface uses the same producer-owned model.
Summary
When
lance-cis embedded in a host query engine that enforces per-query memory budgets (Velox, Spark+Comet, DuckDB, etc.), the buffers it returns over the Arrow C Data Interface are allocated by Rust's global allocator and are invisible to the host's memory accounting. Insidelance-c, the scanner also holds significant uncounted working memory (read-ahead, decoded pages, take buffers, vector-index scratch, filter evaluation). Together this creates a real OOM risk — the host can be told its budget is fine while the process RSS climbs unchecked.This issue surfaces the problem, summarizes how comparable systems handle it, and proposes three directions for discussion before anyone designs an API.
Motivation
A Velox Lance connector landed at facebookincubator/velox#16556. It uses
lance_scanner_to_arrow_streamand bridges into Velox viavelox/vector/arrow/Bridge.h. Velox'sConnectorQueryCtx::memoryPool()is plumbed through but it can only account for whatever the bridge wraps — not for buffers the Rust side holds, and not at all for scanner-internal scratch. Gluten has hit exactly this class of bug (apache/gluten#1124). DataFusion-Comet's design notes call out the same accounting/spill split (apache/datafusion-comet#3873).What's allocated where today
ArrowArraybuffers returned bylance_scanner_*arrow-buffer)arrow-rs+lance-encodinginternalBytesarrow-rsMutableBufferlance-indexinternaldatafusion-physical-exprThe C API has zero allocator hooks today — see
include/lance/lance.handsrc/scanner.rs.Prior art
DataFusion-Comet bridges Spark's
TaskMemoryManagerto DataFusion via aCometUnifiedMemoryPoolthat implements DataFusion'sMemoryPooltrait and JNI-upcallsacquireMemory/releaseMemory. It works for accounting; it doesn't work for spill —NativeMemoryConsumer.spill()returns0, so Spark cannot reclaim memory from native operators (Spark cannot reclaim memory from native operators (spill callback returns 0) apache/datafusion-comet#3873). This is the closest existing precedent for what a Velox/Lance integration would look like.arrow-rs already has the right abstraction.
arrow_buffer::pool::MemoryPool(reserve/available/used/capacity) plusMemoryReservationand a built-inTrackingMemoryPoolexist today. It's a tracking layer overArc<Bytes>, not a pluggable backing allocator — it does not redirect allocations away from Rust's global allocator.Buffer::from_custom_allocationexists for foreign-owned buffers but has correctness gaps (How doesBuffer::from_custom_allocationwork ? apache/arrow-rs#6362). C++'s per-instance swappable pool (jemalloc/system/mimalloc) has no arrow-rs equivalent.Velox flows a per-operator
MemoryPoolthroughConnectorQueryCtx; the published guidance is that all allocations route through it. The native Parquet reader exists in part to keep allocations inside accounting. There is no documented official pattern for connectors whose underlying reader allocates outside the pool — they typically copy-into-pool or accept the leak (Gluten #1124).Arrow C Data Interface ownership is producer-owned by spec. No accepted RFC adds a "transfer to host pool" hook on
ArrowArray; the closest related work ([JS][C Data Interface] Use wasm memory pool and expose C interface apache/arrow-js#82) addresses pool-aware import only, not cross-domain ownership transfer.lance-duckdb (lance-format/lance-duckdb) ingests each batch via DuckDB's standard
ArrowTableFunction::ArrowToDuckDB(src/lance_scan.cpp:1681and similar). That helper copies Arrow buffers into DuckDB'sDataChunk, which is owned by DuckDB'sAllocator/BufferManager; the producer release callback then frees the Rust-side buffer. So lance-duckdb effectively gets output-only accounting for free, because DuckDB's default Arrow ingest path is copy-based, not zero-copy. There are no allocator hooks anywhere in lance-duckdb. Caveats:(rust_buffer + duckdb_chunk)per batch.BufferManager.importFromArrowis zero-copy with a release callback — which is why the accounting gap is much more visible in the Velox integration.Proposed approaches
A. Reservation-callback hook on the scanner (accounting-only)
Implement
arrow_buffer::pool::MemoryPoolinsidelance-cas a thin wrapper around user-supplied C callbacks, register it on the scanner, and thread it through toarrow-rs/lance-encodingallocation sites that accept a pool.MemoryPool; gradually improvable as morelance-encodingpaths thread it through.lance-rs.B. Memcpy at the FFI bridge into host-pool buffers (output-only — descriptive, not recommended)
The host copies each
ArrowArrayinto pool-allocated buffers and immediately invokes the producer release callback. Nolance-cchange required.This is what lance-duckdb gets for free today via DuckDB's standard Arrow ingest. Listed for completeness because it's a real pattern in the wild, but not proposed as the path for
lance-cbecause:lance-cno signal about host budget pressure, so it can never evolve toward cooperative spill.C. File-level reader API alongside the dataset scanner
Expose
lance_file_reader_open(path) -> readerandlance_file_reader_read_columns(reader, columns, range) -> ArrowArrayso engines like Velox can drive the scan loop themselves (batch sizing, async scheduling, filter eval, projection — all on the engine side). Pair with reservation hooks at the chunk-decode site.Recommendation
A is the right primitive. It uses arrow-rs's existing
MemoryPooltrait, keeps the C surface tiny, and gives engines like Velox the visibility they need to throttle and fail-fast cleanly. It also captures Lance-side internal scratch — which dominates unaccounted memory in long scans, vector search, and large coalesced batches — that B can never see. It mirrors the Comet pattern and is the only path that can later evolve toward cooperative spill (thereservecallback can return-1under pressure).Acknowledge up front (as Comet does) that A is accounting-only: cooperative spill from inside
lance-cis out of scope and would need separate design.C is worth opening as a separate discussion if there is interest in a "file-format" engine integration story beyond Velox; it is strictly bigger than this issue.
Out of scope
lance-c(Lance reclaiming memory on host signal).arrow-rswork — there is no per-instance pluggable allocator today, only the trackingMemoryPool).References
MemoryPooltrait:arrow_buffer::pool::MemoryPool; foreign-buffer gap: How doesBuffer::from_custom_allocationwork ? apache/arrow-rs#6362src/lance_scan.cpp(ArrowTableFunction::ArrowToDuckDBcalls)