Skip to content

Memory pool integration for embedding lance-c in strict-budget query engines (Velox like) #34

@jja725

Description

@jja725

Summary

When lance-c is embedded in a host query engine that enforces per-query memory budgets (Velox, Spark+Comet, DuckDB, etc.), the buffers it returns over the Arrow C Data Interface are allocated by Rust's global allocator and are invisible to the host's memory accounting. Inside lance-c, the scanner also holds significant uncounted working memory (read-ahead, decoded pages, take buffers, vector-index scratch, filter evaluation). Together this creates a real OOM risk — the host can be told its budget is fine while the process RSS climbs unchecked.

This issue surfaces the problem, summarizes how comparable systems handle it, and proposes three directions for discussion before anyone designs an API.

Motivation

A Velox Lance connector landed at facebookincubator/velox#16556. It uses lance_scanner_to_arrow_stream and bridges into Velox via velox/vector/arrow/Bridge.h. Velox's ConnectorQueryCtx::memoryPool() is plumbed through but it can only account for whatever the bridge wraps — not for buffers the Rust side holds, and not at all for scanner-internal scratch. Gluten has hit exactly this class of bug (apache/gluten#1124). DataFusion-Comet's design notes call out the same accounting/spill split (apache/datafusion-comet#3873).

What's allocated where today

Allocation site Owner Visible to host pool
ArrowArray buffers returned by lance_scanner_* Rust global allocator (via arrow-buffer) No — host wraps with release callback only
Scanner read-ahead / decoded pages arrow-rs + lance-encoding internal Bytes No
Batch coalescing / take arrow-rs MutableBuffer No
Vector-index scratch (IVF probe, HNSW visit set) lance-index internal No
Substrait/SQL filter evaluation datafusion-physical-expr No

The C API has zero allocator hooks today — see include/lance/lance.h and src/scanner.rs.

Prior art

  1. DataFusion-Comet bridges Spark's TaskMemoryManager to DataFusion via a CometUnifiedMemoryPool that implements DataFusion's MemoryPool trait and JNI-upcalls acquireMemory / releaseMemory. It works for accounting; it doesn't work for spill — NativeMemoryConsumer.spill() returns 0, so Spark cannot reclaim memory from native operators (Spark cannot reclaim memory from native operators (spill callback returns 0) apache/datafusion-comet#3873). This is the closest existing precedent for what a Velox/Lance integration would look like.

  2. arrow-rs already has the right abstraction. arrow_buffer::pool::MemoryPool (reserve / available / used / capacity) plus MemoryReservation and a built-in TrackingMemoryPool exist today. It's a tracking layer over Arc<Bytes>, not a pluggable backing allocator — it does not redirect allocations away from Rust's global allocator. Buffer::from_custom_allocation exists for foreign-owned buffers but has correctness gaps (How does Buffer::from_custom_allocation work ? apache/arrow-rs#6362). C++'s per-instance swappable pool (jemalloc/system/mimalloc) has no arrow-rs equivalent.

  3. Velox flows a per-operator MemoryPool through ConnectorQueryCtx; the published guidance is that all allocations route through it. The native Parquet reader exists in part to keep allocations inside accounting. There is no documented official pattern for connectors whose underlying reader allocates outside the pool — they typically copy-into-pool or accept the leak (Gluten #1124).

  4. Arrow C Data Interface ownership is producer-owned by spec. No accepted RFC adds a "transfer to host pool" hook on ArrowArray; the closest related work ([JS][C Data Interface] Use wasm memory pool and expose C interface apache/arrow-js#82) addresses pool-aware import only, not cross-domain ownership transfer.

  5. lance-duckdb (lance-format/lance-duckdb) ingests each batch via DuckDB's standard ArrowTableFunction::ArrowToDuckDB (src/lance_scan.cpp:1681 and similar). That helper copies Arrow buffers into DuckDB's DataChunk, which is owned by DuckDB's Allocator / BufferManager; the producer release callback then frees the Rust-side buffer. So lance-duckdb effectively gets output-only accounting for free, because DuckDB's default Arrow ingest path is copy-based, not zero-copy. There are no allocator hooks anywhere in lance-duckdb. Caveats:

    • Peak RSS is briefly (rust_buffer + duckdb_chunk) per batch.
    • Only output is accounted; Lance-side internal scratch (read-ahead, decoded pages, vector index, take buffers) is still invisible to DuckDB's BufferManager.
    • Velox is not copy-based by default — importFromArrow is zero-copy with a release callback — which is why the accounting gap is much more visible in the Velox integration.

Proposed approaches

A. Reservation-callback hook on the scanner (accounting-only)

Implement arrow_buffer::pool::MemoryPool inside lance-c as a thin wrapper around user-supplied C callbacks, register it on the scanner, and thread it through to arrow-rs / lance-encoding allocation sites that accept a pool.

typedef int32_t (*LanceReserveFn)(void* ctx, int64_t bytes);  /* 0 ok, -1 over budget */
typedef void    (*LanceReleaseFn)(void* ctx, int64_t bytes);

int32_t lance_scanner_set_memory_reservation(
    LanceScanner*,
    LanceReserveFn reserve,
    LanceReleaseFn release,
    void* ctx);
  • Accuracy: good for sites that already take a MemoryPool; gradually improvable as more lance-encoding paths thread it through.
  • Cost: small C-API surface; bulk of the work is plumbing reservation calls inside lance-rs.
  • Limit: accounting-only — same model as Comet. Host can throttle / fail-fast; cannot trigger Lance to spill.

B. Memcpy at the FFI bridge into host-pool buffers (output-only — descriptive, not recommended)

The host copies each ArrowArray into pool-allocated buffers and immediately invokes the producer release callback. No lance-c change required.

This is what lance-duckdb gets for free today via DuckDB's standard Arrow ingest. Listed for completeness because it's a real pattern in the wild, but not proposed as the path for lance-c because:

  • It only accounts for output memory — internal scanner scratch (read-ahead, decode, vector index, coalescing) is the larger source and stays invisible.
  • It's a permanent per-batch memcpy + transient ~2× memory tax that engines like Velox specifically avoid by importing zero-copy.
  • It gives lance-c no signal about host budget pressure, so it can never evolve toward cooperative spill.

C. File-level reader API alongside the dataset scanner

Expose lance_file_reader_open(path) -> reader and lance_file_reader_read_columns(reader, columns, range) -> ArrowArray so engines like Velox can drive the scan loop themselves (batch sizing, async scheduling, filter eval, projection — all on the engine side). Pair with reservation hooks at the chunk-decode site.

  • Accuracy: shrinks the unaccounted surface to "decode one column chunk" — small, predictable.
  • Cost: meaningful new public API surface; ongoing maintenance.
  • Limit: additive — doesn't replace the scanner API, which non-engine consumers (Python, Go, generic C++) still want for filter pushdown / vector search / fragment selection. Strictly better fit for engines like Velox; harder justification if no other engine integrations are expected.

Recommendation

A is the right primitive. It uses arrow-rs's existing MemoryPool trait, keeps the C surface tiny, and gives engines like Velox the visibility they need to throttle and fail-fast cleanly. It also captures Lance-side internal scratch — which dominates unaccounted memory in long scans, vector search, and large coalesced batches — that B can never see. It mirrors the Comet pattern and is the only path that can later evolve toward cooperative spill (the reserve callback can return -1 under pressure).

Acknowledge up front (as Comet does) that A is accounting-only: cooperative spill from inside lance-c is out of scope and would need separate design.

C is worth opening as a separate discussion if there is interest in a "file-format" engine integration story beyond Velox; it is strictly bigger than this issue.

Out of scope

  • Cooperative spill from inside lance-c (Lance reclaiming memory on host signal).
  • Replacing Rust's global allocator with a per-stream allocator (would require upstream arrow-rs work — there is no per-instance pluggable allocator today, only the tracking MemoryPool).
  • GPU/CUDA memory; the C Device Data Interface uses the same producer-owned model.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions