Memory pool integration for embedding lance-c in strict-budget query engines (Velox like)

## Summary

When `lance-c` is embedded in a host query engine that enforces per-query memory budgets (Velox, Spark+Comet, DuckDB, etc.), the buffers it returns over the Arrow C Data Interface are allocated by Rust's global allocator and are invisible to the host's memory accounting. Inside `lance-c`, the scanner also holds significant uncounted working memory (read-ahead, decoded pages, take buffers, vector-index scratch, filter evaluation). Together this creates a real OOM risk — the host can be told its budget is fine while the process RSS climbs unchecked.

This issue surfaces the problem, summarizes how comparable systems handle it, and proposes three directions for discussion before anyone designs an API.

## Motivation

A Velox Lance connector landed at facebookincubator/velox#16556. It uses `lance_scanner_to_arrow_stream` and bridges into Velox via `velox/vector/arrow/Bridge.h`. Velox's `ConnectorQueryCtx::memoryPool()` is plumbed through but it can only account for whatever the bridge wraps — not for buffers the Rust side holds, and not at all for scanner-internal scratch. Gluten has hit exactly this class of bug (apache/incubator-gluten#1124). DataFusion-Comet's design notes call out the same accounting/spill split (apache/datafusion-comet#3873).

## What's allocated where today

| Allocation site | Owner | Visible to host pool |
|---|---|---|
| `ArrowArray` buffers returned by `lance_scanner_*` | Rust global allocator (via `arrow-buffer`) | No — host wraps with release callback only |
| Scanner read-ahead / decoded pages | `arrow-rs` + `lance-encoding` internal `Bytes` | No |
| Batch coalescing / take | `arrow-rs` `MutableBuffer` | No |
| Vector-index scratch (IVF probe, HNSW visit set) | `lance-index` internal | No |
| Substrait/SQL filter evaluation | `datafusion-physical-expr` | No |

The C API has zero allocator hooks today — see `include/lance/lance.h` and `src/scanner.rs`.

## Prior art

1. **DataFusion-Comet** bridges Spark's `TaskMemoryManager` to DataFusion via a `CometUnifiedMemoryPool` that implements DataFusion's `MemoryPool` trait and JNI-upcalls `acquireMemory` / `releaseMemory`. It works for accounting; it doesn't work for spill — `NativeMemoryConsumer.spill()` returns `0`, so Spark cannot reclaim memory from native operators (apache/datafusion-comet#3873). This is the closest existing precedent for what a Velox/Lance integration would look like.

2. **arrow-rs already has the right abstraction.** `arrow_buffer::pool::MemoryPool` (`reserve` / `available` / `used` / `capacity`) plus `MemoryReservation` and a built-in `TrackingMemoryPool` exist today. It's a *tracking* layer over `Arc<Bytes>`, not a pluggable backing allocator — it does not redirect allocations away from Rust's global allocator. `Buffer::from_custom_allocation` exists for foreign-owned buffers but has correctness gaps (apache/arrow-rs#6362). C++'s per-instance swappable pool (jemalloc/system/mimalloc) has no arrow-rs equivalent.

3. **Velox** flows a per-operator `MemoryPool` through `ConnectorQueryCtx`; the published guidance is that all allocations route through it. The native Parquet reader exists in part to keep allocations inside accounting. There is no documented official pattern for connectors whose underlying reader allocates outside the pool — they typically copy-into-pool or accept the leak (Gluten #1124).

4. **Arrow C Data Interface ownership is producer-owned by spec.** No accepted RFC adds a "transfer to host pool" hook on `ArrowArray`; the closest related work (apache/arrow-js#82) addresses pool-aware import only, not cross-domain ownership transfer.

5. **lance-duckdb** (lance-format/lance-duckdb) ingests each batch via DuckDB's standard `ArrowTableFunction::ArrowToDuckDB` (`src/lance_scan.cpp:1681` and similar). That helper **copies** Arrow buffers into DuckDB's `DataChunk`, which is owned by DuckDB's `Allocator` / `BufferManager`; the producer release callback then frees the Rust-side buffer. So lance-duckdb effectively gets *output-only* accounting for free, because DuckDB's default Arrow ingest path is copy-based, not zero-copy. There are no allocator hooks anywhere in lance-duckdb. Caveats:
    - Peak RSS is briefly `(rust_buffer + duckdb_chunk)` per batch.
    - Only output is accounted; Lance-side internal scratch (read-ahead, decoded pages, vector index, take buffers) is still invisible to DuckDB's `BufferManager`.
    - Velox is *not* copy-based by default — `importFromArrow` is zero-copy with a release callback — which is why the accounting gap is much more visible in the Velox integration.

## Proposed approaches

### A. Reservation-callback hook on the scanner (accounting-only)

Implement `arrow_buffer::pool::MemoryPool` inside `lance-c` as a thin wrapper around user-supplied C callbacks, register it on the scanner, and thread it through to `arrow-rs` / `lance-encoding` allocation sites that accept a pool.

```c
typedef int32_t (*LanceReserveFn)(void* ctx, int64_t bytes);  /* 0 ok, -1 over budget */
typedef void    (*LanceReleaseFn)(void* ctx, int64_t bytes);

int32_t lance_scanner_set_memory_reservation(
    LanceScanner*,
    LanceReserveFn reserve,
    LanceReleaseFn release,
    void* ctx);
```

- Accuracy: good for sites that already take a `MemoryPool`; gradually improvable as more `lance-encoding` paths thread it through.
- Cost: small C-API surface; bulk of the work is plumbing reservation calls inside `lance-rs`.
- Limit: accounting-only — same model as Comet. Host can throttle / fail-fast; cannot trigger Lance to spill.

### B. Memcpy at the FFI bridge into host-pool buffers (output-only — descriptive, not recommended)

The host copies each `ArrowArray` into pool-allocated buffers and immediately invokes the producer release callback. No `lance-c` change required.

This is what lance-duckdb gets for free today via DuckDB's standard Arrow ingest. Listed for completeness because it's a real pattern in the wild, but not proposed as the path for `lance-c` because:

- It only accounts for *output* memory — internal scanner scratch (read-ahead, decode, vector index, coalescing) is the larger source and stays invisible.
- It's a permanent per-batch memcpy + transient ~2× memory tax that engines like Velox specifically avoid by importing zero-copy.
- It gives `lance-c` no signal about host budget pressure, so it can never evolve toward cooperative spill.

### C. File-level reader API alongside the dataset scanner

Expose `lance_file_reader_open(path) -> reader` and `lance_file_reader_read_columns(reader, columns, range) -> ArrowArray` so engines like Velox can drive the scan loop themselves (batch sizing, async scheduling, filter eval, projection — all on the engine side). Pair with reservation hooks at the chunk-decode site.

- Accuracy: shrinks the unaccounted surface to "decode one column chunk" — small, predictable.
- Cost: meaningful new public API surface; ongoing maintenance.
- Limit: additive — doesn't replace the scanner API, which non-engine consumers (Python, Go, generic C++) still want for filter pushdown / vector search / fragment selection. Strictly better fit for engines like Velox; harder justification if no other engine integrations are expected.

## Recommendation

**A** is the right primitive. It uses arrow-rs's existing `MemoryPool` trait, keeps the C surface tiny, and gives engines like Velox the visibility they need to throttle and fail-fast cleanly. It also captures Lance-side internal scratch — which dominates unaccounted memory in long scans, vector search, and large coalesced batches — that B can never see. It mirrors the Comet pattern and is the only path that can later evolve toward cooperative spill (the `reserve` callback can return `-1` under pressure).

Acknowledge up front (as Comet does) that A is **accounting-only**: cooperative spill from inside `lance-c` is out of scope and would need separate design.

**C** is worth opening as a separate discussion if there is interest in a "file-format" engine integration story beyond Velox; it is strictly bigger than this issue.

## Out of scope

- Cooperative spill from inside `lance-c` (Lance reclaiming memory on host signal).
- Replacing Rust's global allocator with a per-stream allocator (would require upstream `arrow-rs` work — there is no per-instance pluggable allocator today, only the tracking `MemoryPool`).
- GPU/CUDA memory; the C Device Data Interface uses the same producer-owned model.

## References

- Velox connector PR: facebookincubator/velox#16556
- DataFusion-Comet memory model + spill gap: apache/datafusion-comet#3873
- arrow-rs `MemoryPool` trait: `arrow_buffer::pool::MemoryPool`; foreign-buffer gap: apache/arrow-rs#6362
- Velox memory model: https://facebookincubator.github.io/velox/develop/memory.html
- Gluten foreign-allocation OOM: apache/incubator-gluten#1124
- Arrow C Data Interface ownership rules: https://arrow.apache.org/docs/format/CDataInterface.html
- lance-duckdb Arrow ingest path: lance-format/lance-duckdb `src/lance_scan.cpp` (`ArrowTableFunction::ArrowToDuckDB` calls)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory pool integration for embedding lance-c in strict-budget query engines (Velox like) #34

Summary

Motivation

What's allocated where today

Prior art

Proposed approaches

A. Reservation-callback hook on the scanner (accounting-only)

B. Memcpy at the FFI bridge into host-pool buffers (output-only — descriptive, not recommended)

C. File-level reader API alongside the dataset scanner

Recommendation

Out of scope

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Allocation site	Owner	Visible to host pool
`ArrowArray` buffers returned by `lance_scanner_*`	Rust global allocator (via `arrow-buffer`)	No — host wraps with release callback only
Scanner read-ahead / decoded pages	`arrow-rs` + `lance-encoding` internal `Bytes`	No
Batch coalescing / take	`arrow-rs` `MutableBuffer`	No
Vector-index scratch (IVF probe, HNSW visit set)	`lance-index` internal	No
Substrait/SQL filter evaluation	`datafusion-physical-expr`	No

Memory pool integration for embedding lance-c in strict-budget query engines (Velox like) #34

Description

Summary

Motivation

What's allocated where today

Prior art

Proposed approaches

A. Reservation-callback hook on the scanner (accounting-only)

B. Memcpy at the FFI bridge into host-pool buffers (output-only — descriptive, not recommended)

C. File-level reader API alongside the dataset scanner

Recommendation

Out of scope

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions