Skip to content

Compound FtsQuery C ABI surface — design preference (JSON vs typed handles)? #35

@wanglun

Description

@wanglun

Hi! We're integrating lance-c as the items / FTS storage backend for an internal C++ service and want to understand early on the surface for compound FTS queries.

The Phase 2 design doc already names this as future work (docs/superpowers/specs/2026-04-23-phase2-vector-search-indexing-design.md:22):

Compound boolean FTS queries (Boost / Boolean / Phrase composition). MVP
exposes match + fuzzy; the composer can be added later without breaking
changes.

We'd like to know the rough timeline for the implementation and happy to help land this. What's your preferred shape for the C ABI?

What we need

Concretely, we need to express each FtsQuery variant from lance_index::scalar::inverted::query over the C ABI:

Need (downstream consumer) Maps to
AND across query terms MatchQuery::with_operator(And)
Type-ahead (last token as prefix) + expansion cap prefix expansion → BooleanQuery(Should) of MatchQuery::with_max_expansions(...)
Per-attribute boost (title^1.2 body) MultiMatchQuery or BooleanQuery(Should) of MatchQuery::with_boost
Sub-queries with per-clause match_all_terms / last_as_prefix / boost BooleanQuery(Should) of distinct MatchQuery
Phrase queries with slop PhraseQuery::with_slop
Negative-boost re-rank BoostQuery

The current C function:

int32_t lance_scanner_full_text_search(
    LanceScanner* scanner,
    const char* query,
    const char* const* columns,
    uint32_t max_fuzzy_distance);

covers MatchQuery::new + MatchQuery::with_fuzziness(Some(d)) only.

Two surface shapes we're considering

Option A — JSON entrypoint. Single new function:

/// Set a serialized FtsQuery (lance_index::scalar::inverted::query::FtsQuery)
/// as JSON. The Rust types already derive Serialize/Deserialize.
int32_t lance_scanner_fts_query_json(
    LanceScanner* scanner,
    const char* fts_query_json,
    size_t json_len);

Pros: one C symbol; forward-compatible with new variants for free;
implementation is serde_json::from_slice + set_fts_query.
Cons: callers serialize JSON; errors surface late (deserialization), not at
call sites; less idiomatic vs the typed nearest/nprobes/refine_factor
chain pattern.

Option B — Typed handles. Opaque LanceFtsQuery* plus builders:

LanceFtsQuery* lance_fts_match_new(const char* terms);
int32_t lance_fts_match_set_column(LanceFtsQuery*, const char*);
int32_t lance_fts_match_set_operator(LanceFtsQuery*, LanceFtsOperator);
int32_t lance_fts_match_set_fuzziness(LanceFtsQuery*, int32_t);
int32_t lance_fts_match_set_max_expansions(LanceFtsQuery*, uint32_t);
int32_t lance_fts_match_set_boost(LanceFtsQuery*, float);
int32_t lance_fts_match_set_prefix_length(LanceFtsQuery*, uint32_t);

LanceFtsQuery* lance_fts_phrase_new(const char* terms);
int32_t lance_fts_phrase_set_slop(LanceFtsQuery*, uint32_t);

LanceFtsQuery* lance_fts_boost_new(LanceFtsQuery* positive, LanceFtsQuery* negative, float boost);
LanceFtsQuery* lance_fts_multi_match_new(const LanceFtsQuery* const* matches, size_t n);
LanceFtsQuery* lance_fts_boolean_new(void);
int32_t lance_fts_boolean_add(LanceFtsQuery*, LanceFtsOccur, LanceFtsQuery*);

void lance_fts_query_close(LanceFtsQuery*);

int32_t lance_scanner_fts_query(LanceScanner*, LanceFtsQuery*);

Pros: matches the existing Scanner::nearest + nprobes/refine_factor/ef pattern; errors surface at construction time; idiomatic in C++ via RAII wrappers in lance.hpp.
Cons: ~12 new C functions vs 1; new variants in the future need new
symbols (still ABI-compatible — just additive).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions