Skip to content

feat(semantic): [alpha build] provider-aware typed embeddings, reranking, diagnostics, and eval harness#87

Open
Zireael wants to merge 43 commits into
cortexkit:mainfrom
Zireael:semantic-search-enhancement
Open

feat(semantic): [alpha build] provider-aware typed embeddings, reranking, diagnostics, and eval harness#87
Zireael wants to merge 43 commits into
cortexkit:mainfrom
Zireael:semantic-search-enhancement

Conversation

@Zireael
Copy link
Copy Markdown
Contributor

@Zireael Zireael commented Jun 2, 2026

Summary

Semantic search in AFT moves from a minimal embedding-and-cosine prototype to a provider-capability-aware retrieval subsystem with typed vectors, optional reranking, background lifecycle management, diagnostics, and evaluation tooling. This is a public preview — the feature is functional and tested (~93 new tests) but expects iteration based on real-world feedback.

What changed

The upgrade touches the full semantic pipeline — config, indexing, retrieval, diagnostics, and observability — without breaking the default fastembed experience.

Typed vector representations

Vectors are no longer opaque f32 blobs. Every stored vector carries explicit type metadata (DenseF32, Int8SourceDecoded, BinaryPacked) and is paired with its source kind so the correct distance metric is selected automatically. Binary packed vectors use Hamming search (native bitwise XOR + popcount) instead of cosine, which is both faster and semantically correct for quantized embeddings. This unlocks Perplexity's base64_binary and base64_int8 output modes alongside standard dense providers.

Provider capability profiles

Each embedding backend (fastembed, OpenAI-compatible, Ollama, Perplexity) declares what it supports: output encoding, distance metric, dimension range, max batch size. The config layer validates combinations at configure time — you cannot accidentally request binary vectors through a cosine-only provider. Profiles also carry fingerprint fields so switching providers triggers a clean index rebuild rather than silent corruption.

Fingerprint-driven index lifecycle

A SemanticIndexFingerprint captures every dimension that affects index correctness: backend, model, base_url, dimension, chunking_version, output_encoding, storage_strategy, vector kinds, normalization, and prompt hashes. diff() classifies changes as Rebuild (structural — re-embed everything), ClearQueryCache (query prompts changed — invalidate cached results only), or None. This replaces the previous "delete and hope" invalidation with precise, explainable rebuild decisions.

Non-blocking cold start

Index builds run in a background thread with cooperative cancellation (SemanticCancellationToken via AtomicU64 generation counter). The build checks the generation before each embedding batch and exits early when a reconfigure arrives. Priority ordering ensures high-value files (recently edited, high PageRank) get embedded first. Exponential backoff handles transient provider failures without blocking the session.

Stale-vector pruning

When files are edited, deleted, moved, excluded, or re-included, the index tracks which vectors are stale and prunes them during the next refresh cycle. Every vector record carries file/chunk ownership metadata (file path, version, chunk hash, index fingerprint) so pruning is traceable and deterministic.

File policy and docs chunking

A configurable file policy controls which files enter the index (include globs, exclude globs, max file size, max chunk count). The docs chunker splits Markdown and documentation files into semantic sections before embedding, improving recall for documentation-shaped queries.

Reranking pipeline

Optional reranking via any OpenAI-compatible /v1/rerank or chat-completion endpoint. The pipeline sends initial retrieval candidates to a reranker, parses the response (supporting multiple JSON shapes), and reorders results with safe fallback — if the reranker fails, the original cosine-similarity order is returned unchanged. Config fields: rerank.enabled, rerank.model, rerank.base_url, rerank.api_key_env, rerank.max_candidates.

Search pipeline metrics and diagnostics

Every aft_search call records timing, cache hits/misses, result counts, and reranker fallback events. Metrics are exposed through the status command and through JSONL diagnostic logs for offline analysis. The DiagnosticsOutputMode config controls verbosity in tool output (compact | verbose | off).

Semantic doctor

semantic_doctor is a health-check command that reports config summary, index summary, metrics summary, provider summary, and actionable suggestions. Use it to verify that the index is healthy, the provider is reachable, and the configuration is consistent.

Semantic eval harness

semantic_eval runs a JSONL-defined evaluation suite against the semantic index. Each case specifies a query, expected paths, expected symbols, and top-k. The harness computes recall@k and MRR (Mean Reciprocal Rank) for quantifying retrieval quality across config changes.

Status integration

The status command now includes semantic health metrics: lifecycle state, entry count, dimension, total queries, cache hit ratio, average query time, and provider info. The OpenCode TUI sidebar surfaces these alongside the existing index state.

Config trust boundary

backend, base_url, and api_key_env are user-only fields — project-level aft.jsonc cannot inject these. A hostile repository cannot redirect embeddings at an attacker-controlled endpoint or exfiltrate API keys. The plugin logs a warning when it strips a project-level setting.

Contextualized document-chunk embedding (partial)

Initial support for Perplexity-style document/chunk grouped embedding — chunks from the same source document are batched together rather than flattened. Oversized document handling and retry logic are still in progress (see roadmap).

How to test

Default fastembed (zero-config)

# Enable semantic search in your AFT config
# ~/.config/opencode/aft.jsonc or ~/.pi/agent/aft.jsonc:
{ "semantic_search": true }

# Start a session — index builds in background
# Run aft_search with a concept query:
aft_search({ "query": "authentication middleware" })

Verify: results appear with source: semantic or source: hybrid tags. Status shows [index: ready] after build completes.

Provider switching

// Switch to OpenAI-compatible
{
  "semantic_search": true,
  "semantic": {
    "backend": "openai_compatible",
    "model": "text-embedding-3-small",
    "base_url": "https://api.openai.com/v1",
    "api_key_env": "OPENAI_API_KEY"
  }
}

Verify: index rebuilds automatically on next session start. Status shows new provider/model.

Reranking

{
  "semantic_search": true,
  "semantic": {
    "backend": "openai_compatible",
    "model": "text-embedding-3-small",
    "base_url": "https://api.openai.com/v1",
    "api_key_env": "OPENAI_API_KEY"
  },
  "rerank": {
    "enabled": true,
    "model": "rerank-english-v3.0",
    "base_url": "https://api.cohere.com",
    "api_key_env": "COHERE_API_KEY"
  }
}

Verify: search results show reranker-sorted order. Disable reranker — results fall back to cosine order.

Semantic doctor

aft_search({ "query": "test" })  # trigger index build if cold
# Then check health via status command or semantic_doctor

Verify: health report shows ConfigSummary, IndexSummary, MetricsSummary, ProviderSummary.

Eval harness

// Create eval-cases.jsonl:
{"query": "authentication handler", "expected_paths": ["src/auth/middleware.ts"], "expected_symbols": ["authMiddleware"], "top_k": 10}
{"query": "database connection", "expected_paths": ["src/db/pool.ts"], "expected_symbols": ["createPool"], "top_k": 10}

Verify: returns recall@k and MRR scores.

Test coverage

~93 tests across 8 test sub-tasks covering:

  • Config parsing and backward compatibility
  • Fingerprint diff matrix (all field combinations → Rebuild/ClearQueryCache/None)
  • File policy, docs chunking, and manifest handling
  • VectorStore trait with DenseF32 and BinaryPacked implementations
  • Binary packed-vector storage and Hamming search
  • Lifecycle states, snapshots, and stale-vector pruning
  • Search pipeline metrics, diagnostics, and DiagnosticsOutputMode
  • Concurrency, race conditions, and cancellation token behavior
  • Security trust boundary enforcement (project config stripping)
  • Semantic doctor health report
  • Semantic eval harness (JSONL parsing, scoring, recall/MRR)
  • Reranking pipeline (parse multiple JSON shapes, fallback on failure)

Roadmap

Still in progress or planned for follow-up:

  • aft-t6p.23: Complete contextualized document-chunk embedding (oversized docs, retry logic) — partially implemented
  • aft-t6p.2.2: Configurable snippet truncation in reranking (currently hardcoded at 200 chars)
  • aft-t6p.18: End-to-end verification across all backends
  • aft-t6p.5: Configuration and operations documentation
  • Performance benchmarking suite
  • Migration tooling for index format upgrades

Architecture notes

Key new modules:

  • crates/aft/src/semantic_rerank.rs — reranking pipeline with safe fallback
  • crates/aft/src/semantic_diagnostics.rs — JSONL diagnostic logging
  • crates/aft/src/semantic_doctor.rs — health-check report generation
  • crates/aft/src/semantic_eval.rs — evaluation harness (JSONL parser, scoring)
  • crates/aft/src/vector_store.rs — VectorStore trait with DenseF32 and BinaryPacked implementations
  • crates/aft/src/commands/semantic_doctor.rs — doctor command handler
  • crates/aft/src/commands/semantic_eval.rs — eval command handler

Modified significantly:

  • crates/aft/src/semantic_index.rs — lifecycle management, fingerprint-driven invalidation, non-blocking build, stale pruning, typed vectors
  • crates/aft/src/config.rs — provider profiles, rerank config, trust boundary fields
  • crates/aft/src/commands/status.rs — semantic health metrics
  • crates/aft/src/commands/semantic_search.rs — reranking integration, diagnostics output mode

View with Codesmith Autofix with Codesmith
Need help on this PR? Tag /codesmith with what you need. Autofix is disabled.


Summary by cubic

Upgrades semantic search to a provider-aware pipeline with typed vectors, reranking, contextualized document-chunk embedding, partial-ready querying, and built-in diagnostics/eval. Adds Perplexity support and Hamming search for binary/int8, and hardens lifecycle, metrics, and config.

  • New Features

    • Provider profiles and typed vectors (f32, int8, binary packed) with auto metric selection; enables Perplexity base64_binary/base64_int8.
    • Contextualized document-chunk embedding with oversized-document splitting, retry/backoff, and build diagnostics.
    • Fingerprint-driven lifecycle with background builds, partial-ready state, and precise stale‑vector pruning.
    • Optional reranking via OpenAI-compatible endpoints with safe fallback.
    • Metrics/diagnostics: JSONL logs with retention and configurable verbosity; status, semantic_doctor, and semantic_eval surface semantic health.
  • Bug Fixes

    • Cosine similarity guards zero-norm vectors and clamps scores to [-1, 1].
    • Reranker: robust fence stripping, out-of-bounds index warning, duplicate index prevention, and max_candidate_chars support.
    • Search no longer overwrites lifecycle to Ready; validates non-empty queries; cancellation token uses acquire/release ordering.
    • Chunking fixes: correct end_line for final splits and accurate oversized-document counters.
    • Config parsing accepts all semantic fields (dimensions, encoding, storage strategy, input mode, metric, prompts, diagnostics, rerank limits) and now also reads jsonl_logging, jsonl_path, include_raw_queries, include_snippets, retention_days, and metrics_window_size; TypeScript enums aligned with Rust.

Written for commit d204e2d. Summary will update on new commits.

Review in cubic

Greptile Summary

This PR upgrades the semantic search subsystem from a minimal prototype to a full provider-aware retrieval pipeline, introducing typed vectors, fingerprint-driven index lifecycle management, optional reranking, background build with cooperative cancellation, and diagnostic tooling.

  • Core pipeline (semantic_index.rs, vector_store.rs): adds VectorStore abstraction, EmbeddingModelProfile for provider capability validation, SemanticIndexFingerprint with diff() for precise rebuild decisions, and stale-vector pruning. Cancellation token uses correct Acquire/Release ordering; cosine_similarity guards zero-norm vectors and clamps output.
  • Reranking (semantic_rerank.rs, semantic_search.rs): optional OpenAI-compatible reranker with safe fallback and markdown-fence stripping. A field/method naming collision on diagnostics_enabled silently disables JSONL logging when jsonl_logging: true is set without diagnostics_enabled: true.
  • TypeScript config (config.ts): new enum schemas and trust-boundary mergeSemanticConfig; rerank and diagnostics fields are absent from SemanticConfigSchema and stripped by Zod before reaching Rust.

Confidence Score: 3/5

Safe to merge with the diagnostics_enabled field/method fix applied — without it, any user who enables JSONL logging alone gets silence.

The search handler reads the raw bool field instead of the diagnostics_enabled() method that unifies diagnostics_enabled || jsonl_logging, silently breaking JSONL-only logging configs.

crates/aft/src/commands/semantic_search.rs (line 51 field vs. method) and packages/opencode-plugin/src/config.ts (rerank/diagnostics fields absent from SemanticConfigSchema).

Important Files Changed

Filename Overview
crates/aft/src/commands/semantic_search.rs Rewrites search handler to add reranking, diagnostics, and partial-index handling. Field access instead of method call on diagnostics_enabled silently disables JSONL logging when only jsonl_logging: true is set.
crates/aft/src/semantic_rerank.rs New reranking pipeline with safe fallback. Markdown-fence stripper has an edge case for single-line format that returns empty string, causing a graceful but silent degradation.
crates/aft/src/config.rs Adds provider enums, SemanticBackendConfig with rerank/diagnostics fields, and a diagnostics_enabled() method that ORs the field with jsonl_logging — but callers bypass this method and read the field directly.
packages/opencode-plugin/src/config.ts Adds SemanticConfigSchema with new enum types and project-level trust boundary enforcement. Missing rerank/diagnostics fields from the schema (silently stripped by Zod) and minor warning message formatting inconsistency.
crates/aft/src/vector_store.rs New VectorStore trait + FlatF32 and FlatBinaryHamming implementations. Search, upsert, prune, and orphan-removal logic is correct and well-tested.
crates/aft/src/semantic_index.rs Major expansion adding typed vectors, fingerprint-driven lifecycle, background build with cooperative cancellation, stale-vector pruning, and provider profiles. Cosine similarity correctly guards zero-norm vectors and clamps output.
crates/aft/src/semantic_diagnostics.rs JSONL diagnostics logger and in-memory metrics. Correct use of retention policy and configurable verbosity. No issues found.
crates/aft/src/context.rs Adds SemanticCancellationToken (AtomicU64 with correct Acquire/Release ordering) and lazy JSONL logger initialization. Logic in init_diagnostics_logger is correct, gated on jsonl_logging field.

Comments Outside Diff (1)

  1. packages/opencode-plugin/src/config.ts, line 37-54 (link)

    P1 TypeScript enum values don't match the Rust serde strings — config will fail to deserialize

    Several new enum schemas use values that don't align with the Rust serde representation:

    • SemanticOutputEncodingEnum allows "binary", "ubinary", "int8", "uint8" but Rust OutputEncoding deserializes from "base64_binary" and "base64_int8".
    • SemanticStorageStrategyEnum allows "flat" and "binary_pack" but Rust StorageStrategy expects "native_f32" and "binary_packed".
    • SemanticInputModeEnum includes "chunk_extracts" and "contextualized" but Rust InputMode only has "flat_texts" and "document_chunks".
    • SemanticDistanceMetricEnum uses "dot" but Rust DistanceMetric expects "dot_product".
    • SemanticBackendEnum is missing the new "perplexity" variant added to Rust.

    A user who follows the TypeScript autocomplete and picks output_encoding: "int8" will pass TypeScript validation but receive a deserialization error (or silent fallback to default) from the Rust binary at runtime.

Reviews (5): Last reviewed commit: "fix(configure): add missing JSONL/metric..." | Re-trigger Greptile

Zireael and others added 30 commits May 24, 2026 11:10
Add scripts, docs, Dockerfile, and package.json scripts for Docker-based
Rust validation (fmt/check/clippy/test) so Windows users without MSVC
Build Tools can still validate Rust code.

- scripts/docker-rust.ps1: PowerShell script supporting fmt/check/clippy/
  test/validate/shell tasks with persistent Docker volumes
- Dockerfile.rust: minimal Rust image with rustfmt + clippy pre-installed
- docs/docker-rust-validation.md: full usage and design documentation
- package.json: 6 new docker:rust:* convenience scripts

Design: Linux-target validation via rust:1-bookworm, persistent cargo
volumes for caching, fail-fast sequential validation.
- SemanticFilePolicy config struct with include_code/include_docs/
  include_configs/binary_detection/generated_file_detection/globs
- parse_semantic_files_config handler in configure.rs
- File policy evaluation: should_index_file(), is_generated_file(),
  is_config_file(), is_docs_file()
- Docs chunker: collect_docs_chunks() with heading-based splitting
  for markdown, splitting by file for other doc types
- collect_chunks routes doc files through docs chunker, skips
  binary/generated/config files per policy
- SemanticIndexFingerprint extended with file_policy_hash and
  docs_chunker_version; diff() triggers rebuild on policy change
- build_with_progress/refresh_stale_files accept &SemanticFilePolicy
- compute_file_policy_hash() deterministic hash of policy fields
- Re-export SemanticFilePolicy from semantic_index module
- All test callers updated with &SemanticFilePolicy::default()
…iority ordering, backoff

- CancellationToken (Arc<AtomicU64> generation counter) for cooperative build cancellation on reconfigure
- Cancel old semantic index builds instead of detaching when config changes
- Priority file ordering: README/docs first, then core source, then tests, then rest
- Embedding backoff: exponential retry with jitter for remote provider rate limits
- SemanticIndexStatus::Partial variant with completeness percentage for partial builds
- Search reports partial index state during cold start
- Phase-boundary cancellation checks between model init, disk read, incremental refresh, and full rebuild
Add Perplexity backend with InputMode::DocumentChunks support for
contextualized embedding where chunks carry document-level context.

- SemanticBackend::Perplexity variant with config, profile, engine
- DocumentChunks/PerDocumentChunks/DocumentEmbeddings structs
- embed_document_chunks() routes Perplexity to grouped embedding API
- build_with_progress_contextualized() groups chunks by document
- Wire configure.rs to branch on input_mode: DocumentChunks
- SemanticEmbeddingModel::input_mode() public accessor
- EmbeddingModelProfile with contextualized_supported guard
- Response validation: index continuity, missing documents, dimension
…to trait-backed module

Bead: aft-t6p.12

Extracts Vec<EmbeddingEntry> storage and search from SemanticIndexSnapshot
into a VectorStore trait with FlatF32VectorStore implementation. This
decouples the storage layer from the lifecycle logic and prepares for
alternative backends (binary Hamming, approximate ANN).

Key changes:
- vector_store.rs: VectorStore trait + ScoredChunk/PruneStats types
- FlatF32VectorStore: flat scan with cosine similarity (preserves existing
  behaviour exactly)
- FlatBinaryHammingVectorStore: forward-looking Hamming-search impl
- SemanticIndexSnapshot delegates search/len/prune/entries to store
- Fixed dimension-sync bug where set_dimension updated the snapshot
  dimension but not the store dimension, causing search to return 0
- EmbeddingEntry and IndexedFileMetadata made pub for trait compatibility
On Windows, use copyFileSync for the binary replacement (which overwrites
the target — renameSync fails with EEXIST). If it fails, the original
binary at binaryPath is preserved.

The temp file cleanup is now wrapped in its own try/catch so a cleanup
failure does NOT propagate as a download failure — the binary was already
successfully placed at binaryPath.

Addresses PR cortexkit#69 cubic review finding P2.
Implement bead aft-t6p.24: file identity manifest + vector ownership records.

Changes:
- **FileRecord struct**: identity record with content_hash, size_bytes, mtime,
  language, document_kind, inclusion_policy_hash, indexed_at
- **file_manifest on SemanticIndexSnapshot**: HashMap<PathBuf, FileRecord>
  tracking which files produced which vectors, enabling precise stale-vector
  pruning when files are edited, deleted, or excluded
- **V8 serialization format**: extends V7 with per-entry chunk_hash (after
  each vector) and file manifest block (after all entry vectors). Full
  backward compatibility with V1-V7 reads.
- **chunk_hash on EmbeddingEntry**: deterministic hash of chunk content fields
  for tracing which version of a chunk produced a stored vector
- **compute_chunk_hash**: blake3-based deterministic hash
- **build_manifest_from_store helper**: populates file_manifest from store's
  file_metadata, called in all builder functions (build_from_chunks,
  build_with_progress_contextualized, refresh_stale_files) and from_bytes
  for V1-V7 cache migration
- **next_chunk_id, fingerprint_string**: forward-looking fields on snapshot
  for future unique ID assignment and fingerprint tracking
…rmalization, and model profiles

Adds aft-t6p.20 (Typed embedding vector representation +
storage-strategy resolution):

- TypedVector (source-side) and StoredVector (persisted) enums
  with DenseF32, DenseInt8, BinaryPacked, and Quantized variants
- StorageStrategy (NativeF32, DecodeNormalizeF32, BinaryPacked)
- VectorKind enum for runtime type tagging
- DistanceMetric (Cosine, DotProduct, Euclidean, Hamming)
- NormalizationPolicy (AlreadyNormalized, NormalizeOnInsertQuery,
  NotApplicable)
- EmbeddingModelProfile fields: source_vector_kind, stored_vector_kind,
  metric, normalization
- convert_vector() / validate_compatible() on EmbeddingModelProfile
- blake3 dependency for chunk hashing
… + dummy base_url for Perplexity profile test

Two fixes for `fingerprint_invalidation_tests`:
- Mock HTTP server now lowercases header names before matching
  Content-Length (reqwest/hyper sends lowercase `content-length:`).
- `base64_int8_profile_from_config_selects_correctly` test provides a
  dummy `base_url` for the Perplexity backend (required by `from_config`).

Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
- Add StorageStrategy::BinaryPacked variant for packed-bit vector storage
- Add EmbeddingModelProfile::perplexity_binary() with BinaryPacked → Hamming path
- Wire from_config to select perplexity_binary profile when Base64Binary encoding
- Implement parse_embedding_value for Base64Binary (decode → 0.0/1.0 f32 vec)
- Implement into_stored for TypedVector::BinaryPacked (requires BinaryPacked strategy)
- Update validate_config and validate_compatible to accept Base64Binary+BinaryPacked
- Replace old "not yet supported" test with parse_embedding_value_base64_binary_succeeds
- 886/893 tests pass (7 pre-existing Docker failures)

Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
Add semantic_diagnostics module with SearchDiagnostics, SearchPipelineType,
SearchWarning, SearchMetricsCollector, PhaseTimer, score_statistics,
top1_margin. Instrument handle_semantic_search with per-phase timing
and warning collection. Wire SearchMetricsCollector into AppContext.
17 new tests, 902/910 lib tests pass (8 pre-existing Docker failures).

Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
- Add SemanticDiagnosticsLogger with file append, rotation (50 MB), and
  retention cleanup (file-deletion based on mtime)
- Add SearchDiagnosticsEvent struct for JSONL serialization with
  raw_query redaction (opt-in via include_raw_queries) and snippet
  placeholder (include_snippets)
- Add config fields: jsonl_logging, jsonl_path, include_raw_queries,
  include_snippets, retention_days to SemanticBackendConfig
- Add lazy-init diagnostics_logger on AppContext with
  resolve_diagnostics_log_path helper (env var → project root → ~/.cache)
- Wire JSONL record into handle_semantic_search diagnostics block
- 4 new tests: raw query redaction, raw query inclusion, disk write
  verification, missing-file recovery
- 907/914 lib tests pass (7 pre-existing Docker failures)

Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
…rch output

Add DiagnosticsOutputMode enum (Off/Minimal/Verbose) and output_mode field
to SemanticBackendConfig. Implement format_diagnostics_prefix() for
Minimal (warnings only) and Verbose (scores + latency + warnings)
output modes. Wire into handle_semantic_search response text.
4 new tests, 25 diagnostics tests total. 910/918 lib tests pass
(8 pre-existing Docker failures).

Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
Add optional reranking via OpenAI-compatible chat endpoint. When
enabled, aft_search overfetches candidates, sends them to a reranker
model, and re-sorts by relevance. Falls back gracefully on any error.

- Add RerankConfig fields to SemanticBackendConfig (rerank_enabled,
  rerank_model, rerank_base_url, rerank_api_key_env, rerank_timeout_ms,
  rerank_max_candidates)
- Create semantic_rerank.rs with RerankerClient, RerankOutcome enum,
  and rerank_candidates function
- Add RerankerFailure warning variant to SearchWarning
- Wire reranking into handle_semantic_search (overfetch → rerank → re-sort)
- Add rerank_latency_ms to SearchDiagnostics and SearchDiagnosticsEvent
- Include rerank latency in verbose diagnostics output
- 6 unit tests for reranker parsing, skip conditions, and failure handling

All 25 diagnostics + 6 reranker tests pass. 917/924 total tests pass
(7 pre-existing Docker infrastructure failures).
Add 40+ unit tests to fingerprint_invalidation_tests covering:
- SemanticBackendConfig deserialization (minimal, all-fields, defaults)
- EmbeddingModelProfile validation for all encoding types
- TypedVector conversion and StoredVector roundtrip
- convert_vector and validate_compatible rejection paths
- Distance metric auto-resolution for f32/int8/binary
- base64_int8 signed int8 decode correctness
- Template hashing, enum roundtrips, resolve helpers

Minor: add #[derive(Debug)] to StoredVector for test ergonomics.

Closes aft-t6p.6.1
Add 6 new tests to fingerprint_invalidation_tests covering:
- file_policy_hash mismatch triggers rebuild
- docs_chunker_version mismatch triggers rebuild
- multi-field changes still trigger rebuild
- rebuild+query_prompt: rebuild wins
- only query_prompt change: ClearQueryCache
- non-fingerprint field changes: NoChange

Total: 22 fingerprint tests. Closes aft-t6p.6.2
Add 29 tests covering:
- is_generated_file: protobuf, minified, dist, build, generated, dart
- is_doc_extension and is_config_extension validation
- classify_semantic_file for code/doc/config
- collect_docs_chunks markdown heading splitting
- SemanticFilePolicy defaults and builtin globs
- FileRecord field population
- build_manifest_from_store construction and cleanup

Closes aft-t6p.6.3
… tests

Add 23 tests covering:
- FlatF32VectorStore: search, empty, dimension mismatch, CRUD, prune, stats
- FlatBinaryHammingVectorStore: search, ranking, prune, delete, stats
- hamming_distance and popcount64 correctness
- Binary decode: byte-aligned, non-byte-aligned, padding, error

Closes aft-t6p.6.4
Add 8 tests covering:
- SemanticIndexLifecycle: cold start, set/get, failed+error, all variants
- SemanticIndexSnapshot: search ranking, immutability after clone
- VectorStore: prune_stale_vectors, prune_orphans

Closes aft-t6p.6.5
Add 10 tests covering:
- HybridRerank pipeline type display
- Metrics collector: window size 1, cache hit rate, zero result rate,
  low confidence rate, latency percentiles
- Diagnostics output mode defaults
- Warning formatting: minimal (all variants, verifies suppressed),
  verbose (all 9 variants)
- SearchWarning serde roundtrip for all 8 variants

Closes aft-t6p.6.6
Add 4 tests covering:
- Concurrent snapshot clones produce independent results
- Concurrent read threads see identical data via Arc
- Mutex contention across 10 threads does not deadlock
- Arc strong_count tracks clone/drop correctly

Closes aft-t6p.6.7
Add 6 tests covering:
- Trust file atomic write (no tmp files left behind)
- Multiple projects trusted independently
- Untrust is idempotent
- Trust state survives reload (serde roundtrip)
- Nonexistent project path is untrusted (fail-closed)

Closes aft-t6p.6.8
The validate_compatible_rejects_binary_stored_with_cosine_metric test
was missing source_vector_kind: BinaryPacked, causing the first match
block to fail with 'unsupported source→stored vector conversion' instead
of reaching the metric compatibility check.
Zireael added 4 commits June 1, 2026 09:24
Add local retrieval evaluation harness for measuring semantic search quality.

New files:
- crates/aft/src/semantic_eval.rs — pure-logic module with:
  - EvalCase, EvalResult, EvalSummary structs
  - JSONL parser (tolerates blank lines and comments)
  - path_matches() — cross-platform suffix matching
  - symbol_matches() — Rust/other-language symbol normalization
  - score_case() — per-case recall@k and MRR scoring
  - score_suite() — aggregate metrics across a suite
- crates/aft/src/commands/semantic_eval.rs — handler wiring:
  - Reads .aft/semantic-eval.jsonl, returns EvalSummary as JSON
  - Supports top_k override and include_per_case toggle
  - Returns tri-state response per AFT honest reporting convention

Wiring:
- crates/aft/src/lib.rs: pub mod semantic_eval
- crates/aft/src/commands/mod.rs: pub mod semantic_eval
- crates/aft/src/main.rs: dispatch semantic_eval command

Tests: 44 tests passing (parser, matcher, scorer, handler)
Add semantic_doctor command that produces a SemanticHealthReport gathering:
- Config summary (backend, model, dimensions, metric, prompts, rerank)
- Index state (lifecycle, entry count, dimension, fingerprint freshness)
- Search quality metrics (p50/p95 latency, zero-result/low-confidence rates)
- Provider connectivity (optional probe)
- Active warnings and actionable suggestions

New files:
- crates/aft/src/semantic_doctor.rs — HealthStatus, ConfigSummary,
  IndexSummary, MetricsSummary, ProviderSummary, Suggestion,
  SemanticHealthReport structs with Serialize and Display impls
- crates/aft/src/commands/semantic_doctor.rs — command handler with
  optional probe_provider param, suggestion generation for disabled/
  building/failed/ready states, 7 handler tests + 6 model tests

Wiring:
- crates/aft/src/lib.rs: pub mod semantic_doctor
- crates/aft/src/commands/mod.rs: pub mod semantic_doctor
- crates/aft/src/main.rs: dispatch "semantic_doctor" command

Also: fix semantic_eval temp directory race condition (atomic counter).

Tests: 14 semantic_doctor + 44 semantic_eval passing, check+clippy+fmt clean.
Extend the semantic_index_info section of the status command to include:
- Search quality metrics (total_queries, p50/p95 latency, zero_result_rate,
  low_confidence_rate, embedding_failure_rate, lexical_failure_rate)
- Rerank status (rerank_enabled, rerank_model)
- Diagnostics state (diagnostics_enabled, prompt_active)

The TUI/status surfaces can now show pipeline health without a separate
semantic_doctor call. Metrics are zero when no queries have been recorded.

Tests: status + semantic_doctor tests passing, check+clippy+fmt clean.
- Add 3 new tests: markdown-fence parsing, snippet truncation, max_candidates limit
- Fix missing-ID append: semantic_search now appends missing indices in original order
- Add max_candidate_chars config field (default 2500) to SemanticBackendConfig
- Use config.rerank_max_candidate_chars instead of hardcoded 200 in reranker
- Update all test configs with new field

Bead: aft-t6p.2.1
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 107 files

Note: This PR contains a large number of files. cubic only reviews up to 100 files per PR, so some files may not have been reviewed. cubic prioritizes the most important files to review.
On a pro plan you can use ultrareview for larger PRs.

Re-trigger cubic

Comment thread .beads/README.md Outdated
Comment thread .beads/config.yaml Outdated
Comment thread .claude/settings.json Outdated
Comment thread .qartez/acks/5813b13fa433d553 Outdated
@Zireael Zireael changed the title feat(semantic): provider-aware typed embeddings, reranking, diagnostics, and eval harness feat(semantic): [alpha build] provider-aware typed embeddings, reranking, diagnostics, and eval harness Jun 2, 2026
Remove .beads/, .qartez/, .claude/, .omo/, .kiro/, .lean-ctx/ from
the branch. These are local agent working directories that should not
be distributed. Add them to .gitignore to prevent future accidents.

Addresses cubic review comments on PR cortexkit#87.
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 69 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name=".gitignore">

<violation number="1" location=".gitignore:95">
P2: Inconsistent .gitignore pattern: `omo/` should likely be `.omo/` to match the hidden tooling directory convention used by all other entries in this block.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread .gitignore
.beads/
.qartez/
.claude/
omo/
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Inconsistent .gitignore pattern: omo/ should likely be .omo/ to match the hidden tooling directory convention used by all other entries in this block.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .gitignore, line 95:

<comment>Inconsistent .gitignore pattern: `omo/` should likely be `.omo/` to match the hidden tooling directory convention used by all other entries in this block.</comment>

<file context>
@@ -87,3 +87,11 @@ benchmarks/aft-search/.bench/
+.beads/
+.qartez/
+.claude/
+omo/
+.kiro/
+.lean-ctx/
</file context>

Comment thread crates/aft/src/commands/semantic_search.rs
Comment thread crates/aft/src/semantic_rerank.rs
Zireael added 2 commits June 2, 2026 20:43
Remove .alfonso/, agents.md, beads-data-*.jsonl, magic-context-*.md,
biome.json_ from the branch. Add them to .gitignore to prevent future
inclusion in PRs.
Restore .alfonso/ from main (it exists upstream). Keep agents.md,
beads-data-*.jsonl, magic-context-*.md, biome.json_ removed and
gitignored since they don't exist on main.
Comment thread packages/opencode-plugin/src/config.ts
Comment thread crates/aft/src/commands/semantic_search.rs Outdated
@Zireael
Copy link
Copy Markdown
Contributor Author

Zireael commented Jun 2, 2026

Source code for semantic search functionality for public preview.
Feature skeleton is there, needs finishing up, polishing static tests and functional testing.
One more thing that would need adding would be model2vec 'Potion Code 16M' support. If it performs well in tests, I think it could become fast, cheap and performant default semantic model.

Here's imlementation plans for sprints under this epic (in gastown beads format):
aft-semantic-search-upgrade.json

1. Fix duplicate entries in reranked output (greptile P1)
   - Add !used[i] check in filter_map to prevent duplicate indices
   - File: crates/aft/src/commands/semantic_search.rs

2. Strip markdown fences from LLM reranker responses (greptile P1)
   - Many chat models wrap JSON in code fences
   - Add strip_markdown_fences() helper applied before parsing
   - File: crates/aft/src/semantic_rerank.rs

3. Align TypeScript enum values with Rust serde (qubic P1)
   - SemanticBackendEnum: add perplexity variant
   - SemanticOutputEncodingEnum: float, base64_int8, base64_binary
   - SemanticStorageStrategyEnum: native_f32, decode_normalize_f32, binary_packed
   - SemanticInputModeEnum: flat_texts, document_chunks
   - SemanticDistanceMetricEnum: auto, cosine, dot_product, euclidean, hamming
   - File: packages/opencode-plugin/src/config.ts
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 4 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/opencode-plugin/src/config.ts">

<violation number="1" location="packages/opencode-plugin/src/config.ts:40">
P2: Semantic enum literals were renamed without backward-compatibility aliases or migration, breaking existing configs that use old values.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

const SemanticBackendEnum = z.enum(["fastembed", "openai_compatible", "ollama", "perplexity"]);

/** Output encoding mode for embeddings. */
const SemanticOutputEncodingEnum = z.enum(["float", "base64_int8", "base64_binary"]);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Semantic enum literals were renamed without backward-compatibility aliases or migration, breaking existing configs that use old values.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/opencode-plugin/src/config.ts, line 40:

<comment>Semantic enum literals were renamed without backward-compatibility aliases or migration, breaking existing configs that use old values.</comment>

<file context>
@@ -34,19 +34,19 @@ const CheckerEnum = z.enum([
 
 /** Output encoding mode for embeddings. */
-const SemanticOutputEncodingEnum = z.enum(["float", "binary", "ubinary", "int8", "uint8"]);
+const SemanticOutputEncodingEnum = z.enum(["float", "base64_int8", "base64_binary"]);
 
 /** Storage strategy for embedding vectors. */
</file context>

Zireael added 4 commits June 2, 2026 23:16
…s, retry, diagnostics

Add three features to build_with_progress_contextualized:

1. Oversized document handling: split_oversized_document() partitions
   documents exceeding DEFAULT_MAX_CHUNKS_PER_DOCUMENT (100) into
   sub-groups, preserving chunk order with synthetic '(part N)' titles.

2. Retry logic: embed_document_group_with_retry() wraps each document
   group with exponential backoff (3 retries, 1s base, 8s cap), only
   retrying transient errors (rate limits, timeouts, server errors).
   Failed groups are skipped with a warning instead of aborting the
   entire build.

3. Diagnostics: ContextualizedBuildDiagnostics struct tracks
   documents_processed, chunks_embedded, rejected_oversized,
   retried_groups, failed_groups, and max_chunks_in_document.
   Summary logged via slog_info! at build completion.
Coverage:
- chunks grouped by source document (multi-file)
- chunk order preserved within each document
- wrong chunk count in response fails loudly
- unknown file path in response fails
- dimension mismatch fails with specific error
- stale-vector pruning after contextualized index + refresh
- Perplexity backend defaults to DocumentChunks input mode
- Fastembed backend verifies FlatTexts for contrast
- oversized document is split into sub-groups (>100 chunks)
- empty file set produces empty index
- retry on transient errors (429 rate limit)
- non-transient errors are NOT retried
- progress callback reports correct done/total counts
CRITICAL fixes:
- cosine_similarity: guard NaN from zero-norm vectors + clamp to [-1,1]
- semantic_search: remove unconditional Ready status overwrite (search
  must not change lifecycle state)
- reranker: add out-of-bounds index warning when LLM returns indices
  exceeding candidate count

HIGH fixes:
- build_embed_text: remove duplicate name: field in embed text format
- split_large_chunk: fix end_line for final sub-chunk (was using
  chunk.start_line + total_lines instead of chunk_start + current_lines)
- strip_markdown_fences: robust fence stripping with language tag
  handling and proper closing-fence detection
- rejected_oversized: actually increment counter when documents are split

MEDIUM fixes:
- SemanticCancellationToken: use Acquire/Release ordering instead of
  Relaxed for cross-thread generation counter
- semantic_search: validate non-empty query before processing
parse_semantic_config previously only handled 6 fields (backend, model,
base_url, api_key_env, timeout_ms, max_batch_size). Now it also parses:
dimensions, output_encoding, input_mode, storage_strategy, distance_metric,
query_prompt_template, document_prompt_template, diagnostics_enabled,
low_confidence_threshold, output_mode, rerank_enabled, rerank_model,
rerank_base_url, rerank_api_key_env, rerank_timeout_ms,
rerank_max_candidates, rerank_max_candidate_chars.

Note: the TS plugin's getStrippedSemanticKeys() intentionally strips these
fields from PROJECT config (untrusted) as a security boundary. They can
still be set from USER config (trusted). The Rust side now correctly
accepts all fields when the plugin sends them.
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 5 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/aft/src/commands/configure.rs">

<violation number="1" location="crates/aft/src/commands/configure.rs:342">
P1: `rerank_base_url` is parsed without SSRF validation, unlike `base_url`</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

.to_string()
.into();
}
if let Some(raw) = obj.get("rerank_base_url") {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: rerank_base_url is parsed without SSRF validation, unlike base_url

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/aft/src/commands/configure.rs, line 342:

<comment>`rerank_base_url` is parsed without SSRF validation, unlike `base_url`</comment>

<file context>
@@ -230,6 +230,150 @@ fn parse_semantic_config(
+            .to_string()
+            .into();
+    }
+    if let Some(raw) = obj.get("rerank_base_url") {
+        semantic.rerank_base_url = raw
+            .as_str()
</file context>

Also parse: jsonl_logging, jsonl_path, include_raw_queries,
include_snippets, retention_days, metrics_window_size.
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/aft/src/commands/configure.rs">

<violation number="1" location="crates/aft/src/commands/configure.rs:382">
P2: `semantic.jsonl_path` lacks path validation/normalization, unlike other path configs in this file (`validate_storage_dir`, `parse_lsp_paths_extra`) which enforce absolute paths and reject `..` traversal. This creates path-injection risk for downstream JSONL diagnostics writes.</violation>

<violation number="2" location="crates/aft/src/commands/configure.rs:402">
P2: `semantic.retention_days` uses lossy `u64 -> u32` cast with silent overflow instead of explicit validation.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

"configure: semantic.jsonl_logging must be a boolean".to_string()
})?;
}
if let Some(raw) = obj.get("jsonl_path") {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: semantic.jsonl_path lacks path validation/normalization, unlike other path configs in this file (validate_storage_dir, parse_lsp_paths_extra) which enforce absolute paths and reject .. traversal. This creates path-injection risk for downstream JSONL diagnostics writes.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/aft/src/commands/configure.rs, line 382:

<comment>`semantic.jsonl_path` lacks path validation/normalization, unlike other path configs in this file (`validate_storage_dir`, `parse_lsp_paths_extra`) which enforce absolute paths and reject `..` traversal. This creates path-injection risk for downstream JSONL diagnostics writes.</comment>

<file context>
@@ -374,6 +374,42 @@ fn parse_semantic_config(
+            "configure: semantic.jsonl_logging must be a boolean".to_string()
+        })?;
+    }
+    if let Some(raw) = obj.get("jsonl_path") {
+        semantic.jsonl_path = if raw.is_null() {
+            None
</file context>

Comment on lines +402 to +404
semantic.retention_days = raw.as_u64().ok_or_else(|| {
"configure: semantic.retention_days must be an unsigned integer".to_string()
})? as u32;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: semantic.retention_days uses lossy u64 -> u32 cast with silent overflow instead of explicit validation.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/aft/src/commands/configure.rs, line 402:

<comment>`semantic.retention_days` uses lossy `u64 -> u32` cast with silent overflow instead of explicit validation.</comment>

<file context>
@@ -374,6 +374,42 @@ fn parse_semantic_config(
+        })?;
+    }
+    if let Some(raw) = obj.get("retention_days") {
+        semantic.retention_days = raw.as_u64().ok_or_else(|| {
+            "configure: semantic.retention_days must be an unsigned integer".to_string()
+        })? as u32;
</file context>
Suggested change
semantic.retention_days = raw.as_u64().ok_or_else(|| {
"configure: semantic.retention_days must be an unsigned integer".to_string()
})? as u32;
let v = raw.as_u64().ok_or_else(|| {
"configure: semantic.retention_days must be an unsigned integer".to_string()
})?;
semantic.retention_days = u32::try_from(v)
.map_err(|_| "configure: semantic.retention_days is too large".to_string())?;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant