Skip to content

feat(search): expose semantic index size diagnostics#82

Open
Suknna wants to merge 1 commit into
cortexkit:mainfrom
Suknna:semantic-index-observability
Open

feat(search): expose semantic index size diagnostics#82
Suknna wants to merge 1 commit into
cortexkit:mainfrom
Suknna:semantic-index-observability

Conversation

@Suknna
Copy link
Copy Markdown
Contributor

@Suknna Suknna commented Jun 2, 2026

Summary

  • Rebase the PR onto upstream main at v0.35.2.
  • Keep the existing v0.35.x phase timing logs intact and avoid duplicating them.
  • Add semantic index payload/cache/clone diagnostics that are still missing after v0.35.2:
    • embed_text_bytes and snippet_bytes in the semantic collect log.
    • Estimated in-memory payload bytes for semantic indexes.
    • Estimated serialized cache bytes in semantic build progress/status stats.
    • Cache read/write byte and timing diagnostics.
    • Clone-cost logs for the refresh-worker and corpus-refresh index clones.

Context

Related issue: #81

The issue discussion notes that v0.35.0 already added phase timing logs for semantic collect/embed and search-index cold build. This revision narrows the PR to the remaining observability gap: byte estimates plus cache/clone cost visibility, while reusing the existing progress/status/logging infrastructure.

Verification

  • cargo fmt -p agent-file-tools -- --check
  • cargo check -p agent-file-tools
  • cargo test -p agent-file-tools status_exposes_semantic_progress_stats
  • cargo test -p agent-file-tools memory_estimate_counts_scaling_payload_bytes
  • cargo test -p agent-file-tools test_serialization_roundtrip
  • cargo test -p agent-file-tools natural_language_auto_falls_back_to_grep_while_semantic_builds
  • git diff --check

Greptile Summary

This PR wires up the remaining semantic-index observability gap after v0.35.2: byte-level payload estimates (memory_estimate, serialized_size_estimate), cache read/write timing diagnostics in the persist and load paths, and clone-cost logs for both the refresh-worker clone and the corpus-refresh clone. The new SemanticProgressStats struct is propagated through the event/status system so consumers can read all counters from the status response.

  • semantic_index.rs: Adds SemanticMemoryEstimate and SemanticProgressStats structs, memory_estimate() and serialized_size_estimate() methods, enriched log messages for collect/persist/load, and serialize/write/read/decode timing.
  • configure.rs: Adds three helper functions for building SemanticProgressStats from estimates, emits clone-cost logs and a starting_refresh_worker progress event, and propagates stats to every SemanticIndexEvent::Progress send site.
  • status.rs / context.rs: Exposes stats in the JSON status payload; new test status_exposes_semantic_progress_stats covers the end-to-end field mapping.

Confidence Score: 4/5

Safe to merge; all changes are additive diagnostics with no mutations to search correctness or index persistence logic.

The core indexing, embedding, and cache paths are unchanged. The new byte-estimate methods are read-only and verified by unit tests; the timing instrumentation is fire-and-forget. The only non-trivial concern is the starting_refresh_worker progress event, which could momentarily expose a Building status just before Ready arrives, but the FIFO channel ordering limits this to a very short window and the index itself is already fully built at that point.

The starting_refresh_worker event in configure.rs and the cache_bytes source discrepancy in semantic_index.rs load path are worth a second look before merge.

Important Files Changed

Filename Overview
crates/aft/src/semantic_index.rs Adds SemanticMemoryEstimate struct, memory_estimate() and serialized_size_estimate() methods, and timing/byte diagnostics in the persist and load paths. The serialized_size_estimate() is verified to match to_bytes() by a new unit test.
crates/aft/src/commands/configure.rs Adds three helper functions for SemanticProgressStats construction, clone-cost logs and a starting_refresh_worker progress event, and propagates stats=None/Some to all existing Progress send sites.
crates/aft/src/context.rs Adds SemanticProgressStats struct and extends SemanticIndexStatus::Building and SemanticIndexEvent::Progress with an optional stats field. Straightforward structural change.
crates/aft/src/commands/status.rs Adds semantic_progress_stats_json() helper and exposes stats in the Building status JSON. New test status_exposes_semantic_progress_stats covers the full field mapping.
crates/aft/src/cli/warmup.rs Adds .. wildcard to ignore new stats field in semantic_index_state pattern match; propagates stats through drain_semantic_index_events. Minimal mechanical change.
crates/aft/src/main.rs Propagates stats field in two Progress send sites and in drain_semantic_index_events. Mechanical pattern-match update with no logic changes.
crates/aft/tests/integration/aft_search_contract_test.rs Adds stats: None to two SemanticIndexStatus::Building literals in integration tests to satisfy the new struct field. No logic changes.

Sequence Diagram

sequenceDiagram
    participant BC as Build Closure
    participant TX as tx_progress / tx
    participant Drain as drain_semantic_index_events
    participant Status as SemanticIndexStatus

    BC->>BC: SemanticIndex::build_with_progress()
    BC->>TX: "Progress{stage=collecting, stats=None}"
    TX->>Drain: process Progress
    Drain->>Status: "Building{stage=collecting}"

    BC->>BC: Load/build index complete
    BC->>TX: "Progress{stage=persisting_index, stats=Some(cache+payload bytes)}"
    TX->>Drain: process Progress
    Drain->>Status: "Building{stage=persisting_index, stats=SemanticProgressStats}"

    BC->>BC: index.clone() for refresh worker
    BC->>BC: log clone timing
    BC->>TX: "Progress{stage=starting_refresh_worker, stats=Some(clone_estimated_bytes)}"
    TX->>Drain: process Progress
    Drain->>Status: "Building{stage=starting_refresh_worker, stats=SemanticProgressStats}"

    BC->>TX: Ready(index)
    TX->>Drain: process Ready
    Drain->>Status: "Ready{...}"

    Note over Status: status JSON now includes stats field
Loading

Comments Outside Diff (1)

  1. crates/aft/src/commands/configure.rs, line 213-219 (link)

    P2 starting_refresh_worker stage may briefly surface as Building after the index is ready

    A Progress { stage: "starting_refresh_worker" } event is enqueued just before the Ready event. Both go through the same channel so the drain processes them in FIFO order, but any status consumer that reads between these two drain iterations will see SemanticIndexStatus::Building with fully-populated stats even though the index is done. Stage values outside the expected build stages could confuse tooling that inspects stage to decide whether to wait. Consider gating this event behind a debug/verbose flag or removing it if the clone timing in the slog_info! above is sufficient.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Reviews (1): Last reviewed commit: "feat(search): expose semantic index size..." | Re-trigger Greptile

Greptile also left 2 inline comments on this PR.

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 7 files

Re-trigger cubic

@Suknna Suknna force-pushed the semantic-index-observability branch from 638b4c1 to f5bf774 Compare June 4, 2026 02:15
@Suknna Suknna changed the title feat(search): expose semantic index build timings feat(search): expose semantic index size diagnostics Jun 4, 2026
Comment on lines +2231 to 2240
let estimate = index.memory_estimate();
slog_info!(
"loaded semantic index from disk: {} entries",
index.entries.len()
"loaded semantic index from disk: {} entries, {} cache bytes, {} payload bytes (read_ms={}, decode_ms={})",
index.entries.len(),
file_len,
estimate.estimated_payload_bytes,
read_ms,
decode_ms,
);
Some(index)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The progress-stats sent in the cache-hit path use serialized_size_estimate() for cache_bytes, but the load-path log message uses the actual file_len from fs::metadata. For a freshly deserialized index these values should agree (and the round-trip test confirms it for the happy path), but any skipped-path edge case (entry whose file is outside project_root) would cause serialized_size_estimate() to under-count while file_len remains the ground truth. Passing the already-available file_len keeps the two sources consistent without extra I/O.

Suggested change
let estimate = index.memory_estimate();
slog_info!(
"loaded semantic index from disk: {} entries",
index.entries.len()
"loaded semantic index from disk: {} entries, {} cache bytes, {} payload bytes (read_ms={}, decode_ms={})",
index.entries.len(),
file_len,
estimate.estimated_payload_bytes,
read_ms,
decode_ms,
);
Some(index)
let estimate = index.memory_estimate();
slog_info!(
"loaded semantic index from disk: {} entries, {} cache bytes, {} payload bytes (read_ms={}, decode_ms={})",
index.entries.len(),
file_len,
estimate.estimated_payload_bytes,
read_ms,
decode_ms,
);
Some((index, file_len as u64))

Comment on lines +1204 to +1235
pub fn serialized_size_estimate(&self) -> u64 {
let fingerprint_len = self
.fingerprint
.as_ref()
.map(|fingerprint| fingerprint.as_string().len() as u64)
.unwrap_or(0);
let mut total = 1 + 4 + 4 + 4 + fingerprint_len;

total += 4;
for (path, _mtime) in &self.file_mtimes {
let Some(relative) = cache_relative_path(&self.project_root, path) else {
continue;
};
let path_len = relative.to_string_lossy().len() as u64;
total += 4 + path_len + 8 + 4 + 8 + 32;
}

for entry in &self.entries {
let Some(relative) = cache_relative_path(&self.project_root, &entry.chunk.file) else {
continue;
};
total += 4 + relative.to_string_lossy().len() as u64;
total += 4 + entry.chunk.name.len() as u64;
total += 1;
total += 4 + 4 + 1;
total += 4 + entry.chunk.snippet.len() as u64;
total += 4 + entry.chunk.embed_text.len() as u64;
total += (entry.vector.len() * std::mem::size_of::<f32>()) as u64;
}

total
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 serialized_size_estimate header literal silently wrong if format gains a new fixed field

The magic constant 1 + 4 + 4 + 4 + fingerprint_len is correct for the current V6 layout, and the round-trip test catches any drift today. However, if a future version bumps the format and adds a fixed header field without updating this literal, the estimate will silently diverge. A named constant (e.g. HEADER_BYTES_V6) mirroring the existing HEADER_BYTES_V1 pattern would make the coupling explicit and keep the test meaningful as a drift detector.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@Suknna
Copy link
Copy Markdown
Contributor Author

Suknna commented Jun 4, 2026

CI update: the new Linux failure is in @cortexkit/aft-pi test bash tool adapter > foreground running command promotes to background after timeout.

I checked the failing log: the assertion expected two bash_status calls before bash_promote, but CI produced one bash_status followed by bash_promote. With timeout: 1, that is a timing-sensitive test path and is unrelated to this PR's Semantic Index diagnostics changes. Windows and macOS unit jobs passed, and the Rust/Semantic-focused local checks listed in the PR body pass.

I am leaving this PR scoped to Semantic Index observability rather than mixing in an unrelated Pi bash test adjustment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant