feat(search): expose semantic index size diagnostics by Suknna · Pull Request #82 · cortexkit/aft

Suknna · 2026-06-02T02:34:20Z

Summary

Rebase the PR onto upstream main at v0.35.2.
Keep the existing v0.35.x phase timing logs intact and avoid duplicating them.
Add semantic index payload/cache/clone diagnostics that are still missing after v0.35.2:
- embed_text_bytes and snippet_bytes in the semantic collect log.
- Estimated in-memory payload bytes for semantic indexes.
- Estimated serialized cache bytes in semantic build progress/status stats.
- Cache read/write byte and timing diagnostics.
- Clone-cost logs for the refresh-worker and corpus-refresh index clones.

Context

Related issue: #81

The issue discussion notes that v0.35.0 already added phase timing logs for semantic collect/embed and search-index cold build. This revision narrows the PR to the remaining observability gap: byte estimates plus cache/clone cost visibility, while reusing the existing progress/status/logging infrastructure.

Verification

cargo fmt -p agent-file-tools -- --check
cargo check -p agent-file-tools
cargo test -p agent-file-tools status_exposes_semantic_progress_stats
cargo test -p agent-file-tools memory_estimate_counts_scaling_payload_bytes
cargo test -p agent-file-tools test_serialization_roundtrip
cargo test -p agent-file-tools natural_language_auto_falls_back_to_grep_while_semantic_builds
git diff --check

Greptile Summary

This PR wires up the remaining semantic-index observability gap after v0.35.2: byte-level payload estimates (memory_estimate, serialized_size_estimate), cache read/write timing diagnostics in the persist and load paths, and clone-cost logs for both the refresh-worker clone and the corpus-refresh clone. The new SemanticProgressStats struct is propagated through the event/status system so consumers can read all counters from the status response.

semantic_index.rs: Adds SemanticMemoryEstimate and SemanticProgressStats structs, memory_estimate() and serialized_size_estimate() methods, enriched log messages for collect/persist/load, and serialize/write/read/decode timing.
configure.rs: Adds three helper functions for building SemanticProgressStats from estimates, emits clone-cost logs and a starting_refresh_worker progress event, and propagates stats to every SemanticIndexEvent::Progress send site.
status.rs / context.rs: Exposes stats in the JSON status payload; new test status_exposes_semantic_progress_stats covers the end-to-end field mapping.

Confidence Score: 4/5

Safe to merge; all changes are additive diagnostics with no mutations to search correctness or index persistence logic.

The core indexing, embedding, and cache paths are unchanged. The new byte-estimate methods are read-only and verified by unit tests; the timing instrumentation is fire-and-forget. The only non-trivial concern is the starting_refresh_worker progress event, which could momentarily expose a Building status just before Ready arrives, but the FIFO channel ordering limits this to a very short window and the index itself is already fully built at that point.

The starting_refresh_worker event in configure.rs and the cache_bytes source discrepancy in semantic_index.rs load path are worth a second look before merge.

Important Files Changed

Filename	Overview
crates/aft/src/semantic_index.rs	Adds SemanticMemoryEstimate struct, memory_estimate() and serialized_size_estimate() methods, and timing/byte diagnostics in the persist and load paths. The serialized_size_estimate() is verified to match to_bytes() by a new unit test.
crates/aft/src/commands/configure.rs	Adds three helper functions for SemanticProgressStats construction, clone-cost logs and a starting_refresh_worker progress event, and propagates stats=None/Some to all existing Progress send sites.
crates/aft/src/context.rs	Adds SemanticProgressStats struct and extends SemanticIndexStatus::Building and SemanticIndexEvent::Progress with an optional stats field. Straightforward structural change.
crates/aft/src/commands/status.rs	Adds semantic_progress_stats_json() helper and exposes stats in the Building status JSON. New test status_exposes_semantic_progress_stats covers the full field mapping.
crates/aft/src/cli/warmup.rs	Adds .. wildcard to ignore new stats field in semantic_index_state pattern match; propagates stats through drain_semantic_index_events. Minimal mechanical change.
crates/aft/src/main.rs	Propagates stats field in two Progress send sites and in drain_semantic_index_events. Mechanical pattern-match update with no logic changes.
crates/aft/tests/integration/aft_search_contract_test.rs	Adds stats: None to two SemanticIndexStatus::Building literals in integration tests to satisfy the new struct field. No logic changes.

Sequence Diagram

sequenceDiagram
    participant BC as Build Closure
    participant TX as tx_progress / tx
    participant Drain as drain_semantic_index_events
    participant Status as SemanticIndexStatus

    BC->>BC: SemanticIndex::build_with_progress()
    BC->>TX: "Progress{stage=collecting, stats=None}"
    TX->>Drain: process Progress
    Drain->>Status: "Building{stage=collecting}"

    BC->>BC: Load/build index complete
    BC->>TX: "Progress{stage=persisting_index, stats=Some(cache+payload bytes)}"
    TX->>Drain: process Progress
    Drain->>Status: "Building{stage=persisting_index, stats=SemanticProgressStats}"

    BC->>BC: index.clone() for refresh worker
    BC->>BC: log clone timing
    BC->>TX: "Progress{stage=starting_refresh_worker, stats=Some(clone_estimated_bytes)}"
    TX->>Drain: process Progress
    Drain->>Status: "Building{stage=starting_refresh_worker, stats=SemanticProgressStats}"

    BC->>TX: Ready(index)
    TX->>Drain: process Ready
    Drain->>Status: "Ready{...}"

    Note over Status: status JSON now includes stats field

Comments Outside Diff (1)

crates/aft/src/commands/configure.rs, line 213-219 (link)

starting_refresh_worker stage may briefly surface as Building after the index is ready

A Progress { stage: "starting_refresh_worker" } event is enqueued just before the Ready event. Both go through the same channel so the drain processes them in FIFO order, but any status consumer that reads between these two drain iterations will see SemanticIndexStatus::Building with fully-populated stats even though the index is done. Stage values outside the expected build stages could confuse tooling that inspects stage to decide whether to wait. Consider gating this event behind a debug/verbose flag or removing it if the clone timing in the slog_info! above is sufficient.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

_{Reviews (1): Last reviewed commit: "feat(search): expose semantic index size..." | Re-trigger Greptile}

Greptile also left 2 inline comments on this PR.

cubic-dev-ai

No issues found across 7 files

_{Re-trigger cubic}

greptile-apps · 2026-06-04T02:23:04Z

+                let estimate = index.memory_estimate();
                slog_info!(
-                    "loaded semantic index from disk: {} entries",
-                    index.entries.len()
+                    "loaded semantic index from disk: {} entries, {} cache bytes, {} payload bytes (read_ms={}, decode_ms={})",
+                    index.entries.len(),
+                    file_len,
+                    estimate.estimated_payload_bytes,
+                    read_ms,
+                    decode_ms,
                );
                Some(index)


The progress-stats sent in the cache-hit path use serialized_size_estimate() for cache_bytes, but the load-path log message uses the actual file_len from fs::metadata. For a freshly deserialized index these values should agree (and the round-trip test confirms it for the happy path), but any skipped-path edge case (entry whose file is outside project_root) would cause serialized_size_estimate() to under-count while file_len remains the ground truth. Passing the already-available file_len keeps the two sources consistent without extra I/O.

Suggested change

let estimate = index.memory_estimate();

slog_info!(

"loaded semantic index from disk: {} entries",

index.entries.len()

"loaded semantic index from disk: {} entries, {} cache bytes, {} payload bytes (read_ms={}, decode_ms={})",

index.entries.len(),

file_len,

estimate.estimated_payload_bytes,

read_ms,

decode_ms,

);

Some(index)

let estimate = index.memory_estimate();

slog_info!(

"loaded semantic index from disk: {} entries, {} cache bytes, {} payload bytes (read_ms={}, decode_ms={})",

index.entries.len(),

file_len,

estimate.estimated_payload_bytes,

read_ms,

decode_ms,

);

Some((index, file_len as u64))

greptile-apps · 2026-06-04T02:23:06Z

+    pub fn serialized_size_estimate(&self) -> u64 {
+        let fingerprint_len = self
+            .fingerprint
+            .as_ref()
+            .map(|fingerprint| fingerprint.as_string().len() as u64)
+            .unwrap_or(0);
+        let mut total = 1 + 4 + 4 + 4 + fingerprint_len;
+
+        total += 4;
+        for (path, _mtime) in &self.file_mtimes {
+            let Some(relative) = cache_relative_path(&self.project_root, path) else {
+                continue;
+            };
+            let path_len = relative.to_string_lossy().len() as u64;
+            total += 4 + path_len + 8 + 4 + 8 + 32;
+        }
+
+        for entry in &self.entries {
+            let Some(relative) = cache_relative_path(&self.project_root, &entry.chunk.file) else {
+                continue;
+            };
+            total += 4 + relative.to_string_lossy().len() as u64;
+            total += 4 + entry.chunk.name.len() as u64;
+            total += 1;
+            total += 4 + 4 + 1;
+            total += 4 + entry.chunk.snippet.len() as u64;
+            total += 4 + entry.chunk.embed_text.len() as u64;
+            total += (entry.vector.len() * std::mem::size_of::<f32>()) as u64;
+        }
+
+        total
+    }


serialized_size_estimate header literal silently wrong if format gains a new fixed field

The magic constant 1 + 4 + 4 + 4 + fingerprint_len is correct for the current V6 layout, and the round-trip test catches any drift today. However, if a future version bumps the format and adds a fixed header field without updating this literal, the estimate will silently diverge. A named constant (e.g. HEADER_BYTES_V6) mirroring the existing HEADER_BYTES_V1 pattern would make the coupling explicit and keep the test meaningful as a drift detector.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Suknna · 2026-06-04T03:30:37Z

CI update: the new Linux failure is in @cortexkit/aft-pi test bash tool adapter > foreground running command promotes to background after timeout.

I checked the failing log: the assertion expected two bash_status calls before bash_promote, but CI produced one bash_status followed by bash_promote. With timeout: 1, that is a timing-sensitive test path and is unrelated to this PR's Semantic Index diagnostics changes. Windows and macOS unit jobs passed, and the Rust/Semantic-focused local checks listed in the PR body pass.

I am leaving this PR scoped to Semantic Index observability rather than mixing in an unrelated Pi bash test adjustment.

cubic-dev-ai Bot reviewed Jun 2, 2026

View reviewed changes

feat(search): expose semantic index size diagnostics

f5bf774

Suknna force-pushed the semantic-index-observability branch from 638b4c1 to f5bf774 Compare June 4, 2026 02:15

Suknna changed the title ~~feat(search): expose semantic index build timings~~ feat(search): expose semantic index size diagnostics Jun 4, 2026

greptile-apps Bot reviewed Jun 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(search): expose semantic index size diagnostics#82

feat(search): expose semantic index size diagnostics#82
Suknna wants to merge 1 commit into
cortexkit:mainfrom
Suknna:semantic-index-observability

Suknna commented Jun 2, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

greptile-apps Bot Jun 4, 2026

Uh oh!

greptile-apps Bot Jun 4, 2026

Uh oh!

Suknna commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Suknna commented Jun 2, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Verification

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Suknna commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Suknna commented Jun 2, 2026 •

edited by greptile-apps Bot

Loading