feat(search): expose semantic index size diagnostics#82
Conversation
638b4c1 to
f5bf774
Compare
| let estimate = index.memory_estimate(); | ||
| slog_info!( | ||
| "loaded semantic index from disk: {} entries", | ||
| index.entries.len() | ||
| "loaded semantic index from disk: {} entries, {} cache bytes, {} payload bytes (read_ms={}, decode_ms={})", | ||
| index.entries.len(), | ||
| file_len, | ||
| estimate.estimated_payload_bytes, | ||
| read_ms, | ||
| decode_ms, | ||
| ); | ||
| Some(index) |
There was a problem hiding this comment.
The progress-stats sent in the cache-hit path use
serialized_size_estimate() for cache_bytes, but the load-path log message uses the actual file_len from fs::metadata. For a freshly deserialized index these values should agree (and the round-trip test confirms it for the happy path), but any skipped-path edge case (entry whose file is outside project_root) would cause serialized_size_estimate() to under-count while file_len remains the ground truth. Passing the already-available file_len keeps the two sources consistent without extra I/O.
| let estimate = index.memory_estimate(); | |
| slog_info!( | |
| "loaded semantic index from disk: {} entries", | |
| index.entries.len() | |
| "loaded semantic index from disk: {} entries, {} cache bytes, {} payload bytes (read_ms={}, decode_ms={})", | |
| index.entries.len(), | |
| file_len, | |
| estimate.estimated_payload_bytes, | |
| read_ms, | |
| decode_ms, | |
| ); | |
| Some(index) | |
| let estimate = index.memory_estimate(); | |
| slog_info!( | |
| "loaded semantic index from disk: {} entries, {} cache bytes, {} payload bytes (read_ms={}, decode_ms={})", | |
| index.entries.len(), | |
| file_len, | |
| estimate.estimated_payload_bytes, | |
| read_ms, | |
| decode_ms, | |
| ); | |
| Some((index, file_len as u64)) |
| pub fn serialized_size_estimate(&self) -> u64 { | ||
| let fingerprint_len = self | ||
| .fingerprint | ||
| .as_ref() | ||
| .map(|fingerprint| fingerprint.as_string().len() as u64) | ||
| .unwrap_or(0); | ||
| let mut total = 1 + 4 + 4 + 4 + fingerprint_len; | ||
|
|
||
| total += 4; | ||
| for (path, _mtime) in &self.file_mtimes { | ||
| let Some(relative) = cache_relative_path(&self.project_root, path) else { | ||
| continue; | ||
| }; | ||
| let path_len = relative.to_string_lossy().len() as u64; | ||
| total += 4 + path_len + 8 + 4 + 8 + 32; | ||
| } | ||
|
|
||
| for entry in &self.entries { | ||
| let Some(relative) = cache_relative_path(&self.project_root, &entry.chunk.file) else { | ||
| continue; | ||
| }; | ||
| total += 4 + relative.to_string_lossy().len() as u64; | ||
| total += 4 + entry.chunk.name.len() as u64; | ||
| total += 1; | ||
| total += 4 + 4 + 1; | ||
| total += 4 + entry.chunk.snippet.len() as u64; | ||
| total += 4 + entry.chunk.embed_text.len() as u64; | ||
| total += (entry.vector.len() * std::mem::size_of::<f32>()) as u64; | ||
| } | ||
|
|
||
| total | ||
| } |
There was a problem hiding this comment.
serialized_size_estimate header literal silently wrong if format gains a new fixed field
The magic constant 1 + 4 + 4 + 4 + fingerprint_len is correct for the current V6 layout, and the round-trip test catches any drift today. However, if a future version bumps the format and adds a fixed header field without updating this literal, the estimate will silently diverge. A named constant (e.g. HEADER_BYTES_V6) mirroring the existing HEADER_BYTES_V1 pattern would make the coupling explicit and keep the test meaningful as a drift detector.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
|
CI update: the new Linux failure is in I checked the failing log: the assertion expected two I am leaving this PR scoped to Semantic Index observability rather than mixing in an unrelated Pi bash test adjustment. |
Summary
embed_text_bytesandsnippet_bytesin the semantic collect log.Context
Related issue: #81
The issue discussion notes that v0.35.0 already added phase timing logs for semantic collect/embed and search-index cold build. This revision narrows the PR to the remaining observability gap: byte estimates plus cache/clone cost visibility, while reusing the existing progress/status/logging infrastructure.
Verification
cargo fmt -p agent-file-tools -- --checkcargo check -p agent-file-toolscargo test -p agent-file-tools status_exposes_semantic_progress_statscargo test -p agent-file-tools memory_estimate_counts_scaling_payload_bytescargo test -p agent-file-tools test_serialization_roundtripcargo test -p agent-file-tools natural_language_auto_falls_back_to_grep_while_semantic_buildsgit diff --checkGreptile Summary
This PR wires up the remaining semantic-index observability gap after v0.35.2: byte-level payload estimates (
memory_estimate,serialized_size_estimate), cache read/write timing diagnostics in the persist and load paths, and clone-cost logs for both the refresh-worker clone and the corpus-refresh clone. The newSemanticProgressStatsstruct is propagated through the event/status system so consumers can read all counters from the status response.semantic_index.rs: AddsSemanticMemoryEstimateandSemanticProgressStatsstructs,memory_estimate()andserialized_size_estimate()methods, enriched log messages for collect/persist/load, and serialize/write/read/decode timing.configure.rs: Adds three helper functions for buildingSemanticProgressStatsfrom estimates, emits clone-cost logs and astarting_refresh_workerprogress event, and propagatesstatsto everySemanticIndexEvent::Progresssend site.status.rs/context.rs: Exposesstatsin the JSON status payload; new teststatus_exposes_semantic_progress_statscovers the end-to-end field mapping.Confidence Score: 4/5
Safe to merge; all changes are additive diagnostics with no mutations to search correctness or index persistence logic.
The core indexing, embedding, and cache paths are unchanged. The new byte-estimate methods are read-only and verified by unit tests; the timing instrumentation is fire-and-forget. The only non-trivial concern is the
starting_refresh_workerprogress event, which could momentarily expose aBuildingstatus just beforeReadyarrives, but the FIFO channel ordering limits this to a very short window and the index itself is already fully built at that point.The
starting_refresh_workerevent inconfigure.rsand thecache_bytessource discrepancy insemantic_index.rsload path are worth a second look before merge.Important Files Changed
Sequence Diagram
sequenceDiagram participant BC as Build Closure participant TX as tx_progress / tx participant Drain as drain_semantic_index_events participant Status as SemanticIndexStatus BC->>BC: SemanticIndex::build_with_progress() BC->>TX: "Progress{stage=collecting, stats=None}" TX->>Drain: process Progress Drain->>Status: "Building{stage=collecting}" BC->>BC: Load/build index complete BC->>TX: "Progress{stage=persisting_index, stats=Some(cache+payload bytes)}" TX->>Drain: process Progress Drain->>Status: "Building{stage=persisting_index, stats=SemanticProgressStats}" BC->>BC: index.clone() for refresh worker BC->>BC: log clone timing BC->>TX: "Progress{stage=starting_refresh_worker, stats=Some(clone_estimated_bytes)}" TX->>Drain: process Progress Drain->>Status: "Building{stage=starting_refresh_worker, stats=SemanticProgressStats}" BC->>TX: Ready(index) TX->>Drain: process Ready Drain->>Status: "Ready{...}" Note over Status: status JSON now includes stats fieldComments Outside Diff (1)
crates/aft/src/commands/configure.rs, line 213-219 (link)starting_refresh_workerstage may briefly surface asBuildingafter the index is readyA
Progress { stage: "starting_refresh_worker" }event is enqueued just before theReadyevent. Both go through the same channel so the drain processes them in FIFO order, but any status consumer that reads between these two drain iterations will seeSemanticIndexStatus::Buildingwith fully-populated stats even though the index is done. Stage values outside the expected build stages could confuse tooling that inspectsstageto decide whether to wait. Consider gating this event behind a debug/verbose flag or removing it if the clone timing in theslog_info!above is sufficient.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Reviews (1): Last reviewed commit: "feat(search): expose semantic index size..." | Re-trigger Greptile