damacy rewrite: metadata prefetch#120
Conversation
Review notesRead through the cache primitive, the prefetcher, and the three fetchers against Blockers1. Handle stability is broken by eviction — The handle stores The straightforward fix is to key the handle on the slot itself, not its index in a compact array — e.g. allocate from a fixed slot table (capacity-sized) and let This bug isn't currently caught because no unit test triggers eviction with a still-live handle to a non-evicted sibling slot. Worth adding before further consumers depend on handle stability. 2. Gate aliasing across batch reuse —
Significant3. Worker polls instead of using the blocking pop the design added —
4.
5. Per-prefetcher errors don't set the batch gate's error bit —
6.
Notes7. 8. Slot/batch tables are linear scans — 9. Test coverage gaps for the contracts that bite later — eviction with live sibling handle (#1), gate reuse race (#2), zero-shard sample (#4), saturated cache returning 10. Bottom line#1 and #2 are correctness issues that should land in this PR or as immediate follow-ups before the planner/scheduler integration goes in on top. The rest can sit until the integration PR exercises them. |
closes #116 ## Approach `store_fs_gds.c`'s `gds_event_query` previously returned 1 unconditionally for any non-sentinel `seq` — completely ignoring whether the `cuFileReadAsync` submitted earlier had actually retired on `stream_h2d`. Callers (the wave-pool scheduler) treat a 1 return as "destination bytes are safe to consume" and transition the slot to `SLOT_READY`. On the GDS path this can hand a wave a `dev_buf` that the device has not yet written to, producing illegal memory accesses downstream in decode kernels gated on cross-stream events that fire ahead of the read draining. This PR makes the query honor what the caller would reasonably expect: returns 1 only after the cuFile read has actually completed on the stream. The mechanism reuses infrastructure already in `gds_submit_dev`: `cuLaunchHostFunc(stream, fs_gds_free_params_cb, ctx)` is enqueued after every `cuFileReadAsync`, so the callback runs in stream order *after* the reads drain. A new small `fs_gds_done { flag, claimed, rc }` struct is allocated per submit; the callback sets `flag=1` and drops one ref, `gds_event_query` checks the flag (acquire) and CAS-claims the owner-side ref the first time it observes the flag set. Repeated queries are safe; an unqueried event is reclaimed by the callback alone (no leak). `store_event` gains an opaque `void* impl` — backend-private, NULL for non-GDS stores. `gds_event_wait`, previously a no-op, now actually `cuStreamSynchronize`s and reclaims. ## Key files - `src/store/store_fs_gds.c:222-260` — the `fs_gds_done` refcount protocol. - `src/store/store_fs_gds.c:367-407` — the new `gds_event_query` / `gds_event_wait`. - `tests/test_store_fs_gds.c::test_event_query_reflects_completion` — the contract test. Uses `cuLaunchHostFunc` to park `stream_h2d` behind a host-side barrier, submits a read so cuFile is queued but not retired, asserts the query reports not-ready, then unblocks and asserts it reports ready. Deterministic, not race-dependent. Runs under cuFile compat mode (`CUFILE_FORCE_COMPAT_MODE=true` set in `main` before any cuFile init) so no nvidia-fs is required. ## Test plan - [x] New test fails before the fix (verified during development — query returns 1 while the read is provably queued behind the barrier). - [x] New test passes after the fix. - [x] Existing `test_submit_fail_releases_pins` still passes when it doesn't trip the separate bug below. - [x] Full test suite (25 tests, GDS build) passes. ## Related, not addressed here `test_submit_fail_releases_pins` SEGVs at ~4% on this hardware (filed as #118). Reproduces on this branch's parent commit, so it is not introduced by this PR, but it is a real damacy bug to chase, not upstream noise.
Fixes #118. ## Approach In compat mode, libcufile lazily allocates per-stream state on the first `cuFileReadAsync`, and that lazy init races against itself. The passing test in the same file happens to enqueue a `cuLaunchHostFunc` barrier before the first read, which serializes the stream enough to mask the race. The failing test goes straight into `cuFileReadAsync` on an empty stream and SEGVs ~4% of the time deep inside libcufile. cuFile already exposes the hook for this: `cuFileStreamRegister` "allocates resources needed to support cuFile operations asynchronously for the cuda stream" — i.e. exactly the lazy state that was racing. Calling it eagerly when damacy adopts a stream removes the race window. The matching `cuFileStreamDeregister` is required before `cuStreamDestroy` per the cuFile contract. ## Change In `src/store/store_fs_gds.c`: - dlsym-bind `cuFileStreamRegister` / `cuFileStreamDeregister` as optional symbols (graceful no-op on older libcufile that doesn't ship them). - `store_fs_gds_set_stream`: deregister any previously-set stream, then register the new one. First `cuFileReadAsync` now finds per-stream state already allocated. - `gds_destroy`: deregister after the existing `cuStreamSynchronize`, before the caller's `cuStreamDestroy`. ## Verification - `cmake --build build` clean. - 100× loop of `CUFILE_FORCE_COMPAT_MODE=true ./build/tests/test_store_fs_gds`: 0 failures (was ~4/100). - Full `ctest`: 26/26 pass.
## Approach Close #101 by removing the dead static fallback and replacing the ad-hoc structural constants it derived from with explicit `damacy_tuning` knobs. `chunk_substreams_upper_bound` (formerly `chunk_zsubs_upper_bound`) in `src/wave/wave_pool.c` sizes the per-wave fanout SOA and the shared nvcomp zstd decoder scratch. Its `!sp->layout_probed` fallback returned a hardcoded `DAMACY_BLOSC_MAX_BLOCKS_PER_CHUNK = 32` — the adversarial worst case. But `wave_chunks_eligible` (per-chunk gate, runs before `prepare_decode_caps` in `kick_h2d`) rejects any wave containing an unprobed BLOSC_ZSTD chunk with `DAMACY_INVAL`, so the fallback is structurally unreachable. The "perf" framing of the original issue was moot. This PR: - **Turns the implicit gate-vs-sizer contract into an explicit check.** `chunk_substreams_upper_bound` now returns `enum damacy_status`; on unprobed BLOSC it returns `DAMACY_INVAL` with a `log_error("gate-vs-sizer contract violated")` at the caller. A future gate regression now fails loudly instead of silently undersizing the fanout SOA. - **Replaces the two compile-time constants** (`DAMACY_MAX_CHUNKS_PER_WAVE`, `DAMACY_BLOSC_MAX_BLOCKS_PER_CHUNK`) with `damacy_tuning.max_chunks_per_wave` and `damacy_tuning.max_substreams_per_chunk`. The parser, planner, coalesce, wave_pool, fanout, wave_budget, and meta_cache all thread the effective values through their existing param chains. New `DAMACY_DEFAULT_*` siblings preserve current behavior; `0` in either field resolves to the default. `WAVE_ZSUBS_STRUCTURAL_MAX` becomes a runtime field `wave_pool.max_substreams_per_wave` derived once at init. - **Drops the dead substream rename target.** `zsubs` was a contraction that read as zstd-specific; renames to `substreams` everywhere (the noun that matches both BLOSC1 spec language and the nvcomp batched-decode input it actually counts). - **Strips machinery wired only to the unreachable branch:** the `_Atomic(uint16_t) observed_max_nblocks_per_chunk` slot, its `atomic_u16_observe_max` CAS-loop helper (`src/util/atomic_max.h`), the meta-cache observer setter, the bump sites in `zarr_meta_cache_layout_set` / `_probe_layout`, and the wiring in `damacy_create`. `zarr/zarr_meta_cache.h` returns to `extern "C"` shape (matches main) — the C-only `static_assert` is no longer needed. ## API Two new optional fields on `damacy_tuning` (Python `Config`): - `max_chunks_per_wave: int = 0` — `0` → 512 (current behavior). Clamped to `0xFFFFu` (the 16-bit chunk_idx packing in `d_block_chunk_map`). - `max_substreams_per_chunk: int = 0` — `0` → 32 (current behavior). Parser rejects blosc1 layouts above this with `DAMACY_DECODE`. ## Key file `src/wave/wave_pool.c:355` — `chunk_substreams_upper_bound` (the contract check) and `prepare_decode_caps` (caller). Closes #101.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #120 +/- ##
==========================================
+ Coverage 56.36% 59.77% +3.41%
==========================================
Files 50 60 +10
Lines 6953 8371 +1418
Branches 1238 1435 +197
==========================================
+ Hits 3919 5004 +1085
- Misses 2547 2772 +225
- Partials 487 595 +108
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Review responsesThanks for the careful pass. Walking through each item with the fix or rationale. Blockers — fixed#1 Handle stability ( #2 Gate aliasing across batch reuse ( Significant — addressed#3 Worker polling vs blocking pop ( The deeper "burns the cache mutex on every tick" complaint is structural — #4 Zero-shard sample (already addressed in #5 Prefetcher-origin errors not setting gate.error (already in #6 shard_index_fetch peek pin Notes — deferred#7 advance_watermark walks active under the lock — perf tuning, will profile under the integration PR. |
Summary
prefetch_cache, array metadata fetcher, shard-index fetcher, chunk-layout fetcher, and the prefetcher worker.damacy.corchestrator into lifecycle, push, plan, pop, and scheduler modules.zarr_meta_cache/zarr_shard_cachepath.array_meta,shard_index, andchunk_layoutcaches.Reviewer Notes
src/prefetch/prefetcher.c,src/prefetch/prefetch_cache.c,src/damacy_plan.c,src/damacy_scheduler.c, andsrc/planner/planner.c.pop(), notpush().admit_seq, even if metadata requests complete out of order.store_statnow distinguishesNOT_FOUNDfrom other stat failures so sparse data does not mask IO/permission errors.Tests
cmake --build build --target damacy test_chunk_layout_cache test_prefetcher test_plannerenv UV_CACHE_DIR=/tmp/uv-cache timeout 90s ctest --test-dir build -R 'test_(planner|chunk_layout_cache|prefetcher)$' --output-on-failureAdditional coverage in this branch includes unit/integration tests for
prefetch_cache, array metadata fetch, shard index fetch, chunk layout fetch, prefetcher ordering/readiness/error paths, sample-shard iteration, and planner handle consumption.