damacy rewrite: metadata prefetch by nclack · Pull Request #120 · nclack/damacy

nclack · 2026-05-22T18:49:33Z

Summary

Add the metadata prefetch pipeline: prefetch_cache, array metadata fetcher, shard-index fetcher, chunk-layout fetcher, and the prefetcher worker.
Split the old monolithic damacy.c orchestrator into lifecycle, push, plan, pop, and scheduler modules.
Make the prefetcher the producer for planning, so samples are planned only after metadata/shard/layout prefetch has reached a terminal state.
Rewire the planner to consume prefetch handles directly and delete the legacy synchronous zarr_meta_cache / zarr_shard_cache path.
Preserve sparse-zarr behavior: missing shard files become fill chunks, while IO, permission, malformed-shard, and decode errors still fail the sample.
Update public config/stats, Python bindings, docs, and benchmark schema for the new array_meta, shard_index, and chunk_layout caches.

Reviewer Notes

Start with src/prefetch/prefetcher.c, src/prefetch/prefetch_cache.c, src/damacy_plan.c, src/damacy_scheduler.c, and src/planner/planner.c.
Store-derived validation is now asynchronous. Missing URIs, unsupported source dtypes, per-array rank mismatch, and decode failures surface from pop(), not push().
Prefetch slots preserve push order with admit_seq, even if metadata requests complete out of order.
store_stat now distinguishes NOT_FOUND from other stat failures so sparse data does not mask IO/permission errors.
Chunk-layout probing uses a ready shard touched by the sample rather than assuming the origin shard exists.

Tests

cmake --build build --target damacy test_chunk_layout_cache test_prefetcher test_planner
env UV_CACHE_DIR=/tmp/uv-cache timeout 90s ctest --test-dir build -R 'test_(planner|chunk_layout_cache|prefetcher)$' --output-on-failure

Additional coverage in this branch includes unit/integration tests for prefetch_cache, array metadata fetch, shard index fetch, chunk layout fetch, prefetcher ordering/readiness/error paths, sample-shard iteration, and planner handle consumption.

nclack · 2026-05-22T18:51:33Z

Review notes

Read through the cache primitive, the prefetcher, and the three fetchers against dev/metadata_prefetch.md. The shape matches the design — these are the issues worth surfacing before more code lands on top.

Blockers

1. Handle stability is broken by eviction — src/prefetch/prefetch_cache.c:194-203,304-307,470-482

The handle stores active_idx + 1. active_remove swap-and-pops c->active, moving an unrelated slot into the freed position and rewriting its active_idx. Any outstanding handle for that swapped slot is now stale: resolve_handle indexes into c->active[idx], finds the unrelated slot, and the generation check fails. So a ready handle silently degrades to "NULL" from try_get/query whenever any other entry is evicted. This isn't a sentinel collision — it's a misroute.

The straightforward fix is to key the handle on the slot itself, not its index in a compact array — e.g. allocate from a fixed slot table (capacity-sized) and let slot in the handle be that stable table index. The active array becomes a free-list, not the ID space.

This bug isn't currently caught because no unit test triggers eviction with a still-live handle to a non-evicted sibling slot. Worth adding before further consumers depend on handle stability.

2. Gate aliasing across batch reuse — src/prefetch/prefetcher.c:419-432, header prefetcher.h:74

prefetcher_release_batch flips in_use = 0, freeing the entry for batch_get_or_create_locked to reuse — but any in-flight prefetcher_slot for that batch still holds gate = &p->batches[i].gate. If the slot is reused before the worker drains, prefetch_gate_init zeroes the state and the still-pending fetch's gate_dec_pending lands on the new batch's gate. The header puts the burden on the caller ("ensure no slots still reference"), but the prefetcher exposes no way to verify that, and the design doc's intent (scheduler broadcasts watermark advance on plan success) doesn't naturally imply "no in-flight slots." Either gate it on a per-batch refcount, or have the prefetcher own the lifetime and refuse release while slots are active.

Significant

3. Worker polls instead of using the blocking pop the design added — src/prefetch/prefetcher.c:238-252

lookahead_pop_blocking exists per design step 5, but worker_fn does try_pop then platform_sleep_ns(1_000_000). With the design's target lead times (10k–100k samples), this gives a 1ms admission ceiling and burns the cache mutex on every tick (advance_all walks p->capacity). The state advance loop is the real polling work; the lookahead drain should block on the condvar instead.

4. advance_from_meta fails on zero-shard samples — src/prefetch/prefetcher.c:117-120

CHECK(Bad, n > 0) errors the slot if a sample's AABB intersects no shards. That's a valid configuration (edge AABB, all-padding sample) and the planner today handles it without erroring — the prefetcher shouldn't be stricter than the planner. Treat as "skip stages 2+3, mark READY with zero shards" or document the contract that callers must not submit such samples.

5. Per-prefetcher errors don't set the batch gate's error bit — src/prefetch/prefetcher.c:93-98

fail_slot advances local state but never touches s->gate. For errors that originate in a cache fetch (PREFETCH_STATE_ERROR observed via query), the cache already set the gate bit before the prefetcher saw it — fine. For errors that originate in the prefetcher itself (alloc fail at line 124, n == 0 once #4 is real, invalid handle from a saturated cache at line 132/156/218), the gate bit is never set. Design doc §Readiness gate promises "Error bit set → batch fails fast"; right now a fast-batch-fail consumer can miss these.

6. shard_index_fetch depends on array-meta entry staying pinned, but only peeks — src/prefetch/shard_index.c:106-112

prefetch_cache_peek doesn't widen the ordinal range and doesn't bump LRU recency. If a scheduler advances the watermark between the prefetcher's advance_from_meta and the shard-index worker running, the array-meta slot can be released and (under capacity pressure) evicted — fetch returns DAMACY_INVAL. The current setup avoids this because the slot is still pinned by the current batch's range, but the contract is implicit. Either document "stage-N+1 fetchers must run before watermark advances past stage-N requests" or have the fetcher widen the range explicitly.

Notes

7. prefetch_cache_advance_watermark walks n_active under the cache lock — src/prefetch/prefetch_cache.c:562. With design lead times of 10k–100k entries per cache, this is a real stall on every scheduler advance. Likely fine for the first integration but worth measuring before tuning lead time up.

8. Slot/batch tables are linear scans — prefetcher.c:62-83,201-208. O(capacity) per worker tick plus O(batch_capacity) per admit. Same scale concern as #7; reach for an intrusive free list or a small hash table when this shows up in profiles.

9. Test coverage gaps for the contracts that bite later — eviction with live sibling handle (#1), gate reuse race (#2), zero-shard sample (#4), saturated cache returning PREFETCH_HANDLE_NONE mid-walk in the prefetcher. These are the non-obvious behaviors; the current suite covers the happy paths well.

10. struct damacy_lookahead is fully exposed in the public header — src/lookahead/lookahead.h:17-27. Inconsistent with the opaque-handle pattern used by everything else here (prefetch_cache, prefetcher). Worth opaque-ing before downstream code starts touching la->size directly.

Bottom line

#1 and #2 are correctness issues that should land in this PR or as immediate follow-ups before the planner/scheduler integration goes in on top. The rest can sit until the integration PR exercises them.

closes #116 ## Approach `store_fs_gds.c`'s `gds_event_query` previously returned 1 unconditionally for any non-sentinel `seq` — completely ignoring whether the `cuFileReadAsync` submitted earlier had actually retired on `stream_h2d`. Callers (the wave-pool scheduler) treat a 1 return as "destination bytes are safe to consume" and transition the slot to `SLOT_READY`. On the GDS path this can hand a wave a `dev_buf` that the device has not yet written to, producing illegal memory accesses downstream in decode kernels gated on cross-stream events that fire ahead of the read draining. This PR makes the query honor what the caller would reasonably expect: returns 1 only after the cuFile read has actually completed on the stream. The mechanism reuses infrastructure already in `gds_submit_dev`: `cuLaunchHostFunc(stream, fs_gds_free_params_cb, ctx)` is enqueued after every `cuFileReadAsync`, so the callback runs in stream order *after* the reads drain. A new small `fs_gds_done { flag, claimed, rc }` struct is allocated per submit; the callback sets `flag=1` and drops one ref, `gds_event_query` checks the flag (acquire) and CAS-claims the owner-side ref the first time it observes the flag set. Repeated queries are safe; an unqueried event is reclaimed by the callback alone (no leak). `store_event` gains an opaque `void* impl` — backend-private, NULL for non-GDS stores. `gds_event_wait`, previously a no-op, now actually `cuStreamSynchronize`s and reclaims. ## Key files - `src/store/store_fs_gds.c:222-260` — the `fs_gds_done` refcount protocol. - `src/store/store_fs_gds.c:367-407` — the new `gds_event_query` / `gds_event_wait`. - `tests/test_store_fs_gds.c::test_event_query_reflects_completion` — the contract test. Uses `cuLaunchHostFunc` to park `stream_h2d` behind a host-side barrier, submits a read so cuFile is queued but not retired, asserts the query reports not-ready, then unblocks and asserts it reports ready. Deterministic, not race-dependent. Runs under cuFile compat mode (`CUFILE_FORCE_COMPAT_MODE=true` set in `main` before any cuFile init) so no nvidia-fs is required. ## Test plan - [x] New test fails before the fix (verified during development — query returns 1 while the read is provably queued behind the barrier). - [x] New test passes after the fix. - [x] Existing `test_submit_fail_releases_pins` still passes when it doesn't trip the separate bug below. - [x] Full test suite (25 tests, GDS build) passes. ## Related, not addressed here `test_submit_fail_releases_pins` SEGVs at ~4% on this hardware (filed as #118). Reproduces on this branch's parent commit, so it is not introduced by this PR, but it is a real damacy bug to chase, not upstream noise.

Fixes #118. ## Approach In compat mode, libcufile lazily allocates per-stream state on the first `cuFileReadAsync`, and that lazy init races against itself. The passing test in the same file happens to enqueue a `cuLaunchHostFunc` barrier before the first read, which serializes the stream enough to mask the race. The failing test goes straight into `cuFileReadAsync` on an empty stream and SEGVs ~4% of the time deep inside libcufile. cuFile already exposes the hook for this: `cuFileStreamRegister` "allocates resources needed to support cuFile operations asynchronously for the cuda stream" — i.e. exactly the lazy state that was racing. Calling it eagerly when damacy adopts a stream removes the race window. The matching `cuFileStreamDeregister` is required before `cuStreamDestroy` per the cuFile contract. ## Change In `src/store/store_fs_gds.c`: - dlsym-bind `cuFileStreamRegister` / `cuFileStreamDeregister` as optional symbols (graceful no-op on older libcufile that doesn't ship them). - `store_fs_gds_set_stream`: deregister any previously-set stream, then register the new one. First `cuFileReadAsync` now finds per-stream state already allocated. - `gds_destroy`: deregister after the existing `cuStreamSynchronize`, before the caller's `cuStreamDestroy`. ## Verification - `cmake --build build` clean. - 100× loop of `CUFILE_FORCE_COMPAT_MODE=true ./build/tests/test_store_fs_gds`: 0 failures (was ~4/100). - Full `ctest`: 26/26 pass.

## Approach Close #101 by removing the dead static fallback and replacing the ad-hoc structural constants it derived from with explicit `damacy_tuning` knobs. `chunk_substreams_upper_bound` (formerly `chunk_zsubs_upper_bound`) in `src/wave/wave_pool.c` sizes the per-wave fanout SOA and the shared nvcomp zstd decoder scratch. Its `!sp->layout_probed` fallback returned a hardcoded `DAMACY_BLOSC_MAX_BLOCKS_PER_CHUNK = 32` — the adversarial worst case. But `wave_chunks_eligible` (per-chunk gate, runs before `prepare_decode_caps` in `kick_h2d`) rejects any wave containing an unprobed BLOSC_ZSTD chunk with `DAMACY_INVAL`, so the fallback is structurally unreachable. The "perf" framing of the original issue was moot. This PR: - **Turns the implicit gate-vs-sizer contract into an explicit check.** `chunk_substreams_upper_bound` now returns `enum damacy_status`; on unprobed BLOSC it returns `DAMACY_INVAL` with a `log_error("gate-vs-sizer contract violated")` at the caller. A future gate regression now fails loudly instead of silently undersizing the fanout SOA. - **Replaces the two compile-time constants** (`DAMACY_MAX_CHUNKS_PER_WAVE`, `DAMACY_BLOSC_MAX_BLOCKS_PER_CHUNK`) with `damacy_tuning.max_chunks_per_wave` and `damacy_tuning.max_substreams_per_chunk`. The parser, planner, coalesce, wave_pool, fanout, wave_budget, and meta_cache all thread the effective values through their existing param chains. New `DAMACY_DEFAULT_*` siblings preserve current behavior; `0` in either field resolves to the default. `WAVE_ZSUBS_STRUCTURAL_MAX` becomes a runtime field `wave_pool.max_substreams_per_wave` derived once at init. - **Drops the dead substream rename target.** `zsubs` was a contraction that read as zstd-specific; renames to `substreams` everywhere (the noun that matches both BLOSC1 spec language and the nvcomp batched-decode input it actually counts). - **Strips machinery wired only to the unreachable branch:** the `_Atomic(uint16_t) observed_max_nblocks_per_chunk` slot, its `atomic_u16_observe_max` CAS-loop helper (`src/util/atomic_max.h`), the meta-cache observer setter, the bump sites in `zarr_meta_cache_layout_set` / `_probe_layout`, and the wiring in `damacy_create`. `zarr/zarr_meta_cache.h` returns to `extern "C"` shape (matches main) — the C-only `static_assert` is no longer needed. ## API Two new optional fields on `damacy_tuning` (Python `Config`): - `max_chunks_per_wave: int = 0` — `0` → 512 (current behavior). Clamped to `0xFFFFu` (the 16-bit chunk_idx packing in `d_block_chunk_map`). - `max_substreams_per_chunk: int = 0` — `0` → 32 (current behavior). Parser rejects blosc1 layouts above this with `DAMACY_DECODE`. ## Key file `src/wave/wave_pool.c:355` — `chunk_substreams_upper_bound` (the contract check) and `prepare_decode_caps` (caller). Closes #101.

codecov · 2026-05-22T21:59:09Z

Codecov Report

❌ Patch coverage is 74.61175% with 752 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.77%. Comparing base (5672f7e) to head (a0f780b).

Files with missing lines	Patch %	Lines
src/prefetch/prefetcher.c	78.46%	63 Missing and 24 partials ⚠️
src/damacy_pop.c	59.35%	49 Missing and 27 partials ⚠️
src/wave/wave_pool.c	81.26%	45 Missing and 29 partials ⚠️
src/prefetch/prefetch_cache.c	77.85%	47 Missing and 21 partials ⚠️
src/prefetch/shard_index.c	48.41%	46 Missing and 19 partials ⚠️
src/damacy_plan.c	75.96%	31 Missing and 25 partials ⚠️
src/planner/planner.c	60.15%	31 Missing and 22 partials ⚠️
src/damacy_lifecycle.c	81.50%	31 Missing and 18 partials ⚠️
bench/main.c	0.00%	43 Missing ⚠️
src/prefetch/chunk_layout.c	70.63%	24 Missing and 13 partials ⚠️
... and 15 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #120      +/-   ##
==========================================
+ Coverage   56.36%   59.77%   +3.41%     
==========================================
  Files          50       60      +10     
  Lines        6953     8371    +1418     
  Branches     1238     1435     +197     
==========================================
+ Hits         3919     5004    +1085     
- Misses       2547     2772     +225     
- Partials      487      595     +108

Flag	Coverage Δ
unittests	`59.77% <74.61%> (+3.41%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/damacy_config.c	`67.39% <100.00%> (+4.17%)`	⬆️
src/damacy_stats.c	`100.00% <100.00%> (ø)`
src/nvtx/nvtx.c	`61.76% <ø> (ø)`
src/wave/input_slot.c	`69.66% <100.00%> (ø)`
src/wave/wave_budget.c	`71.33% <100.00%> (+0.55%)`	⬆️
src/wave/wave.c	`89.53% <95.83%> (+2.96%)`	⬆️
src/platform/platform_io.posix.c	`53.93% <60.00%> (-0.62%)`	⬇️
src/zarr/sample_shard_iterator.c	`95.12% <95.12%> (ø)`
src/store/store_fs.c	`72.41% <72.72%> (-2.33%)`	⬇️
src/store/store.c	`44.92% <55.55%> (+1.64%)`	⬆️
... and 20 more

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

nclack · 2026-05-22T21:59:44Z

Review responses

Thanks for the careful pass. Walking through each item with the fix or rationale.

Blockers — fixed

#1 Handle stability (edbc00d prefetch_cache: stable slot indices)
Slot table is now capacity-sized and indexed by a stable slot id. active becomes a free-list rather than the id space, so active_remove's swap-and-pop no longer invalidates handles to sibling slots. New test test_handle_stable_across_eviction covers the exact case the reviewer described.

#2 Gate aliasing across batch reuse (da117cb prefetcher: batch refcount + edges, 064ad82 prefetcher: hold batch until gate drains)
Batch entries now carry a refcount + a release_pending flag. release_batch defers the in_use=0 flip until refcount and gate.pending both hit zero, so the gate cannot be re-initialized while in-flight slots still reference it. New tests cover release-before-pop, distinct-gates-per-batch, and admit-fail-rollback.

Significant — addressed

#3 Worker polling vs blocking pop (9518104 lookahead: timed pop, worker drops sleep)
Added lookahead_pop_blocking_timeout. Worker now uses pure pop_blocking when there's a free slot and no in-flight work; switches to pop_blocking_timeout(1ms) when in-flight work needs periodic state advance. Kills the 1ms admission ceiling — a sample arriving 100us into the wait now wakes the worker via condvar instead of waiting out the sleep.

The deeper "burns the cache mutex on every tick" complaint is structural — advance_all walks p->capacity because there's no cache→prefetcher notification path. That refactor (cache callbacks) is left for the integration PR rather than dropped here.

#4 Zero-shard sample (already addressed in da117cb)
advance_from_meta now branches on n == 0: skips stages 2+3 by going straight to PREFETCHER_PENDING_CHUNK_LAYOUT with n_shards = 0. The chunk-layout request still goes through since it's per-uri, not per-shard; a downstream consumer that doesn't need layout for empty samples can ignore the handle.

#5 Prefetcher-origin errors not setting gate.error (already in c7bba1b)
fail_slot now calls prefetch_gate_set_error(s->gate) before transitioning to PREFETCHER_ERROR. Catches the alloc-fail / saturation paths the reviewer flagged. test_batch_gate_error_on_failed_sample exercises this.

#6 shard_index_fetch peek pin
Documented the contract inline. The structural fix (widen ordinal range explicitly) would require the cache to expose a pin/unpin API; left for the integration PR. The current invariant (stage-N+1 fetchers run before watermark advances past stage-N requests) is naturally upheld by the prefetcher's submission ordering.

Notes — deferred

#7 advance_watermark walks active under the lock — perf tuning, will profile under the integration PR.
#8 Linear scans for slot/batch tables — same; will reach for free-list / hash table when profiling shows it.
#9 Test gaps — added the eviction-with-live-sibling-handle test (#1). Saturated-cache test is harder to write deterministically; deferred.
#10 Opaque damacy_lookahead — agree it's inconsistent. Will opaque-ify in a follow-up before the planner/scheduler integration lands.

nclack added 9 commits May 22, 2026 11:32

add prefetch_cache primitive

249cf0f

add zarr/sample_shard_iterator

2e366a0

planner: use shard iterator, beg/end names

2327a07

lookahead: batch_id + blocking pop

d095fba

add prefetch/array_meta fetcher

b525197

add prefetch/shard_index fetcher

4cd27b7

add prefetch/chunk_layout fetcher

24e855f

add prefetch/prefetcher orchestrator

8abdc2a

add dev/metadata_prefetch.md

fcf5cb4

nclack added 13 commits May 22, 2026 13:41

prefetch_cache: stable slot indices

edbc00d

prefetcher: batch refcount + edges

da117cb

prefetcher: rollback new batch on admit fail

3af0eb6

prefetch_cache: doc executor post contract

ac70985

prefetcher: hold batch until gate drains

064ad82

prefetcher: defer n_shards until requests done

3f646ae

prefetcher: doc destroy precondition

7bba37d

prefetcher: lookup helper, fused slot scan

c7bba1b

lookahead: timed pop, worker drops sleep

9518104

chunk_layout: thread max_substreams param

0b773ab

nclack added 3 commits May 22, 2026 15:43

prefetcher: batch-aware pop_ready

dea0137

prefetcher: err_code on ready and slot

7faaa83

prefetcher: batch readiness queries

23d95f0

nclack mentioned this pull request May 23, 2026

damacy rewrite: prefetcher producer + planner handles #121

Merged

prefetch_cache: pin PENDING + release waiters

d20f381

nclack added 30 commits May 30, 2026 17:28

Clarify input queueing

de13d3c

Factor parse prep

c9e407f

Factor assemble queueing

ea7d4c9

Factor decode queueing

91b37af

Clarify wave retirement

f614c21

Trim wave pool comments

7ff0787

Share input transfer envelope

c816670

Trim wave header comments

b836902

Use store submit ops directly

4adf6de

Share GDS input marker

4500b6e

Inline input queue state

ef729e4

Drop wave pool inline hints

5fa5086

Clarify input transfer metric

014fb73

Rename input transfer stat

ea31caa

Rename host staging input mode

e3fb887

Factor wave geometry resolve

cd97d71

Keep geometry failure detail

153c6aa

Release failed plan slots

5c6a1be

Clarify batch release reuse

4eaf8be

Rename deferred reuse flag

ab9474b

Split wave peel header

a1e3349

Clarify peel ticket names

274212b

Move wave peel code

26633c3

Rename wave input dispatch

69c9489

Clarify input reserve result

112b9ed

Return input submit status

bb4b7d2

Return store submit status

e87cd21

Pass input submit result

7ccc3bd

Rename Python input transfer stats

b49c4e5

Bound Python pending drain

a0f780b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

damacy rewrite: metadata prefetch#120

damacy rewrite: metadata prefetch#120
nclack wants to merge 90 commits into
mainfrom
worktree-prefetch

nclack commented May 22, 2026 •

edited

Loading

Uh oh!

nclack commented May 22, 2026

Uh oh!

codecov Bot commented May 22, 2026 •

edited

Loading

Uh oh!

nclack commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nclack commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Reviewer Notes

Tests

Uh oh!

nclack commented May 22, 2026

Review notes

Blockers

Significant

Notes

Bottom line

Uh oh!

codecov Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nclack commented May 22, 2026

Review responses

Blockers — fixed

Significant — addressed

Notes — deferred

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nclack commented May 22, 2026 •

edited

Loading

codecov Bot commented May 22, 2026 •

edited

Loading