-
Notifications
You must be signed in to change notification settings - Fork 55
Open
Description
What happened?
Manifest preloading iteration stops prematurely when a Zarr store has many nested sibling groups. Only the first few groups are processed, leaving the majority of matching arrays without preloaded manifests.
Environment
- icechunk version: 1.1.10
- Python: 3.12
- OS: Linux
Reproduction
A public S3 bucket is available for testing:
import icechunk as ic
# Public bucket with anonymous access
storage = ic.s3_storage(
bucket='nexrad-arco',
prefix='KLOT-RT',
endpoint_url='https://umn1.osn.mghpcc.org',
anonymous=True,
force_path_style=True,
region='us-east-1',
)
# Config to preload coordinate arrays (vcp_time, azimuth, range)
var_condition = ic.ManifestPreloadCondition.name_matches(r'^(vcp_time|azimuth|range)$')
preload_config = ic.ManifestPreloadConfig(
max_total_refs=1000, # Very high limit
preload_if=var_condition,
)
config = ic.RepositoryConfig(manifest=ic.ManifestConfig(preload=preload_config))
repo = ic.Repository.open(storage, config=config)
session = repo.readonly_session('main')Run with trace logging:
ICECHUNK_LOG=icechunk=debug python script.pyStore Structure
The store contains radar data with the following structure:
/VCP-34/
├── georeferencing_correction/
│ └── vcp_time # coordinate array
├── radar_parameters/
│ └── vcp_time # coordinate array
├── sweep_0/
│ ├── azimuth # coordinate array
│ ├── range # coordinate array
│ ├── vcp_time # coordinate array
│ └── [data variables...]
├── sweep_1/
│ ├── azimuth, range, vcp_time # coordinate arrays
│ └── [data variables...]
├── sweep_2/
│ ├── azimuth, range, vcp_time # coordinate arrays
│ └── ...
... (continues through sweep_9)
Total arrays matching the preload filter: 33
Expected Behavior
All 33 coordinate arrays (vcp_time, azimuth, range) across all groups should be preloaded since:
- They match the regex filter
^(vcp_time|azimuth|range)$ max_total_refs=100- Each manifest contains only 1 ref (small 1D coordinate arrays)
Actual Behavior
Only 8 manifests are preloaded, then iteration stops:
Preloading manifest ... for array /VCP-34/georeferencing_correction/vcp_time
Preloading manifest ... for array /VCP-34/radar_parameters/vcp_time
Preloading manifest ... for array /VCP-34/sweep_0/azimuth
Preloading manifest ... for array /VCP-34/sweep_0/range
Preloading manifest ... for array /VCP-34/sweep_0/vcp_time
Preloading manifest ... for array /VCP-34/sweep_1/azimuth
Preloading manifest ... for array /VCP-34/sweep_1/range
Preloading manifest ... for array /VCP-34/sweep_1/vcp_time
[STOPS HERE - no more preloading]
When later accessing arrays in sweep_2 through sweep_9, manifests are fetched on-demand:
# Accessing sweep_5/azimuth triggers manifest download
Downloading manifest ZM0S0EPCPRFWPXZS85SG for /VCP-34/sweep_5/azimuth
# Accessing sweep_9/azimuth triggers manifest download
Downloading manifest FCZW14HH8TR90QMX7KPG for /VCP-34/sweep_9/azimuth
This may impact loading datasets with many nested groups.
MVCE confirmation
- Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in icechunk.
- Complete example — the example is self-contained, including all data and the text of any traceback.
- Verifiable example — the example copy & pastes into an IPython prompt, returning the result.
- New issue — a search of GitHub Issues suggests this is not a duplicate.
Metadata
Metadata
Assignees
Labels
No labels