Skip to content

Manifest preloading might stops early with nested zarr groups #1464

@aladinor

Description

@aladinor

What happened?

Manifest preloading iteration stops prematurely when a Zarr store has many nested sibling groups. Only the first few groups are processed, leaving the majority of matching arrays without preloaded manifests.

Environment

  • icechunk version: 1.1.10
  • Python: 3.12
  • OS: Linux

Reproduction

A public S3 bucket is available for testing:

import icechunk as ic

# Public bucket with anonymous access
storage = ic.s3_storage(
    bucket='nexrad-arco',
    prefix='KLOT-RT',
    endpoint_url='https://umn1.osn.mghpcc.org',
    anonymous=True,
    force_path_style=True,
    region='us-east-1',
)

# Config to preload coordinate arrays (vcp_time, azimuth, range)
var_condition = ic.ManifestPreloadCondition.name_matches(r'^(vcp_time|azimuth|range)$')
preload_config = ic.ManifestPreloadConfig(
    max_total_refs=1000,  # Very high limit
    preload_if=var_condition,
)
config = ic.RepositoryConfig(manifest=ic.ManifestConfig(preload=preload_config))

repo = ic.Repository.open(storage, config=config)
session = repo.readonly_session('main')

Run with trace logging:

ICECHUNK_LOG=icechunk=debug python script.py

Store Structure

The store contains radar data with the following structure:

/VCP-34/
├── georeferencing_correction/
│   └── vcp_time                       # coordinate array
├── radar_parameters/
│   └── vcp_time                       # coordinate array
├── sweep_0/
│   ├── azimuth                        # coordinate array
│   ├── range                          # coordinate array
│   ├── vcp_time                       # coordinate array
│   └── [data variables...]
├── sweep_1/
│   ├── azimuth, range, vcp_time       # coordinate arrays
│   └── [data variables...]
├── sweep_2/
│   ├── azimuth, range, vcp_time       # coordinate arrays
│   └── ...
... (continues through sweep_9)

Total arrays matching the preload filter: 33

Expected Behavior

All 33 coordinate arrays (vcp_time, azimuth, range) across all groups should be preloaded since:

  1. They match the regex filter ^(vcp_time|azimuth|range)$
  2. max_total_refs=100
  3. Each manifest contains only 1 ref (small 1D coordinate arrays)

Actual Behavior

Only 8 manifests are preloaded, then iteration stops:

Preloading manifest ... for array /VCP-34/georeferencing_correction/vcp_time
Preloading manifest ... for array /VCP-34/radar_parameters/vcp_time
Preloading manifest ... for array /VCP-34/sweep_0/azimuth
Preloading manifest ... for array /VCP-34/sweep_0/range
Preloading manifest ... for array /VCP-34/sweep_0/vcp_time
Preloading manifest ... for array /VCP-34/sweep_1/azimuth
Preloading manifest ... for array /VCP-34/sweep_1/range
Preloading manifest ... for array /VCP-34/sweep_1/vcp_time
[STOPS HERE - no more preloading]

When later accessing arrays in sweep_2 through sweep_9, manifests are fetched on-demand:

# Accessing sweep_5/azimuth triggers manifest download
Downloading manifest ZM0S0EPCPRFWPXZS85SG for /VCP-34/sweep_5/azimuth

# Accessing sweep_9/azimuth triggers manifest download
Downloading manifest FCZW14HH8TR90QMX7KPG for /VCP-34/sweep_9/azimuth

This may impact loading datasets with many nested groups.

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in icechunk.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions