-
Notifications
You must be signed in to change notification settings - Fork 181
Description
Environment
- OS: Ubuntu 22.04.2 LTS
- Hardware (GPU, or instance type): Not relevant
- mosaicml-streaming version: 0.13.0
- Dataset: ~6 TB, generated by hundreds of workers (followed the recommended parallel data conversion strategy).
To reproduce
Steps to reproduce the behavior:
-
Prepare a streaming dataset with many workers such that shards and index.json files are nested multiple levels deep, e.g.
bucket-name/subset1/worker1/index.json bucket-name/subset1/worker200/index.json bucket-name/subset2/worker1/index.json bucket-name/subset2/worker200/index.json ... -
Run
merge_indexon the dataset root directory (in this case a GCS URI likegs://bucket-name/). -
Inspect the generated top-level index.json: shard basenames do not contain the full relative hierarchical path, e.g.
{ "raw_data": {"basename": "worker1/shard.00000.mds"}, ... }instead of the correct
{ "raw_data": {"basename": "subset1/worker1/shard.00000.mds"}, ... } -
Attempt to stream the dataset using the merged index.json. Shard download attempts fail with 404s because intermediate directories are missing in referenced paths.
Expected behavior
All shard basenames in the merged top-level index.json should preserve the full path relative to the dataset root, so that streaming clients can successfully locate each shard with its correct GCS path.
Additional context
This affects any layout where the shard directories are nested more than one level deep.