merge_index drops full directory path in merged index.json, causing broken shard references and 404s

## Environment
- OS: Ubuntu 22.04.2 LTS
- Hardware (GPU, or instance type): Not relevant
- mosaicml-streaming version: 0.13.0
- Dataset: ~6 TB, generated by hundreds of workers (followed the recommended [parallel data conversion](https://docs.mosaicml.com/projects/streaming/en/stable/preparing_datasets/parallel_dataset_conversion.html#Merge-meta-data) strategy).

## To reproduce
Steps to reproduce the behavior:
1. Prepare a streaming dataset with many workers such that shards and index.json files are nested multiple levels deep, e.g.

   ```
   bucket-name/subset1/worker1/index.json
   bucket-name/subset1/worker200/index.json
   bucket-name/subset2/worker1/index.json
   bucket-name/subset2/worker200/index.json
   ...
   ```
2. Run `merge_index` on the dataset root directory (in this case a GCS URI like `gs://bucket-name/`).
3. Inspect the generated top-level index.json: shard basenames do not contain the full relative hierarchical path, e.g.

   ```json
   {
     "raw_data": {"basename": "worker1/shard.00000.mds"}, ...
   }
   ```
   instead of the correct
   ```json
   {
     "raw_data": {"basename": "subset1/worker1/shard.00000.mds"}, ...
   }
   ```
4. Attempt to stream the dataset using the merged index.json. Shard download attempts fail with 404s because intermediate directories are missing in referenced paths.

## Expected behavior
All shard basenames in the merged top-level index.json should preserve the full path relative to the dataset root, so that streaming clients can successfully locate each shard with its correct GCS path.

## Additional context
This affects any layout where the shard directories are nested more than one level deep.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

merge_index drops full directory path in merged index.json, causing broken shard references and 404s #959

Environment

To reproduce

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

merge_index drops full directory path in merged index.json, causing broken shard references and 404s #959

Description

Environment

To reproduce

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions