Skip to content

merge_index drops full directory path in merged index.json, causing broken shard references and 404s #959

@agdhruv

Description

@agdhruv

Environment

  • OS: Ubuntu 22.04.2 LTS
  • Hardware (GPU, or instance type): Not relevant
  • mosaicml-streaming version: 0.13.0
  • Dataset: ~6 TB, generated by hundreds of workers (followed the recommended parallel data conversion strategy).

To reproduce

Steps to reproduce the behavior:

  1. Prepare a streaming dataset with many workers such that shards and index.json files are nested multiple levels deep, e.g.

    bucket-name/subset1/worker1/index.json
    bucket-name/subset1/worker200/index.json
    bucket-name/subset2/worker1/index.json
    bucket-name/subset2/worker200/index.json
    ...
    
  2. Run merge_index on the dataset root directory (in this case a GCS URI like gs://bucket-name/).

  3. Inspect the generated top-level index.json: shard basenames do not contain the full relative hierarchical path, e.g.

    {
      "raw_data": {"basename": "worker1/shard.00000.mds"}, ...
    }

    instead of the correct

    {
      "raw_data": {"basename": "subset1/worker1/shard.00000.mds"}, ...
    }
  4. Attempt to stream the dataset using the merged index.json. Shard download attempts fail with 404s because intermediate directories are missing in referenced paths.

Expected behavior

All shard basenames in the merged top-level index.json should preserve the full path relative to the dataset root, so that streaming clients can successfully locate each shard with its correct GCS path.

Additional context

This affects any layout where the shard directories are nested more than one level deep.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions