Skip to content

fix(data): resolve data_type aliases and 'auto' to dispatchable dataset ids#335

Merged
irisliu10 merged 1 commit into
Tencent:mainfrom
SuperMarioYL:fix/dataloader-data-type-resolution
Jun 9, 2026
Merged

fix(data): resolve data_type aliases and 'auto' to dispatchable dataset ids#335
irisliu10 merged 1 commit into
Tencent:mainfrom
SuperMarioYL:fix/dataloader-data-type-resolution

Conversation

@SuperMarioYL

Copy link
Copy Markdown
Contributor

Problem

DataLoaderFactory.create_data_loader documents data_type as accepting
"text", "multimodal" or "auto" (and "auto" is the parameter default),
but those values never work:

  • Auto-detection assigns data_type = "text" or "multimodal", yet the
    dataset dispatch only matches the class-name keys ("TextDataset",
    "MultiModalDataset", "Text2ImageDataset", "OmniDataset",
    "AudioDataset").
  • So data_type="text", data_type="multimodal", and data_type="auto"
    all fall through to the final else and raise
    ValueError: Unsupported data type: text.

Any caller that uses the documented short names — or simply relies on the
default data_type="auto" — hits this. Only the full class-name strings
supplied by the shipped YAML configs (name: TextDataset) happen to work.

Fix

Extract a small, pure _resolve_data_type(data_type, data_source) helper that:

  1. performs the existing "auto" detection, then
  2. normalizes the documented short aliases (text, multimodal, text2image,
    omni, audio) to their canonical dataset id via a DATA_TYPE_ALIASES
    map, and
  3. validates the result against SUPPORTED_DATA_TYPES, raising a descriptive
    ValueError for anything unrecognized.

Class-name values supplied by existing configs pass through unchanged, so this
is fully backward compatible. The docstring is updated to list the values that
are actually accepted.

Tests

tests/test_dataloader.py — CPU-only (no GPU, no model weights; the heavy
torch/transformers and sibling-dataset imports are stubbed so the pure
resolution logic can be exercised in isolation, matching the style of the
existing tests/test_config_parser.py). 16 cases cover:

  • "auto" resolving to a dispatchable id for .json/.parquet/dir/dict sources,
  • every documented short alias,
  • every canonical class name passing through unchanged (regression guard),
  • the unknown-type ValueError.

Red on main (the documented values are not dispatchable), green on this branch.
black --line-length=99, isort --profile=black, and flake8 all pass.

Scope

+54 / -9 in angelslim/data/dataloader.py plus the new test file. No change to
the success path of configs that already pass class names.

…et ids

DataLoaderFactory.create_data_loader documented 'text', 'multimodal' and
'auto' as valid data_type values (with 'auto' as the parameter default), but
auto-detection produced 'text'/'multimodal' while the dataset dispatch only
matched the class-name keys ('TextDataset', 'MultiModalDataset', ...). As a
result every create_data_loader(data_type='auto') call — and the documented
short aliases — fell through to the final else branch and raised
ValueError: Unsupported data type.

Extract _resolve_data_type() to perform auto-detection and normalize the
documented short aliases to canonical dataset ids via a DATA_TYPE_ALIASES
map, validated against SUPPORTED_DATA_TYPES with a descriptive error.
Config-supplied class names pass through unchanged. Add CPU-only regression
tests covering auto detection, alias resolution, class-name pass-through and
the unknown-type error.
@irisliu10 irisliu10 merged commit e821500 into Tencent:main Jun 9, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants