fix(data): resolve data_type aliases and 'auto' to dispatchable dataset ids by SuperMarioYL · Pull Request #335 · Tencent/AngelSlim

SuperMarioYL · 2026-06-08T19:15:09Z

Problem

DataLoaderFactory.create_data_loader documents data_type as accepting
"text", "multimodal" or "auto" (and "auto" is the parameter default),
but those values never work:

Auto-detection assigns data_type = "text" or "multimodal", yet the
dataset dispatch only matches the class-name keys ("TextDataset",
"MultiModalDataset", "Text2ImageDataset", "OmniDataset",
"AudioDataset").
So data_type="text", data_type="multimodal", and data_type="auto"
all fall through to the final else and raise
ValueError: Unsupported data type: text.

Any caller that uses the documented short names — or simply relies on the
default data_type="auto" — hits this. Only the full class-name strings
supplied by the shipped YAML configs (name: TextDataset) happen to work.

Fix

Extract a small, pure _resolve_data_type(data_type, data_source) helper that:

performs the existing "auto" detection, then
normalizes the documented short aliases (text, multimodal, text2image,
omni, audio) to their canonical dataset id via a DATA_TYPE_ALIASES
map, and
validates the result against SUPPORTED_DATA_TYPES, raising a descriptive
ValueError for anything unrecognized.

Class-name values supplied by existing configs pass through unchanged, so this
is fully backward compatible. The docstring is updated to list the values that
are actually accepted.

Tests

tests/test_dataloader.py — CPU-only (no GPU, no model weights; the heavy
torch/transformers and sibling-dataset imports are stubbed so the pure
resolution logic can be exercised in isolation, matching the style of the
existing tests/test_config_parser.py). 16 cases cover:

"auto" resolving to a dispatchable id for .json/.parquet/dir/dict sources,
every documented short alias,
every canonical class name passing through unchanged (regression guard),
the unknown-type ValueError.

Red on main (the documented values are not dispatchable), green on this branch.
black --line-length=99, isort --profile=black, and flake8 all pass.

Scope

+54 / -9 in angelslim/data/dataloader.py plus the new test file. No change to
the success path of configs that already pass class names.

…et ids DataLoaderFactory.create_data_loader documented 'text', 'multimodal' and 'auto' as valid data_type values (with 'auto' as the parameter default), but auto-detection produced 'text'/'multimodal' while the dataset dispatch only matched the class-name keys ('TextDataset', 'MultiModalDataset', ...). As a result every create_data_loader(data_type='auto') call — and the documented short aliases — fell through to the final else branch and raised ValueError: Unsupported data type. Extract _resolve_data_type() to perform auto-detection and normalize the documented short aliases to canonical dataset ids via a DATA_TYPE_ALIASES map, validated against SUPPORTED_DATA_TYPES with a descriptive error. Config-supplied class names pass through unchanged. Add CPU-only regression tests covering auto detection, alias resolution, class-name pass-through and the unknown-type error.

yghstill approved these changes Jun 9, 2026

View reviewed changes

irisliu10 merged commit e821500 into Tencent:main Jun 9, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(data): resolve data_type aliases and 'auto' to dispatchable dataset ids#335

fix(data): resolve data_type aliases and 'auto' to dispatchable dataset ids#335
irisliu10 merged 1 commit into
Tencent:mainfrom
SuperMarioYL:fix/dataloader-data-type-resolution

SuperMarioYL commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

SuperMarioYL commented Jun 8, 2026

Problem

Fix

Tests

Scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants