fix(data): resolve data_type aliases and 'auto' to dispatchable dataset ids#335
Merged
irisliu10 merged 1 commit intoJun 9, 2026
Merged
Conversation
…et ids
DataLoaderFactory.create_data_loader documented 'text', 'multimodal' and
'auto' as valid data_type values (with 'auto' as the parameter default), but
auto-detection produced 'text'/'multimodal' while the dataset dispatch only
matched the class-name keys ('TextDataset', 'MultiModalDataset', ...). As a
result every create_data_loader(data_type='auto') call — and the documented
short aliases — fell through to the final else branch and raised
ValueError: Unsupported data type.
Extract _resolve_data_type() to perform auto-detection and normalize the
documented short aliases to canonical dataset ids via a DATA_TYPE_ALIASES
map, validated against SUPPORTED_DATA_TYPES with a descriptive error.
Config-supplied class names pass through unchanged. Add CPU-only regression
tests covering auto detection, alias resolution, class-name pass-through and
the unknown-type error.
yghstill
approved these changes
Jun 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
DataLoaderFactory.create_data_loaderdocumentsdata_typeas accepting"text","multimodal"or"auto"(and"auto"is the parameter default),but those values never work:
data_type = "text"or"multimodal", yet thedataset dispatch only matches the class-name keys (
"TextDataset","MultiModalDataset","Text2ImageDataset","OmniDataset","AudioDataset").data_type="text",data_type="multimodal", anddata_type="auto"all fall through to the final
elseand raiseValueError: Unsupported data type: text.Any caller that uses the documented short names — or simply relies on the
default
data_type="auto"— hits this. Only the full class-name stringssupplied by the shipped YAML configs (
name: TextDataset) happen to work.Fix
Extract a small, pure
_resolve_data_type(data_type, data_source)helper that:"auto"detection, thentext,multimodal,text2image,omni,audio) to their canonical dataset id via aDATA_TYPE_ALIASESmap, and
SUPPORTED_DATA_TYPES, raising a descriptiveValueErrorfor anything unrecognized.Class-name values supplied by existing configs pass through unchanged, so this
is fully backward compatible. The docstring is updated to list the values that
are actually accepted.
Tests
tests/test_dataloader.py— CPU-only (no GPU, no model weights; the heavytorch/transformersand sibling-dataset imports are stubbed so the pureresolution logic can be exercised in isolation, matching the style of the
existing
tests/test_config_parser.py). 16 cases cover:"auto"resolving to a dispatchable id for.json/.parquet/dir/dict sources,ValueError.Red on
main(the documented values are not dispatchable), green on this branch.black --line-length=99,isort --profile=black, andflake8all pass.Scope
+54 / -9 in
angelslim/data/dataloader.pyplus the new test file. No change tothe success path of configs that already pass class names.