Skip to content

Introducing eager loading of dataframe(s) in RBatchGenerator#21035

Merged
martinfoell merged 3 commits intoroot-project:masterfrom
martinfoell:rbatchgenerator-multiple-dataframes
Jan 30, 2026
Merged

Introducing eager loading of dataframe(s) in RBatchGenerator#21035
martinfoell merged 3 commits intoroot-project:masterfrom
martinfoell:rbatchgenerator-multiple-dataframes

Conversation

@martinfoell
Copy link
Contributor

This Pull request:

  • Introduces RDatasetLoader class for eager loading of one or several dataframes into memory
  • Introduces RSampler class for implementing sampling strategies from the dataframe(s) that are loaded into memory
  • Add slice and concatenate methods for Flat2DMatrix
  • Removes numEntries and rdf_entries as input parameters to the RChunkLoader class
  • Replaces numColumns with cols and vecSizes as input parameters to the RBatchLoader class
  • Adjustments to the RBatchGenerator class to enable eager loading from dataframe(s) with the new classes, as well as changes to the existing classes (see above)
  • Adjustments to the pythonization to enable eager loading from dataframe(s)
  • Tests added for eager loading of dataframe(s)

@github-actions
Copy link

github-actions bot commented Jan 26, 2026

Test Results

    22 files      22 suites   3d 12h 55m 43s ⏱️
 3 774 tests  3 774 ✅ 0 💤 0 ❌
75 952 runs  75 952 ✅ 0 💤 0 ❌

Results for commit 2d3b47a.

♻️ This comment has been updated with latest results.

Copy link
Member

@vepadulano vepadulano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! First round of review

@martinfoell
Copy link
Contributor Author

Nice work! First round of review

Thanks for the review @vepadulano ! I addressed the comments that you gave and left some comments to explain where you had questions.

Copy link
Member

@vepadulano vepadulano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me! Before merging, I believe the commit history should be cleaned. Ideally, there should be 3 commits in total:

  • One introducing all the changes in the C++ code
  • One with the changes in _batchgenerator.py that rely on the additions from the previous
  • One with the new tests

This commit introduces the RDatasetLoader class which takes as input a vector of dataframes and loads each of them in memory and further splits them into training and validation datasets that are added to a vector for the datasets from each dataframe.
The RSampler class is introduced to concatenate the training and validation datasets from the vector of datasets from RDatasetLoader and further shuffle them before the dataset is passed to RBatchLoader.
Some changes are done to the existing classes to help with integrating the eager loading along side the existing chunk loading:
- Remove numEntries and rdf_entries as input parameters to the RChunkLoader class
- Replace numColumns with cols and vecSizes as input parameters to the RBatchLoader class
- Add slice and concatenate methods for Flat2DMatrix in Flat2DMatrixOperator

In the RBatchGenerator class the changes mentioned above are integrated to enable eager loading from dataframe(s).
…g from dataframe(s)

This commits adjusts the python bindings from RBatchGenerator such that eager loading is enabled in the batch loading from Numpy, PyTorch and TensorFlow.
The load_eager (bool) parameter is added to choose between eager loading (True) or chunk loading (False). The sampling_type (str) parameter is added to distingush between which sampling strategy is chosen for eager loading. Further, the rdataframes input parameter is changes such that it now can either be a single dataframe or a list of dataframes.
@martinfoell martinfoell force-pushed the rbatchgenerator-multiple-dataframes branch from 523de65 to 2d3b47a Compare January 29, 2026 15:27
Copy link
Member

@vepadulano vepadulano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@martinfoell martinfoell merged commit c6f6d22 into root-project:master Jan 30, 2026
28 of 30 checks passed
@siliataider siliataider added the in:ML Everything under ROOT/ML label Feb 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in:ML Everything under ROOT/ML

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants