-
Notifications
You must be signed in to change notification settings - Fork 17
Open
Description
Adding an issue here to track Medical-Event-Data-Standard/MEDS-DEV#134
Details:
- The "direct" disk usage of CEHR-BERT on larger datasets (across all training stages) is pretty significant; on the denser dataset we're working with, it is around 150 GB / 100k patients. Note that the raw data here is ~ 0.9GB / 100k patients, so the space cost is extreme. This space breaks down as:
- ~8GB for packages in the pip virtual env.
- Approximately twice as much disk is used for FT than PT on the task I've tested (~50 and 100GB / 100k patients).
- Almost all of this space is in the prepared dataset directory (95+%); within this directory, space is ~equally distributed between two
meds_reader_*files, one with a suffix that appears to be a hash and the other with the suffix "extension". These files are confirmed to not be the rawmeds_readeroutputs, but rather are post-processed files that CEHR-BERT outputs.
- There is also a huge amount of "indirect" space being used via huggingface caching for CEHR-BERT. I think this is because these cache files are made for every hugging face dataset used, never re-used, and never cleaned up if runs fail or other issues occur.
Metadata
Metadata
Assignees
Labels
No labels