CEHR-BERT in MEDS format is very disk intensive

Adding an issue here to track https://github.com/Medical-Event-Data-Standard/MEDS-DEV/issues/134

Details: 
  1. The "direct" disk usage of CEHR-BERT on larger datasets (across all training stages) is pretty significant; on the denser dataset we're working with, it is around 150 GB / 100k patients. Note that the _raw data here is ~ 0.9GB / 100k patients_, so the space cost is extreme. This space breaks down as:
     * ~8GB for packages in the pip virtual env.
     * Approximately twice as much disk is used for FT than PT on the task I've tested (~50 and 100GB / 100k patients).
     * Almost all of this space is in the prepared dataset directory (95+%); within this directory, space is ~equally distributed between two `meds_reader_*` files, one with a suffix that appears to be a hash and the other with the suffix "extension". These files are confirmed to _not_ be the raw `meds_reader` outputs, but rather are post-processed files that CEHR-BERT outputs.
  2. There is also a _huge_ amount of "indirect" space being used via huggingface caching for CEHR-BERT. I think this is because these cache files are made for every hugging face dataset used, never re-used, and never cleaned up if runs fail or other issues occur.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CEHR-BERT in MEDS format is very disk intensive #93

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CEHR-BERT in MEDS format is very disk intensive #93

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions