Skip to content

CEHR-BERT in MEDS format is very disk intensive #93

@mmcdermott

Description

@mmcdermott

Adding an issue here to track Medical-Event-Data-Standard/MEDS-DEV#134

Details:

  1. The "direct" disk usage of CEHR-BERT on larger datasets (across all training stages) is pretty significant; on the denser dataset we're working with, it is around 150 GB / 100k patients. Note that the raw data here is ~ 0.9GB / 100k patients, so the space cost is extreme. This space breaks down as:
    • ~8GB for packages in the pip virtual env.
    • Approximately twice as much disk is used for FT than PT on the task I've tested (~50 and 100GB / 100k patients).
    • Almost all of this space is in the prepared dataset directory (95+%); within this directory, space is ~equally distributed between two meds_reader_* files, one with a suffix that appears to be a hash and the other with the suffix "extension". These files are confirmed to not be the raw meds_reader outputs, but rather are post-processed files that CEHR-BERT outputs.
  2. There is also a huge amount of "indirect" space being used via huggingface caching for CEHR-BERT. I think this is because these cache files are made for every hugging face dataset used, never re-used, and never cleaned up if runs fail or other issues occur.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions