Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Pull request overview
This PR updates the data download/preprocessing utilities and associated notebooks to support downloading and annotating CPJUMP experimental + MOA metadata, while removing unused helper code in preparation for splitting notebooks into a separate analysis repository.
Changes:
- Expanded
utils/io_utils.pywith improved module docs and a new helper to load + concatenate profile parquet files. - Simplified/cleaned
utils/data_utils.py(removing unused signature-grouping helpers) and added feature-modality utilities (split_data,remove_feature_prefixes). - Updated
notebooks/0.download-data/*notebooks (and nbconverted scripts) to use a newdl-configs.yamland to generate/consume compound+MOA metadata.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| utils/validator.py | Removes an unused clustering param-grid validator module. |
| utils/io_utils.py | Adds module docs + load_and_concat_profiles; adjusts formatting/error messaging. |
| utils/data_utils.py | Cleans up unused functions; improves docs; adds modality/prefix helpers. |
| notebooks/0.download-data/nbconverted/1.download-data.py | Uses dl-configs.yaml; adds CPJUMP compound+MOA merge/export steps; doc edits. |
| notebooks/0.download-data/nbconverted/2.preprocessing.py | Switches MOA annotation to use generated compound metadata TSV; doc edits. |
| notebooks/0.download-data/nbconverted/3.subset-jump-controls.py | Updates paths/filenames for control subsets. |
| notebooks/0.download-data/dl-configs.yaml | Adds dedicated download configuration for the download notebook(s). |
| notebooks/0.download-data/1.download-data.ipynb | Notebook equivalent of the nbconverted updates + new compound/MOA section. |
| notebooks/0.download-data/2.preprocessing.ipynb | Notebook equivalent of preprocessing updates (compound metadata TSV usage). |
| notebooks/0.download-data/3.subset-jump-controls.ipynb | Notebook equivalent of control-subsetting path/filename updates. |
| .pre-commit-config.yaml | Bumps ruff-pre-commit revision. |
Comments suppressed due to low confidence (2)
notebooks/0.download-data/nbconverted/3.subset-jump-controls.py:122
- This notebook header says it subsets controls from the CPJUMP1 CRISPR dataset, but the code now loads
cpjump1_compound_concat_profiles.parquetand writescpjump1_compound_negcon_...outputs. Update the top-level notebook description (and any related variable names/text) to match the compound dataset being processed to avoid confusion for readers.
cpjump1_data_path = (
profiles_dir / "cpjump1" / "cpjump1_compound_concat_profiles.parquet"
).resolve(strict=True)
notebooks/0.download-data/3.subset-jump-controls.ipynb:151
- The notebook introduction describes subsetting controls from the CPJUMP1 CRISPR dataset, but this code cell is now pointing at
cpjump1_compound_concat_profiles.parquet. Please update the introductory text to reflect the compound dataset (or adjust the code back to CRISPR) so the narrative matches the executed workflow.
"# setting directory where all the single-cell profiles are stored\n",
"data_dir = pathlib.Path.cwd() / \"data\"\n",
"profiles_dir = (data_dir / \"sc-profiles\").resolve(strict=True)\n",
"\n",
"cpjump1_data_path = (\n",
" profiles_dir / \"cpjump1\" / \"cpjump1_compound_concat_profiles.parquet\"\n",
").resolve(strict=True)\n",
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| def split_data( | ||
| pycytominer_output: pl.DataFrame, dataset: str = "CP_and_DP" | ||
| ) -> pl.DataFrame: | ||
| """ | ||
| Filter a pycytominer output DataFrame to retain only metadata and the | ||
| selected feature modality columns. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| pycytominer_output : pl.DataFrame | ||
| Polars DataFrame from pycytominer containing both metadata and feature columns. | ||
| dataset : str, optional | ||
| Feature modality to retain. One of: | ||
| - ``"CP"`` — CellProfiler features only (columns containing ``"CP__"``) | ||
| - ``"DP"`` — DeepProfiler features only (columns containing ``"DP__"``) | ||
| - ``"CP_and_DP"`` — both modalities (default) | ||
|
|
||
| Returns | ||
| ------- | ||
| pl.DataFrame | ||
| Polars DataFrame with metadata and selected features | ||
| """ | ||
| all_cols = pycytominer_output.columns | ||
|
|
||
| # Get DP, CP, or both features from all columns depending on desired dataset | ||
| if dataset == "CP": | ||
| feature_cols = [col for col in all_cols if "CP__" in col] | ||
| elif dataset == "DP": | ||
| feature_cols = [col for col in all_cols if "DP__" in col] | ||
| elif dataset == "CP_and_DP": | ||
| feature_cols = [col for col in all_cols if "P__" in col] | ||
| else: | ||
| raise ValueError( | ||
| f"Invalid dataset '{dataset}'. Choose from 'CP', 'DP', or 'CP_and_DP'." | ||
| ) | ||
|
|
||
| # Metadata columns is all columns except feature columns | ||
| metadata_cols = [col for col in all_cols if "P__" not in col] | ||
|
|
||
| # Select metadata and feature columns | ||
| selected_cols = metadata_cols + feature_cols | ||
|
|
||
| return pycytominer_output.select(selected_cols) | ||
|
|
||
|
|
||
| def remove_feature_prefixes(df: pl.DataFrame, prefix: str = "CP__") -> pl.DataFrame: | ||
| """ | ||
| Strip a feature-modality prefix from all matching column names. | ||
|
|
||
| For example, ``"CP__Cells_AreaShape_Area"`` becomes ``"Cells_AreaShape_Area"`` | ||
| when ``prefix="CP__"``. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| df : pl.DataFrame | ||
| Input DataFrame whose column names may contain the prefix. | ||
| prefix : str, default ``"CP__"`` | ||
| Prefix string to strip from matching column names. | ||
|
|
||
| Returns | ||
| ------- | ||
| pl.DataFrame | ||
| DataFrame with the prefix removed from all matching column names. | ||
| """ | ||
| return df.rename(lambda x: x.replace(prefix, "") if prefix in x else x) |
There was a problem hiding this comment.
New public helpers split_data and remove_feature_prefixes were added, but the existing test suite for utils.data_utils only covers add_cell_id_hash. Add unit tests covering the new behaviors (at least: CP vs DP vs CP_and_DP selection, invalid dataset value, and prefix stripping) to prevent regressions.
| profiles_dir / "cpjump1" / "cpjump1_compound_experimental-metadata.csv" | ||
| ).resolve(strict=True) | ||
|
|
||
| # cpjump1 compound metadta |
There was a problem hiding this comment.
Typo in comment: "metadta" → "metadata".
| # cpjump1 compound metadta | |
| # cpjump1 compound metadata |
| " profiles_dir / \"cpjump1\" / \"cpjump1_compound_experimental-metadata.csv\"\n", | ||
| ").resolve(strict=True)\n", | ||
| "\n", | ||
| "# cpjump1 compound metadta\n", |
There was a problem hiding this comment.
Typo in notebook cell text: "metadta" → "metadata".
| "# cpjump1 compound metadta\n", | |
| "# cpjump1 compound metadata\n", |
| # # Downloading Single-Cell Profiles | ||
| # | ||
| # This notebook focuses on downloading metadata and single-cell profiles from three key datasets: | ||
| # This notebook downloading metadata and single-cell profiles from three key datasets: |
There was a problem hiding this comment.
Grammar issue in notebook intro: "This notebook downloading ..." is missing a verb (e.g., "This notebook downloads ..." / "This notebook focuses on downloading ...").
| # This notebook downloading metadata and single-cell profiles from three key datasets: | |
| # This notebook downloads metadata and single-cell profiles from three key datasets: |
| {col: f"Metadata_{col}" for col in broad_compound_moa_metadata.columns} | ||
| ) | ||
|
|
||
| # replace null values in the boroad compound moa to "unknown" |
There was a problem hiding this comment.
Typo in comment: "boroad" → "broad".
| # replace null values in the boroad compound moa to "unknown" | |
| # replace null values in the broad compound moa to "unknown" |
| "# Downloading Single-Cell Profiles\n", | ||
| "\n", | ||
| "This notebook focuses on downloading metadata and single-cell profiles from three key datasets:\n", | ||
| "This notebook downloading metadata and single-cell profiles from three key datasets:\n", |
There was a problem hiding this comment.
Grammar issue in notebook intro: "This notebook downloading ..." is missing a verb (e.g., "This notebook downloads ..." / "This notebook focuses on downloading ...").
| "This notebook downloading metadata and single-cell profiles from three key datasets:\n", | |
| "This notebook downloads metadata and single-cell profiles from three key datasets:\n", |
| " {col: f\"Metadata_{col}\" for col in broad_compound_moa_metadata.columns}\n", | ||
| ")\n", | ||
| "\n", | ||
| "# replace null values in the boroad compound moa to \"unknown\"\n", |
There was a problem hiding this comment.
Typo in notebook cell text: "boroad" → "broad".
| "# replace null values in the boroad compound moa to \"unknown\"\n", | |
| "# replace null values in the broad compound moa to \"unknown\"\n", |
Given the number of changes made throughout the analysis notebooks, this PR updates the downloads module to include functions for downloading the CPJUMP experimental and MOA data, along with several improvements to the module documentation.
We have also removed functions that are no longer used in the notebooks and updated the documentation accordingly.
These changes are part of the preparation for separating the notebooks into a dedicated analysis repository while transitioning this repository into a focused software package.