Skip to content

Update downloads module #80

Open
axiomcura wants to merge 5 commits intoWayScience:mainfrom
axiomcura:update-download-module
Open

Update downloads module #80
axiomcura wants to merge 5 commits intoWayScience:mainfrom
axiomcura:update-download-module

Conversation

@axiomcura
Copy link
Member

Given the number of changes made throughout the analysis notebooks, this PR updates the downloads module to include functions for downloading the CPJUMP experimental and MOA data, along with several improvements to the module documentation.

We have also removed functions that are no longer used in the notebooks and updated the documentation accordingly.

These changes are part of the preparation for separating the notebooks into a dedicated analysis repository while transitioning this repository into a focused software package.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the data download/preprocessing utilities and associated notebooks to support downloading and annotating CPJUMP experimental + MOA metadata, while removing unused helper code in preparation for splitting notebooks into a separate analysis repository.

Changes:

  • Expanded utils/io_utils.py with improved module docs and a new helper to load + concatenate profile parquet files.
  • Simplified/cleaned utils/data_utils.py (removing unused signature-grouping helpers) and added feature-modality utilities (split_data, remove_feature_prefixes).
  • Updated notebooks/0.download-data/* notebooks (and nbconverted scripts) to use a new dl-configs.yaml and to generate/consume compound+MOA metadata.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
utils/validator.py Removes an unused clustering param-grid validator module.
utils/io_utils.py Adds module docs + load_and_concat_profiles; adjusts formatting/error messaging.
utils/data_utils.py Cleans up unused functions; improves docs; adds modality/prefix helpers.
notebooks/0.download-data/nbconverted/1.download-data.py Uses dl-configs.yaml; adds CPJUMP compound+MOA merge/export steps; doc edits.
notebooks/0.download-data/nbconverted/2.preprocessing.py Switches MOA annotation to use generated compound metadata TSV; doc edits.
notebooks/0.download-data/nbconverted/3.subset-jump-controls.py Updates paths/filenames for control subsets.
notebooks/0.download-data/dl-configs.yaml Adds dedicated download configuration for the download notebook(s).
notebooks/0.download-data/1.download-data.ipynb Notebook equivalent of the nbconverted updates + new compound/MOA section.
notebooks/0.download-data/2.preprocessing.ipynb Notebook equivalent of preprocessing updates (compound metadata TSV usage).
notebooks/0.download-data/3.subset-jump-controls.ipynb Notebook equivalent of control-subsetting path/filename updates.
.pre-commit-config.yaml Bumps ruff-pre-commit revision.
Comments suppressed due to low confidence (2)

notebooks/0.download-data/nbconverted/3.subset-jump-controls.py:122

  • This notebook header says it subsets controls from the CPJUMP1 CRISPR dataset, but the code now loads cpjump1_compound_concat_profiles.parquet and writes cpjump1_compound_negcon_... outputs. Update the top-level notebook description (and any related variable names/text) to match the compound dataset being processed to avoid confusion for readers.
cpjump1_data_path = (
    profiles_dir / "cpjump1" / "cpjump1_compound_concat_profiles.parquet"
).resolve(strict=True)

notebooks/0.download-data/3.subset-jump-controls.ipynb:151

  • The notebook introduction describes subsetting controls from the CPJUMP1 CRISPR dataset, but this code cell is now pointing at cpjump1_compound_concat_profiles.parquet. Please update the introductory text to reflect the compound dataset (or adjust the code back to CRISPR) so the narrative matches the executed workflow.
    "# setting directory where all the single-cell profiles are stored\n",
    "data_dir = pathlib.Path.cwd() / \"data\"\n",
    "profiles_dir = (data_dir / \"sc-profiles\").resolve(strict=True)\n",
    "\n",
    "cpjump1_data_path = (\n",
    "    profiles_dir / \"cpjump1\" / \"cpjump1_compound_concat_profiles.parquet\"\n",
    ").resolve(strict=True)\n",

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +335 to +399
def split_data(
pycytominer_output: pl.DataFrame, dataset: str = "CP_and_DP"
) -> pl.DataFrame:
"""
Filter a pycytominer output DataFrame to retain only metadata and the
selected feature modality columns.

Parameters
----------
pycytominer_output : pl.DataFrame
Polars DataFrame from pycytominer containing both metadata and feature columns.
dataset : str, optional
Feature modality to retain. One of:
- ``"CP"`` — CellProfiler features only (columns containing ``"CP__"``)
- ``"DP"`` — DeepProfiler features only (columns containing ``"DP__"``)
- ``"CP_and_DP"`` — both modalities (default)

Returns
-------
pl.DataFrame
Polars DataFrame with metadata and selected features
"""
all_cols = pycytominer_output.columns

# Get DP, CP, or both features from all columns depending on desired dataset
if dataset == "CP":
feature_cols = [col for col in all_cols if "CP__" in col]
elif dataset == "DP":
feature_cols = [col for col in all_cols if "DP__" in col]
elif dataset == "CP_and_DP":
feature_cols = [col for col in all_cols if "P__" in col]
else:
raise ValueError(
f"Invalid dataset '{dataset}'. Choose from 'CP', 'DP', or 'CP_and_DP'."
)

# Metadata columns is all columns except feature columns
metadata_cols = [col for col in all_cols if "P__" not in col]

# Select metadata and feature columns
selected_cols = metadata_cols + feature_cols

return pycytominer_output.select(selected_cols)


def remove_feature_prefixes(df: pl.DataFrame, prefix: str = "CP__") -> pl.DataFrame:
"""
Strip a feature-modality prefix from all matching column names.

For example, ``"CP__Cells_AreaShape_Area"`` becomes ``"Cells_AreaShape_Area"``
when ``prefix="CP__"``.

Parameters
----------
df : pl.DataFrame
Input DataFrame whose column names may contain the prefix.
prefix : str, default ``"CP__"``
Prefix string to strip from matching column names.

Returns
-------
pl.DataFrame
DataFrame with the prefix removed from all matching column names.
"""
return df.rename(lambda x: x.replace(prefix, "") if prefix in x else x)
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New public helpers split_data and remove_feature_prefixes were added, but the existing test suite for utils.data_utils only covers add_cell_id_hash. Add unit tests covering the new behaviors (at least: CP vs DP vs CP_and_DP selection, invalid dataset value, and prefix stripping) to prevent regressions.

Copilot uses AI. Check for mistakes.
profiles_dir / "cpjump1" / "cpjump1_compound_experimental-metadata.csv"
).resolve(strict=True)

# cpjump1 compound metadta
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: "metadta" → "metadata".

Suggested change
# cpjump1 compound metadta
# cpjump1 compound metadata

Copilot uses AI. Check for mistakes.
" profiles_dir / \"cpjump1\" / \"cpjump1_compound_experimental-metadata.csv\"\n",
").resolve(strict=True)\n",
"\n",
"# cpjump1 compound metadta\n",
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in notebook cell text: "metadta" → "metadata".

Suggested change
"# cpjump1 compound metadta\n",
"# cpjump1 compound metadata\n",

Copilot uses AI. Check for mistakes.
# # Downloading Single-Cell Profiles
#
# This notebook focuses on downloading metadata and single-cell profiles from three key datasets:
# This notebook downloading metadata and single-cell profiles from three key datasets:
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar issue in notebook intro: "This notebook downloading ..." is missing a verb (e.g., "This notebook downloads ..." / "This notebook focuses on downloading ...").

Suggested change
# This notebook downloading metadata and single-cell profiles from three key datasets:
# This notebook downloads metadata and single-cell profiles from three key datasets:

Copilot uses AI. Check for mistakes.
{col: f"Metadata_{col}" for col in broad_compound_moa_metadata.columns}
)

# replace null values in the boroad compound moa to "unknown"
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: "boroad" → "broad".

Suggested change
# replace null values in the boroad compound moa to "unknown"
# replace null values in the broad compound moa to "unknown"

Copilot uses AI. Check for mistakes.
"# Downloading Single-Cell Profiles\n",
"\n",
"This notebook focuses on downloading metadata and single-cell profiles from three key datasets:\n",
"This notebook downloading metadata and single-cell profiles from three key datasets:\n",
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar issue in notebook intro: "This notebook downloading ..." is missing a verb (e.g., "This notebook downloads ..." / "This notebook focuses on downloading ...").

Suggested change
"This notebook downloading metadata and single-cell profiles from three key datasets:\n",
"This notebook downloads metadata and single-cell profiles from three key datasets:\n",

Copilot uses AI. Check for mistakes.
" {col: f\"Metadata_{col}\" for col in broad_compound_moa_metadata.columns}\n",
")\n",
"\n",
"# replace null values in the boroad compound moa to \"unknown\"\n",
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in notebook cell text: "boroad" → "broad".

Suggested change
"# replace null values in the boroad compound moa to \"unknown\"\n",
"# replace null values in the broad compound moa to \"unknown\"\n",

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants