Update downloads module by axiomcura · Pull Request #80 · WayScience/buscar

axiomcura · 2026-03-08T05:44:19Z

Given the number of changes made throughout the analysis notebooks, this PR updates the downloads module to include functions for downloading the CPJUMP experimental and MOA data, along with several improvements to the module documentation.

We have also removed functions that are no longer used in the notebooks and updated the documentation accordingly.

These changes are part of the preparation for separating the notebooks into a dedicated analysis repository while transitioning this repository into a focused software package.

review-notebook-app · 2026-03-08T05:44:24Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Copilot

Pull request overview

This PR updates the data download/preprocessing utilities and associated notebooks to support downloading and annotating CPJUMP experimental + MOA metadata, while removing unused helper code in preparation for splitting notebooks into a separate analysis repository.

Changes:

Expanded utils/io_utils.py with improved module docs and a new helper to load + concatenate profile parquet files.
Simplified/cleaned utils/data_utils.py (removing unused signature-grouping helpers) and added feature-modality utilities (split_data, remove_feature_prefixes).
Updated notebooks/0.download-data/* notebooks (and nbconverted scripts) to use a new dl-configs.yaml and to generate/consume compound+MOA metadata.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
utils/validator.py	Removes an unused clustering param-grid validator module.
utils/io_utils.py	Adds module docs + `load_and_concat_profiles`; adjusts formatting/error messaging.
utils/data_utils.py	Cleans up unused functions; improves docs; adds modality/prefix helpers.
notebooks/0.download-data/nbconverted/1.download-data.py	Uses `dl-configs.yaml`; adds CPJUMP compound+MOA merge/export steps; doc edits.
notebooks/0.download-data/nbconverted/2.preprocessing.py	Switches MOA annotation to use generated compound metadata TSV; doc edits.
notebooks/0.download-data/nbconverted/3.subset-jump-controls.py	Updates paths/filenames for control subsets.
notebooks/0.download-data/dl-configs.yaml	Adds dedicated download configuration for the download notebook(s).
notebooks/0.download-data/1.download-data.ipynb	Notebook equivalent of the nbconverted updates + new compound/MOA section.
notebooks/0.download-data/2.preprocessing.ipynb	Notebook equivalent of preprocessing updates (compound metadata TSV usage).
notebooks/0.download-data/3.subset-jump-controls.ipynb	Notebook equivalent of control-subsetting path/filename updates.
.pre-commit-config.yaml	Bumps `ruff-pre-commit` revision.

Comments suppressed due to low confidence (2)

notebooks/0.download-data/nbconverted/3.subset-jump-controls.py:122

This notebook header says it subsets controls from the CPJUMP1 CRISPR dataset, but the code now loads cpjump1_compound_concat_profiles.parquet and writes cpjump1_compound_negcon_... outputs. Update the top-level notebook description (and any related variable names/text) to match the compound dataset being processed to avoid confusion for readers.

cpjump1_data_path = (
    profiles_dir / "cpjump1" / "cpjump1_compound_concat_profiles.parquet"
).resolve(strict=True)

notebooks/0.download-data/3.subset-jump-controls.ipynb:151

The notebook introduction describes subsetting controls from the CPJUMP1 CRISPR dataset, but this code cell is now pointing at cpjump1_compound_concat_profiles.parquet. Please update the introductory text to reflect the compound dataset (or adjust the code back to CRISPR) so the narrative matches the executed workflow.

    "# setting directory where all the single-cell profiles are stored\n",
    "data_dir = pathlib.Path.cwd() / \"data\"\n",
    "profiles_dir = (data_dir / \"sc-profiles\").resolve(strict=True)\n",
    "\n",
    "cpjump1_data_path = (\n",
    "    profiles_dir / \"cpjump1\" / \"cpjump1_compound_concat_profiles.parquet\"\n",
    ").resolve(strict=True)\n",

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-08T05:49:36Z

utils/data_utils.py

+def split_data(
+    pycytominer_output: pl.DataFrame, dataset: str = "CP_and_DP"
+) -> pl.DataFrame:
+    """
+    Filter a pycytominer output DataFrame to retain only metadata and the
+    selected feature modality columns.
+
+    Parameters
+    ----------
+    pycytominer_output : pl.DataFrame
+        Polars DataFrame from pycytominer containing both metadata and feature columns.
+    dataset : str, optional
+        Feature modality to retain. One of:
+        - ``"CP"`` — CellProfiler features only (columns containing ``"CP__"``)
+        - ``"DP"`` — DeepProfiler features only (columns containing ``"DP__"``)
+        - ``"CP_and_DP"`` — both modalities (default)
+
+    Returns
+    -------
+    pl.DataFrame
+        Polars DataFrame with metadata and selected features
+    """
+    all_cols = pycytominer_output.columns
+
+    # Get DP, CP, or both features from all columns depending on desired dataset
+    if dataset == "CP":
+        feature_cols = [col for col in all_cols if "CP__" in col]
+    elif dataset == "DP":
+        feature_cols = [col for col in all_cols if "DP__" in col]
+    elif dataset == "CP_and_DP":
+        feature_cols = [col for col in all_cols if "P__" in col]
+    else:
+        raise ValueError(
+            f"Invalid dataset '{dataset}'. Choose from 'CP', 'DP', or 'CP_and_DP'."
+        )
+
+    # Metadata columns is all columns except feature columns
+    metadata_cols = [col for col in all_cols if "P__" not in col]
+
+    # Select metadata and feature columns
+    selected_cols = metadata_cols + feature_cols
+
+    return pycytominer_output.select(selected_cols)
+
+
+def remove_feature_prefixes(df: pl.DataFrame, prefix: str = "CP__") -> pl.DataFrame:
+    """
+    Strip a feature-modality prefix from all matching column names.
+
+    For example, ``"CP__Cells_AreaShape_Area"`` becomes ``"Cells_AreaShape_Area"``
+    when ``prefix="CP__"``.
+
+    Parameters
+    ----------
+    df : pl.DataFrame
+        Input DataFrame whose column names may contain the prefix.
+    prefix : str, default ``"CP__"``
+        Prefix string to strip from matching column names.
+
+    Returns
+    -------
+    pl.DataFrame
+        DataFrame with the prefix removed from all matching column names.
+    """
+    return df.rename(lambda x: x.replace(prefix, "") if prefix in x else x)


New public helpers split_data and remove_feature_prefixes were added, but the existing test suite for utils.data_utils only covers add_cell_id_hash. Add unit tests covering the new behaviors (at least: CP vs DP vs CP_and_DP selection, invalid dataset value, and prefix stripping) to prevent regressions.

Copilot · 2026-03-08T05:49:37Z

notebooks/0.download-data/nbconverted/2.preprocessing.py

    profiles_dir / "cpjump1" / "cpjump1_compound_experimental-metadata.csv"
 ).resolve(strict=True)

+# cpjump1 compound metadta


Typo in comment: "metadta" → "metadata".

Suggested change

# cpjump1 compound metadta

# cpjump1 compound metadata

Copilot · 2026-03-08T05:49:37Z

notebooks/0.download-data/2.preprocessing.ipynb

    "    profiles_dir / \"cpjump1\" / \"cpjump1_compound_experimental-metadata.csv\"\n",
    ").resolve(strict=True)\n",
    "\n",
+    "# cpjump1 compound metadta\n",


Typo in notebook cell text: "metadta" → "metadata".

Suggested change

"# cpjump1 compound metadta\n",

"# cpjump1 compound metadata\n",

Copilot · 2026-03-08T05:49:37Z

notebooks/0.download-data/nbconverted/1.download-data.py

 # # Downloading Single-Cell Profiles
 #
-# This notebook focuses on downloading metadata and single-cell profiles from three key datasets:
+# This notebook downloading metadata and single-cell profiles from three key datasets:


Grammar issue in notebook intro: "This notebook downloading ..." is missing a verb (e.g., "This notebook downloads ..." / "This notebook focuses on downloading ...").

Suggested change

# This notebook downloading metadata and single-cell profiles from three key datasets:

# This notebook downloads metadata and single-cell profiles from three key datasets:

Copilot · 2026-03-08T05:49:37Z

notebooks/0.download-data/nbconverted/1.download-data.py

+    {col: f"Metadata_{col}" for col in broad_compound_moa_metadata.columns}
+)
+
+# replace null values in the boroad compound moa to "unknown"


Typo in comment: "boroad" → "broad".

Suggested change

# replace null values in the boroad compound moa to "unknown"

# replace null values in the broad compound moa to "unknown"

utils/io_utils.py

Copilot · 2026-03-08T05:49:38Z

notebooks/0.download-data/1.download-data.ipynb

    "# Downloading Single-Cell Profiles\n",
    "\n",
-    "This notebook focuses on downloading metadata and single-cell profiles from three key datasets:\n",
+    "This notebook downloading metadata and single-cell profiles from three key datasets:\n",


Grammar issue in notebook intro: "This notebook downloading ..." is missing a verb (e.g., "This notebook downloads ..." / "This notebook focuses on downloading ...").

Suggested change

"This notebook downloading metadata and single-cell profiles from three key datasets:\n",

"This notebook downloads metadata and single-cell profiles from three key datasets:\n",

Copilot · 2026-03-08T05:49:38Z

notebooks/0.download-data/1.download-data.ipynb

+    "    {col: f\"Metadata_{col}\" for col in broad_compound_moa_metadata.columns}\n",
+    ")\n",
+    "\n",
+    "# replace null values in the boroad compound moa to \"unknown\"\n",


Typo in notebook cell text: "boroad" → "broad".

Suggested change

"# replace null values in the boroad compound moa to \"unknown\"\n",

"# replace null values in the broad compound moa to \"unknown\"\n",

axiomcura added 3 commits March 7, 2026 22:38

updated docs and removed functions that were no longer used

5044203

download module update

ec78b56

doc updates

5940d0e

axiomcura requested a review from Copilot March 8, 2026 05:44

Copilot started reviewing on behalf of axiomcura March 8, 2026 05:45 View session

Copilot AI reviewed Mar 8, 2026

View reviewed changes

axiomcura added 2 commits March 7, 2026 22:53

fixed execution call

9f2f9f2

updated module by adding cell type

cb44173

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update downloads module #80

Update downloads module #80
axiomcura wants to merge 5 commits intoWayScience:mainfrom
axiomcura:update-download-module

axiomcura commented Mar 8, 2026

Uh oh!

review-notebook-app bot commented Mar 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	"# cpjump1 compound metadta\n",
	"# cpjump1 compound metadata\n",

	# This notebook downloading metadata and single-cell profiles from three key datasets:
	# This notebook downloads metadata and single-cell profiles from three key datasets:

	# replace null values in the boroad compound moa to "unknown"
	# replace null values in the broad compound moa to "unknown"

	"This notebook downloading metadata and single-cell profiles from three key datasets:\n",
	"This notebook downloads metadata and single-cell profiles from three key datasets:\n",

	"# replace null values in the boroad compound moa to \"unknown\"\n",
	"# replace null values in the broad compound moa to \"unknown\"\n",

Conversation

axiomcura commented Mar 8, 2026

Uh oh!

review-notebook-app bot commented Mar 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants