-
Notifications
You must be signed in to change notification settings - Fork 1
GSoC 2025
Aleksandra Galitsyna, Anton Goloborodko, Geoff Fudenberg, Ilya Flyamer, Nezar Abdennur, Thomas Reimonn
You can find the potential projects that are available for GSoC 2025 contributors. Interested applicants can always contact us at our GSoC2025 channel on Discord or send us an email for potential brainstorming before they submit the application.
We develop software for the analysis of the spatial organization of genomes, mostly leveraging the family of molecular technologies known as chromosome conformation capture (3C), primarily its high throughput derivative called Hi-C and its many closely related techniques, which we’ll collectively refer to below as 3C+. We also develop tools for genomic and multi-omic data analysis more broadly within the Python data science ecosystem. We like our tools to be easy to use, flexible, to facilitate active development of novel analytical approaches, and scalable, to make use of the latest and largest datasets. We welcome Google Summer of Code contributors with potential proposals focusing on one of the topics below.
Provides a framework for genomic data analysis using Pandas DataFrames, including genomic interval arithmetic.
GSoC applicants can contribute to Bioframe by:
- extending operations to 2D genomic intervals (https://github.com/open2c/bioframe/issues/25);
- implement operations on binned genomes and their intervals (https://github.com/open2c/bioframe/issues/116)
- enable out-of-core genomic interval arithmetic (difficulty: hard)
- improve access to genome assembly metadata in Python (difficulty: easy)
- Skills and Requirements: Python, data science with Python (numpy/pandas), background in math, 350 hours, Medium.
- Mentors: Geoff Fudenberg, Anton Goloborodko, Nezar Abdennur
A standard storage format and Python package for Hi-C and 3C+ data based on HDF5 format, designed for storage and manipulation of extremely large Hi-C datasets at any resolution, but is not limited to Hi-C data in any way. These massive heatmaps can be explored using a multiscale genome browser such as HiGlass and analyzed with a growing array of downstream analysis software, including cooltools.
GSoC applicants can contribute to cooler by:
- Implementing the powerful and flexible Zarr storage system as an alternative and cloud-friendly backend for cooler.
- Providing solutions for an Xarray-based API for genomic heatmaps via cooler.
- Challenges: Harmonizing differences between Zarr and HDF5 APIs.
- Skills and Requirements: Python programming, numpy, minimal familiarity with at least one of HDF5, Zarr or Xarray, 350 hours commitment, Medium
- Mentors: Nezar Abdennur, Thomas Reimonn
Provides a suite of computational tools to perform various downstream analytical workflows on genomic contact maps in cooler files. The individual datasets are typically much larger than what can fit memory at once, demanding an out-of-core data processing approach. The unified CLI + Python API design facilitates creating workflows on high-performance computing clusters as well as in custom data analysis notebooks or simple scripts. As the key part of interpreting and extracting biological insights from Hi-C and 3C-based datasets, Open2C maintains a collection of detailed educational tutorials on key concepts in 3C+ data analysis using interactive notebooks based largely on cooltools; see open2c_examples.
GSoC applicants can contribute to cooltools by:
- Migrate log-smoothing code to a mini repository. (https://github.com/open2c/cooltools/issues/505)
- Implementation and optimization of scalable, sparse eigen-decomposition and other matrix factorization methods for Hi-C data.
- Challenges: parallelization and optimization of the process.
- Skills and Requirements: python data science stack (numpy/pandas), math background can be useful (for linear algebra), 350 hours, Hard
- Mentors: Geoff Fudenberg, Ilya Flamer
A simple and fast command-line framework for low-level stream-based processing of sequencing data from a 3C+ experiment. Pairtools fulfills the fundamental step of 3C+ data processing: detecting genomic contacts from experimental sequencing data and provides tools to sort, manipulate, filter, and classify these pairs,to design feature-rich pipelines for specialized experimental protocols or studies, as well as perform quality assessment of billions of contacts detected in a given experiment.
GSoC applicants can contribute to pairtools by:
- Turning the pairtools CLI into a domain-specific language (DSL) allowing on-demand pipeline construction.
- Developing a binary pairs format using Apache Parquet.
- Implementing cheaper I/O using technologies like Apache Arrow or Dask.
- Challenges: Parallel implementation of the parsing and crucial steps of Hi-C data processing
- Skills and Requirements: Python programming, numpy and pandas, design of CLI tools following Unix style guidelines, 350 hours, Medium
- Mentors: Anton Goloborodko
A new tool for fast and effective conversion of chromatin interactions between genome assemblies and bin schemes (including coolers and pairs). Liftover_2d utilizes the advantages of interaction storage and effective I/O of Open2C framework to enable new functionality in the field of evolutionary genomics research.
GSoC applicants can contribute to liftover_2d by:
- Developing an efficient and precise model for the conversion of contact pairs between assemblies
- Implementing an API and CLI for file conversion
- Benchmarking performance time and quality against alternatives.
- Challenges: bioframe, cooler and polars-friendly implementation of the tool, critical parallelizable implementation, development of clean and easy-to-use API/CLI
- Skills and Requirements: Python programming, numpy, pandas, polars, design of CLI tools following Unix and Scientific Python style guidelines, 350 hours, Medium
- Mentors: Aleksandra Galitsyna