perf(delfi): ~80x faster via blacklist preloading and binary search by DucoG · Pull Request #172 · epifluidlab/FinaleToolkit

DucoG · 2026-05-01T21:57:17Z

Summary

delfi is roughly 80x faster on whole-genome inputs while producing bit-identical output. On a chr1+chr2 hg38 benchmark (4912 100kb bins, 16 worker processes), wall-clock time drops from 643s → 8s and total CPU time from 9845s → 59s.

Motivation

While running DELFI on 211 WGS samples I profiled the worker function and noticed almost all of the per-window time was spent re-reading the blacklist BED file from disk and doing per-fragment linear scans against it. Per sample the existing implementation does ~26K full re-parses of the blacklist file (one per 100kb window) and reopens the BAM and 2bit reference inside every worker call. With many samples processed in parallel on shared/NFS storage this also produces a heavy and mostly-redundant I/O load.

Changes

src/finaletoolkit/frag/_delfi.py:

Preload the blacklist once and index regions by contig as sorted numpy arrays. Eliminates ~26K redundant disk reads per sample.
Binary search the blacklist per window via np.searchsorted instead of a linear scan over every fragment.
Share BAM and 2bit handles per worker via the multiprocessing.Pool initializer (one open per worker, reused for every window). The handle is reused by passing the open pysam.AlignmentFile to frag_generator, which already supports it.
Pre-load ContigGaps into worker globals instead of pickling one into every task argument.
Vectorise GC counting using str.count('G') + str.count('C') in place of a Python sum over 'G' or 'C' comparisons.

tests/test_delfi.py:

Adds test_workers_equivalence which asserts delfi(workers=1) and delfi(workers=4) produce bit-identical output on every column. This guards against any future regression that introduces parallel-only state in the worker pool.

CHANGELOG.md: documents the speedup and the new test under [Unreleased].

The public API (delfi(...) signature, output schema, output values) is unchanged.

Benchmarks

Hardware: 56-core x86_64, NFS-backed BAM, hg38, 16 workers, identical inputs across runs. Wall and CPU times are means over 5 runs each (σ < 1% in every cell).

Implementation	Wall (s)	CPU (s)	Speedup (wall)
Upstream (current `main`)	643	9845	1×
This PR	8.0	59.3	80×

Ablation (which optimisation contributes what)

Cumulative speedup as each optimisation is added on top of the previous:

Config	Wall (s)	CPU (s)	Speedup
baseline	643	9845	1×
+ preload blacklist	96	1378	6.7×
+ sorted-array blacklist	9.6	85	67×
+ reuse BAM/ref handle	8.5	65	76×
+ preload contig gaps	8.0	60	80×
+ vectorised GC count	8.0	59	80×

The two dominant wins are preloading the blacklist (eliminates ~26K file reopens) and switching to a sorted-array blacklist filter (eliminates O(blacklist) work per fragment). The remaining three optimisations together are worth ~1.5s wall / ~25s CPU on this input.

Correctness

Existing tests/test_delfi.py::test_overall passes unchanged.
New tests/test_delfi.py::test_workers_equivalence verifies workers=1 and workers=4 produce identical output on every column.
Independently verified on a real 211-sample WGS dataset by running the upstream implementation and this PR side-by-side and diffing the output TSVs: every value of every column (contig, start, stop, arm, short, long, gc, num_frags) matches exactly.

Notes

I considered an alternative restructuring (one task per contig with a single pysam.fetch per contig instead of per window) hoping to reduce I/O. It was actually slower in every regime (cold and warm cache) because the existing per-window fetches use the BAI index and skip gap regions, while a per-contig fetch reads the full contig. Not pursued.
The same patterns (per-task BAM/refseq reopens, etc.) exist in _end_motifs, _frag_length, _coverage, _wps, and _multi_wps. Happy to send follow-up PRs that extract a small shared _pool helper and apply the same pattern there if there's appetite for it.

The previous implementation reopened and re-parsed the blacklist BED file, opened the BAM via pysam, and opened the 2bit reference for every single 100kb window (~26K windows per whole-genome run). It also did a linear blacklist scan on every fragment. This patch: * Parses the blacklist once and indexes regions by contig as sorted numpy arrays. * Uses binary search (np.searchsorted) to filter blacklist regions per window, replacing the per-fragment linear scan. * Opens one pysam.AlignmentFile and py2bit reference handle per worker process via a multiprocessing.Pool initializer; workers reuse those handles for every window they receive. * Pre-loads per-contig ContigGaps into a worker-global dict so they are not pickled into every task argument. * Replaces the per-base Python sum used to count GC bases with the C-level str.count. Output is bit-identical to the previous implementation (verified by ablation on chr1+chr2 with hg38; see PR description for benchmarks). A new regression test (test_workers_equivalence) asserts that delfi(workers=1) and delfi(workers=N) produce bit-identical results, so future changes to the parallel path are caught immediately.

jamesli124

Thank you for this PR. The performance improvement is very welcome and clever. I've left a couple of notes for you in my review, the main one being that support for fragment input in tabix-indexed bed.gz was inadvertently dropped. Otherwise, we appreciate your contribution!

jamesli124 · 2026-05-18T19:48:50Z

-    pandas DataFrame
-        Results of delfi analysis, with column names corresponding to
-        those generated by the original author's scripts.
+    See original docstring for full parameter documentation. This


The new docstring says "See original docstring for full parameter
documentation", but the original docstring was deleted in this same commit. Users calling help(delfi) or reading rendered
docs will get no parameter descriptions at all. Please either restore the
parameter docs or link to the documentation site.

jamesli124 · 2026-05-18T19:58:44Z

+def _delfi_pool_initializer(input_file, reference_file, blacklist_by_contig,
+                            contig_gaps_by_contig):
+    global _WORKER_BAM, _WORKER_REF, _WORKER_BLACKLIST, _WORKER_CONTIG_GAPS
+    _WORKER_BAM = pysam.AlignmentFile(str(input_file), 'r')


This unconditionally opens the input as a BAM/SAM file, but frag_generator
(which the rest of the codebase uses) also supports tabix-indexed .frag.gz/bed.gz
files by opening them as pysam.TabixFile. When given a .frag.gz/bed.gz input,
AlignmentFile raises "file does not contain alignment data" in every worker
process, causing the pool to hang silently with an empty output.

The fix is to mirror the detection already in frag_generator:

s = str(input_file) if s.endswith('.sam') or s.endswith('.bam') or s.endswith('.cram'): _WORKER_BAM = pysam.AlignmentFile(s, 'r') else: _WORKER_BAM = pysam.TabixFile(s, 'r')

The existing tests pass because they exclusively use .bam input — this code
path is never exercised for .frag.gz.

jamesli124 · 2026-05-18T20:37:59Z

+
    results = delfi(frag_file, autosomes, bins_file, twobit, blacklist, gaps)
-
+


Small note: none of the tests test opening fragment information in .frag.gz/bed.gz files.

- waiting for PR #172 to clear before addressing DELFI

jamesli124 requested changes May 18, 2026

View reviewed changes

jamesli124 added a commit that referenced this pull request May 26, 2026

added support for reading CRAM except to delfi

65035f0

- waiting for PR #172 to clear before addressing DELFI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(delfi): ~80x faster via blacklist preloading and binary search#172

perf(delfi): ~80x faster via blacklist preloading and binary search#172
DucoG wants to merge 1 commit into
epifluidlab:mainfrom
DucoG:perf/delfi-blacklist-and-gc

DucoG commented May 1, 2026

Uh oh!

jamesli124 left a comment

Uh oh!

jamesli124 May 18, 2026

Uh oh!

jamesli124 May 18, 2026

Uh oh!

jamesli124 May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		results = delfi(frag_file, autosomes, bins_file, twobit, blacklist, gaps)
		No newline at end of file

Conversation

DucoG commented May 1, 2026

Summary

Motivation

Changes

Benchmarks

Ablation (which optimisation contributes what)

Correctness

Notes

Uh oh!

jamesli124 left a comment

Choose a reason for hiding this comment

Uh oh!

jamesli124 May 18, 2026

Choose a reason for hiding this comment

Uh oh!

jamesli124 May 18, 2026

Choose a reason for hiding this comment

Uh oh!

jamesli124 May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants