perf(delfi): ~80x faster via blacklist preloading and binary search#172
perf(delfi): ~80x faster via blacklist preloading and binary search#172DucoG wants to merge 1 commit into
Conversation
The previous implementation reopened and re-parsed the blacklist BED file, opened the BAM via pysam, and opened the 2bit reference for every single 100kb window (~26K windows per whole-genome run). It also did a linear blacklist scan on every fragment. This patch: * Parses the blacklist once and indexes regions by contig as sorted numpy arrays. * Uses binary search (np.searchsorted) to filter blacklist regions per window, replacing the per-fragment linear scan. * Opens one pysam.AlignmentFile and py2bit reference handle per worker process via a multiprocessing.Pool initializer; workers reuse those handles for every window they receive. * Pre-loads per-contig ContigGaps into a worker-global dict so they are not pickled into every task argument. * Replaces the per-base Python sum used to count GC bases with the C-level str.count. Output is bit-identical to the previous implementation (verified by ablation on chr1+chr2 with hg38; see PR description for benchmarks). A new regression test (test_workers_equivalence) asserts that delfi(workers=1) and delfi(workers=N) produce bit-identical results, so future changes to the parallel path are caught immediately.
jamesli124
left a comment
There was a problem hiding this comment.
Thank you for this PR. The performance improvement is very welcome and clever. I've left a couple of notes for you in my review, the main one being that support for fragment input in tabix-indexed bed.gz was inadvertently dropped. Otherwise, we appreciate your contribution!
| pandas DataFrame | ||
| Results of delfi analysis, with column names corresponding to | ||
| those generated by the original author's scripts. | ||
| See original docstring for full parameter documentation. This |
There was a problem hiding this comment.
The new docstring says "See original docstring for full parameter
documentation", but the original docstring was deleted in this same commit. Users calling help(delfi) or reading rendered
docs will get no parameter descriptions at all. Please either restore the
parameter docs or link to the documentation site.
| def _delfi_pool_initializer(input_file, reference_file, blacklist_by_contig, | ||
| contig_gaps_by_contig): | ||
| global _WORKER_BAM, _WORKER_REF, _WORKER_BLACKLIST, _WORKER_CONTIG_GAPS | ||
| _WORKER_BAM = pysam.AlignmentFile(str(input_file), 'r') |
There was a problem hiding this comment.
This unconditionally opens the input as a BAM/SAM file, but frag_generator
(which the rest of the codebase uses) also supports tabix-indexed .frag.gz/bed.gz
files by opening them as pysam.TabixFile. When given a .frag.gz/bed.gz input,
AlignmentFile raises "file does not contain alignment data" in every worker
process, causing the pool to hang silently with an empty output.
The fix is to mirror the detection already in frag_generator:
s = str(input_file)
if s.endswith('.sam') or s.endswith('.bam') or s.endswith('.cram'):
_WORKER_BAM = pysam.AlignmentFile(s, 'r')
else:
_WORKER_BAM = pysam.TabixFile(s, 'r')
The existing tests pass because they exclusively use .bam input — this code
path is never exercised for .frag.gz.
|
|
||
| results = delfi(frag_file, autosomes, bins_file, twobit, blacklist, gaps) | ||
|
No newline at end of file |
||
|
|
There was a problem hiding this comment.
Small note: none of the tests test opening fragment information in .frag.gz/bed.gz files.
- waiting for PR #172 to clear before addressing DELFI
Summary
delfiis roughly 80x faster on whole-genome inputs while producing bit-identical output. On a chr1+chr2 hg38 benchmark (4912 100kb bins, 16 worker processes), wall-clock time drops from 643s → 8s and total CPU time from 9845s → 59s.Motivation
While running DELFI on 211 WGS samples I profiled the worker function and noticed almost all of the per-window time was spent re-reading the blacklist BED file from disk and doing per-fragment linear scans against it. Per sample the existing implementation does ~26K full re-parses of the blacklist file (one per 100kb window) and reopens the BAM and 2bit reference inside every worker call. With many samples processed in parallel on shared/NFS storage this also produces a heavy and mostly-redundant I/O load.
Changes
src/finaletoolkit/frag/_delfi.py:np.searchsortedinstead of a linear scan over every fragment.multiprocessing.Poolinitializer (one open per worker, reused for every window). The handle is reused by passing the openpysam.AlignmentFiletofrag_generator, which already supports it.ContigGapsinto worker globals instead of pickling one into every task argument.str.count('G') + str.count('C')in place of a Pythonsumover'G' or 'C'comparisons.tests/test_delfi.py:test_workers_equivalencewhich assertsdelfi(workers=1)anddelfi(workers=4)produce bit-identical output on every column. This guards against any future regression that introduces parallel-only state in the worker pool.CHANGELOG.md: documents the speedup and the new test under[Unreleased].The public API (
delfi(...)signature, output schema, output values) is unchanged.Benchmarks
Hardware: 56-core x86_64, NFS-backed BAM, hg38, 16 workers, identical inputs across runs. Wall and CPU times are means over 5 runs each (σ < 1% in every cell).
main)Ablation (which optimisation contributes what)
Cumulative speedup as each optimisation is added on top of the previous:
The two dominant wins are preloading the blacklist (eliminates ~26K file reopens) and switching to a sorted-array blacklist filter (eliminates O(blacklist) work per fragment). The remaining three optimisations together are worth ~1.5s wall / ~25s CPU on this input.
Correctness
tests/test_delfi.py::test_overallpasses unchanged.tests/test_delfi.py::test_workers_equivalenceverifiesworkers=1andworkers=4produce identical output on every column.contig,start,stop,arm,short,long,gc,num_frags) matches exactly.Notes
pysam.fetchper contig instead of per window) hoping to reduce I/O. It was actually slower in every regime (cold and warm cache) because the existing per-window fetches use the BAI index and skip gap regions, while a per-contig fetch reads the full contig. Not pursued._end_motifs,_frag_length,_coverage,_wps, and_multi_wps. Happy to send follow-up PRs that extract a small shared_poolhelper and apply the same pattern there if there's appetite for it.