Fix paralog cross-hits silently dropping syntenic blocks (--best_hmm_wins)#103
Merged
Conversation
#96) When the HMM panel contains paralogous models, the same peptide can score above threshold under more than one HMM. Pynteny kept every hit, so the duplicate rows interleaved (by contig/gene_pos sort) between true syntenic neighbours and broke the rolling-window matcher's position-diff check, silently dropping a genuine operon from synteny_matched.tsv. - Add an opt-in `--best_hmm_wins` flag (default off, preserving current behaviour). When enabled, SyntenyHMMfilter.get_all_HMM_hits deduplicates by peptide (full_label), keeping the highest-scoring HMM, before the rolling-window scan. bitscore is carried only on this path and dropped right after dedup, so the default code path is unchanged. - Thread the flag CLI (`-b/--best_hmm_wins`) -> synteny_search -> filter_FASTA_by_synteny_structure -> SyntenyHMMfilter, and through the api.Search class. - Upgrade the duplicate-hits warning to report the count and the top offending HMM combinations, and to point at --best_hmm_wins. - Docs: search.md gains a "note on paralogous HMMs" section and the new flag in the options table. - Tests: new tests/test_filter.py reproduces the silent drop on cross-hits and verifies --best_hmm_wins recovers the operon. Closes #96 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #96.
Problem
When the HMM panel passed to
pynteny searchcontains paralogous models (e.g.nifD/nifK,nifE/nifN), the same peptide commonly scores above threshold under more than one HMM. Pynteny kept every hit, and the duplicate rows — interleaved by the(contig, gene_pos)sort between the rows of true syntenic neighbours — broke the position-diff check inside the rolling window. The result was a silent false negative: a genuine syntenic operon was dropped fromsynteny_matched.tsvwith only an informational warning.I confirmed the mechanism against the code: with cross-hits, the canonical triple (e.g.
nifH@33, nifD@34, nifK@35) is never visited as a contiguous window because extra cross-hit rows sit between its members, so the(distance == 1,1)check never passes.Fix
Add an opt-in
--best_hmm_winsflag (default off, so existing behaviour and curated|-group panels are unchanged):SyntenyHMMfilter.get_all_HMM_hits()deduplicates by peptide (full_label), keeping the highest-scoring HMM, before the rolling-window scan. The HMMERbitscoreis carried only on this path and dropped immediately after dedup, so the default code path keeps exactly the same columns/behaviour (this avoided a regression in_merge_hits_by_HMM_group'sdrop_duplicates).-b/--best_hmm_wins) →synteny_search→filter_FASTA_by_synteny_structure→SyntenyHMMfilter, and through theapi.Searchclass.--best_hmm_wins.search.mdgains a "note on paralogous HMMs" section and the new flag in the options table;pynteny search --helpdocuments it.Tests
New
tests/test_filter.pybuilds a minimal synthetic cross-hit scenario:test_default_drops_operon_on_crosshit— reproduces the silent drop (default behaviour).test_best_hmm_wins_recovers_operon—--best_hmm_winsrecovers the fullnifH→nifD→nifKoperon.Full suite: 26 passed (24 existing + 2 new); the existing
|-group integration test still passes, confirming the default path is unchanged.Scope notes
Following the issue's guidance, I did not change the rolling-window matcher itself (correct on deduplicated input), the degenerate-case
sys.exit(1), or the default behaviour. The 2.0 default-flip is left for a future major release.🤖 Generated with Claude Code