Skip to content

Fix paralog cross-hits silently dropping syntenic blocks (--best_hmm_wins)#103

Merged
Robaina merged 1 commit into
mainfrom
fix-issue96-paralog-crosshits
Jun 19, 2026
Merged

Fix paralog cross-hits silently dropping syntenic blocks (--best_hmm_wins)#103
Robaina merged 1 commit into
mainfrom
fix-issue96-paralog-crosshits

Conversation

@Robaina

@Robaina Robaina commented Jun 19, 2026

Copy link
Copy Markdown
Owner

Closes #96.

Problem

When the HMM panel passed to pynteny search contains paralogous models (e.g. nifD/nifK, nifE/nifN), the same peptide commonly scores above threshold under more than one HMM. Pynteny kept every hit, and the duplicate rows — interleaved by the (contig, gene_pos) sort between the rows of true syntenic neighbours — broke the position-diff check inside the rolling window. The result was a silent false negative: a genuine syntenic operon was dropped from synteny_matched.tsv with only an informational warning.

I confirmed the mechanism against the code: with cross-hits, the canonical triple (e.g. nifH@33, nifD@34, nifK@35) is never visited as a contiguous window because extra cross-hit rows sit between its members, so the (distance == 1,1) check never passes.

Fix

Add an opt-in --best_hmm_wins flag (default off, so existing behaviour and curated |-group panels are unchanged):

  • SyntenyHMMfilter.get_all_HMM_hits() deduplicates by peptide (full_label), keeping the highest-scoring HMM, before the rolling-window scan. The HMMER bitscore is carried only on this path and dropped immediately after dedup, so the default code path keeps exactly the same columns/behaviour (this avoided a regression in _merge_hits_by_HMM_group's drop_duplicates).
  • Flag threaded: CLI (-b/--best_hmm_wins) → synteny_searchfilter_FASTA_by_synteny_structureSyntenyHMMfilter, and through the api.Search class.
  • The duplicate-hits warning is upgraded to report the count and the top offending HMM combinations, and to point users at --best_hmm_wins.
  • Docs: search.md gains a "note on paralogous HMMs" section and the new flag in the options table; pynteny search --help documents it.

Tests

New tests/test_filter.py builds a minimal synthetic cross-hit scenario:

  • test_default_drops_operon_on_crosshit — reproduces the silent drop (default behaviour).
  • test_best_hmm_wins_recovers_operon--best_hmm_wins recovers the full nifH→nifD→nifK operon.

Full suite: 26 passed (24 existing + 2 new); the existing |-group integration test still passes, confirming the default path is unchanged.

Scope notes

Following the issue's guidance, I did not change the rolling-window matcher itself (correct on deduplicated input), the degenerate-case sys.exit(1), or the default behaviour. The 2.0 default-flip is left for a future major release.

🤖 Generated with Claude Code

#96)

When the HMM panel contains paralogous models, the same peptide can score
above threshold under more than one HMM. Pynteny kept every hit, so the
duplicate rows interleaved (by contig/gene_pos sort) between true syntenic
neighbours and broke the rolling-window matcher's position-diff check, silently
dropping a genuine operon from synteny_matched.tsv.

- Add an opt-in `--best_hmm_wins` flag (default off, preserving current
  behaviour). When enabled, SyntenyHMMfilter.get_all_HMM_hits deduplicates by
  peptide (full_label), keeping the highest-scoring HMM, before the
  rolling-window scan. bitscore is carried only on this path and dropped right
  after dedup, so the default code path is unchanged.
- Thread the flag CLI (`-b/--best_hmm_wins`) -> synteny_search ->
  filter_FASTA_by_synteny_structure -> SyntenyHMMfilter, and through the
  api.Search class.
- Upgrade the duplicate-hits warning to report the count and the top offending
  HMM combinations, and to point at --best_hmm_wins.
- Docs: search.md gains a "note on paralogous HMMs" section and the new flag in
  the options table.
- Tests: new tests/test_filter.py reproduces the silent drop on cross-hits and
  verifies --best_hmm_wins recovers the operon.

Closes #96

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Robaina Robaina self-assigned this Jun 19, 2026
@Robaina Robaina merged commit 2f88af4 into main Jun 19, 2026
6 checks passed
@Robaina Robaina mentioned this pull request Jun 19, 2026
@Robaina Robaina deleted the fix-issue96-paralog-crosshits branch June 19, 2026 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Synteny matcher silently drops true positives when paralogous HMMs cross-hit the same peptide

1 participant