Skip to content

feat: use multiple reference sequences for minimizer index generation#404

Open
ivan-aksamentov wants to merge 5 commits intomasterfrom
feat/multiref
Open

feat: use multiple reference sequences for minimizer index generation#404
ivan-aksamentov wants to merge 5 commits intomasterfrom
feat/multiref

Conversation

@ivan-aksamentov
Copy link
Copy Markdown
Member

Background:

Some pathogen datasets have significant genetic diversity that a single reference sequence cannot fully represent. This limits the accuracy of dataset auto-detection when query sequences are distant from the chosen reference. By allowing multiple reference sequences per dataset, the minimizer index can capture broader sequence diversity and improve detection rates.

Implementation:

  • Add optional files.minimizerReferences field in pathogen.json (array of FASTA file paths)
  • New get_minimizer_refs() function reads sequences from all listed files, falls back to main reference if field is absent
  • make_ref_search_index() combines minimizers from all references using set union; uses average length for normalization
  • Backward compatible: existing datasets work unchanged

Usage:

In pathogen.json, add array of FASTA paths containing representative sequences for the dataset:

{
  "files": {
    "reference": "reference.fasta",
    "minimizerReferences": [
      "clade_a.fasta",
      "clade_b.fasta"
    ]
  }
}

Each FASTA file can contain one or more sequences. All sequences across all files contribute minimizers to the dataset's index entry.

Checklist

  • Check if changes affect downstream workflows which depend on this dataset. For instance, Nextstrain ingest workflows may break if clade nomenclature changes. Consider fixing those workflows or making an issue at least. Not applicable

Background:
Some pathogen datasets have significant genetic diversity that a single reference sequence cannot fully represent. This limits the accuracy of dataset auto-detection when query sequences are distant from the chosen reference. By allowing multiple reference sequences per dataset, the minimizer index can capture broader sequence diversity and improve detection rates.

Implementation:
- Add optional `files.minimizerReferences` field in pathogen.json (array of FASTA file paths)
- New `get_minimizer_refs()` function reads sequences from all listed files, falls back to main reference if field is absent
- `make_ref_search_index()` combines minimizers from all references using set union; uses average length for normalization
- Backward compatible: existing datasets work unchanged

Usage:
In pathogen.json, add array of FASTA paths containing representative sequences for the dataset:

```json
{
  "files": {
    "reference": "reference.fasta",
    "minimizerReferences": [
      "clade_a.fasta",
      "clade_b.fasta"
    ]
  }
}
```

Each FASTA file can contain one or more sequences. All sequences across all files contribute minimizers to the dataset's index entry.

Co-Authored-By: Claude <noreply@anthropic.com>
@nneune
Copy link
Copy Markdown
Collaborator

nneune commented Apr 1, 2026

I tested the multi-ref minimizer functionality in this branch with CVA16: https://github.com/nextstrain/nextclade_data/tree/cva16-testing-sample

Current Nextclade versions reject array values in the `files` section of `pathogen.json` with `invalid type: sequence, expected a string`. The `DatasetFiles` struct catches unknown keys with `rest_files: BTreeMap<String, String>` (https://github.com/nextstrain/nextclade/blob/f7db57f31/packages/nextclade/src/io/dataset.rs#L567-L569), which only accepts string values. An array like `minimizerReferences` fails deserialization before reaching the `other: serde_json::Value` catch-all.

Move the lookup to `.buildFiles`, a top-level key absorbed by `VirusProperties`'s `other: serde_json::Value` (https://github.com/nextstrain/nextclade/blob/1400012f7/packages/nextclade/src/analyze/virus_properties.rs#L137-L138), which accepts any JSON type. This makes datasets with multiple minimizer reference files loadable by both old and new Nextclade versions.
ivan-aksamentov added a commit that referenced this pull request Apr 1, 2026
- Depends on: #404

Current Nextclade versions reject array values in the `files` section of `pathogen.json` with `invalid type: sequence, expected a string`. Datasets declaring `minimizerReferences` as an array inside `files` fail to load.

Move `minimizerReferences` from `.files` to a new top-level `.buildFiles` key:

```json
{
  "files": {
    "reference": "reference.fasta",
    "genomeAnnotation": "genome_annotation.gff3",
    "changelog": "CHANGELOG.md"
  },
  "buildFiles": {
    "minimizerReferences": [
      "minimizer_refs/additional_refs_B.fasta",
      "minimizer_refs/additional_refs_B1a.fasta"
    ]
  }
}
```

Unknown top-level keys are silently ignored by Nextclade, so the dataset remains loadable by both old and new versions. The companion script change reading from `buildFiles` is in #404 (9fc7f3e, 9fc7f3e).

### Work items

- [x] Move `minimizerReferences` from `files` to `buildFiles` in `pathogen.json`
# Conflicts:
#	data_output/minimizer_index.json
#	scripts/rebuild
@ivan-aksamentov ivan-aksamentov had a problem deploying to refs/heads/feat/multiref April 1, 2026 13:43 — with GitHub Actions Error
@ivan-aksamentov ivan-aksamentov temporarily deployed to refs/pull/404/merge April 1, 2026 13:43 — with GitHub Actions Inactive
nextstrain-bot and others added 2 commits April 1, 2026 13:44
…port

Rewrite scripts/minimizer as a subcommand CLI (build, search) with support
for multi-ref indexes, pre-built index loading, and sorted FASTA output.

Move search algorithm (search_one_query, filter_matches,
deserialize_ref_search_index) from the script into lib/minimizer.py to
share with other consumers (rebuild, suggest_datasets).
@ivan-aksamentov ivan-aksamentov had a problem deploying to refs/heads/feat/multiref April 1, 2026 14:12 — with GitHub Actions Error
@ivan-aksamentov ivan-aksamentov deployed to refs/pull/404/merge April 1, 2026 14:12 — with GitHub Actions Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants