feat: use multiple reference sequences for minimizer index generation#404
Open
ivan-aksamentov wants to merge 5 commits intomasterfrom
Open
feat: use multiple reference sequences for minimizer index generation#404ivan-aksamentov wants to merge 5 commits intomasterfrom
ivan-aksamentov wants to merge 5 commits intomasterfrom
Conversation
Background:
Some pathogen datasets have significant genetic diversity that a single reference sequence cannot fully represent. This limits the accuracy of dataset auto-detection when query sequences are distant from the chosen reference. By allowing multiple reference sequences per dataset, the minimizer index can capture broader sequence diversity and improve detection rates.
Implementation:
- Add optional `files.minimizerReferences` field in pathogen.json (array of FASTA file paths)
- New `get_minimizer_refs()` function reads sequences from all listed files, falls back to main reference if field is absent
- `make_ref_search_index()` combines minimizers from all references using set union; uses average length for normalization
- Backward compatible: existing datasets work unchanged
Usage:
In pathogen.json, add array of FASTA paths containing representative sequences for the dataset:
```json
{
"files": {
"reference": "reference.fasta",
"minimizerReferences": [
"clade_a.fasta",
"clade_b.fasta"
]
}
}
```
Each FASTA file can contain one or more sequences. All sequences across all files contribute minimizers to the dataset's index entry.
Co-Authored-By: Claude <noreply@anthropic.com>
Collaborator
|
I tested the multi-ref minimizer functionality in this branch with CVA16: https://github.com/nextstrain/nextclade_data/tree/cva16-testing-sample |
Current Nextclade versions reject array values in the `files` section of `pathogen.json` with `invalid type: sequence, expected a string`. The `DatasetFiles` struct catches unknown keys with `rest_files: BTreeMap<String, String>` (https://github.com/nextstrain/nextclade/blob/f7db57f31/packages/nextclade/src/io/dataset.rs#L567-L569), which only accepts string values. An array like `minimizerReferences` fails deserialization before reaching the `other: serde_json::Value` catch-all. Move the lookup to `.buildFiles`, a top-level key absorbed by `VirusProperties`'s `other: serde_json::Value` (https://github.com/nextstrain/nextclade/blob/1400012f7/packages/nextclade/src/analyze/virus_properties.rs#L137-L138), which accepts any JSON type. This makes datasets with multiple minimizer reference files loadable by both old and new Nextclade versions.
ivan-aksamentov
added a commit
that referenced
this pull request
Apr 1, 2026
- Depends on: #404 Current Nextclade versions reject array values in the `files` section of `pathogen.json` with `invalid type: sequence, expected a string`. Datasets declaring `minimizerReferences` as an array inside `files` fail to load. Move `minimizerReferences` from `.files` to a new top-level `.buildFiles` key: ```json { "files": { "reference": "reference.fasta", "genomeAnnotation": "genome_annotation.gff3", "changelog": "CHANGELOG.md" }, "buildFiles": { "minimizerReferences": [ "minimizer_refs/additional_refs_B.fasta", "minimizer_refs/additional_refs_B1a.fasta" ] } } ``` Unknown top-level keys are silently ignored by Nextclade, so the dataset remains loadable by both old and new versions. The companion script change reading from `buildFiles` is in #404 (9fc7f3e, 9fc7f3e). ### Work items - [x] Move `minimizerReferences` from `files` to `buildFiles` in `pathogen.json`
1 task
# Conflicts: # data_output/minimizer_index.json # scripts/rebuild
…port Rewrite scripts/minimizer as a subcommand CLI (build, search) with support for multi-ref indexes, pre-built index loading, and sorted FASTA output. Move search algorithm (search_one_query, filter_matches, deserialize_ref_search_index) from the script into lib/minimizer.py to share with other consumers (rebuild, suggest_datasets).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background:
Some pathogen datasets have significant genetic diversity that a single reference sequence cannot fully represent. This limits the accuracy of dataset auto-detection when query sequences are distant from the chosen reference. By allowing multiple reference sequences per dataset, the minimizer index can capture broader sequence diversity and improve detection rates.
Implementation:
files.minimizerReferencesfield in pathogen.json (array of FASTA file paths)get_minimizer_refs()function reads sequences from all listed files, falls back to main reference if field is absentmake_ref_search_index()combines minimizers from all references using set union; uses average length for normalizationUsage:
In pathogen.json, add array of FASTA paths containing representative sequences for the dataset:
{ "files": { "reference": "reference.fasta", "minimizerReferences": [ "clade_a.fasta", "clade_b.fasta" ] } }Each FASTA file can contain one or more sequences. All sequences across all files contribute minimizers to the dataset's index entry.
Checklist
Check if changes affect downstream workflows which depend on this dataset. For instance, Nextstrain ingest workflows may break if clade nomenclature changes. Consider fixing those workflows or making an issue at least.Not applicable