feat: use multiple reference sequences for minimizer index generation by ivan-aksamentov · Pull Request #404 · nextstrain/nextclade_data

ivan-aksamentov · 2026-01-13T15:41:10Z

Background:

Some pathogen datasets have significant genetic diversity that a single reference sequence cannot fully represent. This limits the accuracy of dataset auto-detection when query sequences are distant from the chosen reference. By allowing multiple reference sequences per dataset, the minimizer index can capture broader sequence diversity and improve detection rates.

Implementation:

Add optional files.minimizerReferences field in pathogen.json (array of FASTA file paths)
New get_minimizer_refs() function reads sequences from all listed files, falls back to main reference if field is absent
make_ref_search_index() combines minimizers from all references using set union; uses average length for normalization
Backward compatible: existing datasets work unchanged

Usage:

In pathogen.json, add array of FASTA paths containing representative sequences for the dataset:

{
  "files": {
    "reference": "reference.fasta",
    "minimizerReferences": [
      "clade_a.fasta",
      "clade_b.fasta"
    ]
  }
}

Each FASTA file can contain one or more sequences. All sequences across all files contribute minimizers to the dataset's index entry.

Checklist

Check if changes affect downstream workflows which depend on this dataset. For instance, Nextstrain ingest workflows may break if clade nomenclature changes. Consider fixing those workflows or making an issue at least. Not applicable

Background: Some pathogen datasets have significant genetic diversity that a single reference sequence cannot fully represent. This limits the accuracy of dataset auto-detection when query sequences are distant from the chosen reference. By allowing multiple reference sequences per dataset, the minimizer index can capture broader sequence diversity and improve detection rates. Implementation: - Add optional `files.minimizerReferences` field in pathogen.json (array of FASTA file paths) - New `get_minimizer_refs()` function reads sequences from all listed files, falls back to main reference if field is absent - `make_ref_search_index()` combines minimizers from all references using set union; uses average length for normalization - Backward compatible: existing datasets work unchanged Usage: In pathogen.json, add array of FASTA paths containing representative sequences for the dataset: ```json { "files": { "reference": "reference.fasta", "minimizerReferences": [ "clade_a.fasta", "clade_b.fasta" ] } } ``` Each FASTA file can contain one or more sequences. All sequences across all files contribute minimizers to the dataset's index entry. Co-Authored-By: Claude <noreply@anthropic.com>

nneune · 2026-04-01T11:17:33Z

I tested the multi-ref minimizer functionality in this branch with CVA16: https://github.com/nextstrain/nextclade_data/tree/cva16-testing-sample

Current Nextclade versions reject array values in the `files` section of `pathogen.json` with `invalid type: sequence, expected a string`. The `DatasetFiles` struct catches unknown keys with `rest_files: BTreeMap<String, String>` (https://github.com/nextstrain/nextclade/blob/f7db57f31/packages/nextclade/src/io/dataset.rs#L567-L569), which only accepts string values. An array like `minimizerReferences` fails deserialization before reaching the `other: serde_json::Value` catch-all. Move the lookup to `.buildFiles`, a top-level key absorbed by `VirusProperties`'s `other: serde_json::Value` (https://github.com/nextstrain/nextclade/blob/1400012f7/packages/nextclade/src/analyze/virus_properties.rs#L137-L138), which accepts any JSON type. This makes datasets with multiple minimizer reference files loadable by both old and new Nextclade versions.

- Depends on: #404 Current Nextclade versions reject array values in the `files` section of `pathogen.json` with `invalid type: sequence, expected a string`. Datasets declaring `minimizerReferences` as an array inside `files` fail to load. Move `minimizerReferences` from `.files` to a new top-level `.buildFiles` key: ```json { "files": { "reference": "reference.fasta", "genomeAnnotation": "genome_annotation.gff3", "changelog": "CHANGELOG.md" }, "buildFiles": { "minimizerReferences": [ "minimizer_refs/additional_refs_B.fasta", "minimizer_refs/additional_refs_B1a.fasta" ] } } ``` Unknown top-level keys are silently ignored by Nextclade, so the dataset remains loadable by both old and new versions. The companion script change reading from `buildFiles` is in #404 (9fc7f3e, 9fc7f3e). ### Work items - [x] Move `minimizerReferences` from `files` to `buildFiles` in `pathogen.json`

# Conflicts: # data_output/minimizer_index.json # scripts/rebuild

…port Rewrite scripts/minimizer as a subcommand CLI (build, search) with support for multi-ref indexes, pre-built index loading, and sorted FASTA output. Move search algorithm (search_one_query, filter_matches, deserialize_ref_search_index) from the script into lib/minimizer.py to share with other consumers (rebuild, suggest_datasets).

ivan-aksamentov temporarily deployed to refs/pull/404/merge January 13, 2026 15:41 — with GitHub Actions Inactive

ivan-aksamentov mentioned this pull request Apr 1, 2026

refactor: move minimizerReferences from files to buildFiles #442

Merged

1 task

Merge remote-tracking branch 'origin/master' into feat/multiref

1e95c89

# Conflicts: # data_output/minimizer_index.json # scripts/rebuild

ivan-aksamentov had a problem deploying to refs/heads/feat/multiref April 1, 2026 13:43 — with GitHub Actions Error

ivan-aksamentov temporarily deployed to refs/pull/404/merge April 1, 2026 13:43 — with GitHub Actions Inactive

nextstrain-bot and others added 2 commits April 1, 2026 13:44

chore: rebuild [skip ci]

f40ac06

ivan-aksamentov had a problem deploying to refs/heads/feat/multiref April 1, 2026 14:12 — with GitHub Actions Error

ivan-aksamentov deployed to refs/pull/404/merge April 1, 2026 14:12 — with GitHub Actions Active

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: use multiple reference sequences for minimizer index generation#404

feat: use multiple reference sequences for minimizer index generation#404
ivan-aksamentov wants to merge 5 commits intomasterfrom
feat/multiref

ivan-aksamentov commented Jan 13, 2026

Uh oh!

nneune commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ivan-aksamentov commented Jan 13, 2026

Background:

Implementation:

Usage:

Checklist

Uh oh!

nneune commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants