refactor: move minimizerReferences from files to buildFiles#442
Merged
ivan-aksamentov merged 135 commits intocva16-testing-samplefrom Apr 1, 2026
Merged
refactor: move minimizerReferences from files to buildFiles#442ivan-aksamentov merged 135 commits intocva16-testing-samplefrom
ivan-aksamentov merged 135 commits intocva16-testing-samplefrom
Conversation
flu: add trees to internal segments, add old sequences to ha/na builds, add h2n2, h1n1, and all-b builds
…skip pseudo genes
Updated datasets: - Influenza A H1N1pdm HA (nextstrain/flu/h1n1pdm/ha/MW626062) - Influenza A H1N1pdm NA (nextstrain/flu/h1n1pdm/na/MW626056) - Influenza A H3N2 HA (nextstrain/flu/h3n2/ha/EPI1857216) - Influenza A H3N2 NA (nextstrain/flu/h3n2/na/EPI1857215) - Influenza B (all) HA (nextstrain/flu/b/ha/KX058884) - Influenza B (all) NA (nextstrain/flu/b/na/CY073894) - Influenza A H3N2 PB2 (nextstrain/flu/h3n2/pb2) - Influenza A H3N2 PB1 (nextstrain/flu/h3n2/pb1) - Influenza A H3N2 PA (nextstrain/flu/h3n2/pa) - Influenza A H3N2 HA (nextstrain/flu/h3n2/ha/CY163680) - Influenza A H3N2 NP (nextstrain/flu/h3n2/np) - Influenza A H3N2 MP (nextstrain/flu/h3n2/mp) - Influenza A H3N2 NS (nextstrain/flu/h3n2/ns) - Influenza A H1N1pdm PB2 (nextstrain/flu/h1n1pdm/pb2) - Influenza A H1N1pdm PB1 (nextstrain/flu/h1n1pdm/pb1) - Influenza A H1N1pdm PA (nextstrain/flu/h1n1pdm/pa) - Influenza A H1N1pdm HA (nextstrain/flu/h1n1pdm/ha/CY121680) - Influenza A H1N1pdm MP (nextstrain/flu/h1n1pdm/mp) - Influenza A H1N1pdm NP (nextstrain/flu/h1n1pdm/np) - Influenza A H1N1pdm NS (nextstrain/flu/h1n1pdm/ns) - Influenza B (all) PB1 (nextstrain/flu/b/pb1) - Influenza B (all) PB2 (nextstrain/flu/b/pb2) - Influenza B (all) PA (nextstrain/flu/b/pa) - Influenza B (all) NP (nextstrain/flu/b/np) - Influenza B (all) MP (nextstrain/flu/b/mp) - Influenza B (all) NS (nextstrain/flu/b/ns) - Influenza A H1N1 PB2 (nextstrain/flu/h1n1/pb2) - Influenza A H1N1 PB1 (nextstrain/flu/h1n1/pb1) - Influenza A H1N1 PA (nextstrain/flu/h1n1/pa) - Influenza A H1N1 HA (nextstrain/flu/h1n1/ha) - Influenza A H1N1 NP (nextstrain/flu/h1n1/np) - Influenza A H1N1 NA (nextstrain/flu/h1n1/na) - Influenza A H1N1 MP (nextstrain/flu/h1n1/mp) - Influenza A H1N1 NS (nextstrain/flu/h1n1/ns) - Influenza A H2N2 PB2 (nextstrain/flu/h2n2/pb2) - Influenza A H2N2 PB1 (nextstrain/flu/h2n2/pb1) - Influenza A H2N2 PA (nextstrain/flu/h2n2/pa) - Influenza A H2N2 NP (nextstrain/flu/h2n2/np) - Influenza A H2N2 HA (nextstrain/flu/h2n2/ha) - Influenza A H2N2 NA (nextstrain/flu/h2n2/na) - Influenza A H2N2 MP (nextstrain/flu/h2n2/mp) - Influenza A H2N2 NS (nextstrain/flu/h2n2/ns)
- Remove deprecated/enabled from root (obsolete) - Move experimental: true to attributes
- Event-based reporter with severity levels, stage tracking, dataset lifecycle, and defect findings - Terminal renderer with severity-routed output (warnings/errors to stderr) - GitHub Actions renderer with annotation commands and step summary markdown - JSONL renderer for machine-readable build reports
- Replace logging-based logger with thin reporter adapter - Replace ad-hoc CI annotation emission with DefectFinding-based reporting - Remove Defect, DefectReport, Severity, print_defect_summary, write_defect_summary_markdown from schema.py (moved to reporting modules) - Add dataset start/finish lifecycle events and stage grouping to rebuild - Add --report-jsonl flag for machine-readable build output
`_build_schema_index()` uses `defaultdict` but `collections.defaultdict` was not imported, causing `NameError` on every rebuild.
Current Nextclade versions reject array values in the `files` section of `pathogen.json` with `invalid type: sequence, expected a string`. The `DatasetFiles` struct catches unknown keys with `rest_files: BTreeMap<String, String>` (https://github.com/nextstrain/nextclade/blob/f7db57f31/packages/nextclade/src/io/dataset.rs#L567-L569), which only accepts string values. An array like `minimizerReferences` fails deserialization before reaching the `other: serde_json::Value` catch-all. Move the lookup to `.buildFiles`, a top-level key absorbed by `VirusProperties`'s `other: serde_json::Value` (https://github.com/nextstrain/nextclade/blob/1400012f7/packages/nextclade/src/analyze/virus_properties.rs#L137-L138), which accepts any JSON type. This makes datasets with multiple minimizer reference files loadable by both old and new Nextclade versions.
- Depends on: #404 Current Nextclade versions reject array values in the `files` section of `pathogen.json` with `invalid type: sequence, expected a string`. Datasets declaring `minimizerReferences` as an array inside `files` fail to load. Move `minimizerReferences` from `.files` to a new top-level `.buildFiles` key: ```json { "files": { "reference": "reference.fasta", "genomeAnnotation": "genome_annotation.gff3", "changelog": "CHANGELOG.md" }, "buildFiles": { "minimizerReferences": [ "minimizer_refs/additional_refs_B.fasta", "minimizer_refs/additional_refs_B1a.fasta" ] } } ``` Unknown top-level keys are silently ignored by Nextclade, so the dataset remains loadable by both old and new versions. The companion script change reading from `buildFiles` is in #404 (9fc7f3e, 9fc7f3e). ### Work items - [x] Move `minimizerReferences` from `files` to `buildFiles` in `pathogen.json`
# Conflicts: # data_output/minimizer_index.json # scripts/rebuild
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Current Nextclade versions reject array values in the
filessection ofpathogen.jsonwithinvalid type: sequence, expected a string. Datasets declaringminimizerReferencesas an array insidefiles[before] fail to load.This PR (757b824) moves
minimizerReferencesfrom.filesto a new top-level.buildFileskey [after]:{ "files": { "reference": "reference.fasta", "genomeAnnotation": "genome_annotation.gff3", "changelog": "CHANGELOG.md" }, "buildFiles": { "minimizerReferences": [ "minimizer_refs/additional_refs_B.fasta", "minimizer_refs/additional_refs_B1a.fasta" ] } }Unknown top-level keys are silently ignored by Nextclade, so the dataset remains loadable by both old and new versions. The companion script change reading from
buildFilesis in nextstrain/nextclade_data#404 (9fc7f3e4).Work items
minimizerReferencesfromfilestobuildFilesinpathogen.json