Skip to content

Add orov datasets#411

Open
itpsgit wants to merge 3 commits intonextstrain:masterfrom
InstitutoTodosPelaSaude:master
Open

Add orov datasets#411
itpsgit wants to merge 3 commits intonextstrain:masterfrom
InstitutoTodosPelaSaude:master

Conversation

@itpsgit
Copy link
Copy Markdown
Contributor

@itpsgit itpsgit commented Mar 2, 2026

This pull request introduces Oropouche virus Nextclade datasets for quality control. The workflow used to generate these datasets is available in the following repository, as referenced in the dataset README.

We created datasets for all three genome segments and used two references: the NCBI RefSeq reference and a reference from the Tefé outbreak (described here: https://pubmed.ncbi.nlm.nih.gov/29623245), which has been used for assembling Oropouche virus genomes from recent outbreaks.

The nucleotide divergence between the two references for each segment is as follows:

  • L: 10.26%
  • M: 3.59%
  • S: 4.14%

@ivan-aksamentov
Copy link
Copy Markdown
Member

@itpsgit

Hi and thanks for submitting these datasets! Excellent, detailed work, with many advanced features used!

I am reviewing as a software engineer, so mostly technical aspects. I will let our scientists review the scientific parts and decide.

I reviewed both manually, and using current AI models. I am not familiar with this virus, so scientific statements here are to be verified by domain experts.

Testing

Links to try datasets in Nextclade Web:

Science

Oropouche virus biology [click to expand]

Oropouche virus (OROV) is a segmented negative-sense RNA virus in the family Peribunyaviridae, order Bunyavirales, genus Orthobunyavirus. It causes Oropouche fever, an acute febrile illness transmitted primarily by the biting midge Culicoides paraensis. OROV has caused recurrent outbreaks in the Amazon basin since its discovery in 1955.

Three genome segments:

  • L segment (~6.8 kb): Encodes the RNA-dependent RNA polymerase (RdRp)
  • M segment (~4.4 kb): Encodes the glycoprotein precursor (polyprotein cleaved into Gn, NSm, and Gc)
  • S segment (~0.75-0.95 kb): Encodes the nucleocapsid protein (N) and the non-structural protein NSs in overlapping reading frames (+1 frame offset)
2023-2025 epidemic [click to expand]

The 2023-2025 epidemic represents a major expansion driven by a novel reassortant lineage (Naveca et al., Nat Med 2024, PMID 39293488):

  • Reassortment origin: M segment from eastern Amazon viruses (2009-2018), L and S segments from Peru/Colombia/Ecuador strains (2008-2021). Reassortment likely occurred in Amazonas state between 2010-2014.
  • Phenotype: ~100x higher replication in mammalian cells and 32-fold reduced cross-neutralization vs historical strains (PMID 39423838).
  • Spread: By 2024, Brazil confirmed 8,639 cases (58.8x the annual median), with spread to all 27 federal units, Bolivia, Colombia, Peru, Ecuador, Cuba, and travel-associated cases in the US and Europe.
  • Severity: Two confirmed deaths and suspected vertical transmission with microcephaly were reported.
Lineage classification [click to expand]

Historical genotype classification (I-IV, based on the S/N gene with ~5% mean nucleotide divergence) is considered insufficient for current diversity. No standardized lineage nomenclature analogous to SARS-CoV-2 Pango lineages exists.

At least 21 reassortment events were identified among 2024 Brazilian genomes alone (PMID 40037296), making per-segment phylogenetics essential since classification by one segment does not predict the others. At least 6 sub-lineages circulate within the OROVBR-2015-2024/2025 clade across different Brazilian states.

Reference strains [click to expand]
  • RefSeq (BeAn19991): Prototype strain isolated in 1960 from Belem, Brazil. Phylogenetically distant from current epidemic strains but appropriate as a canonical root reference. Accessions: NC_005776.1 (L), NC_005775.1 (M), NC_005777.1 (S).

  • Tefe (ILMD_TF29): Collected 2015-04-13 from Tefe, Amazonas, Brazil during a local outbreak (Naveca et al., PLoS Curr Outbreaks, 2018, PMID 29623245). The study screened dengue-negative febrile patients, identifying 9/30 as OROV-positive by RT-qPCR. Closer to contemporary circulating lineages but predates the novel reassortant era. Accessions: PP154172.1 (L), PP154171.1 (M), PP154170.1 (S).

Nucleotide divergence between references: L=10.26%, M=3.59%, S=4.14%.

Blocking issues affecting science

These issues are blocking adoption of the dataset. Please address.

H1. placementMaskRanges off-by-one in all 6 datasets [click to expand]

placementMaskRanges uses 0-based, end-exclusive intervals (documented in Nextclade CHANGELOG.old.md). Current values use GFF3 1-based coordinates directly without conversion:

  • 5' mask end values are 1 too high (excludes first CDS nucleotide from placement)
  • 3' mask begin values are 1 too high (includes one extra UTR nucleotide)

For a GFF3 gene at 1-based positions [start, end], the 0-based half-open UTR masks should be [0, start-1) and [end, seq_length). Current values use [0, start) and [end+1, seq_length).

Dataset 5' mask (current) 5' mask (correct) 3' mask (current) 3' mask (correct)
L/refseq [0, 44) [0, 43) [6797, 6846) [6796, 6846)
L/tefe [0, 36) [0, 35) [6795, 6814) [6794, 6814)
M/refseq [0, 32) [0, 31) [4295, 4385) [4294, 4385)
M/tefe [0, 21) [0, 20) [4284, 4371) [4283, 4371)
S/refseq [0, 45) [0, 44) [741, 754) [740, 754)
S/tefe [0, 44) [0, 43) [740, 944) [739, 944)

Effect: First nucleotide of each gene (start codon 'A' of ATG) is excluded from placement distance calculation; one 3' UTR nucleotide (immediately after stop codon) is included. Practical impact is minor (1 nt each direction) but the error is systematic.

Fix: Subtract 1 from each end value in 5' masks and each begin value in 3' masks.

Non-blocking issues

Cosmetic issues and small inconsistencies, low-priority. Fix if time allows.

M1. Repository URLs mismatch between pathogen.json and README [click to expand]

All 6 pathogen.json files have:

  • meta.bugs pointing to https://github.com/dezordi/nextclade_data_workflows/issues
  • meta["source code"] pointing to https://github.com/dezordi/nextclade_data_workflows/tree/main/oroV

All 6 README files point to https://github.com/InstitutoTodosPelaSaude/nextclade-datasets-workflows/.

Three discrepancies:

  1. GitHub org: dezordi (personal) vs InstitutoTodosPelaSaude (institutional)
  2. Repo name: nextclade_data_workflows (underscores) vs nextclade-datasets-workflows (hyphens)
  3. Path casing: oroV (capital V) vs orov (lowercase)

Users following the pathogen.json bug tracker URL reach a personal fork instead of the institutional repository. The pathogen.json URLs are what Nextclade surfaces to users in the UI.

Fix: Unify all URLs to the institutional repository InstitutoTodosPelaSaude/nextclade-datasets-workflows with path orov.

M2. Typo "Oropouch" in all 6 CHANGELOG.md files [click to expand]

All changelogs say "Oropouch Virus" instead of "Oropouche Virus" (missing 'e').

Files affected:

  • data/community/itps/orov/L/refseq/CHANGELOG.md
  • data/community/itps/orov/L/tefe/CHANGELOG.md
  • data/community/itps/orov/M/refseq/CHANGELOG.md
  • data/community/itps/orov/M/tefe/CHANGELOG.md
  • data/community/itps/orov/S/refseq/CHANGELOG.md
  • data/community/itps/orov/S/tefe/CHANGELOG.md

Fix: Replace "Oropouch" with "Oropouche" in all 6 files.

M3. Inconsistent "oroV" casing in all 6 README.md titles [click to expand]

All 6 README.md files use "oroV" with a capital V in the title (e.g., # Nextclade Dataset for "oroV" L segment), while:

  • Directory structure uses orov (lowercase)
  • attributes.name in pathogen.json uses orov
  • Standard abbreviation is "OROV" (all caps)

Files affected:

  • data/community/itps/orov/L/refseq/README.md
  • data/community/itps/orov/L/tefe/README.md
  • data/community/itps/orov/M/refseq/README.md
  • data/community/itps/orov/M/tefe/README.md
  • data/community/itps/orov/S/refseq/README.md
  • data/community/itps/orov/S/tefe/README.md

Fix: Change "oroV" to "OROV" (standard abbreviation) in all README titles.

M4. Missing trailing newlines on all 24 text files [click to expand]

Every .json, .md, and .gff3 file in the PR lacks a final newline (POSIX text file convention). Git diff shows \ No newline at end of file markers.

Files affected: all pathogen.json, tree.json, README.md, CHANGELOG.md, and genome_annotation.gff3 files (24 total).

Fix: Add trailing newline to all text files.

L1. Tefe CHANGELOG says "NCBI tefe" - misleading phrasing [click to expand]

All 3 tefe CHANGELOG files say "based on NCBI tefe (ILMD_TF29) genome". The Tefe reference is a GenBank submission (PP154172.1, PP154171.1, PP154170.1), not an NCBI RefSeq. Phrasing "NCBI tefe" is misleading.

Files affected:

  • data/community/itps/orov/L/tefe/CHANGELOG.md
  • data/community/itps/orov/M/tefe/CHANGELOG.md
  • data/community/itps/orov/S/tefe/CHANGELOG.md

Fix: Rephrase to "based on Tefe outbreak reference genome (ILMD_TF29)" in all tefe CHANGELOG files.

L2. S/refseq reference is truncated (754 vs 944 nt) [click to expand]

The OROV S segment genomic RNA is typically ~940-960 nt. NC_005777.1 at 754 nt is missing ~190 nt of 3' UTR compared to the Tefe reference PP154170.1 (944 nt). Coding regions are identical in length.

S/refseq S/tefe
Total length 754 nt 944 nt
N gene 45-740 (696 nt) 44-739 (696 nt)
3' UTR after N 14 nt 205 nt

Full-length S segment sequences analyzed with S/refseq will have ~190 unaligned nucleotides at the 3' end. This does not affect amino acid analysis but inflates missing/unaligned data metrics. The Tefe reference is more representative for complete genomes.

Fix: Add a note to S/refseq README documenting the truncated 3' UTR and recommending S/tefe for full-length sequences.

Notes

Non-issues. Curious observations and positive patterns that do not require action.

Click to expand
  • Complete segment coverage (L, M, S) with dual-reference strategy - thoughtful design for segmented virus with significant reference divergence
  • All CDS lengths divisible by 3, all start codons ATG, all stop codons TAG/TAA
  • Overlapping ORFs on S segment correctly annotated: NSs (67-342) in +1 reading frame within N (45-740)
  • Gene coordinates consistent between GFF3 and tree.json
  • All GFF3 sequence IDs match reference FASTA headers
  • All tree root sequences match reference FASTA lengths
  • compatibility.cli: "3.0.0-alpha.0" restricts to Nextclade v3+ - appropriate as v3 is current major version
  • retryReverseComplement: true - appropriate for RNA virus sequences that may be in either orientation
  • experimental: true correctly set for new community dataset
  • snpClusters disabled - appropriate for a virus with limited within-outbreak diversity
  • QC thresholds scaled proportionally to segment size (~20% for missingDataThreshold)
  • Auspice v2 tree format with proper structure (root_sequence, meta, colorings)
  • Example sequences (28 for L, 58 for M, 100 for S) provide good segment-proportional coverage for testing
  • Dual references per segment is good practice: RefSeq provides stable canonical reference, Tefe is closer to recent outbreak sequences
  • Per-segment approach is well-suited for OROV given extensive reassortment - avoids pitfalls of concatenated-genome analyses
  • Tefe reference (2015) predates the novel reassortant driving the 2023-2025 epidemic; future iterations may benefit from including a reference from the reassortant lineage
  • L/refseq defines 3 ignoredFrameShifts at codons 788-792, 797-800, and 846-855; L/tefe has none. This asymmetry is expected given 10.26% nucleotide divergence - alignments may produce frameshifts relative to one reference but not the other
  • L segment RdRp differs by 6 nt between references: L/refseq = 6753 nt (2250 aa), L/tefe = 6759 nt (2252 aa) - real biological difference
  • Trees lack clade/lineage annotations (only country and div attributes) - acceptable for initial release given no standardized OROV nomenclature exists
  • Tree dates show 2025-12-05 - data cutoff ~3 months before PR submission
  • GFF3 source field has minor casing inconsistency (Genbank vs GenBank in tefe datasets) - from NCBI's annotation pipeline, does not affect Nextclade parsing
  • Nextstrain maintains separate public OROV trees (nextstrain.org/oropouche); these ITpS datasets complement that effort with Nextclade-specific QC (alignment, mutation calling, frameshift detection, missing data scoring)
  • Nextstrain oropouche workflow uses augur + IQ-TREE per segment with fixed clock rate 0.0014 subst/site/year; ITpS datasets provide QC rather than time-resolved phylogenetics
  • UK Culex pipiens molestus showed zero infection competence for 2024 Cuban OROV strain (bioRxiv 10.1101/2025.05.07.652619), reducing temperate transmission risk; primary vector remains Culicoides paraensis in tropical/subtropical regions

@rneher
Copy link
Copy Markdown
Member

rneher commented Mar 2, 2026

Thanks so much! The looks generally really good. I assume there is no clade annotation?

A few observations from my side:

  • in the refseq datasets, the reference sequence isn't actually in the tree (I checked the L build). When analyzing the reference sequence, it is flagged as having many private mutations. This suggests to me that a) it would be good to include it, and b) increase the private mutation threshold.
  • most example sequence have compensated frameshifts (short indels around base 2400) when aligned against the reference and you have added these as 'Known frameshifts', which makes sense. I guess these frameshifts are real? They could be suppressed by increasing the gap open penalties a LOT, at the expense of many more substitutions. One way or the other, you should probably use "alignmentPreset":"high-diversity" for the L segment (doesn't change the main indels, but generally penalizes indels more compared to mutations. If you do, rerun the alignment for the tree with the same parameters.
  • The S datasets flag a lot of sequence with a lot of private mutations. Either more sequences should be included in the tree, or the threshold adjusted.

- Fix typo: "Oropouch" → "Oropouche" in CHANGELOG.md files
- Fix casing: "oroV" → "OROV" in README.md titles
- Update meta URLs from dezordi/nextclade_data_workflows to
  InstitutoTodosPelaSaude/nextclade-datasets-workflows
- Adjust placementMaskRanges to use 0-based coordinates (off-by-one fix)
- Increase privateMutations cutoff thresholds:
  - L segments: 20 → 30
  - M segments: 15 → 20
  - S segments: 15 → 20
- Add "alignmentPreset": "high-diversity" to L/refseq alignment params
- Add note on truncated 3' UTR to S/refseq README with recommendation
  to use S/tefe for full-length sequences
- Improve tefe CHANGELOG descriptions (e.g. "NCBI tefe" → "Tefe outbreak reference genome")
- Add missing newlines at end of files
- Update tree.json files with refreshed phylogenetic trees
@dezordi
Copy link
Copy Markdown
Contributor

dezordi commented Mar 3, 2026

@itpsgit

Hi and thanks for submitting these datasets! Excellent, detailed work, with many advanced features used!

I am reviewing as a software engineer, so mostly technical aspects. I will let our scientists review the scientific parts and decide.

I reviewed both manually, and using current AI models. I am not familiar with this virus, so scientific statements here are to be verified by domain experts.

Testing

Links to try datasets in Nextclade Web:

Science

Oropouche virus biology [click to expand]
Oropouche virus (OROV) is a segmented negative-sense RNA virus in the family Peribunyaviridae, order Bunyavirales, genus Orthobunyavirus. It causes Oropouche fever, an acute febrile illness transmitted primarily by the biting midge Culicoides paraensis. OROV has caused recurrent outbreaks in the Amazon basin since its discovery in 1955.

Three genome segments:

  • L segment (~6.8 kb): Encodes the RNA-dependent RNA polymerase (RdRp)
  • M segment (~4.4 kb): Encodes the glycoprotein precursor (polyprotein cleaved into Gn, NSm, and Gc)
  • S segment (~0.75-0.95 kb): Encodes the nucleocapsid protein (N) and the non-structural protein NSs in overlapping reading frames (+1 frame offset)

2023-2025 epidemic [click to expand]
The 2023-2025 epidemic represents a major expansion driven by a novel reassortant lineage (Naveca et al., Nat Med 2024, PMID 39293488):

  • Reassortment origin: M segment from eastern Amazon viruses (2009-2018), L and S segments from Peru/Colombia/Ecuador strains (2008-2021). Reassortment likely occurred in Amazonas state between 2010-2014.
  • Phenotype: ~100x higher replication in mammalian cells and 32-fold reduced cross-neutralization vs historical strains (PMID 39423838).
  • Spread: By 2024, Brazil confirmed 8,639 cases (58.8x the annual median), with spread to all 27 federal units, Bolivia, Colombia, Peru, Ecuador, Cuba, and travel-associated cases in the US and Europe.
  • Severity: Two confirmed deaths and suspected vertical transmission with microcephaly were reported.

Lineage classification [click to expand]
Historical genotype classification (I-IV, based on the S/N gene with ~5% mean nucleotide divergence) is considered insufficient for current diversity. No standardized lineage nomenclature analogous to SARS-CoV-2 Pango lineages exists.

At least 21 reassortment events were identified among 2024 Brazilian genomes alone (PMID 40037296), making per-segment phylogenetics essential since classification by one segment does not predict the others. At least 6 sub-lineages circulate within the OROVBR-2015-2024/2025 clade across different Brazilian states.

Reference strains [click to expand]

  • RefSeq (BeAn19991): Prototype strain isolated in 1960 from Belem, Brazil. Phylogenetically distant from current epidemic strains but appropriate as a canonical root reference. Accessions: NC_005776.1 (L), NC_005775.1 (M), NC_005777.1 (S).
  • Tefe (ILMD_TF29): Collected 2015-04-13 from Tefe, Amazonas, Brazil during a local outbreak (Naveca et al., PLoS Curr Outbreaks, 2018, PMID 29623245). The study screened dengue-negative febrile patients, identifying 9/30 as OROV-positive by RT-qPCR. Closer to contemporary circulating lineages but predates the novel reassortant era. Accessions: PP154172.1 (L), PP154171.1 (M), PP154170.1 (S).

Nucleotide divergence between references: L=10.26%, M=3.59%, S=4.14%.

Blocking issues affecting science

These issues are blocking adoption of the dataset. Please address.

H1. placementMaskRanges off-by-one in all 6 datasets [click to expand]
placementMaskRanges uses 0-based, end-exclusive intervals (documented in Nextclade CHANGELOG.old.md). Current values use GFF3 1-based coordinates directly without conversion:

  • 5' mask end values are 1 too high (excludes first CDS nucleotide from placement)
  • 3' mask begin values are 1 too high (includes one extra UTR nucleotide)

For a GFF3 gene at 1-based positions [start, end], the 0-based half-open UTR masks should be [0, start-1) and [end, seq_length). Current values use [0, start) and [end+1, seq_length).

Dataset 5' mask (current) 5' mask (correct) 3' mask (current) 3' mask (correct)
L/refseq [0, 44) [0, 43) [6797, 6846) [6796, 6846)
L/tefe [0, 36) [0, 35) [6795, 6814) [6794, 6814)
M/refseq [0, 32) [0, 31) [4295, 4385) [4294, 4385)
M/tefe [0, 21) [0, 20) [4284, 4371) [4283, 4371)
S/refseq [0, 45) [0, 44) [741, 754) [740, 754)
S/tefe [0, 44) [0, 43) [740, 944) [739, 944)
Effect: First nucleotide of each gene (start codon 'A' of ATG) is excluded from placement distance calculation; one 3' UTR nucleotide (immediately after stop codon) is included. Practical impact is minor (1 nt each direction) but the error is systematic.

Fix: Subtract 1 from each end value in 5' masks and each begin value in 3' masks.

Non-blocking issues

Cosmetic issues and small inconsistencies, low-priority. Fix if time allows.

M1. Repository URLs mismatch between pathogen.json and README [click to expand]
All 6 pathogen.json files have:

  • meta.bugs pointing to https://github.com/dezordi/nextclade_data_workflows/issues
  • meta["source code"] pointing to https://github.com/dezordi/nextclade_data_workflows/tree/main/oroV

All 6 README files point to https://github.com/InstitutoTodosPelaSaude/nextclade-datasets-workflows/.

Three discrepancies:

  1. GitHub org: dezordi (personal) vs InstitutoTodosPelaSaude (institutional)
  2. Repo name: nextclade_data_workflows (underscores) vs nextclade-datasets-workflows (hyphens)
  3. Path casing: oroV (capital V) vs orov (lowercase)

Users following the pathogen.json bug tracker URL reach a personal fork instead of the institutional repository. The pathogen.json URLs are what Nextclade surfaces to users in the UI.

Fix: Unify all URLs to the institutional repository InstitutoTodosPelaSaude/nextclade-datasets-workflows with path orov.

M2. Typo "Oropouch" in all 6 CHANGELOG.md files [click to expand]
All changelogs say "Oropouch Virus" instead of "Oropouche Virus" (missing 'e').

Files affected:

  • data/community/itps/orov/L/refseq/CHANGELOG.md
  • data/community/itps/orov/L/tefe/CHANGELOG.md
  • data/community/itps/orov/M/refseq/CHANGELOG.md
  • data/community/itps/orov/M/tefe/CHANGELOG.md
  • data/community/itps/orov/S/refseq/CHANGELOG.md
  • data/community/itps/orov/S/tefe/CHANGELOG.md

Fix: Replace "Oropouch" with "Oropouche" in all 6 files.

M3. Inconsistent "oroV" casing in all 6 README.md titles [click to expand]
All 6 README.md files use "oroV" with a capital V in the title (e.g., # Nextclade Dataset for "oroV" L segment), while:

  • Directory structure uses orov (lowercase)
  • attributes.name in pathogen.json uses orov
  • Standard abbreviation is "OROV" (all caps)

Files affected:

  • data/community/itps/orov/L/refseq/README.md
  • data/community/itps/orov/L/tefe/README.md
  • data/community/itps/orov/M/refseq/README.md
  • data/community/itps/orov/M/tefe/README.md
  • data/community/itps/orov/S/refseq/README.md
  • data/community/itps/orov/S/tefe/README.md

Fix: Change "oroV" to "OROV" (standard abbreviation) in all README titles.

M4. Missing trailing newlines on all 24 text files [click to expand]
Every .json, .md, and .gff3 file in the PR lacks a final newline (POSIX text file convention). Git diff shows \ No newline at end of file markers.

Files affected: all pathogen.json, tree.json, README.md, CHANGELOG.md, and genome_annotation.gff3 files (24 total).

Fix: Add trailing newline to all text files.

L1. Tefe CHANGELOG says "NCBI tefe" - misleading phrasing [click to expand]
All 3 tefe CHANGELOG files say "based on NCBI tefe (ILMD_TF29) genome". The Tefe reference is a GenBank submission (PP154172.1, PP154171.1, PP154170.1), not an NCBI RefSeq. Phrasing "NCBI tefe" is misleading.

Files affected:

  • data/community/itps/orov/L/tefe/CHANGELOG.md
  • data/community/itps/orov/M/tefe/CHANGELOG.md
  • data/community/itps/orov/S/tefe/CHANGELOG.md

Fix: Rephrase to "based on Tefe outbreak reference genome (ILMD_TF29)" in all tefe CHANGELOG files.

L2. S/refseq reference is truncated (754 vs 944 nt) [click to expand]
The OROV S segment genomic RNA is typically ~940-960 nt. NC_005777.1 at 754 nt is missing ~190 nt of 3' UTR compared to the Tefe reference PP154170.1 (944 nt). Coding regions are identical in length.

S/refseq S/tefe
Total length 754 nt 944 nt
N gene 45-740 (696 nt) 44-739 (696 nt)
3' UTR after N 14 nt 205 nt
Full-length S segment sequences analyzed with S/refseq will have ~190 unaligned nucleotides at the 3' end. This does not affect amino acid analysis but inflates missing/unaligned data metrics. The Tefe reference is more representative for complete genomes.

Fix: Add a note to S/refseq README documenting the truncated 3' UTR and recommending S/tefe for full-length sequences.

Notes

Non-issues. Curious observations and positive patterns that do not require action.

Click to expand

Hi @ivan-aksamentov, thank you for the thorough and detailed review! All issues have been addressed in the latest push (977e4398)

@dezordi
Copy link
Copy Markdown
Contributor

dezordi commented Mar 3, 2026

Thanks so much! The looks generally really good. I assume there is no clade annotation?

A few observations from my side:

  • in the refseq datasets, the reference sequence isn't actually in the tree (I checked the L build). When analyzing the reference sequence, it is flagged as having many private mutations. This suggests to me that a) it would be good to include it, and b) increase the private mutation threshold.
  • most example sequence have compensated frameshifts (short indels around base 2400) when aligned against the reference and you have added these as 'Known frameshifts', which makes sense. I guess these frameshifts are real? They could be suppressed by increasing the gap open penalties a LOT, at the expense of many more substitutions. One way or the other, you should probably use "alignmentPreset":"high-diversity" for the L segment (doesn't change the main indels, but generally penalizes indels more compared to mutations. If you do, rerun the alignment for the tree with the same parameters.
  • The S datasets flag a lot of sequence with a lot of private mutations. Either more sequences should be included in the tree, or the threshold adjusted.

Hi @rneher, thank you very much for your feedback. In this first version, we chose not to include clade annotation, as Oropouche virus does not yet have an official lineage proposal. Given the complexity introduced by the recent reassortment events, we felt it would be more appropriate to first seek broader community input before defining segment-specific clades. In the future, we would be very happy to collaborate with the community to establish and include clade annotations for each segment in a way that is biologically meaningful and broadly supported.

In the latest push 977e4398, I addressed your comments regarding the L and S segments.

If any further adjustments are necessary, please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants