Conversation
|
Hi and thanks for submitting these datasets! Excellent, detailed work, with many advanced features used! I am reviewing as a software engineer, so mostly technical aspects. I will let our scientists review the scientific parts and decide. I reviewed both manually, and using current AI models. I am not familiar with this virus, so scientific statements here are to be verified by domain experts. TestingLinks to try datasets in Nextclade Web: ScienceOropouche virus biology [click to expand]Oropouche virus (OROV) is a segmented negative-sense RNA virus in the family Peribunyaviridae, order Bunyavirales, genus Orthobunyavirus. It causes Oropouche fever, an acute febrile illness transmitted primarily by the biting midge Culicoides paraensis. OROV has caused recurrent outbreaks in the Amazon basin since its discovery in 1955. Three genome segments:
2023-2025 epidemic [click to expand]The 2023-2025 epidemic represents a major expansion driven by a novel reassortant lineage (Naveca et al., Nat Med 2024, PMID 39293488):
Lineage classification [click to expand]Historical genotype classification (I-IV, based on the S/N gene with ~5% mean nucleotide divergence) is considered insufficient for current diversity. No standardized lineage nomenclature analogous to SARS-CoV-2 Pango lineages exists. At least 21 reassortment events were identified among 2024 Brazilian genomes alone (PMID 40037296), making per-segment phylogenetics essential since classification by one segment does not predict the others. At least 6 sub-lineages circulate within the OROVBR-2015-2024/2025 clade across different Brazilian states. Reference strains [click to expand]
Nucleotide divergence between references: L=10.26%, M=3.59%, S=4.14%. Blocking issues affecting scienceThese issues are blocking adoption of the dataset. Please address. H1. placementMaskRanges off-by-one in all 6 datasets [click to expand]
For a GFF3 gene at 1-based positions [start, end], the 0-based half-open UTR masks should be
Effect: First nucleotide of each gene (start codon 'A' of ATG) is excluded from placement distance calculation; one 3' UTR nucleotide (immediately after stop codon) is included. Practical impact is minor (1 nt each direction) but the error is systematic. Fix: Subtract 1 from each Non-blocking issuesCosmetic issues and small inconsistencies, low-priority. Fix if time allows. M1. Repository URLs mismatch between pathogen.json and README [click to expand]All 6
All 6 README files point to Three discrepancies:
Users following the Fix: Unify all URLs to the institutional repository M2. Typo "Oropouch" in all 6 CHANGELOG.md files [click to expand]All changelogs say "Oropouch Virus" instead of "Oropouche Virus" (missing 'e'). Files affected:
Fix: Replace "Oropouch" with "Oropouche" in all 6 files. M3. Inconsistent "oroV" casing in all 6 README.md titles [click to expand]All 6 README.md files use "oroV" with a capital V in the title (e.g.,
Files affected:
Fix: Change "oroV" to "OROV" (standard abbreviation) in all README titles. M4. Missing trailing newlines on all 24 text files [click to expand]Every Files affected: all Fix: Add trailing newline to all text files. L1. Tefe CHANGELOG says "NCBI tefe" - misleading phrasing [click to expand]All 3 tefe CHANGELOG files say "based on NCBI tefe (ILMD_TF29) genome". The Tefe reference is a GenBank submission (PP154172.1, PP154171.1, PP154170.1), not an NCBI RefSeq. Phrasing "NCBI tefe" is misleading. Files affected:
Fix: Rephrase to "based on Tefe outbreak reference genome (ILMD_TF29)" in all tefe CHANGELOG files. L2. S/refseq reference is truncated (754 vs 944 nt) [click to expand]The OROV S segment genomic RNA is typically ~940-960 nt. NC_005777.1 at 754 nt is missing ~190 nt of 3' UTR compared to the Tefe reference PP154170.1 (944 nt). Coding regions are identical in length.
Full-length S segment sequences analyzed with S/refseq will have ~190 unaligned nucleotides at the 3' end. This does not affect amino acid analysis but inflates missing/unaligned data metrics. The Tefe reference is more representative for complete genomes. Fix: Add a note to S/refseq README documenting the truncated 3' UTR and recommending S/tefe for full-length sequences. NotesNon-issues. Curious observations and positive patterns that do not require action. Click to expand
|
|
Thanks so much! The looks generally really good. I assume there is no clade annotation? A few observations from my side:
|
- Fix typo: "Oropouch" → "Oropouche" in CHANGELOG.md files - Fix casing: "oroV" → "OROV" in README.md titles - Update meta URLs from dezordi/nextclade_data_workflows to InstitutoTodosPelaSaude/nextclade-datasets-workflows - Adjust placementMaskRanges to use 0-based coordinates (off-by-one fix) - Increase privateMutations cutoff thresholds: - L segments: 20 → 30 - M segments: 15 → 20 - S segments: 15 → 20 - Add "alignmentPreset": "high-diversity" to L/refseq alignment params - Add note on truncated 3' UTR to S/refseq README with recommendation to use S/tefe for full-length sequences - Improve tefe CHANGELOG descriptions (e.g. "NCBI tefe" → "Tefe outbreak reference genome") - Add missing newlines at end of files - Update tree.json files with refreshed phylogenetic trees
Hi @ivan-aksamentov, thank you for the thorough and detailed review! All issues have been addressed in the latest push (977e4398) |
Hi @rneher, thank you very much for your feedback. In this first version, we chose not to include clade annotation, as Oropouche virus does not yet have an official lineage proposal. Given the complexity introduced by the recent reassortment events, we felt it would be more appropriate to first seek broader community input before defining segment-specific clades. In the future, we would be very happy to collaborate with the community to establish and include clade annotations for each segment in a way that is biologically meaningful and broadly supported. In the latest push 977e4398, I addressed your comments regarding the L and S segments. If any further adjustments are necessary, please let me know. |
This pull request introduces Oropouche virus Nextclade datasets for quality control. The workflow used to generate these datasets is available in the following repository, as referenced in the dataset README.
We created datasets for all three genome segments and used two references: the NCBI RefSeq reference and a reference from the Tefé outbreak (described here: https://pubmed.ncbi.nlm.nih.gov/29623245), which has been used for assembling Oropouche virus genomes from recent outbreaks.
The nucleotide divergence between the two references for each segment is as follows: