You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Can you please release a version with this off-by-1 error fixed?
Can you please generate VCF format with REF and ALT sequences like below to allow us to use the sequence in downstream analyses?
As you know, we've been benchmarking vamos alongside other TR callers. We were getting startlingly low exact length concordance between vamos on reads and the assembly allele (20%). However, we did a little digging and were able to tweak the vamos output to produce ~90 % length concordance. To do so, I had to make some assumptions and account for an off-by-one error. Obviously, I would much prefer not to be messing with the output like this.
Here is what we did: vamos --read -S -b ${aln} -r ${vamos_tr_regions} -s ${sample} -o ${sample}.vamos.vcf -t 16
vamos_tr_regions is a random subset of ~40k loci using the efficient motif set.
The -S flag to add alleles to the output VCF was critical to allow us to identify alleles that are identical to the reference. However, it produces a non-standard VCF format where the REF sequence is N and the ALTs are in the sample field.
I wrote a script to fix the VCF:
Add REF alleles by extracting them from the ref fasta using the VCF POS - 1 and END (actually END - 1, see below)
Move ALT alleles from the sample column into the ALT column
Compare REF and ALT, and if they match, change the corresponding GT to 0 and remove from ALT.
As soon as I did this, I realized that most loci differed by 1 bp at the end position between the REF and ALT. From other genotyping methods, we expect most loci to match the reference (i.e., be 0/0). Once I removed the last bp of every REF allele, most loci were 0/0.
So, since the original END coordinates match the input bed file I'm guessing there is 1 bp dropped from the ALT allele at the end coordinate. Given you mentioned an off-by-1 error previously, you may have already fixed this in the code base. But the fix doesn't appear to have been released yet.
Hi @mchaisso,
Sorry, this is a long one.
tldr;
As you know, we've been benchmarking vamos alongside other TR callers. We were getting startlingly low exact length concordance between vamos on reads and the assembly allele (20%). However, we did a little digging and were able to tweak the vamos output to produce ~90 % length concordance. To do so, I had to make some assumptions and account for an off-by-one error. Obviously, I would much prefer not to be messing with the output like this.
Here is what we did:
vamos --read -S -b ${aln} -r ${vamos_tr_regions} -s ${sample} -o ${sample}.vamos.vcf -t 16vamos_tr_regions is a random subset of ~40k loci using the efficient motif set.
The
-Sflag to add alleles to the output VCF was critical to allow us to identify alleles that are identical to the reference. However, it produces a non-standard VCF format where the REF sequence isNand the ALTs are in the sample field.I wrote a script to fix the VCF:
Fix vamos VCF dashnowlab/TR-Benchmarking#12
As soon as I did this, I realized that most loci differed by 1 bp at the end position between the REF and ALT. From other genotyping methods, we expect most loci to match the reference (i.e., be 0/0). Once I removed the last bp of every REF allele, most loci were 0/0.
So, since the original END coordinates match the input bed file I'm guessing there is 1 bp dropped from the ALT allele at the end coordinate. Given you mentioned an off-by-1 error previously, you may have already fixed this in the code base. But the fix doesn't appear to have been released yet.
Before:
After:
This now looks much more like I'd expect.
Here is an example locus from HG007
You can see that 1/2 becomes 0/1 as after checking the sequences, it is an exact match for the reference.
cc'ing @avvaruakshay in case I missed anything
Thanks!
Warm regards,
Harriet