Notes on off-target coverage calculation filtered by dbSNP loci #18
alkaZeltser
started this conversation in
General
Replies: 1 comment 2 replies
-
Just trying to understand this comment but |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm working on adding a new feature, which will calculate read depth in regions not specifically in a target file.
Why calculate off-target coverage in a targeted sequencing experiment?
Targeted sequencing is not 100% efficient, typically there will be reads that map outside of specifically targeted regions. These reads provide extra genomic information that we can take advantage of. When performing variant-calling in a targeted sequencing experiment, it is typical to restrict variant calling to just targeted regions. However, it may be useful to expand variant calling regions to those with substantial off-target coverage, in order to increase the call-set of high quality variants. These extra "bonus" variants can then be used to substantially improve the imputation of additional common variants, allowing for the investigation of genetic loci beyond just those included in the targeted experiment. Analyzing off-target coverage produces an extra set of coordinates to guide variant-calling beyond just the target file.
Efficiency of off-target coverage evaluation.
"Off-target" technically refers to the entirety of the genome not encompassed by a targeted panel, which is typically a very large area. Calculating the read-depth at every single genomic locus takes a long time. But not all loci are useful. The read depth at one locus is a good estimate of the read-depth at the surrounding loci within the length of a typical read. Also, the most useful loci are those of common variants, which can then be used to improve imputation. To increase the efficiency of off-target coverage calculation, we can restrict calculations to just common variants that are a typical read-length (150bp) distant from each other.
The technical details
samtools depth
has an option to provide a coordinate file, which it then uses to only report depth at the given coordinates.The documentation says it needs to be a BED file, but VCFs work as well. Thus the following command can be used to calculate read depth at a specific set of variant coordinates:
samtools depth -b ${VCF}
whereVCF
is a file of common variants from e.g. dbSNP.The dbSNP reference provides coordinates of known common variants. We can further thin these variants down to just one within every 150bp window using
vcftools --thin 150
. See below for details on various technical difficulties in handling our dbSNP reference files. A final curated, thinned reference file can be obtained with the following command (substitute paths accordingly):Troubleshooting dbSNP reference file
We have dbSNP reference files on the cluster in VCF format under /hot/ref/. In their original form, the chromosomes are encoded using accession numbers e.g. NC_00000001 instead of the
chr1
standard. This is not compatible with genomic data outputted by our alignment pipelines.There is a chromosome recoded version of dbSNP reference here: /hot/ref/database/dbSNP-155/original/GRCh38/human_9606_b151_GRCh38p7/00-common_all_chr.vcf.gz and here: /hot/ref/tool-specific-input/RecSNV/GRCh38/dbsnp_b150_grch38.vcf.gz
samtools
seems to struggle with reading the default compression used on dbSNP reference files. They need to be re-compressed using bcftools:bcftools view -O z
A recompressed version of dbSNP-155 for testing purposes is currently stored in /hot/user/nzeltser/pipeline-targeted-coverage/test/input/dbSNP-155-full-hg38.vcf.gz
Beta Was this translation helpful? Give feedback.
All reactions