Notes on off-target coverage calculation filtered by dbSNP loci #18

alkaZeltser · 2023-10-04T19:59:51Z

alkaZeltser
Oct 4, 2023
Maintainer

I'm working on adding a new feature, which will calculate read depth in regions not specifically in a target file.

Why calculate off-target coverage in a targeted sequencing experiment?

Targeted sequencing is not 100% efficient, typically there will be reads that map outside of specifically targeted regions. These reads provide extra genomic information that we can take advantage of. When performing variant-calling in a targeted sequencing experiment, it is typical to restrict variant calling to just targeted regions. However, it may be useful to expand variant calling regions to those with substantial off-target coverage, in order to increase the call-set of high quality variants. These extra "bonus" variants can then be used to substantially improve the imputation of additional common variants, allowing for the investigation of genetic loci beyond just those included in the targeted experiment. Analyzing off-target coverage produces an extra set of coordinates to guide variant-calling beyond just the target file.

Efficiency of off-target coverage evaluation.

"Off-target" technically refers to the entirety of the genome not encompassed by a targeted panel, which is typically a very large area. Calculating the read-depth at every single genomic locus takes a long time. But not all loci are useful. The read depth at one locus is a good estimate of the read-depth at the surrounding loci within the length of a typical read. Also, the most useful loci are those of common variants, which can then be used to improve imputation. To increase the efficiency of off-target coverage calculation, we can restrict calculations to just common variants that are a typical read-length (150bp) distant from each other.

The technical details

Coordinate-based depth calculation:

samtools depth has an option to provide a coordinate file, which it then uses to only report depth at the given coordinates.
The documentation says it needs to be a BED file, but VCFs work as well. Thus the following command can be used to calculate read depth at a specific set of variant coordinates: samtools depth -b ${VCF} where VCF is a file of common variants from e.g. dbSNP.

Curating desired coordinates

The dbSNP reference provides coordinates of known common variants. We can further thin these variants down to just one within every 150bp window using vcftools --thin 150. See below for details on various technical difficulties in handling our dbSNP reference files. A final curated, thinned reference file can be obtained with the following command (substitute paths accordingly):

vcftools --gzvcf /hot/ref/database/dbSNP-155/original/Gh38/human_9606_b151_GRCh38p7/00-common_all_chr.vcf.gz --thin 150 --recode --stdout | bcftools view -O z -o /hot/user/nzeltser/pipeline-targeted-coverage/test/input/dbSNP-155_thinned_hg38.vcf.gz

Troubleshooting dbSNP reference file

Chromosome encoding

We have dbSNP reference files on the cluster in VCF format under /hot/ref/. In their original form, the chromosomes are encoded using accession numbers e.g. NC_00000001 instead of the chr1 standard. This is not compatible with genomic data outputted by our alignment pipelines.

There is a chromosome recoded version of dbSNP reference here: /hot/ref/database/dbSNP-155/original/GRCh38/human_9606_b151_GRCh38p7/00-common_all_chr.vcf.gz and here: /hot/ref/tool-specific-input/RecSNV/GRCh38/dbsnp_b150_grch38.vcf.gz

Compression

samtools seems to struggle with reading the default compression used on dbSNP reference files. They need to be re-compressed using bcftools:
bcftools view -O z
A recompressed version of dbSNP-155 for testing purposes is currently stored in /hot/user/nzeltser/pipeline-targeted-coverage/test/input/dbSNP-155-full-hg38.vcf.gz

tyamaguchi-ucla · 2023-10-04T20:25:14Z

tyamaguchi-ucla
Oct 4, 2023
Maintainer

samtools seems to struggle with reading the default compression used on dbSNP reference files.

Just trying to understand this comment but samtools can take SAM/BAM/CRAM files not VCF files. You can use less, zless, zcat, etc to view compressed vcf.gz on the cluster in addition to bcftools view.

2 replies

alkaZeltser Oct 4, 2023
Maintainer Author

Samtools DOES accept VCF files as a COORDINATE file for the -b option (see Technical Details). However, it wouldn't accept the original dbSNP VCF until I recompressed it.

tyamaguchi-ucla Oct 4, 2023
Maintainer

Got it. I totally missed the depth part and assumed that the samtools view was used.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notes on off-target coverage calculation filtered by dbSNP loci #18

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Notes on off-target coverage calculation filtered by dbSNP loci #18

alkaZeltser Oct 4, 2023 Maintainer

Why calculate off-target coverage in a targeted sequencing experiment?

Efficiency of off-target coverage evaluation.

The technical details

Troubleshooting dbSNP reference file

Replies: 1 comment · 2 replies

tyamaguchi-ucla Oct 4, 2023 Maintainer

alkaZeltser Oct 4, 2023 Maintainer Author

tyamaguchi-ucla Oct 4, 2023 Maintainer

alkaZeltser
Oct 4, 2023
Maintainer

Replies: 1 comment 2 replies

tyamaguchi-ucla
Oct 4, 2023
Maintainer

alkaZeltser Oct 4, 2023
Maintainer Author

tyamaguchi-ucla Oct 4, 2023
Maintainer