diff --git a/CHANGELOG.md b/CHANGELOG.md index f43ef3b..752d81b 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,20 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm --- +## [1.1.0] - 2024-10-17 + +### Added + +- Add Delly2-sSV support +- Add preprocessing script for Strelka2 VCFs to add GT field for Funcotator +- Add process to split large VCFs to parse in chunks + +### Changed + +- Standardize annotation workflow to always annotate variants after LiftOver +- Rename Delly2 to Delly2-gSV +- Upgrade to score version 1.20 + ## [1.0.0] - 2024-09-10 ### Added diff --git a/README.md b/README.md index d314d62..006d7cd 100644 --- a/README.md +++ b/README.md @@ -3,10 +3,9 @@ - [Overview](#overview) - [How To Run](#how-to-run) - [Pipeline Steps](#pipeline-steps) - - [1. LiftOver variant coordinates](#1-liftover-variant-coordinates) - - [2. Variant annotation](#2-variant-annotation) - - [3. Predict variant stability](#3-predict-variant-stability) - - [Flow Diagram](#flow-diagram) + - [1. LiftOver Variant Coordinates](#1-liftover-variant-coordinates) + - [2. Annotate Variants](#2-annotate-variants) + - [3. Predict Variant Stability](#3-predict-variant-stability) - [Inputs](#inputs) - [Outputs](#outputs) - [Testing and Validation](#testing-and-validation) @@ -19,17 +18,35 @@ ## Overview -StableLift is a machine learning approach designed to predict variant stability across reference genome builds. It addresses challenges in cross-build variant comparison, supplementing LiftOver coordinate conversion with a quantitative "Stability Score" for each variant, indicating the likelihood of consistent representation between the two most commonly used human reference genome builds (GRCh37 and GRCh38). StableLift is provided as a Nextflow pipeline, accepting either GRCh37 or GRCh38 input VCFs from six variant callers (HaplotypeCaller, MuTect2, Strelka2, SomaticSniper, MuSE2, Delly2) spanning three variant types (germline SNPs, somatic SNVs, germline structural variants). Pre-trained models are provided along with performance in a whole genome validation set to define the default F1-maximizing operating point and allow for custom filtering based on pre-calibrated specificity estimates. +StableLift is a machine learning approach designed to predict variant stability across reference genome builds. It addresses challenges in cross-build variant comparison, supplementing LiftOver coordinate conversion with a quantitative "Stability Score" for each variant, indicating the probability of consistent representation across the two most commonly used human reference builds (GRCh37 and GRCh38). -GRCh37 → GRCh38 workflow: - +StableLift is implemented as a Nextflow pipeline featuring: + - Robust LiftOver of SNVs, indels, and structural variants + - Variant annotation with external databases + - Variant filtering based on predicted cross-build stability + +Pre-trained models are provided along with performance in a whole genome validation set to define the default F1-maximizing operating point and allow for custom filtering based on pre-calibrated specificity estimates. + + + +Supported conversions: + - GRCh37->GRCh38 + - GRCh38->GRCh37 + +Supported variant callers: + - HaplotypeCaller (gSNP) + - MuTect2 (sSNV) + - Strelka2 (sSNV) + - SomaticSniper (sSNV) + - MuSE2 (sSNV) + - DELLY2 (gSV, sSV) --- ## How To Run 1. Download and extract [resource bundle](https://github.com/uclahs-cds/pipeline-StableLift/releases/download/v1.0.0/resource-bundle.zip) and [source code](https://github.com/uclahs-cds/pipeline-StableLift/releases/download/v1.0.0/source_code_with_submodules.tar.gz). -2. Download [pre-trained model](https://github.com/uclahs-cds/pipeline-StableLift/releases/tag/v1.0.0) depending on variant caller and conversion direction. +2. Download [pre-trained model](https://github.com/uclahs-cds/pipeline-StableLift/releases/tag/v1.0.0) corresponding to variant caller and conversion direction. 3. Copy [`./config/template.config`](./config/template.config) (e.g. project.config) and fill in all required parameters. 4. Copy [`./input/template.yaml`](./input/template.yaml) (e.g. project.yaml) and update with input VCF ID and path. 5. Run the pipeline using [Nextflow](https://www.nextflow.io/docs/latest/install.html#install-nextflow) `nextflow run -c project.config -params-file project.yaml main.nf`. @@ -38,30 +55,23 @@ GRCh37 → GRCh38 workflow: ## Pipeline Steps -### 1. LiftOver variant coordinates +### 1. LiftOver Variant Coordinates - For SNVs, convert variant coordinates using the `BCFtools` LiftOver plugin with UCSC chain files. - For SVs, convert variant breakpoint coordinates using custom R script with UCSC chain files and `rtracklayer` and `GenomicRanges` R packages. -### 2. Variant annotation* +### 2. Annotate Variants - For SNVs, add dbSNP, GENCODE, and HGNC annotations using GATK's Funcotator. Add trinucleotide context and RepeatMasker intervals with `bedtools`. - For SVs, annotate variants with population allele frequency from the gnomAD-SV v4 database. -- *Variant annotation occurs prior to LiftOver when converting from GRCh38 -> GRCh37. -### 3. Predict variant stability +### 3. Predict Variant Stability - Predict variant stability with pre-trained random forest model and the `ranger` R package. - Annotate VCF with Stability Score and filter unstable variants. --- -## Flow Diagram - - - ---- - ## Inputs UCLA pipelines have a hierarchical configuration structure to reduce code repetition: @@ -98,15 +108,16 @@ input: | Optional Parameter | Type | Default | Description | | --------------------------- | ----------------------------------------------------------------------------------------- | ---------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `target_threshold` | numeric | `""` | Target Stability Score threshold for variant filtering: [0, 1]. | -| `target_specificity` | numeric | `""` | Target specificity based on whole genome validation set for variant filtering: [0, 1]. | +| `target_threshold` | numeric | `""` | Target Stability Score threshold for variant filtering: [0, 1]. | +| `target_specificity` | numeric | `""` | Target specificity based on whole genome validation set for variant filtering: [0, 1]. | +| `extract_features_cpus` | int | `4` | Number of cpus to use for parallel parsing of large VCFs (>1GB). | | `work_dir` | path | `/scratch/$SLURM_JOB_ID` | Path of working directory for Nextflow. When included in the sample config file, Nextflow intermediate files and logs will be saved to this directory. With `ucla_cds`, the default is `/scratch` and should only be changed for testing/development. Changing this directory to `/hot` or `/tmp` can lead to high server latency and potential disk space limitations, respectively. | | `save_intermediate_files` | boolean | false | If set, save output files from intermediate pipeline processes. | | `min_cpus` | int | 1 | Minimum number of CPUs that can be assigned to each process. | | `max_cpus` | int | `SysHelper.getAvailCpus()` | Maximum number of CPUs that can be assigned to each process. | | `min_memory` | [MemoryUnit](https://www.nextflow.io/docs/latest/script.html#implicit-classes-memoryunit) | `1.MB` | Minimum amount of memory that can be assigned to each process. | | `max_memory` | [MemoryUnit](https://www.nextflow.io/docs/latest/script.html#implicit-classes-memoryunit) | `SysHelper.getAvailMemory()` | Maximum amount of memory that can be assigned to each process. | -| `dataset_id` | string | `""` | Dataset ID to be used as output filename prefix. | +| `dataset_id` | string | `""` | Dataset ID to be used as output filename prefix. | | `blcds_registered_dataset` | boolean | false | Set to true when using BLCDS folder structure; use false for now. | | `ucla_cds` | boolean | true | If set, overwrite default memory and CPU values by UCLA cluster-specific configs. | @@ -116,10 +127,10 @@ input: | Output | Description | | ------------ | ------------------------ | -| `*_stability.vcf.gz` | Output VCF in target build coordinates with variant annotations and predicted Stability Scores. | -| `*_stability.vcf.gz.tbi` | Output VCF tabix index. | -| `*_filtered.vcf.gz` | Filtered output VCF with predicted "Unstable" variants removed. | -| `*_filtered.vcf.gz.tbi` | Filtered output VCF tabix index. | +| `*_StableLift-${target_build}.vcf.gz` | Output VCF in target build coordinates with variant annotations and predicted Stability Scores. | +| `*_StableLift-${target_build}.vcf.gz.tbi` | Output VCF tabix index. | +| `*_StableLift-${target_build}_filtered.vcf.gz` | Filtered output VCF with predicted "Unstable" variants removed. | +| `*_StableLift-${target_build}_filtered.vcf.gz.tbi` | Filtered output VCF tabix index. | --- @@ -127,7 +138,7 @@ input: ### Test Dataset -10 whole genomes from [The Cancer Genome Atlas (TCGA-SARC)](https://portal.gdc.cancer.gov/projects/TCGA-SARC) were used to test pipeline outputs and validate model performance. All data was processed using [standardized Nextflow pipelines](https://github.com/uclahs-cds/metapipeline-DNA). Somatic VCFs from GRCh37 and GRCh38 alignments are available for the four supported sSNV callers as [release attachments](https://github.com/uclahs-cds/pipeline-StableLift/releases). +10 whole genomes from [The Cancer Genome Atlas (TCGA-SARC)](https://portal.gdc.cancer.gov/projects/TCGA-SARC) were used to test pipeline outputs and validate model performance. All data was processed using [standardized Nextflow pipelines](https://github.com/uclahs-cds/metapipeline-DNA). Somatic VCFs from GRCh37 and GRCh38 alignments are available for the four supported sSNV callers and DELLY2 sSV as [release attachments](https://github.com/uclahs-cds/pipeline-StableLift/releases). | Donor ID | Normal Sample ID | Tumour Sample ID | |----------------|---------------------------|---------------------------| diff --git a/config/default.config b/config/default.config index 30743ba..25b85f2 100644 --- a/config/default.config +++ b/config/default.config @@ -17,9 +17,9 @@ params { docker_container_registry = "ghcr.io/uclahs-cds" // Docker images - bcftools_version = '1.20_score-1.16-20221221' + bcftools_version = '1.20_score-1.20-20240505' bedtools_version = '2.31.0' - gatk_version = '4.4.0.0' + gatk_version = '4.6.0.0' pipeval_version = '5.0.0-rc.3' samtools_version = '1.20' stablelift_version = '1.0.0' diff --git a/config/methods.config b/config/methods.config index 093b48f..7a8ea71 100644 --- a/config/methods.config +++ b/config/methods.config @@ -50,7 +50,12 @@ methods { 'dest_fasta_id', 'dest_fasta_ref', 'dest_fasta_fai', - 'dest_fasta_dict' + 'dest_fasta_dict', + 'chain_file', + 'repeat_bed', + 'header_contigs', + 'funcotator_data_source', + 'gnomad_rds' ] for (key in advanced_parameters) { @@ -66,17 +71,25 @@ methods { if (liftover_direction in [forward, backward]) { if (liftover_direction == forward) { - params.src_fasta_id = 'hg19' + params.src_fasta_id = 'GRCh37' params.src_fasta_ref = params.fasta_ref_37 - params.dest_fasta_id = 'hg38' + params.dest_fasta_id = 'GRCh38' params.dest_fasta_ref = params.fasta_ref_38 + + params.chain_file = params.resource_bundle_path + "/hg19ToHg38.over.chain" + params.repeat_bed = params.resource_bundle_path + "/GRCh38_RepeatMasker-intervals.bed" + params.header_contigs = params.resource_bundle_path + "/GRCh38_VCF-header-contigs.txt" } else { - params.src_fasta_id = 'hg38' + params.src_fasta_id = 'GRCh38' params.src_fasta_ref = params.fasta_ref_38 - params.dest_fasta_id = 'hg19' + params.dest_fasta_id = 'GRCh37' params.dest_fasta_ref = params.fasta_ref_37 + + params.chain_file = params.resource_bundle_path + "/hg38ToHg19.over.chain" + params.repeat_bed = params.resource_bundle_path + "/GRCh37_RepeatMasker-intervals.bed" + params.header_contigs = params.resource_bundle_path + "/GRCh37_VCF-header-contigs.txt" } params.src_fasta_fai = params.src_fasta_ref + ".fai" @@ -84,6 +97,9 @@ methods { params.src_fasta_dict = Nextflow.file(params.src_fasta_ref).resolveSibling(Nextflow.file(params.src_fasta_ref).getBaseName() + '.dict').toString() params.dest_fasta_dict = Nextflow.file(params.dest_fasta_ref).resolveSibling(Nextflow.file(params.dest_fasta_ref).getBaseName() + '.dict').toString() + + params.funcotator_data_source = params.resource_bundle_path + "/funcotator_dataSources.v1.7.20200521s_StableLift" + params.gnomad_rds = params.resource_bundle_path + "/gnomad.v4.0.sv.Rds" } } @@ -95,7 +111,7 @@ methods { schema.validate() // Validate the branch-specific parameters - if (params.variant_caller == "Delly2") { + if (params.variant_caller == "Delly2-gSV" || params.variant_caller == "Delly2-sSV") { schema.validate("${projectDir}/config/schema-sv.yaml") } else { schema.validate("${projectDir}/config/schema-snv.yaml") diff --git a/config/schema.yaml b/config/schema.yaml index 37ec19f..0e6aea5 100644 --- a/config/schema.yaml +++ b/config/schema.yaml @@ -9,8 +9,8 @@ liftover_direction: required: true help: 'Direction of LiftOver to perform' choices: - - GRCh37toGRCh38 - - GRCh38toGRCh37 + - GRCh37ToGRCh38 + - GRCh38ToGRCh37 variant_caller: type: 'String' @@ -22,7 +22,8 @@ variant_caller: - Strelka2 - Muse2 - SomaticSniper - - Delly2 + - Delly2-gSV + - Delly2-sSV save_intermediate_files: type: 'Bool' diff --git a/config/template.config b/config/template.config index ffb0358..ee5778d 100644 --- a/config/template.config +++ b/config/template.config @@ -7,47 +7,38 @@ includeConfig "${projectDir}/nextflow.config" // Inputs/parameters of the pipeline params { - // Output locations - output_dir = "where/to/save/outputs/" + // Output location + output_dir = "" - // Choices: ["HaplotypeCaller", "Mutect2", "Strelka2", "SomaticSniper", "Muse2", "Delly2"] - variant_caller = "Mutect2" + // Choices: ["HaplotypeCaller", "Mutect2", "Strelka2", "SomaticSniper", "Muse2", "Delly2-gSV", "Delly2-sSV"] + variant_caller = "" + + // Choices: ["GRCh37ToGRCh38", "GRCh38ToGRCh37"] + liftover_direction = "" // Path to pre-trained random forest model rf_model = "" - // Optional parameter specifying target Stability Score threshold for - // variant filtering Default behavior without `target_threshold` or - // `target_specificity` specified uses threshold maximizing F1-score in - // whole genome validation set. Must be in the range [0.0, 1.0]. - // target_threshold = 0.5 - - // Optional parameter specifying target specificity for variant filtering - // based on whole genome validation set Overrides `target_threshold`. Must - // be in the range [0.0, 1.0], - // target_specificity = 0.5 + // Path to reference fasta files + fasta_ref_37 = "" // GRCh37-EBI-hs37d5/hs37d5.fa + fasta_ref_38 = "" // GRCh38-BI-20160721/Homo_sapiens_assembly38.fasta - liftover_direction = "" // Choices: GRCh37ToGRCh38, GRCh38ToGRCh37 + // Path to unpacked resource-bundle.zip + resource_bundle_path = "" - // Path to the GRCh37 reference sequence (FASTA). - fasta_ref_37 = "/hot/ref/reference/GRCh37-EBI-hs37d5/hs37d5.fa" - - // Path to the GRCh38 reference sequence (FASTA). - fasta_ref_38 = "/hot/ref/reference/GRCh38-BI-20160721/Homo_sapiens_assembly38.fasta" - - // The liftover chain file between the source and destination FASTA references - // Included in resource-bundle.zip - chain_file = "/hot/ref/tool-specific-input/liftOver/hg19ToHg38.over.chain" + // Optional parameter specifying target Stability Score threshold for variant filtering + // Default behavior without `target_threshold` or `target_specificity` uses threshold maximizing F1-score in whole genome validation set + // Must be in the range [0.0, 1.0] + // target_threshold = 0.5 - // SNV required references - funcotator_data_source = "/hot/ref/tool-specific-input/Funcotator/somatic/funcotator_dataSources.v1.7.20200521s" - // Included in resource-bundle.zip - repeat_bed = "/hot/ref/database/RepeatMasker-3.0.1/processed/GRCh38/GRCh38_RepeatMasker_intervals.bed" + // Optional parameter specifying target specificity for variant filtering based on whole genome validation set + // Overrides `target_threshold` + // Must be in the range [0.0, 1.0] + // target_specificity = 0.95 - // SV required references - // Included in resource-bundle.zip, choose header_contigs based on target build - header_contigs = "/hot/code/nkwang/GitHub/uclahs-cds/project-method-AlgorithmEvaluation-BNCH-000142-GRCh37v38/report/manuscript/publish/GRCh38-vcf-header-contigs.txt" - gnomad_rds = "/hot/code/nkwang/GitHub/uclahs-cds/project-method-AlgorithmEvaluation-BNCH-000142-GRCh37v38/report/manuscript/publish/data/gnomad.v4.0.sv.Rds" + // Optional parameter specifying number of cpus to use for parsing large VCFs + // Defaults to 4 if unset + // extract_features_cpus = 4 } // Setup the pipeline config. DO NOT REMOVE THIS LINE! diff --git a/docs/pipeline.mmd b/docs/pipeline.mmd index 64e8423..8b215f4 100644 --- a/docs/pipeline.mmd +++ b/docs/pipeline.mmd @@ -60,11 +60,9 @@ flowchart TD gatk_func[gatk Funcotator]:::gatk --> bcftools_annotate["`bcftools annotate*RepeatMasker*`"]:::bcftools - --> bcftools_annotate2["`bcftools annotate*Trinucleotide*`"]:::bcftools + --> bcftools_annotate2["`bcftools annotate*TrinucleotideContext*`"]:::bcftools end - blocknote["`**Note:** Annotation is performed prior to LiftOver when converting from GRCh38 to GRCh37`"] - bcftools_liftover ---> gatk_func bcftools_annotate2 --> r_extract_snv[extract-VCF-features.R]:::R end @@ -79,7 +77,7 @@ flowchart TD subgraph Predict Stability ["`        **Predict Stability**`"] r_predict_stability[predict-variant-stability.R]:::R - --> bcftools_annotate3["`bcftools annotate*Stability*`"]:::bcftools + --> bcftools_annotate3["`bcftools annotate*StabilityScore*`"]:::bcftools rf_model([rf_model]):::input .-> r_predict_stability end diff --git a/docs/pipeline.mmd.svg b/docs/pipeline.mmd.svg index 7a19610..7ef4bf2 100644 --- a/docs/pipeline.mmd.svg +++ b/docs/pipeline.mmd.svg @@ -1 +1 @@ -         Predict StabilitySNVSVLiftoverAnnotationbcftools annotateStabilitypredict-variant-stability.Rrf_modelNote: Annotation isperformed prior toLiftOver whenconverting from GRCh38to GRCh37extract-VCF-features.Rfuncotator_sourcesrepeat_bedbcftools annotateTrinucleotidebcftools annotateRepeatMaskergatk Funcotatorchain_filebcftools +liftoverheader_contigschain_filegnomad_rdsextract-VCF-features-SV.RLegendNodesInput FileParameterized InputOutput fileProcessesGATKbcftoolsRscriptGeneric LinuxInput VCFpipevalVariant Type?Output VCFs \ No newline at end of file +         Predict StabilitySNVSVLiftoverAnnotationbcftools annotateStabilityScorepredict-variant-stability.Rrf_modelextract-VCF-features.Rfuncotator_sourcesrepeat_bedbcftools annotateTrinucleotideContextbcftools annotateRepeatMaskergatk Funcotatorchain_filebcftools +liftoverheader_contigschain_filegnomad_rdsextract-VCF-features-SV.RLegendNodesInput FileParameterized InputOutput fileProcessesGATKbcftoolsRscriptGeneric LinuxInput VCFpipevalVariant Type?Output VCFs \ No newline at end of file diff --git a/docs/stablelift-overview.png b/docs/stablelift-overview.png index 2df4554..4274d74 100644 Binary files a/docs/stablelift-overview.png and b/docs/stablelift-overview.png differ diff --git a/main.nf b/main.nf index 4f304b4..b25ee48 100644 --- a/main.nf +++ b/main.nf @@ -150,14 +150,15 @@ workflow { // The values of validated_vcf_with_index are maps with keys vcf, index, and sample_id. // The values of validated_vcf_tuple are tuples of (sample_id, vcf, index). - if (params.variant_caller == "Delly2") { + if (params.variant_caller == "Delly2-gSV" || params.variant_caller == "Delly2-sSV") { // Take the SV branch workflow_extract_sv_annotations( validated_vcf_tuple, input_ch_src_sequence, Channel.value(params.header_contigs), Channel.value(params.gnomad_rds), - Channel.value(params.chain_file) + Channel.value(params.chain_file), + Channel.value(params.variant_caller) ) workflow_extract_sv_annotations.out.liftover_vcf.set { liftover_vcf } @@ -182,6 +183,7 @@ workflow { liftover_vcf, r_annotations, Channel.value(params.rf_model), - Channel.value(params.variant_caller) + Channel.value(params.variant_caller), + Channel.value(params.dest_fasta_id) ) } diff --git a/metadata.yaml b/metadata.yaml index 6fc0035..6fe159a 100644 --- a/metadata.yaml +++ b/metadata.yaml @@ -1,7 +1,7 @@ --- category: "pipeline" description: "Nextflow pipeline for predicting variant stability across reference genome builds." -maintainers: "Nicholas Wiltsie , Nicholas K. Wang " +maintainers: "Nicholas K. Wang , Nicholas Wiltsie " languages: ["Nextflow", "R"] dependencies: ["Nextflow", "R", "Docker"] image_name: "stablelift" diff --git a/module/predict_stability.nf b/module/predict_stability.nf index 9554a18..aa89f59 100644 --- a/module/predict_stability.nf +++ b/module/predict_stability.nf @@ -1,14 +1,9 @@ -include { compress_and_index_HTSlib } from './utils.nf' +include { compress_and_index_tsv } from './utils.nf' process predict_stability_StableLift { container params.docker_image_stablelift containerOptions "-v ${moduleDir}:${moduleDir}" - publishDir path: "${params.output_dir_base}/output", - pattern: "stability.tsv", - mode: "copy", - saveAs: { "StableLift-${sample_id}-${variant_caller}.tsv" } - input: tuple val(sample_id), path(features_rds) path(rf_model) @@ -41,9 +36,9 @@ process run_apply_stability_annotations { container params.docker_image_bcftools publishDir path: "${params.output_dir_base}/output", - pattern: "{stability,filtered}.vcf.gz{,.tbi}", + pattern: "StableLift-${dest_fasta_id}{,_filtered}.vcf.gz{,.tbi}", mode: "copy", - saveAs: { "${sample_id}-${variant_caller}-${it}" } + saveAs: { "${sample_id}_${variant_caller}_${it}" } input: tuple val(sample_id), @@ -51,22 +46,23 @@ process run_apply_stability_annotations { path(stability_tsv, stageAs: 'inputs/*'), path(stability_tsv_tbi, stageAs: 'inputs/*') val(variant_caller) + val(dest_fasta_id) output: tuple val(sample_id), - path("stability.vcf.gz"), - path("stability.vcf.gz.tbi"), + path("StableLift-${dest_fasta_id}.vcf.gz"), + path("StableLift-${dest_fasta_id}.vcf.gz.tbi"), emit: stability_vcf_with_index tuple val(sample_id), - path("filtered.vcf.gz"), - path("filtered.vcf.gz.tbi"), + path("StableLift-${dest_fasta_id}_filtered.vcf.gz"), + path("StableLift-${dest_fasta_id}_filtered.vcf.gz.tbi"), emit: filtered_vcf_with_index script: - stability_vcf = "stability.vcf.gz" + stability_vcf = "StableLift-${dest_fasta_id}.vcf.gz" stability_vcf_tbi = "${stability_vcf}.tbi" - filtered_vcf = "filtered.vcf.gz" + filtered_vcf = "StableLift-${dest_fasta_id}_filtered.vcf.gz" filtered_vcf_tbi = "${filtered_vcf}.tbi" """ @@ -90,10 +86,10 @@ process run_apply_stability_annotations { stub: """ - touch "stability.vcf.gz" - touch "stability.vcf.gz.tbi" - touch "filtered.vcf.gz" - touch "filtered.vcf.gz.tbi" + touch "StableLift-${dest_fasta_id}.vcf.gz" + touch "StableLift-${dest_fasta_id}.vcf.gz.tbi" + touch "StableLift-${dest_fasta_id}_filtered.vcf.gz" + touch "StableLift-${dest_fasta_id}_filtered.vcf.gz.tbi" """ } @@ -103,6 +99,7 @@ workflow workflow_predict_stability { r_annotations rf_model variant_caller + dest_fasta_id main: @@ -112,17 +109,18 @@ workflow workflow_predict_stability { variant_caller ) - compress_and_index_HTSlib( + compress_and_index_tsv( predict_stability_StableLift.out.stability_tsv ) run_apply_stability_annotations( vcf_with_sample_id.join( - compress_and_index_HTSlib.out.compressed_tsv_with_index, + compress_and_index_tsv.out.compressed_tsv_with_index, failOnDuplicate: true, failOnMismatch: true ), - variant_caller + variant_caller, + dest_fasta_id ) emit: diff --git a/module/scripts/add-GT-Strelka2.R b/module/scripts/add-GT-Strelka2.R new file mode 100644 index 0000000..1efb6d3 --- /dev/null +++ b/module/scripts/add-GT-Strelka2.R @@ -0,0 +1,45 @@ +#!/usr/bin/env Rscript +# add-GT-Strelka2.R +#################################################################################################### +# +# Adds GT to FORMAT fields +# 0/1 for called variants and ./. for missing data +# +#################################################################################################### + +suppressPackageStartupMessages({ + library(vcfR); + library(data.table); + library(argparse); + }); + +################################################################################################### +# Input +################################################################################################### +# Define command line arguments +parser <- ArgumentParser(); +parser$add_argument('--input-vcf', type = 'character', help = 'Strelka2 VCF to add FORMAT/GT field to'); +parser$add_argument('--output-vcf', type = 'character', help = 'Formatted output VCF path'); +args <- parser$parse_args(); + +# Save command line arguments +for (arg in names(args)) { + assign(gsub('_', '.', arg), args[[arg]]); + } + +################################################################################################### +# Main +################################################################################################### +vcf <- read.vcfR(input.vcf); +format.fields <- unlist(strsplit(vcf@gt[1, 'FORMAT'], ':')); + +if ('GT' %in% format.fields) { + message('FORMAT/GT field already exists'); + write.vcf(vcf, output.vcf); +} else { + vcf@meta <- append(vcf@meta, '##FORMAT=', after = grep('##FORMAT', vcf@meta)[1] - 1); + vcf@gt[, 'FORMAT'] <- paste0('GT:', vcf@gt[, 'FORMAT']); + gt <- vcf@gt[, colnames(vcf@gt) != 'FORMAT']; + vcf@gt[, colnames(vcf@gt) != 'FORMAT'] <- ifelse(startsWith(gt, '.'), paste0('./.:', gt), paste0('0/1:', gt)); + write.vcf(vcf, output.vcf); + } diff --git a/module/scripts/extract-VCF-features-SV.R b/module/scripts/extract-VCF-features-SV.R index d991989..d950bf6 100644 --- a/module/scripts/extract-VCF-features-SV.R +++ b/module/scripts/extract-VCF-features-SV.R @@ -20,6 +20,7 @@ suppressPackageStartupMessages({ ################################################################################################### # Define command line arguments parser <- ArgumentParser(); +parser$add_argument('--variant-caller', type = 'character', help = 'One of {Delly2-gSV, Delly2-sSV}'); parser$add_argument('--input-vcf', type = 'character', help = 'Input Delly2 VCF'); parser$add_argument('--source-build', type = 'character', help = 'One of {GRCh37, GRCh38}'); parser$add_argument('--chain-file', type = 'character', help = 'Chain file for coordinate conversion'); @@ -39,10 +40,15 @@ for (arg in names(args)) { ################################################################################################### # vcfR::getINFO() to data.table vcf.info.to.dt <- function(vcf.info) { - vcf.info <- lapply(vcf.info, function(x) vcf.info.string.to.list(x)); - feature.names <- unique(unlist(lapply(vcf.info, names))); - vcf.info <- do.call(mapply, c(FUN = list, lapply(vcf.info, `[`, feature.names))); - setNames(as.data.table(vcf.info), feature.names); + # Split each string by semicolon and convert to a list of key-value pairs + vcf.info <- strsplit(vcf.info, ';'); + vcf.info <- lapply(vcf.info, function(x) { + x <- strsplit(x, '='); + as.list(stats::setNames(sapply(x, `[`, 2), sapply(x, `[`, 1))); + }) + + # Combine the list of key-value pairs into a data table + rbindlist(vcf.info, fill = TRUE); } # Split VCF info field to list @@ -57,12 +63,6 @@ vcf.info.string.to.list <- function(vcf.info, keep.columns = NULL) { return(values); } -calculate.VAF <- function(GT.row) { - total <- sum(GT.row %in% c('0/0', '0/1', '1/1'), na.rm = TRUE) * 2; - alt <- sum(GT.row == '0/1', na.rm = TRUE) + sum(GT.row == '1/1', na.rm = TRUE) * 2; - return(alt / total); - } - get.overlap <- function(start1, end1, start2, end2) { max.length <- pmax((end1 - start1), (end2 - start2)); overlap.length <- pmin(end1, end2) - pmax(start1, start2); @@ -87,7 +87,7 @@ find.SV.match <- function(this.ID, input, reference, overlap, offset) { } annotate.gnomad.features <- function(features.dt, features.dt.gnomad) { - gnomad.features <- c('ID', 'AF', 'POPMAX_AF'); + gnomad.features <- c('ID', 'AF'); features.dt[, c('gnomad.match.ID', 'gnomad.matches') := rbindlist(lapply(ID, find.SV.match, input = features.dt, reference = features.dt.gnomad, overlap = 0.8, offset = 500))]; features.dt <- merge(features.dt, features.dt.gnomad[, ..gnomad.features], all.x = TRUE, by.x = 'gnomad.match.ID', by.y = 'ID'); } @@ -111,18 +111,17 @@ features.dt <- cbind(input.fix[, -c('INFO')], input.info); # Format columns features.dt[, CONSENSUS := NULL]; -numeric.columns <- c('POS', 'QUAL', 'END', 'PE', 'MAPQ', 'SRMAPQ', 'INSLEN', 'HOMLEN', 'SR', 'SRQ', 'CE', 'RDRATIO', 'SVLEN', 'POS2'); -character.columns <- names(features.dt)[!names(features.dt) %in% numeric.columns]; -features.dt[, (numeric.columns) := lapply(.SD, as.numeric), .SDcols = numeric.columns]; -features.dt[, (character.columns) := lapply(.SD, as.character), .SDcols = character.columns]; +features.dt[, c('POS', 'END') := lapply(.SD, as.numeric), .SDcols = c('POS', 'END')]; +if (!'SVLEN' %in% names(features.dt)) features.dt[, SVLEN := 0]; # Extract and aggregate per sample GT fields -gt.fields <- c('GQ', 'RC', 'RDCN', 'DR', 'DV', 'RR', 'RV'); +gt.fields <- c('GQ', 'RC', 'DR', 'DV', 'RR', 'RV'); for (field in gt.fields) { features.dt[, (field) := apply(extract.gt(input.vcf, element = ..field, as.numeric = TRUE), 1, mean, na.rm = TRUE)]; } -features.dt[, COHORT_AF := apply(extract.gt(input.vcf, element = 'GT'), 1, calculate.VAF)]; features.dt[, CIPOS := as.numeric(sapply(CIPOS, function(x) unlist(strsplit(x, ','))[2]))]; +features.dt[, DP := DR + DV + RR + RV] +if (variant.caller == 'Delly2-sSV') features.dt[, VAF := (DV + RV) / DP]; if (source.build == 'GRCh37') { features.dt[, CHROM := paste0('chr', CHROM)]; @@ -170,6 +169,16 @@ if (source.build == 'GRCh38') { grange.target.dt[, CHR2 := ifelse(is.na(CHR2), CHR2, sub('chr', '', CHR2))]; } +features.dt <- features.dt[ID %in% grange.target.dt$ID]; +features.dt <- features.dt[match(input.fix$ID[input.fix$ID %in% features.dt$ID], features.dt$ID)]; +features.dt[, c('CHROM', 'POS', 'END', 'CHR2', 'POS2') := grange.target.dt[, .(seqnames, start, end, CHR2, POS2)]]; +features.dt[!SVTYPE %in% c('BND', 'INS'), SVLEN := END - POS + 1]; +features.dt[SVTYPE == 'INS', SVLEN := INSLEN]; + +if (source.build == 'GRCh37') { + features.dt <- annotate.gnomad.features(features.dt, features.dt.gnomad); + } + ################################################################################################### # Write output VCF ################################################################################################### @@ -186,6 +195,10 @@ for (i in seq_len(nrow(input.fix))) { this.INFO[['CHR2']] <- grange.target.dt[i, CHR2]; this.INFO[['POS2']] <- grange.target.dt[i, POS2]; } + # Add gnomAD AF to INFO field + if (!is.na(features.dt[ID == this.ID, AF])) { + this.INFO[['gnomAD_AF']] <- format(features.dt[ID == this.ID, AF], scientific = FALSE, digits = 6); + } this.INFO <- lapply(names(this.INFO), function(x) paste(x, this.INFO[[x]], sep = '=')); this.INFO <- paste(this.INFO, collapse = ';'); this.INFO <- gsub('IMPRECISE=IMPRECISE', 'IMPRECISE', this.INFO); @@ -200,20 +213,15 @@ lifted.vcf@gt <- as.matrix(input.gt); lifted.vcf@meta <- lifted.vcf@meta[!grepl('^##(contig|reference)', lifted.vcf@meta)]; lifted.vcf@meta <- c(lifted.vcf@meta, header.contigs); +# Add gnomAD_AF to the VCF header +gnomad.header <- '##INFO='; +lifted.vcf@meta <- c(lifted.vcf@meta, gnomad.header); + write.vcf(lifted.vcf, output.vcf); ################################################################################################### # Format features for RF ################################################################################################### -features.dt <- features.dt[ID %in% grange.target.dt$ID]; -features.dt <- features.dt[match(input.fix$ID, features.dt$ID)]; -features.dt[, c('CHROM', 'POS', 'END', 'CHR2', 'POS2') := grange.target.dt[, .(seqnames, start, end, CHR2, POS2)]]; -features.dt[!SVTYPE %in% c('BND', 'INS'), SVLEN := END - POS + 1]; - -if (source.build == 'GRCh37') { - features.dt <- annotate.gnomad.features(features.dt, features.dt.gnomad); - } - continuous.features <- c( 'POS', 'QUAL', @@ -221,23 +229,18 @@ continuous.features <- c( 'PE', 'MAPQ', 'CIPOS', + 'RDRATIO', 'SRMAPQ', 'HOMLEN', 'SR', 'SRQ', 'CE', - 'RDRATIO', 'SVLEN', 'GQ', 'RC', - 'RDCN', - 'DR', - 'DV', - 'RR', - 'RV', - 'AF', - 'gnomad.matches', - 'POPMAX_AF' + 'DP', + 'VAF', + 'AF' ); categorical.features <- c( diff --git a/module/scripts/extract-VCF-features.R b/module/scripts/extract-VCF-features.R index 641d653..4d62c03 100644 --- a/module/scripts/extract-VCF-features.R +++ b/module/scripts/extract-VCF-features.R @@ -22,7 +22,7 @@ parser <- ArgumentParser(); parser$add_argument('--input-vcf', type = 'character', help = 'Input VCF for feature extraction, mutually exclusive with --input-dir'); parser$add_argument('--input-dir', type = 'character', help = 'Directory with VCF subsets for parallelization, mutually exclusive with --input-vcf'); parser$add_argument('--output-rds', type = 'character', help = 'Rds output for input to RF model'); -parser$add_argument('--variant-caller', type = 'character', help = 'One of {HaplotypeCaller, Mutect2, Strelka2, SomaticSniper, Muse2, Delly2}'); +parser$add_argument('--variant-caller', type = 'character', help = 'One of {HaplotypeCaller, Mutect2, Strelka2, SomaticSniper, Muse2}'); parser$add_argument('--ncore', type = 'integer', help = 'Number of cores to use for processing VCF subsets in --input-dir', default = 1); args <- parser$parse_args(); @@ -49,7 +49,10 @@ vcf.info.to.dt <- function(vcf.info) { vcf.info <- strsplit(vcf.info, ';'); vcf.info <- lapply(vcf.info, function(x) { x <- strsplit(x, '='); - as.list(stats::setNames(sapply(x, `[`, 2), sapply(x, `[`, 1))); + as.list(stats::setNames( + sapply(x, function(pair) if (length(pair) == 1) pair[1] else pair[2]), + sapply(x, `[`, 1) + )); }) # Combine the list of key-value pairs into a data table @@ -177,11 +180,6 @@ features.dt.subsets <- foreach(vcf.subset = vcf.subsets) %dopar% { x; } })]; - features.dt[Gencode_34_variantType == 'SNP', TRINUCLEOTIDE_SEQ := TRINUCLEOTIDE]; - # Add substitution base - features.dt[Gencode_34_variantType == 'SNP', ALT := sapply(ALT, function(x) unlist(strsplit(x, ','))[1])]; - features.dt[Gencode_34_variantType == 'SNP' & REF %in% c('A', 'G'), ALT := chartr('ATGC', 'TACG', ALT)]; - features.dt[Gencode_34_variantType == 'SNP' & TRINUCLEOTIDE != '', TRINUCLEOTIDE := paste0(TRINUCLEOTIDE, '->', substr(TRINUCLEOTIDE, 1, 1), ALT, substr(TRINUCLEOTIDE, 3, 3))]; cat(format(Sys.time() - start.subset, nsmall = 2), '\n'); return(features.dt); @@ -197,9 +195,9 @@ cat(format(Sys.time() - start.extract, nsmall = 2), '\n'); ################################################################################################### continuous.features <- c( 'Chromosome Position (POS)' = 'POS', - 'Local GC Content' = 'Gencode_34_gcContent', # - '1000 Genomes Allele Frequency' = 'dbSNP_CAF', # - 'TOPMED Allele Frequency' = 'dbSNP_TOPMED', # + 'Local GC Content' = 'Gencode_34_gcContent', + '1k Genomes Pop AF' = 'dbSNP_CAF', + 'TOPMED Pop AF' = 'dbSNP_TOPMED', 'Sequencing Depth (DP)' = 'DP', 'Variant Frequency (AF)' = 'AF' ); @@ -247,14 +245,15 @@ if (variant.caller == 'SomaticSniper') continuous.features <- c(continuous.featu categorical.features <- c( 'Chromosome (CHR)' = 'CHROM', - 'GENCODE Variant Type' = 'Gencode_34_variantType', # - 'GENCODE Variant Type' = 'Gencode_34_variantClassification', # - 'HGNC Locus Group' = 'HGNC_Locus_Group', # - 'Repeat Class' = 'REPEAT', ## - 'Trinucleotide Substitution' = 'TRINUCLEOTIDE', ## - 'Trinucleotide Context' = 'TRINUCLEOTIDE_SEQ', ## - 'VQSR Culprit' = 'culprit', # HaplotypeCaller - 'Somatic Genotype (SGT)' = 'SGT' # Strelka2 + 'GENCODE Variant Type' = 'Gencode_34_variantType', + 'GENCODE Variant Type' = 'Gencode_34_variantClassification', + 'HGNC Locus Group' = 'HGNC_Locus_Group', + 'Repeat Class' = 'REPEAT', + 'Trinucleotide Context' = 'TRINUCLEOTIDE', + 'VQSR Culprit' = 'culprit', + 'Somatic Genotype (SGT)' = 'SGT', + 'LiftOver Allele Flip' = 'FLIP', + 'LiftOver Allele Swap' = 'SWAP' ); # Extract features and format diff --git a/module/scripts/predict-variant-stability.R b/module/scripts/predict-variant-stability.R index c8b271d..771f41e 100644 --- a/module/scripts/predict-variant-stability.R +++ b/module/scripts/predict-variant-stability.R @@ -18,7 +18,7 @@ suppressPackageStartupMessages({ ################################################################################################### # Define command line arguments parser <- ArgumentParser(); -parser$add_argument('--variant-caller', type = 'character', help = 'One of {HaplotypeCaller, Mutect2, Strelka2, SomaticSniper, Muse2, Delly2}'); +parser$add_argument('--variant-caller', type = 'character', help = 'One of {HaplotypeCaller, Mutect2, Strelka2, SomaticSniper, Muse2, Delly2-gSV, Delly2-sSV}'); parser$add_argument('--features-dt', type = 'character', help = 'Processed Rds file with variant info and annotations'); parser$add_argument('--rf-model', type = 'character', help = 'Pre-trained random forest model Rds file'); parser$add_argument('--specificity', type = 'numeric', help = 'Target specificity based on whole genome validation set, overrides `--threshold`'); @@ -43,20 +43,9 @@ rf.model <- readRDS(rf.model); # Feature engineering #################################################################################################### if (variant.caller == 'HaplotypeCaller') { - normalize.features <- c('FS', 'VQSLOD', 'QD', 'SOR'); - features.dt[, (normalize.features) := lapply(.SD, scale), .SDcols = normalize.features]; - features.dt[, c('QUAL', 'GQ', 'DP') := NULL]; -} else if (variant.caller == 'Mutect2') { - features.dt[, c('TRINUCLEOTIDE') := NULL]; -} else if (variant.caller == 'Muse2') { - features.dt[, c('TRINUCLEOTIDE', 'Gencode_34_variantType') := NULL]; -} else if (variant.caller == 'Strelka2') { - features.dt[, c('TRINUCLEOTIDE_SEQ', 'DP', 'Gencode_34_variantType') := NULL]; -} else if (variant.caller == 'SomaticSniper') { - features.dt[, c('DP', 'BQ', 'GQ', 'MQ', 'SSC', 'Gencode_34_variantType', 'TRINUCLEOTIDE', 'TRINUCLEOTIDE_SEQ', 'Gencode_34_variantClassification', 'Gencode_34_gcContent', 'dbSNP_CAF') := NULL]; -} else if (variant.caller == 'Delly2') { - normalize.features <- c('SR', 'SRQ', 'DV'); + normalize.features <- c('DP', 'VQSLOD'); features.dt[, (normalize.features) := lapply(.SD, scale), .SDcols = normalize.features]; + features.dt[, c('QUAL', 'GQ') := NULL]; } cat('Input data dimensions:\n'); diff --git a/module/snv_annotations.nf b/module/snv_annotations.nf index 362523d..7cca624 100644 --- a/module/snv_annotations.nf +++ b/module/snv_annotations.nf @@ -1,4 +1,34 @@ -include { compress_and_index_HTSlib } from './utils.nf' +include { compress_and_index_vcf; compress_and_index_tsv } from './utils.nf' + +process add_genotype_field { + container params.docker_image_stablelift + containerOptions "-v ${moduleDir}:${moduleDir}" + + publishDir path: "${params.output_dir_base}/intermediate/${task.process.replace(':', '/')}", + pattern: "output.vcf", + mode: "copy", + enabled: params.save_intermediate_files, + saveAs: { "${sample_id}_add-GT.vcf" } + + input: + tuple val(sample_id), path(vcf), path(index) + + output: + tuple val(sample_id), path('output.vcf'), emit: vcf_with_gt + + script: + """ + Rscript "${moduleDir}/scripts/add-GT-Strelka2.R" \ + --input-vcf "${vcf}" \ + --output-vcf output.vcf.gz + gunzip output.vcf.gz + """ + + stub: + """ + touch output.vcf + """ +} process run_Funcotator_GATK { container params.docker_image_gatk @@ -20,11 +50,13 @@ process run_Funcotator_GATK { tuple val(sample_id), path('output.vcf.gz'), emit: funcotator_vcf script: + def ref_version_map = [GRCh37: 'hg19', GRCh38: 'hg38'] + """ gatk Funcotator \ --variant "${vcf}" \ --reference "${dest_fasta_ref}" \ - --ref-version "${dest_fasta_id}" \ + --ref-version "${ref_version_map[dest_fasta_id]}" \ --data-sources-path "${funcotator_sources}" \ --output-file-format VCF \ --output "output.vcf.gz" @@ -117,7 +149,7 @@ process annotate_trinucleotide_BCFtools { container params.docker_image_bcftools publishDir path: "${params.output_dir_base}/intermediate/${task.process.replace(':', '/')}", - pattern: "output.vcf.gz{,.tbi}", + pattern: "output.vcf.gz", mode: "copy", enabled: params.save_intermediate_files, saveAs: { "Trinucleotide-annotated-${sample_id}.vcf.gz" } @@ -149,15 +181,27 @@ process annotate_trinucleotide_BCFtools { """ } -workflow workflow_apply_snv_annotations { +workflow workflow_annotate_snvs { take: vcf_with_sample_id dest_fasta_data main: + if (params.variant_caller == "Strelka2") { + add_genotype_field( + vcf_with_sample_id + ) + compress_and_index_vcf( + add_genotype_field.out.vcf_with_gt + ) + vcf_for_annotation = compress_and_index_vcf.out.compressed_vcf_with_index + } else { + vcf_for_annotation = vcf_with_sample_id + } + run_Funcotator_GATK( - vcf_with_sample_id, + vcf_for_annotation, dest_fasta_data, Channel.value(params.funcotator_data_source) ) @@ -172,13 +216,13 @@ workflow workflow_apply_snv_annotations { dest_fasta_data ) - compress_and_index_HTSlib( + compress_and_index_tsv( extract_TrinucleotideContext_BEDTools.out.trinucleotide_tsv ) annotate_trinucleotide_BCFtools( annotate_RepeatMasker_BCFtools.out.repeatmasker_vcf.join( - compress_and_index_HTSlib.out.compressed_tsv_with_index, + compress_and_index_tsv.out.compressed_tsv_with_index, failOnDuplicate: true, failOnMismatch: true ) diff --git a/module/snv_workflow.nf b/module/snv_workflow.nf index fcc44e0..3dadcfa 100644 --- a/module/snv_workflow.nf +++ b/module/snv_workflow.nf @@ -1,4 +1,4 @@ -include { workflow_apply_snv_annotations } from './snv_annotations.nf' +include { workflow_annotate_snvs } from './snv_annotations.nf' process run_liftover_BCFtools { container params.docker_image_bcftools @@ -7,7 +7,7 @@ process run_liftover_BCFtools { pattern: "{reject,liftover}.vcf.gz{,.tbi}", mode: "copy", enabled: params.save_intermediate_files, - saveAs: { filename -> "LiftOver-${sample_id}-${src_fasta_id}-to-${dest_fasta_id}-${filename}" } + saveAs: { filename -> "${sample_id}_LiftOver-${dest_fasta_id}_${filename}" } input: tuple val(sample_id), path(vcf), path(index) @@ -27,11 +27,12 @@ process run_liftover_BCFtools { --src-fasta-ref "${src_fasta_ref}" \ --fasta-ref "${dest_fasta_ref}" \ --chain "${chain_file}" \ + --af-tags "" \ --reject-type z \ --reject "reject.vcf.gz" | \ bcftools view \ --output-type u \ - -e 'REF="." | ALT="."' | \ + -e 'REF="." | ALT="." | POS<10' | \ bcftools sort \ --output-type z \ --write-index=tbi \ @@ -46,10 +47,41 @@ process run_liftover_BCFtools { """ } +process split_vcf_for_feature_extraction { + container params.docker_image_bcftools + + input: + tuple val(sample_id), path(vcf), path(index) + + output: + tuple val(sample_id), path('split_vcf'), emit: split_vcf + + script: + """ + mkdir -p split_vcf + vcf=\$(readlink -f "${vcf}") + if [ \$(du -m "\${vcf}" | cut -f1) -gt 1024 ]; then + bcftools view -H "\${vcf}" | + split -C 1G -d -a2 \ + --additional-suffix=".vcf" \ + - "split_vcf/${sample_id}_split_" + for file in split_vcf/*; do + bcftools view -h "\${vcf}" > "\${file}~" + cat "\${file}" >> "\${file}~" + mv "\${file}~" "\${file}" + done + else + ln -s \${vcf} "split_vcf/${sample_id}.vcf" + fi + """ +} + process extract_VCF_features_StableLift { container params.docker_image_stablelift containerOptions "-v ${moduleDir}:${moduleDir}" + cpus { params.getOrDefault('extract_features_cpus', 4) } + publishDir path: "${params.output_dir_base}/intermediate/${task.process.replace(':', '/')}", pattern: "features.Rds", mode: "copy", @@ -57,7 +89,8 @@ process extract_VCF_features_StableLift { saveAs: { "StableLift-${sample_id}.Rds" } input: - tuple val(sample_id), path(vcf), path(index) + tuple val(sample_id), path(vcf_dir) + val variant_caller output: tuple val(sample_id), path('features.Rds'), emit: r_annotations @@ -65,9 +98,10 @@ process extract_VCF_features_StableLift { script: """ Rscript "${moduleDir}/scripts/extract-VCF-features.R" \ - --input-vcf "${vcf}" \ - --variant-caller ${params.variant_caller} \ - --output-rds "features.Rds" + --input-dir "${vcf_dir}" \ + --output-rds "features.Rds" \ + --ncore ${task.cpus} \ + --variant-caller ${variant_caller} """ stub: @@ -86,52 +120,33 @@ workflow workflow_extract_snv_annotations { main: - // We want to do all of the annotating with the GRCh38 / hg38 reference. If - // the liftover is going from h38 to hg19, defer until after annotations - if (params.liftover_direction == "GRCh37ToGRCh38") { - // Step 1: Liftover - run_liftover_BCFtools( - vcf_with_sample_id, - src_sequence, - dest_sequence, - chain_file - ) - - // Step 2: Annotate with GRCh38 - workflow_apply_snv_annotations( - run_liftover_BCFtools.out.liftover_vcf_with_index, - dest_sequence - ) - - workflow_apply_snv_annotations.out.annotated_vcf.set { annotated_vcf_with_index } - - } else { - // Step 1: Annotate with GRCh38 - workflow_apply_snv_annotations( - vcf_with_sample_id, - src_sequence - ) - - // Step 2: Liftover - run_liftover_BCFtools( - workflow_apply_snv_annotations.out.annotated_vcf, - src_sequence, - dest_sequence, - chain_file - ) - - run_liftover_BCFtools.out.liftover_vcf_with_index.set { annotated_vcf_with_index } - } - - // Step 3: Extract features - // FIXME Parallelize HaplotypeCaller + // Step 1: LiftOver + run_liftover_BCFtools( + vcf_with_sample_id, + src_sequence, + dest_sequence, + chain_file + ) + + // Step 2: Annotate + workflow_annotate_snvs( + run_liftover_BCFtools.out.liftover_vcf_with_index, + dest_sequence + ) + + // Step 3: Split large VCFs + split_vcf_for_feature_extraction( + workflow_annotate_snvs.out.annotated_vcf + ) + + // Step 4: Extract features extract_VCF_features_StableLift( - annotated_vcf_with_index + split_vcf_for_feature_extraction.out.split_vcf, + variant_caller ) - // For consistency with the SV branch, remove the index file from the - // output VCF channel - annotated_vcf_with_index + // For consistency with the SV branch, remove the index file from the output VCF channel + workflow_annotate_snvs.out.annotated_vcf .map { sample_id, vcf, index -> [sample_id, vcf] } .set { annotated_vcf } @@ -139,4 +154,3 @@ workflow workflow_extract_snv_annotations { liftover_vcf = annotated_vcf r_annotations = extract_VCF_features_StableLift.out.r_annotations } - diff --git a/module/sv_workflow.nf b/module/sv_workflow.nf index 12c5a38..cf22f4d 100644 --- a/module/sv_workflow.nf +++ b/module/sv_workflow.nf @@ -11,9 +11,10 @@ process liftover_annotate_SV_StableLift { input: tuple val(sample_id), path(vcf, stageAs: 'inputs/*') val(source_grch_label) - path (header_contigs) - path (chain_file) - path (gnomad_rds) + path(header_contigs) + path(chain_file) + path(gnomad_rds) + val(variant_caller) output: tuple val(sample_id), path('annotations.vcf.gz'), emit: liftover_vcf @@ -22,6 +23,7 @@ process liftover_annotate_SV_StableLift { script: """ Rscript "${moduleDir}/scripts/extract-VCF-features-SV.R" \ + --variant-caller "${variant_caller}" \ --input-vcf "${vcf}" \ --source-build "${source_grch_label}" \ --chain-file "${chain_file}" \ @@ -74,6 +76,7 @@ workflow workflow_extract_sv_annotations { header_contigs gnomad_rds chain_file + variant_caller main: @@ -82,11 +85,12 @@ workflow workflow_extract_sv_annotations { vcf_with_sample_id.map{ [it[0], it[1]] }, // We only need the sample ID - src_sequence.map{ ["hg19": "GRCh37", "hg38": "GRCh38"][it[0]] }, + src_sequence.map{ it[0] }, header_contigs, chain_file, - gnomad_rds + gnomad_rds, + variant_caller ) run_sort_BCFtools( diff --git a/module/utils.nf b/module/utils.nf index 1c8efdf..17e71eb 100644 --- a/module/utils.nf +++ b/module/utils.nf @@ -1,5 +1,35 @@ +process compress_and_index_vcf { + container params.docker_image_samtools + + publishDir path: "${params.output_dir_base}/intermediate/${task.process.replace(':', '/')}", + pattern: "output.vcf.gz{,.tbi}", + mode: "copy", + enabled: params.save_intermediate_files, + saveAs: { "${sample_id}${file(it).getName().replace(file(it).getSimpleName(), "")}" } + + input: + tuple val(sample_id), path(input_vcf) + + output: + tuple val(sample_id), + path('output.vcf.gz'), + path('output.vcf.gz.tbi'), + emit: compressed_vcf_with_index -process compress_and_index_HTSlib { + script: + """ + bgzip ${input_vcf} --output output.vcf.gz + tabix -p vcf output.vcf.gz + """ + + stub: + """ + touch output.vcf.gz + touch output.vcf.gz.tbi + """ +} + +process compress_and_index_tsv { container params.docker_image_samtools publishDir path: "${params.output_dir_base}/intermediate/${task.process.replace(':', '/')}", @@ -9,7 +39,7 @@ process compress_and_index_HTSlib { saveAs: { "${sample_id}${file(it).getName().replace(file(it).getSimpleName(), "")}" } input: - tuple val(sample_id), path(tsv) + tuple val(sample_id), path(input_tsv) output: tuple val(sample_id), @@ -19,18 +49,14 @@ process compress_and_index_HTSlib { script: """ - bgzip ${tsv} --output output.tsv.gz - - tabix \ - --sequence 1 \ - --begin 2 \ - --end 2 \ - output.tsv.gz + sort -k1,1 -k2,2n ${input_tsv} > sorted_input.tsv + bgzip sorted_input.tsv --output output.tsv.gz + tabix --sequence 1 --begin 2 --end 2 output.tsv.gz """ stub: """ - touch "output.tsv.gz" - touch "output.tsv.gz.tbi" + touch output.tsv.gz + touch output.tsv.gz.tbi """ } diff --git a/nextflow.config b/nextflow.config index 38c2ebc..3f08ce1 100644 --- a/nextflow.config +++ b/nextflow.config @@ -1,7 +1,7 @@ // Metadata manifest { name = 'pipeline-StableLift' - author = 'Nicholas Wiltsie, Nicholas K. Wang' + author = 'Nicholas K. Wang, Nicholas Wiltsie' description = 'Nextflow pipeline for predicting variant stability across reference genome builds.' version = '1.0.0' } diff --git a/nftest.yml b/nftest.yml index 6db653a..e1d9f57 100644 --- a/nftest.yml +++ b/nftest.yml @@ -1,7 +1,7 @@ --- global: - temp_dir: test/work - remove_temp: true + temp_dir: temp + remove_temp: false clean_logs: true nf_config: nextflow.config @@ -14,25 +14,11 @@ cases: verbose: true asserts: - script: test/compare-VCFs.sh - expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/output/TCGA-SARC_10TN-WGS_GRCh37-to-GRCh38/TCGA-SARC_10TN-WGS_GRCh37_gSNP_HaplotypeCaller_StableLift-GRCh38.vcf.gz - actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh37/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh37-HaplotypeCaller-stability.vcf.gz' + expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/expect/TCGA-SARC_10TN-WGS_GRCh37-to-GRCh38/TCGA-SARC_10TN-WGS_GRCh37_gSNP_HaplotypeCaller_StableLift-GRCh38.vcf.gz + actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh37/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh37_HaplotypeCaller_StableLift-GRCh38.vcf.gz' - script: test/compare-VCFs.sh - expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/output/TCGA-SARC_10TN-WGS_GRCh37-to-GRCh38/TCGA-SARC_10TN-WGS_GRCh37_gSNP_HaplotypeCaller_StableLift-GRCh38_filtered.vcf.gz - actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh37/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh37-HaplotypeCaller-filtered.vcf.gz' - - - name: Delly2-37 - nf_script: ./main.nf - nf_config: test/Delly2-37.config - params_file: test/yamls/Delly2-37.yaml - skip: false - verbose: true - asserts: - - script: test/compare-VCFs.sh - expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/output/TCGA-SARC_10TN-WGS_GRCh37-to-GRCh38/TCGA-SARC_10TN-WGS_GRCh37_gSV_Delly2_StableLift-GRCh38.vcf.gz - actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh37/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh37-Delly2-stability.vcf.gz' - - script: test/compare-VCFs.sh - expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/output/TCGA-SARC_10TN-WGS_GRCh37-to-GRCh38/TCGA-SARC_10TN-WGS_GRCh37_gSV_Delly2_StableLift-GRCh38_filtered.vcf.gz - actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh37/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh37-Delly2-filtered.vcf.gz' + expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/expect/TCGA-SARC_10TN-WGS_GRCh37-to-GRCh38/TCGA-SARC_10TN-WGS_GRCh37_gSNP_HaplotypeCaller_StableLift-GRCh38_filtered.vcf.gz + actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh37/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh37_HaplotypeCaller_StableLift-GRCh38_filtered.vcf.gz' - name: Muse2-37 nf_script: ./main.nf @@ -42,11 +28,11 @@ cases: verbose: true asserts: - script: test/compare-VCFs.sh - expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/output/TCGA-SARC_10TN-WGS_GRCh37-to-GRCh38/TCGA-SARC_10TN-WGS_GRCh37_sSNV_Muse2_StableLift-GRCh38.vcf.gz - actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh37/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh37-Muse2-stability.vcf.gz' + expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/expect/TCGA-SARC_10TN-WGS_GRCh37-to-GRCh38/TCGA-SARC_10TN-WGS_GRCh37_sSNV_Muse2_StableLift-GRCh38.vcf.gz + actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh37/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh37_Muse2_StableLift-GRCh38.vcf.gz' - script: test/compare-VCFs.sh - expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/output/TCGA-SARC_10TN-WGS_GRCh37-to-GRCh38/TCGA-SARC_10TN-WGS_GRCh37_sSNV_Muse2_StableLift-GRCh38_filtered.vcf.gz - actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh37/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh37-Muse2-filtered.vcf.gz' + expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/expect/TCGA-SARC_10TN-WGS_GRCh37-to-GRCh38/TCGA-SARC_10TN-WGS_GRCh37_sSNV_Muse2_StableLift-GRCh38_filtered.vcf.gz + actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh37/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh37_Muse2_StableLift-GRCh38_filtered.vcf.gz' - name: Mutect2-37 nf_script: ./main.nf @@ -56,11 +42,11 @@ cases: verbose: true asserts: - script: test/compare-VCFs.sh - expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/output/TCGA-SARC_10TN-WGS_GRCh37-to-GRCh38/TCGA-SARC_10TN-WGS_GRCh37_sSNV_Mutect2_StableLift-GRCh38.vcf.gz - actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh37/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh37-Mutect2-stability.vcf.gz' + expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/expect/TCGA-SARC_10TN-WGS_GRCh37-to-GRCh38/TCGA-SARC_10TN-WGS_GRCh37_sSNV_Mutect2_StableLift-GRCh38.vcf.gz + actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh37/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh37_Mutect2_StableLift-GRCh38.vcf.gz' - script: test/compare-VCFs.sh - expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/output/TCGA-SARC_10TN-WGS_GRCh37-to-GRCh38/TCGA-SARC_10TN-WGS_GRCh37_sSNV_Mutect2_StableLift-GRCh38_filtered.vcf.gz - actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh37/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh37-Mutect2-filtered.vcf.gz' + expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/expect/TCGA-SARC_10TN-WGS_GRCh37-to-GRCh38/TCGA-SARC_10TN-WGS_GRCh37_sSNV_Mutect2_StableLift-GRCh38_filtered.vcf.gz + actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh37/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh37_Mutect2_StableLift-GRCh38_filtered.vcf.gz' - name: SomaticSniper-37 nf_script: ./main.nf @@ -70,11 +56,11 @@ cases: verbose: true asserts: - script: test/compare-VCFs.sh - expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/output/TCGA-SARC_10TN-WGS_GRCh37-to-GRCh38/TCGA-SARC_10TN-WGS_GRCh37_sSNV_SomaticSniper_StableLift-GRCh38.vcf.gz - actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh37/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh37-SomaticSniper-stability.vcf.gz' + expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/expect/TCGA-SARC_10TN-WGS_GRCh37-to-GRCh38/TCGA-SARC_10TN-WGS_GRCh37_sSNV_SomaticSniper_StableLift-GRCh38.vcf.gz + actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh37/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh37_SomaticSniper_StableLift-GRCh38.vcf.gz' - script: test/compare-VCFs.sh - expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/output/TCGA-SARC_10TN-WGS_GRCh37-to-GRCh38/TCGA-SARC_10TN-WGS_GRCh37_sSNV_SomaticSniper_StableLift-GRCh38_filtered.vcf.gz - actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh37/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh37-SomaticSniper-filtered.vcf.gz' + expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/expect/TCGA-SARC_10TN-WGS_GRCh37-to-GRCh38/TCGA-SARC_10TN-WGS_GRCh37_sSNV_SomaticSniper_StableLift-GRCh38_filtered.vcf.gz + actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh37/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh37_SomaticSniper_StableLift-GRCh38_filtered.vcf.gz' - name: Strelka2-37 nf_script: ./main.nf @@ -84,39 +70,53 @@ cases: verbose: true asserts: - script: test/compare-VCFs.sh - expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/output/TCGA-SARC_10TN-WGS_GRCh37-to-GRCh38/TCGA-SARC_10TN-WGS_GRCh37_sSNV_Strelka2_StableLift-GRCh38.vcf.gz - actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh37/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh37-Strelka2-stability.vcf.gz' + expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/expect/TCGA-SARC_10TN-WGS_GRCh37-to-GRCh38/TCGA-SARC_10TN-WGS_GRCh37_sSNV_Strelka2_StableLift-GRCh38.vcf.gz + actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh37/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh37_Strelka2_StableLift-GRCh38.vcf.gz' - script: test/compare-VCFs.sh - expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/output/TCGA-SARC_10TN-WGS_GRCh37-to-GRCh38/TCGA-SARC_10TN-WGS_GRCh37_sSNV_Strelka2_StableLift-GRCh38_filtered.vcf.gz - actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh37/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh37-Strelka2-filtered.vcf.gz' + expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/expect/TCGA-SARC_10TN-WGS_GRCh37-to-GRCh38/TCGA-SARC_10TN-WGS_GRCh37_sSNV_Strelka2_StableLift-GRCh38_filtered.vcf.gz + actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh37/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh37_Strelka2_StableLift-GRCh38_filtered.vcf.gz' - - name: HaplotypeCaller-38 + - name: Delly2-gSV-37 nf_script: ./main.nf - nf_config: test/HaplotypeCaller-38.config - params_file: test/yamls/HaplotypeCaller-38.yaml + nf_config: test/Delly2-gSV-37.config + params_file: test/yamls/Delly2-gSV-37.yaml + skip: false + verbose: true + asserts: + - script: test/compare-VCFs.sh + expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/expect/TCGA-SARC_10TN-WGS_GRCh37-to-GRCh38/TCGA-SARC_10TN-WGS_GRCh37_gSV_Delly2-gSV_StableLift-GRCh38.vcf.gz + actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh37/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh37_Delly2-gSV_StableLift-GRCh38.vcf.gz' + - script: test/compare-VCFs.sh + expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/expect/TCGA-SARC_10TN-WGS_GRCh37-to-GRCh38/TCGA-SARC_10TN-WGS_GRCh37_gSV_Delly2-gSV_StableLift-GRCh38_filtered.vcf.gz + actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh37/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh37_Delly2-gSV_StableLift-GRCh38_filtered.vcf.gz' + + - name: Delly2-sSV-37 + nf_script: ./main.nf + nf_config: test/Delly2-sSV-37.config + params_file: test/yamls/Delly2-sSV-37.yaml skip: false verbose: true asserts: - script: test/compare-VCFs.sh - expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/output/TCGA-SARC_10TN-WGS_GRCh38-to-GRCh37/TCGA-SARC_10TN-WGS_GRCh38_gSNP_HaplotypeCaller_StableLift-GRCh37.vcf.gz - actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh38/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh38-HaplotypeCaller-stability.vcf.gz' + expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/expect/TCGA-SARC_10TN-WGS_GRCh37-to-GRCh38/TCGA-SARC_10TN-WGS_GRCh37_sSV_Delly2-sSV_StableLift-GRCh38.vcf.gz + actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh37/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh37_Delly2-sSV_StableLift-GRCh38.vcf.gz' - script: test/compare-VCFs.sh - expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/output/TCGA-SARC_10TN-WGS_GRCh38-to-GRCh37/TCGA-SARC_10TN-WGS_GRCh38_gSNP_HaplotypeCaller_StableLift-GRCh37_filtered.vcf.gz - actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh38/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh38-HaplotypeCaller-filtered.vcf.gz' + expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/expect/TCGA-SARC_10TN-WGS_GRCh37-to-GRCh38/TCGA-SARC_10TN-WGS_GRCh37_sSV_Delly2-sSV_StableLift-GRCh38_filtered.vcf.gz + actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh37/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh37_Delly2-sSV_StableLift-GRCh38_filtered.vcf.gz' - - name: Delly2-38 + - name: HaplotypeCaller-38 nf_script: ./main.nf - nf_config: test/Delly2-38.config - params_file: test/yamls/Delly2-38.yaml + nf_config: test/HaplotypeCaller-38.config + params_file: test/yamls/HaplotypeCaller-38.yaml skip: false verbose: true asserts: - script: test/compare-VCFs.sh - expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/output/TCGA-SARC_10TN-WGS_GRCh38-to-GRCh37/TCGA-SARC_10TN-WGS_GRCh38_gSV_Delly2_StableLift-GRCh37.vcf.gz - actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh38/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh38-Delly2-stability.vcf.gz' + expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/expect/TCGA-SARC_10TN-WGS_GRCh38-to-GRCh37/TCGA-SARC_10TN-WGS_GRCh38_gSNP_HaplotypeCaller_StableLift-GRCh37.vcf.gz + actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh38/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh38_HaplotypeCaller_StableLift-GRCh37.vcf.gz' - script: test/compare-VCFs.sh - expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/output/TCGA-SARC_10TN-WGS_GRCh38-to-GRCh37/TCGA-SARC_10TN-WGS_GRCh38_gSV_Delly2_StableLift-GRCh37_filtered.vcf.gz - actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh38/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh38-Delly2-filtered.vcf.gz' + expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/expect/TCGA-SARC_10TN-WGS_GRCh38-to-GRCh37/TCGA-SARC_10TN-WGS_GRCh38_gSNP_HaplotypeCaller_StableLift-GRCh37_filtered.vcf.gz + actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh38/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh38_HaplotypeCaller_StableLift-GRCh37_filtered.vcf.gz' - name: Muse2-38 nf_script: ./main.nf @@ -126,11 +126,11 @@ cases: verbose: true asserts: - script: test/compare-VCFs.sh - expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/output/TCGA-SARC_10TN-WGS_GRCh38-to-GRCh37/TCGA-SARC_10TN-WGS_GRCh38_sSNV_Muse2_StableLift-GRCh37.vcf.gz - actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh38/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh38-Muse2-stability.vcf.gz' + expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/expect/TCGA-SARC_10TN-WGS_GRCh38-to-GRCh37/TCGA-SARC_10TN-WGS_GRCh38_sSNV_Muse2_StableLift-GRCh37.vcf.gz + actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh38/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh38_Muse2_StableLift-GRCh37.vcf.gz' - script: test/compare-VCFs.sh - expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/output/TCGA-SARC_10TN-WGS_GRCh38-to-GRCh37/TCGA-SARC_10TN-WGS_GRCh38_sSNV_Muse2_StableLift-GRCh37_filtered.vcf.gz - actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh38/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh38-Muse2-filtered.vcf.gz' + expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/expect/TCGA-SARC_10TN-WGS_GRCh38-to-GRCh37/TCGA-SARC_10TN-WGS_GRCh38_sSNV_Muse2_StableLift-GRCh37_filtered.vcf.gz + actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh38/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh38_Muse2_StableLift-GRCh37_filtered.vcf.gz' - name: Mutect2-38 nf_script: ./main.nf @@ -140,11 +140,11 @@ cases: verbose: true asserts: - script: test/compare-VCFs.sh - expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/output/TCGA-SARC_10TN-WGS_GRCh38-to-GRCh37/TCGA-SARC_10TN-WGS_GRCh38_sSNV_Mutect2_StableLift-GRCh37.vcf.gz - actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh38/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh38-Mutect2-stability.vcf.gz' + expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/expect/TCGA-SARC_10TN-WGS_GRCh38-to-GRCh37/TCGA-SARC_10TN-WGS_GRCh38_sSNV_Mutect2_StableLift-GRCh37.vcf.gz + actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh38/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh38_Mutect2_StableLift-GRCh37.vcf.gz' - script: test/compare-VCFs.sh - expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/output/TCGA-SARC_10TN-WGS_GRCh38-to-GRCh37/TCGA-SARC_10TN-WGS_GRCh38_sSNV_Mutect2_StableLift-GRCh37_filtered.vcf.gz - actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh38/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh38-Mutect2-filtered.vcf.gz' + expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/expect/TCGA-SARC_10TN-WGS_GRCh38-to-GRCh37/TCGA-SARC_10TN-WGS_GRCh38_sSNV_Mutect2_StableLift-GRCh37_filtered.vcf.gz + actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh38/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh38_Mutect2_StableLift-GRCh37_filtered.vcf.gz' - name: SomaticSniper-38 nf_script: ./main.nf @@ -154,11 +154,11 @@ cases: verbose: true asserts: - script: test/compare-VCFs.sh - expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/output/TCGA-SARC_10TN-WGS_GRCh38-to-GRCh37/TCGA-SARC_10TN-WGS_GRCh38_sSNV_SomaticSniper_StableLift-GRCh37.vcf.gz - actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh38/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh38-SomaticSniper-stability.vcf.gz' + expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/expect/TCGA-SARC_10TN-WGS_GRCh38-to-GRCh37/TCGA-SARC_10TN-WGS_GRCh38_sSNV_SomaticSniper_StableLift-GRCh37.vcf.gz + actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh38/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh38_SomaticSniper_StableLift-GRCh37.vcf.gz' - script: test/compare-VCFs.sh - expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/output/TCGA-SARC_10TN-WGS_GRCh38-to-GRCh37/TCGA-SARC_10TN-WGS_GRCh38_sSNV_SomaticSniper_StableLift-GRCh37_filtered.vcf.gz - actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh38/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh38-SomaticSniper-filtered.vcf.gz' + expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/expect/TCGA-SARC_10TN-WGS_GRCh38-to-GRCh37/TCGA-SARC_10TN-WGS_GRCh38_sSNV_SomaticSniper_StableLift-GRCh37_filtered.vcf.gz + actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh38/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh38_SomaticSniper_StableLift-GRCh37_filtered.vcf.gz' - name: Strelka2-38 nf_script: ./main.nf @@ -168,8 +168,36 @@ cases: verbose: true asserts: - script: test/compare-VCFs.sh - expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/output/TCGA-SARC_10TN-WGS_GRCh38-to-GRCh37/TCGA-SARC_10TN-WGS_GRCh38_sSNV_Strelka2_StableLift-GRCh37.vcf.gz - actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh38/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh38-Strelka2-stability.vcf.gz' + expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/expect/TCGA-SARC_10TN-WGS_GRCh38-to-GRCh37/TCGA-SARC_10TN-WGS_GRCh38_sSNV_Strelka2_StableLift-GRCh37.vcf.gz + actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh38/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh38_Strelka2_StableLift-GRCh37.vcf.gz' + - script: test/compare-VCFs.sh + expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/expect/TCGA-SARC_10TN-WGS_GRCh38-to-GRCh37/TCGA-SARC_10TN-WGS_GRCh38_sSNV_Strelka2_StableLift-GRCh37_filtered.vcf.gz + actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh38/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh38_Strelka2_StableLift-GRCh37_filtered.vcf.gz' + + - name: Delly2-gSV-38 + nf_script: ./main.nf + nf_config: test/Delly2-gSV-38.config + params_file: test/yamls/Delly2-gSV-38.yaml + skip: false + verbose: true + asserts: + - script: test/compare-VCFs.sh + expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/expect/TCGA-SARC_10TN-WGS_GRCh38-to-GRCh37/TCGA-SARC_10TN-WGS_GRCh38_gSV_Delly2-gSV_StableLift-GRCh37.vcf.gz + actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh38/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh38_Delly2-gSV_StableLift-GRCh37.vcf.gz' + - script: test/compare-VCFs.sh + expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/expect/TCGA-SARC_10TN-WGS_GRCh38-to-GRCh37/TCGA-SARC_10TN-WGS_GRCh38_gSV_Delly2-gSV_StableLift-GRCh37_filtered.vcf.gz + actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh38/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh38_Delly2-gSV_StableLift-GRCh37_filtered.vcf.gz' + + - name: Delly2-sSV-38 + nf_script: ./main.nf + nf_config: test/Delly2-sSV-38.config + params_file: test/yamls/Delly2-sSV-38.yaml + skip: false + verbose: true + asserts: + - script: test/compare-VCFs.sh + expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/expect/TCGA-SARC_10TN-WGS_GRCh38-to-GRCh37/TCGA-SARC_10TN-WGS_GRCh38_sSV_Delly2-sSV_StableLift-GRCh37.vcf.gz + actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh38/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh38_Delly2-sSV_StableLift-GRCh37.vcf.gz' - script: test/compare-VCFs.sh - expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/output/TCGA-SARC_10TN-WGS_GRCh38-to-GRCh37/TCGA-SARC_10TN-WGS_GRCh38_sSNV_Strelka2_StableLift-GRCh37_filtered.vcf.gz - actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh38/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh38-Strelka2-filtered.vcf.gz' + expect: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/expect/TCGA-SARC_10TN-WGS_GRCh38-to-GRCh37/TCGA-SARC_10TN-WGS_GRCh38_sSV_Delly2-sSV_StableLift-GRCh37_filtered.vcf.gz + actual: 'pipeline-StableLift-*/TCGA-SARC_10TN-WGS_GRCh38/StableLift-*/output/TCGA-SARC_10TN-WGS_GRCh38_Delly2-sSV_StableLift-GRCh37_filtered.vcf.gz' diff --git a/test/Delly2-37.config b/test/Delly2-gSV-37.config similarity index 90% rename from test/Delly2-37.config rename to test/Delly2-gSV-37.config index 5c29521..085871a 100644 --- a/test/Delly2-37.config +++ b/test/Delly2-gSV-37.config @@ -3,7 +3,7 @@ includeConfig "${projectDir}/config/methods.config" includeConfig "${projectDir}/nextflow.config" params { - variant_caller = "Delly2" + variant_caller = "Delly2-gSV" liftover_direction = "GRCh37ToGRCh38" } diff --git a/test/Delly2-38.config b/test/Delly2-gSV-38.config similarity index 90% rename from test/Delly2-38.config rename to test/Delly2-gSV-38.config index 3558e50..1896b6c 100644 --- a/test/Delly2-38.config +++ b/test/Delly2-gSV-38.config @@ -3,7 +3,7 @@ includeConfig "${projectDir}/config/methods.config" includeConfig "${projectDir}/nextflow.config" params { - variant_caller = "Delly2" + variant_caller = "Delly2-gSV" liftover_direction = "GRCh38ToGRCh37" } diff --git a/test/Delly2-sSV-37.config b/test/Delly2-sSV-37.config new file mode 100644 index 0000000..960826b --- /dev/null +++ b/test/Delly2-sSV-37.config @@ -0,0 +1,13 @@ +includeConfig "${projectDir}/config/default.config" +includeConfig "${projectDir}/config/methods.config" +includeConfig "${projectDir}/nextflow.config" + +params { + variant_caller = "Delly2-sSV" + liftover_direction = "GRCh37ToGRCh38" +} + +includeConfig "${projectDir}/test/common.config" + +// Setup the pipeline config. DO NOT REMOVE THIS LINE! +methods.setup() diff --git a/test/Delly2-sSV-38.config b/test/Delly2-sSV-38.config new file mode 100644 index 0000000..faa3c1e --- /dev/null +++ b/test/Delly2-sSV-38.config @@ -0,0 +1,13 @@ +includeConfig "${projectDir}/config/default.config" +includeConfig "${projectDir}/config/methods.config" +includeConfig "${projectDir}/nextflow.config" + +params { + variant_caller = "Delly2-sSV" + liftover_direction = "GRCh38ToGRCh37" +} + +includeConfig "${projectDir}/test/common.config" + +// Setup the pipeline config. DO NOT REMOVE THIS LINE! +methods.setup() diff --git a/test/common.config b/test/common.config index db8596d..012e867 100644 --- a/test/common.config +++ b/test/common.config @@ -3,47 +3,30 @@ params { // Parameterized values that tests can use test_base = "/hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/publish" - - test_chains = [ - GRCh37ToGRCh38: "${test_base}/resource/hg19ToHg38.over.chain", - GRCh38ToGRCh37: "${test_base}/resource/hg38ToHg19.over.chain" - ] - model_37_38 = "${test_base}/model/GRCh37-to-GRCh38/RF-model_GRCh37-to-GRCh38" model_38_37 = "${test_base}/model/GRCh38-to-GRCh37/RF-model_GRCh38-to-GRCh37" test_models = [ GRCh37ToGRCh38: [ HaplotypeCaller: "${model_37_38}_gSNP_HaplotypeCaller.Rds", - Delly2: "${model_37_38}_gSV_Delly2.Rds", Muse2: "${model_37_38}_sSNV_Muse2.Rds", Mutect2: "${model_37_38}_sSNV_Mutect2.Rds", SomaticSniper: "${model_37_38}_sSNV_SomaticSniper.Rds", - Strelka2: "${model_37_38}_sSNV_Strelka2.Rds" + Strelka2: "${model_37_38}_sSNV_Strelka2.Rds", + "Delly2-gSV": "${model_37_38}_gSV_Delly2-gSV.Rds", + "Delly2-sSV": "${model_37_38}_sSV_Delly2-sSV.Rds" ], GRCh38ToGRCh37: [ HaplotypeCaller: "${model_38_37}_gSNP_HaplotypeCaller.Rds", - Delly2: "${model_38_37}_gSV_Delly2.Rds", Muse2: "${model_38_37}_sSNV_Muse2.Rds", Mutect2: "${model_38_37}_sSNV_Mutect2.Rds", SomaticSniper: "${model_38_37}_sSNV_SomaticSniper.Rds", - Strelka2: "${model_38_37}_sSNV_Strelka2.Rds" + Strelka2: "${model_38_37}_sSNV_Strelka2.Rds", + "Delly2-gSV": "${model_38_37}_gSV_Delly2-gSV.Rds", + "Delly2-sSV": "${model_38_37}_sSV_Delly2-sSV.Rds" ] ] - test_contigs = [ - GRCh37ToGRCh38: "${test_base}/resource/GRCh38_VCF-header-contigs.txt", - GRCh38ToGRCh37: "${test_base}/resource/GRCh37_VCF-header-contigs.txt" - ] - - // Concrete values for tests - header_contigs = test_contigs[liftover_direction] - gnomad_rds = "${test_base}/resource/gnomad.v4.0.sv.Rds" - chain_file = test_chains[liftover_direction] - - funcotator_data_source = "/hot/ref/tool-specific-input/Funcotator/somatic/funcotator_dataSources.v1.7.20200521s" - repeat_bed = "${test_base}/resource/GRCh38_RepeatMasker-intervals.bed" - rf_model = test_models[liftover_direction][variant_caller] save_intermediate_files = true @@ -51,4 +34,5 @@ params { fasta_ref_37 = "/hot/ref/reference/GRCh37-EBI-hs37d5/hs37d5.fa" fasta_ref_38 = "/hot/ref/reference/GRCh38-BI-20160721/Homo_sapiens_assembly38.fasta" + resource_bundle_path = "/hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/publish/resource" } diff --git a/test/yamls/Delly2-37.yaml b/test/yamls/Delly2-gSV-37.yaml similarity index 100% rename from test/yamls/Delly2-37.yaml rename to test/yamls/Delly2-gSV-37.yaml diff --git a/test/yamls/Delly2-38.yaml b/test/yamls/Delly2-gSV-38.yaml similarity index 100% rename from test/yamls/Delly2-38.yaml rename to test/yamls/Delly2-gSV-38.yaml diff --git a/test/yamls/Delly2-sSV-37.yaml b/test/yamls/Delly2-sSV-37.yaml new file mode 100644 index 0000000..10778ae --- /dev/null +++ b/test/yamls/Delly2-sSV-37.yaml @@ -0,0 +1,4 @@ +--- +input: + vcf: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/input/TCGA-SARC_10TN-WGS_GRCh37/TCGA-SARC_10TN-WGS_GRCh37_sSV_Delly2.vcf.gz +sample_id: TCGA-SARC_10TN-WGS_GRCh37 diff --git a/test/yamls/Delly2-sSV-38.yaml b/test/yamls/Delly2-sSV-38.yaml new file mode 100644 index 0000000..2da07e4 --- /dev/null +++ b/test/yamls/Delly2-sSV-38.yaml @@ -0,0 +1,4 @@ +--- +input: + vcf: /hot/project/method/AlgorithmEvaluation/BNCH-000142-GRCh37v38/test/input/TCGA-SARC_10TN-WGS_GRCh38/TCGA-SARC_10TN-WGS_GRCh38_sSV_Delly2.vcf.gz +sample_id: TCGA-SARC_10TN-WGS_GRCh38