Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GRIDSS preprocessing #175

Merged
merged 22 commits into from
Oct 1, 2024
Merged

Add GRIDSS preprocessing #175

merged 22 commits into from
Oct 1, 2024

Conversation

Faizal-Eeman
Copy link
Contributor

@Faizal-Eeman Faizal-Eeman commented Sep 19, 2024

Description

Add GRIDSS preprocessing step.

Closes #176

Testing Results

  • DNA A-mini
    • sample: TWGSAMIN000001-N003-S03-F, TWGSAMIN000001-T003-S03-F
    • input yaml: /hot/software/pipeline/pipeline-call-sSV/Nextflow/development/unreleased/mmootor-add-GRIDSS/TWGSAMIN000001-T003-S03-F.yaml
    • config: /hot/software/pipeline/pipeline-call-sSV/Nextflow/development/unreleased/mmootor-add-GRIDSS/TWGSAMIN000001-T003-S03-F.config
    • output: /hot/software/pipeline/pipeline-call-sSV/Nextflow/development/unreleased/mmootor-add-GRIDSS/call-sSV-6.1.0/S2_v1.1.5/GRIDSS-2.13.2/intermediate/

Checklist

  • I have read the code review guidelines and the code review best practice on GitHub check-list.

  • I have reviewed the Nextflow pipeline standards.

  • The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].

  • I have set up or verified the branch protection rule following the github standards before opening this pull request.

  • I have added my name to the contributors listings in the manifest block in the nextflow.config as part of this pull request, am listed
    already, or do not wish to be listed. (This acknowledgement is optional.)

  • I have added the changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.

  • I have updated the version number in the metadata.yaml and manifest block of the nextflow.config file following semver, or the version number has already been updated. (Leave it unchecked if you are unsure about new version number and discuss it with the infrastructure team in this PR.)

  • I have tested the pipeline on at least one A-mini sample with algorithm = ['delly', 'manta']. The paths to the test config files and output directories are attached above.

@Faizal-Eeman Faizal-Eeman marked this pull request as ready for review September 19, 2024 15:56
@Faizal-Eeman Faizal-Eeman requested a review from a team as a code owner September 19, 2024 15:56
@@ -4,6 +4,17 @@ process {
memory = 1.GB
}

withName: preprocess_BAM_GRIDSS {
cpus = 1
memory = 26.GB
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like only 6GB is utilized. Testing a run with 4 CPUs and 10GB of memory. Will update this thread once done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test successful with the recommended 4CPUs by GRIDSS.
/hot/software/pipeline/pipeline-call-sSV/Nextflow/development/unreleased/mmootor-add-GRIDSS/preprocess-4cpu-10GBMem.log

@@ -189,4 +200,12 @@ workflow {
call_sSV_Manta.out.manta_vcfs.flatten()
)
}
if ('gridss2' in params.algorithm) {
preprocess_BAM_GRIDSS(
gridss_ch,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Question | non-blocking): I'm trying to confirm that the preprocess_BAM_GRIDSS process is functioning as you want it to, but I need more context about the intended behavior.

How many times is preprocess_BAM_GRIDSS expected to run? It seems gridss is sample level tuple of [id, path, index] is this pipeline going to be run with multiple samples? I am not familiar with the tools being used here and I'm trying to understand how the sample(s) relate to thegridss_reference_fasta andgridss_reference_files channels. The preprocess_BAM_GRIDSS treats the corresponded expected input from these channels as paths. However, from what I understand these channels will be lists due to the .collect() function. Are they singleton lists which act as channels consumed once? If there will be multiple samples are these singletons meant to be used for each sample? Or do the lists have many elements meant to be consumed one at a time as a channel? Does their order matter/do they need to be synced with the sample passed?

input:
tuple(val(sample_id), path(sample_bam), path(sample_index))
path(gridss_reference_fasta)
path(gridss_reference_files)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Question | non-blocking): Is gridss_reference_files used?

@Faizal-Eeman Faizal-Eeman requested a review from kiarod September 26, 2024 21:34
main.nf Outdated Show resolved Hide resolved
module/gridss.nf Outdated Show resolved Hide resolved
module/gridss.nf Outdated Show resolved Hide resolved
Copy link
Contributor

@yashpatel6 yashpatel6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor question and can you also address @kiarod's questions?

config/F16.config Show resolved Hide resolved
@Faizal-Eeman
Copy link
Contributor Author

One minor question and can you also address @kiarod's questions?

It seems gridss is sample level tuple of [id, path, index] is this pipeline going to be run with multiple samples?

GRIDSS takes a single tumor/normal pair for variant calling. However, preprocess_BAM_GRIDSS is just step 1 of 3 in the GRIDSS workflow which produces intermediate files per given BAM. Meaning, it produces intermediate BAMs for Normal and Tumor BAM due to which this process takes one BAM at a time as input.

How many times is preprocess_BAM_GRIDSS expected to run?

Two times, one for normal and one for tumor.

I am not familiar with the tools being used here and I'm trying to understand how the sample(s) relate to the gridss_reference_fasta and gridss_reference_files channels

The process requires one BAM input to be passed with a FASTA reference and additional reference files (.fasta.idx, .fasta.gridsscache, .fasta.dict, etc) which are expected to be in the same dir as the FASTA reference.

However, from what I understand these channels will be lists due to the .collect() function. Are they singleton lists which act as channels consumed once? If there will be multiple samples are these singletons meant to be used for each sample? Or do the lists have many elements meant to be consumed one at a time as a channel? Does their order matter/do they need to be synced with the sample passed?

gridss_reference_files is a list because of .collect() which allows all the files in the list to be softlinked in the working dir. The ordering of files in the list doesn't matter as all those files are required to be present in the working dir when the process is run.

@Faizal-Eeman
Copy link
Contributor Author

F32 test - /hot/software/pipeline/pipeline-call-sSV/Nextflow/development/unreleased/mmootor-add-GRIDSS/ILHNLNEV000002-T001-P01-F.F32/F32.test.log
F72 test - /hot/software/pipeline/pipeline-call-sSV/Nextflow/development/unreleased/mmootor-add-GRIDSS/ILHNLNEV000009-T002-L01-F.F72/F72.test.log

Copy link
Contributor

@yashpatel6 yashpatel6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Anything else to add @kiarod ?

Copy link

@kiarod kiarod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Faizal-Eeman Faizal-Eeman merged commit 253dd16 into main Oct 1, 2024
5 checks passed
@Faizal-Eeman Faizal-Eeman deleted the mmootor-add-GRIDSS branch October 2, 2024 19:15
@nwiltsie nwiltsie mentioned this pull request Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add GRIDSS preprocessing
3 participants