Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add XY filtration workflow #191

Merged
merged 103 commits into from
Mar 25, 2025
Merged

Add XY filtration workflow #191

merged 103 commits into from
Mar 25, 2025

Conversation

Faizal-Eeman
Copy link
Contributor

@Faizal-Eeman Faizal-Eeman commented Dec 4, 2024

Description

ADD XY filter workflow

Closes #190

Testing Results

  • N-T paired WGS (sample_sex = XY)

    • samples: NA24149, NA24143
    • input YAML: /hot/software/pipeline/pipeline-call-gSNP/Nextflow/development/pre-10.1.0-rc.1/mmootor-add-xy-filtration/pipeline_test/NA24143.yaml
    • config: /hot/software/pipeline/pipeline-call-gSNP/Nextflow/development/pre-10.1.0-rc.1/mmootor-add-xy-filtration/pipeline_test/filter-xy-gSNP.config
    • output: /hot/software/pipeline/pipeline-call-gSNP/Nextflow/development/pre-10.1.0-rc.1/mmootor-add-xy-filtration/pipeline_test/call-gSNP-10.1.0-rc.1/NA24143/GATK-4.5.0.0/output/
  • N-T paired WGS (sample_sex = XX)

    • samples: NA24149, NA24143
    • input YAML: /hot/software/pipeline/pipeline-call-gSNP/Nextflow/development/pre-10.1.0-rc.1/mmootor-add-xy-filtration/pipeline_test/NA24143.yaml
    • config: /hot/software/pipeline/pipeline-call-gSNP/Nextflow/development/pre-10.1.0-rc.1/mmootor-add-xy-filtration/pipeline_test/test-XX/filter-xx-gSNP.config
    • output: /hot/software/pipeline/pipeline-call-gSNP/Nextflow/development/pre-10.1.0-rc.1/mmootor-add-xy-filtration/pipeline_test/test-XX/call-gSNP-10.1.0-rc.1/NA24143/GATK-4.5.0.0/output/

Checklist

  • I have read the code review guidelines and the code review best practice on GitHub check-list.

  • I have reviewed the Nextflow pipeline standards.

  • The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].

  • I have set up or verified the branch protection rule following the github standards before opening this pull request.

  • I have added my name to the contributors listings in the manifest block in the nextflow.config as part of this pull request, am listed
    already, or do not wish to be listed. (This acknowledgement is optional.)

  • I have added the changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.

  • I have updated the version number in the metadata.yaml and manifest block of the nextflow.config file following semver, or the version number has already been updated. (Leave it unchecked if you are unsure about new version number and discuss it with the infrastructure team in this PR.)

  • I have tested the pipeline on at least one A-mini sample.

@Faizal-Eeman
Copy link
Contributor Author

#191 (comment)

This still needs to be done, including for the mouse PAR BED

@yashpatel6 I've created issues to move the files to a standard location.
https://github.com/uclahs-cds/group-dataset-standardization/issues/174
https://github.com/uclahs-cds/group-dataset-standardization/issues/173

@Faizal-Eeman Faizal-Eeman requested a review from yashpatel6 March 20, 2025 17:38
@Faizal-Eeman
Copy link
Contributor Author

Faizal-Eeman commented Mar 20, 2025

XY Filter test on latest commit.

Output - /hot/software/pipeline/pipeline-call-gSNP/Nextflow/development/pre-10.1.0-rc.1/mmootor-add-xy-filtration/pipeline_test/human-sample/test-XY/call-gSNP-10.1.0-rc.1/NA24143/GATK-4.5.0.0/output/Hail-0.2.133_GATK-4.5.0.0_TEST_NA24143.vcf.bgz

@yashpatel6 this PR is ready again now.

Copy link
Collaborator

@yashpatel6 yashpatel6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good! One small comment on the README

README.md Outdated
@@ -115,6 +118,8 @@ For normal-only or tumor-only samples, exclude the fields for the other state.
|:----------------|:---------|:-----|:------------|
| `dataset_id` | Yes | string | Dataset ID |
| `blcds_registered_dataset` | Yes | boolean | Set to true when using BLCDS folder structure; use false for now |
| `xy_filter` | Yes | boolean | Set to true to produce an XY Filtered VCF |
| `sample_sex` | Yes | string | Sample Sex, `XY` or `XX` if `xy_filter = true` |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `sample_sex` | Yes | string | Sample Sex, `XY` or `XX` if `xy_filter = true` |
| `sample_sex` | Yes | string | Sample Sex, `XY` or `XX` |

The parameter is technically required regardless of other parameters

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment:

It might be a bit picky, but I think by definition, if we say sample_sex, it usually refers to 'male' or 'female' like in this example by Illumina. Alternatively, we could replace sample_sex with genetic_sex or chromosomal_sex.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter is technically required regardless of other parameters

I just asked a similar question but how do we process samples with unknown sex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tyamaguchi-ucla
When genetic sex of a sample is not known, users have the option to skip the XY Filter entirely. Future iterations of this workflow would involve determining genetic sex of a sample which may help for unknown cases.

And I think genetic_sex is a better alternative to sample_sex.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For unknown sex, it may be worth adding an explicit unknown option for genetic_sex and then automatically disabling filtering if that's set. Currently, the way it would have to be done is that the filtering parameter would have to be set to false but genetic_sex would still have to be defined to a valid value to avoid parameter validation failing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added unknown as a check to run XY Filter instead of using the xy_filter parameter.

README.md Outdated
@@ -115,6 +118,8 @@ For normal-only or tumor-only samples, exclude the fields for the other state.
|:----------------|:---------|:-----|:------------|
| `dataset_id` | Yes | string | Dataset ID |
| `blcds_registered_dataset` | Yes | boolean | Set to true when using BLCDS folder structure; use false for now |
| `xy_filter` | Yes | boolean | Set to true to produce an XY Filtered VCF |
| `sample_sex` | Yes | string | Sample Sex, `XY` or `XX` if `xy_filter = true` |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment:

It might be a bit picky, but I think by definition, if we say sample_sex, it usually refers to 'male' or 'female' like in this example by Illumina. Alternatively, we could replace sample_sex with genetic_sex or chromosomal_sex.

@Faizal-Eeman
Copy link
Contributor Author

Test log after switching from --sample_sex to --genetic_sex

/hot/software/pipeline/pipeline-call-gSNP/Nextflow/development/pre-10.1.0-rc.1/mmootor-add-xy-filtration/pipeline_test/human-sample/test-XY/call-gSNP-10.1.0-rc.1/NA24143/log-call-gSNP-10.1.0-rc.1-20250321T030153Z/process-log/filter_XY_Hail-NA24143/log.command.log

@Faizal-Eeman
Copy link
Contributor Author

XY Filter is skipped when genetic_sex = unknown.
Test - /hot/software/pipeline/pipeline-call-gSNP/Nextflow/development/pre-10.1.0-rc.1/mmootor-add-xy-filtration/pipeline_test/human-sample/test-unknown/filter-xy-unknown-human.log

Copy link

@tyamaguchi-ucla tyamaguchi-ucla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Anything else to add? @yashpatel6 @alkaZeltser


output:
path(".command.*")
tuple path("${output_filename}.vcf.bgz"), path("${output_filename}.vcf.bgz.tbi"), emit: xy_filtered_vqsr

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (non-blocking):

I haven't used vcf.bgz and vcf.bgz.tbi before but does this file type work like the gz version? (gzip vs bzip2)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vcf.bgz is a block compressed VCF where the data is stored in blocks which enables faster reading of the VCF. Whereas gzip requires decompression of the whole compressed VCF.

Size wise, gzip and bzip2 should offer better compression but in this case bgzipped VCFs are slightly smaller (perhaps due to filtration).

BCFtools can be used to work with vcf.bgz output here.

Copy link
Collaborator

@yashpatel6 yashpatel6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@Faizal-Eeman Faizal-Eeman merged commit 1fbb473 into main Mar 25, 2025
5 checks passed
@Faizal-Eeman Faizal-Eeman deleted the mmootor-add-xy-filtration branch March 25, 2025 18:09
Copy link

@alkaZeltser alkaZeltser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry didn't get my review in before the merge, I made two more comments!

@@ -115,6 +118,7 @@ For normal-only or tumor-only samples, exclude the fields for the other state.
|:----------------|:---------|:-----|:------------|
| `dataset_id` | Yes | string | Dataset ID |
| `blcds_registered_dataset` | Yes | boolean | Set to true when using BLCDS folder structure; use false for now |
| `genetic_sex` | Yes | string | Sample Sex, `XY`, `XX` or `unknown` |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the plan is with metapipeline integration, but if I were to run call-gSNP separately in a bunch of jobs, it would be a massive pain to create per-sample configs varying this parameter in a multi-sex cohort. I think every other config param can be global to a cohort, except for this one?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of options that can be taken for this:

  • genetic_sex is a simple parameter that could be passed through the sample input YAMLs for multi-sex cohorts as those would need to be generated separately anyways
  • In any other single-sex cohort, it can be in either the YAMLs or config since you'd only need the one config anyways


if genetic_sex == 'XY':
#If MALE (XY), remove heterozygous non-PAR chrX calls
non_par_filtered_variants = non_par_variants.filter_rows(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are PAR handled from the side of the Y chromosome? Does Y also experience heterozygous calls? Is it possible to have a case in males where a genotype in a PAR is double-called? Once on X and once on Y? This would be a scenario where reads from Y are aligning to X and vice versa.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add XY filtration to correct chrX and chrY variant calls
4 participants