-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add XY filtration workflow #191
Conversation
…122-GIABGermlineVariant
@yashpatel6 I've created issues to move the files to a standard location. |
XY Filter test on latest commit. Output - @yashpatel6 this PR is ready again now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good! One small comment on the README
README.md
Outdated
@@ -115,6 +118,8 @@ For normal-only or tumor-only samples, exclude the fields for the other state. | |||
|:----------------|:---------|:-----|:------------| | |||
| `dataset_id` | Yes | string | Dataset ID | | |||
| `blcds_registered_dataset` | Yes | boolean | Set to true when using BLCDS folder structure; use false for now | | |||
| `xy_filter` | Yes | boolean | Set to true to produce an XY Filtered VCF | | |||
| `sample_sex` | Yes | string | Sample Sex, `XY` or `XX` if `xy_filter = true` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `sample_sex` | Yes | string | Sample Sex, `XY` or `XX` if `xy_filter = true` | | |
| `sample_sex` | Yes | string | Sample Sex, `XY` or `XX` | |
The parameter is technically required regardless of other parameters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment:
It might be a bit picky, but I think by definition, if we say sample_sex
, it usually refers to 'male' or 'female' like in this example by Illumina. Alternatively, we could replace sample_sex
with genetic_sex
or chromosomal_sex
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter is technically required regardless of other parameters
I just asked a similar question but how do we process samples with unknown sex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tyamaguchi-ucla
When genetic sex of a sample is not known, users have the option to skip the XY Filter entirely. Future iterations of this workflow would involve determining genetic sex of a sample which may help for unknown
cases.
And I think genetic_sex
is a better alternative to sample_sex
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For unknown sex, it may be worth adding an explicit unknown
option for genetic_sex
and then automatically disabling filtering if that's set. Currently, the way it would have to be done is that the filtering parameter would have to be set to false but genetic_sex
would still have to be defined to a valid value to avoid parameter validation failing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added unknown
as a check to run XY Filter instead of using the xy_filter
parameter.
README.md
Outdated
@@ -115,6 +118,8 @@ For normal-only or tumor-only samples, exclude the fields for the other state. | |||
|:----------------|:---------|:-----|:------------| | |||
| `dataset_id` | Yes | string | Dataset ID | | |||
| `blcds_registered_dataset` | Yes | boolean | Set to true when using BLCDS folder structure; use false for now | | |||
| `xy_filter` | Yes | boolean | Set to true to produce an XY Filtered VCF | | |||
| `sample_sex` | Yes | string | Sample Sex, `XY` or `XX` if `xy_filter = true` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment:
It might be a bit picky, but I think by definition, if we say sample_sex
, it usually refers to 'male' or 'female' like in this example by Illumina. Alternatively, we could replace sample_sex
with genetic_sex
or chromosomal_sex
.
Test log after switching from
|
XY Filter is skipped when genetic_sex = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Anything else to add? @yashpatel6 @alkaZeltser
|
||
output: | ||
path(".command.*") | ||
tuple path("${output_filename}.vcf.bgz"), path("${output_filename}.vcf.bgz.tbi"), emit: xy_filtered_vqsr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question (non-blocking):
I haven't used vcf.bgz
and vcf.bgz.tbi
before but does this file type work like the gz
version? (gzip vs bzip2)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vcf.bgz
is a block compressed VCF where the data is stored in blocks which enables faster reading of the VCF. Whereas gzip requires decompression of the whole compressed VCF.
Size wise, gzip and bzip2 should offer better compression but in this case bgzipped VCFs are slightly smaller (perhaps due to filtration).
BCFtools can be used to work with vcf.bgz
output here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry didn't get my review in before the merge, I made two more comments!
@@ -115,6 +118,7 @@ For normal-only or tumor-only samples, exclude the fields for the other state. | |||
|:----------------|:---------|:-----|:------------| | |||
| `dataset_id` | Yes | string | Dataset ID | | |||
| `blcds_registered_dataset` | Yes | boolean | Set to true when using BLCDS folder structure; use false for now | | |||
| `genetic_sex` | Yes | string | Sample Sex, `XY`, `XX` or `unknown` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the plan is with metapipeline integration, but if I were to run call-gSNP separately in a bunch of jobs, it would be a massive pain to create per-sample configs varying this parameter in a multi-sex cohort. I think every other config param can be global to a cohort, except for this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of options that can be taken for this:
genetic_sex
is a simple parameter that could be passed through the sample input YAMLs for multi-sex cohorts as those would need to be generated separately anyways- In any other single-sex cohort, it can be in either the YAMLs or config since you'd only need the one config anyways
|
||
if genetic_sex == 'XY': | ||
#If MALE (XY), remove heterozygous non-PAR chrX calls | ||
non_par_filtered_variants = non_par_variants.filter_rows( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are PAR handled from the side of the Y chromosome? Does Y also experience heterozygous calls? Is it possible to have a case in males where a genotype in a PAR is double-called? Once on X and once on Y? This would be a scenario where reads from Y are aligning to X and vice versa.
Description
ADD XY filter workflow
Closes #190
Testing Results
N-T paired WGS (
sample_sex = XY
)N-T paired WGS (
sample_sex = XX
)Checklist
I have read the code review guidelines and the code review best practice on GitHub check-list.
I have reviewed the Nextflow pipeline standards.
The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].
I have set up or verified the branch protection rule following the github standards before opening this pull request.
I have added my name to the contributors listings in the
manifest
block in thenextflow.config
as part of this pull request, am listedalready, or do not wish to be listed. (This acknowledgement is optional.)
I have added the changes included in this pull request to the
CHANGELOG.md
under the next release version or unreleased, and updated the date.I have updated the version number in the
metadata.yaml
andmanifest
block of thenextflow.config
file following semver, or the version number has already been updated. (Leave it unchecked if you are unsure about new version number and discuss it with the infrastructure team in this PR.)I have tested the pipeline on at least one A-mini sample.