-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run MarkDuplicatesSpark by library #234
Comments
@nkwang24 @yashpatel6 For multi-library samples, this approach would help although it will take some time to implement it. Also, it would be helpful to understand the usage of |
@tyamaguchi-ucla agreed. I wrote a script to periodically log the scratch use over the course of a metapipeline run, but as I commented in #229, I can't access the files generated by Spark. The best I've been able to do is correlate sample size with where in metapipeline the failures occur. Based on what I've gathered so far using the latest metapipeline PR, it looks like align-DNA lets samples of up to ~450Gb through. Of these, call-gSNP lets ~400Gb through. It's really hard to get a good idea of what's going as my tests have been somewhat inconsistent and confounded by stochastic node level errors. Possible sources of inconsistencies:
|
We discussed this briefly in the metapipeline-DNA meeting but just to log the discussion here: The intermediate file deletion with individual pipelines doesn't depend on the write to From a inter-pipeline deletion perspective, there's only one case where this happens, which is when align-DNA output is deleted from |
broadinstitute/gatk#8134 This is another good reason to consider using |
This is not too urgent but we probably want to implement the following processes
so that we can process large samples with multiple libraries (e.g. CPCG0196-F1) with 2TB scratch
We could parallelize
#1
for intermediate size samples with multiple libraries (e.g. CPCG0196-B1) but not sure if this would be always faster because the library level BAMs need to be merged."""
It looks like both of the logs indicated 2TB scratch wasn't enough and we know MarkDuplicatesSpark generates quite a bit of intermediate files. I don't think we can do much unless we run MarkDuplicatesSpark at the library level, remove intermediate files and then samtools merge.
Originally posted by @tyamaguchi-ucla in #229 (comment)
The text was updated successfully, but these errors were encountered: