-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scaling Isoquant #280
Comments
Dear @omarelgarwany Sorry for the delayed response, I was out of the office for a while. In fact, IsoQuant implements the very same strategy when assigning reads to isoforms. This way, it allows to quickly assign reads to a small subset of genes within the limited region. However, I suspect the memory peak occurs in the second part where isofroms are generated and counted. Could you share the log file? Also, do you have any RAM consumption plots for the runs you made? In general, I don't see any problems with this approach. However, I am not entirely sure whether it will help a lot with that amount of cells. I will also think whether it's possible to implement it inside IsoQuant since read chunks are already defined. Best |
Hi @andrewprzh Thanks for responding. Good that you don't see issues. The approach did work to an extent. I didn't create memory trace plots unfortunately, but I have plotted memory consumption and runtime per chunk (for 16 chunks x 24 chromosomes = 384 chunks): As you can see that doesn't finish for all the chunks. Among the 7 remaining chunk, some were left running for 6 days with 200-600 GB of memory. These chunks were in chromosomes 2, 14 and 22 were there's a lot of immunoglobulin genes where a lot of V(D)J recombination takes place producing potentially a huge number of "novel" transcripts. Anyway, I have since tested another approach that I'm less sure about. Briefly, it's a two-pass approach where: That two-pass approach runs much much faster and consumes much less memory. Even pass two which is done jointly consumes much much less memory and time. But I recognise this approach might be more problematic. Let me know what you think. I can send you an email if you'd like to see further details. I'm having a hard time extracting the relevant info from the log file because I'd enabled the Kind regards |
Dear @omarelgarwany Again sorry for slow replies. Thank you so much for sharing this information, this is valuable. Of course, I am keen to improve IsoQuant performance, so I'm mostly interested in those unfinished chunks. Would it be possible to share some of the data so I can investigate this closely and see whether I can up with something? Best |
Hi @andrewprzh
Thanks a lot for developing IsoQuant. It's been a pleasure using it for isoform quantification/discovery.
My apologies for the long question, but this is a performance issue I've been facing for a while. I'm following up from a previous issue I posted in July (#209), and I suspect others have posted similar issues about memory usage. That being said, I've been very keen on using a combination of isoseq+IsoQuant because I find IsoQuant's approach to identifying read assignments and declaring novel transcripts quite reasonable.
However, since last July, we have expanded our cohort from 15 samples to 61 samples with hundreds of thousands of cells and with tens of millions of HiFi reads per sample. Already with 15 samples, I was struggling with memory requirements, so at the time I thought to split the BAM files by chromosome. The obvious disadvantage is that this approach cannot handle supplementary alignments and could potentially affect fusion genes/transcripts discovery. But that seemed like a fair price to pay.
Now with 61 samples, even that approach didn't work as some chromosomes took > 5 days, > 400 GB and 20 CPUs to run using IsoQuant v3.6.2 (also tried with as few as 2 CPUs in command line argument
-t
). So I had to think about another approach. I'm currently implementing a further sharding approach where I find valid "splitting points" across all samples. I do this by:1-Finding unmapped regions per sample:
samtools view -F 4 -b ${sampleBam} ${chrom} | bedtools bamtobed -i - | bedtools merge > $mappedRegionsBed
bedtools complement -i $mappedRegionsBed -g chrom.sizes.txt > $unMappedRegionsBed
2-Intersection of all unmapped regions across samples:
bedtools multiinter -i ${unMappedRegionsBed[@]} | awk -v count_all_files="${#unMappedRegionsBed[@]}" '$4==count_all_files' | cut -f1,2,3 > unmapped_regions_across_all_samples.bed
Now these represent regions that are uncovered by any reads (including reads containing gaps i.e
N
in their CIGAR string). Again, this doesn't take care of supplementary reads so I just decide to filter them out and accept that this is the cost of parallelisation.3-Chunking strategy
I'm trying multiple strategies to choose the best set of "splittting points" to make the sharded BAMs as equal in terms of numbers of reads as possible, but this is another story (can provide code if interested).
My question is:
Can you foresee any serious problems with this approach? The obvious advantage is that it is now possible to run much smaller chunks that don't require 100s of GBs of memory. The disadvantage is that there will have to be a bit of playing around to ensure things can be collected appropriately afterwards. For example, I will have to modify novel transcript names (e.g. there will be multiple transcript123.chr1 in multple chunks).
Other than supplementary alignments and transcript naming, is there any other why not to continue with this approach?
My command
The text was updated successfully, but these errors were encountered: