Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling Isoquant #280

Open
omarelgarwany opened this issue Jan 24, 2025 · 3 comments
Open

Scaling Isoquant #280

omarelgarwany opened this issue Jan 24, 2025 · 3 comments
Labels
performance Issues related to computational perfromance

Comments

@omarelgarwany
Copy link

omarelgarwany commented Jan 24, 2025

Hi @andrewprzh

Thanks a lot for developing IsoQuant. It's been a pleasure using it for isoform quantification/discovery.

My apologies for the long question, but this is a performance issue I've been facing for a while. I'm following up from a previous issue I posted in July (#209), and I suspect others have posted similar issues about memory usage. That being said, I've been very keen on using a combination of isoseq+IsoQuant because I find IsoQuant's approach to identifying read assignments and declaring novel transcripts quite reasonable.

However, since last July, we have expanded our cohort from 15 samples to 61 samples with hundreds of thousands of cells and with tens of millions of HiFi reads per sample. Already with 15 samples, I was struggling with memory requirements, so at the time I thought to split the BAM files by chromosome. The obvious disadvantage is that this approach cannot handle supplementary alignments and could potentially affect fusion genes/transcripts discovery. But that seemed like a fair price to pay.

Now with 61 samples, even that approach didn't work as some chromosomes took > 5 days, > 400 GB and 20 CPUs to run using IsoQuant v3.6.2 (also tried with as few as 2 CPUs in command line argument -t). So I had to think about another approach. I'm currently implementing a further sharding approach where I find valid "splitting points" across all samples. I do this by:

1-Finding unmapped regions per sample:
samtools view -F 4 -b ${sampleBam} ${chrom} | bedtools bamtobed -i - | bedtools merge > $mappedRegionsBed
bedtools complement -i $mappedRegionsBed -g chrom.sizes.txt > $unMappedRegionsBed

2-Intersection of all unmapped regions across samples:
bedtools multiinter -i ${unMappedRegionsBed[@]} | awk -v count_all_files="${#unMappedRegionsBed[@]}" '$4==count_all_files' | cut -f1,2,3 > unmapped_regions_across_all_samples.bed

Now these represent regions that are uncovered by any reads (including reads containing gaps i.e N in their CIGAR string). Again, this doesn't take care of supplementary reads so I just decide to filter them out and accept that this is the cost of parallelisation.

3-Chunking strategy
I'm trying multiple strategies to choose the best set of "splittting points" to make the sharded BAMs as equal in terms of numbers of reads as possible, but this is another story (can provide code if interested).

My question is:

Can you foresee any serious problems with this approach? The obvious advantage is that it is now possible to run much smaller chunks that don't require 100s of GBs of memory. The disadvantage is that there will have to be a bit of playing around to ensure things can be collected appropriately afterwards. For example, I will have to modify novel transcript names (e.g. there will be multiple transcript123.chr1 in multple chunks).

Other than supplementary alignments and transcript naming, is there any other why not to continue with this approach?

My command

isoquant.py --reference ${fasta} --genedb gencodev46.db--complete_genedb --sqanti_output --bam ${bams[@]} --labels ${sample_ids[@]} --data_type pacbio_ccs -o ${region} -p ${region} --count_exons --check_canonical --read_group tag:CB -t 3 --counts_format linear --bam_tags CB --no_secondary --clean_start

@andrewprzh
Copy link
Collaborator

Dear @omarelgarwany

Sorry for the delayed response, I was out of the office for a while.

In fact, IsoQuant implements the very same strategy when assigning reads to isoforms. This way, it allows to quickly assign reads to a small subset of genes within the limited region.

However, I suspect the memory peak occurs in the second part where isofroms are generated and counted. Could you share the log file? Also, do you have any RAM consumption plots for the runs you made?

In general, I don't see any problems with this approach. However, I am not entirely sure whether it will help a lot with that amount of cells. I will also think whether it's possible to implement it inside IsoQuant since read chunks are already defined.

Best
Andrey

@andrewprzh andrewprzh added the performance Issues related to computational perfromance label Feb 10, 2025
@omarelgarwany
Copy link
Author

Hi @andrewprzh

Thanks for responding. Good that you don't see issues. The approach did work to an extent. I didn't create memory trace plots unfortunately, but I have plotted memory consumption and runtime per chunk (for 16 chunks x 24 chromosomes = 384 chunks):

Image

As you can see that doesn't finish for all the chunks. Among the 7 remaining chunk, some were left running for 6 days with 200-600 GB of memory. These chunks were in chromosomes 2, 14 and 22 were there's a lot of immunoglobulin genes where a lot of V(D)J recombination takes place producing potentially a huge number of "novel" transcripts.

Anyway, I have since tested another approach that I'm less sure about. Briefly, it's a two-pass approach where:
1-I split BAMs by chromosomes, do a sample-by-sample run with the flag --no_model_construction
2-Take reads that were flagged as "inconsistent", "inconsistent_ambiguous", "inconsistent_non_intronic", do another run but this time jointly (not sample by sample), do another run turning on model construction.

That two-pass approach runs much much faster and consumes much less memory. Even pass two which is done jointly consumes much much less memory and time. But I recognise this approach might be more problematic. Let me know what you think.

Image

I can send you an email if you'd like to see further details. I'm having a hard time extracting the relevant info from the log file because I'd enabled the --debug flag and there are hundreds of chunks. But I can send you info on specific chunks that you might want to look at.

Kind regards
Omar

@andrewprzh
Copy link
Collaborator

Dear @omarelgarwany

Again sorry for slow replies.

Thank you so much for sharing this information, this is valuable. Of course, I am keen to improve IsoQuant performance, so I'm mostly interested in those unfinished chunks. Would it be possible to share some of the data so I can investigate this closely and see whether I can up with something?
And yes, we can continue communication via email (in my profile).

Best
Andrey

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Issues related to computational perfromance
Projects
None yet
Development

No branches or pull requests

2 participants