Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty results when Using taxator-tk for binning contigs against RefSeq bacterial genomes #68

Open
FarzanehRah opened this issue Dec 6, 2024 · 4 comments

Comments

@FarzanehRah
Copy link

FarzanehRah commented Dec 6, 2024

Hi,

I want to use Taxator for binning my contigs against RefSeq bacterial genomes. To do so, I created refpack using all bacterial genomes (25,124,056 sequences, 346 GB). since the indexing time was very long after several attempts, I used the following options to create the index: -P64 -uRY16 -c -i 100, then, I added the -m 100 option to the lastal command in binning-last.bash, but the results are empty:

bash.err:

Loading '~/resistome/AMR_bacteria_association/results/_taxator_results/for_taxator_100_contigs_with_mapped_AMR_id_edit.fa' (total=9763)
Analyzing sample composition: 1 nested taxa with total support of 0 positions
Noise removal: 0 taxa removed
Consensus taxonomy assignment:  done

bash.out:

Aligning sample against sequences in '~/resistome/taxator/refpack_bacteria/refdata.fna' and assigning segments to taxa using 30 threads.
Assigning whole sequences.
Generating summary files.
Results are in ~/resistome/AMR_bacteria_association/results/_taxator_results/taxator_100/'.

Do you have any suggestions or advice for this issue?

Thank you in advance

@fungs
Copy link
Owner

fungs commented Dec 7, 2024

Hi @FarzanehRah, this is hard to debug with sparse information, but let's give it a try!

Observations

  1. There are some try parts in the refpack building that are still tricky, but since I don't see error messages, you seem to have done a good job so far.
  2. The binner (the consensus algorithm that is run on the results of taxator) output comes first, which is strange.
  3. The contigs file is quite small compared to the refpack size, which is very large. Did you consider the Blast workflow in such a case?

Questions

Please answer those questions so that we can find the problem together:

  1. What is the FASTA header format of the database and the query files. They need to be unique without whitespace characters. Can you paste examples?
  2. There should be an .alignments file in the output folder. Can you count the lines or paste and example (like head /path/to/file?
  3. Can you post the content of the taxator output gff3 file? That could be very helpful
  4. Can you list the contents of the refpack using ls -lh /path/to/refpack and du -hls /path/to/refpack/*? I suspect there might be an issue with the alignment index.

The most reasonable explanation for me at the moment is, that the aligner does not generate any hits, thus the alignments file is empty, and the classification is as well. I haven't used the specific alignment parameters with last, but they might just be too stringent. Did you try with the blast workflow, which works similarly, but the index building is much faster and memory usage much lower?

@FarzanehRah
Copy link
Author

FarzanehRah commented Dec 8, 2024

Hi Johannes, thank you so much, I really appreciate your prompt responses and valuable suggestions.
Here's the current situation and the steps I've taken so far:


Work Done and Challenges Encountered:

  • LAST Indexing: I started by using LAST to create an index from Refseq bacterial genomes, after 4 days and 4.5 TB of data, the job was canceled due to the time limit.

  • Switching to BLAST: I switched to BLAST, which allowed for rapid index creation and a much smaller index size, but after 2 days of running the blast vs query files , the cluster admin asked me to stop it (possibility of memory leaks on multithreading blast in the older version of BLAST?)

  • Back to LAST: I consulted with the LAST developer and followed their advice to use the options:
    -P64 -uRY16 -c (I added -i 100 option as well, so need to add -m 100 to the lastal command in line 93 of file binning-last.bash: compression_cmd='lz4' decompression_cmd='lz4 -d' $time_cmd -p -o lastal-parallel.time lastal-parallel -f 1 -X 3 -e 40 -m 100 -P "${cores:-$cores_max}" ${last_options:-$last_options_default} "$aligner_index" "$input" |. The indexing was completed in ~8 hours, and the index size is 591 GB.


Responses to Your Questions:

  1. FASTA Header Format (Database and Query Files):
    Here are examples of the FASTA headers:

    • Refseq Database:
      >NZ_JADJ01000011.1
      >NZ_JADJ01000012.1
      >NZ_JADJ01000017.1
      >NZ_JADJ01000018.1
      >NZ_JADJ01000019.1
      >NZ_JADJ01000020.1
      >NZ_JADJ01000021.1
      
    • Query File:
      >k141_275600
      >k141_352771
      >k141_264579
      >k141_44103
      >k141_154347
      >k141_286636
      >k141_418920
      
  2. Alignments File:
    The sample.alignments.gz file is empty. Here's the output when I checked:

    zcat sample.alignments.gz
    #query ID       query begin     query end       query length    reference ID    reference begin reference end   score  E-value  identities      alignment length        alignment code
    
  3. Taxator GFF3 File:
    The sample.gff3 file is also empty:

    head sample.gff3
    ##gff-version 3
    
  4. Contents of Refpack:
    The contents of the Refpack, as listed by ls -lh and du -hls:

    • ls -lh refpack_bacteria/:

      total 349G
      drwxr-x---. 3 farfar farfar  25K Dec  6 16:19 aligner-index
      drwxr-x---. 3 farfar farfar  25K Dec  1 01:47 aligner-index_blast
      -rwxr-xr-x. 1 farfar farfar  11G Nov 25 01:59 mapping.tax
      drwxr-xr-x. 2 farfar farfar  25K Nov 25 19:43 ncbi-taxonomy
      drwxr-xr-x. 2 farfar farfar  25K Nov 25 19:43 original_files
      -rwxr-xr-x. 1 farfar farfar 1.1T Nov 25 17:16 refdata.fna
      -rw-r-----. 1 farfar farfar 1.1G Dec  4 14:03 refdata.fna.fai
      
    • du -hls refpack_bacteria/*:

      591G    refpack_bacteria/aligner-index
      262G    refpack_bacteria/aligner-index_blast
      2.3G    refpack_bacteria/mapping.tax
      61M     refpack_bacteria/ncbi-taxonomy
      849G    refpack_bacteria/original_files
      346G    refpack_bacteria/refdata.fna
      283M    refpack_bacteria/refdata.fna.fai
      

For your second observation, I should mention that the first part is from the .err file, and the second part is from the .out file of my bash job.

Thanks again for your time.

@fungs
Copy link
Owner

fungs commented Dec 9, 2024

As you can see, the aligner doesn't generate any hit to work with in taxator. That's the reason that the results are empty. Possibly, there is an issue with the parallel wrapper around lastal, or that the sensitivity is simply not high with the parameters used. I would take a few sequences of the query for a test run for use with both last and blast, to find the right alignment parameters. Start with the single-thread mode for testing. You can also just run plain lastal given your parameters against the constructed refpack to see, whether that also produces no alignments. I would also give NCBI blast a try again, to verify.

To work with more recent aligners, you can always go to the binary folder in the taxator-tk installation and replace the binary versions with more recent ones. This always worked fine for last, but blast has shown to be less backward compatible and might need some tweaking. In any case, I update the binaries from time to time, so any feedback would be valuable for me to provide an updated taxator-tk runtime.

Finally, you could also use alignment in protein space, which should be quite sensitive over larger phylogenetic distances. taxator-tk includes a sample blastp pipeline with builtin ORF detection etc. That mode requires a protein database, if I remember correctly.

@FarzanehRah
Copy link
Author

Hi,
I tested my query files with your prebuilt refpack database for viruses (without adding the option -m100), and had very nice results. I will try running plain lastal and see how to use Taxator-tk on its results.
Here is an example of my query file.

for_taxator_contigs.txt
Thanks again !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants