From 74e05b8daed82e203c25141c95974cf0f8512d5b Mon Sep 17 00:00:00 2001 From: Merritt-Brian <50592701+Merritt-Brian@users.noreply.github.com> Date: Fri, 9 Aug 2024 15:57:58 -0400 Subject: [PATCH] Create v1.md --- v1.md | 167 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 167 insertions(+) create mode 100644 v1.md diff --git a/v1.md b/v1.md new file mode 100644 index 0000000..24c3c31 --- /dev/null +++ b/v1.md @@ -0,0 +1,167 @@ + +## Original Mytax v1.0 + + +This is the repository for mytax, a tool for building custom taxonomies, which can aid nucleotide sequence classification. + +# Installation + +Clone this repo with: + +`git clone https://github.com/jhuapl-bio/mytax` + +Symbolically link all shell scripts into your path, for example with: + +`find v1 -name "*.sh" | while read fn; do sudo ln -s $PWD/$fn /usr/local/bin; done` + +# Dependencies + + - jellyfish (version 1) - https://www.cbcb.umd.edu/software/jellyfish/ + - kraken (version 1) - https://ccb.jhu.edu/software/kraken/ + - gawk - https://www.gnu.org/software/gawk/manual/html_node/Installation.html + - perl + - GNU CoreUtils + + - 16 GB of RAM is needed to build the provided influenza kraken database + +# Usage + +## Building example + +This pipeline is built from a central set of scripts located in the `v1` directory + +Build flu-kraken example with: + +`build_flukraken.sh -k flukraken-$(date +"%F")` + +The single script `build_flukraken.sh` functions as an outer wrapper for the influenza classification example using the Kraken classifier published in the mytax paper. + + +`build_flukraken.sh` can also be used as a model to build modified pipelines as desired. It is built from four main sub-modules: +``` + download_IVR.sh -> download references and taxonomy from IVR + + build_IVR_metadata.sh -> build tab-delimited metadata table in format for mytax + + build_taxonomy.sh -> build custom taxonomy from tab-delimited table + + build_krakendb.sh -> add new taxonomic IDs to reference FASTA, build kraken database, post-process database for visualization pipeline +``` + +`build_krakendb.sh` currently references three helper scripts, which also need to be in the PATH: +``` + fix_references.sh -> adds new taxonomic IDs to reference FASTA + + kraken-build -> builds kraken database + + process_krakendb.sh -> post-processes database for visualization pipeline (not included in this repo yet) +``` + + + +## Running process script on kraken/kraken2 report and outfiles + +### If running from Docker + +docker build . -t jhuaplbio/mytax + +Unix + +`docker container run -it --rm -v $PWD:/data jhuaplbio/mytax bash` + +Windows Powershell + +`docker container run -it --rm -v $pwd:/data jhuaplbio/mytax bash` + + + +## Run the installation script + + +# Activate the env, this will contain kraken2 and centrifuge scripts to build the database if needed as well as kraken2 and centrifuge dependencies + +`conda activate mytax` + +## Lets make a sample.fastq from test-data + +### First, download ncbi taxdump + +``` +python3 src/generate_hierarchy.py -o $PWD/taxdump --report test-data/sample.report -download +rm taxdump.tar.gz +``` + + +### DEPRECATED Kraken1 + +``` +mkdir -p databases/minikraken1 +wget https://ccb.jhu.edu/software/kraken/dl/minikraken_20171019_4GB.tgz -O databases/minikraken1.tgz +tar -xvzf databases/minikraken1.tgz --directory databases/ + +export kraken1db=databases/minikraken_20171013_4GB && \ +kraken --db $kraken1db --output test-data/sample.out test-data/sample.fastq && \ +kraken-report --db $kraken1db test-data/sample.out | tee test-data/sample.report +``` + + +### Kraken2 + +### IF you've made flukraken2 in tmp or.... + +`export KRAKEN2_DEFAULT_DB="tmp/flukraken2` + +### IF you have a pre-made minikraken/other kraken db ready + +``` +kraken2 --report output/sample_metagenome.first.report --output output/sample_metagenome.first.out --memory-mapping --db ~/Desktop/mytax/minikraken2 example-data/sample_metagenome.first.fastq +``` + +### Download minikraken2 + +``` +mkdir -p databases/ +wget ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/old/minikraken2_v2_8GB_201904.tgz -O databases/minikraken2.tgz +tar -xvzf databases/minikraken2.tgz --directory databases/ +``` + +### Centrifuge + +#### Install + +`bash install.sh` + +#### Set up centrifuge env + +``` +mkdir -p databases/centrifuge +wget https://genome-idx.s3.amazonaws.com/centrifuge/p_compressed%2Bh%2Bv.tar.gz -O databases/centrifuge.tgz +tar -xvzf databases/centrifuge.tgz --directory databases/centrifuge/ +``` + + + +#### Run Centrifuge classify + +``` +## If you need to make a new database, see here: $CONDA_PREFIX/lib/centrifuge/centrifuge-build --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp sample.fastq sample + +$CONDA_PREFIX/lib/centrifuge/centrifuge -f -x databases/centrifuge/p_compressed+h+v -q test-data/sample.fastq --report test-data/sample.centrifuge.report > test-data/sample.out +$CONDA_PREFIX/lib/centrifuge/centrifuge-kreport -x databases/centrifuge/p_compressed+h+v test-data/sample.centrifuge.report > test-data/sample.report +``` + +#### Next, generate the hierarchy json file + +``` +python3 server/src/generate_hierarchy.py \ +-o output/sample_metagenome.first.fullstring \ +--report output/sample_metagenome.first.report \ +-taxdump taxonomy/nodes.dmp +``` + +#### Get the json for mytax sunburst plot +``` +bash server/src/krakenreport2json.sh -i output/sample_metagenome.first.fullstring -o output/sample_metagenome.first.json +``` + +The resulting file can then imported into the sunburst plot at `server/src/sunburst/index.html` rendered with a simple `http.server` protocol like `python3 -m http.server 8080`