Skip to content

Commit 74e05b8

Browse files
Create v1.md
1 parent df42d93 commit 74e05b8

File tree

1 file changed

+167
-0
lines changed

1 file changed

+167
-0
lines changed

v1.md

+167
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
2+
## Original Mytax v1.0
3+
4+
5+
This is the repository for mytax, a tool for building custom taxonomies, which can aid nucleotide sequence classification.
6+
7+
# Installation
8+
9+
Clone this repo with:
10+
11+
`git clone https://github.com/jhuapl-bio/mytax`
12+
13+
Symbolically link all shell scripts into your path, for example with:
14+
15+
`find v1 -name "*.sh" | while read fn; do sudo ln -s $PWD/$fn /usr/local/bin; done`
16+
17+
# Dependencies
18+
19+
- jellyfish (version 1) - https://www.cbcb.umd.edu/software/jellyfish/
20+
- kraken (version 1) - https://ccb.jhu.edu/software/kraken/
21+
- gawk - https://www.gnu.org/software/gawk/manual/html_node/Installation.html
22+
- perl
23+
- GNU CoreUtils
24+
25+
- 16 GB of RAM is needed to build the provided influenza kraken database
26+
27+
# Usage
28+
29+
## Building example
30+
31+
This pipeline is built from a central set of scripts located in the `v1` directory
32+
33+
Build flu-kraken example with:
34+
35+
`build_flukraken.sh -k flukraken-$(date +"%F")`
36+
37+
The single script `build_flukraken.sh` functions as an outer wrapper for the influenza classification example using the Kraken classifier published in the mytax paper.
38+
39+
40+
`build_flukraken.sh` can also be used as a model to build modified pipelines as desired. It is built from four main sub-modules:
41+
```
42+
download_IVR.sh -> download references and taxonomy from IVR
43+
44+
build_IVR_metadata.sh -> build tab-delimited metadata table in format for mytax
45+
46+
build_taxonomy.sh -> build custom taxonomy from tab-delimited table
47+
48+
build_krakendb.sh -> add new taxonomic IDs to reference FASTA, build kraken database, post-process database for visualization pipeline
49+
```
50+
51+
`build_krakendb.sh` currently references three helper scripts, which also need to be in the PATH:
52+
```
53+
fix_references.sh -> adds new taxonomic IDs to reference FASTA
54+
55+
kraken-build -> builds kraken database
56+
57+
process_krakendb.sh -> post-processes database for visualization pipeline (not included in this repo yet)
58+
```
59+
60+
61+
62+
## Running process script on kraken/kraken2 report and outfiles
63+
64+
### If running from Docker
65+
66+
docker build . -t jhuaplbio/mytax
67+
68+
Unix
69+
70+
`docker container run -it --rm -v $PWD:/data jhuaplbio/mytax bash`
71+
72+
Windows Powershell
73+
74+
`docker container run -it --rm -v $pwd:/data jhuaplbio/mytax bash`
75+
76+
77+
78+
## Run the installation script
79+
80+
81+
# Activate the env, this will contain kraken2 and centrifuge scripts to build the database if needed as well as kraken2 and centrifuge dependencies
82+
83+
`conda activate mytax`
84+
85+
## Lets make a sample.fastq from test-data
86+
87+
### First, download ncbi taxdump
88+
89+
```
90+
python3 src/generate_hierarchy.py -o $PWD/taxdump --report test-data/sample.report -download
91+
rm taxdump.tar.gz
92+
```
93+
94+
95+
### DEPRECATED Kraken1
96+
97+
```
98+
mkdir -p databases/minikraken1
99+
wget https://ccb.jhu.edu/software/kraken/dl/minikraken_20171019_4GB.tgz -O databases/minikraken1.tgz
100+
tar -xvzf databases/minikraken1.tgz --directory databases/
101+
102+
export kraken1db=databases/minikraken_20171013_4GB && \
103+
kraken --db $kraken1db --output test-data/sample.out test-data/sample.fastq && \
104+
kraken-report --db $kraken1db test-data/sample.out | tee test-data/sample.report
105+
```
106+
107+
108+
### Kraken2
109+
110+
### IF you've made flukraken2 in tmp or....
111+
112+
`export KRAKEN2_DEFAULT_DB="tmp/flukraken2`
113+
114+
### IF you have a pre-made minikraken/other kraken db ready
115+
116+
```
117+
kraken2 --report output/sample_metagenome.first.report --output output/sample_metagenome.first.out --memory-mapping --db ~/Desktop/mytax/minikraken2 example-data/sample_metagenome.first.fastq
118+
```
119+
120+
### Download minikraken2
121+
122+
```
123+
mkdir -p databases/
124+
wget ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/old/minikraken2_v2_8GB_201904.tgz -O databases/minikraken2.tgz
125+
tar -xvzf databases/minikraken2.tgz --directory databases/
126+
```
127+
128+
### Centrifuge
129+
130+
#### Install
131+
132+
`bash install.sh`
133+
134+
#### Set up centrifuge env
135+
136+
```
137+
mkdir -p databases/centrifuge
138+
wget https://genome-idx.s3.amazonaws.com/centrifuge/p_compressed%2Bh%2Bv.tar.gz -O databases/centrifuge.tgz
139+
tar -xvzf databases/centrifuge.tgz --directory databases/centrifuge/
140+
```
141+
142+
143+
144+
#### Run Centrifuge classify
145+
146+
```
147+
## If you need to make a new database, see here: $CONDA_PREFIX/lib/centrifuge/centrifuge-build --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp sample.fastq sample
148+
149+
$CONDA_PREFIX/lib/centrifuge/centrifuge -f -x databases/centrifuge/p_compressed+h+v -q test-data/sample.fastq --report test-data/sample.centrifuge.report > test-data/sample.out
150+
$CONDA_PREFIX/lib/centrifuge/centrifuge-kreport -x databases/centrifuge/p_compressed+h+v test-data/sample.centrifuge.report > test-data/sample.report
151+
```
152+
153+
#### Next, generate the hierarchy json file
154+
155+
```
156+
python3 server/src/generate_hierarchy.py \
157+
-o output/sample_metagenome.first.fullstring \
158+
--report output/sample_metagenome.first.report \
159+
-taxdump taxonomy/nodes.dmp
160+
```
161+
162+
#### Get the json for mytax sunburst plot
163+
```
164+
bash server/src/krakenreport2json.sh -i output/sample_metagenome.first.fullstring -o output/sample_metagenome.first.json
165+
```
166+
167+
The resulting file can then imported into the sunburst plot at `server/src/sunburst/index.html` rendered with a simple `http.server` protocol like `python3 -m http.server 8080`

0 commit comments

Comments
 (0)