Skip to content

Latest commit

 

History

History
85 lines (57 loc) · 4.43 KB

File metadata and controls

85 lines (57 loc) · 4.43 KB

Workflow information and usage instructions

General workflow info

The current processing protocol is implemented as a Snakemake workflow and utilizes conda environments. The workflow can be used even if you are unfamiliar with Snakemake and conda, but if you want to learn more about those, this Snakemake tutorial within Snakemake's documentation is a good place to start for that, and an introduction to conda with installation help and links to other resources can be found here at Happy Belly Bioinformatics.

Utilizing the workflow

  1. Install conda and genelab-utils package
  2. Download the workflow template files
  3. Modify the variables in the config.yaml file
  4. Run the workflow

1. Install conda and genelab-utils package

We recommend installing a Miniconda, Python3 version appropriate for your system, as exemplified in the above link.

Once conda is installed on your system, you can install the genelab-utils conda package in a new environment with the following:

conda create -n genelab-utils -c conda-forge -c bioconda -c defaults -c astrobiomike genelab-utils

The environment then needs to be activated:

conda activate genelab-utils

2. Download the workflow template files

All files required for utilizing the GeneLab workflow for removing human reads from metagenomics data are in the workflow-template directory. To get a copy of that directory on to your system, run the following command:

GL-get-454-IonTorren-amplicon-wf

This downloaded the workflow into a directory called 454-and-IonTorrent-workflow/.

3. Modify the variables in the config.yaml file

Once you've downlonaded the workflow template, you can modify the variables in the config.yaml file as needed. For example, you will have to provide a text file containing a single-column list of unique sample identifiers (see an example of how to set this up below). You will also need to indicate the paths to your input data (raw reads) and, if necessary, modify each variable to be consistent with the study you want to process.

Note: If you are unfamiliar with how to specify paths, one place you can learn more is here.

Example for how to create a single-column list of unique sample identifiers from your raw data file names

For example, if you have paired-end read data for 2 samples located in ../Raw_Data/ relative to your workflow directory, that look like this:

ls ../Raw_Data/
Sample-1_R1_raw.fastq.gz
Sample-1_R2_raw.fastq.gz
Sample-2_R1_raw.fastq.gz
Sample-2_R2_raw.fastq.gz

You would set up your unique-sample-IDs.txt file as follows:

cat unique-sample-IDs.txt
Sample-1
Sample-2

4. Run the workflow

While in the directory holding the Snakefile, config.yaml, and other workflow files that you downloaded in step 2, here is one example command of how to run the workflow:

snakemake --use-conda --conda-prefix ${CONDA_PREFIX}/envs -j 2 -p
  • --use-conda – specifies to use the conda environments included in the workflow (these are specified in the envs directory)
  • --conda-prefix – indicates where the needed conda environments will be stored. Adding this option will also allow the same conda environments to be re-used when processing additional datasets, rather than making new environments each time you run the workflow. The value listed for this option, ${CONDA_PREFIX}/envs, points to the default location for conda environments (note: the variable ${CONDA_PREFIX} will be expanded to the appropriate location on whichever system it is run on).
  • -j – assigns the number of jobs Snakemake should run concurrently
  • -p – specifies to print out each command being run to the screen

See snakemake -h and Snakemake's documentation for more options and details.