This de novo assembly pipeline is designed to assemble DNA sequences generated by Oxford Nanopore technology.
- Create an environment with all required tools:
conda create -n denovo -c bioconda python fastx_toolkit samtools porechop flye quast prokka minimap2 racon
- Basecalling is performed using Dorado. Installation instructions are available here: Dorado GitHub Repository.
The raw sequencing data is generated in POD5 or FastQ format and is demultiplexed according to the sequencing setup. Real-time basecalling can be performed during sequencing, making this step optional. However, depending on the experiment, re-running basecalling using more efficient models might be desirable. If basecalling hasn’t been done or needs to be repeated, follow these steps:
Before running Dorado, download the basecalling models:
dorado download --model <model name>
dorado download --model all # To download all available models
Dorado outputs data in BAM format by default. To generate FASTQ files, use the --emit-fastq
option or convert BAM files to FASTQ using Samtools.
- Basecalling
dorado basecaller <model> pod5s/ > calls.bam # Recommended model: [email protected]
To convert BAM files to FASTQ format:
samtools fastq <BAM> > <FASTQ>
Scripts for running the programs on a dataset are named as follows:
Run_concatenate.py
Run_QC_first.py
Run_trimming_check.py
Run_assembly_QC.py
Run_polishing.py
Before executing, modify the input_folder
variable in each script:
input_folder = "/PATH/TO/FASTQ/"
Execute the scripts using:
python <Run.py> </PATH/TO/FASTQ/>
The steps in the pipeline are detailed below with examples for each script:
I. Merge all FASTQ files obtained from basecalling:
cat *.fastq.gz > merge.fastq.gz
II. Quality control:
fastqc [-t <threads>] merge.fastq.gz # Used 14 threads for the -t option
FastQC documentation: FastQC Help
III. Trimming:
porechop -i merge.fastq.gz -o trimm.fastq # Basic trimming mode
Porechop documentation: Porechop GitHub Repository
IV. Quality control:
fastqc [-t <threads>] trimm.fastq.gz # Used 14 threads for the -t option
This step ensures the trimming was successful and determines if adjustments are needed.
I. Assembly:
flye --nano-raw trimm.fastq --out-dir flye --threads 4
Flye documentation: Flye Usage Guide
II. Assembly quality:
quast.py /flye/assembly.fasta -o quast_assembly
Quast documentation: Quast GitHub Repository
III. Polishing:
minimap2 -a -t 14 /flye/assembly.fasta trimm.fastq > assembly.sam
racon -t 8 -m 8 -x -6 -g -8 -w 500 -u trimm.fastq assembly.sam /flye/assembly.fasta > polishing.fasta
Minimap2 documentation: Minimap2 GitHub Repository
Racon documentation: Racon GitHub Repository
samtools sort assembly.sam > sorted.bam
samtools index sorted.bam
In IGV:
- Click Genomes -> Load Genome from File and load the
assembly.fasta
file generated by Flye or Racon. - Click File -> Load From File and select the sorted and indexed BAM file.