De novo Assembly Pipeline for Long Reads

This de novo assembly pipeline is designed to assemble DNA sequences generated by Oxford Nanopore technology.

Basic Setup

Create an environment with all required tools:

conda create -n denovo -c bioconda python fastx_toolkit samtools porechop flye quast prokka minimap2 racon

Basecalling is performed using Dorado. Installation instructions are available here: Dorado GitHub Repository.

NOTE: If you already have FASTQ files, skip directly to Section 2.

Section 1:

The raw sequencing data is generated in POD5 or FastQ format and is demultiplexed according to the sequencing setup. Real-time basecalling can be performed during sequencing, making this step optional. However, depending on the experiment, re-running basecalling using more efficient models might be desirable. If basecalling hasn’t been done or needs to be repeated, follow these steps:

Before running Dorado, download the basecalling models:

dorado download --model <model name>
dorado download --model all # To download all available models

Dorado outputs data in BAM format by default. To generate FASTQ files, use the --emit-fastq option or convert BAM files to FASTQ using Samtools.

Basecalling

dorado basecaller <model> pod5s/ > calls.bam # Recommended model: dna_r10.4.1_e8.2_400bps_sup@v4.3.0

To convert BAM files to FASTQ format:

samtools fastq <BAM> > <FASTQ>

Section 2:

Scripts for running the programs on a dataset are named as follows:

Run_concatenate.py
Run_QC_first.py
Run_trimming_check.py
Run_assembly_QC.py
Run_polishing.py

Before executing, modify the input_folder variable in each script:

input_folder = "/PATH/TO/FASTQ/"

Execute the scripts using:

python <Run.py> </PATH/TO/FASTQ/>

The steps in the pipeline are detailed below with examples for each script:

Sequence Preprocessing

I. Merge all FASTQ files obtained from basecalling:

cat *.fastq.gz > merge.fastq.gz

II. Quality control:

fastqc [-t <threads>] merge.fastq.gz  # Used 14 threads for the -t option

FastQC documentation: FastQC Help

III. Trimming:

porechop -i merge.fastq.gz -o trimm.fastq  # Basic trimming mode

Porechop documentation: Porechop GitHub Repository

IV. Quality control:

fastqc [-t <threads>] trimm.fastq.gz  # Used 14 threads for the -t option

This step ensures the trimming was successful and determines if adjustments are needed.

Sequence Assembly

I. Assembly:

flye --nano-raw trimm.fastq --out-dir flye --threads 4

Flye documentation: Flye Usage Guide

II. Assembly quality:

quast.py /flye/assembly.fasta -o quast_assembly

Quast documentation: Quast GitHub Repository

III. Polishing:

minimap2 -a -t 14 /flye/assembly.fasta trimm.fastq > assembly.sam
racon -t 8 -m 8 -x -6 -g -8 -w 500 -u trimm.fastq assembly.sam /flye/assembly.fasta > polishing.fasta

Minimap2 documentation: Minimap2 GitHub Repository

Racon documentation: Racon GitHub Repository

Optional: Visualization of the Assembly in IGV

samtools sort assembly.sam > sorted.bam
samtools index sorted.bam

In IGV:

Click Genomes -> Load Genome from File and load the assembly.fasta file generated by Flye or Racon.
Click File -> Load From File and select the sorted and indexed BAM file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

De novo Assembly Pipeline for Long Reads

Basic Setup

NOTE: If you already have FASTQ files, skip directly to Section 2.

Section 1:

Section 2:

Sequence Preprocessing

Sequence Assembly

Optional: Visualization of the Assembly in IGV

Files

README.md

Latest commit

History

README.md

File metadata and controls

De novo Assembly Pipeline for Long Reads

Basic Setup

NOTE: If you already have FASTQ files, skip directly to Section 2.

Section 1:

Section 2:

Sequence Preprocessing

Sequence Assembly

Optional: Visualization of the Assembly in IGV