Skip to content

Latest commit

 

History

History
135 lines (87 loc) · 3.73 KB

README.md

File metadata and controls

135 lines (87 loc) · 3.73 KB

De novo Assembly Pipeline for Long Reads

This de novo assembly pipeline is designed to assemble DNA sequences generated by Oxford Nanopore technology.

Basic Setup

  • Create an environment with all required tools:
conda create -n denovo -c bioconda python fastx_toolkit samtools porechop flye quast prokka minimap2 racon

NOTE: If you already have FASTQ files, skip directly to Section 2.

Section 1:

The raw sequencing data is generated in POD5 or FastQ format and is demultiplexed according to the sequencing setup. Real-time basecalling can be performed during sequencing, making this step optional. However, depending on the experiment, re-running basecalling using more efficient models might be desirable. If basecalling hasn’t been done or needs to be repeated, follow these steps:

Before running Dorado, download the basecalling models:

dorado download --model <model name>
dorado download --model all # To download all available models

Dorado outputs data in BAM format by default. To generate FASTQ files, use the --emit-fastq option or convert BAM files to FASTQ using Samtools.

  1. Basecalling
dorado basecaller <model> pod5s/ > calls.bam # Recommended model: [email protected]

To convert BAM files to FASTQ format:

samtools fastq <BAM> > <FASTQ>

Section 2:

Scripts for running the programs on a dataset are named as follows:

  • Run_concatenate.py
  • Run_QC_first.py
  • Run_trimming_check.py
  • Run_assembly_QC.py
  • Run_polishing.py

Before executing, modify the input_folder variable in each script:

input_folder = "/PATH/TO/FASTQ/"

Execute the scripts using:

python <Run.py> </PATH/TO/FASTQ/>

The steps in the pipeline are detailed below with examples for each script:

Sequence Preprocessing

I. Merge all FASTQ files obtained from basecalling:

cat *.fastq.gz > merge.fastq.gz

II. Quality control:

fastqc [-t <threads>] merge.fastq.gz  # Used 14 threads for the -t option

FastQC documentation: FastQC Help

III. Trimming:

porechop -i merge.fastq.gz -o trimm.fastq  # Basic trimming mode

Porechop documentation: Porechop GitHub Repository

IV. Quality control:

fastqc [-t <threads>] trimm.fastq.gz  # Used 14 threads for the -t option

This step ensures the trimming was successful and determines if adjustments are needed.

Sequence Assembly

I. Assembly:

flye --nano-raw trimm.fastq --out-dir flye --threads 4

Flye documentation: Flye Usage Guide

II. Assembly quality:

quast.py /flye/assembly.fasta -o quast_assembly

Quast documentation: Quast GitHub Repository

III. Polishing:

minimap2 -a -t 14 /flye/assembly.fasta trimm.fastq > assembly.sam
racon -t 8 -m 8 -x -6 -g -8 -w 500 -u trimm.fastq assembly.sam /flye/assembly.fasta > polishing.fasta

Minimap2 documentation: Minimap2 GitHub Repository

Racon documentation: Racon GitHub Repository

Optional: Visualization of the Assembly in IGV

samtools sort assembly.sam > sorted.bam
samtools index sorted.bam

In IGV:

  • Click Genomes -> Load Genome from File and load the assembly.fasta file generated by Flye or Racon.
  • Click File -> Load From File and select the sorted and indexed BAM file.