Skip to content

YanelElinaBernardi/De-novo-assembly

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

De novo Assembly Pipeline for Long Reads

This de novo assembly pipeline is designed to assemble DNA sequences generated by Oxford Nanopore technology.

Basic Setup

  • Create an environment with all required tools:
conda create -n denovo -c bioconda python fastx_toolkit samtools porechop flye quast prokka minimap2 racon

NOTE: If you already have FASTQ files, skip directly to Section 2.

Section 1:

The raw sequencing data is generated in POD5 or FastQ format and is demultiplexed according to the sequencing setup. Real-time basecalling can be performed during sequencing, making this step optional. However, depending on the experiment, re-running basecalling using more efficient models might be desirable. If basecalling hasn’t been done or needs to be repeated, follow these steps:

Before running Dorado, download the basecalling models:

dorado download --model <model name>
dorado download --model all # To download all available models

Dorado outputs data in BAM format by default. To generate FASTQ files, use the --emit-fastq option or convert BAM files to FASTQ using Samtools.

  1. Basecalling
dorado basecaller <model> pod5s/ > calls.bam # Recommended model: [email protected]

To convert BAM files to FASTQ format:

samtools fastq <BAM> > <FASTQ>

Section 2:

Scripts for running the programs on a dataset are named as follows:

  • Run_concatenate.py
  • Run_QC_first.py
  • Run_trimming_check.py
  • Run_assembly_QC.py
  • Run_polishing.py

Before executing, modify the input_folder variable in each script:

input_folder = "/PATH/TO/FASTQ/"

Execute the scripts using:

python <Run.py> </PATH/TO/FASTQ/>

The steps in the pipeline are detailed below with examples for each script:

Sequence Preprocessing

I. Merge all FASTQ files obtained from basecalling:

cat *.fastq.gz > merge.fastq.gz

II. Quality control:

fastqc [-t <threads>] merge.fastq.gz  # Used 14 threads for the -t option

FastQC documentation: FastQC Help

III. Trimming:

porechop -i merge.fastq.gz -o trimm.fastq  # Basic trimming mode

Porechop documentation: Porechop GitHub Repository

IV. Quality control:

fastqc [-t <threads>] trimm.fastq.gz  # Used 14 threads for the -t option

This step ensures the trimming was successful and determines if adjustments are needed.

Sequence Assembly

I. Assembly:

flye --nano-raw trimm.fastq --out-dir flye --threads 4

Flye documentation: Flye Usage Guide

II. Assembly quality:

quast.py /flye/assembly.fasta -o quast_assembly

Quast documentation: Quast GitHub Repository

III. Polishing:

minimap2 -a -t 14 /flye/assembly.fasta trimm.fastq > assembly.sam
racon -t 8 -m 8 -x -6 -g -8 -w 500 -u trimm.fastq assembly.sam /flye/assembly.fasta > polishing.fasta

Minimap2 documentation: Minimap2 GitHub Repository

Racon documentation: Racon GitHub Repository

Optional: Visualization of the Assembly in IGV

samtools sort assembly.sam > sorted.bam
samtools index sorted.bam

In IGV:

  • Click Genomes -> Load Genome from File and load the assembly.fasta file generated by Flye or Racon.
  • Click File -> Load From File and select the sorted and indexed BAM file.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages