CLASV

Overview

Lassa virus lineage prediction based on random forest.

This is one out of 3 for the manuscript suporting data and code. all here:

Project Repositories

Data and Processing: LASV_ML_Manuscript_Data
Lassa Virus Phylogenetics: LASV_Phylogenetics_Pipeline
Lassa Virus Lineage Prediction: LASV_Lineage_Prediction

Jupyter Notebooks on Google Colab

General Preprocessing: Notebook Link
Motif Search Using RF MD Pcorr: Notebook Link
Lassa Virus Lineage Prediction Training: Notebook Link

Prediction Pipeline Overview

Running the Pipeline

This pipeline relies on Nextstrain for gene extraction and alignmnent. Please install Nextstrain first by following the installation guide and ensure the Nextstrain command is available in your terminal.

Clone this repository using (or simply download it as a zipped file and unzip.):

git clone https://github.com/JoiRichi/CLASV.git

Enter the Nextstrain shell in the root directory of the pipeline. Note: you must enter the Nextstrain shell each time you want to use the pipeline.

nextstrain shell .

When the shell is active, run the pipeline using:

snakemake -s predict_lineage.smk --cores 5  # you can change the number of cores
# To re-run the pipeline from scratch, use snakemake -s predict_lineage.smk --cores 5  -F
#please refer to snakemake documentation for help.

Upon completion, go to the pipeline 'visuals' folder and open the html files in a browser.

Model training

Learn how the data was preprocessed here: LASV_ML_Manuscript_Data. Training process here Notebook Link.

Customization

This pipeline has the ability to process multiple FASTA files containing multiple sequences with proficiency and speed. It is recommended that multiple FASTA files are concatenated into one; however, this is not compulsory, especially if the projects are different. By default, the pipeline finds all files with the extension .fasta in the raw_data folder and tries to find LASV GPC sequences in the files. You can either move your FASTA files into this folder (recommended) or copy the PATH of the folder containing your sequences and use it as raw_seq_folder in the config.yaml file.

To ensure Snakemake has a memory of what files have been checked, intermediary files are created for all files checked, even if they contain no GPC sequences. However, those files would be empty.

Important Outputs

At the end of the run, you can check the predictions folder for the CSV files containing the predictions per sample. A visualization of the prediction can be found in the visuals folder. Open the HTML files in a browser. The images are high quality and reactive, allowing you to hover over them to see more information.

For further details, please refer to the respective notebooks and repositories linked above. You can also leave a comment for help regarding the pipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.snakemake		.snakemake
__pycache__		__pycache__
config		config
predictions		predictions
raw_data		raw_data
results		results
visuals		visuals
.DS_Store		.DS_Store
README.md		README.md
core.py		core.py
execution_time.txt		execution_time.txt
predflow.png		predflow.png
predict_lineage.smk		predict_lineage.smk
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLASV

Overview

Project Repositories

Jupyter Notebooks on Google Colab

Prediction Pipeline Overview

Running the Pipeline

Model training

Customization

Important Outputs

About

Releases

Packages

Languages

JoiRichi/CLASV

Folders and files

Latest commit

History

Repository files navigation

CLASV

Overview

Project Repositories

Jupyter Notebooks on Google Colab

Prediction Pipeline Overview

Running the Pipeline

Model training

Customization

Important Outputs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages