This repository contains the data and code supporting the LASV manuscript. It is one of three repositories associated with the project. For complete project details, see the links below:
- Data and Processing: LASV_ML_Manuscript_Data
- Lassa Virus Phylogenetics: LASV_Phylogenetics_Pipeline
- Lassa Virus Lineage Prediction: LASV_Lineage_Prediction
- General Preprocessing: Notebook Link
- Motif Search Using RF MD Pcorr: Notebook Link
- Lassa Virus Lineage Prediction Training: Notebook Link
This repository stores all data and intermediates related to the project.
-
raw_ncbi: Contains nucleotide sequences and accompanying metadata of Lassa virus available on NCBI Virus until 01/12/2023.
-
last_mafft_alignment: Contains the aligned extracted GPC sequences using LAST and MAFFT, available here. Reference gene. The alignment file and raw metadata file (stored in raw_ncbi) serve as inputs for the General Preprocessing notebook.
-
alignment_preprocessing: Contain outputs from the General Preprocessing notebook. The alignment from MAFFT was filtered to produce passed files, which were manually curated and translated to amino acids using Aliview. These files are inputs for the Lassa Virus Phylogenetics pipeline, Motif Search Using RF MD Pcorr, and Lassa Virus Lineage Prediction Training notebooks.
-
result_motiff_search: Contain results from the Motif Search Using RF MD Pcorr notebook.
-
phylo_tree: Contain result from the LASV_Phylogenetics_Pipeline. This is the tree with which conclusions in the manuscript were drawn. Please open the tree file using https://auspice.us/ and use its features to explore the tree.
-
lineage_annotation: Contains sequence IDs grouped into lineages based on annotated clades from the Lassa Virus Phylogenetics analysis. These files contain target variables for the Lassa Virus Lineage Prediction Training notebook.
For further details, please refer to the respective notebooks and repositories linked above.