Skip to content

Commit

Permalink
Merge pull request #23 from matt-sd-watson/pytest
Browse files Browse the repository at this point in the history
Pytest
  • Loading branch information
matt-sd-watson authored Apr 28, 2022
2 parents a7a2d78 + fb95943 commit 917dd96
Show file tree
Hide file tree
Showing 11 changed files with 145 additions and 41 deletions.
15 changes: 11 additions & 4 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,21 @@ on:

jobs:
build:
runs-on: ubuntu-latest
name: Outbreaker test on ${{ matrix.os }}
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: ["ubuntu-latest"]
steps:
- uses: actions/checkout@v2
- uses: conda-incubator/setup-miniconda@v2
with:
environment-file: environments/environment.yml
activate-environment: ncov_outbreaker
channels: conda-forge,bioconda,defaults
channels: conda-forge,bioconda,defaults,r
mamba-version: "*"
- name: Install outbreaker
shell: bash -l {0}
run: pip install -e .
run: pip install .
- name: Check outbreaker version
shell: bash -l {0}
run: outbreaker -v
Expand All @@ -32,3 +36,6 @@ jobs:
- name: Run outbreaker test via config
shell: bash -l {0}
run: outbreaker -c data/test_config.yaml
- name: Run pytest for outbreaker
shell: bash -l {0}
run: pytest tests/
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -130,3 +130,5 @@ dmypy.json


.DS_Store

.snakemake/
8 changes: 7 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,5 +58,11 @@

## Minor Version 0.6.4, 17-02-22

-outbreaker now retains all sequences if ```--names-csv``` is used for renaming and not all sequences are contained in the CSV
- outbreaker now retains all sequences if ```--names-csv``` is used for renaming and not all sequences are contained in the CSV
- updates to the renaming behavior to be compatible with fastafurious v1.2.0 (additional warning messages)

## Minor Version 0.6.5, 28-04-22 (Patch)
- Change behaviour of renaming when no CSV is supplied. Will now use the prefix for the run to generate new names with alphanumerical sequential order (i.e. prefix_1, prefix_2) and will output the name matches as a CSV file
- Above fix changes fixes the error in the SNP distance plot in the HTML report when rename is used but no names CSV is supplied
- Addition of pytests in the CI/CD workflow

2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,6 @@ More detailed documentation for outbreaker usage and functionality can be found

## Acknowledgments

Inspiration for code structure and design for outbreaker was inspired by [pangolin](https://github.com/cov-lineages/pangolin) and [civet](https://github.com/artic-network/civet), and minor code blocks were adopted from these software. \
Inspiration for code structure and design for outbreaker was inspired by [pangolin](https://github.com/cov-lineages/pangolin) and [civet](https://github.com/artic-network/civet), and minor code blocks were adopted from these software.

The **Background** section in the documentation describing outbreak definitions was written by Mark Horsman.
26 changes: 12 additions & 14 deletions docs/2-INPUTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,21 +30,19 @@ The following inputs are purely optional, but may augment the types of analysis

## Sample head renaming

For PHO outbreak analysis, it is common to rename a sample COVID-19 sequence with a different alias for privacy purposes, especially if the outbreak analysis is to be shared with external collaborators. A typical renaming scheme for PHO COVID-19 samples would follow the following pattern: \
Original sample name: PHLON20-SARS##### or PHLON22-SARS#####
New sample name: ON-PHL-20-##### or ON-PHL-21-##### \
where ##### denotes the specific WGS Id that is used to track the genomic sequence within the PHO laboratory.
It is common to rename a sample COVID-19 sequence with a different alias for privacy purposes, especially if the outbreak analysis is to be shared with external collaborators. \
outbreaker is designed to facilitate the renaming of FASTA headers to accommodate privacy guidelines and/or to use different label aliases for the outbreak. This feature can be toggled on using ```--rename```. There are two different renaming possibilities for user when ```--rename``` is enabled: \
**Option 1**: The workflow will auto-detect any FASTA headers that have the format PHLON{20,21}-SARS##### and change them to ON-PHL-{20,21}-#####. If the FASTA header does not follow this format, it will be left as is (i.e. Gisaid sample headers that follow a different format, or external samples) \
**Option 2**: A CSV file of FASTA labels can be supplied using --names_csv. This requires that ALL focal and background samples be included in the table. The contents of the table should have the following scheme as an example:
original_name
new_name
PHLON21-SARS29115
sequence_1
PHLON21-SARS15665
sequence_2
This table will allow outbreaker to rename the above PHLON sequences with sequence_# headers in all downstream input files generated by the workflow.
If ```--names_csv```, the CSV headers must have original_name for the current/original header name, and new_name for the target/output name to run properly.
**Option 1**: outbreaker will use the run prefix supplied at runtime to create new alias for each sample. In an example, for a run with 10 samples with run prefix "apartment_can", The new sample names will range from apartment_can_1 to apartment_can_10. A CSV matching the original and newly generated names will be added to the output directory. \
**Option 2**: A CSV file of FASTA labels can be supplied using --names_csv. This allows for custom labels for specific samples. Note that not all samples need to have a new name in this CSV. If a sample does not have a coresponding new name, it is left as is as of outbreaker v0.6.4.
The format of this CSV should be as follows:
```
original_name new_name
PHLON21-SARS29115 sequence_1
PHLON21-SARS15665 sequence_2
```

This table will allow outbreaker to use fastafurious to rename the above PHLON sequences with sequence_# headers in all downstream input files generated by the workflow. \
If ```--names_csv``` is supplied, the CSV headers must have original_name for the current/original header name, and new_name for the target/output name to run properly.


## Optional argument descriptions
Expand Down
2 changes: 1 addition & 1 deletion environments/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ dependencies:
- r-essentials
- r-traminer
- scipy=1.6.3
- snakemake-minimal
- snakemake
- snp-dists=0.8.2
- snp-sites=2.5.1
- vcftools=0.1.16
Expand Down
2 changes: 1 addition & 1 deletion outbreaker/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
_program = "outbreaker"
__version__ = "0.6.4"
__version__ = "0.6.5"
23 changes: 13 additions & 10 deletions outbreaker/workflows/outbreaker.smk
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import os
import sys
import click
import pandas as pd

if not config["outdir"]:
config["outdir"] = os.getcwd() + "/outbreaker/"
Expand All @@ -21,6 +22,7 @@ rule all:
os.path.join(config["outdir"], config["prefix"] + ".fa"),
os.path.join(config["outdir"], config["prefix"] + "_filtered.fa") if config["filter"] else [],
os.path.join(config["outdir"], config["prefix"] + "_renamed.fa") if config["rename"] else [],
os.path.join(config["outdir"], config["prefix"] + "_rename_matches.csv") if config["rename"] and not config["names_csv"] else [],
os.path.join(config["outdir"], config["prefix"] + "_aln.fasta"),
os.path.join(config["outdir"], config["prefix"] + "_snipit.jpg"),
os.path.join(config["outdir"], config["prefix"]+ "_tree.nwk"),
Expand Down Expand Up @@ -134,7 +136,8 @@ rule rename_headers:
fasta = rules.create_subset.output.sub_fasta,
names_csv = config["names_csv"] if config["names_csv"] else []
output:
renamed = os.path.join(config["outdir"], config["prefix"] + "_renamed.fa")
renamed = os.path.join(config["outdir"], config["prefix"] + "_renamed.fa"),
names_matches = os.path.join(config["outdir"], config["prefix"] + "_rename_matches.csv") if not config["names_csv"] else []
run:
if config["rename"]:
if config["names_csv"]:
Expand All @@ -146,19 +149,21 @@ rule rename_headers:
else:
fasta_to_open = open(input.fasta)
newfasta = open(output.renamed, 'w')
names_matches = {}
name_counter = 1
for line in fasta_to_open:
if line.startswith('>'):
line_cleaned = line.strip('>').strip()
try:
replacement_name = "ON-PHL-" + line_cleaned.split("PHLON")[1].split("-SARS")[0] + "-" + line_cleaned.split("PHLON")[1].split("-SARS")[1]
except IndexError:
replacement_name = line_cleaned
replacement_name = config["prefix"] + "_" + str(name_counter)
newfasta.write(">" + replacement_name + "\n")
names_matches[line_cleaned] = replacement_name
name_counter += 1
else:
newfasta.write(line)

fasta_to_open.close()
newfasta.close()
pd.DataFrame(names_matches.items(), columns=['original_name', 'new_name']).to_csv(output.names_matches, index = False)
sys.stderr.write(f'\nrenamed multi-FASTA headers into: {output.renamed}\n')


Expand Down Expand Up @@ -321,15 +326,13 @@ rule summary_report:
renamed = convertPythonBooleanToR(config["rename"]),
names_sheet_read = absol_path(config["names_csv"]) if config["names_csv"] else [],
prefix_input = str(config["prefix"]),
report_output = absol_path(os.path.join(config["outdir"])) + "/"
report_output = absol_path(os.path.join(config["outdir"])) + "/",
name_matches = absol_path(os.path.join(config["outdir"], config["prefix"] + "_rename_matches.csv")) if config["rename"] and not config["names_csv"] else []
run:
if config["report"]:
shell(
"""
Rscript -e \"rmarkdown::render(input = '{params.script}', params = list(focal_list = '{params.focal_read}', background_list = '{params.background_read}', snp_dists = '{params.snp_read}', snp_tree = '{params.snp_tree_read}', full_tree = '{params.full_tree_read}', snipit = '{params.snipit_read}', renamed = '{params.renamed}', names_csv = '{params.names_sheet_read}', outbreak_prefix = '{params.prefix_input}', outbreak_directory = '{params.report_output}'), output_file = '{params.output}')\"
Rscript -e \"rmarkdown::render(input = '{params.script}', params = list(focal_list = '{params.focal_read}', background_list = '{params.background_read}', snp_dists = '{params.snp_read}', snp_tree = '{params.snp_tree_read}', full_tree = '{params.full_tree_read}', snipit = '{params.snipit_read}', renamed = '{params.renamed}', names_csv = '{params.names_sheet_read}', outbreak_prefix = '{params.prefix_input}', outbreak_directory = '{params.report_output}', name_matches = '{params.name_matches}'), output_file = '{params.output}')\"
""")





28 changes: 20 additions & 8 deletions outbreaker/workflows/outbreaker_summary_report.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,9 @@ params:
value: ""
outbreak_directory:
value: ""
name_matches:
input: file
value: ""
output:
html_document:
toc: yes
Expand Down Expand Up @@ -108,9 +111,10 @@ if (file_ext(params$focal_list) %in% fasta_extensions) {
```{r, echo=F, warning=F, message=F}
if (params$renamed == "TRUE" & params$names_csv == "") {
rename_matches <- read.csv(params$name_matches)
new_focal_names <- as.vector(subset(rename_matches, original_name %in% focal_input$Sequence)$new_name)
new_focal_names <- as.vector(paste("ON-PHL", str_split_fixed(focal_input$Sequence, "PHLON|-SARS", 4)[,2],
str_split_fixed(focal_input$Sequence, "PHLON|-SARS", 4)[,3], sep = "-"))
} else if (params$renamed == "TRUE" & params$names_csv != "") {
renaming_sheet <- read.csv(params$names_csv, header = T,
na.strings=c("","NA"),
Expand Down Expand Up @@ -249,16 +253,24 @@ distances <- read.csv(params$snp_dists, header = FALSE,
na.strings=c("","NA"),
stringsAsFactors=FALSE,
sep=",") %>% filter(! grepl("MN908947", V1) &
! grepl("MN908947", V2))
! grepl("MN908947", V2)) %>%
filter(V1 != V2)
filtered_w_background <- subset(distances, V1 %in% subset(tr.df.labs, category == "Focal_Sequence")$label &
! V2 %in% subset(tr.df.labs, category == "Focal_Sequence")$label)
filtered_w_background <- subset(distances, V1 %in% as.vector(subset(tr.df.labs, category == "Focal_Sequence")$label) &
! V2 %in% as.vector(subset(tr.df.labs, category == "Focal_Sequence")$label))
filtered_only_focal <- subset(distances, V1 %in% subset(tr.df.labs, category == "Focal_Sequence")$label &
V2 %in% subset(tr.df.labs, category == "Focal_Sequence")$label)
filtered_only_focal <- subset(distances, V1 %in% as.vector(subset(tr.df.labs, category == "Focal_Sequence")$label) &
V2 %in% as.vector(subset(tr.df.labs, category == "Focal_Sequence")$label))
distance_frame_only_focal <- as.data.frame(table(filtered_only_focal$V3)) %>% mutate(Var1 = as.numeric(as.character(Var1)))
distance_frame_only_focal <- as.data.frame(table(filtered_only_focal$V3))
if (nrow(distance_frame_only_focal) != 0) {
distance_frame_only_focal <- distance_frame_only_focal %>% mutate(Var1 = as.numeric(as.character(Var1)))
colnames(distance_frame_only_focal) <- c("SNP_Distance", "Frequency")
} else {
distance_frame_only_focal <- data.frame(SNP_Distance = numeric(),
Frequency = numeric())
}
distance_frame_w_background <-as.data.frame(table(filtered_w_background$V3))
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
author='Matthew Watson',
author_email='[email protected]',
description='snakemake and Python integrated workflow for intermediate file generation for COVID outbreak analysis',
install_requires = ["pandas>=1.1.5", "numpy>=1.19", "biopython>=1.79"],
install_requires = ["pandas>=1.1.5", "numpy>=1.19", "biopython>=1.79", "snakemake>=7.0.0"],
entry_points="""
[console_scripts]
{program} = outbreaker.main:main
Expand Down
76 changes: 76 additions & 0 deletions tests/test_outbreaker.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
import os
from outbreaker import main
import sys
from Bio import SeqIO

DATA_DIR = os.path.abspath(os.path.join(os.path.dirname( __file__ ), '..', 'data/'))
print(DATA_DIR)

test_reference = os.path.join(DATA_DIR, 'reference', 'ncov_reference.gb')


class TestOutbreaker:
def test_read_test_focal_fasta(self):
query_file = os.path.join(DATA_DIR, 'tests/', 'focal_seqs.fa')
assert len(list(SeqIO.parse(query_file, "fasta"))) == 4
def test_read_test_background_fasta(self):
query_file = os.path.join(DATA_DIR, 'tests/', 'background_seqs.fa')
assert len(list(SeqIO.parse(query_file, "fasta"))) == 6

def test_run_outputs(self, tmp_path):
focal_seqs = os.path.join(DATA_DIR, 'tests/', 'focal_seqs.fa')
background_seqs = os.path.join(DATA_DIR, 'tests/', 'background_seqs.fa')

args = ['-f', str(focal_seqs), '-b', str(background_seqs), '--rename', '-p', 'pytest',
'-r', str(test_reference), '-o', str(tmp_path)]

main.main(sysargs = args)
output_merged_fasta = os.path.join(tmp_path, 'pytest_renamed.fa')
assert len(list(SeqIO.parse(output_merged_fasta, "fasta"))) == 10

new_names = ["pytest_" + str(i) for i in range(1, 11, 1)]
names_in_fasta = []
for record in SeqIO.parse(output_merged_fasta, "fasta"):
names_in_fasta.append(record.id)
assert names_in_fasta == new_names


def test_run_with_missing_names_csv(self, tmp_path):
focal_seqs = os.path.join(DATA_DIR, 'tests/', 'focal_seqs.fa')
background_seqs = os.path.join(DATA_DIR, 'tests/', 'background_seqs.fa')
names_csv = os.path.join(DATA_DIR, 'tests/', 'names.csv')

args = ['-f', str(focal_seqs), '-b', str(background_seqs), '--rename', '-p', 'pytest',
'-r', str(test_reference), '-o', str(tmp_path), '--names-csv', str(names_csv)]

main.main(sysargs=args)

output_merged_fasta = os.path.join(tmp_path, 'pytest_renamed.fa')
names_in_fasta = []
for record in SeqIO.parse(output_merged_fasta, "fasta"):
names_in_fasta.append(record.id)
names_not_all = ['Renamed_1', 'Renamed_2', 'Renamed_3',
'Focal_4', 'Renamed_4', 'Renamed_5', 'Background_3',
'Renamed_6', 'Renamed_7', 'Renamed_8']
assert names_in_fasta == names_not_all

output_snp_dists = os.path.join(tmp_path, "pytest_snp_dists.csv")

with open(output_snp_dists) as f:
lines = f.readlines()
assert str('Renamed_8,Background_3,5\n') in lines















0 comments on commit 917dd96

Please sign in to comment.