Skip to content

Commit d7b8fb9

Browse files
author
mnika
committed
init
0 parents  commit d7b8fb9

File tree

1,789 files changed

+343880
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,789 files changed

+343880
-0
lines changed

LICENSE

+674
Large diffs are not rendered by default.

README.md

+127
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
# MegIS: High-Performance, Energy-Efficient, and Low-Cost Metagenomic Analysis with In-Storage Processing
2+
3+
## What is MegIS?
4+
5+
MegIS is the first in-storage processing system designed to significantly reduce the data movement overhead of the end-to-end metagenomic analysis pipeline. MegIS is enabled by our lightweight design that effectively leverages and orchestrates processing inside and outside the storage system. We address in-storage processing challenges for metagenomics via specialized and efficient 1) task partitioning, 2) data/computation flow coordination, 3) storage technology-aware algorithmic optimizations, 4) data mapping, and 5) lightweight in-storage accelerators. MegIS's design is flexible, capable of supporting different types of metagenomic input datasets, and can be integrated into various metagenomic analysis pipelines.
6+
7+
8+
<p align="center">
9+
<img src="megis-overview.png" alt="drawing" width="400"/>
10+
</p>
11+
12+
13+
## Citation
14+
If you find this repo useful, please cite the following paper:
15+
16+
Nika Mansouri Ghiasi, Mohammad Sadrosadati, Harun Mustafa, Arvid Gollwitzer, Can Firtina, Julien Eudine, Haiyu Mao, Joël Lindegger, Meryem Banu Cavlak, Mohammed Alser, Jisung Park, Onur Mutlu,
17+
["MegIS: High-Performance, Energy-Efficient, and Low-Cost Metagenomic Analysis with In-Storage Processing"](https://arxiv.org/pdf/2406.19113)
18+
ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), 2024.
19+
20+
```bibtex
21+
@inproceedings{ghiasi2024megis,
22+
title={MegIS: High-Performance, Energy-Efficient, and Low-Cost Metagenomic Analysis with In-Storage Processing},
23+
author={Ghiasi, Nika Mansouri and Sadrosadati, Mohammad and Mustafa, Harun and Gollwitzer, Arvid and Firtina, Can and Eudine, Julien and Mao, Haiyu and Lindegger, Jo{\"e}l and Cavlak, Meryem Banu and Alser, Mohammed and Park, Jisung and Mutlu, Onur},
24+
booktitle={2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)},
25+
pages={660--677},
26+
year={2024},
27+
organization={IEEE}
28+
}
29+
```
30+
31+
## Table of Contents
32+
33+
* [What is MegIS?](#what-is-megis-)
34+
* [Citation](#citation)
35+
* [Prerequisites](#prerequisites)
36+
* [Input Data](#input-data)
37+
+ [Query Read Sets](#query-read-sets)
38+
+ [Database](#database)
39+
* [Preparing the Input Queries](#preparing-the-input-queries)
40+
* [Finding Candidate Species](#finding-candidate-species)
41+
* [End-to-end Throughput](#end-to-end-throughput)
42+
* [Contact](#contact)
43+
44+
45+
## Prerequisites
46+
47+
The infrastructure has been tested with the following system configuration:
48+
* g++ v11.1.0
49+
* Python v3.9.12
50+
51+
Prerequisites specific to each experiment are listed in their respective subsections.
52+
53+
54+
## Input Data
55+
56+
### Query Read Sets
57+
58+
The read sets used in the paper are from the commonly-used [CAMI benchmark](https://www.nature.com/articles/nmeth.4458). They can be obtained from [this link](http://gigadb.org/dataset/100344).
59+
60+
### Database
61+
62+
In our paper, we generate a database based on microbial genomes drawn from NCBI’s databases, including 155,442 genomes for 52,961 microbial species. You can download the genomes by running the script `input-data/download_genomes.sh`. This also creates a list of the files in `input-data/database_genomes.txt`.
63+
64+
After compiling KMC, build the database from the input files by running the following commands
65+
```bash
66+
# create a local scratch directory
67+
mkdir -p kmc_tmp
68+
69+
# create an initial database
70+
kmc -k60 -fa -ci0 -cs3 -t$NUM_THREADS @input-data/database_genomes.txt ${OUT_FILE}_pre kmc_tmp
71+
72+
# sort k-mers in the database
73+
kmc_tools transform ${OUT_FILE}_pre sort $OUT_FILE
74+
75+
# remove the initial database and the scratch directory
76+
rmdir kmc_tmp
77+
rm ${OUT_FILE}_pre
78+
```
79+
replacing `$OUT_FILE` with the desired output name and `$NUM_THREADS` with the desired number of CPU threads.
80+
81+
## Preparing the Input Queries
82+
83+
To extract k-mers from input queries, MegIS includes a new input processing scheme by improving upon the input processing scheme in [KMC](https://github.com/refresh-bio/KMC). MegIS enables overlapping the k-mer sorting and transfer of a bucket to the SSD with the in-storage processing operations of the [next step](#finding-candidate-species) on the previously transferred buckets. The overlapping of different pipeline stages is modeled in the `pipeline/pipeline_throughput.py`.
84+
85+
In `preparing-input-queries`, we include an optimized version of KMC as a software baseline. This version improves execution time by utilizing a fixed prefix length for query and database k-mers. For a fair comparison, we apply the same optimization (excluding the in-storage processing overlap used in MegIS) when evaluating the baseline software tool, A-Opt, in our experiments.
86+
87+
To prepare a query, run the following commands
88+
```bash
89+
# create a local scratch directory
90+
mkdir -p kmc_tmp
91+
92+
# create an initial k-mer counter
93+
kmc -k60 -fq -ci2 -cs3 -t$NUM_THREADS $IN_FILE ${OUT_FILE}_pre kmc_tmp
94+
95+
# sort k-mers in the k-mer counter
96+
kmc_tools transform ${OUT_FILE}_pre sort $OUT_FILE
97+
98+
# remove the initial k-mer counter and the scratch directory
99+
rmdir kmc_tmp
100+
rm ${OUT_FILE}_pre
101+
```
102+
replacing `$IN_FILE` with the path to the input query file, `$OUT_FILE` with the desired output name, and `$NUM_THREADS` with the desired number of CPU threads. If the input file is in FASTA format, use the `-fa` flag when creating the initial k-mer counter.
103+
104+
## Finding Candidate Species
105+
106+
107+
MegIS runs this step with in-storage process accelerators. The Verilog implementations of MegIS's lightweight hardware units are in `hdl/`.
108+
109+
To ensure a fair comparison with software baselines, we incorporate optimizations that enhance the utilization of the SSD's I/O bandwidth during the process of identifying intersecting k-mers in software. We include the optimized implementation for intersection finding in `finding-candidate-species/` and use this optimization when evaluating the A-Opt baseline. Given a database `$DB_FILE` and query `$QUERY_FILE` (both excluding the `.kmc_suf` extension), their intersection can be computed using the `intersection` executable in `finding-candidate-species`. It may be compiled by running `make` in its directory. Afterwards, compute an intersection as follows
110+
```bash
111+
intersection $DB_FILE $QUERY_FILE $NUM_THREADS $INTERSECTION_OUT
112+
```
113+
where `$NUM_THREADS` is the desired number of threads and `$INTERSECTION_OUT` is the name of the output.
114+
115+
## End-to-End Throughput
116+
117+
To find the end-to-end throughput of the `pipeline`, we incorporate the latency and throughput of all of MegIS's components, including host operations, accessing flash chips, internal DRAM, in-storage accelerator, and host-SSD interfaces.
118+
119+
For the components in the hardware-based steps (e.g., finding candidate species): We implement MegIS’s logic components in Verilog. We use two state-of-the-art simulators,
120+
[Ramulator](https://github.com/CMU-SAFARI/ramulator) to model SSD’s internal DRAM, and [MQSim](https://github.com/CMU-SAFARI/MQSim) to model SSD’s internal operations.
121+
122+
For the components in the software-based step (e.g., host operations for preparing the input queries), we measure performance on a real system, an AMD® EPYC® 7742 CPU with 128 physical cores and 1-TB DRAM. For the software baselines, we measure performance on this real system, with best-performing thread counts.
123+
124+
125+
## Contact
126+
127+
Nika Mansouri Ghiasi - [email protected]

finding-candidate-species/Makefile

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
all:
2+
$(CXX) --std=c++17 -O3 -Wall -Werror -DNDEBUG -fopenmp -march=native -o intersection intersection.cpp progress_bar.cpp
3+
4+
profile:
5+
$(CXX) --std=c++17 -O2 -Wall -Werror -DNDEBUG -fopenmp -march=native -pg -g -o intersection_profile intersection.cpp progress_bar.cpp

0 commit comments

Comments
 (0)