Benchmarks of various genomic ranges operations
Pre-requisites
- pyenv
➜ polars-bio-bench git:(init) ✗ pyenv --version
pyenv 2.5.0
- poetry
➜ polars-bio-bench git:(init) ✗ poetry --version
Poetry (version 2.0.0)
pyenv install 3.12.8
pyenv local 3.12.8
poetry env use 3.12
poetry update
Please note that you need at least 64GB of RAM to run the full benchmarks. For the default 16-32GB should be enough.
All the benchmarking scenarios are defined in the conf/benchmark_*.yaml
files. By default, the conf/benchmark_small.yaml
file is used.
If you would like to run the benchmarks with a different configuration file, you can specify it using the --bench-config
option.
export BENCH_DATA_ROOT=/tmp/polars-bio-bench/
poetry run python src/run-benchmarks.py --help
INFO:polars_bio:Creating BioSessionContext
Usage: run-benchmarks.py [OPTIONS]
Options:
--bench-config TEXT Benchmark config file (default:
conf/benchmark_small.yaml)
--help Show this message and exit.
For e2e test suite (benchmark-e2e-overlap) please additionally set :
export POLARS_MAX_THREADS=1
conf/benchmark_small.yaml
- small dataset, small number of operations for nearest and overlap, native DataFusion inputconf/benchmark_dataframes.yaml
- as above but with DataFrames (Polars/Pandas) as inputconf/benchmark_large.yaml
- large dataset, large number of operations for nearest and overlap, native DataFusion inputconf/benchmark_parallel.yaml
- comparison parallel operations for pyranges0 and polars_bio with bioframe as a baselineconf/benchmark_count_overlaps.yaml
- comparison of count overlaps operation for pyranges{0,1} and polars_bio with bioframe as a baselineconf/benchmark_merge.yaml
- comparison of merge operation for pyranges{0,1} and polars_bio with bioframe as a baselineconf/benchmark_coverage.yaml
- comparison of coverage operation for pyranges{0,1} and polars_bio with bioframe as a baseline
conf/paper/benchmark-e2e-overlap.yaml
- end-to-end benchmark for overlap operation with writing results to a CSV file (1-2 and 8-7 datasets)conf/paper/benchmark-4ops-1-2.yaml
- overlap, nearest, count_overlaps and coverage operations for 1-2 datasetsconf/paper/benchmark-4ops-8-7.yaml
- as above but for 8-7 datasetsconf/paper/benchmark-4ops-8-7-polars-bio-parallel.yaml
- as above but polars_bio only and with parallel operations 1,2,4,6,8 threadsconf/paper/benchmark-read_vcf.yaml
- read VCF file with polars_bio and 1,2,4,6,8 threads
Example of running memory profiler for polars_bio with 1-2 dataset for polars_bio:
PRFOF_FILE="polars_bio_1-2.dat"
mprof run --output $PRFOF_FILE python src/run-memory-profiler.py --bench-config conf/paper/benchmark-e2e-overlap.yaml --tool polars_bio --test-case 1-2
mprof plot $PRFOF_FILE
for tool in "polars_bio" "polars_bio_streaming" "bioframe" "pyranges0" "pyranges1"; do
for test_case in "8-7"; do
PRFOF_FILE="${tool}_${test_case}.dat"
mprof run --output $PRFOF_FILE python src/run-memory-profiler.py --bench-config conf/paper/benchmark-e2e-overlap.yaml --tool $tool --test-case $test_case
done
done