Skip to content

AnacletoLAB/parSMURF-NG

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

parSMURF-NG

This package contains parSMURF-NG, a High Performance Computing imbalance-aware machine learning tool for the genome-wide detection of pathogenic variants.


Table of Contents

Overview
Requirements
Downloading and compiling
General architecture
Running parSMURF-NG
	Command line options
    Running parSMURF-NG on a single machine
    Running parSMURF-NG on a cluster
	Running the Bayesian optimizer
	Configuration file
		name
		exec
		data
		simulate
		folds
		params
		autogp_params
Data Format
	Data file format
	Label file format
	Fold file format
	Output file format
Random dataset generation
Examples
License

Overview

parSMURF-NG is a fast and scalable C++ implementation of the HyperSMURF algorithm - hyper-ensemble of SMOTE Undersampled Random Forests - an ensemble approach explicitly designed to deal with the huge imbalance between deleterious and neutral variants.

The algorithm is outlined in the following papers:
A. Petrini, M. Mesiti, M. Schubach, M. Frasca, D. Danis, M. Re, G.Grossi, L. Cappelletti, T. Castrignanò, P. N. Robinson, and G. Valentini, "parSMURF, a high-performance computing tool for the genome-wide detection of pathogenic variants", GigaScience, vol. 9, 05 2020. giaa052. https://doi.org/10.1093/gigascience/giaa052

Schubach, Matteo Re, Peter N. Robinson & Giorgio Valentini, "Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants", Scientific Reports, 2017/06/07
https://www.nature.com/articles/s41598-017-03011-5

parSMURF-NG is a more stable, faster, more scalable and up-to-date version of parSMURF https://github.com/AnacletoLAB/parSMURF . Also, usability has been improved.


Requirements

parSMURF-NG is designed for x86-64 and Intel Xeon Phi architectures running Linux OSes.
This software is distributed as source code.

A compilier which supports the C++11 language specification is required. It has been tested with GCC (vers. >= 5) and Intel CC (2015, 2017 and 2019).
Code is also optimized for Intel XeonPhi architectures, and it has been successfully tested on Knights Landing family processors.

parSMURF-NG is suitable for running on several architechtures, from a single workstation to an High Performance Computing system. parSMURF-NG requires an implementation of the MPI standard, as most of the I/O functions are managed through that. It has been tested with OpenMPI 1.10.3, OpenMPI 2.0, IntelMPI 2016, IntelMPI 2017 and IntelMPI 2019.

On Ubuntu, it is possible to install the OpenMPI library via apt package manager:

sudo apt-get install openmpi-bin openmpi-common libopenmpi-dev

Makefiles are generated by the cmake (vers. >= 3.0.2) utility. On Ubuntu it is possible to install this package via apt:

sudo apt-get install cmake

Bayesian Optimization is done by the scikit-optimize package. This package require python3 and it depends on several Python packages. The best way to use this feature is by creating and configuring a Python virtual environment and installing the required Python packages there. On Ubuntu:

sudo apt-get install virtualenv
<move to an appropriate folder>
virtualenv parSMURFvenv -p /usr/bin/python3     # This command creates a parSMURFvenv directory
source parSMURFvenv/bin/activate		        # This command activates the virtual environment
pip install scikit-optimize    			        # The following commands install the required packages in the virtual environment
deactivate					                    # Deactivate the virtual environment

parSMURF uses several external libraries that are included ad source code in this repository or are automatically downloaded and compiled. In particular, the following libraries are included:

  • ANN: A Library for Approximate Nearest Neighbor Searching, by David M. Mount and Sunil Arya, Version 1.1.2. The modified version is supplied in the src/ann_1.1.2 directory. This version has been adapted for multi-thread execution, since the original package available at https://www.cs.umd.edu/~mount/ANN/ is not thread safe and is not compatible with this package.
  • Ranger: A Fast Implementation of Random Forests, by Marvin N. Wright, Version 0.11.2. The modified version, stripped from the R code, is supplied in the src/ranger directory. The main codebase is located at https://github.com/imbs-hl/ranger

The following libraries are not included in this code repository, but are automatically downloaded during the compilation process:

  • easylogging++: A single header C++ logging library, by Zuhd Web Services. Automatically cloned in src/easyloggingpp and compiled from https://github.com/zuhd-org/easyloggingpp
  • jsoncons: A C++, header-only library for constructing JSON and JSON-like text and binary data formats, by Daniel Parker. Automatically cloned in src/jsoncons and compiled from https://github.com/danielaparker/jsoncons
  • zlib: A massively spiffy yet delicately unobtrusive compression library, by Jean-loup Gailly and Mark Adler. Autmatically cloned from in src/zlib and compiled from https://github.com/madler/zlib

The following library is required, but it must manually installed:

All the libraries have been modified and redistributed according to their own licenses. For each included library, a copy of the associated license is contained in each library folder.


Downloading and compiling

Download the latest version from this page or clone the git repository altogether:

git clone https://github.com/Topopiccione/parSMURF-NG

Once the package has been downloaded, move to the main directory, create a build dir, invoke cmake and build the software ("-j n" make option enables multithread compilation over n threads):

cd parSMURF-NG
mkdir build
cd build
cmake ../src
make -j 4

This should generate the following executables:

  • parSMURFng
  • data2bin
  • datasetGen

General architecture

parSMURF-NG is a complete rewrite of the original parSMURF software, although sharing the same core concepts and algorithm. The novelties of this package resides in a different execution model: I decided to abandon the master-worker design of parSMURF in favor of a more scalable masterless design, where each MPI process is almost independent and processes a subset of the input dataset. The only major coordination task occurs at the end of the computation, as results are gathered on MPI rank 0.

Thanks to this new design, task parallelization occurs at a finer level, always in a hierarchical way.

The other major improvements resides in a total rewrite of all the tasks involving data I/O, as now everything is managed by the MPI library. Data throughput when reading the dataset is notably increased.

Finally, now it is possible to specify both cross-validation OR hold-out, when designing and running an experiment.

parSMURF-NG features two subsystems for the automatic fine tuning of the learning parameters, aimed to maximize the prediction performances of the algorithm. The first strategy is by performing an exhaustive grid search: given a set of values for each hyper-parameter, the resulting set of all the possible combinations of hyper-parameters is calculated, and each combination evaluated through internal cross validation. The other strategy is by Bayesian optimization: given a range for each hyper-parameter, the Bayesian optimizer generate a sequence of possible candidates whose sequence tends to a probable global maximum. An high level of the execution is given by this pseudo-code snippet:

iter = 0
- while (iter < maxIter) and (error > tolerance):
-- BO generates a new possible candidate of hyper-parameters h
-- evaluation of h in a context of internal cross validation
-- submit (h, AUPRC(h)) to the BO
-- iter <- iter + 1

Both strategies are performed in a context of internal cross validation / hold-out.

Running parSMURF-NG

parSMURF-NG is a command line executable.
All the options are submitted to the main executable through configuration file written in json format.

Command line options

Only two command line options are available, since every other parameter or option is defined by json configuration files.
--cfg <filename> specifies the configuration file for the run
--help prints a brief help screen

Running parSMURF-NG on a single machine

parSMURF-NG does not require anything special to run, besides a proper configuration file. Hence, it can be launched as following:

./parSMURF-NG --cfg <configFile.json>

Running parSMURF-NG on a cluster

parSMURF-NG requires MPI to be installed on the target system or in all the nodes of a cluster. It must be invoked with mpirun or, depending on the scheduling system installed on the cluster, with a proper mpirun wrapper.
The -n option of mpirun also specifies how many processes have to be launched. As an example:

mpirun -n 5 ./parSMURF-NG --cfg <configFile.json>

launches an instance of parSMURFn over 5 processes.\

Running the Bayesian optimizer

Running the Bayesian optimizer is notably improved w.r.t. parSMURF1 and parSMURFn. If scikit-optimize python package has been globally installed, no further operations are required and parSMURF-NG can be run as above.
If it has been installed in a virtual environment or conda, just activate the virtual env before running parSMURF-NG


Configuration file

parSMURF-NG uses configuration files in json format for setting the parameters of each run.
Examples of configuration files are available in the cfgEx folder of the repository.

A configuration file is composed by seven dictionaries:

{
	"name": ...,
	"exec": {...},
	"data": {...},
	"folds": {...},
	"simulate": {...},
	"params": {...}
}

Depending on the configuration itself, some dictionaries are not mandatory and can be left out.

"name"
	"name": string

Mandatory: no
A string for labeling the name of the experiment

"exec"
	"exec": {
		"name": string,
		"nProcs": int,
		"ensThrd": int,
		"rfThrd": int,
		"noMtSender": bool,
		"seed": int,
		"verboseLevel": int,
		"verboseMPI": bool,
		"saveTime": bool,
		"timeFile": string,
		"printCfg": bool,
		"mode": string
    },

Mandatory: yes
General configuration of the run.

	"name": string

Mandatory: No
Label used for marking the name of the executable (legacy from parSMURF1 and parSMURFn). It does not affect the computation itself, since this field is ignored by the json parser

	"nProcs": int

Mandatory: No
Label used for marking the number of processes for a run of parSMURFn. It does not affect the computation itself, since the total number of processes is detected at runtime by the MPI APIs.

	"ensThrd": int

Mandatory: Yes
Number of threads assigned to perform the partition processing.

    "oversmpThrd"

Mandatory: Yes
Number of threads assigned to perform data oversampling.

	"rfThrd": int

Mandatory: Yes
Number of threads assigned to perform the random forest train and test.

	"noMtSender": bool

Mandatory: No
Legacy, ignored in parSMURF-NG

	"seed": int

Mandatory: No
Optional seed for the random number generators. If unspecified, a random seed is generated.

	"verboseLevel": int

Mandatory: No
Level of verbosity on stdout and on the logfile of the computational task. Range is 0-3 (default: 0).

	"verboseMPI": bool

Mandatory: No
Verbose on stdout and logfile the calls to MPI APIs. (Default: false)

	"saveTime": bool

Mandatory: No
Option for saving a report of the computation time of the run (Default: false)

	"timeFile": string

Mandatory: Yes, if "saveTime" is set to true
File name for saving the execution time report

	"cacheSize": string

Mandatory: Yes
This parameter specifies the cache size for data storage. Reserved for a future extension of the software to limit the data required for storing the dataset and allow parSMURF-NG to dinamically load the required data from disk. As this feature is not present, cache size must be equal or larger than the size of the input dataset (excluding the label and fold files).
Example: "cacheSize": "4.2G"

	"printCfg": bool

Mandatory: No
Option for printing a detailed description of the run before it starts (Default: false)

	"mode": string

Mandatory: Yes
Execution mode. Allowed strings are:
"cv": Dataset is splitted in folds, and evaluated in a process of k-fold cross validation. The run returns a set of predictions (default).
"train": The whole dataset is treated as training set. The run returns a folder of trained models for later usage.
"test": The whole dataset is treated as test set. It is mandatory to submit a directory of trained models to perform the evaluation. The run returns a set of predictions.
Note that the autotuning of the learning parameters is available only for "cv" mode

	"optimizer": string

Mandatory: Yes
Execution mode. Allowed strings are:
"no": external cross-validation only (default)
"grid_cv": automatic tuning of the learning parameters by grid search in the internal cross validation loop
"autogp_cv": automatic tuning of the learning parameters by Bayesian optimization (Gaussian process) in the internal cross validation loop
"grid_ho": automatic tuning of the learning parameters by grid search in the internal hold-out
"autogp_ho": automatic tuning of the learning parameters by Bayesian optimization (Gaussian process) in the internal hold-out

"data"
"data": {
	"dataFile": string
	"foldFile": string
	"labelFile": string
	"outFile": string
	"forestDir": string
}

Mandatory: yes
This field contains all the required information for accessing data from and to the system.

	"dataFile": string

Mandatory: Yes (No if simulation mode is enabled)
Input data file. Must be either a tsv/csv file or it binarized version generated with the data2bin utility.

	"foldFile": string

Mandatory: No
Optional input file containing the fold division of the dataset

	"labelFile": string

Mandatory: Yes (No, if simulation mode is enabled)
Input file containing the labels of the examples of the dataset

	"outFile": string

Mandatory: Yes (No, if in train mode)
Output file containing the output predictions

	"forestDir": string

Mandatory: No (Yes, if in train mode)
Output directory for saving the trained models. Must be a valid directory on the filesystem.

"simulate"
"simulate": {
	"simulation": bool,
	"prob": float,
	"n": int,
	"m": int
},

Mandatory: no
This field contains all the required information for enabling the internal dataset generator

	"simulation": bool

Mandatory: No
On true, it enables the internal dataset generator. The fields "dataFile", "foldFile" and "labelFile" are ignored and a random dataset is generated.

	"prob": float

Mandatory: Yes if simulation mode is enabled
This field represent the probability of generating a positive example. Must be a float in the [0,1] range, possibly very small for simulating highly unbalanced datasets

	"n": int

Mandatory: Yes if simulation mode is enabled
Number of examples to be generated

	"m": int

Mandatory: Yes if simulation mode is enabled
Number of features to be generated

"folds"
"folds": {
	"nFolds": int,
	"startingFold": int,
	"endingFold": int
}

Mandatory: Yes (No, if "foldFile" specified)
This section specified the fold subdivision and to which fold execute the run.

	"nFolds": int

Mandatory: Yes (No, if "foldFile" specified)
This field specifies in how many folds the dataset should be subdivided into. Ignored if "foldFile" has been declared.

	"startingFold": int,
	"endingFold": int

Mandatory: No
These fields specify the starting and ending fold that parSMURF have to evaluate. This is useful for parallelizing runs across different folds. If unspecified, parSMURF performs the evaluation of the predictions on all folds.

"params"
"params": {
	"nParts": array of int,
	"fp": array of int,
	"ratio": array of int,
	"k": array of int,
	"nTrees": array of int,
	"mtry": array of int
},

Mandatory: Yes
This field contains the learning parameters for the run. All values must be passed as arrays.
When "optimizer" is set to "no", only one combination is used for the run.
When "optimizer" is set to "grid", parSMURF-NG generates all the possible hyper-parameter combinations and evaluate them in the internla CV / HO loop.
For a deeper explanation of each parameter, please refer to the article

	"nParts": array of int

Mandatory: Yes
Number of partitions (ensembles)
Default: 10

	"fp": array of int

Mandatory: Yes
Over-sampling factor (0 disables over-sampling)
Default: 1

	"ratio": array of int

Mandatory: Yes
Under-sampling factor (0 disables under-sampling)
Defaul: 1

	"k": array of int

Mandatory: Yes
Number of the nearest neighbors for SMOTE oversampling of the minority class
Default: 5

	"nTrees": array of int

Mandatory: Yes
Number of trees in each ensemble
Default: 10

	"mtry": array of int

Mandatory: Yes
mtry random forest parameter
Default: sqrt(m)


Data format

As previously stated, data is provided to the application in two or three files.

Data file

This file should contain the main data needed for computing the predictions. It consists in an n x m matrix of double, where n is the number of examples and m the features. The matrix is read row-wise, i.e. :

   | m1   m2   m3   m4 ...
---------------------------
n1 | -------->
n2 |
n3 |
n4 |                          ( 1 )
.  |
.  |
.  |

Each value must be separated by space. Each sample n should be contained in a single new-line separated line.
Empty fields are not allowed.
The file must not contain an header.
The number of features is automatically detected.\

For faster load time, it is possible to convert the input file to a binarized format with the data2bin utility (MORE INFO SOON)

Label file

This file should contain the labelling of the examples. It consists in n space or new-line separated values, where n is the number of examples.
It is a plain text file where each positive example is marked with "1" and negative examples with "0".

Fold file

This optional file should contain the fold sub division. If specified, examples will be divided in folds as specified in this file. If not, a random stratified division will be performed. This file consists in n space or new-line separated integer values, where n is the number of examples.
It is a plain text file where each number represents the fold to which each example is assigned. Fold numbering starts from "0" (zero). Note that specifying the fold file name overrides the "nFolds" option in the ocnfiguration file.

Output file

Predictions will be saved as plain text file.
The output file consists of two lines of n space separated double values, where n is the number of examples.
Each value in the first line represents the probability of the associated example to be in the minority class, while each value in the second line, the probability to be in the majority class.

Note about dimensionality:
When reading data from file, parSMURF1 and parSMURFn automatically detect the number of examples and features, following these rules:

  • at first, the number of examples is detected from the label file.
  • then, the number of features is detected from the data file and calculated as number of (elements read in the data file) / (number of labels). Hence, the sizes of these files should be consistent, otherwise a warning message is printed to the console.
    Also, the number of folds is detected from the fold file if specified. In this case, the option "nFolds" in the configuration file is ignored, and the total number of folds will be equal to the number of the total unique elements of the fold file.

Random dataset generation

parSMURF-NG is provided with a random dataset generator for testing purposes.
When enabled, a random dataset will be created according to two normal distribution having the same variance but different average value, depending if an example falls in the positive or negative class.
The user enables this mode by using the "simulate: true" option in the configuration file.
The user is also forced to specify the the probability that an example belongs to the minority class ("prob": float) and dimensionality of the dataset with the "n": float and "m": int options.
An additional line will be added to the output file, containing the labelling that has been randomly generated according to the "prob" value.


Examples

Will be added...


License

This package is distributed under the GNU GPLv3 license. Please see the http://github.com/topopiccione/parSMURF-NG/COPYING file for the complete version of the license.

parSMURF-NG includes several third-party libraries which are distributed with their own license. In particular, source code of the following libraries is included in this package:

ANN: Approximate Nearest Neighbor Searching
David M. Mount and Sunil Arya
Version 1.1.2
(https://www.cs.umd.edu/~mount/ANN/)\ Modified and redistributed under the GNU Lesser Public License v2.1
Copy of the license is available in the src/ann_1.1.2 directory

Ranger: A Fast Implementation of Random Forests
Marvin N. Wright
Version 0.11.1
(https://github.com/imbs-hl/ranger)\ Modified and redistributed under the MIT license
Copy of the license is available in the src/ranger folder

Also, this software relies on several libraries whose source code is not included in the package, but it is automatically downloaded at compile time. These libraries are:

Easylogging++
Zuhd Web Services
(https://github.com/zuhd-org/easyloggingpp)\ Distributed under the MIT license
Copy of the license is available at the project homepage

Jsoncons
Daniel Parker
(https://github.com/danielaparker/jsoncons)\ Distributed under the Boost license
Copy of the license is available at the project homepage

zlib
Jean-loup Gailly and Mark Adler
(https://github.com/madler/zlib)\ Distributed under the zlib license
Copy of the license is available at the project homepage

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 94.8%
  • C 2.3%
  • Makefile 1.8%
  • Other 1.1%