This repository contains the code to reproduce the results of the paper "Analysis of Uncertainty of Neural Fingerprint-based Models" (under review).
Estimating the uncertainty of model predictions is crucial in a wide range of cheminformatics applications, not only to better understand machine learning models but also to establish trust in deployed models. Uncertainty estimates for many standard machine learning models, like Random Forest, are well studied. However, their predictive performances can be inferior to deep learning models, like graph neural networks (GNNs). We investigated whether the neural fingerprint extracted from a GNN can be combined with classical machine learning models to achieve good prediction performance and reliable uncertainty estimates.
The experiments are managed using DVC, where each step is specified in the dvc.yaml
file.
Running the pipeline will create a dvc.lock file, which contains the hashes of the scripts, input files, and output files, ensuring that the results originate from the provided code and data.
The cache is available from this repository as a tarball.
The following sections describe how to set up the project and reproduce the results.
- Clone the repository
git clone https://github.com/basf/neural-fingerprint-uncertainty.git
cd neural-fingerprint-uncertainty
- Install the requirements
pip install -r requirements.txt
- Unzip the dvc cache
tar -xf dvc_cache.tar.gz .dvc/
- Pull the data
dvc pull
- Reproduce the results
dvc repro
The generated results, figures, etc are saved in the data folder.
The molecular standardization is performed using molpipeline. Details of the standardization are provided in the 01_preprocess_data.py script.
The data is split into 5 folds using the StratifiedKFold
method and the GroupKFold
method, where the group is determined by Agglomerative Clustering.
The details of the fold creation are provided in the 02_assign_groups.py script.
The Morgan fingerprints are used to train the classical machine learning models. The details of the experiments are provided in the 03_ml_experiments.py script.
The neural fingerprints are extracted from a pre-trained Chemprop model. In addition to the neural fingerprints, the GNN is also used to predict the target values. The details of the experiments are provided in the 04_neural_fingerprint_predictions.py script.
The results of the experiments are visualized using matplotlib, where the plots are saved in the figures folder. Figures have to be loaded using the commands provided [above](#Commands to reproduce the results). Code for the plots is provided in the 05_create_plots.py script.
The plots used in the paper are saved in the final_figures folder. The code for the plots is provided in the 06_create_final_figures.py script.
The tables used in the paper were logged and directly extracted from the console. A copy of the console output is provided in the file 07_create_final_tables.log (again, only available after executing the commands [above](#Commands to reproduce the results)).
This software is licensed under the MIT license. See the LICENSE file for details.