A simulation framework for Bayesian models based on brms.
The main function is full_simulation
with the main arguments being data_gen_confs
, data_gen_fun
, fit_confs
and metrics
.
Bayesim will generate datasets by passing data_gen_confs
rows to data_gen_fun
and fit each model defined by fit_confs
on each generated dataset. It then calculates all of the defined metrics
for each model. This is, as of now, done in a fully crossed fashion.
Data simulation consists of two parts. A data_gen_fun
function and a data_gen_confs
dataframe. Bayesim will feed each row of data_gen_confs
into data_gen_fun
to generate each individual dataset.
The only strictly necessary columns in data_gen_confs
are:
-
dataset_N
, the number of datasets that should be simulated per configuration. This is what Bayesim parallelize over. -
id
, a unique identifier string that is used to save the results per configuration row. -
vars_of_interest
, if you want metrics calculated for individual parameters. It should be a list of variable names used indata_gen_fun
.
data_gen_fun
should output a named list that contains the following parts:
-
dataset
, a dataframe that is fed intobrms::brm
-
testing_data
, a dataframe that is used asnewdata
argument for certain out-of-sample metrics. -
data_gen_output
, a named list that contains all other information that you want to export from the data generating function. This should usually include all the input arguments (see example for how to get those easily), and areferences
list, that contains the reference values for allvars_of_interest
variables (again, see the example for how to easily get those). An example is presented below:
constant_linpred_dgp <- function(data_N,
data_link,
data_family,
seed = NULL,
testing_data = TRUE,
vars_of_interest = list("mu"),
mean = 0,
...) {
arguments <- as.list(c(as.list(environment()), list(...)))
arguments$seed <- NULL
if (!is.null(seed)) {
set.seed(seed)
}
if (testing_data) {
data_gen_size <- data_N * 2
} else {
data_gen_size <- data_N
}
dataset <- data.frame()
mu = rnorm(n = 1, mean = x, sd = 1)
y = rnorm(n = data_gen_size, mean = mu, sd = 1)
# This creates a list of values for each of the vars_of_interest.
arguments$references <- lapply(
unlist(vars_of_interest),
function(x) get(x)
)
data_gen_output <- list()
# Anything in addition to the function arguments you want to save about
# the data generation process ie. If you are resampling the number
# of invalid samples.
)
data_gen_output <- c(data_gen_output, arguments)
if (testing_data) {
return(
list(
dataset = list(y = dataset[1:data_N, ]),
testing_data = list(y = dataset[(data_N + 1):data_gen_size, ]),
data_gen_output = data_gen_output
)
)
} else {
return(
list(
dataset = dataset,
testing_data = NULL,
data_gen_output = data_gen_output
)
)
}
}
Fit configurations currently are dataframes with the following columns:
-
fit_family
, see brms_family_lookup for supported families. -
fit_link
, see link_lookup for supported families. -
formula
, a string that allows conversion viabrms::brmsformula
-
prior
, gets passed tobrms::brm
directly.
Metrics are defined via a list of string identifiers. The supported metrics are:
"v_mean"
"v_sd"
"v_median"
"v_mad"
"v_pos_prob"
"v_quantiles"
"v_bias"
"v_rmse"
"v_mae"
"v_mse"
"v_true_percentile"
"divergent_transitions_rel"
"divergent_transitions_abs"
"rstar"
"bad_pareto_ks"
"pareto_k_values"
"time_per_sample"
"rhat"
"ess_bulk"
"ess_tail"
"elpd_loo"
"elpd_loo_pointwise"
"elpd_loo_pointwise_summary"
"elpd_test"
"elpd_test_pointwise_summary"
"rmse_loo"
"rmse_loo_pointwise"
"rmse_loo_pointwise_summary"
"rmse_test"
"rmse_test_pointwise_summary"
"r2_loo"
"r2_loo_pointwise"
"r2_loo_pointwise_summary"
"r2_test"
"r2_test_pointwise_summary"
"log_lik_pointwise"
"log_lik_summary"
"ppred_summary_y_scaled"
"ppred_pointwise"
"residuals"
"posterior_linpred"
"posterior_linpred_transformed"
"y_pointwise"
"y_pointwise_z_scaled"
"y_summaries"
"data_gen"
"fit_gen"
Or see metric_lookup for all currently implemented metrics.
seed
, sets a seed that will result in the rest of the simulation happening deterministically, conditional on the seed. Allows for reproduction of individual results or the entire simulation run later on.
-
result_path = "./"
, The path where the result .RDS files should be saved. -
debug = FALSE
,TRUE
will save all intermediate results as .RDS files in theresult_path
directory to support debugging.
stan_pars
should be a named list that contains the following arguments:
-
warmup
, directly passed tobrms::brm
-
iter
, directly passed tobrms::brm
-
chains
, directly passed tobrms::brm
-
init
, directly passed tobrms::brm
-
backend = "rstan"
, directly passed tobrms::brm
We recommend rstan due to instabilities of cmdstanr on clusters. -
cmdstan_path
, useful when working with cmdstan on a computing cluster where cmdstan might not be installed in the default location. Use the path to the main directory, eg"~/.cmdstan/cmdstan-2.29.2"
. -
cmdstan_write_path
, directory for cmdstan to write compiled model files to. This should not be a temporary directory as those might get cleaned up during the simulation run.
-
ncores_simulation = 1
, If set to more than1
, Bayesim will parallelize across datasets within each row ofdata_gen_confs
using the specified number of processes. -
cluster_type = "PSOCK"
, Defines the type of cluster used by theparallel
package. Windows requiresPSOCK
however 'FORK` can save quite some time due to the repeated cluster setup times.
Bayesim has been used in the following projects: