bips-hb · mnwright · Feb 21, 2025 · Feb 18, 2025 · Feb 18, 2025 · Feb 18, 2025
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -54,6 +54,7 @@ Suggests:
     knitr,
     rmarkdown,
     tibble,
+    palmerpenguins,
     testthat (>= 3.0.0)
 Config/testthat/edition: 3
 VignetteBuilder: knitr
diff --git a/R/adversarial_rf.R b/R/adversarial_rf.R
@@ -14,7 +14,8 @@
 #' @param early_stop Terminate loop if performance fails to improve from one 
 #'   round to the next? 
 #' @param prune Impose \code{min_node_size} by pruning? 
-#' @param verbose Print discriminator accuracy after each round? Will also show additional warnings.
+#' @param verbose Print discriminator accuracy after each round? Will also show 
+#'   additional warnings.
 #' @param parallel Compute in parallel? Must register backend beforehand, e.g. 
 #'   via \code{doParallel} or \code{doFuture}; see examples.
 #' @param ... Extra parameters to be passed to \code{ranger}.
@@ -37,10 +38,11 @@
 #' trees for improved performance (typically on the order of 100-1000 depending 
 #' on sample size).
 #' 
-#' Integer variables are recoded with a warning. Default behavior is to convert
-#' those with six or more unique values to numeric, while those with up to five
-#' unique values are treated as ordered factors. To override this behavior, 
-#' explicitly recode integer variables to the target type prior to training.
+#' Integer variables are recoded with a warning (set \code{verbose = FALSE} to 
+#' silence these). Default behavior is to convert integer variables with six or
+#' more unique values to numeric, while those with up to five unique values are 
+#' treated as ordered factors. To override this behavior, explicitly recode 
+#' integer variables to the target type prior to training.
 #' 
 #' Note: convergence is not guaranteed in finite samples. The \code{max_iters} 
 #' argument sets an upper bound on the number of training rounds. Similar 
@@ -93,7 +95,8 @@
 #' }
 #' 
 #' @seealso
-#' \code{\link{arf}}, \code{\link{forde}}, \code{\link{forge}}, \code{\link{expct}}, \code{\link{lik}}
+#' \code{\link{arf}}, \code{\link{forde}}, \code{\link{forge}}, 
+#' \code{\link{expct}}, \code{\link{lik}}
 #' 
 #' @export
 #' @import ranger 

diff --git a/R/arf-package.R b/R/arf-package.R
@@ -1,5 +1,6 @@
 #' @seealso
-#' \code{\link{adversarial_rf}}, \code{\link{forde}}, \code{\link{forge}}, \code{\link{expct}}, \code{\link{lik}}
+#' \code{\link{adversarial_rf}}, \code{\link{forde}}, \code{\link{forge}}, 
+#' \code{\link{expct}}, \code{\link{lik}}
 #' 
 #' Useful links:
 #' \itemize{

diff --git a/R/expct.R b/R/expct.R
@@ -6,40 +6,44 @@
 #' @param params Circuit parameters learned via \code{\link{forde}}. 
 #' @param query Optional character vector of variable names. Estimates will be
 #'   computed for each. If \code{NULL}, all variables other than those in 
-#'   \code{evidence} will be estimated. If evidence contains \code{NA}s, those
-#'   variables will be estimated and a full dataset is returned.
+#'   \code{evidence} will be estimated. If \code{evidence} contains \code{NA}s, 
+#'   those values will be imputed and a full dataset is returned.
 #' @param evidence Optional set of conditioning events. This can take one of 
 #'   three forms: (1) a partial sample, i.e. a single row of data with
 #'   some but not all columns; (2) a data frame of conditioning events, 
-#'   which allows for inequalities and intervals; or (3) a posterior distribution over leaves;
-#'   see Details and Examples.
-#' @param evidence_row_mode Interpretation of rows in multi-row evidence. If \code{'separate'},
-#'   each row in \code{evidence} is a separate conditioning event for which \code{n_synth} synthetic samples
-#'   are generated. If \code{'or'}, the rows are combined with a logical or; see Examples.
-#' @param round Round continuous variables to their respective maximum precision in the real data set?
+#'   which allows for inequalities and intervals; or (3) a posterior 
+#'   distribution over leaves. See Details and Examples.
+#' @param evidence_row_mode Interpretation of rows in multi-row evidence. If 
+#'   \code{"separate"}, each row in \code{evidence} is a unique conditioning 
+#'   event for which \code{n_synth} synthetic samples are generated. If 
+#'   \code{"or"}, the rows are combined with a logical OR. See Examples.
+#' @param round Round continuous variables to their respective maximum precision 
+#'   in the real data set?
 #' @param nomatch What to do if no leaf matches a condition in \code{evidence}?
-#'   Options are to force sampling from a random leaf, either with a warning (\code{"force_warning"})
-#'   or without a warning (\code{"force"}), or to return \code{NA}, also with a warning 
-#'   (\code{"na_warning"}) or without a warning (\code{"na"}). The default is \code{"force_warning"}.
-#' @param stepsize Stepsize defining number of evidence rows handled in one for each step.
-#'   Defaults to nrow(evidence)/num_registered_workers for \code{parallel == TRUE}.
+#'   Options are to force sampling from a random leaf (\code{"force"}) or return 
+#'   \code{NA}, also with or without a warning (\code{"na_warning"} and 
+#'   (\code{"na"}, respectively). The default is \code{"force_warning"}.
+#' @param stepsize How many rows of evidence should be handled at each step? 
+#'   Defaults to \code{nrow(evidence) / num_registered_workers} for 
+#'   \code{parallel == TRUE}.
 #' @param parallel Compute in parallel? Must register backend beforehand, e.g. 
-#'   via \code{doParallel} or \code{doFuture}; see examples.
+#'   via \code{doParallel} or \code{doFuture}; see Examples.
 #'   
 #' @details 
 #' This function computes expected values for any subset of features, optionally 
 #' conditioned on some event(s). 
 #' 
 #' There are three methods for (optionally) encoding conditioning events via the 
 #' \code{evidence} argument. The first is to provide a partial sample, where
-#' some columns from the training data are missing or set to \code{NA}. The second is to 
-#' provide a data frame with condition events. This supports inequalities and intervals. 
-#' Alternatively, users may directly input a pre-calculated posterior 
-#' distribution over leaves, with columns \code{f_idx} and \code{wt}. This may 
-#' be preferable for complex constraints. See Examples.
+#' some columns from the training data are missing or set to \code{NA}. The 
+#' second is to provide a data frame with condition events. This supports 
+#' inequalities and intervals. Alternatively, users may directly input a 
+#' pre-calculated posterior distribution over leaves, with columns \code{f_idx} 
+#' and \code{wt}. This may be preferable for complex constraints. See Examples.
 #' 
-#' Please note that results for continuous features which are both included in \code{query} and in
-#' \code{evidence} with an interval condition are currently inconsistent.
+#' Please note that results for continuous features which are both included in 
+#' \code{query} and in \code{evidence} with an interval condition are currently 
+#' inconsistent.
 #' 
 #' @return 
 #' A one row data frame with values for all query variables.
@@ -67,7 +71,12 @@
 #' # Compute expectations for all features other than Species
 #' expct(psi, evidence = evi)
 #' 
-#' # Condition on first two data rows with some missing values
+#' # Condition on Species = "setosa" and Petal.Width > 0.3
+#' evi <- data.frame(Species = "setosa", 
+#'                   Petal.Width = ">0.3")
+#' expct(psi, evidence = evi)
+#' 
+#' # Condition on first two rows with some missing values
 #' evi <- iris[1:2,]
 #' evi[1, 1] <- NA_real_
 #' evi[1, 5] <- NA_character_
@@ -84,7 +93,8 @@
 #' }
 #' 
 #' @seealso
-#' \code{\link{arf}}, \code{\link{adversarial_rf}}, \code{\link{forde}}, \code{\link{forge}}, \code{\link{lik}}
+#' \code{\link{arf}}, \code{\link{adversarial_rf}}, \code{\link{forde}}, 
+#' \code{\link{forge}}, \code{\link{lik}}
 #' 
 #'
 #' @export

diff --git a/R/forde.R b/R/forde.R
@@ -7,25 +7,24 @@
 #' @param x Training data for estimating parameters.
 #' @param oob Only use out-of-bag samples for parameter estimation? If 
 #'   \code{TRUE}, \code{x} must be the same dataset used to train \code{arf}. 
-#'   Can also be "inbag" to only use in-bag samples. Default is \code{FALSE}, i.e.,
-#'   to use all observations
+#'   Set to \code{"inbag"} to only use in-bag samples. Default is \code{FALSE}, 
+#'   i.e. use all observations.
 #' @param family Distribution to use for density estimation of continuous 
 #'   features. Current options include truncated normal (the default
 #'   \code{family = "truncnorm"}) and uniform (\code{family = "unif"}). See 
 #'   Details.
 #' @param finite_bounds Impose finite bounds on all continuous variables? If
-#'   \code{'local'}, infinite bounds are shrinked to empirical extrema within leaves.
-#'   If \code{'global'}, infinite bounds are shrinked to global empirical extrema. 
-#'   if \code{'no'} (the default), do not impose finite bounds.
+#'   \code{"local"}, infinite bounds are set to empirical extrema within leaves.
+#'   If \code{"global"}, infinite bounds are set to global empirical extrema. 
+#'   if \code{"no"} (the default), infinite bounds are left unchanged.
 #' @param alpha Optional pseudocount for Laplace smoothing of categorical 
 #'   features. This avoids zero-mass points when test data fall outside the 
-#'   support of training data. Effectively parametrizes a flat Dirichlet prior
+#'   support of training data. Effectively parameterizes a flat Dirichlet prior
 #'   on multinomial likelihoods.
 #' @param epsilon Optional slack parameter on empirical bounds when 
-#'   \code{finite_bounds != 'no'}. This avoids 
-#'   zero-density points when test data fall outside the support of training 
-#'   data. The gap between lower and upper bounds is expanded by a factor of 
-#'   \code{1 + epsilon}. 
+#'   \code{finite_bounds != "no"}. This avoids zero-density points when test 
+#'   data fall outside the support of training data. The gap between lower and 
+#'   upper bounds is expanded by a factor of \code{1 + epsilon}. 
 #' @param parallel Compute in parallel? Must register backend beforehand, e.g. 
 #'   via \code{doParallel} or \code{doFuture}; see examples.
 #'   
@@ -40,8 +39,7 @@
 #' 
 #' Currently, \code{forde} only provides support for a limited number of 
 #' distributional families: truncated normal or uniform for continuous data,
-#' and multinomial for discrete data. Future releases will accommodate a larger 
-#' set of options.
+#' and multinomial for discrete data. 
 #' 
 #' Though \code{forde} was designed to take an adversarial random forest as 
 #' input, the function's first argument can in principle be any object of class 
@@ -100,8 +98,10 @@
 #' future::plan("multisession", workers = 4)
 #' }
 #' 
+#' 
 #' @seealso
-#' \code{\link{arf}}, \code{\link{adversarial_rf}}, \code{\link{forge}}, \code{\link{expct}}, \code{\link{lik}}
+#' \code{\link{arf}}, \code{\link{adversarial_rf}}, \code{\link{forge}}, 
+#' \code{\link{expct}}, \code{\link{lik}}
 #' 
 #'
 #' @export
@@ -125,7 +125,7 @@ forde <- function(
   tree <- n_oob <- cvg <- leaf <- variable <- count <- sd <- value <- psi_cnt <- 
     psi_cat <- f_idx <- sigma <- new_min <- new_max <- mid <- sigma0 <- prob <- 
     val <- val_count <- level <- all_na <- i <- k <- cnt <- . <- NA_share <-
-    mu <- length_emp <- max_emp <- min_emp <- NULL
+    mu <- length_emp <- max_emp <- min_emp <- inbag <- n_inbag <- NULL
 
   # Prelimz
   if (isTRUE(oob) & !nrow(x) %in% c(arf$num.samples, arf$num.samples/2)) {
@@ -140,7 +140,8 @@ forde <- function(
   # Uniform distribution requires finite bounds
   if (family == 'unif' & finite_bounds == 'no') {
     finite_bounds <- 'local'
-    warning("Denisity estimation with uniform distribution requires finite bounds. finite_bounds has been set to 'local'.")
+    warning('Density estimation with uniform distribution requires finite bounds. ',
+            'Resetting finite_bounds to "local".')
   }
 
   if (alpha < 0) {

diff --git a/R/forge.R b/R/forge.R
@@ -4,21 +4,25 @@
 #' 
 #' @param params Circuit parameters learned via \code{\link{forde}}. 
 #' @param evidence Optional set of conditioning events. This can take one of 
-#'   three forms: (1) a partial sample, i.e. a single row of data with
-#'   some but not all columns; (2) a data frame of conditioning events, 
-#'   which allows for inequalities and intervals; or (3) a posterior distribution over leaves;
-#'   see Details and Examples.
-#' @param evidence_row_mode Interpretation of rows in multi-row evidence. If \code{'separate'},
-#'   each row in \code{evidence} is a separate conditioning event for which \code{n_synth} synthetic samples
-#'   are generated. If \code{'or'}, the rows are combined with a logical or; see Examples.
-#' @param round Round continuous variables to their respective maximum precision in the real data set?
-#' @param sample_NAs Sample NAs respecting the probability for missing values in the original data.
+#'   three forms: (1) a partial sample, i.e. a single row of data with some but
+#'   not all columns; (2) a data frame of conditioning events, which allows for 
+#'   inequalities; or (3) a posterior distribution over leaves. See Details.
+#' @param evidence_row_mode Interpretation of rows in multi-row evidence. If 
+#'   \code{"separate"}, each row in \code{evidence} is a unique conditioning 
+#'   event for which \code{n_synth} synthetic samples are generated. If 
+#'   \code{"or"}, the rows are combined with a logical OR. See Examples.
+#' @param round Round continuous variables to their respective maximum precision 
+#'   in the real data set?
+#' @param sample_NAs Sample \code{NA}s respecting the probability for missing 
+#'   values in the original data?
 #' @param nomatch What to do if no leaf matches a condition in \code{evidence}?
-#'   Options are to force sampling from a random leaf, either with a warning (\code{"force_warning"})
-#'   or without a warning (\code{"force"}), or to return \code{NA}, also with a warning 
-#'   (\code{"na_warning"}) or without a warning (\code{"na"}). The default is \code{"force_warning"}.
-#' @param stepsize Stepsize defining number of evidence rows handled in one for each step.
-#'   Defaults to nrow(evidence)/num_registered_workers for \code{parallel == TRUE}.
+#'   Options are to force sampling from a random leaf, either with a warning 
+#'   (\code{"force_warning"}) or without (\code{"force"}); or to return 
+#'   \code{NA}, also with or without a warning (\code{"na_warning"} and 
+#'   (\code{"na"}, respectively). The default is \code{"force_warning"}.
+#' @param stepsize How many rows of evidence should be handled at each step? 
+#'   Defaults to \code{nrow(evidence) / num_registered_workers} for 
+#'   \code{parallel == TRUE}.
 #' @param parallel Compute in parallel? Must register backend beforehand, e.g. 
 #'   via \code{doParallel} or \code{doFuture}; see examples.
 #' @param n_synth Number of synthetic samples to generate.
@@ -34,11 +38,11 @@
 #' 
 #' There are three methods for (optionally) encoding conditioning events via the 
 #' \code{evidence} argument. The first is to provide a partial sample, where
-#' some columns from the training data are missing or set to \code{NA}. The second is to 
-#' provide a data frame with condition events. This supports inequalities and intervals. 
-#' Alternatively, users may directly input a pre-calculated posterior 
-#' distribution over leaves, with columns \code{f_idx} and \code{wt}. This may 
-#' be preferable for complex constraints. See Examples.
+#' some columns from the training data are missing or set to \code{NA}. The 
+#' second is to provide a data frame with condition events. This supports 
+#' inequalities and intervals. Alternatively, users may directly input a 
+#' pre-calculated posterior distribution over leaves, with columns \code{f_idx} 
+#' and \code{wt}. This may be preferable for complex constraints. See Examples.
 #' 
 #' @return  
 #' A dataset of \code{n_synth} synthetic samples. 
@@ -99,7 +103,8 @@
 #' }
 #'
 #' @seealso
-#' \code{\link{arf}}, \code{\link{adversarial_rf}}, \code{\link{forde}}, \code{\link{expct}}, \code{\link{lik}}
+#' \code{\link{arf}}, \code{\link{adversarial_rf}}, \code{\link{forde}}, 
+#' \code{\link{expct}}, \code{\link{lik}}
 #' 
 #' @export
 #' @import data.table

diff --git a/R/impute.R b/R/impute.R
@@ -1,23 +1,27 @@
-
 #' Missing value imputation with ARF
 #' 
-#' Imputed a dataset with missing values using adversarial random forests (ARF).
-#' Calls \code{adversarial_rf}, \code{forde} and \code{expct}/\code{forge}.
+#' Perform single or multiple imputation with ARFs. Calls \code{adversarial_rf}, 
+#' \code{forde} and \code{expct}/\code{forge}.
 #'
 #' @param x Input data.
-#' @param m Number of multiple imputations. The default is single imputation (\code{m=1}).
-#' @param expectation Return expected value instead of multiple imputations. By default, for single imputation (\code{m=1}), the expected value is returned.
-#' @param num_trees Number of trees in ARF.
-#' @param min_node_size Minimum node size in ARF.
-#' @param round Round imputed values to their respective maximum precision in the original data set?
-#' @param finite_bounds Impose finite bounds on all continuous variables? See \code{\link{forde}}.
+#' @param m Number of imputed datasets to generate. The default is single 
+#'   imputation (\code{m = 1}).
+#' @param expectation Return expected value instead of multiple imputations. By 
+#'   default, for single imputation (\code{m = 1}), the expected value is 
+#'   returned.
+#' @param num_trees Number of trees to grow in the ARF. 
+#' @param min_node_size Minimal number of real data samples in leaf nodes.
+#' @param round Round continuous variables to their respective maximum precision 
+#'   in the real data set?
+#' @param finite_bounds Impose finite bounds on all continuous variables? See 
+#'   \code{\link{forde}}.
 #' @param epsilon Slack parameter on empirical bounds; see \code{\link{forde}}.
 #' @param verbose Print progress for \code{adversarial_rf}?
-#' @param ... Extra parameters to be passed to \code{adversarial_rf}, \code{forde}
-#'   and \code{expct}/\code{forge}.
+#' @param ... Extra parameters to be passed to \code{adversarial_rf}, 
+#'   \code{forde} and \code{expct}/\code{forge}.
 #'
-#' @return Imputed data. A single data table is returned for \code{m=1} and a list of data table for \code{m > 1}.
-#' @export
+#' @return Imputed data. A single dataset is returned for \code{m = 1}, a list
+#'   of datasets for \code{m > 1}.
 #'
 #' @examples
 #' # Generate some missings
@@ -30,7 +34,7 @@
 #' iris_imputed <- arf::impute(iris_na, m = 1)
 #' 
 #' # Multiple imputation
-#' iris_imputed <- arf::impute(iris_na, m = 20)
+#' iris_imputed <- arf::impute(iris_na, m = 10)
 #' 
 #' \dontrun{
 #' # Parallelization with doParallel
@@ -40,9 +44,18 @@
 #' doFuture::registerDoFuture()
 #' future::plan("multisession", workers = 4)
 #' }
+#' 
+#' @seealso
+#' \code{\link{arf}}, \code{\link{forde}}, \code{\link{forge}}, 
+#' \code{\link{expct}}, \code{\link{lik}}
+#' 
+#' @export
+#' @import data.table
+#' 
+
 impute <- function(x, 
                    m = 1, 
-                   expectation = ifelse(m==1, TRUE, FALSE), 
+                   expectation = ifelse(m == 1, TRUE, FALSE), 
                    num_trees = 100L, 
                    min_node_size = 10L, 
                    round = TRUE, 
@@ -107,4 +120,4 @@ impute <- function(x,
     x_imputed <- lapply(x_imputed, function(x) x[, idx := NULL])
   }
   x_imputed
-}
+}
diff --git a/R/shortcut_functions.R b/R/shortcut_functions.R
@@ -12,7 +12,8 @@
 #'   and \code{lik}.
 #'   
 #' @return 
-#' A vector of likelihoods, optionally on the log scale. A dataset of \code{n_synth} synthetic samples or of \code{nrow(x)} synthetic
+#' A vector of likelihoods, optionally on the log scale. A dataset of 
+#' \code{n_synth} synthetic samples or of \code{nrow(x)} synthetic
 #' samples if \code{n_synth} is undefined. 
 #' 
 #' 

diff --git a/README.md b/README.md
@@ -55,3 +55,6 @@ A Python implementation of ARF, `arfpy`, is available on [PyPI](https://pypi.org
 
 ## References
 * Watson, D. S., Blesch, K., Kapar, J. & Wright, M. N. (2023). Adversarial random forests for density estimation and generative modeling. In *Proceedings of the 26th International Conference on Artificial Intelligence and Statistics*. Link [here](https://proceedings.mlr.press/v206/watson23a.html).
+* Blesch, K., Koenen, N., Kapar, J., Golchian, P., Burk, L., Loecher, M. & Wright, M. N. (2025). Conditional feature importance with generative modeling using adversarial random forests. In *Proceedings of the 39th AAAI Conference on Artificial Intelligence*. Link [here](https://arxiv.org/abs/2501.11178).
+
+
Original file line number	Diff line number	Diff line change
Expand Up		@@ -55,3 +55,6 @@ A Python implementation of ARF, `arfpy`, is available on [PyPI](https://pypi.org

		## References
		* Watson, D. S., Blesch, K., Kapar, J. & Wright, M. N. (2023). Adversarial random forests for density estimation and generative modeling. In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics. Link [here](https://proceedings.mlr.press/v206/watson23a.html).
		* Blesch, K., Koenen, N., Kapar, J., Golchian, P., Burk, L., Loecher, M. & Wright, M. N. (2025). Conditional feature importance with generative modeling using adversarial random forests. In Proceedings of the 39th AAAI Conference on Artificial Intelligence. Link [here](https://arxiv.org/abs/2501.11178).