Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve documentation #44

Merged
merged 8 commits into from
Feb 21, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ Suggests:
knitr,
rmarkdown,
tibble,
palmerpenguins,
testthat (>= 3.0.0)
Config/testthat/edition: 3
VignetteBuilder: knitr
15 changes: 9 additions & 6 deletions R/adversarial_rf.R
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@
#' @param early_stop Terminate loop if performance fails to improve from one
#' round to the next?
#' @param prune Impose \code{min_node_size} by pruning?
#' @param verbose Print discriminator accuracy after each round? Will also show additional warnings.
#' @param verbose Print discriminator accuracy after each round? Will also show
#' additional warnings.
#' @param parallel Compute in parallel? Must register backend beforehand, e.g.
#' via \code{doParallel} or \code{doFuture}; see examples.
#' @param ... Extra parameters to be passed to \code{ranger}.
Expand All @@ -37,10 +38,11 @@
#' trees for improved performance (typically on the order of 100-1000 depending
#' on sample size).
#'
#' Integer variables are recoded with a warning. Default behavior is to convert
#' those with six or more unique values to numeric, while those with up to five
#' unique values are treated as ordered factors. To override this behavior,
#' explicitly recode integer variables to the target type prior to training.
#' Integer variables are recoded with a warning (set \code{verbose = FALSE} to
#' silence these). Default behavior is to convert integer variables with six or
#' more unique values to numeric, while those with up to five unique values are
#' treated as ordered factors. To override this behavior, explicitly recode
#' integer variables to the target type prior to training.
#'
#' Note: convergence is not guaranteed in finite samples. The \code{max_iters}
#' argument sets an upper bound on the number of training rounds. Similar
Expand Down Expand Up @@ -93,7 +95,8 @@
#' }
#'
#' @seealso
#' \code{\link{arf}}, \code{\link{forde}}, \code{\link{forge}}, \code{\link{expct}}, \code{\link{lik}}
#' \code{\link{arf}}, \code{\link{forde}}, \code{\link{forge}},
#' \code{\link{expct}}, \code{\link{lik}}
#'
#' @export
#' @import ranger
Expand Down
3 changes: 2 additions & 1 deletion R/arf-package.R
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
#' @seealso
#' \code{\link{adversarial_rf}}, \code{\link{forde}}, \code{\link{forge}}, \code{\link{expct}}, \code{\link{lik}}
#' \code{\link{adversarial_rf}}, \code{\link{forde}}, \code{\link{forge}},
#' \code{\link{expct}}, \code{\link{lik}}
#'
#' Useful links:
#' \itemize{
Expand Down
56 changes: 33 additions & 23 deletions R/expct.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,40 +6,44 @@
#' @param params Circuit parameters learned via \code{\link{forde}}.
#' @param query Optional character vector of variable names. Estimates will be
#' computed for each. If \code{NULL}, all variables other than those in
#' \code{evidence} will be estimated. If evidence contains \code{NA}s, those
#' variables will be estimated and a full dataset is returned.
#' \code{evidence} will be estimated. If \code{evidence} contains \code{NA}s,
#' those values will be imputed and a full dataset is returned.
#' @param evidence Optional set of conditioning events. This can take one of
#' three forms: (1) a partial sample, i.e. a single row of data with
#' some but not all columns; (2) a data frame of conditioning events,
#' which allows for inequalities and intervals; or (3) a posterior distribution over leaves;
#' see Details and Examples.
#' @param evidence_row_mode Interpretation of rows in multi-row evidence. If \code{'separate'},
#' each row in \code{evidence} is a separate conditioning event for which \code{n_synth} synthetic samples
#' are generated. If \code{'or'}, the rows are combined with a logical or; see Examples.
#' @param round Round continuous variables to their respective maximum precision in the real data set?
#' which allows for inequalities and intervals; or (3) a posterior
#' distribution over leaves. See Details and Examples.
#' @param evidence_row_mode Interpretation of rows in multi-row evidence. If
#' \code{"separate"}, each row in \code{evidence} is a unique conditioning
#' event for which \code{n_synth} synthetic samples are generated. If
#' \code{"or"}, the rows are combined with a logical OR. See Examples.
#' @param round Round continuous variables to their respective maximum precision
#' in the real data set?
#' @param nomatch What to do if no leaf matches a condition in \code{evidence}?
#' Options are to force sampling from a random leaf, either with a warning (\code{"force_warning"})
#' or without a warning (\code{"force"}), or to return \code{NA}, also with a warning
#' (\code{"na_warning"}) or without a warning (\code{"na"}). The default is \code{"force_warning"}.
#' @param stepsize Stepsize defining number of evidence rows handled in one for each step.
#' Defaults to nrow(evidence)/num_registered_workers for \code{parallel == TRUE}.
#' Options are to force sampling from a random leaf (\code{"force"}) or return
#' \code{NA}, also with or without a warning (\code{"na_warning"} and
#' (\code{"na"}, respectively). The default is \code{"force_warning"}.
#' @param stepsize How many rows of evidence should be handled at each step?
#' Defaults to \code{nrow(evidence) / num_registered_workers} for
#' \code{parallel == TRUE}.
#' @param parallel Compute in parallel? Must register backend beforehand, e.g.
#' via \code{doParallel} or \code{doFuture}; see examples.
#' via \code{doParallel} or \code{doFuture}; see Examples.
#'
#' @details
#' This function computes expected values for any subset of features, optionally
#' conditioned on some event(s).
#'
#' There are three methods for (optionally) encoding conditioning events via the
#' \code{evidence} argument. The first is to provide a partial sample, where
#' some columns from the training data are missing or set to \code{NA}. The second is to
#' provide a data frame with condition events. This supports inequalities and intervals.
#' Alternatively, users may directly input a pre-calculated posterior
#' distribution over leaves, with columns \code{f_idx} and \code{wt}. This may
#' be preferable for complex constraints. See Examples.
#' some columns from the training data are missing or set to \code{NA}. The
#' second is to provide a data frame with condition events. This supports
#' inequalities and intervals. Alternatively, users may directly input a
#' pre-calculated posterior distribution over leaves, with columns \code{f_idx}
#' and \code{wt}. This may be preferable for complex constraints. See Examples.
#'
#' Please note that results for continuous features which are both included in \code{query} and in
#' \code{evidence} with an interval condition are currently inconsistent.
#' Please note that results for continuous features which are both included in
#' \code{query} and in \code{evidence} with an interval condition are currently
#' inconsistent.
#'
#' @return
#' A one row data frame with values for all query variables.
Expand Down Expand Up @@ -67,7 +71,12 @@
#' # Compute expectations for all features other than Species
#' expct(psi, evidence = evi)
#'
#' # Condition on first two data rows with some missing values
#' # Condition on Species = "setosa" and Petal.Width > 0.3
#' evi <- data.frame(Species = "setosa",
#' Petal.Width = ">0.3")
#' expct(psi, evidence = evi)
#'
#' # Condition on first two rows with some missing values
#' evi <- iris[1:2,]
#' evi[1, 1] <- NA_real_
#' evi[1, 5] <- NA_character_
Expand All @@ -84,7 +93,8 @@
#' }
#'
#' @seealso
#' \code{\link{arf}}, \code{\link{adversarial_rf}}, \code{\link{forde}}, \code{\link{forge}}, \code{\link{lik}}
#' \code{\link{arf}}, \code{\link{adversarial_rf}}, \code{\link{forde}},
#' \code{\link{forge}}, \code{\link{lik}}
#'
#'
#' @export
Expand Down
31 changes: 16 additions & 15 deletions R/forde.R
Original file line number Diff line number Diff line change
Expand Up @@ -7,25 +7,24 @@
#' @param x Training data for estimating parameters.
#' @param oob Only use out-of-bag samples for parameter estimation? If
#' \code{TRUE}, \code{x} must be the same dataset used to train \code{arf}.
#' Can also be "inbag" to only use in-bag samples. Default is \code{FALSE}, i.e.,
#' to use all observations
#' Set to \code{"inbag"} to only use in-bag samples. Default is \code{FALSE},
#' i.e. use all observations.
#' @param family Distribution to use for density estimation of continuous
#' features. Current options include truncated normal (the default
#' \code{family = "truncnorm"}) and uniform (\code{family = "unif"}). See
#' Details.
#' @param finite_bounds Impose finite bounds on all continuous variables? If
#' \code{'local'}, infinite bounds are shrinked to empirical extrema within leaves.
#' If \code{'global'}, infinite bounds are shrinked to global empirical extrema.
#' if \code{'no'} (the default), do not impose finite bounds.
#' \code{"local"}, infinite bounds are set to empirical extrema within leaves.
#' If \code{"global"}, infinite bounds are set to global empirical extrema.
#' if \code{"no"} (the default), infinite bounds are left unchanged.
#' @param alpha Optional pseudocount for Laplace smoothing of categorical
#' features. This avoids zero-mass points when test data fall outside the
#' support of training data. Effectively parametrizes a flat Dirichlet prior
#' support of training data. Effectively parameterizes a flat Dirichlet prior
#' on multinomial likelihoods.
#' @param epsilon Optional slack parameter on empirical bounds when
#' \code{finite_bounds != 'no'}. This avoids
#' zero-density points when test data fall outside the support of training
#' data. The gap between lower and upper bounds is expanded by a factor of
#' \code{1 + epsilon}.
#' \code{finite_bounds != "no"}. This avoids zero-density points when test
#' data fall outside the support of training data. The gap between lower and
#' upper bounds is expanded by a factor of \code{1 + epsilon}.
#' @param parallel Compute in parallel? Must register backend beforehand, e.g.
#' via \code{doParallel} or \code{doFuture}; see examples.
#'
Expand All @@ -40,8 +39,7 @@
#'
#' Currently, \code{forde} only provides support for a limited number of
#' distributional families: truncated normal or uniform for continuous data,
#' and multinomial for discrete data. Future releases will accommodate a larger
#' set of options.
#' and multinomial for discrete data.
#'
#' Though \code{forde} was designed to take an adversarial random forest as
#' input, the function's first argument can in principle be any object of class
Expand Down Expand Up @@ -100,8 +98,10 @@
#' future::plan("multisession", workers = 4)
#' }
#'
#'
#' @seealso
#' \code{\link{arf}}, \code{\link{adversarial_rf}}, \code{\link{forge}}, \code{\link{expct}}, \code{\link{lik}}
#' \code{\link{arf}}, \code{\link{adversarial_rf}}, \code{\link{forge}},
#' \code{\link{expct}}, \code{\link{lik}}
#'
#'
#' @export
Expand All @@ -125,7 +125,7 @@ forde <- function(
tree <- n_oob <- cvg <- leaf <- variable <- count <- sd <- value <- psi_cnt <-
psi_cat <- f_idx <- sigma <- new_min <- new_max <- mid <- sigma0 <- prob <-
val <- val_count <- level <- all_na <- i <- k <- cnt <- . <- NA_share <-
mu <- length_emp <- max_emp <- min_emp <- NULL
mu <- length_emp <- max_emp <- min_emp <- inbag <- n_inbag <- NULL

# Prelimz
if (isTRUE(oob) & !nrow(x) %in% c(arf$num.samples, arf$num.samples/2)) {
Expand All @@ -140,7 +140,8 @@ forde <- function(
# Uniform distribution requires finite bounds
if (family == 'unif' & finite_bounds == 'no') {
finite_bounds <- 'local'
warning("Denisity estimation with uniform distribution requires finite bounds. finite_bounds has been set to 'local'.")
warning('Density estimation with uniform distribution requires finite bounds. ',
'Resetting finite_bounds to "local".')
}

if (alpha < 0) {
Expand Down
45 changes: 25 additions & 20 deletions R/forge.R
Original file line number Diff line number Diff line change
Expand Up @@ -4,21 +4,25 @@
#'
#' @param params Circuit parameters learned via \code{\link{forde}}.
#' @param evidence Optional set of conditioning events. This can take one of
#' three forms: (1) a partial sample, i.e. a single row of data with
#' some but not all columns; (2) a data frame of conditioning events,
#' which allows for inequalities and intervals; or (3) a posterior distribution over leaves;
#' see Details and Examples.
#' @param evidence_row_mode Interpretation of rows in multi-row evidence. If \code{'separate'},
#' each row in \code{evidence} is a separate conditioning event for which \code{n_synth} synthetic samples
#' are generated. If \code{'or'}, the rows are combined with a logical or; see Examples.
#' @param round Round continuous variables to their respective maximum precision in the real data set?
#' @param sample_NAs Sample NAs respecting the probability for missing values in the original data.
#' three forms: (1) a partial sample, i.e. a single row of data with some but
#' not all columns; (2) a data frame of conditioning events, which allows for
#' inequalities; or (3) a posterior distribution over leaves. See Details.
#' @param evidence_row_mode Interpretation of rows in multi-row evidence. If
#' \code{"separate"}, each row in \code{evidence} is a unique conditioning
#' event for which \code{n_synth} synthetic samples are generated. If
#' \code{"or"}, the rows are combined with a logical OR. See Examples.
#' @param round Round continuous variables to their respective maximum precision
#' in the real data set?
#' @param sample_NAs Sample \code{NA}s respecting the probability for missing
#' values in the original data?
#' @param nomatch What to do if no leaf matches a condition in \code{evidence}?
#' Options are to force sampling from a random leaf, either with a warning (\code{"force_warning"})
#' or without a warning (\code{"force"}), or to return \code{NA}, also with a warning
#' (\code{"na_warning"}) or without a warning (\code{"na"}). The default is \code{"force_warning"}.
#' @param stepsize Stepsize defining number of evidence rows handled in one for each step.
#' Defaults to nrow(evidence)/num_registered_workers for \code{parallel == TRUE}.
#' Options are to force sampling from a random leaf, either with a warning
#' (\code{"force_warning"}) or without (\code{"force"}); or to return
#' \code{NA}, also with or without a warning (\code{"na_warning"} and
#' (\code{"na"}, respectively). The default is \code{"force_warning"}.
#' @param stepsize How many rows of evidence should be handled at each step?
#' Defaults to \code{nrow(evidence) / num_registered_workers} for
#' \code{parallel == TRUE}.
#' @param parallel Compute in parallel? Must register backend beforehand, e.g.
#' via \code{doParallel} or \code{doFuture}; see examples.
#' @param n_synth Number of synthetic samples to generate.
Expand All @@ -34,11 +38,11 @@
#'
#' There are three methods for (optionally) encoding conditioning events via the
#' \code{evidence} argument. The first is to provide a partial sample, where
#' some columns from the training data are missing or set to \code{NA}. The second is to
#' provide a data frame with condition events. This supports inequalities and intervals.
#' Alternatively, users may directly input a pre-calculated posterior
#' distribution over leaves, with columns \code{f_idx} and \code{wt}. This may
#' be preferable for complex constraints. See Examples.
#' some columns from the training data are missing or set to \code{NA}. The
#' second is to provide a data frame with condition events. This supports
#' inequalities and intervals. Alternatively, users may directly input a
#' pre-calculated posterior distribution over leaves, with columns \code{f_idx}
#' and \code{wt}. This may be preferable for complex constraints. See Examples.
#'
#' @return
#' A dataset of \code{n_synth} synthetic samples.
Expand Down Expand Up @@ -99,7 +103,8 @@
#' }
#'
#' @seealso
#' \code{\link{arf}}, \code{\link{adversarial_rf}}, \code{\link{forde}}, \code{\link{expct}}, \code{\link{lik}}
#' \code{\link{arf}}, \code{\link{adversarial_rf}}, \code{\link{forde}},
#' \code{\link{expct}}, \code{\link{lik}}
#'
#' @export
#' @import data.table
Expand Down
45 changes: 29 additions & 16 deletions R/impute.R
Original file line number Diff line number Diff line change
@@ -1,23 +1,27 @@

#' Missing value imputation with ARF
#'
#' Imputed a dataset with missing values using adversarial random forests (ARF).
#' Calls \code{adversarial_rf}, \code{forde} and \code{expct}/\code{forge}.
#' Perform single or multiple imputation with ARFs. Calls \code{adversarial_rf},
#' \code{forde} and \code{expct}/\code{forge}.
#'
#' @param x Input data.
#' @param m Number of multiple imputations. The default is single imputation (\code{m=1}).
#' @param expectation Return expected value instead of multiple imputations. By default, for single imputation (\code{m=1}), the expected value is returned.
#' @param num_trees Number of trees in ARF.
#' @param min_node_size Minimum node size in ARF.
#' @param round Round imputed values to their respective maximum precision in the original data set?
#' @param finite_bounds Impose finite bounds on all continuous variables? See \code{\link{forde}}.
#' @param m Number of imputed datasets to generate. The default is single
#' imputation (\code{m = 1}).
#' @param expectation Return expected value instead of multiple imputations. By
#' default, for single imputation (\code{m = 1}), the expected value is
#' returned.
#' @param num_trees Number of trees to grow in the ARF.
#' @param min_node_size Minimal number of real data samples in leaf nodes.
#' @param round Round continuous variables to their respective maximum precision
#' in the real data set?
#' @param finite_bounds Impose finite bounds on all continuous variables? See
#' \code{\link{forde}}.
#' @param epsilon Slack parameter on empirical bounds; see \code{\link{forde}}.
#' @param verbose Print progress for \code{adversarial_rf}?
#' @param ... Extra parameters to be passed to \code{adversarial_rf}, \code{forde}
#' and \code{expct}/\code{forge}.
#' @param ... Extra parameters to be passed to \code{adversarial_rf},
#' \code{forde} and \code{expct}/\code{forge}.
#'
#' @return Imputed data. A single data table is returned for \code{m=1} and a list of data table for \code{m > 1}.
#' @export
#' @return Imputed data. A single dataset is returned for \code{m = 1}, a list
#' of datasets for \code{m > 1}.
#'
#' @examples
#' # Generate some missings
Expand All @@ -30,7 +34,7 @@
#' iris_imputed <- arf::impute(iris_na, m = 1)
#'
#' # Multiple imputation
#' iris_imputed <- arf::impute(iris_na, m = 20)
#' iris_imputed <- arf::impute(iris_na, m = 10)
#'
#' \dontrun{
#' # Parallelization with doParallel
Expand All @@ -40,9 +44,18 @@
#' doFuture::registerDoFuture()
#' future::plan("multisession", workers = 4)
#' }
#'
#' @seealso
#' \code{\link{arf}}, \code{\link{forde}}, \code{\link{forge}},
#' \code{\link{expct}}, \code{\link{lik}}
#'
#' @export
#' @import data.table
#'

impute <- function(x,
m = 1,
expectation = ifelse(m==1, TRUE, FALSE),
expectation = ifelse(m == 1, TRUE, FALSE),
num_trees = 100L,
min_node_size = 10L,
round = TRUE,
Expand Down Expand Up @@ -107,4 +120,4 @@ impute <- function(x,
x_imputed <- lapply(x_imputed, function(x) x[, idx := NULL])
}
x_imputed
}
}
3 changes: 2 additions & 1 deletion R/shortcut_functions.R
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@
#' and \code{lik}.
#'
#' @return
#' A vector of likelihoods, optionally on the log scale. A dataset of \code{n_synth} synthetic samples or of \code{nrow(x)} synthetic
#' A vector of likelihoods, optionally on the log scale. A dataset of
#' \code{n_synth} synthetic samples or of \code{nrow(x)} synthetic
#' samples if \code{n_synth} is undefined.
#'
#'
Expand Down
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,3 +55,6 @@ A Python implementation of ARF, `arfpy`, is available on [PyPI](https://pypi.org

## References
* Watson, D. S., Blesch, K., Kapar, J. & Wright, M. N. (2023). Adversarial random forests for density estimation and generative modeling. In *Proceedings of the 26th International Conference on Artificial Intelligence and Statistics*. Link [here](https://proceedings.mlr.press/v206/watson23a.html).
* Blesch, K., Koenen, N., Kapar, J., Golchian, P., Burk, L., Loecher, M. & Wright, M. N. (2025). Conditional feature importance with generative modeling using adversarial random forests. In *Proceedings of the 39th AAAI Conference on Artificial Intelligence*. Link [here](https://arxiv.org/abs/2501.11178).


Loading
Loading