Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Replace vignettes and examples #11123

Merged
merged 22 commits into from
Jan 15, 2025
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
replace old vignettes and examples
david-cortes committed Dec 19, 2024
commit 89a0bd4ca73f3b4afce845482b5cc49df46cfe59
3 changes: 2 additions & 1 deletion R-package/DESCRIPTION
Original file line number Diff line number Diff line change
@@ -45,10 +45,11 @@ License: Apache License (== 2.0) | file LICENSE
URL: https://github.com/dmlc/xgboost
BugReports: https://github.com/dmlc/xgboost/issues
NeedsCompilation: yes
VignetteBuilder: knitr
VignetteBuilder: knitr, quarto
Suggests:
knitr,
rmarkdown,
quarto,
ggplot2 (>= 1.0.1),
DiagrammeR (>= 0.9.0),
DiagrammeRsvg,
111 changes: 45 additions & 66 deletions R-package/R/xgb.importance.R
Original file line number Diff line number Diff line change
@@ -38,85 +38,64 @@
#' (based on C++ code), it starts at 0 (as in C/C++ or Python) instead of 1 (usual in R).
#'
#' @examples
#'
#' # binomial classification using "gbtree":
#' data(agaricus.train, package = "xgboost")
#'
#' bst <- xgb.train(
#' data = xgb.DMatrix(agaricus.train$data, label = agaricus.train$label),
#' nrounds = 2,
#' params = xgb.params(
#' max_depth = 2,
#' nthread = 2,
#' objective = "binary:logistic"
#' )
#' # binary classification using "gbtree":
#' data("ToothGrowth")
#' x <- ToothGrowth[, c("len", "dose")]
#' y <- ToothGrowth$supp
#' model_tree_binary <- xgboost(
#' x, y,
#' nrounds = 5L,
#' nthreads = 1L,
#' booster = "gbtree",
#' max_depth = 2L
#' )
#'
#' xgb.importance(model = bst)
#'
#' # binomial classification using "gblinear":
#' bst <- xgb.train(
#' data = xgb.DMatrix(agaricus.train$data, label = agaricus.train$label),
#' nrounds = 20,
#' params = xgb.params(
#' booster = "gblinear",
#' learning_rate = 0.3,
#' nthread = 1,
#' objective = "binary:logistic"
#' )
#' xgb.importance(model_tree_binary)
#'
#' # binary classification using "gblinear":
#' model_tree_linear <- xgboost(
#' x, y,
#' nrounds = 5L,
#' nthreads = 1L,
#' booster = "gblinear",
#' learning_rate = 0.3
#' )
#'
#' xgb.importance(model = bst)
#'
#' # multiclass classification using "gbtree":
#' nclass <- 3
#' nrounds <- 10
#' mbst <- xgb.train(
#' data = xgb.DMatrix(
#' as.matrix(iris[, -5]),
#' label = as.numeric(iris$Species) - 1
#' ),
#' nrounds = nrounds,
#' params = xgb.params(
#' max_depth = 3,
#' nthread = 2,
#' objective = "multi:softprob",
#' num_class = nclass
#' )
#' xgb.importance(model_tree_linear)
#'
#' # multi-class classification using "gbtree":
#' data("iris")
#' x <- iris[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")]
#' y <- iris$Species
#' model_tree_multi <- xgboost(
#' x, y,
#' nrounds = 5L,
#' nthreads = 1L,
#' booster = "gbtree",
#' max_depth = 3
#' )
#'
#' # all classes clumped together:
#' xgb.importance(model = mbst)
#'
#' xgb.importance(model_tree_multi)
#' # inspect importances separately for each class:
#' num_classes <- 3L
#' nrounds <- 5L
#' xgb.importance(
#' model = mbst, trees = seq(from = 1, by = nclass, length.out = nrounds)
#' model_tree_multi, trees = seq(from = 1, by = num_classes, length.out = nrounds)
#' )
#' xgb.importance(
#' model = mbst, trees = seq(from = 2, by = nclass, length.out = nrounds)
#' model_tree_multi, trees = seq(from = 2, by = num_classes, length.out = nrounds)
#' )
#' xgb.importance(
#' model = mbst, trees = seq(from = 3, by = nclass, length.out = nrounds)
#' model_tree_multi, trees = seq(from = 3, by = num_classes, length.out = nrounds)
#' )
#'
#' # multiclass classification using "gblinear":
#' mbst <- xgb.train(
#' data = xgb.DMatrix(
#' scale(as.matrix(iris[, -5])),
#' label = as.numeric(iris$Species) - 1
#' ),
#' nrounds = 15,
#' params = xgb.params(
#' booster = "gblinear",
#' learning_rate = 0.2,
#' nthread = 1,
#' objective = "multi:softprob",
#' num_class = nclass
#' )
#' # multi-class classification using "gblinear":
#' model_linear_multi <- xgboost(
#' x, y,
#' nrounds = 5L,
#' nthreads = 1L,
#' booster = "gblinear",
#' learning_rate = 0.2
#' )
#'
#' xgb.importance(model = mbst)
#'
#' xgb.importance(model_linear_multi)
#' @export
xgb.importance <- function(model = NULL, feature_names = getinfo(model, "feature_name"), trees = NULL) {

23 changes: 10 additions & 13 deletions R-package/R/xgb.plot.deepness.R
Original file line number Diff line number Diff line change
@@ -49,27 +49,24 @@
#' data.table::setDTthreads(nthread)
#'
#' ## Change max_depth to a higher number to get a more significant result
#' bst <- xgb.train(
#' data = xgb.DMatrix(agaricus.train$data, label = agaricus.train$label),
#' model <- xgboost(
#' agaricus.train$data, factor(agaricus.train$label),
#' nrounds = 50,
#' params = xgb.params(
#' max_depth = 6,
#' nthread = nthread,
#' objective = "binary:logistic",
#' subsample = 0.5,
#' min_child_weight = 2
#' )
#' max_depth = 6,
#' nthreads = nthread,
#' subsample = 0.5,
#' min_child_weight = 2
#' )
#'
#' xgb.plot.deepness(bst)
#' xgb.ggplot.deepness(bst)
#' xgb.plot.deepness(model)
#' xgb.ggplot.deepness(model)
#'
#' xgb.plot.deepness(
#' bst, which = "max.depth", pch = 16, col = rgb(0, 0, 1, 0.3), cex = 2
#' model, which = "max.depth", pch = 16, col = rgb(0, 0, 1, 0.3), cex = 2
#' )
#'
#' xgb.plot.deepness(
#' bst, which = "med.weight", pch = 16, col = rgb(0, 0, 1, 0.3), cex = 2
#' model, which = "med.weight", pch = 16, col = rgb(0, 0, 1, 0.3), cex = 2
#' )
#'
#' @rdname xgb.plot.deepness
13 changes: 5 additions & 8 deletions R-package/R/xgb.plot.importance.R
Original file line number Diff line number Diff line change
@@ -50,17 +50,14 @@
#' nthread <- 2
#' data.table::setDTthreads(nthread)
#'
#' bst <- xgb.train(
#' data = xgb.DMatrix(agaricus.train$data, label = agaricus.train$label),
#' model <- xgboost(
#' agaricus.train$data, factor(agaricus.train$label),
#' nrounds = 2,
#' params = xgb.params(
#' max_depth = 3,
#' nthread = nthread,
#' objective = "binary:logistic"
#' )
#' max_depth = 3,
#' nthreads = nthread
#' )
#'
#' importance_matrix <- xgb.importance(colnames(agaricus.train$data), model = bst)
#' importance_matrix <- xgb.importance(model)
#' xgb.plot.importance(
#' importance_matrix, rel_to_first = TRUE, xlab = "Relative importance"
#' )
21 changes: 9 additions & 12 deletions R-package/R/xgb.plot.multi.trees.R
Original file line number Diff line number Diff line change
@@ -36,26 +36,23 @@
#' nthread <- 2
#' data.table::setDTthreads(nthread)
#'
#' bst <- xgb.train(
#' data = xgb.DMatrix(agaricus.train$data, label = agaricus.train$label),
#' model <- xgboost(
#' agaricus.train$data, factor(agaricus.train$label),
#' nrounds = 30,
#' verbose = 0,
#' params = xgb.params(
#' max_depth = 15,
#' learning_rate = 1,
#' nthread = nthread,
#' objective = "binary:logistic",
#' min_child_weight = 50
#' )
#' verbosity = 0L,
#' nthreads = nthread,
#' max_depth = 15,
#' learning_rate = 1,
#' min_child_weight = 50
#' )
#'
#' p <- xgb.plot.multi.trees(model = bst, features_keep = 3)
#' p <- xgb.plot.multi.trees(model, features_keep = 3)
#' print(p)
#'
#' # Below is an example of how to save this plot to a file.
#' if (require("DiagrammeR") && require("DiagrammeRsvg") && require("rsvg")) {
#' fname <- file.path(tempdir(), "tree.pdf")
#' gr <- xgb.plot.multi.trees(bst, features_keep = 3, render = FALSE)
#' gr <- xgb.plot.multi.trees(model, features_keep = 3, render = FALSE)
#' export_graph(gr, fname, width = 1500, height = 600)
#' }
#' @export
51 changes: 22 additions & 29 deletions R-package/R/xgb.plot.shap.R
Original file line number Diff line number Diff line change
@@ -81,51 +81,44 @@
#' data.table::setDTthreads(nthread)
#' nrounds <- 20
#'
#' bst <- xgb.train(
#' data = xgb.DMatrix(agaricus.train$data, agaricus.train$label),
#' model_binary <- xgboost(
#' agaricus.train$data, factor(agaricus.train$label),
#' nrounds = nrounds,
#' verbose = 0,
#' params = xgb.params(
#' learning_rate = 0.1,
#' max_depth = 3,
#' subsample = 0.5,
#' objective = "binary:logistic",
#' nthread = nthread
#' )
#' verbosity = 0L,
#' learning_rate = 0.1,
#' max_depth = 3L,
#' subsample = 0.5,
#' nthreads = nthread
#' )
#'
#' xgb.plot.shap(agaricus.test$data, model = bst, features = "odor=none")
#' xgb.plot.shap(agaricus.test$data, model = model_binary, features = "odor=none")
#'
#' contr <- predict(bst, agaricus.test$data, predcontrib = TRUE)
#' xgb.plot.shap(agaricus.test$data, contr, model = bst, top_n = 12, n_col = 3)
#' contr <- predict(model_binary, agaricus.test$data, type = "contrib")
#' xgb.plot.shap(agaricus.test$data, contr, model = model_binary, top_n = 12, n_col = 3)
#'
#' # Summary plot
#' xgb.ggplot.shap.summary(agaricus.test$data, contr, model = bst, top_n = 12)
#' xgb.ggplot.shap.summary(agaricus.test$data, contr, model = model_binary, top_n = 12)
#'
#' # Multiclass example - plots for each class separately:
#' nclass <- 3
#' x <- as.matrix(iris[, -5])
#' set.seed(123)
#' is.na(x[sample(nrow(x) * 4, 30)]) <- TRUE # introduce some missing values
#'
#' mbst <- xgb.train(
#' data = xgb.DMatrix(x, label = as.numeric(iris$Species) - 1),
#' model_multiclass <- xgboost(
#' x, iris$Species,
#' nrounds = nrounds,
#' verbose = 0,
#' params = xgb.params(
#' max_depth = 2,
#' subsample = 0.5,
#' nthread = nthread,
#' objective = "multi:softprob",
#' num_class = nclass
#' )
#' verbosity = 0,
#' max_depth = 2,
#' subsample = 0.5,
#' nthreads = nthread
#' )
#' nclass <- 3
#' trees0 <- seq(from = 1, by = nclass, length.out = nrounds)
#' col <- rgb(0, 0, 1, 0.5)
#'
#' xgb.plot.shap(
#' x,
#' model = mbst,
#' model = model_multiclass,
#' trees = trees0,
#' target_class = 0,
#' top_n = 4,
@@ -137,7 +130,7 @@
#'
#' xgb.plot.shap(
#' x,
#' model = mbst,
#' model = model_multiclass,
#' trees = trees0 + 1,
#' target_class = 1,
#' top_n = 4,
@@ -149,7 +142,7 @@
#'
#' xgb.plot.shap(
#' x,
#' model = mbst,
#' model = model_multiclass,
#' trees = trees0 + 2,
#' target_class = 2,
#' top_n = 4,
@@ -160,7 +153,7 @@
#' )
#'
#' # Summary plot
#' xgb.ggplot.shap.summary(x, model = mbst, target_class = 0, top_n = 4)
#' xgb.ggplot.shap.summary(x, model = model_multiclass, target_class = 0, top_n = 4)
#'
#' @rdname xgb.plot.shap
#' @export
22 changes: 10 additions & 12 deletions R-package/R/xgb.plot.tree.R
Original file line number Diff line number Diff line change
@@ -37,25 +37,23 @@
#' line.
#'
#' @examples
#' data(agaricus.train, package = "xgboost")
#'
#' bst <- xgb.train(
#' data = xgb.DMatrix(agaricus.train$data, agaricus.train$label),
#' nrounds = 2,
#' params = xgb.params(
#' max_depth = 3,
#' nthread = 2,
#' objective = "binary:logistic"
#' )
#' data("ToothGrowth")
#' x <- ToothGrowth[, c("len", "dose")]
#' y <- ToothGrowth$supp
#' model <- xgboost(
#' x, y,
#' nthreads = 1L,
#' nrounds = 3L,
#' max_depth = 3L
#' )
#'
#' # plot the first tree
#' xgb.plot.tree(model = bst, tree_idx = 1)
#' xgb.plot.tree(model, tree_idx = 1)
#'
#' # Below is an example of how to save this plot to a file.
#' if (require("DiagrammeR") && require("htmlwidgets")) {
#' fname <- file.path(tempdir(), "plot.html'")
#' gr <- xgb.plot.tree(bst, tree_idx = 1)
#' gr <- xgb.plot.tree(model, tree_idx = 1)
#' htmlwidgets::saveWidget(gr, fname)
#' }
#' @export
29 changes: 26 additions & 3 deletions R-package/R/xgboost.R
Original file line number Diff line number Diff line change
@@ -856,12 +856,13 @@ check.early.stopping.rounds <- function(early_stopping_rounds, eval_set) {
#' See the tutorial [Introduction to Boosted Trees](https://xgboost.readthedocs.io/en/stable/tutorials/model.html)
#' for a longer explanation of what XGBoost does.
#'
#' This function is intended to provide a more user-friendly interface for XGBoost that follows
#' This function is intended to provide a user-friendly interface for XGBoost that follows
#' R's conventions for model fitting and predictions, but which doesn't expose all of the
#' possible functionalities of the core XGBoost library.
#'
#' See [xgb.train()] for a more flexible low-level alternative which is similar across different
#' language bindings of XGBoost and which exposes the full library's functionalities.
#' language bindings of XGBoost and which exposes additional functionalities such as training on
#' external memory data and learning-to-rank objectives.
#'
#' @details
#' For package authors using 'xgboost' as a dependency, it is highly recommended to use
@@ -1045,7 +1046,29 @@ check.early.stopping.rounds <- function(early_stopping_rounds, eval_set) {
#' # Task objective is determined automatically according to the type of 'y'
#' data(iris)
#' model_classif <- xgboost(iris[, -5], iris$Species, nthreads = 1, nrounds = 5)
#' predict(model_classif, iris, validate_features = TRUE)
#' predict(model_classif, iris[1:10,])
#' predict(model_classif, iris[1:10,], type = "class")
#'
#' # Can nevertheless choose a non-default objective if needed
#' model_poisson <- xgboost(
#' mtcars[, -1], mtcars$mpg,
#' objective = "count:poisson",
#' nthreads = 1,
#' nrounds = 3
#' )
#'
#' # Can calculate evaluation metrics during boosting rounds
#' data(ToothGrowth)
#' xgboost(
#' ToothGrowth[, c("len", "dose")],
#' ToothGrowth$supp,
#' eval_metric = c("auc", "logloss"),
#' eval_set = 0.2,
#' monitor_training = TRUE,
#' verbosity = 1,
#' nthreads = 1,
#' nrounds = 3
#' )
xgboost <- function(
x,
y,
111 changes: 45 additions & 66 deletions R-package/man/xgb.importance.Rd
23 changes: 10 additions & 13 deletions R-package/man/xgb.plot.deepness.Rd
13 changes: 5 additions & 8 deletions R-package/man/xgb.plot.importance.Rd
21 changes: 9 additions & 12 deletions R-package/man/xgb.plot.multi.trees.Rd
51 changes: 22 additions & 29 deletions R-package/man/xgb.plot.shap.Rd
22 changes: 10 additions & 12 deletions R-package/man/xgb.plot.tree.Rd
26 changes: 24 additions & 2 deletions R-package/man/xgboost.Rd
225 changes: 0 additions & 225 deletions R-package/vignettes/vignette.css

This file was deleted.

28 changes: 0 additions & 28 deletions R-package/vignettes/xgboost.bib

This file was deleted.

196 changes: 196 additions & 0 deletions R-package/vignettes/xgboost_introduction.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
---
title: "XGBoost for R introduction"
vignette: >
%\VignetteEncoding{UTF-8}
%\VignetteIndexEntry{XGBoost for R introduction}
%\VignetteEngine{quarto::html}
format:
html:
embed-resources: true
theme: yeti
highlight-style: pygments
jupyter: ir
---

# Introduction

**XGBoost** is an optimized distributed gradient boosting library designed to be highly **efficient**, **flexible** and **portable**. It implements machine learning algorithms under the [Gradient Boosting](https://en.wikipedia.org/wiki/Gradient_boosting) framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.

For an introduction to the concept of gradient boosting, see the tutorial [Introduction to Boosted Trees](https://xgboost.readthedocs.io/en/stable/tutorials/model.html) in XGBoost's online docs.

For more details about XGBoost's features and usage, see the [online documentation](https://xgboost.readthedocs.io/en/stable/) which contains more tutorials, examples, and details.

This short vignette outlines the basic usage of the R interface for XGBoost, assuming the reader has some familiarity with the underlying concepts behind statistical modeling with gradient-boosted decision trees.

# Building a predictive model

At its core, XGBoost consists of a C++ library which offers bindings for different programming languages, including R. The R package for XGBoost provides an idiomatic interface similar to those of other statistical modeling packages using and x/y design, as well as a lower-level interface that interacts more directly with the underlying core library and which is similar to those of other language bindings like Python, plus various helpers to interact with its model objects such as by plotting their feature importances or converting them to other formats.

The main function of interest is `xgboost(x, y, ...)`, which calls the XGBoost model building procedure on observed data of covariates/features/predictors "x", and a response variable "y" - it should feel familiar to users of packages like `glmnet` or `ncvreg`:

```{r}
library(xgboost)
data(ToothGrowth)
y <- ToothGrowth$supp # the response which we want to model/predict
x <- ToothGrowth[, c("len", "dose")] # the features from which we want to predct it
model <- xgboost(x, y, nthreads = 1, nrounds = 2)
model
```

In this case, the "y" response variable that was supplied is a "factor" type with two classes ("OJ" and "VC") - hence, XGBoost builds a binary classification model for it based on the features "x", by finding a maximum likelihood estimate (similar to the `faimily="binomial"` model from R's `glm` function) through rule buckets obtained from the sum of two decision trees (from `nrounds=2`), from which we can then predict probabilities, log-odds, class with highest likelihood, among others:

```{r}
predict(model, x[1:6, ], type = "response") # probabilities for y's last level ("VC")
predict(model, x[1:6, ], type = "raw") # log-odds
predict(model, x[1:6, ], type = "class") # class with highest probability
```

Compared to R's `glm` function which follows the concepts of "families" and "links" from GLM theory to fit models for different kinds of response distributions, XGBoost follows the simpler concept of "objectives" which mix both of them into one, and which just like `glm`, allow modeling very different kinds of response distributions (e.g. discrete choices, real-valued numbers, counts, censored measurements, etc.) through a common framework.

XGBoost will automatically determine a suitable objective for the response given its object class (can pass factors for classification, numeric vectors for regression, `Surv` objects from the `survival` package for survival, etc. - see `?xgboost` for more details), but this can be controlled manually through an `objective` parameter based the kind of model that is desired:

```{r}
data(mtcars)
y <- mtcars$mpg
x <- mtcars[, -1]
model_gaussian <- xgboost(x, y, nthreads = 1, nrounds = 2) # default is squared loss (Gaussian)
model_poisson <- xgboost(x, y, objective = "count:poisson", nthreads = 1, nrounds = 2)
model_abserr <- xgboost(x, y, objective = "reg:absoluteerror", nthreads = 1, nrounds = 2)
```

_Note: the objective must match with the type of the "y" response variable - for example, classification objectives for discrete choices require "factor" types, while regression models for real-valued data require "numeric" types._

# Model parameters

XGBoost models allow a large degree of control over how they are built. By their nature, gradient-boosted decision tree ensembles are able to capture very complex patterns between features in the data and a response variable, which also means they can suffer from overfitting if not controlled appropirately.

For best results, one needs to find suitable parameters for the data being modeled. Note that XGBoost does not adjust its default hyperparameters based on the data, and different datasets will require vastly different hyperparameters for optimal predictive performance.

For example, for a small dataset like "TootGrowth" which has only two features and 60 observations, the defaults from XGBoost are an overkill which lead to severe overfitting - for such data, one might want to have smaller trees (i.e. more convervative decision rules, capturing simpler patterns) and fewer of them, for example.

Parameters can be controlled by passing additional arguments to `xgboost()`. See `?xgb.params` for details about what parameters are available to control.

```{r}
y <- ToothGrowth$supp
x <- ToothGrowth[, c("len", "dose")]
model_conservative <- xgboost(
x, y, nthreads = 1,
nrounds = 5,
max_depth = 2,
reg_lambda = 0.5,
learning_rate = 0.15
)
pred_conservative <- predict(
model_conservative,
x
)
pred_conservative[1:6] # probabilities are all closer to 0.5 now
```

XGBoost also allows the possibility of calculating evaluation metrics for model quality over boosting rounds, with a wide variety of built-in metrics available to use. It's possible to automatically set aside a fraction of the data to use as evaluation set, from which one can then visually monitor progress and overfitting:

```{r}
xgboost(
x, y, nthreads = 1,
eval_set = 0.2,
monitor_training = TRUE,
verbosity = 1,
eval_metric = c("auc", "logloss"),
nrounds = 5,
max_depth = 2,
reg_lambda = 0.5,
learning_rate = 0.15
)
```

# Examining model objects

XGBoost model objects for the most part consist of a pointer to a C++ object where most of the information is held and which is interfaced through the utility functions and methods in the package, but also contains some R attributes that can be retrieved (and new ones added) through `attributes()`:

```{r}
attributes(model)
```

In addition to R attributes (which can be arbitrary R objects), it may also keep some standardized C-level attributes that one can access and modify (but which can only be JSON-format):

```{r}
xgb.attributes(model)
```

(they are empty for this model)

... but usually, when it comes to getting something out of a model object, one would typically want to do this through the built-in utility functions. Some examples:

```{r}
xgb.importance(model)
```

```{r}
xgb.model.dt.tree(model)
```

# Other features

XGBoost supports many additional features on top of its traditional gradient-boosting framework, including, among others:

* Building decision tree models with characteristics such as per-feature monotonicity constraints or interaction constraints.
* Calculating feature contributions in individual predictions.
* Using custom objectives and custom evaluation metrics.
* Fitting linear models.
* Fitting models on GPUs and/or on data that doesn't fit in RAM ("external memory").

See the [online documentation](https://xgboost.readthedocs.io/en/stable/index.html) - particularly the [tutorials section](https://xgboost.readthedocs.io/en/stable/tutorials/index.html) - for a glimpse over further functionalities that XGBoost offers.

# The low-level interface

In addition to the `xgboost(x, y, ...)` function, XGBoost also provides a lower-level interface for creating model objects through the function `xgb.train()`, which resembles the same `xgb.train` functions in other language bindings of XGBoost.

This `xgb.train()` interface exposes additional functionalities (such as user-supplied callbacks or external-memory data support) and performs fewer data validations and castings compared to the `xgboost()` function interface.

Some key differences between the two interfaces:

* Unlike `xgboost()` which takes R objects such as `matrix` or `data.frame` as inputs, the function `xgb.train()` uses XGBoost's own data container called "DMatrix", which can be created from R objects through the function `xgb.DMatrix()`. Note that there are other "DMatrix" constructors too, such as "xgb.QuantileDMatrix()", which might be more beneficial for some use-cases.
* A "DMatrix" object may contain a mixture of features/covariates, the response variable, observation weights, base margins, among others; and unlike `xgboost()`, requires its inputs to have already been encoded into the representation that XGBoost uses behind the scenes - for example, while `xgboost()` may take a `factor` object as "y", `xgb.DMatrix()` requires instead a binary response variable to be passed as a vector of zeros and ones.
* Hyperparameters are passed as function arguments in `xgboost()`, while they are passed as a named list to `xgb.train()`.
* The `xgb.train()` interface keeps less metadata about its inputs - for example, it will not add levels of factors as column names to estimated probabilities when calling `predict`.

Example usage of `xgb.train()`:

```{r}
data("agaricus.train")
dmatrix <- xgb.DMatrix(
data = agaricus.train$data, # a sparse CSC matrix ('dgCMatrix')
label = agaricus.train$label # zeros and ones
)
booster <- xgb.train(
data = dmatrix,
nrounds = 10,
params = list(
objective = "binary:logistic",
nthreads = 1,
max_depth = 3
)
)
data("agaricus.test")
dmatrix_test <- xgb.DMatrix(agaricus.test$data)
pred_prob <- predict(booster, dmatrix_test)
pred_raw <- predict(booster, dmatrix_test, outputmargin = TRUE)
```

Model objects produced by `xgb.train()` have class `xgb.Booster`, while model objects produced by `xgboost()` have class `xgboost`, which is a subclass of `xgb.Booster`. Their `predict` methods also take different arguments - for example, `predict.xgboost` has a `type` parameter, while `predict.xgb.Booster` controls this through binary arguments - but as `xgboost` is a subclass of `xgb.Booster`, methods for `xgb.Booster` can be called on `xgboost` objects if needed.

Utility functions in the XGBoost R package will work with both model classes - for example:

```{r}
xgb.importance(model)
xgb.importance(booster)
```

While `xgboost()` aims to provide a user-friendly interface, there are still many situations where one should prefer the `xgb.train()` interface - for example:

* For latency-sensitive applications (e.g. when serving models in real time), `xgb.train()` will have a speed advantage, as it performs fewer validations, conversions, and post-processings with metadata.
* If you are developing an R package that depends on XGBoost, `xgb.train()` will provide a more stable interface (less subject to changes) and will have lower time/memory overhead.
* If you need functionalities that are not exposed by the `xgboost()` interface - for example, if your dataset does not fit into the computer's RAM, it's still possible to construct a DMatrix from it if the data is loaded in batches through `xgb.ExtMemDMatrix()`.
475 changes: 0 additions & 475 deletions doc/R-package/discoverYourData.md

This file was deleted.

3 changes: 1 addition & 2 deletions doc/R-package/index.rst
Original file line number Diff line number Diff line change
@@ -23,8 +23,7 @@ Tutorials
:maxdepth: 2
:titlesonly:

Introduction to XGBoost in R <xgboostPresentation>
Understanding your dataset with XGBoost <discoverYourData>
R Introductory Vignette <xgboost_introduction>

************
Other topics
589 changes: 0 additions & 589 deletions doc/R-package/xgboostPresentation.md

This file was deleted.

1,012 changes: 1,012 additions & 0 deletions doc/R-package/xgboost_introduction.md

Large diffs are not rendered by default.