rpart(formula, data, weights, subset, na.action = na.rpart, method,
model = FALSE, x = FALSE, y = TRUE, parms, control, cost, ...)
Distributional regression trees and random forests
COBIPE 2024
COBIPE is the 1st Binational Brazil - Colombia, Colloquium on Probability and Statistics. This event represents an incredible opportunity for the exchange of knowledge and experiences among professionals, researchers, and students dedicated to the field of probability and statistics. For more information about the colloquium, you can visit the following link:
https://www.cobipe.com/o-evento
QR to find this presentation
Scan the QR code to find this presentation.
Trees
We all know a traditional tree with or without leaves, but the concepts of regression/classification trees are based on inverted trees without leaves, as shown in the next figure.
Decision trees
Below we can find an example of a decision tree with the main components.
In the next figure we can observe an example of a decision tree to classify a person in one of two categories: fit and unfit.
Homework: use the next tree to classify yourself as fit or unfit.
Types of trees
There are two basic types of trees, in the next figure we can fin an illustration.
The construction of a regression tree involves the following steps:
- Feature Selection: At each node of the tree, a feature and a split point are selected to best divide the data into two subsets.
- Recursive Partitioning: The selected feature and split point are used to partition the data into two subsets. This process is repeated recursively for each subset until a stopping criterion is met (e.g., maximum tree depth or minimum number of samples in each leaf).
- Leaf Prediction: Once the tree structure is determined, a prediction value is assigned to each leaf node. This is typically done by computing the mean or median of the numeric target variable within the leaf or the larger frequency of the qualitative target variable.
The above steps were taken from a lecture available in this url.
Regression tree
A regression tree consists of asking questions like “x_k < c?” for each of the covariates. In this way, the covariate space is divided into hyper-rectangles, and all observations within a hyper-rectangle will have the same estimated value \hat{y}.
In the following figure, the regression tree is illustrated on the left side, and the partitioning of the space is shown on the right side. The partitioning of the space is done repetitively to find the variables and the cutoff values c in such a way that the cost function \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 is minimized.
The steps to perform the partitioning of the space are:
- Given a set of covariates (features), find the covariate that best predicts the response variable.
- Find the cutoff point c on that covariate that best predicts the response variable.
- Repeat the above steps until the stopping criterion is reached.
Some advantages of regression trees are:
- Easy to understand and interpret.
- Requires little data preparation.
- Covariates can be qualitative or quantitative.
- Does not require distributional assumptions.
For more detailed explanations on tree-based techniques, we recommend consulting Chapter 8 of James et al. (2013), and also watching the following video with a simple explanation of trees.
R packages
The most known R packages to create regression trees are:
There are other R packages that the reader can consult in the section Recursive Partitioning de CRAN Task View: Machine Learning & Statistical Learning.
rpart package
The rpart of Therneau and Atkinson (2023) is one of the packages to create regression trees. The main function is rpart
and below we can find the usual parameters.
Example
In this example we want to explain the response variable Y using the covariates X_1, X_2, \ldots, X_{11}. The dataset is avaiable in exercise 9.5 of Montgomery, Peck and Vining (2003). The package MPV of Braun and MacQueen (2023) contains the book datasets.
In the next figure we can look some lines of the dataset which has 32 rows and 12 columns.
Remember: Type of transmission (1=automatic, 0=manual).
library(dplyr)
library(rpart)
library(rpart.plot)
library(MPV) # This package contains the dataset
22:26, ] # Can you see the missing values? table.b3[
y x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11
22 21.47 360.0 180 290 8.4 2.45 2 3 214.2 76.3 4250 1
23 16.59 400.0 185 NA 7.6 3.08 4 3 196.0 73.0 3850 1
24 31.90 96.9 75 83 9.0 4.30 2 5 165.2 61.8 2275 0
25 29.40 140.0 86 NA 8.0 2.92 2 4 176.4 65.4 2150 0
26 13.27 460.0 223 366 8.0 3.00 4 3 228.0 79.8 5430 1
Before we need to avoid rows with NA"s
.
|> na.omit() -> dados
table.b3 |> dim() dados
[1] 30 12
To create the regression tree we can use the next code.
<- rpart(y ~ ., data=dados) mod1
We can plot the tree using the function prp
from the rpart.plot package Milborrow (2024).
prp(mod1)
Since the tree has only two covariates, we can represent it in two dimensions as follows:
with(dados, plot(x=x2, y=x9))
abline(h=66, lty="dashed", col="blue")
segments(x0=144, y0=66, x1=144, y1=82, lty="dashed", col="blue")
text(x=120, y=63, labels="y^=28", col=4)
text(x=90, y=73, labels="y^=20", col=4)
text(x=190, y=73, labels="y^=16", col=4)
Since the previous tree only includes the variables x_2 and x_9, it is recommended to reconstruct the tree using only these variables.
<- rpart(y ~ x2 + x9, data=dados) mod1
Using the information above we can predict Y, for example:
- If a new car has x_9=70 and x_2=100, then \hat{y}=20.
- If a new car has x_9=60 and x_2=150, then \hat{y}=28.
We can obtain predictions using the S3 function predict
.
<- data.frame(x2=c(100, 150),
novos_dados x9=c(70, 60))
predict(object=mod1, newdata=novos_dados)
1 2
19.66875 28.06625
In this example, the original data was used as both the training and testing set because there are only 30 observations.
The closer the \hat{y} values are to the observed y values, the better the model can be considered. Below is the correlation between \hat{y} and y.
<- predict(object=mod1, newdata=dados)
y_hat cor(y_hat, dados$y)
[1] 0.8300304
Below is a scatter plot between \hat{y} and y.
plot(x=dados$y, y=y_hat, pch=20,
las=1, xlab="y", ylab=expression(hat(y)))
abline(a=0, b=1, lty="dashed", col="blue")
Classification tree
Here we are not going to cover this topic, if you interested you can find a short explantion (in Spanish) in this link.
Random forests
Random Forest was proposed by Ho (1995) and involves creating many trees to then use them in predicting the variable of interest. Below is an illustration of the technique.
Simple Explanation of Random Forests
We start with a training set that has n observations, the variable of interest Y, and the predictor variables X_1, X_2, \ldots, X_p. Then the following steps are applied.
- A new training set of the same size as the original is constructed using the Bootstrap technique. This is done by generating a sample with replacement, allowing some observations to appear multiple times while others may not appear at all.
- A tree (either regression or classification) is built using at each split a subset of k predictor variables from the available X_1, X_2, \ldots, X_p.
- The previous steps are repeated B times, typically with B=500 or B=1000. In this way, many trees are created, which can then be used to make predictions of Y.
If we want to predict the variable Y for a case where the information (x_1, x_2, \ldots, x_p)^\top is available, each of the B trees created is used to predict the variable Y, thus yielding the predictions \hat{Y}_1, \hat{Y}_2, \ldots, \hat{Y}_B. Then, using these B predictions, a unified prediction can be obtained depending on whether the problem is classification or regression. Below is an illustrative figure of how the B predictions are unified.
For more detailed explanations on random forest, we recommend consulting Chapter 8 of James et al. (2021), and also watching the following video with a simple explanation of trees.
Distributional regression trees
Distributional regression trees were proposed in Schlosser et al. (2019) and Schlosser (2020). This new approach unifies two modeling approaches:
- Data modeling.
- Algorithmic modeling.
Below is a comparison between a regression tree and a distributional regression tree.
Below you can find a video in which Achim Zeileis explained distributional trees and random forest in the ViennaR Meetup.
disttree package
disttree proposed by Schlosser et al. (2021) is an R package for fitting distributional regression trees and forests based on maximum likelihood estimation of parameters for specific distribution families. To install the package, you can use the following code:
install.packages("disttree", repos="http://R-Forge.R-project.org")
To create a distributional regression tree, you use the disttree
function, which has the following structure:
disttree(formula, data, subset, na.action = na.pass, weights, offset,
family = NO(), control = disttree_control(...),
cluster, converged = NULL, scores = NULL, doFit = TRUE, ...)
The argument family
is used to indicate the distribution for the response variable Y. The user could select any distribution from the R packages: gamlss (Stasinopoulos and Rigby 2024).
There are other packages that contains more distributions than can be selected in the family
argument, those packages are:
- RelDists (Hernandez et al. 2022).
- DiscreteDists (Hernandez-Barajas et al. 2024).
- RealDists (Hernandez-Barajas and Marmolejo-Ramos 2024).
Example
This example is based on the Figure 1.2 from the Lisa Scholosser thesis (Schlosser 2020). The response variable Y follows a normal distribution but depending on the auxiliar variable X.
\begin{equation*} Y \sim \begin{cases} N(\mu=5, \sigma=1) \, \text{if $x < 0.4$,} \\ N(\mu=12, \sigma=2) \, \text{if $0.4 \leq x \leq 0.8$,} \\ N(\mu=0, \sigma=0.5) \, \text{if $x > 0.8$}. \end{cases} \end{equation*}
The next code generates 500 observations from the model.
<- 500
n set.seed(12345)
{<- runif(n=n)
x <- numeric(n)
y
<- x < 0.4
cond1 <- x >= 0.4 & x <= 0.8
cond2 <- x > 0.8
cond3
<- rnorm(n=sum(cond1), mean=5, sd=1)
y[cond1] <- rnorm(n=sum(cond2), mean=12, sd=2)
y[cond2] <- rnorm(n=sum(cond3), mean=0, sd=0.5)
y[cond3]
}
<- data.frame(y=y, x=x)
data plot(x=x, y=y, ylim=c(-5, 20))
We can fit the distributional tree with family=NO
.
library(disttree)
<- disttree(y ~ x, data=data, family=NO)
mod plot(mod)
What are the estimated value for Y when x=0.35, x=0.47 and x=0.89?
<- data.frame(x=c(0.35, 0.47, 0.89))
new_data <- predict(mod, newdata=new_data)
predictions predictions
mu sigma
1 5.06589704 0.9170597
2 12.03236756 1.9684436
3 0.02094238 0.5654369
Example
This example is similar to the initial example, except that here the response variable Y will have a Flexible Weibull Extension (FWE) distribution, and we will construct three distributional trees with NO, FWE, and WEI families. The objective is to know if disttree
identifies which distributional tree is the most appropriate.
The model from which we will generate the data is the following:
\begin{equation*} Y = \begin{cases} FWE(\mu=0.7, \sigma=1) \, \text{if $x < 0.4$,} \\ FWE(\mu=0.4, \sigma=2) \, \text{if $0.4 \leq x \leq 0.8$,} \\ FWE(\mu=0.5, \sigma=0.5) \, \text{if $x > 0.8$}. \end{cases} \end{equation*}
The code to simulate the data is as follows:
library(RelDists) # Para usar la distribucion FWE
<- 1500
n set.seed(12378)
{<- runif(n=n)
x <- numeric(n)
y
<- x < 0.4
cond1 <- x >= 0.4 & x <= 0.8
cond2 <- x > 0.8
cond3
<- rFWE(n=sum(cond1), mu=0.7, sigma=1)
y[cond1] <- rFWE(n=sum(cond2), mu=0.4, sigma=2)
y[cond2] <- rFWE(n=sum(cond3), mu=0.5, sigma=0.5)
y[cond3]
}
<- data.frame(y=y, x=x)
datos
plot(x=x, y=y)
Now let’s train the three models:
library(disttree)
<- disttree(y ~ x, data=datos, family=NO)
mod1 mod1
Distributional regression tree (Normal Distribution)
Model formula:
y ~ x
Fitted party:
[1] root
| [2] x <= 0.796
| | [3] x <= 0.39609: n = 580
| | mu sigma
| | 1.1054738 0.5975482
| | [4] x > 0.39609: n = 597
| | mu sigma
| | 1.992311 1.070787
| [5] x > 0.796: n = 323
| mu sigma
| 0.9873984 0.7654727
Number of inner nodes: 2
Number of terminal nodes: 3
Number of parameters per node: 2
Objective function (negative log-likelihood): 1784.259
<- disttree(y ~ x, data=datos, family=FWE)
mod2 mod2
Distributional regression tree (Flexible Weibull Extension Distribution)
Model formula:
y ~ x
Fitted party:
[1] root
| [2] x <= 0.7967
| | [3] x <= 0.3891: n = 570
| | mu sigma
| | 0.7022919 1.0537549
| | [4] x > 0.3891: n = 608
| | mu sigma
| | 0.3946878 1.9770320
| [5] x > 0.7967: n = 322
| mu sigma
| 0.5131012 0.4805061
Number of inner nodes: 2
Number of terminal nodes: 3
Number of parameters per node: 2
Objective function (negative log-likelihood): 1569.568
<- disttree(y ~ x, data=datos, family=WEI)
mod3 mod3
Distributional regression tree (Weibull Distribution)
Model formula:
y ~ x
Fitted party:
[1] root
| [2] x <= 0.796
| | [3] x <= 0.3891: n = 570
| | mu sigma
| | 1.245303 1.956514
| | [4] x > 0.3891: n = 607
| | mu sigma
| | 2.247237 1.983420
| [5] x > 0.796: n = 323
| mu sigma
| 1.075082 1.316364
Number of inner nodes: 2
Number of terminal nodes: 3
Number of parameters per node: 2
Objective function (negative log-likelihood): 1614.687
In the summary of each model, we can see at the end the value of -logLik, the most appropriate model is the one with the smallest value. When comparing the indicators, we see that the mod2
model that assumes FWE as the distribution for Y, has the better metric.
plot(mod2)
The results of the previous tree are consistent with the model for the simulated data.
\begin{equation*} Y = \begin{cases} FWE(\mu=0.7, \sigma=1) \, \text{if $x < 0.4$,} \\ FWE(\mu=0.4, \sigma=2) \, \text{if $0.4 \leq x \leq 0.8$,} \\ FWE(\mu=0.5, \sigma=0.5) \, \text{if $x > 0.8$}. \end{cases} \end{equation*}
Distributional random forest
Example
This example is based on the lecture of Zeileis (2018). This example uses simulated data following the next structure:
\begin{align*} y &\sim N(\mu(x), \sigma(x)), \\ x &\sim (−0.4, 1), \\ \mu(x) &= 10 \exp(-(4x-2)^{2k}), \\ \sigma(x) &= 0.5 + 2|x|, \\ k &= 2. \end{align*}
The objective is to simulate some n observations from the model and then fit a regression tree, a regression random forest and a gamlss model. The models must be compared by the MSE.
# Training data
<- 300
n <- 2
k <- runif(n=n, min=-0.4, max=1)
x <- 10 * exp(-(4*x-2)^(2*k))
media <- 0.5 + 2 * abs(x)
desvi <- rnorm(n=n, mean=media, sd=desvi)
y <- data.frame(y, x)
dados
# Testing data
<- seq(from=-0.4, to=1, length.out=50)
x <- 10 * exp(-(4*x-2)^(2*k))
y <- data.frame(y, x) testingdata
Creating the MSE function to compare.
# MSE function
<- function(y_true, y_hat) mean((y_true-y_hat)^2) mse
Plotting the training data set.
with(dados, plot(x=x, y=y, pch=19, cex=0.5))
Fitting the models.
library(disttree)
# Regression tree
<- disttree(y ~ x, data=dados, family=NO)
mod1
# Regression random forest with 50 trees
<- distforest(y ~ x, data=dados, family=NO, ntree=50)
mod2
# gamlss with cubic splines
library(gamlss)
<- gamlss(y ~ cs(x), data=dados, family=NO,
mod3 control=gamlss.control(trace=FALSE))
Obtaining predictions using the testing data set.
<- predict(mod1, newdata=testingdata)$mu
y_hat_1 <- predict(mod2, newdata=testingdata)$mu
y_hat_2 <- predict(mod3, newdata=testingdata, what="mu") y_hat_3
Plotting the training data set with the prediction curves.
with(dados, plot(x=x, y=y, pch=19, cex=0.5))
lines(x=testingdata$x, y=y_hat_1, col="tomato", lwd=3)
lines(x=testingdata$x, y=y_hat_2, col="chartreuse3", lwd=3)
lines(x=testingdata$x, y=y_hat_3, col="blue3", lwd=3)
legend("topleft",
legend=c("Reg. tree", "Reg. random forest", "gamlss"),
lwd=3, col=c("tomato", "chartreuse3", "blue3"))
Comparing the fitted models using the MSE.
mse(testingdata$y, y_hat_1)
[1] 1.795599
mse(testingdata$y, y_hat_2)
[1] 0.2872399
mse(testingdata$y, y_hat_3)
[1] 1.252395