Distributional regression trees and random forests

COBIPE 2024

COBIPE is the 1st Binational Brazil - Colombia, Colloquium on Probability and Statistics. This event represents an incredible opportunity for the exchange of knowledge and experiences among professionals, researchers, and students dedicated to the field of probability and statistics. For more information about the colloquium, you can visit the following link:

https://www.cobipe.com/o-evento

QR to find this presentation

Scan the QR code to find this presentation.

Trees

We all know a traditional tree with or without leaves, but the concepts of regression/classification trees are based on inverted trees without leaves, as shown in the next figure.

Decision trees

Below we can find an example of a decision tree with the main components.

In the next figure we can observe an example of a decision tree to classify a person in one of two categories: fit and unfit.

Homework: use the next tree to classify yourself as fit or unfit.

Types of trees

There are two basic types of trees, in the next figure we can fin an illustration.

The construction of a regression tree involves the following steps:

  1. Feature Selection: At each node of the tree, a feature and a split point are selected to best divide the data into two subsets.
  2. Recursive Partitioning: The selected feature and split point are used to partition the data into two subsets. This process is repeated recursively for each subset until a stopping criterion is met (e.g., maximum tree depth or minimum number of samples in each leaf).
  3. Leaf Prediction: Once the tree structure is determined, a prediction value is assigned to each leaf node. This is typically done by computing the mean or median of the numeric target variable within the leaf or the larger frequency of the qualitative target variable.

The above steps were taken from a lecture available in this url.

Regression tree

A regression tree consists of asking questions like “x_k < c?” for each of the covariates. In this way, the covariate space is divided into hyper-rectangles, and all observations within a hyper-rectangle will have the same estimated value \hat{y}.

In the following figure, the regression tree is illustrated on the left side, and the partitioning of the space is shown on the right side. The partitioning of the space is done repetitively to find the variables and the cutoff values c in such a way that the cost function \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 is minimized.

The steps to perform the partitioning of the space are:

  1. Given a set of covariates (features), find the covariate that best predicts the response variable.
  2. Find the cutoff point c on that covariate that best predicts the response variable.
  3. Repeat the above steps until the stopping criterion is reached.

Some advantages of regression trees are:

  • Easy to understand and interpret.
  • Requires little data preparation.
  • Covariates can be qualitative or quantitative.
  • Does not require distributional assumptions.

For more detailed explanations on tree-based techniques, we recommend consulting Chapter 8 of James et al. (2013), and also watching the following video with a simple explanation of trees.

R packages

The most known R packages to create regression trees are:

There are other R packages that the reader can consult in the section Recursive Partitioning de CRAN Task View: Machine Learning & Statistical Learning.

rpart package

The rpart of Therneau and Atkinson (2023) is one of the packages to create regression trees. The main function is rpart and below we can find the usual parameters.

rpart(formula, data, weights, subset, na.action = na.rpart, method,
      model = FALSE, x = FALSE, y = TRUE, parms, control, cost, ...)

Example

In this example we want to explain the response variable Y using the covariates X_1, X_2, \ldots, X_{11}. The dataset is avaiable in exercise 9.5 of Montgomery, Peck and Vining (2003). The package MPV of Braun and MacQueen (2023) contains the book datasets.

In the next figure we can look some lines of the dataset which has 32 rows and 12 columns.

Remember: Type of transmission (1=automatic, 0=manual).

library(dplyr)
library(rpart)
library(rpart.plot)
library(MPV)      # This package contains the dataset

table.b3[22:26, ] # Can you see the missing values?
       y    x1  x2  x3  x4   x5 x6 x7    x8   x9  x10 x11
22 21.47 360.0 180 290 8.4 2.45  2  3 214.2 76.3 4250   1
23 16.59 400.0 185  NA 7.6 3.08  4  3 196.0 73.0 3850   1
24 31.90  96.9  75  83 9.0 4.30  2  5 165.2 61.8 2275   0
25 29.40 140.0  86  NA 8.0 2.92  2  4 176.4 65.4 2150   0
26 13.27 460.0 223 366 8.0 3.00  4  3 228.0 79.8 5430   1

Before we need to avoid rows with NA"s.

table.b3 |> na.omit() -> dados
dados |> dim()
[1] 30 12

To create the regression tree we can use the next code.

mod1 <- rpart(y ~ ., data=dados)

We can plot the tree using the function prp from the rpart.plot package Milborrow (2024).

prp(mod1)

Since the tree has only two covariates, we can represent it in two dimensions as follows:

with(dados, plot(x=x2, y=x9))
abline(h=66, lty="dashed", col="blue")
segments(x0=144, y0=66, x1=144, y1=82, lty="dashed", col="blue")
text(x=120, y=63, labels="y^=28", col=4)
text(x=90, y=73, labels="y^=20", col=4)
text(x=190, y=73, labels="y^=16", col=4)

Since the previous tree only includes the variables x_2 and x_9, it is recommended to reconstruct the tree using only these variables.

mod1 <- rpart(y ~ x2 + x9, data=dados)

Using the information above we can predict Y, for example:

  1. If a new car has x_9=70 and x_2=100, then \hat{y}=20.
  2. If a new car has x_9=60 and x_2=150, then \hat{y}=28.

We can obtain predictions using the S3 function predict.

novos_dados <- data.frame(x2=c(100, 150), 
                          x9=c(70, 60))
predict(object=mod1, newdata=novos_dados)
       1        2 
19.66875 28.06625 

In this example, the original data was used as both the training and testing set because there are only 30 observations.

The closer the \hat{y} values are to the observed y values, the better the model can be considered. Below is the correlation between \hat{y} and y.

y_hat <- predict(object=mod1, newdata=dados)
cor(y_hat, dados$y)
[1] 0.8300304

Below is a scatter plot between \hat{y} and y.

plot(x=dados$y, y=y_hat, pch=20, 
     las=1, xlab="y", ylab=expression(hat(y)))
abline(a=0, b=1, lty="dashed", col="blue")

Classification tree

Here we are not going to cover this topic, if you interested you can find a short explantion (in Spanish) in this link.

Random forests

Random Forest was proposed by Ho (1995) and involves creating many trees to then use them in predicting the variable of interest. Below is an illustration of the technique.

Simple Explanation of Random Forests

We start with a training set that has n observations, the variable of interest Y, and the predictor variables X_1, X_2, \ldots, X_p. Then the following steps are applied.

  1. A new training set of the same size as the original is constructed using the Bootstrap technique. This is done by generating a sample with replacement, allowing some observations to appear multiple times while others may not appear at all.
  2. A tree (either regression or classification) is built using at each split a subset of k predictor variables from the available X_1, X_2, \ldots, X_p.
  3. The previous steps are repeated B times, typically with B=500 or B=1000. In this way, many trees are created, which can then be used to make predictions of Y.

If we want to predict the variable Y for a case where the information (x_1, x_2, \ldots, x_p)^\top is available, each of the B trees created is used to predict the variable Y, thus yielding the predictions \hat{Y}_1, \hat{Y}_2, \ldots, \hat{Y}_B. Then, using these B predictions, a unified prediction can be obtained depending on whether the problem is classification or regression. Below is an illustrative figure of how the B predictions are unified.

For more detailed explanations on random forest, we recommend consulting Chapter 8 of James et al. (2021), and also watching the following video with a simple explanation of trees.

Distributional regression trees

Distributional regression trees were proposed in Schlosser et al. (2019) and Schlosser (2020). This new approach unifies two modeling approaches:

  • Data modeling.
  • Algorithmic modeling.

Below is a comparison between a regression tree and a distributional regression tree.

Below you can find a video in which Achim Zeileis explained distributional trees and random forest in the ViennaR Meetup.

disttree package

disttree proposed by Schlosser et al. (2021) is an R package for fitting distributional regression trees and forests based on maximum likelihood estimation of parameters for specific distribution families. To install the package, you can use the following code:

install.packages("disttree", repos="http://R-Forge.R-project.org")

To create a distributional regression tree, you use the disttree function, which has the following structure:

disttree(formula, data, subset, na.action = na.pass, weights, offset,
         cluster, family = NO(), control = disttree_control(...), 
         converged = NULL, scores = NULL, doFit = TRUE, ...)

The argument family is used to indicate the distribution for the response variable Y. The user could select any distribution from the R packages: gamlss (Stasinopoulos and Rigby 2024).

There are other packages that contains more distributions than can be selected in the family argument, those packages are:

Example

This example is based on the Figure 1.2 from the Lisa Scholosser thesis (Schlosser 2020). The response variable Y follows a normal distribution but depending on the auxiliar variable X.

\begin{equation*} Y \sim \begin{cases} N(\mu=5, \sigma=1) \, \text{if $x < 0.4$,} \\ N(\mu=12, \sigma=2) \, \text{if $0.4 \leq x \leq 0.8$,} \\ N(\mu=0, \sigma=0.5) \, \text{if $x > 0.8$}. \end{cases} \end{equation*}

The next code generates 500 observations from the model.

n <- 500
set.seed(12345) 
{
  x <- runif(n=n)
  y <- numeric(n)
  
  cond1 <- x < 0.4
  cond2 <- x >= 0.4 & x <= 0.8
  cond3 <- x > 0.8
  
  y[cond1] <- rnorm(n=sum(cond1), mean=5, sd=1)
  y[cond2] <- rnorm(n=sum(cond2), mean=12, sd=2)
  y[cond3] <- rnorm(n=sum(cond3), mean=0, sd=0.5)
}

data <- data.frame(y=y, x=x)
plot(x=x, y=y, ylim=c(-5, 20))

We can fit the distributional tree with family=NO.

library(disttree)
mod <- disttree(y ~ x, data=data, family=NO)
plot(mod)

What are the estimated value for Y when x=0.35, x=0.47 and x=0.89?

new_data <- data.frame(x=c(0.35, 0.47, 0.89))
predictions <- predict(mod, newdata=new_data)
predictions
           mu     sigma
1  5.06589704 0.9170597
2 12.03236756 1.9684436
3  0.02094238 0.5654369

Example

This example is similar to the initial example, except that here the response variable Y will have a Flexible Weibull Extension (FWE) distribution, and we will construct three distributional trees with NO, FWE, and WEI families. The objective is to know if disttree identifies which distributional tree is the most appropriate.

The model from which we will generate the data is the following:

\begin{equation*} Y = \begin{cases} FWE(\mu=0.7, \sigma=1) \, \text{if $x < 0.4$,} \\ FWE(\mu=0.4, \sigma=2) \, \text{if $0.4 \leq x \leq 0.8$,} \\ FWE(\mu=0.5, \sigma=0.5) \, \text{if $x > 0.8$}. \end{cases} \end{equation*}

The code to simulate the data is as follows:

library(RelDists) # Para usar la distribucion FWE

n <- 1500
set.seed(12378)
{
  x <- runif(n=n)
  y <- numeric(n)
  
  cond1 <- x < 0.4
  cond2 <- x >= 0.4 & x <= 0.8
  cond3 <- x > 0.8
  
  y[cond1] <- rFWE(n=sum(cond1), mu=0.7, sigma=1)
  y[cond2] <- rFWE(n=sum(cond2), mu=0.4, sigma=2)
  y[cond3] <- rFWE(n=sum(cond3), mu=0.5, sigma=0.5)
}

datos <- data.frame(y=y, x=x)

plot(x=x, y=y)

Now let’s train the three models:

library(disttree)

mod1 <- disttree(y ~ x, data=datos, family=NO)
mod1
Distributional regression tree (Normal Distribution)

Model formula:
y ~ x

Fitted party:
[1] root
|   [2] x <= 0.796
|   |   [3] x <= 0.39609: n = 580
|   |              mu     sigma 
|   |       1.1054738 0.5975482 
|   |   [4] x > 0.39609: n = 597
|   |             mu    sigma 
|   |       1.992311 1.070787 
|   [5] x > 0.796: n = 323
|              mu     sigma 
|       0.9873984 0.7654727 

Number of inner nodes:    2
Number of terminal nodes: 3
Number of parameters per node: 2
Objective function (negative log-likelihood): 1784.259
mod2 <- disttree(y ~ x, data=datos, family=FWE)
mod2
Distributional regression tree (Flexible Weibull Extension Distribution)

Model formula:
y ~ x

Fitted party:
[1] root
|   [2] x <= 0.7967
|   |   [3] x <= 0.3891: n = 570
|   |              mu     sigma 
|   |       0.7022919 1.0537549 
|   |   [4] x > 0.3891: n = 608
|   |              mu     sigma 
|   |       0.3946878 1.9770320 
|   [5] x > 0.7967: n = 322
|              mu     sigma 
|       0.5131012 0.4805061 

Number of inner nodes:    2
Number of terminal nodes: 3
Number of parameters per node: 2
Objective function (negative log-likelihood): 1569.568
mod3 <- disttree(y ~ x, data=datos, family=WEI)
mod3
Distributional regression tree (Weibull Distribution)

Model formula:
y ~ x

Fitted party:
[1] root
|   [2] x <= 0.796
|   |   [3] x <= 0.3891: n = 570
|   |             mu    sigma 
|   |       1.245303 1.956514 
|   |   [4] x > 0.3891: n = 607
|   |             mu    sigma 
|   |       2.247237 1.983420 
|   [5] x > 0.796: n = 323
|             mu    sigma 
|       1.075082 1.316364 

Number of inner nodes:    2
Number of terminal nodes: 3
Number of parameters per node: 2
Objective function (negative log-likelihood): 1614.687

In the summary of each model, we can see at the end the value of -logLik, the most appropriate model is the one with the smallest value. When comparing the indicators, we see that the mod2 model that assumes FWE as the distribution for Y, has the better metric.

plot(mod2)

The results of the previous tree are consistent with the model for the simulated data.

\begin{equation*} Y = \begin{cases} FWE(\mu=0.7, \sigma=1) \, \text{if $x < 0.4$,} \\ FWE(\mu=0.4, \sigma=2) \, \text{if $0.4 \leq x \leq 0.8$,} \\ FWE(\mu=0.5, \sigma=0.5) \, \text{if $x > 0.8$}. \end{cases} \end{equation*}

Distributional random forest

Example

This example is based on the lecture of Zeileis (2018). This example uses simulated data following the next structure:

\begin{align*} y &\sim N(\mu(x), \sigma(x)), \\ x &\sim (−0.4, 1), \\ \mu(x) &= 10 \exp(-(4x-2)^{2k}), \\ \sigma(x) &= 0.5 + 2|x|, \\ k &= 2. \end{align*}

The objective is to simulate some n observations from the model and then fit a regression tree, a regression random forest and a gamlss model. The models must be compared by the MSE.

# Training data
n <- 300
k <- 2
x <- runif(n=n, min=-0.4, max=1)
media <- 10 * exp(-(4*x-2)^(2*k))
desvi <- 0.5 + 2 * abs(x)
y <- rnorm(n=n, mean=media, sd=desvi)
dados <- data.frame(y, x)

# Testing data
x <- seq(from=-0.4, to=1, length.out=50)
y <- 10 * exp(-(4*x-2)^(2*k))
testingdata <- data.frame(y, x)

Creating the MSE function to compare.

# MSE function
mse <- function(y_true, y_hat) mean((y_true-y_hat)^2)

Plotting the training data set.

with(dados, plot(x=x, y=y, pch=19, cex=0.5))

Fitting the models.

library(disttree)

# Regression tree
mod1 <- disttree(y ~ x, data=dados, family=NO)

# Regression random forest with 50 trees
mod2 <- distforest(y ~ x, data=dados, family=NO, ntree=50)

# gamlss with cubic splines
library(gamlss)
mod3 <- gamlss(y ~ cs(x), data=dados, family=NO,
               control=gamlss.control(trace=FALSE))

Obtaining predictions using the testing data set.

y_hat_1 <- predict(mod1, newdata=testingdata)$mu
y_hat_2 <- predict(mod2, newdata=testingdata)$mu
y_hat_3 <- predict(mod3, newdata=testingdata, what="mu")

Plotting the training data set with the prediction curves.

with(dados, plot(x=x, y=y, pch=19, cex=0.5))
lines(x=testingdata$x, y=y_hat_1, col="tomato", lwd=3)
lines(x=testingdata$x, y=y_hat_2, col="chartreuse3", lwd=3)
lines(x=testingdata$x, y=y_hat_3, col="blue3", lwd=3)
legend("topleft", 
       legend=c("Reg. tree", "Reg. random forest", "gamlss"),
       lwd=3, col=c("tomato", "chartreuse3", "blue3"))

Comparing the fitted models using the MSE.

mse(testingdata$y, y_hat_1)
[1] 1.795599
mse(testingdata$y, y_hat_2)
[1] 0.2872399
mse(testingdata$y, y_hat_3)
[1] 1.252395

Final comments

References

Braun, W. J., and S. MacQueen. 2023. MPV: Data Sets from Montgomery, Peck and Vining. https://CRAN.R-project.org/package=MPV.
Hernandez, Freddy, Olga Usuga, Carmen Patino, Jaime Mosquera, and Amylkar Urrea. 2022. RelDists: Estimation for Some Reliability Distributions. https://ousuga.github.io/RelDists/.
Hernandez-Barajas, Freddy, and Fernando Marmolejo-Ramos. 2024. RealDists: Real Statistical Distributions. https://github.com/fhernanb/RealDists.
Hernandez-Barajas, Freddy, Fernando Marmolejo-Ramos, Jamiu Olumoh, and Osho Ajayi. 2024. DiscreteDists: Discrete Statistical Distributions.
Ho, Tin Kam. 1995. Random decision forests.” In Proceedings of the International Conference on Document Analysis and Recognition, ICDAR. Vol. 1. https://doi.org/10.1109/ICDAR.1995.598994.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning – with Applications in r. Springer Texts in Statistics. New York: Springer. https://doi.org/10.1007/DOI.
———. 2021. An Introduction to Statistical Learning – with Applications in r. Second. Springer Texts in Statistics. New York: Springer. https://doi.org/10.1007/978-1-0716-1418-1.
Milborrow, Stephen. 2024. Rpart.plot: Plot Rpart Models: An Enhanced Version of Plot.rpart. http://www.milbo.org/rpart-plot/index.html.
Schlosser, Lisa. 2020. “Distributional Trees and Forests.” PhD thesis, University of Innsbruck.
Schlosser, Lisa, Torsten Hothorn, Reto Stauffer, and Achim Zeileis. 2019. Distributional regression forests for probabilistic precipitation forecasting in complex terrain.” The Annals of Applied Statistics 13 (3): 1564–89. https://doi.org/10.1214/19-AOAS1247.
Schlosser, Lisa, Moritz N. Lang, Torsten Hothorn, and Achim Zeileis. 2021. Disttree: Trees and Forests for Distributional Regression. https://R-Forge.R-project.org/projects/partykit/.
Stasinopoulos, Mikis, and Robert Rigby. 2024. Gamlss: Generalized Additive Models for Location Scale and Shape. https://www.gamlss.com/.
Therneau, Terry, and Beth Atkinson. 2023. Rpart: Recursive Partitioning and Regression Trees. https://github.com/bethatkinson/rpart.
Zeileis, Achim. 2018. “Distributional Regression Forests for Probabilistic Modeling and Forecasting.” 2018. https://www.zeileis.org/papers/Psychoco-2018.pdf.