Boostrap and Jackknife

Simulation

Freddy Hernández-Barajas

Bootstrap

Origin

The bootstrap was introduced in 1979 by Efron (1979).

http://statweb.stanford.edu/~ckirby/brad/

Origin

Boostrap methods are a class of nonparametric Monte Carlo methods that estimate the distribution of a population by resampling.

Key ideas

Treat the sample as if it were the population

What it is good for:

Calculating standard errors
Forming confidence intervals
Performing hypothesis tests
Improving predictors

The “Central Dogma” of statistics

The bootstrap

Interesting paper

Read the paper available in this url: https://garstats.wordpress.com/2016/05/27/the-percentile-bootstrap/

Example 1

Comparing the distribution of \(\bar{X}\):

using one sample with \(n=30\) observations from \(N(0,1)\).
from the population \(N(0,1)\).

Example 1

set.seed(333)
x <- rnorm(n=30, mean=0, sd=1)  # The sample
x[1:4]

[1] -0.08281164  1.93468099 -2.05128979  0.27773897

sampledMean <- rep(x=NA, times=1000) # Central dogma
bootMean    <- rep(x=NA, times=1000) # Bootstrap

N <- 1000
for(i in 1:N) {
  sampledMean[i] <- mean(rnorm(n=30, mean=0, sd=1))
  bootMean[i]    <- mean(sample(x, replace=TRUE))
}

Example 1

mean(bootMean)

[1] -0.01692824

mean(sampledMean)

[1] -0.005497356

sd(bootMean)

[1] 0.1878562

sd(sampledMean)

[1] 0.1814172

The theoretical mean for \(\bar{X}\) is 0 and the theoretical standard deviation for \(\bar{X}\) is \(1/\sqrt{30}=0.1825742\)

Example 2

Looking at mpg variable, we want to calculate the proportion of those cars that have fuel efficiency between 14 and 21 mpg.

head(mtcars)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

dim(mtcars)

[1] 32 11

Example 2

Looking at mpg variable, we want to calculate the proportion of those cars that have fuel efficiency between 14 and 21 mpg.

We create the operator function %entre% to check if some value x is within y[1] and y[2].

'%entre%' <- function(x, y) x > y[1] & x < y[2] 

# Some tests
8 %entre% c(5, 9)

[1] TRUE

8 %entre% c(3, 7)

[1] FALSE

8 %entre% c(1, 5)

[1] FALSE

Example 2

Looking at mpg variable, we want to calculate the proportion of those cars that have fuel efficiency between 14 and 21 mpg.

x <- mtcars$mpg
N <- 1000
bootProp <- rep(x=NA, times=N)
for(i in 1:N) {
  bootProp[i] <- mean(sample(x, replace=TRUE) %entre% c(14, 21))
}

Example 2

What is the proportion of those cars that have fuel efficiency between 14 and 21 mpg?

Using the observed sample.

mean(x %entre% c(14, 21))

[1] 0.46875

Using boostrap.

mean(bootProp)

[1] 0.4708125

sd(bootProp)

[1] 0.08597563

mean(bootProp) + c(-1,1) * 1.96 * sd(bootProp) / sqrt(length(x))

[1] 0.4410235 0.5006015

Jackknife

Origin

The jackknife technique was developed by Maurice Quenouille (1924-1973).

Key ideas

The jackknife is a resampling technique especially useful for variance and bias estimation.
The jackknife predates other common resampling methods such as the bootstrap.
The jackknife estimator of a parameter is found by systematically leaving out each observation from a dataset and calculating the estimate and then finding the average of these calculations.
Given a sample of size \(n\), the jackknife estimate is found by aggregating the estimates of each \(n-1\) sized sub-sample.

Key ideas

The jackknife is like a “leave-one-out” type of cross-validation.

Let \(x=(x_1, \ldots, x_n)\) be an observed random sample, and define the \(i^{th}\) jackknife sample \(x_{(i)}\) to be the subset of \(x\) that leaves out the \(i^{th}\) observation \(x_i\). That is,

\[x_{(i)}=(x_1, \ldots, x_{i-1},x_{i+1}, \ldots, x_n)\]

Key ideas

If \(\hat{\theta}=T_n(x)\), define the \(i^{th}\) jackknife replicate \(\hat{\theta}_{(i)}=T_n(x_{(i)})\), \(i=1,2, \ldots, n\).

Suppose the parameter \(\theta = t(F)\) is a function of the distribution \(F\).
Let \(F_n\) be the ecdf of a random sample from the distribution \(F\).
The estimate of \(\theta\) is \(\hat{\theta} = t(Fn)\).

The Jackknife estimate of bias

If \(\hat{\theta}\) is a smooth statistic, then \(\hat{\theta}_{(i)}=t(F_{n-1}(x_{(i)}))\) and the jackknife estimate of bias is

\[\widehat{bias}_{jack}=(n-1)(\overline{\hat{\theta}_{(\cdot)}} - \hat{\theta}),\]

where \(\overline{\hat{\theta}_{(i)}}=\frac{1}{n}\sum_{i=1}^{n}\hat{\theta}_{(i)}\) is the mean of the estimates from the leave-one-out samples, and \(\hat{\theta}\) is the estimated computed from the original observed sample.

The Jackknife estimate of standard error

A jackknife estimate of standard error is

\[\widehat{se}_{jack}=\sqrt{\frac{n-1}{n} \sum_{i=1}^{n} \left( \hat{\theta}_{(i)} - \overline{\hat{\theta}_{(\cdot)}} \right)^2 }\] for a smooth statistic \(\hat{\theta}\).

Example for the population mean

If the parameter to be estimated is the population mean of \(X\) by using the observed random sample \(x=(x_1, \ldots, x_n)\), we compute the mean \(\bar{x}_{(i)}\) without the \(i\)-th data point:

\[ \bar{x}_{(i)}=\frac{1}{n-1}\sum_{j=1, j\neq i}^{n}x_{j},\quad \quad i=1,\dots ,n.\]

Example for the population mean

These \(n\) estimates form an estimate of the distribution of the sample statistic if it were computed over a large number of samples. In particular, the mean of this sampling distribution is the average of these \(n\) estimates:

\[ \bar{x}=\frac{1}{n} \sum_{i=1}^{n} \bar{x}_{i} \]

Example for the population mean

Using the data from example 1.

set.seed(333)
x <- rnorm(30)  # The sample
jackMean <- numeric(length(x))
for (i in 1:length(x)) jackMean[i] <- mean(x[-i])

mean(jackMean)

[1] -0.01942028

Example for the population mean