Assessing the assumption of normality

Multivariate Analysis

Freddy Hernández-Barajas

Assessing the assumption of normality

Do the marginal distributions of the variables of \(X\) appear to be normal?
Do the scatter plots of pairs of variables give the elliptical appearance expected from normal populations?
What is the pattern observed in the \(D^2\)?
What are the results from several normality test?

Example 1, bivariate normal

Generating one sample

We are going to generate a random sample with only 100 observations.

require(MASS)

n <- 100
Media <- c(-1, 2)
Sigma <- matrix(c(10, 4, 4, 20), byrow=TRUE, ncol=2)

x <- mvrnorm(n=n, mu=Media, Sigma=Sigma)

The simulated data

head(x, n=13)

            [,1]       [,2]
 [1,] -1.0741271 -6.2308487
 [2,] -2.4661300  0.8106840
 [3,]  0.2183284  6.6744679
 [4,] -5.5522441 -2.0566592
 [5,] -0.4462097  3.9457286
 [6,]  1.7441782 -2.3031246
 [7,]  1.2351988  9.9574407
 [8,] -2.7554396  2.0700468
 [9,] -2.1873608  6.4698140
[10,] -2.7969106  1.7510170
[11,]  6.3789467 16.1355967
[12,]  0.7323506  3.0172191
[13,] -7.8417874  0.9103693

Scatterplots

pairs(x, pch=20)

Empirical bivariate density

f1 <- kde2d(x=x[, 1], y=x[, 2], n=30)
persp(x=f1$x, y=f1$y, z=f1$z, theta=60, phi=20, col='lightblue',
      xlab='x1', ylab='x2', zlab='Density')

Marginal densities

par(mfrow=c(1, 2))
for (i in 1:2) plot(density(x[, i]), lwd=3, col='blue3', main=paste("X", i))

Mahalanobis distance

Next we explore the first distances.

d2 <- mahalanobis(x=x, center=colMeans(x), cov=var(x))
d2[1:50]  # To explore some distances

 [1] 3.39129380 0.37876413 0.83267824 2.84059203 0.13332892 2.76882922
 [7] 2.46643474 0.62195238 1.77851072 0.60374427 9.67284803 0.24288097
[13] 7.01739717 0.74761170 1.22781669 6.42601375 0.71188709 2.57175868
[19] 1.98476521 7.03676032 3.09725662 0.27669797 0.78213571 3.67510427
[25] 1.28436604 2.07657456 0.06721147 0.22965271 3.13841683 0.23892611
[31] 2.09274202 3.31698715 0.72283374 0.20967035 4.21639792 0.08257476
[37] 1.91980070 3.08709476 0.85390245 0.28108069 3.03656235 2.88413520
[43] 1.67783764 1.32440914 0.45278643 1.42526294 1.64212940 3.28646247
[49] 0.93468125 5.50202719

Chi-square plot

p <- ncol(x)
require(car)
qqPlot(x=d2, dist="chisq", df=p)

[1] 11 94

MVN package

This package performs multivariate normality tests and graphical approaches.

Multivariate normality tests

Mardia’s test,
Henze-Zirkler’s test,
Royston’s test,
Doornik-Hansen’s test and
E-statistic test

\(H_0:\) the sample comes from a \(N_p\).
\(H_1:\) the sample does not come from a \(N_p\).

Mardia test

require(MVN)
mvn(data=x, mvnTest="mardia")

$multivariateNormality
             Test          Statistic           p value Result
1 Mardia Skewness   2.08922939632991 0.719351377140726    YES
2 Mardia Kurtosis -0.271717735704995 0.785839063078596    YES
3             MVN               <NA>              <NA>    YES

$univariateNormality
              Test  Variable Statistic   p value Normality
1 Anderson-Darling  Column1     0.1693    0.9324    YES   
2 Anderson-Darling  Column2     0.2976    0.5828    YES   

$Descriptives
    n       Mean  Std.Dev     Median        Min       Max       25th     75th
1 100 -0.6843622 2.895154 -0.6736072  -7.841787  6.378947 -2.7658074 1.201993
2 100  2.1899036 4.948071  1.9433470 -10.704631 16.135597 -0.8697526 5.832291
        Skew    Kurtosis
1 0.11563077 -0.36866463
2 0.09604895 -0.07602018

Henze-Zirkler’s test

mvn(data=x, mvnTest="hz")

$multivariateNormality
           Test        HZ   p value MVN
1 Henze-Zirkler 0.3561086 0.8891539 YES

$univariateNormality
              Test  Variable Statistic   p value Normality
1 Anderson-Darling  Column1     0.1693    0.9324    YES   
2 Anderson-Darling  Column2     0.2976    0.5828    YES   

$Descriptives
    n       Mean  Std.Dev     Median        Min       Max       25th     75th
1 100 -0.6843622 2.895154 -0.6736072  -7.841787  6.378947 -2.7658074 1.201993
2 100  2.1899036 4.948071  1.9433470 -10.704631 16.135597 -0.8697526 5.832291
        Skew    Kurtosis
1 0.11563077 -0.36866463
2 0.09604895 -0.07602018

Royston’s test

mvn(data=x, mvnTest="royston")

$multivariateNormality
     Test          H   p value MVN
1 Royston 0.01088171 0.9947412 YES

$univariateNormality
              Test  Variable Statistic   p value Normality
1 Anderson-Darling  Column1     0.1693    0.9324    YES   
2 Anderson-Darling  Column2     0.2976    0.5828    YES   

$Descriptives
    n       Mean  Std.Dev     Median        Min       Max       25th     75th
1 100 -0.6843622 2.895154 -0.6736072  -7.841787  6.378947 -2.7658074 1.201993
2 100  2.1899036 4.948071  1.9433470 -10.704631 16.135597 -0.8697526 5.832291
        Skew    Kurtosis
1 0.11563077 -0.36866463
2 0.09604895 -0.07602018

Doornik-Hansen’s test

mvn(data=x, mvnTest="dh")

$multivariateNormality
            Test         E df   p value MVN
1 Doornik-Hansen 0.3881136  4 0.9834385 YES

$univariateNormality
              Test  Variable Statistic   p value Normality
1 Anderson-Darling  Column1     0.1693    0.9324    YES   
2 Anderson-Darling  Column2     0.2976    0.5828    YES   

$Descriptives
    n       Mean  Std.Dev     Median        Min       Max       25th     75th
1 100 -0.6843622 2.895154 -0.6736072  -7.841787  6.378947 -2.7658074 1.201993
2 100  2.1899036 4.948071  1.9433470 -10.704631 16.135597 -0.8697526 5.832291
        Skew    Kurtosis
1 0.11563077 -0.36866463
2 0.09604895 -0.07602018

Energy test

mvn(data=x, mvnTest="energy")

$multivariateNormality
         Test Statistic p value MVN
1 E-statistic 0.4748378   0.912 YES

$univariateNormality
              Test  Variable Statistic   p value Normality
1 Anderson-Darling  Column1     0.1693    0.9324    YES   
2 Anderson-Darling  Column2     0.2976    0.5828    YES   

$Descriptives
    n       Mean  Std.Dev     Median        Min       Max       25th     75th
1 100 -0.6843622 2.895154 -0.6736072  -7.841787  6.378947 -2.7658074 1.201993
2 100  2.1899036 4.948071  1.9433470 -10.704631 16.135597 -0.8697526 5.832291
        Skew    Kurtosis
1 0.11563077 -0.36866463
2 0.09604895 -0.07602018

Conclusion

What can we conclude?

Example 2, mixture of two bivariate normal

The model for a mixture

\[ \mathbf{X} \sim p \, N_2(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_1) + (1-p) \, N_2(\boldsymbol{\mu}_2, \boldsymbol{\Sigma}_2) \]

Generating one sample from a mixture with \(p=0.6\)

n <- 1000
p <- 0.6
x <- matrix(NA, ncol=2, nrow=n)

Media1 <- c(0, 0)
Sigma1 <- matrix(c(2, 2, 2, 2), byrow=TRUE, ncol=2)

Media2 <- c(0, 0)
Sigma2 <- matrix(c(2, -2, -2, 2), byrow=TRUE, ncol=2)

for (i in 1:n) {
  u <- runif(n=1)
  if (u <= p)
    x[i, ] <- mvrnorm(n=1, mu=Media1, Sigma=Sigma1)
  else 
    x[i, ] <- mvrnorm(n=1, mu=Media2, Sigma=Sigma2)
}

The simulated data

head(x, n=13)

             [,1]        [,2]
 [1,] -0.04786903  0.04786903
 [2,] -0.95473008 -0.95473008
 [3,] -0.15476335  0.15476335
 [4,]  0.40931732  0.40931732
 [5,]  1.06269829  1.06269829
 [6,] -0.27170708 -0.27170708
 [7,]  0.04105247  0.04105247
 [8,]  1.38018770 -1.38018770
 [9,]  1.00784519  1.00784519
[10,] -1.92402870 -1.92402870
[11,]  0.64420849 -0.64420849
[12,] -0.94884616 -0.94884616
[13,]  0.14179830  0.14179830

Scatterplots

pairs(x, pch=21)

Empirical Bivariate density

f2 <- kde2d(x=x[, 1], y=x[, 2], n=30)
persp(x=f2$x, y=f2$y, z=f2$z, theta=60, phi=30, col='lightblue',
      xlab='x1', ylab='x2', zlab='Density')

Marginal densities

par(mfrow=c(1, 2))
for (i in 1:2) plot(density(x[, i]), lwd=3, col='blue3', main=paste("X", i))

Mahalanobis distance

d2 <- mahalanobis(x=x, center=colMeans(x), cov=var(x))
d2[1:50]

 [1]  0.002791700  0.811411475  0.030652276  0.145283948  0.990438826
 [6]  0.067019837  0.001233372  2.501040194  0.890488322  3.282377722
[11]  0.546507654  0.801479896  0.016836787  0.493156761  2.998014105
[16]  0.163655750  0.080242837  0.274717446  4.746104832  2.637748291
[21]  0.051762792  0.128223199  0.170947242  0.553148894  2.376350060
[26]  0.191273424  3.841552031  0.001844047  1.440441092  1.622249550
[31]  0.930085738 16.988913719  2.415464228  0.043853709  0.063663213
[36]  0.005958051  0.013315536  2.913530122  0.582271576  0.114249901
[41]  0.135181469  2.173391253  1.067986473  9.598103789  0.822119524
[46]  1.463277002  0.433409414  0.381100495  3.880129768  0.271759184

Chi-square plot

p <- ncol(x)
qqPlot(x=d2, dist="chisq", df=p)

[1] 213 698

Mardia test

mvn(data=x, mvnTest="mardia")

$multivariateNormality
             Test        Statistic           p value Result
1 Mardia Skewness 5.01324758394094 0.285940909391574    YES
2 Mardia Kurtosis 23.4587751604463                 0     NO
3             MVN             <NA>              <NA>     NO

$univariateNormality
              Test  Variable Statistic   p value Normality
1 Anderson-Darling  Column1     0.3425    0.4910    YES   
2 Anderson-Darling  Column2     0.2940    0.5996    YES   

$Descriptives
     n        Mean  Std.Dev       Median       Min      Max       25th
1 1000 0.001948238 1.376903 -0.004748540 -4.293856 5.007414 -0.9027623
2 1000 0.005547561 1.376893  0.007787424 -4.922250 5.007414 -0.9069348
       75th          Skew  Kurtosis
1 0.9411775  0.0791648370 0.2450577
2 0.9351713 -0.0002342378 0.2455167

Henze-Zirkler’s test

mvn(data=x, mvnTest="hz")

$multivariateNormality
           Test       HZ p value MVN
1 Henze-Zirkler 46.09091       0  NO

$univariateNormality
              Test  Variable Statistic   p value Normality
1 Anderson-Darling  Column1     0.3425    0.4910    YES   
2 Anderson-Darling  Column2     0.2940    0.5996    YES   

$Descriptives
     n        Mean  Std.Dev       Median       Min      Max       25th
1 1000 0.001948238 1.376903 -0.004748540 -4.293856 5.007414 -0.9027623
2 1000 0.005547561 1.376893  0.007787424 -4.922250 5.007414 -0.9069348
       75th          Skew  Kurtosis
1 0.9411775  0.0791648370 0.2450577
2 0.9351713 -0.0002342378 0.2455167

Conclusion

What can we conclude?

Homework

Consider the next random variable \(\mathbf{X}\):

\[ \mathbf{X} \sim p \, N_2(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_1) + (1-p) \, N_2(\boldsymbol{\mu}_2, \boldsymbol{\Sigma}_2) \] assume that \(\boldsymbol{\mu}_1=(0, 0)^\top\) and \(\boldsymbol{\mu}_2=(d, d)^\top\).

Homework

The covariance matrices are

\[ \boldsymbol{\Sigma}_1= \begin{pmatrix} \sigma_{11} & \sqrt{\sigma_{11} \sigma_{22}} \rho_1 \\ & \sigma_{22} \end{pmatrix} \]

\[ \boldsymbol{\Sigma}_2= \begin{pmatrix} \sigma_{11} & \sqrt{\sigma_{11} \sigma_{22}} \rho_2 \\ & \sigma_{22} \end{pmatrix} \]

Some parameters are fixed \(\sigma_{11}=1\), \(\sigma_{22}=1.5\) and \(\rho_1=0.3\).

Homework

The goal is to explore the effects of \(n\), \(d\), \(\rho_2\) and \(p\) on the rejection rate of \(H_0\) with the Mardia Skewness test.

\(n = 10, 20\).
\(d = 0, 1\).
\(\rho_2 = 0.3, 0.5\).
\(p = 0.5, 0.7\).

Homework

You need to simulate 50 random samples for each combination of \(n\), \(d\), \(\rho_2\) and \(p\) given above.
In total, we have \(2\times2\times2\times2=16\) treatments or combinations. In only two treatments the null hypothesis true.
Could you identify those treatments?

Homework

Complete this table.

\(n\)	\(d\)	\(\rho_2\)	\(p\)	Rejection rate of \(H_0\)
10	0	0.3	0.5	?
10	0	0.3	0.7	?
10	0	0.5	0.5	?
10	0	0.5	0.7	?
10	1	0.3	0.5
10	1	0.3	0.7
10	1	0.5	0.5
10	1	0.5	0.7
20	0	0.3	0.5
20	0	0.3	0.7
20	0	0.5	0.5
20	0	0.5	0.7
20	1	0.3	0.5
20	1	0.3	0.7
20	1	0.5	0.5
20	1	0.5	0.7

\(n\)	\(d\)	\(\rho_2\)	\(p\)	Rejection rate of \(H_0\)
10	0	0.3	0.5	?
10	0	0.3	0.7	?
10	0	0.5	0.5	?
10	0	0.5	0.7	?
10	1	0.3	0.5
10	1	0.3	0.7
10	1	0.5	0.5
10	1	0.5	0.7
20	0	0.3	0.5
20	0	0.3	0.7
20	0	0.5	0.5
20	0	0.5	0.7
20	1	0.3	0.5
20	1	0.3	0.7
20	1	0.5	0.5
20	1	0.5	0.7

\(n\)	\(d\)	\(\rho_2\)	\(p\)	Rejection rate of \(H_0\)
10	0	0.3	0.5	?
10	0	0.3	0.7	?
10	0	0.5	0.5	?
10	0	0.5	0.7	?
10	1	0.3	0.5
10	1	0.3	0.7
10	1	0.5	0.5
10	1	0.5	0.7
20	0	0.3	0.5
20	0	0.3	0.7
20	0	0.5	0.5
20	0	0.5	0.7
20	1	0.3	0.5
20	1	0.3	0.7
20	1	0.5	0.5
20	1	0.5	0.7

\(n\)	\(d\)	\(\rho_2\)	\(p\)	Rejection rate of \(H_0\)
10	0	0.3	0.5	?
10	0	0.3	0.7	?
10	0	0.5	0.5	?
10	0	0.5	0.7	?
10	1	0.3	0.5
10	1	0.3	0.7
10	1	0.5	0.5
10	1	0.5	0.7
20	0	0.3	0.5
20	0	0.3	0.7
20	0	0.5	0.5
20	0	0.5	0.7
20	1	0.3	0.5
20	1	0.3	0.7
20	1	0.5	0.5
20	1	0.5	0.7