Análisis de Clusters

1 Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).

2 Methods

Partitioning methods

\(k\)-means
Others

Hierarchical methods

The nearest neighbour algorithm (simple)
The farest neighbour algorithm (complete)
Average
Median
Ward
Centroid
Others

3 Distances

Euclidean distance

\[ d(p,q) = \sqrt{ \sum_{i=1}^{n} (p_{i} - q_{i})^2}\]

Maximum distance

\[ d(p,q) = max \{ p_{i} - q_{i} \}\]

Manhattan distance

\[ d(p,q) = \sum_{i=1}^{n} |p_{i} - q_{i}|\]

Canberra distance

\[ d(p,q) = \sum_{i=1}^{n} \frac{| p_{i} - q_{i}|} {|p_{i}| + |q_{i}| } \]

Minkowski distance

\[ d(p,q) = \left( \sum_{i=1}^{n} |p_i-q_i|^p \right) ^ {1/p} \]

where

\[ p = (p_{1}, p_{2},\ldots, p_{n}) \ \text{and} \ q = (q_{1}, q_{2},\ldots,q_{n}) \]

The function dist() in R computes and returns the distance matrix.

dist(x, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)

where

x: a numeric matrix, data frame or “dist” object.
method: the distance measure to be used: “euclidean”, “maximum”, “manhattan”, “canberra”, “binary” or “minkowski”.
diag: logical value indicating whether the diagonal of the distance matrix should be printed.
upper: logical value indicating whether the upper triangle of the distance matrix should be printed.
p: The power of the Minkowski distance.

3.1 Example

We are going to simulate some data to calculate several distances.

#set seed to make example reproducible
set.seed(12345)
test <- data.frame(x=sample(1:10000, 7), 
                   y=sample(1:10000, 7), 
                   z=sample(1:10000, 7))
test

     x    y    z
1 8243   75 2888
2  720 9254 2819
3 8922 9054  393
4 4922 9994 7872
5  605 5031 7316
6 2264 9473 7786
7 9986 9164 7696

Plotting the data.

require(scatterplot3d)
s3d <- scatterplot3d(test, pch=19, type="h",
                     highlight.3d=TRUE)
s3d.coords <- s3d$xyz.convert(test)
text(s3d.coords$x, s3d.coords$y, 
     labels=row.names(test), cex=1, pos=4)

To obtain all Euclidean distances we can use the dist function.

dist(test, method = "euclidean")

          1         2         3         4         5         6
2 11868.207                                                  
3  9343.902  8555.599                                        
4 11586.883  6613.412  8533.407                              
5 10124.632  6170.086 11544.910  6601.287                    
6 12168.042  5206.053  9957.960  2709.945  4764.929          
7 10429.038 10471.484  7380.922  5134.586 10258.131  7728.704

To obtain all Canberra distances we can use the dist function.

dist(test, method="canberra")

          1         2         3         4         5         6
2 1.8353510                                                  
3 1.7835650 1.6168702                                        
4 1.7005596 1.2558574 1.2431830                              
5 2.2678162 0.8261272 2.0566568 1.1479987                    
6 2.0122111 0.9974846 1.5217242 0.4021415 0.9156323          
7 1.5336519 1.3341961 0.9651415 0.3943126 1.2022240 0.6527605

4 Mahalanobis distance

The Mahalanobis distance was introduced by P. C. Mahalanobis in 1936 and it is a measure of the distance between one point \(\boldsymbol{x}\) and the mean vector \(\boldsymbol{\mu}\).

\[ D^2=(\boldsymbol{x}-\boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1}(\boldsymbol{x}-\boldsymbol{\mu}) \]

where \(\Sigma\) is the covariance matrix.

The mahalanobis() function returns the squared Mahalanobis distance of all rows in x and the vector \(\mu\) = center with respect to \(\Sigma\) = cov. The result is not a matrix, is a vector.

mahalanobis(x, center, cov, inverted = FALSE, ...)

where

x: vector or matrix of data with, \(p\) columns.
center: mean vector of the distribution or second data vector of length p or recyclable to that length. If set to FALSE, the centering step is skipped.
cov: covariance matrix (p x p) of the distribution.
inverted: logical. If TRUE, cov is supposed to contain the inverse of the covariance matrix.

The D2.dist() function from biotools package calculates the squared generalized Mahalanobis distance between all pairs of rows in a data frame with respect to a covariance matrix. The result is a matrix with distances.

\[ D_{ij}^2=(\boldsymbol{x}_i-\boldsymbol{x}_j)^\top \boldsymbol{\Sigma}^{-1}(\boldsymbol{x}_i-\boldsymbol{x}_j) \]

D2.dist(data, cov, inverted = FALSE)

data: a data frame or matrix of data (n x p).
cov: a variance-covariance matrix (p x p).
inverted: logical. If FALSE (default), cov is supposed to be a variance-covariance matrix.

4.1 Example

Using the simulated dataset calculate the Mahalanobis distance.

mahalanobis(x=test, cov=var(test), center=colMeans(test))

[1] 4.378182 2.809841 3.507509 1.024344 2.243094 1.029701 3.007329

Now using D2.dist.

require(biotools)
D2.dist(data=test, cov=var(test))

           1          2          3          4          5          6
2  9.8874347                                                       
3  8.1366021  4.3431985                                            
4  8.4215753  5.0134955  5.9790348                                 
5  5.3984321  4.5257245 10.5195799  3.5196836                      
6  8.5089952  3.3145820  7.0212777  0.5269665  1.8321150           
7  8.2942319 10.5844486  6.5528696  1.7096543  7.9061246  4.0039738

Important

Note that mahalanobis() gives a vector and D2.dist() gives a matrix.

5 Similarity coefficients for binary variables

In statistics and related fields, a similarity measure or similarity coefficient is a real-valued function that quantifies the similarity between two objects.

How are defined quantities a, b, c and d?

a: represents the total number of attributes where A and B both have a value of 1.
b: represents the total number of attributes where the attribute of A is 1 and the attribute of B is 0.
c: represents the total number of attributes where the attribute of A is 0 and the attribute of B is 1.
d: represents the total number of attributes where A and B both have a value of 0.

5.1 Example

There are two subjects A and B who were measured 10 binary variables (1 presence, 0 absence). Information below.

Calculate the coefficients of similarity Simple matching, Jaccard, Rogers and Tanimoto, and Sorensen-Dice.

To know more about distances consult this paper: http://users.uom.gr/~kouiruki/sung.pdf

6 Gower distance

The Gower distance is a measure of dissimilarity between two objects. It is commonly used in multivariate analysis and is particularly suitable for datasets that include a combination of

quantitative,
nominal,
ordinal.

6.1 FD package

The gowdis() function form FD package measures the Gower distance for mixed variables.

require(FD)
gowdis(x, w, ord)

where

x: matrix or data frame containing the variables. Variables can be numeric, ordered, or factor. Symmetric or asymmetric binary variables should be numeric and only contain 0 and 1. character variables will be converted to factor. NAs are tolerated.
w: vector listing the weights for the variables in x. Can be missing, in which case all variables have equal weights.
ord: character string specifying the method to be used for ordinal variables.

Consult the details with help(gowdis).

6.2 cluster package

The daisy() function form cluster package computes all the pairwise dissimilarities (distances) between observations in the data set.

require(cluster)
daisy(x, metric = c("euclidean", "manhattan", "gower"),
      stand = FALSE, type = list(), weights = rep.int(1, p),
      warnBin = warnType, warnAsym = warnType, warnConst = warnType,
      warnType = TRUE)

where

x: matrix or data frame containing the variables. NAs are tolerated.
metric: character string specifying the metric to be used..
stand: logical flag: if TRUE, then the measurements in x are standardized before calculating the dissimilarities.

Consult the details with help(daisy).

6.3 StatMatch package

The gower.dist() function form cluster package computes the Gower’s distance (dissimilarity) between units in a dataset or between observations in two distinct datasets.

require(StatMatch)
gower.dist(data.x, data.y=data.x, 
           rngs=NULL, KR.corr=TRUE, 
           var.weights = NULL, robcb=NULL)

where

data.x: A matrix or a data frame containing variables that should be used in the computation of the distance.
KR.corr: When TRUE (default) the extension of the Gower’s dissimilarity measure proposed by Kaufman and Rousseeuw (1990) is used. Otherwise, when KR.corr=FALSE, the Gower’s (1971) formula is considered.

Consult the details with help(gower.dist).

6.4 Example

Consider the next dt dataframe with mixed variables. Obtain the Gower’s distances using the functions shown above.

sex  <- c('female', 'male', 'female', 'male', 'female')
sex <- factor(sex)
smoke <- c(1, 0, 0, 1, 0)
age  <- c(25, 26, 35, 25, 25)
degree <- c('Bachelor', 'PhD', 'Master', 'Bachelor', 'PhD')
degree <- factor(degree, levels=c('Bachelor', 'Master', 'PhD'))

dt   <- data.frame(sex, smoke, age, degree)
dt

     sex smoke age   degree
1 female     1  25 Bachelor
2   male     0  26      PhD
3 female     0  35   Master
4   male     1  25 Bachelor
5 female     0  25      PhD

To obtain the Gower distances we can use:

FD::gowdis(x=dt)

      1     2     3     4
2 0.775                  
3 0.750 0.725            
4 0.250 0.525 1.000      
5 0.500 0.275 0.500 0.750

cluster::daisy(x=dt, metric="gower")

Dissimilarities :
      1     2     3     4
2 0.775                  
3 0.750 0.725            
4 0.250 0.525 1.000      
5 0.500 0.275 0.500 0.750

Metric :  mixed ;  Types = N, I, I, N 
Number of objects : 5

StatMatch::gower.dist(data.x=dt)

      [,1]  [,2]  [,3]  [,4]  [,5]
[1,] 0.000 0.775 0.750 0.250 0.500
[2,] 0.775 0.000 0.725 0.525 0.275
[3,] 0.750 0.725 0.000 1.000 0.500
[4,] 0.250 0.525 1.000 0.000 0.750
[5,] 0.500 0.275 0.500 0.750 0.000

What happens if we assign a low importance for sex variable? Use the next weights 1, 6, 5, 4 for sex, smoke, age and degree, respectively.

FD::gowdis(x=dt, w=c(1, 6, 5, 4))

        1       2       3       4
2 0.71875                        
3 0.93750 0.59375                
4 0.06250 0.65625 1.00000        
5 0.62500 0.09375 0.56250 0.68750

7 \(k\)-means

What is the objective?

Given a set of observations \((x_1, x_2, ., x_n)\), where each observation is a \(d\)-dimensional real vector, \(k\)-means clustering aims to partition the \(n\) observations into (\(k \leq n\)) \(S = \{S_1, S_2, ., S_k\}\) so as to minimize the within-cluster sum of squares (WCSS). In other words, its objective is to find:

\[ \underset{\mathbf{S}} {\operatorname{arg\,min}} \sum_{i=1}^{k} \sum_{\mathbf x_j \in S_i} \left\| \mathbf x_j - \boldsymbol\mu_i \right\|^2 \]

where \(\mu_i\) is the mean of points in \(S_i\). The clustering optimization problem is solved with the function kmeans in R.

Next we have the structure of the kmeans function.

kmeans(x, centers, iter.max = 10, nstart = 1,
       algorithm = c("Hartigan-Wong", 
                     "Lloyd", 
                     "Forgy",
                     "MacQueen"), 
       trace = FALSE)

7.1 Example 1

The wine dataset contains the results of a chemical analysis of wines grown in a specific area of Italy. Three types of wine are represented in the 178 samples, with the results of 13 chemical analyses recorded for each sample. The Type variable has been transformed into a categoric variable.

data(wine, package='rattle')
head(wine)

  Type Alcohol Malic  Ash Alcalinity Magnesium Phenols Flavanoids Nonflavanoids
1    1   14.23  1.71 2.43       15.6       127    2.80       3.06          0.28
2    1   13.20  1.78 2.14       11.2       100    2.65       2.76          0.26
3    1   13.16  2.36 2.67       18.6       101    2.80       3.24          0.30
4    1   14.37  1.95 2.50       16.8       113    3.85       3.49          0.24
5    1   13.24  2.59 2.87       21.0       118    2.80       2.69          0.39
6    1   14.20  1.76 2.45       15.2       112    3.27       3.39          0.34
  Proanthocyanins Color  Hue Dilution Proline
1            2.29  5.64 1.04     3.92    1065
2            1.28  4.38 1.05     3.40    1050
3            2.81  5.68 1.03     3.17    1185
4            2.18  7.80 0.86     3.45    1480
5            1.82  4.32 1.04     2.93     735
6            1.97  6.75 1.05     2.85    1450

To standarize the numeric variables.

wine_stand <- scale(wine[-1])  # Without the group variable
head(wine_stand)

       Alcohol       Malic        Ash Alcalinity  Magnesium   Phenols
[1,] 1.5143408 -0.56066822  0.2313998 -1.1663032 1.90852151 0.8067217
[2,] 0.2455968 -0.49800856 -0.8256672 -2.4838405 0.01809398 0.5670481
[3,] 0.1963252  0.02117152  1.1062139 -0.2679823 0.08810981 0.8067217
[4,] 1.6867914 -0.34583508  0.4865539 -0.8069748 0.92829983 2.4844372
[5,] 0.2948684  0.22705328  1.8352256  0.4506745 1.27837900 0.8067217
[6,] 1.4773871 -0.51591132  0.3043010 -1.2860793 0.85828399 1.5576991
     Flavanoids Nonflavanoids Proanthocyanins      Color        Hue  Dilution
[1,]  1.0319081    -0.6577078       1.2214385  0.2510088  0.3611585 1.8427215
[2,]  0.7315653    -0.8184106      -0.5431887 -0.2924962  0.4049085 1.1103172
[3,]  1.2121137    -0.4970050       2.1299594  0.2682629  0.3174085 0.7863692
[4,]  1.4623994    -0.9791134       1.0292513  1.1827317 -0.4263410 1.1807407
[5,]  0.6614853     0.2261576       0.4002753 -0.3183774  0.3611585 0.4483365
[6,]  1.3622851    -0.1755994       0.6623487  0.7298108  0.4049085 0.3356589
         Proline
[1,]  1.01015939
[2,]  0.96252635
[3,]  1.39122370
[4,]  2.32800680
[5,] -0.03776747
[6,]  2.23274072

Clustering with \(k=3\).

k.means.fit <- kmeans(x=wine_stand, centers=3)

Exploring the k.means.fit object.

k.means.fit

K-means clustering with 3 clusters of sizes 62, 51, 65

Cluster means:
     Alcohol      Malic        Ash Alcalinity   Magnesium     Phenols
1  0.8328826 -0.3029551  0.3636801 -0.6084749  0.57596208  0.88274724
2  0.1644436  0.8690954  0.1863726  0.5228924 -0.07526047 -0.97657548
3 -0.9234669 -0.3929331 -0.4931257  0.1701220 -0.49032869 -0.07576891
   Flavanoids Nonflavanoids Proanthocyanins      Color        Hue   Dilution
1  0.97506900   -0.56050853      0.57865427  0.1705823  0.4726504  0.7770551
2 -1.21182921    0.72402116     -0.77751312  0.9388902 -1.1615122 -1.2887761
3  0.02075402   -0.03343924      0.05810161 -0.8993770  0.4605046  0.2700025
     Proline
1  1.1220202
2 -0.4059428
3 -0.7517257

Clustering vector:
  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 1
 [75] 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[112] 3 3 3 3 3 3 3 2 3 3 1 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

Within cluster sum of squares by cluster:
[1] 385.6983 326.3537 558.6971
 (between_SS / total_SS =  44.8 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"

To explore the 3 centroids with only two digits.

centroids <- round(k.means.fit$centers, digits=2)
t(centroids)

                    1     2     3
Alcohol          0.83  0.16 -0.92
Malic           -0.30  0.87 -0.39
Ash              0.36  0.19 -0.49
Alcalinity      -0.61  0.52  0.17
Magnesium        0.58 -0.08 -0.49
Phenols          0.88 -0.98 -0.08
Flavanoids       0.98 -1.21  0.02
Nonflavanoids   -0.56  0.72 -0.03
Proanthocyanins  0.58 -0.78  0.06
Color            0.17  0.94 -0.90
Hue              0.47 -1.16  0.46
Dilution         0.78 -1.29  0.27
Proline          1.12 -0.41 -0.75

To explore the clusters.

k.means.fit$cluster

  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 1
 [75] 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[112] 3 3 3 3 3 3 3 2 3 3 1 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

To explore the sum of squares.

k.means.fit$totss # The total sum of squares.

[1] 2301

k.means.fit$withinss # Vector of within-cluster sum of squares

[1] 385.6983 326.3537 558.6971

k.means.fit$tot.withinss # Total within-cluster sum of squares

[1] 1270.749

k.means.fit$betweenss # The between-cluster sum of squares, totss-tot.withinss.

[1] 1030.251

1030.251 /  2301

[1] 0.4477405

A fundamental question is how to determine the value of the parameter \(k\). If we look at the percentage of variance explained as a function of the number of clusters, one should choose a number of clusters so that adding another cluster doesn’t give much better modeling of the data. More precisely, if one plots the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the “elbow criterion”.

wssplot <- function(data, nc=15, seed=1234, ...){
  wss <- (nrow(data)-1)*sum(apply(data,2,var))
  for (i in 2:nc){
    set.seed(seed)
    wss[i] <- sum(kmeans(data, centers=i)$withinss)}
  plot(1:nc, wss, type="b", xlab="Number of Clusters",
       ylab="Within groups sum of squares", ...)
  }

wssplot(wine_stand, nc=6, lwd=3, col="tomato", las=1)

Library clusters allow us to represent (with the aid of PCA) the cluster solution into 2 dimensions:

library(cluster)
clusplot(x=wine_stand, k.means.fit$cluster, main='', 
         color=TRUE, shade=TRUE, labels=2, lines=0)

In this example, we have an additional variable, Type, that was not used in the analysis. To evaluate the clustering performance, we constructed a confusion matrix to compare the clusters with the prior classification based on the Type variable.

table(Actual=wine[,1], Prediction=k.means.fit$cluster)

      Prediction
Actual  1  2  3
     1 59  0  0
     2  3  3 65
     3  0 48  0

8 Hierarchical methods

8.1 The nearest neighbour algorithm (simple)

The distance between two clusters is measured by

\[ d_{AB} = \mathop{min}_{i \in A, j \in B} \left \{ d_{ij} \right \} \]

Each observation is a cluster.
Join the two closest clusters.
Compute the new distance matrix.
Repeat steps 2 and 3 until all observations are in one cluster.

8.2 The farest neighbour algorithm (complete)

The distance between two clusters is measured by

\[ d_{AB} = \mathop{max}_{i \in A, j \in B} \left \{ d_{ij} \right \} \]

Each observation is a cluster.
Join the two closest clusters.
Compute the new distance matrix.
Repeat steps 2 and 3 until all observations are in one cluster.

8.3 Other algortihms

Ward’s: minimum variance criterion minimizes the total within-cluster variance.
Average
Median
Centroid

8.4 dendrogram

A dendrogram (from Greek dendro “tree” and gramma “drawing”) is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering.

To customize a dendrogram we recomend the url https://rpubs.com/gaston/dendrograms

8.5 Example

Apply the nearest neighbour algorithm and draw the dendrogram for the next distance matrix.

### hclust function

hclust function computes cluster analysis on a set of dissimilarities and methods for analyzing it.

hclust(d, method = "complete", members = NULL)

d: a dissimilarity structure as produced by dist.
method: the agglomeration method to be used. This should be one of “ward.D”, “ward.D2”, “single”, “complete”, “average” (= UPGMA), “mcquitty” (= WPGMA), “median” (= WPGMC) or “centroid” (= UPGMC).

8.6 Example

Apply the nearest neighbour algorithm and draw the dendrogram for the next distance matrix.

Dist <- matrix(c(0, 3, 7, 11, 10,
                 3, 0, 6, 10, 9,
                 7, 6, 0, 5, 6, 
                 11, 10, 5, 0, 4,
                 10, 9, 6, 4, 0), ncol=5, byrow=5)
rownames(Dist) <- colnames(Dist) <- paste('Obs', 1:5)
Dist <- as.dist(Dist) # to convert into a dist object

hc <- hclust(d=Dist, method='single')
hc


Call:
hclust(d = Dist, method = "single")

Cluster method   : single 
Number of objects: 5

plot(hc)

8.7 Example

Use the Ward’s algorithm with Euclidean distance to create 3 cluster for wines.

d <- dist(wine_stand, method = "euclidean")
H.fit <- hclust(d, method="ward.D")

plot(H.fit, hang=-1) # display dendrogram
groups <- cutree(H.fit, k=3) # cut tree into 5 clusters
rect.hclust(H.fit, k=3, border="red")

The clustering performance can be evaluated with the aid of a confusion matrix as follows:

table(wine[,1],groups)

8.8 Example

Use the iris data set for clustering into 3 clusters.

head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

summary(iris$Species)

    setosa versicolor  virginica 
        50         50         50

Exploring the data.

Use the Ward’s algorithm with Euclidean distance to create 3 clusters.

d <- dist(dist(iris[, 3:4], method = "euclidean"))
iris.fit <- hclust(d, method="ward.D")

plot(iris.fit, hang=-1)
rect.hclust(iris.fit, k = 3, border = "red")

Clusters.

groups <- cutree(iris.fit, k = 3)
groups

  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
 21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
 41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
  1   1   1   1   1   1   1   1   1   1   2   2   2   2   2   2   2   2   2   2 
 61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
  2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
 81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
  2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
  3   2   3   3   3   3   2   3   3   3   2   3   3   2   3   3   3   3   3   2 
121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 
  3   2   3   2   3   3   2   2   3   3   3   3   3   2   3   3   3   3   2   3 
141 142 143 144 145 146 147 148 149 150 
  3   3   2   3   3   3   2   3   3   2

The clustering performance can be evaluated with the aid of a confusion matrix as follows:

table(groups, iris$Species)

      
groups setosa versicolor virginica
     1     50          0         0
     2      0         50        14
     3      0          0        36

Análisis de Clusters

1 Clustering

2 Methods

3 Distances

3.1 Example

4 Mahalanobis distance

4.1 Example

5 Similarity coefficients for binary variables

5.1 Example

6 Gower distance

6.1 FD package

6.2 cluster package

6.3 StatMatch package

6.4 Example

7 \(k\)-means

7.1 Example 1

8 Hierarchical methods

8.1 The nearest neighbour algorithm (simple)

8.2 The farest neighbour algorithm (complete)

8.3 Other algortihms

8.4 dendrogram

8.5 Example

8.6 Example

8.7 Example

8.8 Example

9 Other examples

10 More about clustering