dist(x, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)
1 Clustering
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).
2 Methods
- Partitioning methods
- \(k\)-means
- Others
- Hierarchical methods
- The nearest neighbour algorithm (simple)
- The farest neighbour algorithm (complete)
- Average
- Median
- Ward
- Centroid
- Others
3 Distances
Euclidean distance
\[ d(p,q) = \sqrt{ \sum_{i=1}^{n} (p_{i} - q_{i})^2}\]
Maximum distance
\[ d(p,q) = max \{ p_{i} - q_{i} \}\]
Manhattan distance
\[ d(p,q) = \sum_{i=1}^{n} |p_{i} - q_{i}|\]
Canberra distance
\[ d(p,q) = \sum_{i=1}^{n} \frac{| p_{i} - q_{i}|} {|p_{i}| + |q_{i}| } \]
Minkowski distance
\[ d(p,q) = \left( \sum_{i=1}^{n} |p_i-q_i|^p \right) ^ {1/p} \]
where
\[ p = (p_{1}, p_{2},\ldots, p_{n}) \ \text{and} \ q = (q_{1}, q_{2},\ldots,q_{n}) \]
The function dist()
in R computes and returns the distance matrix.
where
x
: a numeric matrix, data frame or “dist” object.method
: the distance measure to be used: “euclidean”, “maximum”, “manhattan”, “canberra”, “binary” or “minkowski”.diag
: logical value indicating whether the diagonal of the distance matrix should be printed.upper
: logical value indicating whether the upper triangle of the distance matrix should be printed.p
: The power of the Minkowski distance.
3.1 Example
We are going to simulate some data to calculate several distances.
#set seed to make example reproducible
set.seed(12345)
<- data.frame(x=sample(1:10000, 7),
test y=sample(1:10000, 7),
z=sample(1:10000, 7))
test
x y z
1 8243 75 2888
2 720 9254 2819
3 8922 9054 393
4 4922 9994 7872
5 605 5031 7316
6 2264 9473 7786
7 9986 9164 7696
Plotting the data.
require(scatterplot3d)
<- scatterplot3d(test, pch=19, type="h",
s3d highlight.3d=TRUE)
<- s3d$xyz.convert(test)
s3d.coords text(s3d.coords$x, s3d.coords$y,
labels=row.names(test), cex=1, pos=4)
To obtain all Euclidean distances we can use the dist
function.
dist(test, method = "euclidean")
1 2 3 4 5 6
2 11868.207
3 9343.902 8555.599
4 11586.883 6613.412 8533.407
5 10124.632 6170.086 11544.910 6601.287
6 12168.042 5206.053 9957.960 2709.945 4764.929
7 10429.038 10471.484 7380.922 5134.586 10258.131 7728.704
To obtain all Canberra distances we can use the dist
function.
dist(test, method="canberra")
1 2 3 4 5 6
2 1.8353510
3 1.7835650 1.6168702
4 1.7005596 1.2558574 1.2431830
5 2.2678162 0.8261272 2.0566568 1.1479987
6 2.0122111 0.9974846 1.5217242 0.4021415 0.9156323
7 1.5336519 1.3341961 0.9651415 0.3943126 1.2022240 0.6527605
4 Mahalanobis distance
The Mahalanobis distance was introduced by P. C. Mahalanobis in 1936 and it is a measure of the distance between one point \(\boldsymbol{x}\) and the mean vector \(\boldsymbol{\mu}\).
\[ D^2=(\boldsymbol{x}-\boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1}(\boldsymbol{x}-\boldsymbol{\mu}) \]
where \(\Sigma\) is the covariance matrix.
The mahalanobis()
function returns the squared Mahalanobis distance of all rows in x
and the vector \(\mu\) = center
with respect to \(\Sigma\) = cov
. The result is not a matrix, is a vector.
mahalanobis(x, center, cov, inverted = FALSE, ...)
where
x
: vector or matrix of data with, \(p\) columns.center
: mean vector of the distribution or second data vector of length p or recyclable to that length. If set to FALSE, the centering step is skipped.cov
: covariance matrix (p x p) of the distribution.inverted
: logical. If TRUE, cov is supposed to contain the inverse of the covariance matrix.
The D2.dist()
function from biotools package calculates the squared generalized Mahalanobis distance between all pairs of rows in a data frame with respect to a covariance matrix. The result is a matrix with distances.
\[ D_{ij}^2=(\boldsymbol{x}_i-\boldsymbol{x}_j)^\top \boldsymbol{\Sigma}^{-1}(\boldsymbol{x}_i-\boldsymbol{x}_j) \]
D2.dist(data, cov, inverted = FALSE)
data
: a data frame or matrix of data (n x p).cov
: a variance-covariance matrix (p x p).inverted
: logical. If FALSE (default),cov
is supposed to be a variance-covariance matrix.
4.1 Example
Using the simulated dataset calculate the Mahalanobis distance.
mahalanobis(x=test, cov=var(test), center=colMeans(test))
[1] 4.378182 2.809841 3.507509 1.024344 2.243094 1.029701 3.007329
Now using D2.dist.
require(biotools)
D2.dist(data=test, cov=var(test))
1 2 3 4 5 6
2 9.8874347
3 8.1366021 4.3431985
4 8.4215753 5.0134955 5.9790348
5 5.3984321 4.5257245 10.5195799 3.5196836
6 8.5089952 3.3145820 7.0212777 0.5269665 1.8321150
7 8.2942319 10.5844486 6.5528696 1.7096543 7.9061246 4.0039738
5 Similarity coefficients for binary variables
In statistics and related fields, a similarity measure or similarity coefficient is a real-valued function that quantifies the similarity between two objects.
How are defined quantities a, b, c and d?
- a: represents the total number of attributes where A and B both have a value of 1.
- b: represents the total number of attributes where the attribute of A is 1 and the attribute of B is 0.
- c: represents the total number of attributes where the attribute of A is 0 and the attribute of B is 1.
- d: represents the total number of attributes where A and B both have a value of 0.
5.1 Example
There are two subjects A and B who were measured 10 binary variables (1 presence, 0 absence). Information below.
Calculate the coefficients of similarity Simple matching, Jaccard, Rogers and Tanimoto, and Sorensen-Dice.
To know more about distances consult this paper: http://users.uom.gr/~kouiruki/sung.pdf
6 Gower distance
The Gower distance is a measure of dissimilarity between two objects. It is commonly used in multivariate analysis and is particularly suitable for datasets that include a combination of
- quantitative,
- nominal,
- ordinal.
6.1 FD package
The gowdis()
function form FD package measures the Gower distance for mixed variables.
require(FD)
gowdis(x, w, ord)
where
x
: matrix or data frame containing the variables. Variables can be numeric, ordered, or factor. Symmetric or asymmetric binary variables should be numeric and only contain 0 and 1. character variables will be converted to factor. NAs are tolerated.w
: vector listing the weights for the variables inx
. Can be missing, in which case all variables have equal weights.ord
: character string specifying the method to be used for ordinal variables.
Consult the details with help(gowdis)
.
6.2 cluster package
The daisy()
function form cluster package computes all the pairwise dissimilarities (distances) between observations in the data set.
require(cluster)
daisy(x, metric = c("euclidean", "manhattan", "gower"),
stand = FALSE, type = list(), weights = rep.int(1, p),
warnBin = warnType, warnAsym = warnType, warnConst = warnType,
warnType = TRUE)
where
x
: matrix or data frame containing the variables. NAs are tolerated.metric
: character string specifying the metric to be used..stand
: logical flag: if TRUE, then the measurements in x are standardized before calculating the dissimilarities.
Consult the details with help(daisy)
.
6.3 StatMatch package
The gower.dist()
function form cluster package computes the Gower’s distance (dissimilarity) between units in a dataset or between observations in two distinct datasets.
require(StatMatch)
gower.dist(data.x, data.y=data.x,
rngs=NULL, KR.corr=TRUE,
var.weights = NULL, robcb=NULL)
where
data.x
: A matrix or a data frame containing variables that should be used in the computation of the distance.KR.corr
: When TRUE (default) the extension of the Gower’s dissimilarity measure proposed by Kaufman and Rousseeuw (1990) is used. Otherwise, when KR.corr=FALSE, the Gower’s (1971) formula is considered.
Consult the details with help(gower.dist)
.
6.4 Example
Consider the next dt
dataframe with mixed variables. Obtain the Gower’s distances using the functions shown above.
<- c('female', 'male', 'female', 'male', 'female')
sex <- factor(sex)
sex <- c(1, 0, 0, 1, 0)
smoke <- c(25, 26, 35, 25, 25)
age <- c('Bachelor', 'PhD', 'Master', 'Bachelor', 'PhD')
degree <- factor(degree, levels=c('Bachelor', 'Master', 'PhD'))
degree
<- data.frame(sex, smoke, age, degree)
dt dt
sex smoke age degree
1 female 1 25 Bachelor
2 male 0 26 PhD
3 female 0 35 Master
4 male 1 25 Bachelor
5 female 0 25 PhD
To obtain the Gower distances we can use:
::gowdis(x=dt) FD
1 2 3 4
2 0.775
3 0.750 0.725
4 0.250 0.525 1.000
5 0.500 0.275 0.500 0.750
::daisy(x=dt, metric="gower") cluster
Dissimilarities :
1 2 3 4
2 0.775
3 0.750 0.725
4 0.250 0.525 1.000
5 0.500 0.275 0.500 0.750
Metric : mixed ; Types = N, I, I, N
Number of objects : 5
::gower.dist(data.x=dt) StatMatch
[,1] [,2] [,3] [,4] [,5]
[1,] 0.000 0.775 0.750 0.250 0.500
[2,] 0.775 0.000 0.725 0.525 0.275
[3,] 0.750 0.725 0.000 1.000 0.500
[4,] 0.250 0.525 1.000 0.000 0.750
[5,] 0.500 0.275 0.500 0.750 0.000
What happens if we assign a low importance for sex variable? Use the next weights 1, 6, 5, 4 for sex, smoke, age and degree, respectively.
::gowdis(x=dt, w=c(1, 6, 5, 4)) FD
1 2 3 4
2 0.71875
3 0.93750 0.59375
4 0.06250 0.65625 1.00000
5 0.62500 0.09375 0.56250 0.68750
7 \(k\)-means
What is the objective?
Given a set of observations \((x_1, x_2, ., x_n)\), where each observation is a \(d\)-dimensional real vector, \(k\)-means clustering aims to partition the \(n\) observations into (\(k \leq n\)) \(S = \{S_1, S_2, ., S_k\}\) so as to minimize the within-cluster sum of squares (WCSS). In other words, its objective is to find:
\[ \underset{\mathbf{S}} {\operatorname{arg\,min}} \sum_{i=1}^{k} \sum_{\mathbf x_j \in S_i} \left\| \mathbf x_j - \boldsymbol\mu_i \right\|^2 \]
where \(\mu_i\) is the mean of points in \(S_i\). The clustering optimization problem is solved with the function kmeans
in R.
Next we have the structure of the kmeans
function.
kmeans(x, centers, iter.max = 10, nstart = 1,
algorithm = c("Hartigan-Wong",
"Lloyd",
"Forgy",
"MacQueen"),
trace = FALSE)
7.1 Example 1
The wine dataset contains the results of a chemical analysis of wines grown in a specific area of Italy. Three types of wine are represented in the 178 samples, with the results of 13 chemical analyses recorded for each sample. The Type variable has been transformed into a categoric variable.
data(wine, package='rattle')
head(wine)
Type Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids Nonflavanoids
1 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28
2 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26
3 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30
4 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24
5 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39
6 1 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34
Proanthocyanins Color Hue Dilution Proline
1 2.29 5.64 1.04 3.92 1065
2 1.28 4.38 1.05 3.40 1050
3 2.81 5.68 1.03 3.17 1185
4 2.18 7.80 0.86 3.45 1480
5 1.82 4.32 1.04 2.93 735
6 1.97 6.75 1.05 2.85 1450
To standarize the numeric variables.
<- scale(wine[-1]) # Without the group variable
wine_stand head(wine_stand)
Alcohol Malic Ash Alcalinity Magnesium Phenols
[1,] 1.5143408 -0.56066822 0.2313998 -1.1663032 1.90852151 0.8067217
[2,] 0.2455968 -0.49800856 -0.8256672 -2.4838405 0.01809398 0.5670481
[3,] 0.1963252 0.02117152 1.1062139 -0.2679823 0.08810981 0.8067217
[4,] 1.6867914 -0.34583508 0.4865539 -0.8069748 0.92829983 2.4844372
[5,] 0.2948684 0.22705328 1.8352256 0.4506745 1.27837900 0.8067217
[6,] 1.4773871 -0.51591132 0.3043010 -1.2860793 0.85828399 1.5576991
Flavanoids Nonflavanoids Proanthocyanins Color Hue Dilution
[1,] 1.0319081 -0.6577078 1.2214385 0.2510088 0.3611585 1.8427215
[2,] 0.7315653 -0.8184106 -0.5431887 -0.2924962 0.4049085 1.1103172
[3,] 1.2121137 -0.4970050 2.1299594 0.2682629 0.3174085 0.7863692
[4,] 1.4623994 -0.9791134 1.0292513 1.1827317 -0.4263410 1.1807407
[5,] 0.6614853 0.2261576 0.4002753 -0.3183774 0.3611585 0.4483365
[6,] 1.3622851 -0.1755994 0.6623487 0.7298108 0.4049085 0.3356589
Proline
[1,] 1.01015939
[2,] 0.96252635
[3,] 1.39122370
[4,] 2.32800680
[5,] -0.03776747
[6,] 2.23274072
Clustering with \(k=3\).
<- kmeans(x=wine_stand, centers=3) k.means.fit
Exploring the k.means.fit
object.
k.means.fit
K-means clustering with 3 clusters of sizes 62, 51, 65
Cluster means:
Alcohol Malic Ash Alcalinity Magnesium Phenols
1 0.8328826 -0.3029551 0.3636801 -0.6084749 0.57596208 0.88274724
2 0.1644436 0.8690954 0.1863726 0.5228924 -0.07526047 -0.97657548
3 -0.9234669 -0.3929331 -0.4931257 0.1701220 -0.49032869 -0.07576891
Flavanoids Nonflavanoids Proanthocyanins Color Hue Dilution
1 0.97506900 -0.56050853 0.57865427 0.1705823 0.4726504 0.7770551
2 -1.21182921 0.72402116 -0.77751312 0.9388902 -1.1615122 -1.2887761
3 0.02075402 -0.03343924 0.05810161 -0.8993770 0.4605046 0.2700025
Proline
1 1.1220202
2 -0.4059428
3 -0.7517257
Clustering vector:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 1
[75] 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[112] 3 3 3 3 3 3 3 2 3 3 1 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Within cluster sum of squares by cluster:
[1] 385.6983 326.3537 558.6971
(between_SS / total_SS = 44.8 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
To explore the 3 centroids with only two digits.
<- round(k.means.fit$centers, digits=2)
centroids t(centroids)
1 2 3
Alcohol 0.83 0.16 -0.92
Malic -0.30 0.87 -0.39
Ash 0.36 0.19 -0.49
Alcalinity -0.61 0.52 0.17
Magnesium 0.58 -0.08 -0.49
Phenols 0.88 -0.98 -0.08
Flavanoids 0.98 -1.21 0.02
Nonflavanoids -0.56 0.72 -0.03
Proanthocyanins 0.58 -0.78 0.06
Color 0.17 0.94 -0.90
Hue 0.47 -1.16 0.46
Dilution 0.78 -1.29 0.27
Proline 1.12 -0.41 -0.75
To explore the clusters.
$cluster k.means.fit
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 1
[75] 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[112] 3 3 3 3 3 3 3 2 3 3 1 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
To explore the sum of squares.
$totss # The total sum of squares. k.means.fit
[1] 2301
$withinss # Vector of within-cluster sum of squares k.means.fit
[1] 385.6983 326.3537 558.6971
$tot.withinss # Total within-cluster sum of squares k.means.fit
[1] 1270.749
$betweenss # The between-cluster sum of squares, totss-tot.withinss. k.means.fit
[1] 1030.251
1030.251 / 2301
[1] 0.4477405
A fundamental question is how to determine the value of the parameter \(k\). If we look at the percentage of variance explained as a function of the number of clusters, one should choose a number of clusters so that adding another cluster doesn’t give much better modeling of the data. More precisely, if one plots the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the “elbow criterion”.
<- function(data, nc=15, seed=1234, ...){
wssplot <- (nrow(data)-1)*sum(apply(data,2,var))
wss for (i in 2:nc){
set.seed(seed)
<- sum(kmeans(data, centers=i)$withinss)}
wss[i] plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares", ...)
}
wssplot(wine_stand, nc=6, lwd=3, col="tomato", las=1)
Library clusters allow us to represent (with the aid of PCA) the cluster solution into 2 dimensions:
library(cluster)
clusplot(x=wine_stand, k.means.fit$cluster, main='',
color=TRUE, shade=TRUE, labels=2, lines=0)
In this example, we have an additional variable, Type, that was not used in the analysis. To evaluate the clustering performance, we constructed a confusion matrix to compare the clusters with the prior classification based on the Type variable.
table(Actual=wine[,1], Prediction=k.means.fit$cluster)
Prediction
Actual 1 2 3
1 59 0 0
2 3 3 65
3 0 48 0
8 Hierarchical methods
8.1 The nearest neighbour algorithm (simple)
The distance between two clusters is measured by
\[ d_{AB} = \mathop{min}_{i \in A, j \in B} \left \{ d_{ij} \right \} \]
- Each observation is a cluster.
- Join the two closest clusters.
- Compute the new distance matrix.
- Repeat steps 2 and 3 until all observations are in one cluster.
8.2 The farest neighbour algorithm (complete)
The distance between two clusters is measured by
\[ d_{AB} = \mathop{max}_{i \in A, j \in B} \left \{ d_{ij} \right \} \]
- Each observation is a cluster.
- Join the two closest clusters.
- Compute the new distance matrix.
- Repeat steps 2 and 3 until all observations are in one cluster.
8.3 Other algortihms
- Ward’s: minimum variance criterion minimizes the total within-cluster variance.
- Average
- Median
- Centroid
8.4 dendrogram
A dendrogram (from Greek dendro “tree” and gramma “drawing”) is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering.
To customize a dendrogram we recomend the url https://rpubs.com/gaston/dendrograms
8.5 Example
Apply the nearest neighbour algorithm and draw the dendrogram for the next distance matrix.
### hclust
function
hclust
function computes cluster analysis on a set of dissimilarities and methods for analyzing it.
hclust(d, method = "complete", members = NULL)
d: a dissimilarity structure as produced by dist.
method: the agglomeration method to be used. This should be one of “ward.D”, “ward.D2”, “single”, “complete”, “average” (= UPGMA), “mcquitty” (= WPGMA), “median” (= WPGMC) or “centroid” (= UPGMC).
8.6 Example
Apply the nearest neighbour algorithm and draw the dendrogram for the next distance matrix.
<- matrix(c(0, 3, 7, 11, 10,
Dist 3, 0, 6, 10, 9,
7, 6, 0, 5, 6,
11, 10, 5, 0, 4,
10, 9, 6, 4, 0), ncol=5, byrow=5)
rownames(Dist) <- colnames(Dist) <- paste('Obs', 1:5)
<- as.dist(Dist) # to convert into a dist object
Dist
<- hclust(d=Dist, method='single')
hc hc
Call:
hclust(d = Dist, method = "single")
Cluster method : single
Number of objects: 5
plot(hc)
8.7 Example
Use the Ward’s algorithm with Euclidean distance to create 3 cluster for wines.
<- dist(wine_stand, method = "euclidean")
d <- hclust(d, method="ward.D") H.fit
plot(H.fit, hang=-1) # display dendrogram
<- cutree(H.fit, k=3) # cut tree into 5 clusters
groups rect.hclust(H.fit, k=3, border="red")
The clustering performance can be evaluated with the aid of a confusion matrix as follows:
table(wine[,1],groups)
groups
1 2 3
1 58 1 0
2 7 58 6
3 0 0 48
8.8 Example
Use the iris data set for clustering into 3 clusters.
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
summary(iris$Species)
setosa versicolor virginica
50 50 50
Exploring the data.
Exploring the data.
Use the Ward’s algorithm with Euclidean distance to create 3 clusters.
<- dist(dist(iris[, 3:4], method = "euclidean"))
d <- hclust(d, method="ward.D") iris.fit
plot(iris.fit, hang=-1)
rect.hclust(iris.fit, k = 3, border = "red")
Clusters.
<- cutree(iris.fit, k = 3)
groups groups
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
3 2 3 3 3 3 2 3 3 3 2 3 3 2 3 3 3 3 3 2
121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3
141 142 143 144 145 146 147 148 149 150
3 3 2 3 3 3 2 3 3 2
The clustering performance can be evaluated with the aid of a confusion matrix as follows:
table(groups, iris$Species)
groups setosa versicolor virginica
1 50 0 0
2 0 50 14
3 0 0 36
9 Other examples
To learn more about clustering consult:
- https://rpubs.com/dnchari/kmeans
- https://www.r-bloggers.com/k-means-clustering-in-r/
- https://www.r-bloggers.com/hierarchical-clustering-in-r-2/
- https://www.stat.berkeley.edu/~s133/Cluster2a.html
10 More about clustering
https://lazappi.github.io/clustree/articles/clustree.html