clustering {clues} | R Documentation |
Data clustering (after data shrinking).
clustering(y, disMethod = "Euclidean")
y |
data matrix which is an R matrix object (for dimension > 1) or vector object (for dimension=1) with rows be observations and columns be variables. |
disMethod |
specification of the dissimilarity measure. The available measures are “Euclidean” and “1-corr”. |
We first store the first observation (data point) in point[1]
.
We then get the nearest neighbor of point[1]
. Store it in
point[2]
. Store the dissimilarity between point[1]
and
point[2]
to db[1]
. We next remove point[1]
.
We then find the nearest neighbor of point[2]
.
Store it in point[3]
. Store the dissimilarity between point[2]
and point[3]
to db[2]
. We then remove point[2]
and find the nearest neighbor of point[3]
. We repeat this procudure
until we find point[n]
and db[n-1]
where n
is the
total number of data points.
Next, we calculate the interquartile range (IQR) of the vector db
.
We then check which elements of db
are larger than avg+1.5IQR
where avg
is the average of the vector db
. The mininum value of
these outlier dissimilarities will be stored in omin
.
An estimate of the number of clusters is g
where g-1
is the number
of the outlier dissimilarities.
The position of an outlier dissimilarity
indicates the end of a cluster and the start of a new cluster.
To get a reasonable clustering result, data sharpening (shrinking) is recommended before data clustering.
mem |
vector of the cluster membership of data points. The cluster membership takes values: 1, 2, …, g, where g is the estimated number of clusters. |
size |
vector of the number of data points for clusters. |
g |
an estimate of the number of clusters. |
db |
vector of dissimilarities between sorted consecutive data points (c.f. details). |
point |
vector of sorted consecutive data points (c.f. details). |
omin |
The minimum value of the outlier dissimilarities (c.f. details). |
Wang, S., Qiu, W., and Zamar, R. H. (2007). CLUES: A non-parametric clustering method based on local shrinking. Computational Statistics & Data Analysis, Vol. 52, issue 1, pages 286-298.
# Maronna data set data(Maronna) # data matrix maronna <- Maronna$maronna tt <- shrinking(maronna, K = 50, itmax = 20) tt2 <- clustering(tt) # Plot of disimilarities between the sorted consecutive data points # versus the sorted consecutive data points # This plot can be used to assess the estimated number of clusters db <- tt2$db point <- tt2$point n <- length(point) plot(1:(n - 1), db, type = "l", xlab = "sorted consecutive data points", ylab = "disimilarities between the sorted consecutive data points", xlim = c(0, n), axes = FALSE) box() axis(side = 2) axis(side = 1, at = c(0, 1:(n - 1)), labels = point)