Hartigan's K-Means Clustering

DESCRIPTION:
Returns a list representing a clustering of the data with as many groups as there are rows in centers.

USAGE:
kmeans(x, centers, iter.max=10)

REQUIRED ARGUMENTS:
x:
matrix of multivariate data. Each row corresponds to an observation, and each column corresponds to a variable. Missing values are not accepted.
centers:
matrix of initial guesses for the cluster centers. Each row represents a cluster center, and thus centers must have the same number of columns as x. The number of rows in centers, (there must be at least two), is the number of clusters that will be formed. Missing values are not accepted.

OPTIONAL ARGUMENTS:
iter.max:
maximum number of iterations.

VALUE:
a list with the following components:
cluster:
vector of integers, ranging from 1 to nrow(centers), with length the same as the number of rows of x. The ith value indicates the cluster in which the ith data point belongs.
centers:
matrix like the input centers containing the locations of the final cluster centers. Each row is a cluster center location.
withinss:
vector of length nrow(centers). The ith value gives the within cluster sum of squares for the ith cluster.
size:
vector of length nrow(centers). The ith value gives the number of data points in cluster i.

DETAILS:
The object is to find a partition of the observations with nrow(centers) groups that minimizes sum(withinss). To actually guarantee the minimum would be computationally infeasible in many settings; this function finds a local minimum, that is, a solution such that there is no single switch of an observation from one group to another group that will decrease the objective. The procedure used to achieve the local minimum is rather complex - see Hartigan and Wong (1979) for details.

It may be necessary to scale the columns of x in order for the clustering to be sensible. The larger a variable's variance, the more important it will be to the clustering.

When deciding on the number of clusters, Hartigan (1975, pp 90-91) suggests the following rough rule of thumb. If k is the result of kmeans with k groups and kplus1 is the result with k+1 groups, then it is justifiable to add the extra group when

(sum(k$withinss)/sum(kplus1$withinss)-1)*(nrow(x)-k-1)

is greater than 10.


REFERENCES:
Hartigan, J. A. (1975). Clustering Algorithms. New York: Wiley.

Hartigan, J. A. and Wong, M. A. (1979). A k-means clustering algorithm. Applied Statistics 28, 100-108.


SEE ALSO:
hclust , mclust .

EXAMPLES:
irismean <- t(apply(iris, c(2, 3), 'mean'))
x <- rbind(iris[,,1], iris[,,2], iris[,,3])
km <- kmeans(x, irismean)
wrong <- km$cluster!=rep(1:3, c(50, 50, 50))

spin(x, highlight=wrong)

plot(x[,2], x[,3], type="n") text(x[!wrong, 2], x[!wrong, 3], km$cluster) # identify cluster membership that is correct points(x[wrong, 2], x[wrong, 3], pch=15) # boxes for points in error title(main="K-Means Clustering of the Iris Data")