Hierarchical Clustering

DESCRIPTION:
Performs hierarchical clustering on a distance or similarity structure. Choices of method are "compact" (also called "complete linkage"), "average", and "connected" (also called "single linkage").

USAGE:
hclust(dist, method = "compact", sim =)

REQUIRED ARGUMENTS:
exactly one of dist or sim must be specified.

OPTIONAL ARGUMENTS:
dist:
a distance structure or distance matrix. Normally this will be the result of the function dist, but it can be any data of the form returned by dist, or a full, symmetric matrix. Missing values are not allowed.
method:
character string giving the clustering method. The three methods currently implemented are "average", "connected" (single linkage) and "compact" (complete linkage). (The first three characters of the method are sufficient.)
sim=:
structure giving similarities rather than distances. This can either be a symmetric matrix or a vector with a "Size" attribute. Missing values are not allowed.

VALUE:
a "tree" representing the clustering, i.e., a list consisting of the following components:
merge:
an (n-1) by 2 matrix, if there were n objects in the original data. Row i of merge describes the merging of clusters at step i of the clustering. If an element j in the row is negative, then object -j was merged at this stage. If j is positive, then the merge was with the cluster formed at the (earlier) stage j of the algorithm.
height:
a vector of the clustering "height"; that is, the distance between clusters merged at the successive stages.
order:
a vector giving a permutation of the original objects suitable for plotting, in the sense that a cluster plot using this ordering will not have crossings of the branches.

DETAILS:
At each stage the two "nearest" clusters are combined to form one bigger cluster (initially each cluster contains a single point). For method="connected" the distance between two clusters is the minimum distance between a point in the first cluster and a point in the second cluster. In method="average" the distance between clusters is the average of the distances between the points in one cluster and the points in the other cluster. The largest distance between a point in one cluster and a point in the other cluster is used in method="compact". Long thin clusters are typically created with "connected", and more spherical clusters are formed with "compact".

In hierarchical cluster displays, a decision is needed at each merge to specify which subtree should go on the left and which on the right. Since, for n individuals, there are n-1 merges, there are 2^(n-1) possible orderings for the leaves in a cluster tree. The default algorithm in hclust is to order the subtrees so that the tighter cluster is on the left (the last merge of the left subtree is at a lower value than the last merge of the right subtree). Individuals are the tightest clusters possible, and merges involving two individuals place them in order by their observation number.


BACKGROUND:
Cluster analysis divides datapoints into groups of points that are "close" to each other. The hclust function continues to aggregate groups together until there is just one big group. If it is necessary to choose the number of groups, this can be decided subsequently. Other methods (see kmeans) require that the number of groups be decided from the start.

By changing the distance metric and the clustering method, several different cluster trees can be created from a single dataset. No one method seems to be useful in all situations. Single linkage ("connected") can work poorly if two distinct groups have a few "stragglers" between them.


REFERENCES:
Everitt, B. (1980). Cluster Analysis (second edition). Halsted, New York.

Hartigan, J. A. (1975). Clustering Algorithms. Wiley. New York.

Many multivariate statistics books also include a discussion on clustering.


NOTE:
The functions plclust and labclust are used for plotting the result of a hierarchical clustering. dist computes distance matrices. Functions cutree, clorder, and subtree can be used to manipulate the tree data structure. kmeans performs a non-hierarchical clustering.

SEE ALSO:
clorder , cutree , dist , kmeans , labclust , plclust , subtree .

EXAMPLES:
# create a sample object using a built-in dataset
x <- longley.y
h <- hclust(dist(x))
plclust(h)

hclust(dist(x, metric="maximum"), "ave")

votes.clust <- hclust(dist(votes.repub), "ave") plclust(votes.clust, label=state.abb)