Model-based Hierarchical Clustering

DESCRIPTION:
Performs hierarchical clustering via a wide range of clustering options, calculates a Bayesian criterion for choosing the number of clusters, and optionally allows for noise or "outliers".

USAGE:
mclust(x, method = "S*", signif = rep(0, dim(x)[2]), noise = F,
       scale = rep(1, dim(x)[2]), shape = c(1, rep(0.2, (dim(x)[2]-1))),
       workspace = <<see below>>)

REQUIRED ARGUMENTS:
x:
n by p matrix containing n p-dimensional data points (NAs not allowed).

OPTIONAL ARGUMENTS:
method:
a character string to select the clustering criterion. Possible values are : "S*", "S", "spherical" (with varying sizes), "sum of squares" or "trace" (Ward's method), "unconstrained", "determinant", "centroid", "weighted average link", "group average link", "complete link" or "farthest neighbor", "single link" or "nearest neighbor". Only enough of the string to determine a unique match is required.
signif:
vector giving the number of significant decimal places in each column of x. Nonpositive components are allowed. Used in initializing clustering in some methods.
noise:
indicates whether or not Poisson noise should be assumed.
scale:
vector for scaling the observations. The ith column of x is multiplied by scale[i] before cluster analysis begins.
shape:
vector determining the shape of clusters for methods "S*" and "S".
workspace:
size of the workspace provided to the underlying Fortran program. The default is (dim(x)[1]*(dim(x)[1]-1)) + 10*dim(x)[1].

VALUE:
tree:
list with components merge, height, and order conforming to the output of the function hclust, but here height is just the stage of the merge. This output can be used with several functions such as plclust and subtree.
lr:
list of objects merged at each stage, in which a new cluster inherits the number of the lowest-numbered object or cluster from which it is formed (used for classification by function mclass).
awe:
a vector in which the kth element is the approximate weight of evidence for k clusters. This component is present only for the model-based methods : "S*", "S", "spherical" (with varying sizes), "sum of squares" or "trace" (Ward's method), "unconstrained", and "determinant".
call:
a copy of the call to mclust.

NOTE:
The amount of storage needed is dependent on the ordering of the data points. If the limit is exceeded, it may be possible to rerun without increasing workspace by reordering.

METHOD:
Hierarchical merging is used in all cases; the criteria for the merge are defined by the kind of clusters expected. They include the standard sum-of-squares method for hyperspherical clusters, and the determinant criterion for ellipsoidal clusters pointing in the same direction. There are also several criteria which give importance to the shape of the clusters, such as S*, which is optimal for clusters that are long and point in different directions, perhaps even overlapping. Some standard heuristic criteria are included along with the model-based methods. For the heuristic methods (centroid, weighted average link, group average link, complete link, and single link), the initial criterion is the Euclidean distance between observations.

The function hclust allows more general initialization for the group average link, complete link, and single link methods. Separate functions are available for classification (mclass) and iterative relocation (mreloc).


REFERENCES:
Banfield, J. D. and Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, Vol. 49, No. 3 (September 1993) 803-822.

Gordon, A. D. (1981). Classification: Methods for the Exploratory Analysis of Multivariate Data. London: Chapman and Hall.


SEE ALSO:
hclust , mclass , mreloc , order , plclust , subtree .

EXAMPLES:
years <- c("1960", "1964", "1968", "1972", "1976")
votes.clust <- mclust(votes.repub[,years], method = "S", noise = T)

# display dendrogram on current graphics device plclust(votes.clust$tree, label = state.abb) plot(x = 1:length(votes.clust$awe), y = votes.clust$awe) # plot the awe