Clustering Large Applications

DESCRIPTION:
Returns a list representing a clustering of the data into k clusters.

USAGE:
clara(x, k, metric = "euclidean", stand = F, samples = 5,
      sampsize = 40 + 2 * k)


REQUIRED ARGUMENTS:
x:
data matrix or dataframe, each row corresponds to an observation, and each column corresponds to a variable. All variables must be numeric. Missing values (NAs) are allowed.

k:
integer, the number of clusters.


OPTIONAL ARGUMENTS:
metric:
character string specifying the metric to be used for calculating dissimilarities between objects. The currently available options are "euclidean" and "manhattan". Euclidean distances are root sum-of-squares of differences, and manhattan distances are the sum of absolute differences.

stand:
logical flag: if TRUE, then the measurements in x are standardized before calculating the dissimilarities. Measurements are standardized for each variable (column), by subtracting the variable's mean value and dividing by the variable's mean absolute deviation.

samples:
integer, number of samples to be drawn from the dataset.

sampsize:
integer, number of objects in each sample. sampsize should be higher than the number of clusters (k) and at most the number of objects (nrow(x)).


VALUE:
an object of class "clara" representing the clustering. See clara.object for details.


DETAILS:
clara is fully described in chapter 3 of Kaufman and Rousseeuw (1990). Compared to other partitioning methods such as pam, it can deal with much larger datasets. Internally, this is achieved by considering sub-datasets of fixed size, so that the time and storage requirements become linear in nrow(x) rather than quadratic.

Each sub-dataset is partitioned into k clusters using the same algorithm as in the pam function. Once k representative objects have been selected from the sub-dataset, each object of the entire dataset is assigned to the nearest medoid. The sum of the dissimilarities of the objects to their closest medoid, is used as a measure of the quality of the clustering. The sub-dataset for which the sum is minimal, is retained. A further analysis is carried out on the final partition. Each sub-dataset is forced to contain the medoids obtained from the best sub-dataset until then. Randomly drawn objects are added to this set until sampsize has been reached.


BACKGROUND:
Cluster analysis divides a dataset into groups (clusters) of objects that are similar to each other. Partitioning methods like pam, clara, and fanny require that the number of clusters be given by the user. Hierarchical methods like agnes, diana, and mona construct a hierarchy of clusterings, with the number of clusters ranging from one to the number of objects.


NOTE:
For small datasets (say with fewer than 200 observations), the function pam can be used directly.


REFERENCES:
Kaufman, L. and Rousseeuw, P.J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York.

SEE ALSO:
clara.object , partition.object , pam , plot.partition .

EXAMPLES:
# generate 500 objects, divided into 2 clusters.
x <- y_rbind(cbind(rnorm(200,0,8), rnorm(200,0,8)),
             cbind(rnorm(300,50,8), rnorm(300,50,8)))

clarax <- clara(x, 2) clarax clarax$clusinfo plot(clarax)