S-PLUS help

Partitioning Around Medoids

DESCRIPTION:: Returns a list representing a clustering of the data into k clusters.

USAGE:

pam(x, k, diss = F, metric = "euclidean", stand = F)

REQUIRED ARGUMENTS:

x:: data matrix or dataframe, or dissimilarity matrix, depending on the value of the diss argument.

In case of a matrix or dataframe, each row corresponds to an observation, and each column corresponds to a variable. All variables must be numeric. Missing values (NAs) are allowed.

In case of a dissimilarity matrix, x is typically the output of daisy or dist. Also a vector with length n*(n-1)/2 is allowed (where n is the number of objects), and will be interpreted in the same way as the output of the above-mentioned functions. Missing values (NAs) are not allowed.

k:

integer, the number of clusters.

OPTIONAL ARGUMENTS:

diss:: logical flag: if TRUE, then x will be considered as a dissimilarity matrix. If FALSE, then x will be considered as a matrix of observations by variables.
metric:: character string specifying the metric to be used for calculating dissimilarities between objects. The currently available options are "euclidean" and "manhattan". Euclidean distances are root sum-of-squares of differences, and manhattan distances are the sum of absolute differences. If x is already a dissimilarity matrix, then this argument will be ignored.
stand:: logical flag: if TRUE, then the measurements in x are standardized before calculating the dissimilarities. Measurements are standardized for each variable (column), by subtracting the variable's mean value and dividing by the variable's mean absolute deviation. If x is already a dissimilarity matrix, then this argument will be ignored.

VALUE:: an object of class "pam" representing the clustering. See pam.object for details.

DETAILS:

pam is fully described in chapter 2 of Kaufman and Rousseeuw (1990). Compared to the k-means approach in kmeans, the function pam has the following features: (a) it also accepts a dissimilarity matrix; (b) it is more robust because it minimizes a sum of dissimilarities instead of a sum of squared euclidean distances; (c) it provides a novel graphical display, the silhouette plot (see plot.partition) which also allows to select the number of clusters.

The pam-algorithm is based on the search for k representative objects or medoids among the objects of the dataset. These objects should represent the structure of the data. After finding a set of k medoids, k clusters are constructed by assigning each object to the nearest medoid. The goal is to find k representative objects which minimize the sum of the dissimilarities of the objects to their closest representative object. The algorithm first looks for a good initial set of medoids (this is called the BUILD phase). Then it finds a local minimum for the objective function, that is, a solution such that there is no single switch of an object with a medoid that will decrease the objective (this is called the SWAP phase).

BACKGROUND:: Cluster analysis divides a dataset into groups (clusters) of objects that are similar to each other. Partitioning methods like pam, clara, and fanny require that the number of clusters be given by the user. Hierarchical methods like agnes, diana, and mona construct a hierarchy of clusterings, with the number of clusters ranging from one to the number of objects.

NOTE:: For datasets larger than (say) 200 objects, pam will take a lot of computation time. Then the function clara is preferable.

REFERENCES:: Kaufman, L. and Rousseeuw, P.J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York.

SEE ALSO:: clara , daisy , dist , pam.object , partition.object , plot.partition .

EXAMPLES:

# generate 25 objects, divided into 2 clusters.
x <- rbind(cbind(rnorm(10,0,0.5), rnorm(10,0,0.5)),
           cbind(rnorm(15,5,0.5), rnorm(15,5,0.5)))

pamx <- pam(x, 2)
pamx
summary(pamx)
plot(pamx)

pam(daisy(x, metric = "manhattan"), 2, diss = T)