Classical Metric Multi-Dimensional Scaling

DESCRIPTION:
Represents data in a low dimensional Euclidean space. The dimension of the space can be chosen; also, a constant can be estimated so that "dissimilarities" will better approximate Euclidean distances.

USAGE:
cmdscale(d, k=2, eig=F, add=F)

REQUIRED ARGUMENTS:
d:
distance structure of the form returned by dist, or a full, symmetric matrix. Data is assumed to be dissimilarities or relative distances.

OPTIONAL ARGUMENTS:
k:
desired dimensionality of the output space.
eig:
if TRUE, return the eigenvalues computed by the algorithm. They can be used as an aid in determining the appropriate dimensionality of the solution.
add:
if TRUE, compute the additive constant (see component ac below).

VALUE:
a matrix like points when eig and add are both FALSE.

Otherwise, a list with two or three components named points, plus eig and/or ac.

points:
a matrix with k columns and as many rows as there were objects whose distances were given in d. Row i gives the coordinates in k-space of the i-th object.
eig:
vector of k eigenvalues, returned only when the eig argument is TRUE.
ac:
constant added to all data values in d to transform dissimilarities (or relative distances) into absolute distances. The Unidimensional Subspace procedure, (Torgerson, 1958, p. 276) is used to determine the additive constant. This is only returned if add=TRUE. If add=FALSE, no constant is added.

DETAILS:
The cmdscale function is an implementation of metric multidimensional scaling, that is, the distances between points in the result are as close as possible (in a certain sense) to the beginning distances subject to being Euclidean distances in a k dimensional space. The solution for k+1 dimensions has the same first k columns in points (up to numerical error) as the solution for dimension k.

The additive constant is typically used when the "distances" in d are subjective dissimilarities. The ac constant attempts to make the distances conform to a Euclidean space with as small of dimension as possible. The estimation of ac is done under the assumption that the Euclidean space has only one dimension; an assumption that simplifies computation. A more technical explanation is that the constant attempts to eliminate negative eigenvalues of the doubly centered matrix of the squared distances.

There are various measures of the goodness of fit of a solution in the literature. Two of them are given in the function in the example section below, see Mardia, Kent and Bibby (1979, p. 408).

Results are currently computed to single-precision accuracy only.


BACKGROUND:
Multidimensional scaling is the process of representing, in a small dimensional space, the distances (or dissimilarities) of a group of objects. It is somewhat similar to cluster analysis but returns points in space rather than distinct groupings.

Some examples of its use are: anthropologists studying cultural differences based on language, art, etc.; and marketing researchers assessing product similarity. The technique can be used to "serialize" data if the result is close to a curve in two dimensions or a string in three. For example, archeologists might try to place several cultures into a time order.


REFERENCES:
Many multivariate statistics books include a discussion of multidimensional scaling. Below are some examples.

Johnson, R. A. and Wichern, D. W. (1982). Applied Multivariate Statistical Analysis. Prentice-Hall, Englewood Cliffs, New Jersey.

Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979). Multivariate Analysis. Academic Press, London.

Torgerson, W. S. (1958). Theory and Methods of Scaling, Wiley, New York.


SEE ALSO:
mstree , hclust .

EXAMPLES:
x <- cmdscale(dist.x)  #default 2-space
coord1 <- x[,1]; coord2 <- x[,2]
par( pty="s" )  #set up square plot
r <- range(x)   #get overall max, min
plot(coord1, coord2, type="n", xlim=r, ylim=r) #set up plot
   # note units per inch same on x and y axes
text(coord1, coord2, seq(coord1))  #plot integers
   # use brush to explore a 3-dimensional scaling
dis.vote <- dist(votes.repub)
vote.scale <- cmdscale(dis.vote, 4)
brush(vote.scale, rowlab=state.abb)

# below is a function that calculates two measures of stress # it is fairly slow for datasets of more than 50 or so. cmdscale.gof <- function(dis, k = 4) { amat <- -0.5 * (dist2full(dis))^2 # see dist help file bmat <- sweep(amat, 1, apply(amat, 1, mean)) bmat <- sweep(bmat, 2, apply(bmat, 2, mean)) eigs <- svd(bmat, 0, 0) gof1 <- 1 - (cumsum(abs(eigs$d[1:k]))/sum(abs(eigs$d))) gof2 <- 1 - (cumsum(eigs$d[1:k]^2)/sum(eigs$d^2)) list(gof1 = gof1, gof2 = gof2, eig = eigs$d) } vote.scale <- cmdscale(dist(votes.repub)) plot(vote.scale, type="n") text(vote.scale, state.abb)