Distance Matrix Calculation

DESCRIPTION:
Returns a distance structure that represents all of the pairwise distances between objects in the data. The choices for the metric are "euclidean", "maximum", "manhattan", and "binary".

USAGE:
dist(x, metric = "euclidean")

REQUIRED ARGUMENTS:
x:
matrix (typically a data matrix). The distances computed will be among the rows of x. Missing values (NAs) are allowed.

OPTIONAL ARGUMENTS:
metric:
character string specifying the distance metric to be used. The currently available options are "euclidean", "maximum", "manhattan", and "binary". Euclidean distances are root sum-of-squares of differences, "maximum" is the maximum difference, "manhattan" is the sum of absolute differences, and "binary" is the proportion of non-zeros that two vectors do not have in common (the number of occurrences of a zero and a one, or a one and a zero divided by the number of times at least one vector has a one).

VALUE:
the distances among the rows of x. Since there are many distances and since the result of dist is typically an argument to hclust or cmdscale, a vector is returned, rather than a symmetric matrix. For i less than j, the distance between row i and row j is element nrow(x)*(i-1) - i*(i-1)/2 + j-i of the result. The returned object has an attribute, giving the number of objects, that is, nrow(x). The length of the vector that is returned is nrow(x)*(nrow(x)-1)/2, that is, it is of order nrow(x) squared.

DETAILS:
Missing values in a row of x are not included in any distances involving that row. If the metric is "euclidean" and ng is the number of columns in which no missing values occur for the given rows, then the distance returned is sqrt(ncol(x)/ng) times the Euclidean distance between the two vectors of length ng shortened to exclude NAs. The rule is similar for the "manhattan" metric, except that the coefficient is ncol(x)/ng. The "binary" metric excludes columns in which either row has an NA. If all values for a particular distance are excluded, the distance is NA.

NOTE:
If the columns of a matrix are in different units, it is usually advisable to scale the matrix before using dist. A column that is much more variable than the others will dominate the distance measure.

BACKGROUND:
Distance measures are used in cluster analysis and in multidimensional scaling. The choice of metric may have a large impact.

REFERENCES:
Everitt, B. (1980). Cluster Analysis (second edition). Halsted, New York.

Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979). Multivariate Analysis. Academic Press, London.


SEE ALSO:
cmdscale , hclust , scale .

EXAMPLES:
# create a sample object
x <- votes.repub
dist(x,"max") # distances among rows by maximum
dist(t(x)) # distances among cols in Euclidean metric

# below is a function that converts a distance structure to a matrix dist2full <- function(dis) { n <- attr(dis, "Size") full <- matrix(0, n, n) full[lower.tri(full)] <- dis full + t(full) }