Variance, Covariance, and Correlation

DESCRIPTION:
Returns the variance of a vector, the variance-covariance (or correlation) matrix of a data matrix, or covariances between matrices or vectors. A trimming fraction may be specified for correlations.

USAGE:
var(x, y=x)
cor(x, y=x, trim=0)

REQUIRED ARGUMENTS:
x:
numeric matrix or vector. (May be complex for var.) If a matrix, columns represent variables and rows represent observations. Missing values are not accepted.

OPTIONAL ARGUMENTS:
y:
numeric matrix or vector. (May be complex for var.) If a matrix, columns represent variables and rows represent observations. This must have the same number of observations as x. Missing values are not accepted.
trim:
a number less than .5 giving the proportion trimmed in the internal calculations for cor. This should be a number larger than the suspected fraction of outliers.

VALUE:
cor returns correlations, and var returns variances (and covariances).

If y is missing, then the variance of x when x is a vector and the covariance (correlation) matrix of the columns of x if it is a matrix. When y is present and x or y is a matrix, the result is a correlation or covariance matrix such that the [i,j] element corresponds to the ith column of x and the jth column of y.


DETAILS:
Variances are sample variances: sum((x-mean(x))^2)/(length(x)-1) sum(abs(x-mean(x))^2)/(length(x)-1) (if x is complex) Untrimmed correlations are computed from these variances. Correlations are computed to single-precision accuracy.

S-PLUS defines the sample variance so as to provide an unbiased estimate of the population variance; many texts use the maximum likelihood estimate which for the one-sample case is as follows: sum((x-mean(x))^2)/length(x) This alternative definition yields a biased estimate of the population variance.

There is much discussion in the statistical literature concerning methods for missing values; an example is chapter 10 of Weisberg (1985). Covariances are particularly sensitive to how missing values are treated.

Trimmed correlations are computed by the standardized sums and differences method: Each variable is divided by a trimmed standard deviation, and for each pair of variables v(s) is the trimmed variance of the sum of the standardized variables and v(d) is the trimmed variance of the difference of the standardized variables. The correlation is then (v(s) - v(d))/(v(s) + v(d)). Trimming, whether for means, variances or standard deviations, always rejects the trim/2 smallest points and the trim/2 largest points. See Gnanadesikan and Kettenring (1972), Huber (1981, pp 202-203), or Gnanadesikan (1977, p 132).


REFERENCES:
Gnanadesikan, R. (1977). Methods for Statistical Data Analysis of Multivariate Observations. Wiley, New York.

Gnanadesikan, R. and Kettenring, J. R. (1972). Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics 28, 81-124.

Huber, P. J. (1981). Robust Statistics. Wiley, New York.

Weisberg, S. (1985). Applied Linear Regression (2nd Edition). Wiley, New York.


SEE ALSO:
cov.mve , mad , mean .

EXAMPLES:
cor(cbind(longley.x, longley.y))   # 7 by 7 correlation matrix for longley data
var(longley.x, longley.y) # 6 by 1 matrix of covariances

sd.x <- sqrt(var(x)) # standard deviation of a vector

stan.dev <- function(x) sqrt(var(as.vector(x)))

stan.dev(freeny.y)