Principal Components Analysis

DESCRIPTION:
Finds a new coordinate system for multivariate data such that the first coordinate has maximal variance, the second coordinate has maximal variance subject to being orthogonal to the first, etc.

USAGE:
prcomp(x, retx=T)

REQUIRED ARGUMENTS:
x:
data matrix to be decomposed, the rows represent observations and the columns represent variables. Missing values are not accepted.

OPTIONAL ARGUMENTS:
retx:
logical flag: if TRUE, the rotated version of the data matrix is returned. Using retx=FALSE saves space in the returned data structure.

VALUE:
list describing the principal component analysis:
sdev:
standard deviations of the derived variables.
rotation:
orthogonal matrix describing the rotation. The first column is the linear combination of columns of x defining the first principal component, etc. This may have fewer columns than x. This is commonly called the "loadings"; it is not a rotation in the sense often used in factor analysis.
x:
the rotated version of x; i.e., the first column is the nrow(x) values for the first derived variable, etc. This may have fewer columns than x. This is returned only when retx=TRUE.

DETAILS:
The analysis will work even if nrow(x)<ncol(x), but in this case only nrow(x) variables will be derived, and the returned x will have only nrow(x) columns. In general, if any of the derived variables has zero standard deviation, that variable is dropped from the returned result.

The estimates are made via the singular value decomposition of the input x. The standard deviations are the singular values divided by one less than the number of observations.

If ret <- prcomp(dat), then ret$x == dat %*% ret$rotation up to numerical precision.


BACKGROUND:
Principal component analysis defines a rotation of the variables (columns) of x. The first derived direction is chosen to maximize the standard deviation of the derived variable, the second to maximize the standard deviation among directions uncorrelated with the first, etc.

Principal component analysis is often used as a data reduction technique, sometimes in conjunction with regression. Typically it is advisable to scale the columns of the input before performing the principal component analysis since a variable with large variance relative to the others will dominate the first principal component.


REFERENCES:
Many multivariate statistics books (and some regression texts) include a discussion of principal components. Below are a few examples:

Dillon, W. R. and Goldstein, M. (1984). Multivariate Analysis, Methods and Applications. Wiley, New York.

Johnson, R. A. and Wichern, D. W. (1982). Applied Multivariate Statistical Analysis. Prentice-Hall, Englewood Cliffs, New Jersey.

Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979). Multivariate Analysis. Academic Press, London.


SEE ALSO:
svd , lsfit , cancor .

EXAMPLES:
  # principal components of the prim4 data
prim.pr <- prcomp(prim4)
  # plot of first and second principal components
plot(prim.pr$x[,1], prim.pr$x[,2])
  # variance explained by first k principal components
cumsum(prim.pr$sdev^2/sum(prim.pr$sdev^2))

# scree plot barplot(prim.pr$sdev^2/sum(prim.pr$sdev^2), density=20, ylim=c(0, .8), ylab="fraction of variance explained", xlab="principal component", names=as.character(1:4))