Minimum Volume Ellipsoid Covariance Estimation

DESCRIPTION:
Returns a list containing estimates of the covariance matrix and of the mean vector for the data, and optionally of the correlation matrix. The cov.mve function returns weighted estimates with weights based on an approximate minimum volume ellipsoid estimate.

USAGE:
cov.mve(x, cor=F, print=T, popsize=<<see below>>,
  mutate.prob=c(0.15,0.2,0.2,0.2), random.n=<<see below>>,
  births.n=<<see below>>, stock=list(), maxslen=<<see below>>,
  stockprob=<<see below>>, nkeep=1)

REQUIRED ARGUMENTS:
x:
data matrix. Columns represent variables and rows represent observations. Missing values (NAs) are not accepted.

OPTIONAL ARGUMENTS:
cor:
logical flag: if TRUE, then the estimated correlation matrix will be returned as a component of the answer.
print:
logical flag: if TRUE, a message about the number of samples taken and the number of those samples that were singular will be printed.
popsize:
the population size of the genetic stock. The default is 10 times the number of variables.
mutate.prob:
length 4 vector of mutation probabilities for offspring. The first element is the probability of a mutation to one observation in the offspring. The second through fourth elements give the probability that the length of the offspring will be one shorter than the mother, one longer than the mother or a random length, respectively.
random.n:
the number of random samples taken after the stock is filled. The default is 50 times the number of variables.
births.n:
the number of genetic births. The default is 100 times p plus 20 times p squared, where p is the number of variables.
stock:
a list of vectors of observation numbers to be included in the stock. This is typically the stock component of the output of a previous call to the function.
maxslen:
the maximum number of observations (including duplicates) in a member of the stock. The default is p+1 if (n-p)/2 is less than p+1, where n is the number of observations, and it is the minimum of trunc((n-p)/2) and 5*p otherwise.
stockprob:
vector of cumulative probabilities that a member of the stock will be chosen as a parent. The ith element corresponds to the individual with the ith lowest objective. The default is cumsum((2 * (popsize:1))/popsize/(popsize + 1)).
nkeep:
the number of individuals in the stock to keep in the output.

VALUE:
a list giving the solution. Note that this is an approximation to the true solution based on a random algorithm, hence you will get (slightly) different answers each time you make the same call. The components of the returned list are:
cov:
the estimated covariance matrix.
center:
an estimate for the center (mean) of the data.
cor:
the estimated correlation matrix for the data. This is only returned if the input cor is TRUE.
method:
a character string that contains the number of samples taken and the number that were singular.
objective:
vector of the objective for each component of the output stock. These are in increasing order.
stock:
list of length nkeep containing vectors of observation numbers.
births.n:
the number of genetic births that were performed.

SIDE EFFECTS:
causes creation of the dataset .Random.seed if it does not already exist, otherwise its value is updated.

if print is TRUE, then a message is printed.


DETAILS:
Let n be the number of observations and p be the number of variables. The minimum volume ellipsoid covariance estimate is the covariance matrix that is defined by the ellipsoid with minimum volume of those ellipsoids that contain floor((n+p+1)/2) of the datapoints. It is too computationally intense to find the actual estimate, so an approximation is found.

A genetic algorithm, described in Burns (1992), is used. Individual solutions are defined by a set of observation numbers, which corresponds to a least squares solution with the specified observations. A stock of popsize individuals is produced by random sampling, then a number of random samples are taken and the best solutions are saved in the stock. During the genetic phase, two parents are picked which produce an offspring that contains a sample of the observations from the parents. The best two out of the three are retained in the stock. The best of all of the solutions found is used to compute the coefficients and the residuals. The standard random sampling algorithm can be used by setting popsize to one, maxslen to p+1 and births.n to zero.

The mutate.prob argument controls the mutation of the offspring. The length of the offspring is initially set to be the length of the first parent. This length is reduced by one, increased by one, or given a length uniformly distributed between p+1 and maxslen according to the last three probabilities in mutate.prob. The other type of mutation that can occur is for one of the observations of the offspring to be changed to an observation picked at random from among all of the observations; the probability of this mutation is specified by the first element of mutate.prob.

It is suggested that the number of observations be at least five times the number of variables. When there are fewer observations than this, there is not enough information to accurately determine if outliers exist.

The minimum volume ellipsoid is not allowed to have zero volume - singular covariance matrices from subsamples are ignored (except for being counted). If your data has a covariance matrix that is singular, cov.mve will fail because all of the covariance matrices of the subsamples will be singular. In this case, you will need to create some new data to give to cov.mve that does not have a singular covariance matrix; perhaps by using prcomp and deleting columns with zero variance.

Although the minimum volume ellipsoid covariance estimate has a very high breakdown point, it is inefficient. More efficiency can be attained while retaining the high breakdown point by performing a weighted covariance estimate with weights based on the minimum volume ellipsoid estimate. Such an estimate is what cov.mve returns. The Mahalanobis distance (computed using a scaling of the minimum volume ellipsoid covariance estimate) of each observation is compared to the Chisquare .975 quantile; those observations with smaller distances than this are given weight 1 and the others are given weight 0. The cov.wt function is then used with these weights. This was proposed in Rousseeuw and van Zomeren (1990).


BACKGROUND:
The minimum volume ellipsoid covariance estimator has a breakdown point that is almost one-half. That is, the estimate can not be made arbitrarily bad without changing about half of the data. A covariance matrix is considered to be arbitrarily bad if either a component goes to infinity (just as in the breakdown of a location or regression estimate), or if the matrix becomes deficient in rank. This is analogous to a scale estimate breaking down if the estimate is going either to infinity or to zero.

REFERENCES:
Burns, P. J. (1992). A genetic algorithm for robust regression estimation. (submitted).

Lopuhaa, H. P. and Rousseeuw, P. J. (1991). Breakdown points of affine equivariant estimators of multivariate location and covariance matrices. Annals of Statistics 19, 229-248.

Rousseeuw, P. J. (1991). A diagnostic plot for regression outliers and leverage points. Computational Statistics and Data Analysis 11, 127-129.

Rousseeuw, P. J. and van Zomeren, B. C. (1990). Unmasking multivariate outliers and leverage points (with discussion). Journal of the American Statistical Association 85, 633-651.

Woodruff, D. L. and Rocke, D. M. (1991). Computation of minimum volume ellipsoid estimates using heuristic search. manuscript.


SEE ALSO:
cov.wt , var , mahalanobis , prcomp .

EXAMPLES:
fr.cov1 <- cov.mve(freeny.x)

cov.mve(freeny.x, stock=fr.cov1$stock, births=1000)