Least Median of Squares Regression

DESCRIPTION:
Performs least median of squared residual regression of y on x. The number of samples to be taken can be specified. x and y are allowed to be either a vector or a matrix. This function is deprecated; use ltsreg instead.

USAGE:
lmsreg(x, y, samples=<<see below>>, intercept=T, wt=T,
        diagnostic=F, yname=NULL)

REQUIRED ARGUMENTS:
x:
vector or matrix of explanatory variables. Rows of the matrix represent observations and columns represent variables. A constant term should not be included, a better answer is usually achieved by removing such a column and setting intercept=TRUE.
Missing values (NA) are allowed.
y:
vector or matrix whose columns represent response variables.
Missing values (NA) are allowed. Observations (rows) with missing values in either x or y are excluded from the computations method.

OPTIONAL ARGUMENTS:
samples:
either a positive integer or the character string "all". If numeric, this specifies the number of non-singular random subsamples of the observations that are to be used. If samples="all", then all subsamples are to be found. Note that this can be a very large number even for quite small datasets. There are n-choose-p subsamples where n is the number of observations and p is the number of explanatory variables. The default is to take all of the subsamples if there are less than 3000, and to find 3000 non-singular random samples otherwise.
intercept:
logical flag: should a constant (intercept) term be included?
wt:
logical flag: should weights computed by lmsreg be returned? These weights can be used in lsfit to obtain a weighted least squares solution.
diagnostic:
logical flag: if TRUE, resistant diagnostics are returned.
yname:
vector of character strings of the names of variables in y.

VALUE:
a list containing some or all of the following components:
coef:
vector or matrix of coefficient estimates. This is a matrix when there is more than one regression; columns represent the regressions and rows represent the explanatory variables.
scale:
residual scale estimate for each column of y: the square root of the estimated variance of the residuals that have weight 1.
rsquared:
robust version of R squared for each column of y. This is 1 minus the square of the quantity: the minquanth order statistic of the absolute value of the residuals over the minquanth order statistic of the absolute value of y - median(y). Here, minquan is the order statistic being minimized, and median(y) is not subtracted from y if intercept=FALSE.
objective:
the minimum value found of the minquanth order statistic of each column of the absolute value of the residuals, where minquan is the order statistic that is being minimized.
resid:
object like y containing the residuals from the regression. Rows corresponding to missing values in x or y have NAs.
wt:
object like y of weights for use in weighted least squares if wt is TRUE. These weights are 1 for points with reasonably small residuals and are 0 for points with large residuals.
intercept:
same as the input or default intercept.
diagnostic:
object like y of resistant diagnostics, returned only if diagnostic is TRUE. For each trial the absolute value of the residuals is divided by the objective for that trial; the maximum of these numbers is retained for each observation. The diagnostic component consists of these maxima divided by their median for each response variable. If diagnostic=FALSE, then computation as well as space is saved.
method:
character string giving the method (Least Median of Squares), the number of subsamples, the size of the subsamples, and the number of subsamples leading to a singular system of equations.

DETAILS:
If there are p explanatory variables (including the intercept term, if present), then a large number of subsamples of p observations is taken. Each of these subsamples is used to get a trial set of coefficients, and the best set of coefficients for each regression is retained. "Best" in this context means the coefficients such that the minquanth order statistic of the absolute value of the residuals is smallest, where minquan = floor(n/2) + floor((p+1)/2), and n is the number of observations. If intercept=TRUE, then the least median of squares location estimate is performed on the final residuals. (Because the location estimate is not performed on each trial, it is possible, though very unlikely, that taking more samples can result in a worse answer; generally, the more samples the better the answer becomes.)

Since specifying samples="all" can easily be a request for millions of samples, the maximum number of non-singular samples is limited to 30,000. This limit can be changed by editing the function.

The lmsreg function has a built-in random number generator that starts with the same seed on each call to lmsreg. Thus the same subsamples and hence the same answer will be found by similar calls. The default value of 3000 random samples will give greater than 99% probability of a 50% breakdown point for problems with 9 or fewer explanatory variables. The probability of a high breakdown drops sharply as the number of explanatory variables grows beyond 10.


BACKGROUND:
Rather than minimizing the sum of the squared residuals as least squares regression does, least median of squares minimizes the median of the squared residuals. Actually, it is not precisely the median that is minimized but rather a certain order statistic of the squared (or absolute) residuals.

Least median of squares regression has a very high breakdown point of almost 50%. That is, almost half of the data can be corrupted in an arbitrary fashion and the least median of squares estimates continue to follow the majority of the data. At the present time this property is virtually unique among the robust regression methods that are publicly available, including the methods in rreg.

There are, however, two disadvantages of least median of squares. There is no known feasible algorithm to compute the actual least median of squares estimate in most problems; thus the lmsreg function can only yield a high probability of a 50% breakdown point for all but the smallest problems. Least median of squares is statistically very inefficient; one remedy to this is to use lsfit with the weights returned from lmsreg in a weighted least squares regression. This procedure will give high breakdown estimates that are also quite efficient. The test statistics derived from the least squares regression will not be strictly correct, but can be used informally.


REFERENCES:
Rousseeuw, P. J. (1984). Least median of squares regression. Journal of the American Statistical Association 79, 871-88.

Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regression and Outlier Detection. New York: Wiley.


NOTE:
The ltsreg function makes this deprecated. The least trimmed squares objective is statistically more efficient than the least median of squares objective, and ltsreg approximates the minimum objective much more efficiently. However, ltsreg does not allow missing values or multiple responses.

SEE ALSO:
ltsreg , rreg , l1fit , lsfit .

EXAMPLES:
stacklms <- lmsreg(stack.x, stack.loss, samples="all")