All-Subset Regressions by Leaps and Bounds

DESCRIPTION:
Attempts to find the best regressions using a subset of the given explanatory variables.

USAGE:
leaps(x, y, wt=<<see below>>, int=T, method="Cp", keep.int=T,
      keep=<<see below>>, nbest=10, names=<<see below>>, df=nrow(x),
      dropint=T)

REQUIRED ARGUMENTS:
x:
matrix of explanatory variables. Each column of x is a variable, each row an observation. There must be at least 3 and no more than 31 columns. The matrix must be of full rank and there needs to be fewer columns than rows. Missing values are not accepted.
y:
vector of the response variable with the same number of observations as the number of rows in x. Missing values are not accepted.

OPTIONAL ARGUMENTS:
wt:
vector of weights for the observations. Missing or negative values are not accepted. By default, unweighted regressions are performed. The weights are the same as in lsfit, that is, they should be inversely proportional to the variance.
int:
logical flag: should an intercept term be used in the regressions?
method:
character string describing the method used to evaluate a subset. The possible values are "Cp", "r2", and "adjr2" corresponding to Mallows Cp statistic, R-squared, and adjusted R-squared. Only the first character need be supplied.
keep.int:
logical flag: should the intercept always be kept in the regression?
keep:
a vector of the names (or column numbers) of the variables that should always be kept in the regression. By default, all variables may be omitted from the regression. The intercept is treated specially this way by argument keep.int.
nbest:
integer giving the number of "best" subsets to be found for each subset size. In the case of "r2" or "Cp" methods, the nbest subsets (of any size) are guaranteed to be included in the output (but note that more subsets will also be included).
names:
vector of character strings giving names for the independent variables. This must have length ncol(x). By default, the names are 1, 2, ... 9, A, B, ...
df:
degrees of freedom for y. Useful if, for example, x and y have already been adjusted for previous independent variables. The degrees of freedom used are decreased by 1 if int is TRUE.
drop.int:
logical flag: should the column corresponding to the intercept be dropped from the returned component which?

VALUE:
list with four components giving information on the regression subsets. Components Cp (or adjr2 or r2), size and label will all have the same length - one element per subset; this will be the number of rows in which.
Cp,adjr2,r2:
the first returned component will be named Cp, adjr2, or r2 depending on the method used for evaluating the subsets. This component gives the values of the desired statistic. If r2 or adjr2 are used, the result is in percent.
size:
the number of explanatory variables (including the constant term if int is TRUE) in each subset.
label:
a vector of character strings, each element giving the names of the variables in the subset.
which:
logical matrix with as many rows as there are returned subsets. Each row is a logical vector that can be used to select the columns of x in the subset.
int:
logical value telling whether the which matrix contains, as its first column, the status of the intercept variable.

DETAILS:
If you desire to test the intercept term also, then include a column of 1s in x and set int to FALSE.

BACKGROUND:
The leaps function provides a way to select a few promising regressions (sets of explanatory variables) for further study. By using robustness weights found by a robust regression on the full set of variables, this search can find suitable regressions even when outliers are present.

The best known criterion for regression is the coefficient of determination (R-squared). This has definite limitations in the context of the leaps function since the largest R-squared is the full set of explanatory variables. To take account of the number of parameters being fit an adjusted R-squared can be used. The higher the adjusted R-squared (which, by the way, can be negative), the better. It has been noted that the adjusted R-squared tends to favor large regressions over smaller ones.

Another method of selecting regressions is with Mallow's Cp. Small values of Cp close to or less than $p$ are good.


REFERENCES:
Furnival, G. M. and Wilson, R. W. Jr. (1974). Regressions by Leaps and Bounds. Technometrics 16, 499-511.

Seber, G. A. F. (1977). Linear Regression Analysis. Wiley, New York.

Weisberg, S. (1985). Applied Linear Regression (second edition). Wiley, New York.


SEE ALSO:
stepwise , step , lsfit .

EXAMPLES:
r <- leaps(x, y)

lsfit( x[,r$which[3,]], y ) #regression corresponding # to third subset

longley.wt <- lmsreg(longley.x, longley.y)$wt longley.leap <- leaps(longley.x, longley.y, longley.wt, names=c("D", "G", "U", "A", "P", "Y")) plot(longley.leap$size, longley.leap$Cp, type="n", ylim=c(0,15)) text(longley.leap$size, longley.leap$Cp, longley.leap$label) abline(0, 1) legend(2,15,pch="DGUAPY",legend=dimnames(longley.x)[[2]]) title(main="Cp Plot for Longley Data")