Stepwise Subset Selection for Multiple Regression

DESCRIPTION:
Uses stepwise procedures or an exhaustive search to assist in finding a parsimonious subset of explanatory variables to include in a least squares multiple regression.

USAGE:
stepwise(x, y, wt=<<see below>>, intercept=T, tolerance=1.e-07,
       method="efroymson", size.max=ncol(x), nbest=3, f.crit=2,
       xinclude, plot=T, time=0.02)

REQUIRED ARGUMENTS:
x:
matrix of explanatory variables. It can also be a data frame. Each column represents a variable and each row represents an observation (or case). This should not contain a column of ones unless the argument intercept is FALSE. The number of rows of x should equal the length of y, and there should be fewer columns than rows. Missing values are allowed. If a data frame, x is coerced into a numeric matrix, hence factor data are transformed to numeric values using the function codes.
y:
a vector response variable. Missing values are allowed.

OPTIONAL ARGUMENTS:
wt:
vector of weights with length equal to the number of observations. If the different observations have non-equal variances, wt should be inversely proportional to the variance. By default, an unweighted regression is carried out. Missing values are allowed.
intercept:
if TRUE, a constant (intercept) term is included in each regression.
tolerance:
numerical value used to test for singularity in the regression.
method:
character string to specify which method to use. Possible values are forward, backward, efroymson, or exhaustive for forward selection, backward elimination, Efroymson's forward stepwise and exhaustive search respectively. Only enough of the string for a unique match needs to be given.
size.max:
integer specifying the maximum subset size for method="exhaustive", or method="forward". If ncol(x) is large (say > 35), then size.max should be specified smaller to make exhaustive searches possible.
nbest:
integer specifying the number of "best" subsets to be found for each subset size. This is only relevant for method="exhaustive".
f.crit:
numerical value (or vector of 2 values) that specify the F value(s) to be used as criteria for adding or deleting variables to/from the subset when using Efroymson's method. If two values are provided, the first is the F-to-enter and the second is the F-to-delete.
xinclude:
logical vector of length ncol(x), with values set TRUE for each column of x that is to be forced into the subsets. This argument only works with method="exhaustive".
plot:
logical flag: if TRUE, and a graphics device is available, the residual sum of squares for each model is plotted against subset size.
time:
numerical value. For exhaustive searches, if ncol(x) > 12 then an estimate of time required for the computations will be made. If the estimated time for the search in hours is greater than this value, a message will be printed giving the estimated time. The estimates are very approximate.

VALUE:
a list representing the result of the search, with the following components:
rss:
vector with a residual sum of squares for each subset reported.
size:
vector showing the number of independent variables in each subset reported.
which:
logical matrix with as many rows as there are returned subsets. Each row is a logical vector that can be used to select the columns of x in the subset. For the forward method there are ncol(x) rows with subsets of size 1, ..., ncol(x). For the backward method there are ncol(x) rows with subsets of size ncol(x), ..., 1. For Efroymson's method there is a row for each step of the stepwise procedure. For the exhaustive search, there are nbest subsets for each size (if available). The row labels consist of the subset size with some additional information in parentheses. For the stepwise methods the extra information is +n or -n to indicate that the n-th variable has been added or dropped. For the exhaustive method, the extra information is #i where i is the subset number.
f.stat:
vector with the F statistics for testing the change (adding or dropping) made at this step. This component is not returned for the exhaustive method.
method:
string showing which method was used.

SIDE:

if plot=TRUE, a plot of residual sum of squares versus model size is created on the current graphics device.


DETAILS:
The forward selection procedure starts with an empty subset, and at each step adds the independent variable that gives the largest reduction of the residual sum of squares. The backward elimination procedure starts with a complete set, and at each step drops the independent variable that gives the smallest increase in the residual sum of squares.

Efroymson's stepwise method is like forward selection, except that when each new variable is added to the subset, partial correlations are considered to see if any of the variables in the subset should now be dropped.

The exhaustive search considers all possible subsets of a given size, and chooses the one with the smallest residual sum of squares.

An observation is considered missing if there is a nonfinite value in the response variable, any explanatory variable or the weight (if present) for that observation. Such observations are dropped from the computations.

This function is based on Fortran code written by Alan Miller, CSIRO Division of Mathematics and Statistics, and his monograph provides details of the methods used, and advice on how to use these procedures.


BACKGROUND:
The stepwise function provides several methods for selecting regressions (sets of explanatory variables) for further study. As a first step one or more of the stepwise methods should be used, as these quickly indicate how many explanatory variables may be needed in the regression. Next, the exhaustive search may be useful to select the "best" set of explanatory variables. The stepwise function has an advantage over the leaps function in that it can search all subsets of size size.max, where size.max is less than ncol(x). Also it does not have the restriction that ncol(x) must be less than 32. Depending on the speed of the computer it is possible to search for subsets with size.max up to 30 or 35. Subsets larger than this may be handled if it is clear that some explanatory variables must be included, and the xinclude argument can specify this.

REFERENCES:

Draper, N. R. and Smith, H. (1981). Applied Regression Analysis, (second edition). New York: Wiley.

Gentleman, W. M. (1974). Basic procedures for large sparse or weighted least-squares. Applied Statistics 23, 448-454.

Miller, A. J. (1990). Subset Selection in Regression. Monographs on Statistics and Applied Probability 40, London: Chapman and Hall.

Miller, A. J. (1984). Selection of subsets of regression variables (with discussion). Journal Royal Statistical Society, Series A 147, 389-425.

Osborne, M. R. (1976). On the computation of stepwise regressions. Australia Computer Journal 8, 61-68.


SEE ALSO:
leaps , step , lsfit , ls.print , ls.diag .

EXAMPLES:
z1 <- stepwise(evap.x, evap.y)       # use Efroymson's method
z2 <- stepwise(evap.x, evap.y, method="ex")  # exhaustive search