Fit Linear Regression Model

DESCRIPTION:
Returns an object of class "lm" or "mlm" that represents a fit of a linear model.

USAGE:
lm(formula, data=<<see below>>, weights=<<see below>>,
    subset=<<see below>>, na.action=na.fail, method="qr", model=F,
    x=F, y=F, contrasts=NULL, ...)

REQUIRED ARGUMENTS:
formula:
a formula object, with the response on the left of a ~ operator, and the terms, separated by + operators, on the right.

OPTIONAL ARGUMENTS:
data:
a data.frame in which to interpret the variables named in the formula, or in the subset and the weights argument. If this is missing, then the variables in the formula should be on the search list. This may also be a single number to handle some special cases - see below for details.
weights:
vector of observation weights; if supplied, the algorithm fits to minimize the sum of the weights multiplied into the squared residuals. The length of weights must be the same as the number of observations. The weights must be nonnegative and it is strongly recommended that they be strictly positive, since zero weights are ambiguous, compared to use of the subset argument.
subset:
expression saying which subset of the rows of the data should be used in the fit. This can be a logical vector (which is replicated to have length equal to the number of observations), or a numeric vector indicating which observation numbers are to be included, or a character vector of the row names to be included. All observations are included by default.
na.action:
a function to filter missing data. This is applied to the model.frame after any subset argument has been used. The default (with na.fail) is to create an error if any missing values are found. A possible alternative is na.omit, which deletes observations that contain one or more missing values.
method:
the least squares fitting method to be used; the default is "qr". The method "model.frame" simply returns the model frame.
model:
logical flag: if TRUE, the model frame is returned in component model.
x:
logical flag: if TRUE, the model matrix is returned in component x.
y:
logical flag: if TRUE, the response is returned in component y.
qr:
logical flag: if TRUE, the QR decomposition of the model matrix is returned in component qr.
contrasts:
a list giving contrasts for some or all of the factors appearing in the model formula. The elements of the list should have the same name as the variable and should be either a contrast matrix (specifically, any full-rank matrix with as many rows as there are levels in the factor), or else a function to compute such a matrix given the number of levels.
...:
additional arguments for the fitting routines. The most likely one is singular.ok=T, which instructs the fitting to continue in the presence of over-determined models (the default method recognizes this, but if new fitting methods are written, they don't have to do so).

VALUE:
an object of class "lm" or "mlm" representing the fit. See lm.object for details.

DETAILS:
The formula argument is passed around unevaluated that is, the variables mentioned in the formula will be defined when the model frame is computed, not when lm is initially called. In particular, if data is given, all these names should generally be defined as variables in that data frame.

The subset argument, like the terms in formula, is evaluated in the context of the data frame, if present. The specific action of the argument is as follows: the model frame, including weights and subset, is computed on all the rows, and then the appropriate subset is extracted. A variety of special cases make such an interpretation desirable (e.g., the use of lag or other functions that may need more than the data used in the fit to be fully defined). On the other hand, if you meant the subset to avoid computing undefined values or to escape warning messages, you may be surprised. For example, lm(y ~ log(x), mydata, subset = x > 0) will still generate warnings from log. If this is a problem, do the subsetting on the data frame directly: lm(y ~ log(x), mydata[mydata$x > 0, ])

Generic functions such as print and summary have methods to show the results of the fit. See lm.object for the components of the fit, but the functions residuals, coefficients, and effects should be used rather than extracting the components directly, since these functions take correct account of special circumstances, such as overdetermined models.

The response may be a single numeric variable or a matrix. In the latter case, coefficients, residuals, and effects will also be matrices, with columns corresponding to the response variables. In either case, the object inherits from class "lm". For multivariate response, the first element of the class is "mlm".

NAMES. Variables occurring in a formula are evaluated differently from arguments to S-PLUS functions, because the formula is an object that is passed around unevaluated from one function to another. The functions such as lm that finally arrange to evaluate the variables in the formula try to establish a context based on the data argument. (More precisely, the function model.frame.default does the actual evaluation, assuming that its caller behaves in the way described here.) If the data argument to lm is missing or is an object (typically, a data frame), then the local context for variable names is the frame of the function that called lm, or the top-level expression frame if the user called lm directly. Names in the formula can refer to variables in the local context as well as global variables or variables in the data object.

The data argument can also be a number, in which case that number defines the local context. This can arise, for example, if a function is written to call lm, perhaps in a loop, but the local context is definitely not that function. In this case, the function can set data to sys.parent(), and the local context will be the next function up the calling stack. See the third example below. A numeric value for data can also be supplied if a local context is being explicitly created by a call to new.frame. Notice that supplying data as a number implies that this is the only local context; local variables in any other function will not be available when the model frame is evaluated. This is potentially subtle. Fortunately, it is not something the ordinary user of lm needs to worry about. It is relevant for those writing functions that call lm or other such model-fitting functions.


REFERENCES:
Belsley, D. A., Kuh, E. and Welsch, R. E. (1980). Regression Diagnostics. Wiley, New York.

Draper, N. R. and Smith, H. (1981). Applied Regression Analysis. (second edition). Wiley, New York.

Myers, R. H. (1986). Classical and Modern Regression with Applications. Duxbury, Boston.

Rousseeuw, P. J. and Leroy, A. (1987). Robust Regression and Outlier Detection. Wiley, New York.

Seber, G. A. F. (1977). Linear Regression Analysis. Wiley, New York.

Weisberg, S. (1985). Applied Linear Regression. Second Edition. Wiley, New York.

There is a vast literature on regression, the references above are just a small sample of what is available. The book by Myers is an introductory text that includes a discussion of much of the recent advances in regression technology. The Seber book is at a higher mathematical level and covers much of the classical theory of least squares.


SEE ALSO:
lm.object , model.matrix , glm , gam , loess , tree .

EXAMPLES:
lm(Fuel ~ . , fuel.frame)

lm(cost ~ age + type + car.age, claims, weights = number, na.action = na.omit)

lm(freeny.y ~ freeny.x)

# myfit calls lm, using the caller to myfit # as the local context for variables in the formula # (see aov for an actual example) myfit <- function(formula, data = sys.parent(), ...) { .. .. fit <- lm(formula, data, ...) .. .. }