Split a Dataset by Factors and Apply a Function to the Parts

DESCRIPTION:
by.data.frame takes a data frame and a list of indices, each of which should have one entry for each row (observation) in the data frame. For each unique combination of values in the factors it extracts the rows in the data frame whose corresponding indices have that combination of values and calls the function of your choice with those rows of the data frame as its argument.

USAGE:
by(x, INDICES, FUN, ...)

REQUIRED ARGUMENTS:
x:
A data frame. Currently any x will be converted to a data.frame, but in the future there may be special methods for various classes of data.
INDICES:
A factor or list of several factors. The length of each factor should be the same as the number of rows of x. The elements of the categories define the position in a multi-way array corresponding to each x observation. Missing values (NAs) are allowed. The names of INDICES are used as the names of the dimnames of the result. If a vector is given, it will be treated as a list with one unnamed component.
FUN:
A function whose first argument is a data frame. FUN will be called once for each row subset of x determined by INDICES.

OPTIONAL ARGUMENTS:
...:
All other arguments will be passed to FUN each time it is called.

VALUE:
An object of class "by" is returned. This consists of an array of mode "list" with one dimension for each index in INDICES, the dimension being the number of levels in that index. The dimnames of the object give the levels of the indices and the names of the dimnames give the names of the indices. If the list given as INDICES has no names then by() will try to make up some reasonable names. If there are no observations corresponding to some elements of the array, those elements will have the value NULL (FUN will not be called for those empty cells).

This object is intended to be printed by print.by, the print method of objects of class by. For each cell in the array it prints the value of each index then prints the value of the cell. It prints a separator line, a series of dashes by default, between the cells.


DETAILS:
By() is a convenient, object oriented version of tapply().

SEE ALSO:
lapply , sapply , tapply .

EXAMPLES:
by(kyphosis, kyphosis$Kyphosis, summary)

# Gives the following output: kyphosis$Kyphosis:absent Kyphosis Age Number Start absent :64 Min. : 1.00 Min. :2.00 Min. : 1.00 present: 0 1st Qu.: 18.00 1st Qu.:3.00 1st Qu.:11.00 Median : 79.00 Median :4.00 Median :14.00 Mean : 79.89 Mean :3.75 Mean :12.61 3rd Qu.:131.00 3rd Qu.:5.00 3rd Qu.:16.00 Max. :206.00 Max. :9.00 Max. :18.00 ------------------------------------------------------------ kyphosis$Kyphosis:present Kyphosis Age Number Start absent : 0 Min. : 15.00 Min. : 3.000 Min. : 1.000 present:17 1st Qu.: 73.00 1st Qu.: 4.000 1st Qu.: 5.000 Median :105.00 Median : 5.000 Median : 6.000 Mean : 97.82 Mean : 5.176 Mean : 7.294 3rd Qu.:128.00 3rd Qu.: 6.000 3rd Qu.:12.000 Max. :157.00 Max. :10.000 Max. :14.000

by(kyphosis, list(Kyphosis=kyphosis$Kyphosis, Older=kyphosis$Age>105), function(data)lm(Number~Start,data=data))

# Gives the following output: Kyphosis:absent Older:FALSE Call: lm(formula = Number ~ Start, data = data)

Coefficients: (Intercept) Start 4.885736 -0.08764492

Degrees of freedom: 39 total; 37 residual Residual standard error: 1.261852 ------------------------------------------------------------ Kyphosis:present Older:FALSE Call: lm(formula = Number ~ Start, data = data)

Coefficients: (Intercept) Start 6.371257 -0.1191617

Degrees of freedom: 9 total; 7 residual Residual standard error: 1.170313 ------------------------------------------------------------ Kyphosis:absent Older:TRUE ...