Contingency Table Analysis

DESCRIPTION:
Estimates test statistics and parameter values for a log-linear analysis of a multidimension contingency table.

USAGE:
loglin(table, margin, start=<<see below>>, fit=F, eps=0.1,
       iter=20, param=F, print=T)

REQUIRED ARGUMENTS:
table:
contingency table (array) to be fit by log-linear model. This is typically output from the table function. Neither negative nor missing values are allowed. The number of dimensions of table must be less than or equal to 15.
margin:
list of vectors describing the marginal totals to be fit. A margin is described by the factors not summed over. Thus list(1:2, 3:4) would indicate fitting the 1,2 margin (summing over variables 3 and 4) and the 3,4 margin in a four-way table. The names of factors (i.e., names(dimnames(table))) may be used rather than indices.

OPTIONAL ARGUMENTS:
start:
starting estimate for fitted table. If start is omitted, a start is used that will assure convergence. If structural zeros appear in table, start should contain zeros in corresponding entries, ones in other places. This assures that the fit will contain those zeros.
fit:
logical flag: should the estimated fit be returned?
eps:
maximum permissible deviation between an observed and a fitted margin.
iter:
maximum number of iterations.
param:
logical flag: should the parameter values be returned? Setting this to FALSE saves computation as well as space.
print:
logical flag: if TRUE, the final deviation and number of iterations will be printed.

VALUE:
a list with components:
lrt:
the Likelihood Ratio Test statistic. This is often called either L squared or G squared in the literature, and is 2 times the discrimination information. It is defined as 2 * sum(observed * log(observed/expected))
pearson:
The Pearson test statistic (X squared). It is defined as sum((observed - expected)^2/expected).
df:
the degrees of freedom for the model fit. There is no adjustment for zeros; the user must adjust for them.
margin:
list of the margins that were fit. This is basically the input margin except that the names of the factors are used if present.
fit:
array like table, but containing fitted values. This is returned only when the argument fit is TRUE.
param:
the estimated parameters of the model. They are parametrized so that the constant component describes the overall mean, each single factor sums to zero, each two factor parameter sums to zero both by rows and columns, etc. This is returned only when the argument param is TRUE.

DETAILS:
The fit is produced by the Iterative Proportional Fitting algorithm as presented in Haberman (1972). Convergence is considered to be achieved if the maximum deviation between an observed and a fitted margin is less than eps. At most iter iterations will be performed. The fitting is currently done in single precision, other computations are in double precision.

The margins to be fit describe the model, similar to describing an ANOVA model. A high-order term automatically includes all the lower-order terms within it; e.g., the term c(1,3) includes the one-factor terms 1 and 3. A factor that had constraints in the sampling plan should always be included. For example, if the sampling plan was such that there would be (precisely) x females and y males sampled, then gender should be in all models.

Both the LRT and the Pearson test statistics are asymptotically distributed chisquare with df degrees of freedom (assuming there are no zeros). A general rule of thumb is that the asymptotic distribution is trustworthy when the number of observations is 10 times the number of cells. If the two test statistics differ considerably, not much faith can be put in the test. Using the test statistics to select a model is a rather backwards use of hypothesis testing - a model can be "proved" wrong, but passing the test doesn't mean that the model is right. Bayesian techniques have been developed to select a good model (or models).

The start argument can be used to produce analyses when the cells are assigned different weights, see Clogg and Eliason (1988). The start should be one over the weights.

A suggested analysis strategy is to use the default settings to narrow down the number of models, and then to set the fit and param options to TRUE in order to investigate the more promising models further.


BACKGROUND:
Log-linear analysis studies the relationship between a number of categorical variables, extending the idea of simply testing for independence of the factors. Typically the number of observations falling into each combination of the levels of the variables (factors) is modeled. The model, as the name suggests, is that the logarithm of the counts follows a linear model depending on the levels of the factors.

REFERENCES:
Clogg, C. C. and Eliason, S. R. (1988). Some Common Problems in Log-Linear Analysis. In Common Problems/Proper Solutions J. Scott Long, ed. Newbury Park, Calif.: SAGE.

Fienberg, S. E. (1980). The Analysis of Cross-Classified Categorical Data (2nd edition). Cambridge, Mass.: MIT Press.

Haberman, S. J. (1972). Log-linear fit for contingency tables-Algorithm AS51. Applied Statistics 21, 218-225.

Lunneborg, C. E. and Abbott, R. D. (1983). Elementary Multivariate Analysis for the Behavioral Sciences. New York: North-Holland.


SEE ALSO:
table , glim , Chisquare .

EXAMPLES:
loglin(barley.exposed, list("cultivar", "time", "cluster"))
 # model of independence

loglin(barley.exposed, list(1:2, c(1, 3))) # factors 2 and 3 are independent conditional on factor 1

bar.ci1 <- loglin(barley.exposed, list(1:2, c(1, 3)), param=T, fit=T) # return parameter values and the fit (barley.exposed - bar.ci1$fit)/sqrt(bar.ci1$fit) # scaled residuals