Merge Two Datasets and Match Columns

DESCRIPTION:
Takes two data frames and the names or numbers of a set of columns in each of them to match (the by columns). It returns a new data frame which has a row for each pair of rows in x and y whose by columns have the same values. This row will contain all the columns in both x and y, except that only one copy of the by columns will appear. (In database language this is called the "join" of two relations.) You may have one to one, many to one, or many to many matching. Note that x and y need not have similar dimensions, but the columns to match by should contain similar data.

In some cases you may want to include all the rows in one or both data sets in the output, even if there is not a matching row in the other. The all.x and all.y arguments let you do this. In the rows without matches, merge puts NAs in the columns with no matching data.

This is a generic function but currently the only substantive method for it works on data frames. The default method converts x and y to data frames and calls the method for data frames.


USAGE:
merge(x, y,
      by = intersect(names(x), names(y)),
      by.x = by, by.y = by,
      all = F, all.x = all, all.y = all,
      suffixes = c(".x", ".y"))

REQUIRED ARGUMENTS:
x:
a data frame, or something to be converted into a data frame.
y:
a data frame, or something to be converted into a data frame.

OPTIONAL ARGUMENTS:
by:
a vector of columns to match by. This can be a vector of column names, column numbers, or a logical vector with a T or F for each column, telling which columns to match by. The special name row.names means to match by the row names of the data frames. In that case a new column will be formed called Row.names. If you supply by it will be used for both x and y. If the by columns have different names or locations in x and y then use by.x and by.y. The default value is the vector of column names that are common to x and y.
by.x:
See by.
by.y:
See by.
all:
Shorthand for all.x=T and all.y=T.
all.x:
a logical value. If TRUE then extra rows will be added to the output, one for each row in x that has no matching row in y. These rows will have NA's in the columns that are usually filled with values from y. The default is FALSE, so that only rows with data from both x and y are included in the output.
all.y:
a logical value. This is analogous to all.x, controlling when the output contains rows for y rows with no matching x row.
suffixes:
A character vector containing two distinct strings. If x and y have some column names in common and those columns are not used for matching, then the output would have two columns with the same name, which is not allowed for data frames. Merge will paste suffixes onto these repeated column names to make them unique. The default is c(".x",".y").

VALUE:
a data frame with the by columns first then the remaining columns of x and y.

SEE ALSO:
cbind , data.frame , match , rbind .

EXAMPLES:
# create 2 data frames, one with information on authors
# and one concerning books.  Use merge to relate the
# names of the books written to attributes of the authors.
# Note that some authors have no books listed and some have
# several books.

authors <- data.frame( FirstName = c("Lorne","Loren","Robin", "Robin","Billy"), LastName = c("Green","Jaye","Green", "Howe","Jaye"), Age = c(82,40,45,2,40), Income=c(1200000,40000,25000,0,27500), Home=c("California","Washington","Washington", "Alberta", "Washington")) books <- data.frame( AuthorFirstName=c("Lorne","Loren","Loren", "Loren","Robin","Rich"), AuthorLastName=c("Green","Jaye","Jaye","Jaye", "Green","Calaway"), Book=c("Bonanza","Midwifery", "Gardening", "Perennials", "Who_dun_it?","Splus"))

# Look at all cases in which the author is in both the # authors and books datasets. Match author by both first # and last names -- these have different labels in the 2 # datasets but are in the first 2 columns of both.

merge(authors, books, by = 1:2)

# Produces the following output: FirstName LastName Age Income Home 1 Lorne Green 82 1200000 California 2 Loren Jaye 40 40000 Washington 3 Loren Jaye 40 40000 Washington 4 Loren Jaye 40 40000 Washington 5 Robin Green 45 25000 Washington Book 1 Bonanza 2 Midwifery 3 Gardening 4 Perennials 5 Who_dun_it?

# Next, make sure all authors in the authors dataset are # listed, even if there is no book listed for them. Using # by.x and by.y may be a more reliable way to handle # cases in which the datasets have different # names for the columns to match by.

merge(authors, books, by.x = c("FirstName","LastName"), by.y = c("AuthorFirstName","AuthorLastName"), all.x = T)

# Produces the following: FirstName LastName Age Income Home 1 Billy Jaye 40 27500 Washington 2 Lorne Green 82 1200000 California 3 Loren Jaye 40 40000 Washington 4 Loren Jaye 40 40000 Washington 5 Loren Jaye 40 40000 Washington 6 Robin Green 45 25000 Washington 7 Robin Howe 2 0 Alberta Book 1 NA 2 Bonanza 3 Midwifery 4 Gardening 5 Perennials 6 Who_dun_it? 7 NA

# Use the state.x77 dataset to relate the income of the author # to the median income of his or her home state (we have no # information on Alberta, a Canadian province). Note the use # of "row.names" where the "column" to match on is not a variable # in the dataset but is the names of the rows. Both datasets have # a column called "Income" which is not a key variable, so supply # the suffixes argument to distinguish between them in the # output (without suffixes they would be labeled "Income.x" # and "Income.y").

state.data <- data.frame(state.x77) merge(authors, state.data[,"Income", drop = F], by.x = "Home", by.y = "row.names", all.x = T, suffixes = c("Author","State"))

# Produces the following: FirstName LastName Age IncomeAuthor Home 1 Robin Howe 2 0 Alberta 2 Lorne Green 82 1200000 California 3 Loren Jaye 40 40000 Washington 4 Robin Green 45 25000 Washington 5 Billy Jaye 40 27500 Washington IncomeState 1 NA 2 5114 3 4864 4 4864 5 4864