Input Data from a File

DESCRIPTION:
Reads data from a text file or interactively from standard input. Options are available to control how the file is read and the structure of the data in S-PLUS.

USAGE:
scan(file="", what=numeric(), n=<<see below>>, sep=<<see below>>,
     multi.line=F, flush=F, append=F, skip=0, widths=NULL,
     strip.white=<<see below>>)

OPTIONAL ARGUMENTS:
file:
a character string giving the name of the file to be scanned. It must by a valid file name for a file existing in the directory from which S-PLUS was invoked. If file is missing or empty (""), data will be read from standard input; scan will prompt with the index for the next data item, and data input can be terminated by a blank line.
what:
a vector of mode numeric, character, or complex, or a list of vectors of these modes. Objects of mode logical are not allowed. If what is a numeric, character, or complex vector, scan will interpret all fields on the file as data of the same mode as that object. So, what=character() or what="" causes scan to read data as character fields. If what is missing, scan will interpret all fields as numeric.

If what is a list, then each record is considered to have length(what) fields and the mode of each field is the mode of the corresponding component in what. When widths is given as a vector of length greater than one, what must be a list of the same length as widths.

n:
maximum number of items (number of records times fields per record) to read from the file. If omitted, the function reads to the end of file (or to an empty line, if reading from standard input).
sep:
separator (single character), often "\t" for tab or "\n" for newline. If omitted, any amount of white space (blanks, tabs, and possibly newlines) can separate fields. If widths is specified, then sep tells what separator to insert into fixed-format records.
multi.line:
if FALSE, all the fields must appear on one line: if scan reaches the end of the line without reading all the fields, an error occurs. Thus the number of fields on each line must be a multiple of the length of what unless flush=TRUE. This is useful for checking that no fields have been omitted. If this argument is TRUE, reading will continue, disregarding where new lines occur.
flush:
if TRUE, scan will flush to the end of the line after reading the last of the fields requested. This allows putting comments after the last field that are not read by scan, but also prevents putting multiple sets of items on one line.
append:
if TRUE, the returned object will include all the elements in the what argument, with the input data for the respective fields appended to each component. If FALSE (the default), the data in what is ignored, and only the modes matter.
skip:
the number of initial lines of the file that should be skipped prior to reading. If skip is positive, a temporary copy of the file, less the skipped lines, is written to the temporary directory and then processed by scan, hence it is important to have adequate space in the temporary directory to hold the copy.
widths:
vector of integer field widths corresponding to items in the what argument. The widths argument provides for common fixed-format input. If widths is not NULL, then as scan reads the characters in a record, it automatically inserts a sep character after it reads widths[1] characters (widths[1] represents the width of the first field), then another sep after widths[2] characters, and so on, allowing the record to be read as if your input were delimited by the sep character to begin with. The default sep inserted when using widths is "\001" (binary 1); if your input contains this character, you will need to set the sep argument to a character that you know is not contained anywhere in the input. One caveat: the widths you specify must correspond exactly to field widths in your input; if they do not, you may get "field undecipherable" errors in (seemingly) odd places, or the input may be silently but incorrectly digested. The default for widths is NULL. Note that if widths has a length greater than one, what must be a list of the same length.
strip.white:
vector of logical values corresponding to items in the what argument. The strip.white argument allows you to strip leading and trailing white space from character fields (scan always strips numeric fields in this way). If strip.white is not NULL, it must be either of length 1, in which case the single logical value tells whether to strip all fields read, or the same length as what, in which case the logical vector tells which fields to strip (strip the leading and trailing white space from field 1 if strip.white[1] is TRUE and field 1 is a character field, strip from field 2 if strip.white[2] is TRUE and field 2 is a character field, and so on). If widths is specified, the default for strip.white is TRUE (strip all fields), otherwise the default is NULL (do not strip any fields). Note: if you are reading free format input by leaving sep unspecified, then strip.white has no effect.

VALUE:
a list or vector like the what argument if it is present, and a numeric vector if what is omitted.

DETAILS:
It is possible to read files that contain more than one mode by giving a what argument of mode "list". For example, if fields were alternately numeric and character (e.g., two columns of data on the file), scan(myfile,list(0,"")) would read them and return an object of mode list, with a numeric vector and a character vector as its two elements. The elements of what can be anything, so long as you have numbers where you want numeric fields, character data where you want character fields and complex numbers where you want complex fields. A NULL component in what causes the corresponding field to be skipped during input. The elements are used only to decide the kind of field, unless append is TRUE. Notice that scan retains the names attribute, if any, of the list, so that z <- scan(myfile,list(pop=0,city="")) would let you refer to z$pop and z$city.

Any numeric field containing the characters NA will be returned as a missing value. If the field separator (the sep argument) is given and the field is empty, the returned value will be an NA for a numeric or complex field, and a "" for a character field.

The main use for separators is to allow white space inside character fields. For example, suppose in the above the numeric field was to be followed by a tab with text filling out the rest of the line. z <- scan(myfile,list(pop=0,city=""),sep="\t") would allow blanks in the city name. With no separator, arbitrary white space can be included by quoting the whole string. With a separator, quotes are not used; if the separator character is to be included in a string, it must be escaped by a preceding backslash.

Fields of mode "logical" cannot be read directly: read them as character and convert them by expressions like x=="T". Any field that cannot be interpreted according to the mode(s) supplied to scan will cause an error.

The reading of numeric data in scan is done by means of C scan formats, rather than by the rules of the S-PLUS parser (the function parse). Exponential notation must use "e"; numbers that use "d" or other letters will be read wrong. You will need to change your data from the "d" notation to the "e" notation with, for instance, the sed utility in UNIX.

As it reads more and more records, scan keeps allocating more space to accommodate the growing vectors. If you can manage to pass in a what argument that is identical in size to the result you expect, S-PLUS will use that space and not have to perform memory allocations. This may produce significant memory savings when dealing with large files of data.

The make.fields function preprocesses files that have fixed-format fields, putting in separator after each field; it can be used as a separate step instead of using the widths. The advantage of using widths is that you dont need to create any temporary files.

The read.table function reads data from a file and returns a data frame. It is often a better choice than scan if the data are in a regular table format with rows of equal length.

count.fields tells how many fields are in each line of a file---usefull for determining if read.table is appropriate or, when using scan to return a list, if the number of fields in each line is a proper multiple of the length of what.

readline is another function that accepts data interactively.


SEE ALSO:
read.table , make.fields , count.fields , parse , write readline .

EXAMPLES:
num <- scan() # read numeric values from standard input
# read a label & two numeric fields, make a matrix
z <- scan("myfile",list(name="",0,0))
mat <- cbind(z[[2]],z[[3]])
dimnames(mat)  <- list(z$name,c("X","Y"))
# read in a vector of character data
personnel <- scan("person", what="")
ff <- scan("myfile", what=list(NULL,name="",data=0,NULL),
           multi.line=T, sep="\t")
# creates a list with two NULL components, a character component
# and a numeric component.  Fields are separated by tabs.
ff <- ff[sapply(ff, length) > 0] # delete NULL components
scan("myfile", single(0), skip=5)
# save in single precision, skip the first five lines of the file
# example of reading fixed format file using the widths argument
# and of using the strip.white argument
# blanks are read as NA for numeric fields
# assignment can be suppressed for a field using NULL in the what argument
# for this example, the file 'dfile' contains the following lines:
01giraffe.9346H01-04
88donkey .1220M00-15
77ant         L04-04
20gerbil .1220L01-12
22swallow.2333L01-03
12lemming     L01-23
mydf.what <- list(code=0, name="", x=0, s="", n1=0, NULL, n2=0)
mydf.widths <- c(2, 7, 5, 1, 2, 1, 2)
# note: strip.white defaults to TRUE if widths specified
# could also use strip.white = c(F, T, F, F, F, F, F)
mydf <- scan("dfile", what=mydf.what, widths=mydf.widths)
mydf
# this produces the following output:
$code:
[1]  1 88 77 20 22 12
$name:
[1] "giraffe" "donkey"  "ant"     "gerbil"  "swallow" "lemming"
$x:
[1] 0.9346 0.1220     NA 0.1220 0.2333     NA
$s:
[1] "H" "M" "L" "L" "L" "L"
$n1:
[1] 1 0 4 1 1 1
[[6]]:
NULL
$n2:
[1]  4 15  4 12  3 23
# now with strip.white argument:
mydf <- scan("dfile", what=mydf.what, widths=mydf.widths, strip.white=F)
mydf$name
# this produces a list just like the one above, except the columns are
# not stripped:
[1] "giraffe" "donkey " "ant    " "gerbil " "swallow" "lemming"