Generate Abbreviations

DESCRIPTION:
Returns abbreviations of a vector of character strings.

USAGE:
abbreviate(names, minlength = 4, use.classes = T, dot = F)

REQUIRED ARGUMENTS:
names:
vector of character strings, whose elements are to be abbreviated.

OPTIONAL ARGUMENTS:
minlength:
the minimum length abbreviation produced (not counting the trailing dot that is added if dot=TRUE). It is not guaranteed that the abbreviations are of length minlength; the algorithm will increase minlength until it successfully produces unique abbreviations.
use.classes:
if TRUE, some special character classes will be used to keep what are thought to be more meaningful characters in the abbreviation. See the discussion of the algorithm in the DETAILS section. To see the effect, try the abbreviation of state.name as in the example below, but with use.classes=FALSE.
dot:
should each abbreviation be terminated with "."?

VALUE:
a character vector containing the abbreviations. The vector will have a "names" attribute containing the original names argument. This attribute can make subscripting the result convenient (see the second example).

DETAILS:
The abbreviations are not dependent on the order of the names argument, except when the algorithm produces and has to resolve, duplicate abbreviations.

THE ALGORITHM. The abbreviation algorithm does not simply truncate. It has a threshold, according to which it will drop: 1) non-printing characters and white space, 2) lower case vowels, 3) lower case consonants and punctuation and finally 4) upper case letters and special characters.

If use.classes is FALSE, there is only the distinction between white space and other characters. Each string is broken up into words, separated by white space. For a given value of the threshold, eligible letters are dropped from the end of each word, one more letter from each word on each iteration, until the desired minimum length is reached. At least one letter is kept from each word. If the abbreviation is too long, the threshold is raised and the process is repeated.

This algorithm may still not produce unique abbreviations. If it does not, then minlength will be increased and the algorithm will be applied again, but only to those names not distinguished by the previous round. The end result may be that some of the abbreviations will be longer than the requested length, but as few of these as possible given the algorithm. (See the third example below.)

The method assumes you want identical names to produce identical abbreviations. The result of all this tends to be abbreviations not quite like anything you've ever seen before, but usually fairly intuitive when the input names are English text.


SEE ALSO:
make.names , nchar , paste , substring , table .

EXAMPLES:
abbreviate(state.name[1:10])
#  Alabama Alaska Arizona Arkansas California Colorado
#  "Albm"  "Alsk" "Arzn"  "Arkn"   "Clfr"     "Clrd"

# Connecticut Delaware Florida Georgia # "Cnnc" "Dlwr" "Flrd" "Gerg"

abbreviate(state.name, 2)["New Jersey"] # New Jersey # "NJ"

ab2 <- abbreviate(state.name, 2) table(nchar(ab2)) # 2 3 4 # 32 15 3

ab2[nchar(ab2)==4] # Massachusetts Mississippi Missouri # "Mssc" "Msss" "Mssr"