P. Bryan Heidorn1, Wensheng Wu1 & Reed Beaman2
1 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign, 501 East Daniel St. MC-493, Champaign, Illinois 61820-6212, U.S.A.
2 Peabody Museum of Natural History, Yale University, 170 Whitney Avenue, New Haven, Connecticut 06520-8118, U.S.A.
The objective of the Herbis project is to speed up herbarium label digitization through the use of Optical Character Recognition (OCR), machine learning (ML), information extraction (IE) technology and other tools. The usefulness of label data is well known to the TDWG community. One of the major challenges for the natural history is the transcription of this information into highly structured computer databases. Users would like to extract taxonomic name, collector, collection event date, location string, etc. Once in such databases the information can be made available to the global community through DiGIR, ABCD or any distribution system. While much the same data is found on all labels regardless of where or when they were created, there is a great deal of variability in the label format. This variability makes it difficult to write computer programs (e.g. regular expressions) that can successfully extract useful information from OCR records. ML and IE tools can be more robust in this variable environment than other computational techniques. However standard ML techniques such as Naive Bayse classifiers and order sensitive Scalar Vector Models are inadequate so we are developing modifications to standard ML techniques as well as taking advantage of domain knowledge found in gazetteers, taxonomic name authority and collector lists. In this project we do not address georeferencing of the locality strings (see BioGeomancer).
Herbis is funded in part by the National Science Foundation.