Reed Beaman1, Nico Cellinese1, P. Bryan Heidorn2, Youjun Guo1 & Ashley Green1
1 Peabody Museum of Natural History, Yale University, 170 Whitney Avenue, New Haven, Connecticut 06520-8118, USA
2 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign, 501 East Daniel St. MC-493, Champaign, Illinois 61820-6212, USA
Digital imaging projects often do nothing more with images than make them available for web display. An image, however, can serve the additional purpose as the basis for label data capture – the source of a thousand words. Our goals include developing workflows and technology that combine image and label data capture to ultimately reduce the total cost of herbarium digitization by significantly reducing human labour required and total project duration. Ideally, clicking the shutter on a digital camera initiates a sequence that culminates with online access to specimen label data and images through full-text and structured queries. Doing so requires implementation of a number of available technologies, including high-speed, high resolution digital cameras, Optical Character Recognition (OCR), Natural Handwriting Recognition (NHR), and Natural Language Processing (NLP) and image compression. We are implementing open source and commercial software solutions, and are developing solutions where necessary. HERBIS progress so far has comprised developing workflow and image management protocols that allow us to automatically queue images through several image processing algorithms (e.g. JPEG2000 compression), OCR, and generate a full-text index of label data that is linked to the image, barcodes, and scientific names in an online environment. Each of these components has also embedded into web services, providing benefits such as cross-platform interoperability, scalability, and availability to other institutions. Work on implementing NHR and NLP technologies is ongoing.