About the VIADOCS project and the construction of LepIndex

The VIADOCS project

LepIndex is a product of the VIADOCS (Versatile, Interactive, Archive Document Conversion System) project: a collaborative project between the NHM and the University of Essex, which was funded by the BBSRC/EPSRC Bioinformatics Initiative. The primary aim of the VIADOCS project was to develop efficient new techniques to computerise information contained in hand-written and printed document archives, in order that access to such data can be improved (eg computerised archives could be made available over the Internet). Part of the NHM's card archive to world Lepidoptera names (ie the 27,577 cards for pyraloid moths) was chosen as an exemplar on which to test the methods to be developed during the project, and LepIndex was created so that the electronic version of this archive could be accessed over the Web. Funding to develop LepIndex further has since been secured from a number of sources: please see Acknowledgements and funding sources.

The text below explains (in some detail) how the NHM's Lepidoptera card archive was computerised and LepIndex created (note that much of the text is from Beccaloni et al., 2003). For more information about the VIADOCS project see the VIADOCS Home Page and New approaches to creating global species databases in entomology by Malcolm Scoble, plus the following publications:- Beccaloni et al. (2003); Downton et al. (2001, 2003); He & Downton (2003); Ishidera, Lucas & Downton (2002a, 2002b); Ishidera, Lucas, Downton & Patoulas (Submitted); Lucas et al. (2001); Lucas, Patoulas & Downton (2003); Scoble (2002); and Yin, Fleury & Downton (2003).

Card scanning

The first step in computerising the NHM's card index to the scientific names of world Lepidoptera was to produce electronic images of the cards. We used a modified bank cheque scanner (SEAC Banche RDS-6000) attached to a IBM computer with a Intel Pentium III 1GHz processor and, importantly, a DVD writer to enable backup copies to be made of the card image files. Cards to be scanned are placed into a pocket in the scanner, about 40 at a time (see the animation below). When activated, the machine draws the cards (at a rate of about one card/second) through a scanning head and ejects them into another pocket, maintaining their original order. It produces colour JPEG images (1,000 X 600 pixels) of the front and back of each card, the storage size of which varies between about 30 Kb and 50 Kb, depending on the amount of text on the card. The scanner also prints a unique code (see below) onto the back of each card as it travels through the machine.

SEAC Banche RDS-6000 Scanner

SEAC Banche RDS-6000 Scanner.

Since the software supplied with the scanner did not have all the functionality we required, a customised software interface (operating under Microsoft Windows 2000) was produced. At the beginning of each scanning session, the user enters the following information into this interface: the higher classification of the batch of cards to be scanned; the number of the index drawer from which the cards were taken; and the unique reference number of the first card in the batch (unless this has already been stored by the software - see below). Using this information, the software creates a nested set of folders on the PC hard disk that mirrors the higher classification of the cards. It also uses the information to create names for the JPEG images and to create the unique code which the scanner prints on to the back of each card. For example, if card number 38378 from the index was taken from drawer '40A', and the name of the taxon on the card was placed in the family Apoprogonidae of the superfamily Geometroidea, then the front image of the card would be named 'FC-Geometroidea-Apoprogonidae-40A-038378.jpg'. The back image would have the same name but prefixed 'BC' rather than 'FC', and the code printed onto the card back would read 'Geometroidea-Apoprogonidae-40A-038378'. The two images of this card would be saved in a folder named '40A' placed within the folder 'Apoprogonidae', which is in turn placed within a folder named 'Geometroidea'.

When the scanner interface programme is exited, it stores the number of the last card scanned. On restarting, this card number is retrieved and incremented by one, thus ensuring that all card images receive a unique number. This number can optionally be set during scanning, if, for example, a previously scanned card needs to be rescanned.

The index was scanned sequentially from beginning to end and tab cards dividing genera were ignored. Scanning all 290,099 cards in the index took a total of 61 person days - an average of 4,769 cards per person day of 7 hours duration (breaks excluded).

VIADOCS Nomenclatural Database System

A relational Microsoft Access 97 database (named the VIADOCS Nomenclatural Database System) was designed and programmed by George Beccaloni over a period of about 18 months to manage the card images and associated taxonomic data. It contains 7 linked tables, 11 lookup tables and 18 additional tables, plus 32 queries and 27 forms, and it operates using over 10,000 lines of Visual Basic code. The 7 linked tables form the main part of the database and contain a total of 135 fields (fields are included for all data that might be present on the index cards). These tables are linked by the unique reference number (the 'card number') assigned to each card image when the images were created. The tables include one for the names of the card image files plus their paths, one for bibliographic references, one for type specimen information, one for details about the type species of genus-group names, and one for published name combinations other than the original and the currently valid combinations of the name. The structure of the database and the layout of the front-end were constructed to meet specific taxonomic requirements. It incorporates, therefore, specialist knowledge of taxonomic protocols and demanded a thorough assessment of the structure and function of existing taxonomic databases. The main purpose of the database is to enable quick visual comparison of the type- or hand-written data on the card images with data generated by Optical Character Recognition (OCR) analysis of these images (see below) and to allow these data to be edited. The database was designed in such a way that it provides an electronic substitute for the card index it replaces, and it is available to scientists in the NHM Entomology Department via the local intranet.

The main data form of the Viadocs Nomenclatural Database System, displaying a record for a species name.

The main data form of the VIADOCS Nomenclatural Database System, displaying a record for a species name.

Once the entire index had been scanned, a 'freeware' programme 'rjhextensions' was used to produce a text file containing a list of all the card image files, plus their directory paths. This list was then manipulated using Microsoft Word for Windows to produce two delimited text files: one listing card number/directory path/name of front image/name of back image for each card image, and the other listing card number/superfamily name/family name/subfamily name/tribe name for each image. These data were then imported into the appropriate tables in the VIADOCS database. Records in the database are sorted by a number which gives the current relative position of each record in the electronic index. This number is set initially to be the same as the card number, thus ensuring that the records are in the same order as the cards in the index at the time when the database is first populated with records.

The main database form allows users to find quickly a card image, and the associated data, using a variety of search options (eg a drill-down search by higher classification and a 'simple search', with or without wildcards, for any taxon name). Fields on this form are grouped onto different labelled tabs (similar to those displayed when a record is viewed in LepIndex), each of which contains a group of related fields. A slightly different set of tabs (and hence fields) is displayed according to whether the record is for a genus-group, or for a species-group/infrasubspecific name.

Authorised users are able to edit, delete and create new records. They can also 'move' records, singly or in batches, to new relative positions within the record-set (eg in cases where the user wishes to transfer a species name from one genus to another). All changes made to data in the database are recorded in a set of archive tables. These tables store the old and the new field values, the name of the user, and the date and time of the change. Deleted records are also archived, and the user name, date and time are recorded. Users can validate information in all except memo fields, by placing the cursor in the appropriate field and double-clicking the left hand mouse button. The value currently stored in the field, plus the user name, date and time are recorded. If a field containing validated data is double-clicked subsequently, then the validated data is displayed on a pop-up form and the user is given the option of deleting the stored validation information or overwriting it with a new validation record.

Card image processing

Two initial preprocessing stages need to be performed on the card images before the text on them can be read by an optical character recognition (OCR) algorithm:

  • First, the text fields of the images must be parsed to identify which text field should be associated with each database field (scientific name, author(s) name(s), date of publication, etc). Image analysis and processing techniques must also be used to extract the corresponding text image from the overall card image without distorting it by cutting off part of the required text image or including noise artefacts.
  • Second, once the full card image has been broken down to a set of text sub-images labelled with appropriate database fields, these sub-images must be presented to an OCR engine which produces a ranked set of candidate words matching each image. The OCR engine will normally operate using a dictionary of allowed words for each field to optimize its performance (eg a list of the possible scientific names which may be contained in the 'scientific name' field), though these will not always provide complete coverage of the possible text words.

Our iterative approach to implementing a practically useful system, has been to focus first on the scientific names, which are found in the top left section of each card, because they represent the primary index term for searching the database. Therefore a simple image analysis process was used to extract the scientific name as a sub-image, and this was used as the first dataset for testing the OCR component of the project (see below).

Currently a more sophisticated image analysis tool, which attempts to extract all fields from the card images (rather than simply the scientific name) is being developed. Overall success rates achieved in an initial, fairly simplistic, implementation for correct text image extraction and labelling, range from 92% for the field giving the location of the taxon in the NHM collection, to 97% for the scientific name field, based on a sample of 2,000 card images. Further improvements are expected to be achieved with more sophisticated image analysis algorithms, which are currently being developed.

Optical Character Recognition

Experimental evaluation of the performance of commercial off-the-shelf OCR packages on sample card images showed that the error rate is unacceptable, due to touching characters, poor quality printing and the use of a specialist vocabulary. The following two algorithms were therefore investigated and tested as candidates for their ability to read the text from the card images:

  1. An algorithm developed by Simon Lucas (Lucas et al., 2001).
  2. An algorithm developed by Eiki Ishidera (with support from Gregory Patoulas) at the Department of Electronic Systems Engineering, University of Essex.

A commercial algorithm designed by Parascript in the USA, generally considered to represent the state of the art for commercial offline handwriting recognition, was used for comparative evaluation against these algorithms.

Our initial system used the Lucas algorithm, and it was tested on the scientific name field of a subset of 27,577 card images (ie all the index cards relating to pyraloid moths). The system employed a dictionary covering about 60% of the scientific names on these cards, and it achieved a correct recognition rate of 37% of the names (compared with <10% for standard PC OCR packages). Of greatest interest at present is the algorithm developed for the project by Ishidera. A novel approach to OCR is employed, which is based upon constructing a probabilistic word image model of each possible word in a dictionary, and which incorporates image character templates, character segmentation information and linguistic knowledge. Each word image, once constructed, is used as a template against which archive word images are matched. The method works particularly well for the cards we scanned because these contain very poor quality text, but with a restricted range of typed character fonts. Exceptionally good recognition performance has been obtained with this algorithm on test sets consisting of 4,498 scientific names (over 99% recognition rate) and 1,977 author names (over 97% recognition rate), compared without about 90% recognition rate for the algorithm developed by Lucas, and under 80% for the Parascript algorithm (which, however, is optimised for handwriting rather than print).

The main limitation of the Ishidera algorithm is that it is computationally extremely intensive, currently taking more than one minute to match each word image on a typical PC. However, research associated with the project has been investigating web-based deployment of OCR engines as a standardised mechanism by which more computing resources can be brought to bear on this problem.

Once OCR results have been obtained, they are imported into the appropriate fields of the VIADOCS database described above. The data then have to be visually checked against these same data on the card images and corrections made where appropriate. This procedure saves much time compared with the alternative approach of simply typing all the values into the database, especially considering that data entered manually will usually also need to be verified. In some cases the data obtained from OCR can be validated against existing databases, leaving only a residue of data which needs to be visually checked. For example, in the case of our card archive, OCR results for the fields 'scientific name' + 'author(s) name(s)' + 'date of publication' for all genus-group names were validated electronically against corresponding data in the NHM's comprehensive and accurate Butterflies & Moths of the World: Generic Names & their Type-species online database.

As of February 1 2003, OCR results for the scientific name, author(s) name(s) and date of publication fields of the 27,577 pyraloid moth cards had been checked visually against the card images and corrected where necessary. Once these data had been validated, Visual Basic algorithms were used to compute the values of certain other fields in the VIADOCS database and fill them in. For example, an algorithm was used which identifies currently valid generic names and copies these to the 'current genus' field for all appropriate records (it is possible to identify current valid generic names by the arrangement of the records in the database, plus the fact that genus-group names are in capital letters (other scientific names are in lower case)). Please see Current coverage of LepIndex for details about what information LepIndex currently contains.

The diagram below shows the overall structure of the system developed during the VIADOCS project to computerise the NHM's Lepidoptera card archive (arrows represent the flow of information).

The overall structure of the system developed during the Viadocs project to computerise the NHM's Lepidoptera card archive

A flowchart of the overall structure of the system developed during the VIADOCS project to computerise the NHM's Lepidoptera card archive.

Viadocs project