13.   Detecting spelling errors in taxonomic databases

Eduardo C. Dalcin1, Richard J. White2 & W. Alex Gray2

1 Instituto Nacional de Pesquisas da Amazônia, Av. André Araújo, 2936, Petrópolis - CEP 69083-000 Manaus, Amazonas, Brazil

2 School of Computer Science, Cardiff University, 5 The Parade, Cardiff CF24 3AA, UK

Many problems can affect the quality of data in taxonomic databases, and some of the ways to resolve these issues are discussed in other talks at this meeting. As the most common values linking related information in different tables and databases, species names play a key role in biodiversity information systems. However, because of their unfamiliarity to many of the contributors and editors of databases, they are particularly prone to errors. The causes and frequencies of different types of errors in scientific names are usually unknown.

In this talk, we shall address some techniques for detecting and correcting spelling errors in scientific names. We will classify error detection and correction procedures in several ways, including the use of vocabularies of scientific name components akin to those used by conventional spelling checkers. However, suitable dictionaries of complete scientific names are frequently unavailable. Algorithms for detecting possible errors without the use of vocabularies will be described, together with procedures for assessing their effectiveness. The results of some experiments will be discussed and summarised, particularly in terms of database error rates, cost effectiveness and the balance between “recall” and “precision”. We will suggest procedures for effective error limiting, and discuss the potential for error monitoring, as opposed to error correction, giving users more control and flexibility in their information retrieval.