Walter G. Berendsohn1 and the CODATA/TDWG Task Group for ABCD
Department of Biodiversity Informatics and Laboratories, Botanic Garden and Botanical Museum Berlin-Dahlem, Koenigin-Luise-Str. 6-8, 14191 Berlin, Germany
ABCD Schema - Access to Biological Collections Data - is a common data specification for biological collection units, including living and preserved specimens, along with field observations that did not produce voucher specimens. It is intended to support the exchange and integration of detailed primary collection and observation data.
All of the world's biological collections contain a number of data items including specimen specific (e.g. taxon, date, altitude, sex) and collection specific (e.g. holding institution) elements. ABCD provides a reconciled set of element names and their definition for scientists and curators to use. It is not expected (or even possible) for any collection to use more than a fraction of the elements defined in the standard.
A design goal of the data specification was to be both comprehensive and general, to include a broad array of concepts that might be available in a collection database, but to mandate only the bare minimum of elements required to make the specification functional. ABCD deliberately does not cover taxonomic data, such as synonymy, other than the use of names in identifications. Likewise, taxon-related information, such as distribution range, indicator values, etc., is also not included. The elements and concepts that are used provide as much compatibility as is possible with other standards in the field of biological collection data, such as HISPID, DarwinCore, and others. The data specification is cast as an XML schema. ABCD was designed with the following principles in mind:
(1) Full coverage approach: ABCD is comprehensive and therefore complex. It explicitly aims to define the semantics of all elements, in order to provide a unified approach for the natural history collection community, accept detailed information, where available, and to develop a proto-ontology as a first step towards a collection ontology. (2) Polymorphism: Variable atomisation allows provision of data in different degrees of detail and standardisation, in order to accept data from a wide variety of sources. (3) No internal referencing: A single-root document without relational structures that use IDs - to make processing easier and faster. (4) Extensible Slots: extensions are not meant for individualised adaptations of the schema, but instead to allow fast community support in case of missing elements, before integration into a subsequent version and the inclusion of third-party-schemas (or parts thereof), in order to prevent duplication of developments in other communities (e.g. geographical data). (5) Flexible containers: Element-element or element-attribute couples for category-value pairs allow freely defined and repeatable data fields (e.g., higher taxa, measurements, morphological features). In addition, there is often provision for free-text data where it is impractical to provide atomised data. (6) No recursive structures. (7) Language support: Language can be made explicit for most text elements. (8) Typing: The use of complex types and the deposition in a common type library allows type-sharing with other communities (e.g. Structure of Descriptive Data (SDD).
Development of the ABCD content definition started after the 2000 meeting of TDWG in Frankfurt/Main, where the decision was made to specify both a protocol and a data structure to enable interoperability of the numerous heterogeneous biological collection databases. As a consequence, the TDWG/CODATA subgroup on Access to Biological Collection Data (ABCD) was established, with one sub-section working on the ABCD data standard. The subgroup was accepted as a CODATA working group in 2002, and in 2003 it became a CODATA task group. Several workshops and numerous individual contributions accompanied further development of ABCD. In 2003, the BioCASE project provided a reference implementation using ABCD v. 1.2. Today there are more than 70 providers serving unit-level data from numerous databases on-line using ABCD. In October 2003, GBIF decided to integrate the BioCASE network into the nascent GBIF network along with the DiGIR protocol and Darwin Core.
ABCD version 2.0 is a proposed TDWG standard, which will be voted on at the annual meeting in September 2005. If accepted, this will be the version that GBIF will promote for use globally. If further changes become necessary, they will also be proposed through TDWG and result in a version increment. The main changes are likely to be extensions for use in new domains, refinement of domain-specific elements and support for the modularisation of TDWG standards.