Integrating Descriptive Data from Floras
Wood, M. M.1, Lydon, S. J.2, Wang, S.1, Huxley, R.3 & Sutton, D. A.3
1 Department of Computer Science, University of Manchester
2 Earth Science Education Unit, Keele University
3 Department of Botany, Natural History Museum"Interoperability" usually refers to the format of databases: we address the compatibility of legacy data. The MultiFlora system automatically extracts information from the texts of multiple Floras describing the same plant taxa, and integrates it into a single, formally structured data resource.
In earlier work, we showed that the distribution of information across Floras is highly "sparse": different authors describe different characters for the same
taxon. Here we analyse the range of ways in which the same information can be differently presented. These include:Units of measure: 80cm / .8m
Number ranges: 4 / 3-5 / 3-(4)-5
Exact synonyms: two-parted / bi-partite
Near-synonyms: e.g. words for leaf shape and for hairiness
Degrees of specicity: yellow / rich golden yellow
Temporal reference: after flowering / in fruit. This is a particularly challenging
problem for automatic analysis, as it requires active reasoning, not just
arithmetical calculation or look-up in a synonym list.We aim to identify automatically which descriptive elements are directly equivalent, and which are genuine disagreements between authors, or species
diagnostics. We aim also to enable automated translation between legacy standards, and integration of existing descriptive data with any newly agreed standard to emerge from the Systematics community.