TDWG subgroup: Structure of Descriptive Data, subgroup session at the TDWG
meeting in Frankfurt, 12. Nov. 2000 ## Version 1 ##
== Participants ==
Stan Blum, Jim Croft, Gregor Hagedorn, Nicholas Lander, Bob Morris, Jörg
Ochsmann, Richard Pankhurst, Jean-Marc Vanel, Mark Watson, Greg Whitbread,
and others
== Discussion topic: What is this group about? ==
Bob Morris: “Group is not interested in resource development.” Nicholas
Lander: Not on Standard Characters lists. There was a TDWG subgroup on
Standard Characters for plants (Richard Pankhurst was convener), which did
not succeed to draw up a standard character list. Richard Pankhurst thinks
we should put it off for further 5 years until Delta-like technology is
widely used and interest is sufficient.
== Discussion topic: SDD future and presentation ==
Convener: Gregor Hagedorn apologized for his lack of initiatives during
the year, being overwhelmed with work in the starting phase of the GLOPP
project. He is willing to continue as convener, but also willing to step
aside if somebody else wants to fill the position and bring more
initiatives into the group.
It was agreed that there should be a SDD website to point to information
resources concerning the activities of the group. Nicholas Lander will ask
Ben Richardson at Perth, alternatively if Gregor Hagedorn continues as
convener he can host the site on the www.DiversityCampus.net server of the
GLOPP project.
Most participants agreed that a true SDD workshop should be organized,
lasting at least full 2 days, to clarify the issues brought up during the
email discussions and reach real understanding. It was agreed that the
email discussions are extremely good and useful, but that it is difficult
to fully assess where a consensus was reached, and where the discussion
simply tapered off. Also, certain concepts are easier discussed with
physical presence, using graphical presentations.
Such a workshop was considered to be desirable in the near future, latest
immediately before or after the next TDWG meeting. It should be organized
in conjunction with some other event, to minimize traveling expenses in
terms of time and money. Another time could be before or after a workshop
which is planned by the TDWG subgroup on Accession Data in the spring of
2001.
The more technical discussions regarding implementation choices in XML
should be clearly separated on the discussion list by the string “XML”.
Perhaps a sub-subgroup might be formed to resolve questions in this area?
== Discussion topic: Resource discovery ==
Bob Morris: We need to distinguish between questions that hinge on
resource cost, and questions that hinge on biological problems. Resource
cost: does it pay to try a given information source whether it holds
information. It should be easy to know in advance whether a server may
hold the information I am looking for. New important development reported
by Bob Morris: Uniform data description interface (UDDI). Nothing about
resource cost has been discussed so far, but group may address this issue.
Some information about this should be in the header of a new XML files
standard. It was agreed that this question is secondary to the interest
Information that should be present in the header of the files:
Taxonomic scope: What are the taxa I hold data on, either list, or query
definition (query -> sorry I hold no data) Do I have descriptions? (e.g.
unparsed natural language)
Do I have structured descriptions? Which kind? e.g. DELTA/NEXUS/XML? or:
If so, point
me to where the structure of description is defined…
Do I have media (images)?
Do I have keys or other means of identification?
Do I have interactive keys?
== Discussion topic: Who is interested in a new standard for descriptive
data? == We distinguished between 3 user/provider interests: 1) The pure
user of electronic field guides, applying the descriptions e.g. to
identification work, 2) The pure data provider, who holds large amounts of
descriptive information that is historical and has the status of an
unchanging publication 3) The data developing scientist, who creates and
analyses/uses his or her data. In contrast to case 2, any data here must
be referenced to individual sources and any information is under
contention of being false.
The case of 2 occurs e.g. in projects where legacy information like huge
conventional flora or fauna works are digitized and shall be accessed in a
structured way. In cases 1 and 2 the data can be transformed in ways that
may lead to some loss of information (e.g. concerning the source of
individual assertions: "leaves are glandulous when young: observed by
Author1, 1980"). However, the case 3 is interested in rigid structure to
allow knowledge management and data validation. The case 3 is the case
assumed in programs like the CSIRO Delta programs, Pankey/Pandora, or
DeltaAccess which see themselves as working tools for scientists.
== Discussion topic: XML markup as proposed by Kevin Thiele ==
Bob Morris raised the question whether Kevin Thiele’s proposal would be
enough to start with, to which additional structural information can be
added. Greg Whitbread thinks yes, but character types should be added.
Gregor Hagedorn remarked that, without having objections against Kevin’s
principal argument about the advantages of providing a very simple system
that could be used to markup existing information, it was difficult to say
whether the system would be compatible with a more structured approach, as
long as the structured approach is yet very vague. However, it may be wise
to take Kevin’s approach ahead, risking however to redefine it later.
It was agreed that it is desirable to have a common standard which defines
levels of structure. Gregor Hagedorn remarked that a software may require
at least a certain structural level, e.g. a structured database may
require coded markup of character and feature, referring to a full
character schema. Levels should be clearly labeled, so that an application
can easily detect whether there is sufficient structure for its needs.
However, levels should be known not only to the software, but also to
users so that scientists can communicate “I can give you this, is this ok
for you?” Problems like those with versions of the DELTA standard (format
changes, but version not recognizable for importing software), or in the
graphics area the “TIF”-problem (the standard defines an envelope, which
may contain any kind of information, including proprietary formats
unreadable for other software) should be avoided. Possible levels could
be:
>> Level 0: The description is marked up as a block referring to a certain
>> taxon. No markup of structures, methodology, or features.
>> Level 1: Level 0 plus markup (not necessarily complete) of structures
>> (leaf, flower, …) or method (naked eye, hand lens, light microscope, scanning electron microscope)
>> Level 2: Level 1 plus markup of characters (i.e. structure/methodology/feature), but not character states
>> Level 3: Level 2 plus markup of states
>> Level 4: Level 3, fully coded markup referring to separate character definition/schema Are more levels needed? More orthogonal scheme, with complete/incomplete markup noted separately?
Gregor Hagedorn brought up the question whether the simpler, character
schema-free forms of XML markup are able to cope with queries and reports
in multiple languages. It seems to him that the words of the language
stand for only the English understanding, without any definition being
available elsewhere. This is a contrast to the DELTA method of defining
characters and states, and using codes that can easily be output in
multiple languages concurrently.
Nicholas Lander: There is a file format standard, the "Star file" format
(20 yrs old but dynamic CODATA standard, now also in xml) in chemistry
supplies means to define core character lists with supplements
== Discussion topic: Discussion of future of DELTA format ==
Nicholas Lander: “We need more rigorous system.” It was agreed to put
efforts into the original idea of developing a system that goes beyond
DELTA and Nexus, but encompassed the functionality of NEXUS, DELTA/New
DELTA, and adds the additional requirements identified by LucID or
DeltaAccess. Richard Pankhurst warned about the generalized system
fallacy: Any system that tries to fulfill too many requirements will
become very complex and inherently difficult to analyze and maintain.
There is a danger of creating a monster-structure nobody will actually
use.
@@ Question to participants: something was discussed about the Free Delta
system, but I missed that in my note. Anybody can fill in here?
== Discussion topic: Standard character lists ==
Several members stressed the need for Standard character lists in the
future. It was (who said that??) proposed to start with core schemata that
can be expanded as time goes on. Gregor Hagedorn proposed that standards
should not necessarily be seen as concentric rings (e.g. a single standard
with successive versions or levels), but perhaps rather modular blocks
that can stand side-by-side. For example, it may be wise to develop and
maintain standards for different methodologies (field observation, light
microscope, SEM characters, chemical compounds) by different standard
bodies. A given description could choose from the standard modules as
necessary for the observations or studies made.
Gregor Hagedorn: Standard character schema should be developed like
scientific publications, so that they can be developed and improved by
scientists in the course of several years. Only after the contour of
competing schemata have become clear, standardization efforts should
begin.
== Discussion topic: Character vs. Structure / feature ==
The following discussion happened after a break necessary due to the
collision of Bob Morris presentation with the discussion. Many
participants were absent during this part. Jean-Marc Vanel presented his
views about structural analysis of character data. It was first that it is
preferable to use the general term structure rather than the term “organ”
proposed by Jean- Marc Vanel. Vanel proposed that structures can be
hierarchical or primary/secondary. We found, however, that structures have
not necessarily a clear hierarchy. If the same type of hairs exist on both
the stem and the leaf, it is not sufficient to place “hairs” outside the
group containing both stem and leaf. It is possible that characters and
structures have overlapping hierarchies, that can not be resolved into a
simple tree. For example, the same hairs may occur on many different
structures, which can not be grouped hierarchically. A Ref-ID mechanism is
necessary to document the relations between structures, substructures, and
properties. Vanel: “XML is like a Christmas tree: basic tree with
decoration connecting branches”.
Richard Pankhurst uses relational adjectives (= linguistic term): part-of
relationship and kind-of relationship. Examples: ‘Leaflet’ is a part of a
leaf. ‘Basal leaf’ is a kind of leaf; ‘glume’ is a kind of leaf, but in
more special sense.
Features have qualifiers. Richard Pankhurst further discussed restrictions
of context: plant is young/old, ivy leaves: when young lobed, when old
almost round. Certain conditions are called “epitopic” child birth is
possible only in female, and only in female that is pregnant and of
appropriate age.
Gregor Hagedorn stressed the importance of methods or methodology in
addition to the structure/basic property analysis performed by Diederich
et al. The same character may have different states (or values/results)
for different methodologies (e.g. surface rough in SEM, but smooth with
hand lens). Methodology can further be split into “observation method”
(type of apparatus used) and “condition” environmental or experimental
method (soil, climate, culture media, substrate, etc.). In some cases a
character is implicitly only possible to observe using a certain
methodology.
----------------------------------------------------------
Gregor Hagedorn (G.Hagedorn@bba.de) Institute for Plant Virology,
Microbiology, and Biosafety Federal Research Center for Agriculture and
Forestry (BBA) Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 14195
Berlin, Germany Fax: +49-30-8304-2203