TDWG subgroup: Structure of Descriptive Data, subgroup session at the TDWG

meeting in Frankfurt, 12. Nov. 2000 ## Version 1 ##

 

== Participants ==

Stan Blum, Jim Croft, Gregor Hagedorn, Nicholas Lander, Bob Morris, Jörg

Ochsmann, Richard Pankhurst, Jean-Marc Vanel, Mark Watson, Greg Whitbread,

and others

 

== Discussion topic: What is this group about? ==

Bob Morris: “Group is not interested in resource development.” Nicholas

Lander: Not on Standard Characters lists. There was a TDWG subgroup on

Standard Characters for plants (Richard Pankhurst was convener), which did

not succeed to draw up a standard character list. Richard Pankhurst thinks

we should put it off for further 5 years until Delta-like technology is

widely used and interest is sufficient.

 

== Discussion topic: SDD future and presentation ==

Convener: Gregor Hagedorn apologized for his lack of initiatives during

the year, being overwhelmed with work in the starting phase of the GLOPP

project. He is willing to continue as convener, but also willing to step

aside if somebody else wants to fill the position and bring more

initiatives into the group.

 

It was agreed that there should be a SDD website to point to information

resources concerning the activities of the group. Nicholas Lander will ask

Ben Richardson at Perth, alternatively if Gregor Hagedorn continues as

convener he can host the site on the www.DiversityCampus.net server of the

GLOPP project.

 

Most participants agreed that a true SDD workshop should be organized,

lasting at least full 2 days, to clarify the issues brought up during the

email discussions and reach real understanding. It was agreed that the

email discussions are extremely good and useful, but that it is difficult

to fully assess where a consensus was reached, and where the discussion

simply tapered off. Also, certain concepts are easier discussed with

physical presence, using graphical presentations.

 

Such a workshop was considered to be desirable in the near future, latest

immediately before or after the next TDWG meeting. It should be organized

in conjunction with some other event, to minimize traveling expenses in

terms of time and money. Another time could be before or after a workshop

which is planned by the TDWG subgroup on Accession Data in the spring of

2001.

 

The more technical discussions regarding implementation choices in XML

should be clearly separated on the discussion list by the string “XML”.

Perhaps a sub-subgroup might be formed to resolve questions in this area?

 

== Discussion topic: Resource discovery ==

Bob Morris: We need to distinguish between questions that hinge on

resource cost, and questions that hinge on biological problems. Resource

cost: does it pay to try a given information source whether it holds

information. It should be easy to know in advance whether a server may

hold the information I am looking for. New important development reported

by Bob Morris: Uniform data description interface (UDDI). Nothing about

resource cost has been discussed so far, but group may address this issue.

Some information about this should be in the header of a new XML files

standard. It was agreed that this question is secondary to the interest

 

Information that should be present in the header of the files:

Taxonomic scope: What are the taxa I hold data on, either list, or query

definition (query -> sorry I hold no data) Do I have descriptions? (e.g.

unparsed natural language)

  Do I have structured descriptions? Which kind? e.g. DELTA/NEXUS/XML? or:

  If so, point

me to where the structure of description is defined…

Do I have media (images)?

Do I have keys or other means of identification?

  Do I have interactive keys?

 

== Discussion topic: Who is interested in a new standard for descriptive

data? == We distinguished between 3 user/provider interests: 1) The pure

user of electronic field guides, applying the descriptions e.g. to

identification work, 2) The pure data provider, who holds large amounts of

descriptive information that is historical and has the status of an

unchanging publication 3) The data developing scientist, who creates and

analyses/uses his or her data. In contrast to case 2, any data here must

be referenced to individual sources and any information is under

contention of being false.

 

The case of 2 occurs e.g. in projects where legacy information like huge

conventional flora or fauna works are digitized and shall be accessed in a

structured way. In cases 1 and 2 the data can be transformed in ways that

may lead to some loss of information (e.g. concerning the source of

individual assertions: "leaves are glandulous when young: observed by

Author1, 1980"). However, the case 3 is interested in rigid structure to

allow knowledge management and data validation. The case 3 is the case

assumed in programs like the CSIRO Delta programs, Pankey/Pandora, or

DeltaAccess which see themselves as working tools for scientists.

 

== Discussion topic: XML markup as proposed by Kevin Thiele ==

Bob Morris raised the question whether Kevin Thiele’s proposal would be

enough to start with, to which additional structural information can be

added. Greg Whitbread thinks yes, but character types should be added.

 

Gregor Hagedorn remarked that, without having objections against Kevin’s

principal argument about the advantages of providing a very simple system

that could be used to markup existing information, it was difficult to say

whether the system would be compatible with a more structured approach, as

long as the structured approach is yet very vague. However, it may be wise

to take Kevin’s approach ahead, risking however to redefine it later.

 

It was agreed that it is desirable to have a common standard which defines

levels of structure. Gregor Hagedorn remarked that a software may require

at least a certain structural level, e.g. a structured database may

require coded markup of character and feature, referring to a full

character schema. Levels should be clearly labeled, so that an application

can easily detect whether there is sufficient structure for its needs.

However, levels should be known not only to the software, but also to

users so that scientists can communicate “I can give you this, is this ok

for you?” Problems like those with versions of the DELTA standard (format

changes, but version not recognizable for importing software), or in the

graphics area the “TIF”-problem (the standard defines an envelope, which

may contain any kind of information, including proprietary formats

unreadable for other software) should be avoided. Possible levels could

be:

 

>> Level 0: The description is marked up as a block referring to a certain

>> taxon. No markup of structures, methodology, or features.

>> Level 1: Level 0 plus markup (not necessarily complete) of structures

>> (leaf, flower, …) or method (naked eye, hand lens, light microscope, scanning electron microscope)

>> Level 2: Level 1 plus markup of characters (i.e. structure/methodology/feature), but not character states

>> Level 3: Level 2 plus markup of states

>> Level 4: Level 3, fully coded markup referring to separate character definition/schema Are more levels needed? More orthogonal scheme, with complete/incomplete markup noted separately?

 

Gregor Hagedorn brought up the question whether the simpler, character

schema-free forms of  XML markup are able to cope with queries and reports

in multiple languages. It seems to him that the words of the language

stand for only the English understanding, without any definition being

available elsewhere. This is a contrast to the DELTA method of defining

characters and states, and using codes that can easily be output in

multiple languages concurrently.

 

Nicholas Lander: There is a file format standard, the "Star file" format

(20 yrs old but dynamic CODATA standard, now also in xml) in chemistry

supplies means to define core character lists with supplements

 

== Discussion topic: Discussion of future of DELTA format ==

Nicholas Lander: “We need more rigorous system.” It was agreed to put

efforts into the original idea of developing a system that goes beyond

DELTA and Nexus, but encompassed the functionality of NEXUS, DELTA/New

DELTA, and adds the additional requirements identified by LucID or

DeltaAccess. Richard Pankhurst warned about the generalized system

fallacy: Any system that tries to fulfill too many requirements will

become very complex and inherently difficult to analyze and maintain.

There is a danger of creating a monster-structure nobody will actually

use.

 

@@ Question to participants: something was discussed about the Free Delta

system, but I missed that in my note. Anybody can fill in here?

 

== Discussion topic: Standard character lists ==

Several members stressed the need for Standard character lists in the

future. It was (who said that??) proposed to start with core schemata that

can be expanded as time goes on. Gregor Hagedorn proposed that standards

should not necessarily be seen as concentric rings (e.g. a single standard

with successive versions or levels), but perhaps rather modular blocks

that can stand side-by-side. For example, it may be wise to develop and

maintain standards for different methodologies (field observation, light

microscope, SEM characters, chemical compounds) by different standard

bodies. A given description could choose from the standard modules as

necessary for the observations or studies made.

 

Gregor Hagedorn: Standard character schema should be developed like

scientific publications, so that they can be developed and improved by

scientists in the course of several years. Only after the contour of

competing schemata have become clear, standardization efforts should

begin.

 

== Discussion topic: Character vs. Structure / feature  ==

The following discussion happened after a break necessary due to the

collision of Bob Morris presentation with the discussion. Many

participants were absent during this part. Jean-Marc Vanel presented his

views about structural analysis of character data. It was first that it is

preferable to use the general term structure rather than the term “organ”

proposed by Jean- Marc Vanel. Vanel proposed that structures can be

hierarchical or primary/secondary. We found, however, that structures have

not necessarily a clear hierarchy. If the same type of hairs exist on both

the stem and the leaf, it is not sufficient to place “hairs” outside the

group containing both stem and leaf. It is possible that characters and

structures have overlapping hierarchies, that can not be resolved into a

simple tree. For example, the same hairs may occur on many different

structures, which can not be grouped hierarchically. A Ref-ID mechanism is

necessary to document the relations between structures, substructures, and

properties. Vanel: “XML is like a Christmas tree: basic tree with

decoration connecting branches”.

 

Richard Pankhurst uses relational adjectives (= linguistic term): part-of

relationship and kind-of relationship. Examples: ‘Leaflet’ is a part of a

leaf. ‘Basal leaf’ is a kind of leaf; ‘glume’ is a kind of leaf, but in

more special sense.

 

Features have qualifiers. Richard Pankhurst further discussed restrictions

of context: plant is young/old, ivy leaves: when young lobed, when old

almost round. Certain conditions are called “epitopic” child birth is

possible only in female, and only in female that is pregnant and of

appropriate age.

 

Gregor Hagedorn stressed the importance of methods or methodology in

addition to the structure/basic property analysis performed by Diederich

et al. The same character may have different states (or values/results)

for different methodologies (e.g. surface rough in SEM, but smooth with

hand lens). Methodology can further be split into “observation method”

(type of apparatus used) and “condition” environmental or experimental

method (soil, climate, culture media, substrate, etc.). In some cases a

character is implicitly only possible to observe using a certain

methodology.

 

----------------------------------------------------------

 

 

Gregor Hagedorn (G.Hagedorn@bba.de) Institute for Plant Virology,

Microbiology, and Biosafety Federal Research Center for Agriculture and

Forestry (BBA) Koenigin-Luise-Str. 19          Tel: +49-30-8304-2220 14195

Berlin, Germany           Fax: +49-30-8304-2203