Computers,
Quantification, and Databases in the 21st Century
Norman MacLeod, Patrick Diver, Robert Guralnick, David
Lazarus, and Bjorn Malmgren
Introduction
While the use of computers and digital communications technology pervades some areas of paleontology in these final years of the 20th century, many of paleontology's subdisciplines have the appearance (to both internal and external observers) of being ambivalent to the power of computers to increase paleontologist's effectiveness at their research, applications, and educational tasks. This ambivalence has arisen as a result of paleontological needs in the areas of data management / storage / access, quality control, imaging, and publication, coupled with the limitations and expense of available computer hardware/software solutions. Throughout the 1960's, 1970's and 1980's the match between needs and tools for many aspects of paleontology has simply not been right. However, this situation is changing. We are rapidly approaching the time when computer-driven information technologies will transform paleontology in the same way that they have transformed other scientific disciplines (e.g., physics, mathematics, geophysics, chemistry, and molecular biology).
The immediate challenge for paleontologists is to come to grips with the expanding information technologies that will overwhelm our field within the next 10-20 years. Already there is a substantial divide between more qualitatively-oriented paleontologists and stratigraphers who primarily use computers for word processing and e-mail, and their colleagues, many of whom are conversant with such topics as statistical analysis, morphometrics, databases, numerical modeling & simulation, computer networks, and the World Wide Web. That this transition will happen is no longer an issue. The only questions that remain are when? how? and at what social cost (for paleontology is very much a community of individuals). If this transition is allowed to continue being made piecemeal, with individuals and companies/research groups proceeding more-or-less in isolation from one another, the transition will be protracted and the associated costs (both monetary and personal) will be high. On the other hand, if this transition is organized and managed by the paleontological community, these technologies can be integrated into the mainstream of our science on a much more time and cost-effective basis. The following reviews and recommendations are submitted to encourage consideration of these topics by all paleontologists and to serve as a focal point from which definite plans for action can be developed.
Quantification
Paleontologists have been using quantitative methods for more than 100 years, but it was not until the advent of the computers in the late 1950's and early 1960's that more advanced multivariate techniques were first applied to paleontological data. Paleontological data are by their nature most often of a multivariate character. Nevertheless, due to the inaccessibility of computers and a general unawareness of the utility of multivariate techniques, the first decades of the computer era saw relatively few applications of such techniques, written by a limited group of enthusiasts. When personal computers became an every-day tool among paleontologists at the beginning of the 1980's, the use of multivariate methods began to increase, but the application of such techniques has yet to become a standard procedure in many areas of paleontological research where such techniques could be profitably employed.
The use of quantitative methods in paleontology can be divided into the analyses of morphological characteristics and variation in the quantitative distribution (abundances and relative abundances) of faunal elements. For example, the morphology of a fossil organism cannot be quantitatively described by a single variable. Multivariate descriptors are required. Likewise, phylogenetic relationships must be analyzed using a multidimensional approach. The study of relations between the morphology of fossil organisms, the variation in fossil faunas or floras, and the surrounding environment in which they live represents another example of the multivariate nature of paleontological data.
Also, during the last decade a wide variety of new quantitative techniques have appeared in the literature. Artificial neural networks (e.g., for calibrations of relative abundance data of fossil species to oceanic temperature data, taxonomy, and modeling of paleontological time-series), fuzzy set theory (fuzzy logic), machine learning, and soft modeling, all hold much promise for improving the quality of our data analyses, but have yet to make the kind of impact that is becoming routine in other areas of scientific inquiry. The problem is that use of such quantitative approaches to paleontological data analysis is essentially confined to certain countries or universities, where different traditions and/or approaches have evolved. Where this tradition is lacking, the incentive and ability to apply advanced quantitative techniques have not developed.
Even though the availability of "canned" programs for statistical analysis has meant an upsurge in the employment of quantitative techniques to paleontological data through the last decade, there is a true danger in using such programs without proper knowledge of the basic assumptions behind various techniques. For example, one major requirement in parametric statistics is the conformity to normality or multivariate normality of the data being analyzed. Comparatively few paleontological data are so distributed. Another concern is the lack of appreciation for the precision of the data being submitted to analysis. We emphasize the need of computing measures of precision of the data, such as confidence intervals for relative abundance data or mean values of morphological measurements, in paleontological studies. Without inclusion of measures of precision in graphs showing changes with time in some character the pattern cannot be interpreted in terms of trends or cycles.
Databases: Overview
Databases are structured collections of data, together with special software for entering and retrieving data. As such, databases are at the heart of many modern enterprises. Most large businesses and government organizations could not exist without the central coordinating, memory, and communications role of databases. Databases will certainly play a vital role in the future of paleontology. Although all aspects of paleontologic information can be stored in computer databases, the few "all in one" database systems that have been developed are typically specific to the needs of one individual. More general-purpose databases normally contain a more restricted range of paleontological information, as follows.
Character Databases. - Two-dimensional matrices of character state values for a set of taxa or individuals. With appropriate software, these databases can be used as taxonomic "keys," to identify new material, or they can be analyzed for various types of structure (e.g., geographic pattern, and phylogenetic relationships). Key type character databases have been rare in paleontology. Analytic character databases are common, as an essential part of quantitative systematics (phenetics, cladistics, etc.).
Taxonomic Databases. - Reproductions, in electronic form, of the classic taxonomic paper or monograph; a single set of taxonomic concepts are held in the database, together with a variety of supporting documentation. The taxon's name, diagnosis, text list of synonyms, and (usually) some images of the taxon are typical. Additional data may include text or graphic summaries of space/time distribution, list of bibliographic references, and limited character state data. Such databases will eventually replace their paper equivalents, and, via links, feed data into other database systems.
Distributional Databases. - Repositories of biostratigraphic or paleobiogeographical data. The two basic data types in these databases are taxa and samples. Taxa are frequently represented by little more than a single name. Samples are represented in much greater detail, and the occurrences of named taxa in each sample are held in the database as well. Distributional databases are not common, but have been employed in the oil industry, in marine micropaleontology/paleoceanography, and in studies of macroevolution. Globally accessible distributional databases (e.g., that envisioned for micropaleontology by the Ocean Drilling Stratigraphic network), may come to play a central role in paleontology in the next century.
Collections Databases. - Essentially, inventory systems for managing biological (and paleontological) materials. These databases are mostly employed in museums, botanical gardens, and institutions housing biologic culture collections. Efforts are currently underway by biologists to improve and standardize the design of such databases, with the goal of making the contents of all collections more accessible to researchers as part of global biodiversity science initiatives (e.g., Species 2000).
Bibliographical Databases. - Common in paleontology (and most other fields of science), in part because many good, affordable off-the-shelf bibliographic database programs have long been available.
Nomenclature Databases. - Collections of information on the formal relationships among taxonomic names. They are listed separately here, despite their obvious relevance to other database categories, because in reality, they do not yet exist, except in rather limited form. Modeling and programming databases to reflect the complexities of the nomenclatural codes have proven to be very difficult. However, progress is expected to continue in the area.
It is important to realize that in the future databases will not only be widespread, they will continue to be based on a diverse range of technologies, and hold quite different types of data. Melding this disparity into a coherent global information storage and retrieval system will constitute one of the main technical/organizational challenges to our field in the next century.
Developing Paleontological Databases
Because of the importance of databases to the modern world, their design and programming is a recognized professional specialty within the field of information technology. Consequently, several general rules have been developed to guide database developers. The general sequence of steps in developing a database are:
Databases can be programmed using many techniques, but three technologies are the most important: hypertext systems (active text-based links-like most World Wide Web pages), "flat file" databases (a single data table-like a spreadsheet), and relational databases (sets of data tables linked together by cross-referencing fields). These database types will probably continue their dominance into the early decades of the next century. Most paleontological databases - particularly distributional and collection databases - are best implemented with relational technology, although to date comparatively few such programs exist.
The most general problem with developing database technology is the pervasive problem of data format standardization. As discussed below, standards are needed to insure that different databases can be used effectively. Standard data formats also help insure the long-term survivability of the data in databases as older systems are replaced, and substantially reduce the time needed to develop new database programs.
Problems more specific to paleontology include data quality, data entry, hardware and software costs, and a severe shortage of technical skills. The costs of computer hardware and software for paleontologic databases are no longer very high. The one (important) current exception is the cost for image databases, which require a great deal more capacity from the individual computer and (for networked systems) the cabling technology. It is expected that the hardware and software needed for even large, networked image databases will become less expensive.
By far the biggest problem facing the development of paleontologic databases is a shortage of skills and manpower. In contrast to a field like physics, where most professionals possess the necessary skills to use computers effectively (including having learned the difficult skill of how to program in a computer language), many paleontologists do not currently have the skills needed to effectively use existing database systems, and very few are trained in designing or programming new databases. Substantially improved training for all paleontologists and the allocation of funds for specialists in this technology will both be needed to prevent this skills shortage from stunting the development of paleontological databases in the next century.
Uses Of Paleontological Databases
The uses of paleontological databases are at least as diverse as paleontology itself. From recording of specimen morphology and related information, to chronostratigraphic interpretations based on paleontological data for reservoir modeling, the information in these databases is of use to many scientific disciplines, not just paleontologists. While uses of paleontological databases might naturally be divided into user categories such as "educational," "industrial," and "research," the information contained in paleontological databases is useful in all of these contexts.
One trend in large private-sector consumers of paleontological data is to create "data warehouses" that contain "business objects." This concept places a premium on centrally storing information and tying data directly to applications. Complete sharing of resources across an entire industry is still a long way off, but efforts such as the Integrated Paleontological System (IPS) are examples of shared application development that eventually will need to address these issues.
Another trend in database technology that may apply to paleontology is web-enabled links to databases, and the ability to generate graphical output that allows the user to "drill-down" further into a database. An example may be to plot an outline of a chosen country, showing locations of paleontologic collections. A user can select from the map or from a list of the outcrops or sub-surface samples, and then display a graph of the sampling. Further drill-down would list species and counts within samples, and also display a range chart of the complete section. This type of information retrieval will become more prevalent in the next several years.
A check for uses of paleontological databases finds that a relatively small percentage of available databases are currently on-line. As noted above, even if a database is available, it may be incompatible with a user's computer software. Accordingly, one of the more important problems to overcome in making paleontological databases maximally useful is to create a common format for understanding what kind of data are available from which database, and how to access them. A lack of a metadatabase (= a database that stores data about data) leaves the user to his/her own in searching for information, often without success.
Another hurdle just beginning to be addressed is a common data exchange format or protocol. By making such a protocol for paleontological data available more time can be spent analyzing and synthesizing data for interpretation, and less time spent re-formatting data. Another, and perhaps more effective approach would be to create a common data model for storing paleontological information. Such a data model would allow the broad range of databases expected in the next century (taxonomic, distributional, etc.) to work effectively together. This, in turn, leads too probably the biggest problem of all: standardization of database content via standardization of systematic terms and concepts. Historically it has been very difficult for paleontologists to agree upon taxonomic concepts, semi-quantitative definitions, and methods to record various types of "uncertainty." Discussion of this issue is beyond the remit of this document. Yet, without improvement in these areas as well, the quality of data in databases will be so variable as to significantly limit their scientific usefulness.
A worst case scenario for the near future is that no effort is made at standardization of either data formats or data content, and no resources are made available from the various funding agencies (public and private) to create metadatabases. Under such a scenario we expect that the lack of coherency of paleontology's primary data will continue to be a major stumbling block for the field by limiting educational, research and funding opportunities. The best case is that, as new databases become available to the scientific community, they are self-describing, and can be immediately useful through expert, object technologies. The data content would be sufficiently standardized that meaningful synthesis of the data can be readily accomplished. Somewhere in between these positions, enterprising groups (e.g., the Ocean Drilling Stratigraphic Network [ODSN]) might build metadatabases that would warehouse information about many scientific projects without any standardization on the part of the individual databases. Similarly, partial solutions for exchange protocols will appear via adoption of externally created standards, such as those being developed for collections databases by the ASC and IUBS/TDWG.
Accessing Paleontological Databases
Several advances in computer technology have had direct impact on the possibility for data sharing for paleontologists. The first is the drop in price of all electronic storage media. Hard disk space is as low as twenty-five U.S. cents per megabyte. Optical disk drives, jazz drives and other media reading and writing devices are becoming more affordable as well, allowing large datasets to be saved on a single disk. The second major advance has been and continues to be the rise and proliferation of computer networking. The network itself becomes a huge, sprawling megadatabase storing different kinds of data.
Although the origination of read/write optical media and CD-ROM's have been important for distributing large, graphic rich programs, we do not foresee a long life for these items as a means to distribute or contextualize paleontological data. Part of the problem, particularly acute for CD-ROMs, is that once made, the media is dead. It is difficult to edit and update effectively. Another problem is that information stored on magnetic or optical media still need to reach an audience. This requires physically distributing the media.
The potential for using client-server networking software as a means of exchanging data is much more promising. Indeed, in the life sciences, databases such as Genbank are mostly accessed through the World Wide Web. Web browsers (e.g., Netscape, Internet Explorer) allow easy access to information, and the potential for exploring and using data and especially databases on the Web is just beginning to be realized. Database access to full collections catalogs mediated either by the native database engine or by building block programs (e.g., the Delphi system) will allow customizable on-the-fly return of queries in HTML format.
The Internet and the World Wide Web in particular affords paleontologists the opportunity to meld together research and education by starting at a very general explanatory level and moving towards more specific, more research orientated levels. If paleontology is to have an agenda in the 21st century, workers in the field must continue to bridge the gap between research and education as well as continue to share information with other workers and the public. We believe the greatest potential for achieving those goals lies in the use of network client-server programs.
Summary
One of the fundamental goals of paleontology, like any scientific discipline, is to present its data to members of the community and also to the larger scientific community and general public. Data sharing has traditionally been accomplished by peer-reviewed journals, and this will continue in both print and electronic formats. The flip side to presenting qualitative or quantitative data summaries and interpretations, however, is accessing the data themselves. Computers can help us provide access to our data and guide/facilitate its interpretation as well as serving as a medium for the discussion of those data. All of these tasks are important.
At present, professional paleontologists in North America and Western Europe are able to make use of a wide variety of computer technologies to access and manage their data. Although most paleontologists are familiar with word processors and e-mail, and many have begun to explore the World-Wide-Web, the routine use of computers to solve the majority of day-to-day data access/management/quality control tasks remains the exception rather than the rule.
A primary opportunity, as well as a primary concern for paleontologists with respect to the technical aspects of information technology (IT) is that such a large proportion of our data is encoded in the form of images. The assembly, transportation, reconstruction, and dissemination of image-based data are much more complex and computer-intensive than that of text-based data. This, in turn, will require that paleontologists become familiar with digital imaging technology.
Regardless of these considerations, databases and communications technologies are central to paleontology's future. Paleontologists will thus have to learn more about the design and use of relational database systems, and some of the more arcane aspects of communications technology (e.g., e-mail attachments, ftp, file translations) in order to be able to use and contribute to the electronic resources that will dominate the field in the coming century. The infrastructure problems to be overcome in making the transition to quantitative data analysis and electronic database storage/manipulation (especially in terms of images) are not formidable and there are well-established precedents that should speed the transition along. Many graduate-training programs in paleontology already offer a "methods" course in which topics such as print photography and computer-based data analysis techniques are taught. If such courses are not available, almost all graduate-level paleontologists obtain training in the various skills they will need to pursue their careers informally, either from their advisors or from fellow graduate students. It is to this system that the paleontological community will inevitably turn for training in electronic communications and imaging skills.
The most fundamental difficulty in the entire paleontological IT area arises, however, when one considers who is going to train the trainers. At the present time there is a marked disparity in computing skills that breaks down roughly along generational lines. While many younger paleontologists are more comfortable with and skilled in computer-related technologies than their older colleagues, it is not an exaggeration to say that comparatively few individuals currently possess the necessary skills to handle the wide spectrum of technical problems that accompany many routine electronic data management and information transfer tasks. This situation will improve as the demand for these skills increases. Nevertheless, the professional paleontological community must become more proactive in this area if a marked disparity between skill levels within different segments of the community is to be avoided.
We advocate the identification of a pool of researchers with experience in quantitative techniques that would be willing to serve as a teaching resource for quantitative paleontology. This group would be universally available to the paleontological community as teachers of quantitative paleontology. We recommend that a list of teachers/resource persons interested in being a part of an international paleontological IT teaching network be established and made available or distributed to paleontological associations in different countries or institutions. Through lectures, seminars, and workshops organized or advised by this group the process of systematically raising the paleontological community can begin the process of raising its own level of IT awareness and skills.
In many other scientific fields, recognition of the value of using computers
to store, organize, provide access to, and retrieve data has lead to the
formation of discipline specific sub-field of informatics (e.g., bioinformatics),
which deals with these issues. Incorporation of similar subfield (paleoinformatics?),
into the corpus of paleontology is crucial if the advantages of computer-oriented
data analysis and data management are to be realized in the coming century.
Computers, Quantification &Databases Delegates
Dr. Norman MacLeod--Topic Coordinator
Department of Palaeontology
The Natural History Museum
Cromwell Road, London, SW7 5BD
United Kingdom
N.MacLeod@nhm.ac.uk
44-171-938-9277 (FAX)
44-171-938-9006 (PHONE)
Mr. Patrick Diver
Amoco Corporation
P. O. Box 3092
Houston, TX 77253-3092
pldiver@amoco.com
281-366-7416 (FAX)
281-366-2291 (PHONE)
Mr. Robert Guralnick
Museum of Paleontology
1101 Valley Life Sciences Bldg.
University of California, Berkeley, CA 94720
robg@ucmp1.berkeley.edu
510-642-1822 (FAX)
510-642-1821 (PHONE)
Dr. David Lazarus
Institut fuer Palaeontologie
MUSEUM FUER NATURKUNDE
Zentralinstitut der Humbolt-Universitaet zu Berlin
Invalidenstrasse 43
D-10115 Berlin, Germany
lazarus@fub46.zedat.fu-berlin.de
david.lazarus@rz.hu-berlin.de
49 - 30 - 2093 - 8868 (FAX)
49 - 30 - 2093 - 8579 (o)
49 - 30 - 2093 - 8862 (dept.)
49 - 30 - 859 - 3884 (h)
Prof. Bjorn Malmgren
Department of Marine Geology
Earth Sciences Centre
University of Goeteborg, S-413 81
Goeteborg, Sweden
bjornm@gvc.gu.se
This page is maintained for the Paleo21 Organizing Committee by Norman MacLeod and H. Richard Lane. Corrections, inquiries about, and updates to any of the information shown above should be directed to Norm and/or Rich.