| These pages have been created to accompany an article entitled "Mining the Schistosome DNA sequence database" written by Guilherme Oliveira and David Johnston and published in Trends in Parasitology 17 (10), p 501-503 (October 2001) |
![]() |
|
||
|
|
|
|
back |
|
forward |
These pages accompany an article entitled "Mining the Schistosome DNA sequence database" written by Guilherme Oliveira and David Johnston and published in Trends in Parasitology 17 (10), p 501-503 (October 2001 issue). They aim to provide examples of, and summary results from data mining techniques being employed by the Schistosoma Genome Network.
Genome analysis, and functional genomics derived from basic genome data, provide fundamental information on organism biology in a rapid and cost effective manner. Consequently, numerous research communities have initiated projects to sequence either whole genomes or specific components such as expressed genes (the transcriptome). GOLD, the Genomes Online Database, provides summary information on over 400 ongoing projects.
For many organisms, especially those with larger genomes and / or large amounts of repetitive DNA, gene discovery through expressed sequence tagging has been the initial priority. Expressed sequence tags (ESTs) are short, single pass sequences produced from randomly selected cDNA clones and reflect the transcriptional activity of organisms or tissues 1. For parasitologists, EST analysis promises the discovery of new drug targets and new candidate vaccine antigens, as well as revealing the molecular mechanisms underlying parasite biochemistry, development, pathogenicity, diversity etc. 2.
Since Schistosoma possesses a large, highly repetitious genome, EST analysis provides the Schistosome Genome Network (SGN) with a rapid and cost-effective way to produce "gene catalogues" for both S. mansoni (the main focus) and S. japonicum 3. With the support of various agencies, the SGN has generated, annotated and deposited more than 16,000 sequences in dbEST (the EST division of GenBank) 4,5.
Schistosoma EST analysis continues at various centres:
Thus far, exploitation of public EST data has chiefly been by keyword search of sequence annotations, or by homology search. However, the size of the Schistosoma EST dataset now permits its exploration in additional creative and informative ways. Some of these analyses require access to high end computing and programming support, and thus are performed as a service for the community, but there are also increasing possibilities for individual analysis. These WWW pages provide examples of, and summary results from some of the data mining techniques that are being employed by the Schistosoma Genome Network. Such techniques can, of course, be equally applied to any organism for which public genomic data exits.
Text searches and sequence comparison are probably the most common ways to query a database and allow the putative assignment of sequence-function relationships to query sequences. Many such database queries can be carried out via the WWW.
| Database annotation searches | Database homology searches |
|
|
In addition to the general databases, several schistosome-specific databases are available. These include:
Martin Aslett's Parasite Genome WWW site at the EBI
against
(the primary sequence database is updated monthly and, for anyone wanting to set up their own server, all databases are available for download from the EBI's FTP site.
The Institute for Genomic Research's S. mansoni Gene Index
Computational technology also allows more imaginative and complicated database searches that analyse gene expression in a wider context and generate testable hypotheses.
Microsatellite polymorphisms provide essential markers for genome sequencing, positional cloning, physical mapping and population analysis 6 7. The University of Washington's RepeatMasker WWW server allows large numbers of sequences to rapidly be scanned for microsatellite-like simple repeats.
|
|
Information on parasite biochemistry and genomics can be integrated through the use of in silico "metabolomics" 8; the mapping of identified genes onto metabolic pathways.
This can facilitate the search for new drug targets by identifying:
ESTs related to metabolic function can also be classified by the life cycle stages that they have been detected in, and rough statistical comparisons of expression level made by comparing the frequency of their detection in different cDNA libraries (assuming that the libraries have been neither normalised, nor pre-screened).
We have used a comprehensive list of 2546 enzymatic, metabolic and biochemical keywords to search the annotations of Schistosoma database sequences and classified the matches by the cDNA libraries from which the sequences have been derived. The frequency of detection of a particular gene in a particular life cycle stage is then statistically compared to the proportion of the overall sequence dataset that is derived from that stage to examine whether that gene is expressed at a higher than expected level in that stage. This is only a rough analysis, subject to many caveats but it does suggest, for example, that fructose biphosphate aldolase is expressed at a higher than expected levels in cercariae, whilst ubiquinol, cytochrome-c-oxidase and glycogen synthase appears biased towards adult worms. These observations can then be correlated with known information on parasite metabolism and used to create testable hypotheses.
|
|
Since EST generation relies on random selection of cDNA clones and since libraries are rarely normalised, genes that are highly expressed are present many times in the library and so will be selected for EST sequencing over and over again. Cluster analysis groups together such homologous sequences and identifies the non-redundant sequence set 9.
Several different cluster analyses of Schistosoma sequence data are available on the WWW, each with different interface, data format and search options:
Once such data is available, a wide variety of secondary analyses can be performed including:
For the average user performing text searches, the value of the primary databases (GenBank / EMBL) is determined by the accuracy of sequence annotation. This can be assessed by comparing the annotation of sequences that belong to individual clusters 10.
|
|
The consensus sequence of a cluster is frequently longer, and more accurate than the individual sequences within it. This facilitates identification through homology searching.
|
|
By comparing different clusters that return similar database homology results, it is possible to identify potential transcript families and alternate splicing events (this may depend on the clustering algorithm used and the stringency of assembly selected.