Biological Database Integration - Data Integration Issues
Author: Deron Eriksson
Description: Integration of Data from Heterogeneous Biological Databases using CORBA and XML - Data Integration Issues Relating to Bioinformatics Databases.
7. Data Integration Issues Relating to Bioinformatics Databases
Biological experimentation or analysis often involves the generation of large quantities of data. Biological data are frequently most useful in conjunction with other biological data, so managing and integrating biological data collected by scientists and researchers from around the world can be very valuable. This is an exceedingly challenging problem for a wide variety of reasons. These problems include the quantity of biological data, the large number of biological databases, the rapid rate in the growth of biological data, the overabundance of data types and formats, the variety of data access techniques, database heterogeneity, errors in biological data, and even the interdisciplinary nature of the field of bioinformatics. These issues will be explained in detail below.
7.1. Quantity of Biological Data
The first and most obvious source of difficulty in biological data management and integration is the vast quantity of biological data involved. The 3.12 billion base pair first draft of the human genome is extremely long. “This data corresponds to 400 books which consist of 1000 pages with the four letters A (adenine), T (thymedine), G (guanine) and C (cytosine)” [HOF, 1995]. The primary DNA sequence database in the United States is GenBank. GenBank is growing exponentially and holds DNA sequence information from thousands of species in addition to humans. As of December 2000, GenBank contained sequence data of over 13 billion nucleotides from over 86,000 species. [SHE, 2000]. A graph of GenBank’s growth can be seen at [GEN, 2001].
To complicate matters, errors can creep into DNA sequences and other biological data from a variety of sources. For instance, in automated DNA sequencing, noise in the data can cause bases to be misidentified. The identification of bases in sequencing is based on probabilities, not on absolutes, so it is possible for bases to be incorrectly determined a certain portion of the time. One way to avoid such errors is to rerun sequences multiple times. In general, biological data is dirty, and thus statistics often comes into play in its analysis.
Butte [BUT, 2001] gives two examples of the immense quantities of data that can be generated through bioinformatics. “For example, ~15% of the US population will undergo medical imaging this year; data from computed tomography (CT), magnetic resonance imaging (MRI) and X-rays represent 200 million multigigabyte images. The raw trace files containing nucleotide sequence data for a single human being would occupy 300 terabytes, or 1800 billion terabytes for all humans.”
7.2. Number of Biological Databases
In addition to databases growing in size to handle this influx of data, the number of biological databases is also increasing. In the 2001 Molecular Biology Database Collection of the Nucleic Acids Research Journal, 55 databases were added, bringing their total number of biological Internet-accessible databases to 281 [NAR, 2001]. As of June of 2001, the DBcat database catalog contained 511 biology-related Internet databases, as seen in the following chart modified slightly from [DBC, 2001]:
Sansom and Smith [SAN, 2000] describe the three primary DNA sequence databases as follows: “The main databases containing general DNA sequences are EMBL, GenBank and DDBJ. The only real difference between these is their locations. The EMBL database is compiled and maintained at the European Bioinformatics Institute in Hinxton; GenBank is based in the USA, and DDBJ in Japan. Sequences sent to one database are indexed and distributed automatically to the others. Typically, a gene sequence sent to EMBL by a researcher in Europe will be released by GenBank and DDBJ the day after it appears in EMBL.”
Sansom and Smith [SAN, 2000] provide an excellent description of the primary protein databases: “Protein sequence databases usually contain many fewer sequences than gene sequence databases but tend to be better annotated with cross-links to many other databases. The best known of these are undoubtedly SwissProt, based at the University of Geneva, TrEMBL and GenPept. SwissProt contains a relatively small number (80,681 in early September 1999) of well-annotated protein sequences. A typical SwissProt entry contains, in addition to the sequence data, a description of the protein, notes of important residues, post-translational modifications and polymorphisms, and links to the original reference in Medline and to the appropriate entries in the main gene sequence databases. Where appropriate, the entry will also be linked to the three-dimensional structure data of the protein held in the Protein Data Bank or PDB; to information about associated diseases via on-line mendelian inheritance in man (OMIM) and, if it is an enzyme, to information about its catalytic action. The TrEMBL database is in SwissProt format and is prepared automatically from the protein coding regions of the EMBL DNA sequence database. A similar database, GenPept is produced by translating all coding regions of the DNA sequences in GenBank.”
The Medline database mentioned in [SAN, 2000] is a vital medical research resource. For example, it contains important gene information such as descriptions of protein functionality and diseases related to the expression of particular genes. “In the broadest sense, Medline (http://www.ncbi.nlm.nih.gov) represents a systematic storage of all discovered knowledge in biomedicine but that information is not easily accessible to automated processes. Dietrich Schuhmann (Lion Bioscience AG, Heidelberg, Germany) [BUT, 2001] estimated that there are 11 million abstracts in Medline, with between 70 and 100 million sentences and a vocabulary of 2.5 million different terms (more than the English language).”
Many different types of biological databases exist in addition to the aforementioned databases. Databases may focus on a particular organism, such as the original AceDB database that concentrated on the Caenorhabditis elegans species, while other databases may function on a particular type of research data, such as radiation hybridization or protein-protein interactions. “As well as these sequence databases, Medline and other bibliographic and PDB structural data databases, which are also primary information sources, there are many, many other databases available. These include databases of pathways, enzymes, transcription factors, individual organisms protein motifs and many others” [SAL, 1998]. Sansom and Smith list many of the primary biological databases in [SAM, 2000].
(Continued on page 2)