Biological Database Integration - Data Integration Issues
Author: Deron Eriksson
Description: Integration of Data from Heterogeneous Biological Databases using CORBA and XML - Data Integration Issues Relating to Bioinformatics Databases.
(Continued from page 1)
7.3. Rapid Rate of Growth of Data
In addition to the quantity of data, the rate at which biological data is being produced presents problems to data management. As George Lake pointed out [BUT, 2001], “growth in bioinformatics data exceeded Moore’s Law, the well-known observation that the number of transistors on a chip doubles every 18 months.” Rapid production of biological data means that outdated data management systems may be patched or adapted to deal with unforeseen quantities of data, since data migration to a new schema in a new DBMS is an arduous process that may be deemed too time-consuming or expensive. “Biological databases are now a central part of the research environment, but many have evolved simply as a by-product of a particular individual's research project, with no thought that they might one day become valuable international treasures. Consequently, some have not stood the test of time (most do not survive beyond the first five years). Others are creaking under the strain of information overload, their underlying technologies never having been designed to cope with such volumes of data” [ATT, 2000]. Bains [BAI, 1996] offers another example of a legacy software solution not very adaptable to current bioinformatics needs: “The IntelliGenetics software is still based on the LISP-based AI language MainSail; this is a legacy of its past in the Stanford AI community, and a significant barrier to its development as an integrated part of a larger bioinformatics environment.”
The effects of the rapid rate of growth in the quantity of biological data can be seen in the data formats of the sequence databases. “The original format of the released sequence databases, simple linear text files, has been retained, and the format itself has not changed much over the years since their inception, mostly because of the phenomenal exponential size increase. The result of this is that the databases usually have to be processed into a more useful form, usually into some form of compressed and rapidly accessibly index, at the analysis site. Historically, this requirement for sophisticated intervention at the user end has complicated the analysis task, and together with the physical size and rate of growth of the data, has meant that personal computers cannot cope with the databases directly” [SAL, 1998]. Lijnzaad et al [LIJ, 1998] describe several problems with storing data as flat files: “(i) parsing can be non-trivial; (ii) formats are often ad hoc; (iii) data is not ‘live’; (iv) data can be redundant; (v) data model is often poor, and does not have ‘behaviour’; (vi) access and querying are often difficult.”
The quantity of data and rapid rate of growth can lead to database schema changes in an attempt to deal with the large quantities of data and projections for increased data in the future. Spiridou [SPI, 2000] states, “Molecular biology data is inherently complex and usually stored in heterogeneous and autonomous databases or other data sources with frequently evolving schemas.” Schema changes alter how data is stored and retrieved from databases, thus having wide-ranging effects on data integration efforts.
Proteins, the primary building blocks and functional units within cells, will play a larger and larger role in bioinformatics in coming years. Protein data will far outpace nucleotide data in coming years. “As it becomes apparent that the 30,000 human genes can't fully explain biological diversity, researchers are turning to proteomics for answers… There are many more proteins than genes, in part due to alternative slicing or post-translational modifications. The human proteome may contain millions of different proteins, most of them yet remain to be identified and characterized” [BOG2, 2001]. Proteins perform functions and thus often need functional data to be associated with them. Proteins obtain their functionality through their chemical properties and three-dimensional structure and various factors in their cellular environments, including their interactions with other proteins. As these examples illustrate, data associated with proteins can be copious and complex. Advanced protein-related analysis will require databases containing petabytes of data (thousands of terabytes) [LIC, 2001].
Enormous processing power is required for large-scale protein-related analysis. Celera Genomics recently joined forces with the Department of Energy and Compaq Computers to create a 100 trillion operations per second supercomputer for protein identification. [LIC, 2001]. Boulnois [BOU, 2000] describes an example of the usefulness of the rapidly expanding protein databases: “In ‘structural genomics’, high-throughput approaches are applied to elucidate protein structure. We can expect to see rapid growth in the associated databases, allowing the comparison of the structure of novel proteins with an ever-increasing array of known protein structures from a range of species. This will be important information for probing structure–function relationships and, significantly, this information will be available at the initiation of drug-discovery projects.”
Sometimes different proteins may have similar structure and function but their sequences may vary significantly so that sequence comparisons do not identify the proteins as having similar structure and function. Kasif [KAS, 1999] describes two techniques – protein threading and alignments using structure prediction – that can be used to predict protein structure, thus allowing structural comparisons (rather than sequence comparisons) that may more accurately identify similar protein functionalities. Benton [BEN, 1996] offers an example of this: “For example, the human ob gene (mutations in which are associated with obesity and diabetes in mice) showed no sequence similarity to any gene or protein of known function. However, when its sequence was searched against a database of three-dimensional structures of core protein motifs [from the Brookhaven Protein DataBank (PDB)] to determine whether the ob protein could adopt a folding pattern similar to a known protein (a technique known as 'threading'), a significant similarity to structures from the helical cytokine family was found, leading to the suggestion that the ob gene product (leptin) utilized the Jak-STAT pathway (as do other helical cytokines) to regulate nuclear transcription. This hypothesis was subsequently verified by experiment.”
7.4. Overabundance of Data Types and Formats
A plethora of different types of biological analysis exist, and as a result, innumerable types of biological data are currently being generated. Even for data that is essentially the same, a widespread lack of standardization in biology has resulted in the creation of a wide variety of data types and data formats. Chung and Wong [CHU, 1999] have described these problems: “Biological data are inherently complex, ranging from plain-text nucleic acid and protein sequences, through the three-dimensional structures of therapeutic drugs and macromolecules, and high-resolution images of cells and tissues, to microarray-chip outputs. The data in various autonomous databases are organized in extremely heterogeneous formats, ranging from flatfile format (plain text files, binary files) through relational data models to highly nested data model such as ASN-1 [adopted by the US National Center for Biotechnology Information (NCBI)]. Moreover, the data structures are constantly evolving to reflect new research and technology development.”
Benton [BEN, 1996] argues that biological data complexity is also a result of the intrinsic nature of biologists and the field of biology. “The complexity of biological data is due both to the inherent diversity and complexity of the subject matter, and to the sociology of biology. Biological research is largely a 'cottage industry', with data generation being carried out in an intellectually idiosyncratic and geographically distributed mode. Where there are standard methods (some of which have been distilled into commercially available kits), these methods are often applied in novel ways and in novel combinations such that the overall experimental protocol is unique. From an information-science viewpoint, the result of this creativity is that, with few exceptions, biological experimental data are produced with neither standard semantics nor syntax. Thus, every biological specialty is overwhelmed by a vast quantity of complex data, or will be in the very near future.”
Data is represented in databases in different fashions. Some of these differences are a result of the database technology being used to store the data. EMBL data is essentially a linear list of entries, whereas the AceDB object-oriented database contains data stored as objects. Other databases such as the Genome Database (GDB) store data in relational tables. The RHdb (Radiation Hybridization database) uses an Oracle RDBMS [LIJ, 1998]. The nature of commercial database packages from vendors such as Oracle, Sybase, Informix, and IBM can influence data representation, since vendors may do things such as make modifications to SQL, thus leading to some idiosyncratic methods for data access and manipulation. Differences in database package versions can also be significant. For instance, Oracle 7 is a relational DBMS whereas Oracle 8 is an object-relational DBMS. As a result, data access based on objects in an Oracle 8 system might have to be significantly modified to view similar relational data in Oracle 7. Data integration of different types of data in different formats in different schemas from different databases is obviously a difficult problem. “One obstacle is the fact that biological DBs are built using a wide range of technologies, so interoperation is required across relational, object-oriented, flat-file and various ‘home brewed’ DB systems. A second obstacle is the fact that DBs use different ontologies of biological entities” [KAR, 1996]. Ontologies are similar to schemas at a more abstract level.
7.5. Variety of Bioinformatics Data Access Techniques
There are several standard ways of accessing data in databases. In the case of private databases, ad hoc database queries in a language such as SQL can be issued through a tool such as SQL*Plus. This technique is generally not used in the case of public Internet-accessible databases. At a programming level, access to such databases might be supplied by a server-based database access object or module that moderates communication between a client and the database. This can help control database access and provide a simple interface to database clients that don’t have a completely thorough understanding of the database. It can also allow database changes, even upgrading to a completely new DBMS and database schema, while still providing the same interface to clients so that client code isn’t broken when database changes occur.
With public databases, data submissions and queries can be performed via e-mail in which various parameters in the e-mail describe the nature of the submission or query. Manual entry of parameters in this fashion usually requires fairly significant knowledge of the database being queried and its query parameters. Manual entry is prone to errors and time-consuming. Command prompt query tools such as query programs or scripts, especially Perl scripts, are available. Graphical-interfaced query tools are becoming more and more popular. The most common way to access data in public biological databases is through web-based forms in HTML browsers. Such forms present query capabilities that are easier to comprehend than command prompt tools. Many default search parameters are automatically entered in forms, and available options are available in pull-down menus. Form-based queries typically will send the query to a web server that will forward this information to a server-based CGI program that actually invokes the query.
However, even with web-based form queries, database access is generally not easy. Rather esoteric terms are used to describe the various search parameters, and it can be very tricky to figure out exactly what information is expected to be submitted in the query. Interpretation of results is often challenging since it can be hard to understand what exactly was returned in a query. Documentation is often poor, so trying to figure out these various issues can be extremely time-consuming and frustrating. Brusic [BRU, 2000] states, “Lack of standards, adhoc nomenclature, variable quality of source data, incomplete information, and biases in data repositories are all major obstacles in data interpretation. Furthermore, most database search tools require careful selection of parameters for search optimisation as the default settings are usually not optimal for a specific query. The selection of search parameters is currently more art than science and requires a good understanding of the domain as well as of specific issues relating to a particular query tool.” These problems are compounded when integration of information from multiple databases is desired, since most of these issues are present and unique with each individual database that is encountered.
An example of three common methods for performing sequence similarity searches using the BLAST algorithm can be found in [SAL, 1998]. This example features queries using a web-based form, e-mail, and UNIX command line.
(Continued on page 3)