Biological Database Integration - Current Approaches
Author: Deron Eriksson
Description: Integration of Data from Heterogeneous Biological Databases using CORBA and XML - Current Approaches to Biological Data Integration.
8. Current Approaches to Biological Data IntegrationBiological data are often most valuable when it can be integrated with other types of related biological data. For example, a DNA sequence is usually stored as a string of four repeating characters. In itself, this sequence is exactly that – a string of repeating characters. However, if this sequence is very similar to other sequences and it is known that these other sequences code for proteins of a particular function, then inferences as to the functionality of the given sequence can be made. It may then be possible to use this integrated information to apply a particular drug to a particular condition or to develop a new drug to treat a condition. Salter [SAL, 1998] reaffirms the value of integrated biological data: “Biological information is best approached from an integrated viewpoint. Thus, when looking for a gene, it is desirable to know what associated bibliographical, structural and functional information is also available. From a computing point of view, the essential task is to define and make search engines that collate and cross-reference the primary databases to make derivative indices that can then be searched rapidly.” Karp [KAR, 1996] gives three reasons why integrated data from different biological databases are more valuable than solitary database data. “The value of an integrated collection of molecular biology DBs is greater than the sum of the component DBs for various reasons:
One might ask, if biological data are so valuable when integrated, then why hasn’t one enormous database for all biological data been built? From a practical standpoint, it would be incredibly difficult to accomplish this. Benton [BEN, 1996] states: “… The expertise to build and populate biological databases does not reside in any single institution. Therefore, biological databases are built by different teams, in different locations, for different purposes, and using different data models and supporting database-management systems.” The resources required for a single integrated biological database would be tremendous, both in terms of labor and money. All the previously mentioned types of heterogeneity would need to be dealt with, and some of these issues, particularly semantic and data model heterogeneity, are very difficult to resolve. The other various data integration issues discussed would also have to be contended with on a truly massive scale. For instance, the DNA sequence data in GenBank are growing at an exponential rate, and DNA sequence data are only a small part of the big picture in bioinformatics. Protein analysis could involve petabytes of data. Analysis of biological data of this scope could require totally redesigned data management systems. A single uberbiodatabase could limit creativity in the biological sciences, since standardization could limit people to using certain data types and collecting certain types of data, when another new solution may be ideal for that particular type of research. There could be a natural tendency to form new databases that allow for creativity in research and data collection, which would once again act to prevent the formation of a single database holding all biological data. Currently, data warehouses and database federations are the two main methods of automated data integration from heterogeneous biological databases. Both approaches need to address the same integration issues, as Li and Clifton [LI, 2000] point out: “In either approach to database integration, the steps of database integration include extracting semantics, transforming formats, identifying attribute correspondence, resolving and modeling heterogeneity, multidatabase query processing, and data integration.” It should be mentioned that hypertext links are a popular method used to connect data in different biological databases. Hypertext links are used primarily for manual perusal of biological data. Hypertext link data from a web server could be captured and processed via a module or application so that data integration could be automated. However, hypertext links do present some problems when wide scale data integration across multiple databases is desired. Chung and Wong [CHU, 1999] state: “One of the simplest and most successful approaches to integrating biological databases is by connecting heterogeneous databases via hypertext links on the World Wide Web (WWW) and providing comprehensive indexing systems for query. Such a hypertext-navigation approach is adopted by the sequence-retrieval system SRS and by DBGET–LinkDB. These systems are convenient to use for simple operations but offer no facilities for the flexible transformation and integration of data derived from heterogeneous sources. Indeed navigating around the hypertext links is like flipping from one entry in an encyclopedia to another, with only one flip of the pages at a time. Complex queries that require selection and transformation of structural data across large numbers of heterogeneous databases are almost prohibitive for this type of hypertext navigation approach. It requires a significant amount of manual work for data integration.” Links can connect different types of data across different databases. Karp gives biological database examples of unification links and relationship links in [KAR, 1996]. “Unification links are defined as links that connect distributed slices of an object… For example, although EcoCyc describes in detail the catalytic activities of E. coli enzymes, as well as their cellular location, their amino acid sequences are held in SWISS-PROT, information about their expression is held in Eco2DBase, and their three-dimensional structures are held in PDB. Similarly, each EcoCyc gene entry is linked to the E. coli Genetic Stock Center (CGSC) DB entry for the same gene, which provides information about related E. coli strains and detailed citations concerning the gene.” “Relationship links represent an inter-DB relationship between two objects. For example, EcoCyc enzymes contain relationship links to Medline, which holds literature relevant to the enzyme, and to PROSITE entries, which describe amino acid sequence patterns that are present in the enzyme. Relationship links can encode an extremely large variety of relationships, so this classification of unification versus relationship links may seem unequal. However, because unification links are extremely prevalent in biological DBs, it is important to highlight them.” Insertions of data into biological databases may involve the manual or automated creation of hypertext links to related biological data. Hypertext links may be produced during the creation of data warehouses. Hypertext links should not really be seen as a competitor of data warehouses or database federations in automated data integration, since they usually are simply a method for manual inspection of related biological data. 8.1. Data WarehousesIn the data warehouse approach to data integration, data from individual databases is usually transformed into a common format and then stored in a large single database called a data warehouse. Karp [KAR, 1996] defines a data warehouse in the following manner: “A data warehouse is a single DB that is constructed by physically consolidating a collection of data sources into an integrated whole… Thus, a warehouse transforms a set of heterogeneous DBs into a single, homogeneous DB, and data stored in this way can be queried using the standard query tools of the DBMS. The transformation is performed by a set of translators, each of which must convert a data source to a form that is compatible with the warehouse. This conversion process can be complex, because DB heterogeneity can manifest itself in a number of different respects.” The various types of heterogeneity and other issues must be confronted during this integration phase that occurs when the data warehouse is constructed. Data warehouses are typically read-only, since they draw their data from existing biological databases and may do various data transformations and integration steps that make it impossible to write changes to the base databases via the data warehouse. Writing data to the data warehouse itself would not make sense since the data warehouse is periodically updated from the base databases and any data written to the data warehouse would be lost when the data warehouse was updated. Lim and Chiang [LIM, 2000] describe monitoring systems and how data updates are transmitted to the data warehouse. “To construct a data warehouse, monitoring systems must be implemented on the local databases to actively detect changes to them, and to propagate the changes to the data warehouse. These changes are transformed into updates to the data warehouse by a data integrator subsystem.” Database features such as indexing can be added during data warehouse construction so that faster querying is possible. Construction may even involve no data format conversions, thus simply keeping the original data formats of the base databases. This greatly simplifies the warehouse construction process, but it maintains the heterogeneous nature of data formats from the constituent databases. This format heterogeneity would still need to be dealt with when performing data integration. An earlier example of semantic heterogeneity showed how the Entrez data warehouse contains two different types of DNA sequences from two different databases. This data has been physically placed into one data warehouse but hasn’t been integrated, possibly due to the difficult nature of the semantic heterogeneity involved. Although many authors refer to data warehouse creation as physical integration, some refer to it as a virtual integration. For instance, [LIM, 2000] states: “Database integration is performed whenever two or more databases have to be combined together either physically or virtually. Physical database integration requires the original databases to be discarded after the integrated database has been constructed and all existing application software to be migrated to the database systems operating the integrated database. Virtual database integration, on the other [hand], deploys a multidatabase or data warehousing system to support queries on an integrated view constructed upon the original databases. It retains both the original databases and its application software.” In this description, a physical integration refers to the creation of a new database that permanently integrates all information from base databases so that the base databases are no longer used. The new database would be read/write since data changes would be made directly to this database. Here, a virtual integration refers to the creation of a standard read-only data warehouse in which the base databases are kept and data updates are made to the base databases. This use of the terms ‘physical’ and ‘virtual’ is largely a question of semantics, but it can be a source of confusion for the unwary reader. There are several advantages and disadvantages to data warehouses, and Karp [KAR, 1996] discusses many of these. “The advantages of physical integration are that queries can be executed rapidly because all the data are located in one place, and the end user sees a homogeneous, integrated data source. Its disadvantages are that the integration of updates and revisions from the data sources into the warehouse can be difficult and time consuming; by contrast, the distributed approach makes updates available instantly. In addition, as the number and size of biological DBs increase, the warehousing approach may not be able to keep pace because it becomes more and more time consuming to build large warehouses and to keep them up to date; in addition, the storage and computing resources required to maintain the warehouse become increasingly expensive.” An additional disadvantage to data warehouses is that they are expensive to develop. Li and Clifton [LI, 2000] state that the creation of a data warehouse’s global schema presents many difficulties. “Early work in heterogeneous database integration focused on procedures to merge individual schemas into a single global conceptual schema... The amount of knowledge required about local schemas, how to identify and resolve heterogeneity among the local schemas, and how changes to local schemas can be rejected by corresponding changes in the global schema are major problems with this approach because of the complexity of a global schema.” Data warehouses also require that mechanisms be put in place for updating and synchronization. Benton [BEN, 1996] discusses three difficulties with data integration that apply to data warehouses. These issues are also pertinent to database federations. “… Integration efforts that depend on making local copies of data (possibly modified in a variety of ways) from other databases must contend with a number of factors that will make their task increasingly difficult, including: (1) a proliferation of independently administered biological databases containing relevant data; (2) the absence of a clear domain boundary for data of interest to the potential users of such integrated data products (implying that there is no obvious limit to the set of course databases that should be integrated); and (3) the accelerating rate of data production that will preclude manual intervention in the integration process (implying that all the information required to cross-reference data from two or more source databases must be present in, or derivable from, those databases).” Several useful biological data warehouses exist. Among the major biological data warehouses are Entrez and the Sequence Retrieval System (SRS). Entrez ([ENT, 2001]) was developed at the National Center from Biotechnology Information (NCBI) and is “… a valuable resource that comprises a set of interlinked databases containing nucleotide and protein sequence information, 3D protein structure data, population study data sets, and assemblies of complete genomes that are linked to the scientific literature via PubMed” [CUR, 2000]. The neighboring concept in Entrez allows linking between entries in different databases even if they don’t explicitly cross-reference. [SAL, 1998]. GHOST is an example of a bioinformatics search tool that uses Entrez to search for interspecies protein homologues. [CUR, 2000]. A high-level diagram of Entrez can be found at [ENT2, 2001]. SRS [SRS, 2001] was developed at the European Molecular Biology Laboratory’s (EMBL) European Bioinformatics Institute (EBI). It allows linear databases to be indexed to other linear databases [SAL, 1998]. The complex interrelationships of the SRS databases can be found in [BEN, 1996]. In [SAL, 1998], Salter compares SRS to Entrez and mentions advantages and disadvantages to each system. SRS allows virtually any linear database to be included in the data warehouse, but linking requires explicit cross-referencing. Hyperlinking is extensive in SRS and is more limited in Entrez. However, Entrez’s neighboring concept doesn’t require cross-referencing of links, and Entrez can display results in different ways such as graphical formats. Entrez is limited in its ability for local site database incorporation. Although no longer in service, the Integrated Genome Database (IGD) is another example of a biological data warehouse. A description of IGD is provided in [CHU, 1999]: “IGD essentially adopts the data-warehousing approach to integrate the 20 or so major data sources in the domain of genome projects and molecular biology. It has a global schema that is very different from those of the underlying data sources. For each data source to be integrated, IGD has to convert the source data format into its global data model and store the converted data in a centralized local database. IGD provides a global schema, a popular GUI and the ACEDB data management and query facilities.” Chung and Wong [CHU, 1999] mention two disadvantages associated with IGD, which are common data integration issues. “First, the need to store the integrated data locally limits the number of data sources that can be integrated. As the number of integrated data sources increase, it is likely to push ACEDB beyond its design limits of size and performance. Second, the cost of maintaining the system is high; it is extremely difficult to adjust the global schema when new data sources are added or old data sources are removed or evolved.” (Continued on page 2) |