Biological Database Integration - Introduction
Author: Deron Eriksson
Description: Integration of Data from Heterogeneous Biological Databases using CORBA and XML - Introduction.

2. Introduction

The emergence of bioinformatics is a phenomenon that promises to revolutionize people’s lives through breakthroughs such as facilitating the creation of new drugs to treat diseases. The field of bioinformatics represents the convergence of biological data and computer technology that is necessary for the acquisition, management, and analysis of large-scale biological data. Most definitions of bioinformatics are similar to that offered by [PAT, 1998]: “Bioinformatics is the use of computational techniques for the consolidation and analysis of experimental data in biology.”

This assembly of the human genome represents a significant achievement in bioinformatics. On June 26th of 2000, the National Human Genome Research Institute and Celera Genomics announced that the 3.12 billion base pair first draft of the human genome had been completed [BOG1, 2000]. Dunham [DUN, 2000] states that, on the date of the announcement of the completion of the first draft, 21.1% of the genome was in a high-quality finished form. This DNA sequencing and assembly effort required enormous computing power. Celera’s approach to genome assembly required five-hundred million trillion comparisons, twenty thousand supercomputer CPU hours, management of 80 terabytes of data, and 64 gigabytes of shared memory [BOG1, 2000]. In addition to DNA sequence data in humans and many other species, other types of biological data are being generated at an exponential rate.

Many people have recognized the potential of bioinformatics, both in terms of financial rewards and making a significant difference in healthcare. “(In 2000,) venture capitalists invested more than $700 million in the field…, according to the tracking firm VentureOne. Dozens of schools, including UCLA, have started bioinformatics centers in the past few years, and tech giants are targeting it as one of the few growth sectors in today's economy” [STON, 2001]. As of October 2000, the number of biotech companies in the U.S. totaled more that 1,300 [MIL, 2000]. Projections for bioinformatics-related markets can be staggering. “An internal study commissioned by IBM, for instance, predicts that when the markets for high-performance computing, storage, and e-commerce combine with that of data management, the worldwide market for IT products and services in the life-sciences sector will swell to $43 billion by 2004” [LIC, 2001].

Drug discovery and development are an example of a process that can significantly benefit from the application of bioinformatics tools and methodologies. Sequencing of the human genome is important to this process because: “The availability of the complete genome sequence reveals all potential targets for drug intervention and simultaneously provides the comprehensive reagent set (cells, proteins and genes) for drug and diagnostic discovery” [BOU, 2000]. The steps involved in drug discovery and development are described in [BOU, 2000]. These steps include target selection, lead discovery, and lead optimization, followed by preclinical and clinical testing.

According to [LIC, 2001], “On average, it takes $500 million and 14 years to go from discovery to government approval on a new drug. The process is highly inefficient: Nine out of ten compounds fail in human tests.” Application of knowledge gained through bioinformatics can significantly reduce drug targeting and development time and costs. As an example, “Biotech pioneer Human Genome Sciences in Rockville, Md., for example, credits its computer system for the rapid development of repifermin, a much-heralded protein that helps wounds heal and is now in clinical trials. William A. Haseltine, HGS's chief executive, says the IT system has shortened the 14-year drug-development process by four or five years, allowing HGS to get drugs into human clinical trials for one-tenth of the costs shouldered by large pharmaceutical companies” [LIC, 2001]. In another example, “Bioinformatics played a key role in one of Exelixis's main triumphs to date: the identification of that tumor suppressor gene in the fruit fly. The loss or deactivation of the human version of that gene, known as p53, is the single most common mutation in human cancer. Identifying the fly version would speed drug development by allowing experimentation on p53's function in cells” [SHE, 2000]. Genomics is expected to increase the number of drug targets from 500 to between 5,000 and 10,000. [DRU, 2001].

Today, hundreds of public biological databases are accessible via the Internet. However, taking advantage of biological data stored in heterogeneous biological databases can be a difficult, time-consuming task for a multitude of reasons discussed later. A scientist may query a biological database, manually peruse the results of this query, find the data of interest, and use this data in a query of a second biological database. Performing manual multidatabase queries of this sort can be a time-consuming process, especially when large amounts of data are involved, as is often the case in biological research. Therefore, a need exists for automated data integration from multiple databases.

Data warehouses and database federations are two approaches that have been developed to deal with this data integration. In data warehouses, data are combined from multiple data sources into one large database that houses the data of the individual databases. Data integration, which includes data format conversions, is performed at the time of the data warehouse creation. Queries are performed on this integrated data source. In database federations, software modules are created to interact with the databases included in the federation. Data integration is performed at run-time at the time of the query, since the data must be obtained on the fly from all the separate databases involved in the query. Both approaches have several advantages and drawbacks that will be discussed later.

This integration of data from biological databases can be extremely valuable in terms of scientific knowledge, human health advancement, and financial benefit, since processes such as drug discovery require the synthesis of data from many sources, often in combination with laboratory research. Automating biological database data integration can speed up the discovery of new drugs and the introduction of these drugs to market. This can decrease the sizable costs incurred during these processes and can mean that drugs may reach the market years earlier. This can mean huge profits in terms of sales since drug companies have exclusive rights to the sale of a drug before the drug’s patent expires. The sick also benefit from this process since new drugs can be developed faster, and these drugs may save their lives.

This thesis addresses the issue of integration of data from heterogeneous biological databases and presents a solution involving the use of a slew of recently emerged technologies. This solution involves the creation of a database federation that allows for multidatabase queries to be performed via a client application. Client applications can access the database federation system in a simple mechanism similar to submitting data via an HTML form. A Java servlet provides this system access. The use of servlet technology provides a layer of abstraction that allows clients to access the system in a manner that doesn’t involve the distributed technology used within the system.

The data integration system uses CORBA objects implemented in the Java programming language. CORBA is a distributed computing architecture that allows objects written in different programming languages on different platforms to intercommunicate across a network. Objects in the data integration system can thus be located on a single computer or distributed on different computers. The use of CORBA allows for flexibility in object distribution and object programming languages. For example, an object could be written in a fast language such as C++ and could be placed on a powerful dedicated workstation if such an arrangement was desired.

Objects in the data integration federation system include database accessor objects, a query processing object, and a system access object. Clients access the system via the system access object. The query processing object takes query requests from the system access object, decides which databases need to be contacted to perform the queries, performs the queries and the data integration necessary to fulfill the client’s request, and returns the result. Each biological database has a database accessor object in the system that handles the specific translations necessary to query that database. If the techniques required to access a biological database change, this would require changes to the database’s accessor object but the rest of the data integration system would not require any modifications.

Within the data integration system, communication takes place using Extensible Markup Language (XML). This provides standardization in the data representations within the system.

The data integration system is diagrammed in gray in Figure 2.1. A client is shown communicating with the system via the system access object. Query processing is handled by the query processing object, and database access objects are used to communicate with each biological database in the federation. Databases can be added to the federation, and a new database access object would be added to the system for each database added to the federation.

Figure 2.1: The Database Federation Data Integration System

It is hoped that this work presents a useful method of biological data integration using a database federation. The system interface to clients is flexible in that it allows a client to access the system in a simple manner that insulates the client from the distributed computing technology within the system. Any client application can access the services provided by the data integration system as long as the application follows the client/system interface specifications. The system internals are also flexible in that objects within the system can be written in any programming language and can be placed on any platform that supports CORBA. The use of a database access object for each biological database prevents changes in a biological database’s access techniques or changes in the format of query results from that database from requiring system-wide modifications, since these changes would only require code modifications in the particular database access object. The data in the system are formatted in XML, which provides a standard representation for data within the system. These technologies combine to create a novel biological data integration system that is powerful yet flexible.