Biological Database Integration - New Approach
Author: Deron Eriksson
Description: Integration of Data from Heterogeneous Biological Databases using CORBA and XML - New Approach to Data Integration.
(Continued from page 3)
9.3. System Configuration
The data integration system and accompanying client application were primarily developed on personal computers running either the Windows NT or the Windows 2000 operating systems. All system objects were coded in the Java programming language using JDK 1.3 (Java 2 SDK, Standard Edition Version 1.3.0), which was installed at C:\jdk1.3. The Java API documentation (Java 2 Platform, Standard Edition, v 1.3 API Specification) was installed at C:\jdk1.3\docs. If needed, C:\jdk1.3\bin was added to the System “Path” variable which allows the javac and java commands to be executed without specifying full paths to the command executables.
The JSWDK 1.0.1 (JavaServer Web Development Kit 1.0.1) was installed at C:\jswdk-1.0.1. Directions for setting up a system to use the JSWDK could be found at C:\jswdk-1.0.1\README.html. A “CLASSPATH” System variable was created and then C:\jdk1.3\lib\tools.jar and C:\jswdk-1.0.1\lib\servlet.jar were added to the “CLASSPATH” variable.
The web server that is included with the JSWDK 1.0.1 can be started through the execution of the startserver.bat batch file located in C:\jswdk-1.0.1\. The default port of the web server is 8080. This port number can be changed if so desired.
For development work, servlets and all other Java objects were written and compiled in the C:\jswdk-1.0.1\examples\WEB-INF\servlets directory, which is the default servlet directory used by the web server. The source code for a servlet and its class file were kept together during development.
Servlets can be run from other directories. This requires modifications to the webserver.xml file located in C:\jswdk-1.0.1\. Such modifications are described in the JSWDK FAQ located in C:\jswdk-1.0.1\ and within the webserver.xml file itself.
After compiling a servlet to obtain its class file and starting the web server, a servlet could be executed by the server through a request from a web browser. For example, the RequestParamExample.class servlet located in C:\jswdk-1.0.1\examples\WEB-INF\servlets\ could be executed by providing the following URL in a web browser:
This request was performed locally, so localhost was specified (127.0.0.1 could be used in place of localhost). Port 8080 was specified since the web server was configured for this port number. During development and testing, servlets were called with both Microsoft Internet Explorer 5 web browser and Netscape Communicator 4.7 web browser.
The client application was coded using Visual Basic 6. Microsoft Visual Studio 6 was installed on the development computers, and Visual Basic 6 was part of this installation. The client application was developed in a directory close to the other project directories. The client application was located in C:\jswdk-1.0.1\examples\WEB-INF\servlets\VBClient.
9.4. Experiences with the Project
The current design of the data integration system is slightly different than the original design. The original design of the federated data integration system is shown in Figure 9.8.
Figure 9.8: Original Design of the Federated Data Integration System
In the original system design, the client was a Java applet that communicated directly with the query processing object using CORBA. Since the client communicated with the query processing object via CORBA and only one system client was envisioned, the client was originally seen as being part of the data integration system itself. An applet client would allow the system to be remotely accessed by users with applet-compatible browsers. An attractive feature of using an applet is that applets are platform independent so that the client could run on different Java-compatible platforms. An additional feature of the original system design removed from the current version of the system is the integration system database that was to be part of the data integration system.
The first step in prototyping was to create a simple Java program that could perform a query of a biological database on the Internet and display the response. In effect, this Java program was a rudimentary database access object. The EMBL sequence database was the target database. This program took an accession number as a command-line argument and retrieved the corresponding EMBL flatfile sequence entry. The Java URL class was used to query the biological database via the EMBL web server in a CGI fashion, as in this Java code example:
URL u = new URL("http://www.ebi.ac.uk/cgi-bin/emblfetch?" + args);
This prototype allowed for the gaining of familiarity with the EMBL sequence database access methods and displayed a method of web database access using Java that would fit well into the envisioned data integration system. This initial prototype is shown in Figure 9.9.
Figure 9.9: Initial Prototype
To add CORBA functionality to the prototype, a Java applet client was constructed that used CORBA to communicate with the database access object. The client could send an EMBL accession number query to the database access object using CORBA. The database access object could send this accession number query to the EMBL web server and retrieve the results as in the initial prototype. The database access object could then return the response to the applet. Different accession numbers could be entered in the applet, and results were displayed in a text area. This prototype is shown in Figure 9.10.
Figure 9.10: Prototype System
The JDK 1.3 ORB was used for CORBA communication between the client applet and the database access object. This communication utilized the tnameserv name server located in C:\jdk1.3\bin\, which needed to be started before communication could occur.
This prototype system using the client applet could be run locally on the development computer without significant difficulties. However, when attempting to run the system on other computers, various problems were experienced with the applet communication. For example, computers that had a version of the Java Development Kit prior to 1.3 required the downloading and installation of a Java Plug-in for communication to proceed correct. Additional problems were experienced with the applet.
As a result of these various tests, it was decided that a Java servlet would be used to access the data integration system in a way that would provide a layer of abstraction so that clients would be buffered from any direct interactions with CORBA. This servlet mechanism provides a straightforward way for client applications to communicate with the system, since they can communicate in the same fashion as talking to a web server by performing a GET or POST method in conjunction with a form. As a result, many different types of clients could access the system. For example, if the returned results were formatted and processed in a useful fashion, the system could be queried via a web form. In addition, Java applets could still access the system, and clients could also be written in other languages on other platforms, and in order to communicate, the client application would just need to be able to talk to a web server. Clients don’t need to run within web browsers, since they can be actual standalone applications. This can allow clients to be significantly more powerful that applets running in web browsers, since the client application can perform file system tasks such as saving files which may not be allowed for particular security settings for applets in conjunction with a web browser.
In order to demonstrate the possibility of creating client applications in different languages, a prototype client application was created using Visual Basic 6.0. Visual Basic is an excellent language for rapidly developing small applications with advanced functionality through the use of components. This rapid development capability was a key factor in Visual Basic’s selection as the development language for the client application.
The Java applet client was removed from the prototype system, and a Java servlet system access object was added to the system. Additionally, a rudimentary Visual Basic client application was created that would take an accession number from a user and display the corresponding EMBL flatfile as a result of a query of the system. The client application would take an accession number from the user and would send this data to the system access object via the GET method when a submit button was pressed. The system access object would in turn send this accession number to the data access object using CORBA. The data access object would submit the accession number to the EMBL web server as described above and retrieve the EMBL flatfile result, which it would pass back to the system access object, which would pass the flatfile back to the client application.
This prototype system is shown in Figure 9.11. In this diagram, the client is considered to be a client of the data integration system and is thus not considered part of the data integration system itself, since the servlet isolates the client application from the system, which internally uses CORBA for communication. Although the system uses the EMBL web server and the EMBL sequence database, these are not considered part of the system since the system will be a database federation, and a federation allows the underlying databases to remain autonomous from the federated system. They are outside entities crucial to the functionality of the federation but considered distinct from the integration system itself.
Figure 9.11: Prototype System with Servlet and Visual Basic Client Application
Through the use of this prototype, experience was gained with Java Servlets and Visual Basic. The web server that came with the JSWDK 1.0.1 was used, and it could be started via the startserver.bat batch file in C:\jswdk-1.0.1\. The system access servlet source code and class files were originally placed in the default servlet location for the web server (C:\jswdk-1.0.1\examples\WEB-INF\servlets\) while the data access object was placed in a separate location. However, for ease of use during development, all development files were placed in the default servlet directory with the servlet files.
For the Visual Basic client, a WebBrowser component was used for communication with the system access object. The WebBrowser component’s Navigate procedure allowed for easy use of the POST method to send the query string from the client to the servlet. The POST method is preferable to the GET method since the GET method can be limited by the environmental variables of some platforms, so sending a long query string using the GET method could be unpredictable. This could be an issue if the system was expanded in the future to allow for many name/value pairs or long names or values.
Posting data to the web server and thus the system access object could be accomplished via a Navigate call such as:
Browser.Navigate txtURL.Text, , , PostValues, Header
Use of this procedure requires the PostValues string to be converted from Unicode to ASCII prior to the call to the procedure. The URL used for testing was http://localhost:8080/examples/servlet/BioServlet (the prototype system access servlet in this case is called BioServlet). An example of a PostValues query string used was:
The servlet allowed both the GET and POST methods to be used to submit query strings to the system. As a result, a query could also be performed in a web browser using a URL such as:
This example helps show the straightforward yet flexible system access provided by the servlet. Client applications can access the system, yet the system can also be accessed via the same mechanism by web forms or even by entering the servlet URL and query string into a web browser navigation box. Use of the servlet makes the system more flexible in comparison with requiring clients to use CORBA to access the system. This also improved the likelihood of the system to be successfully accessed on different platforms, since the system could be accessed using the familiar GET and POST methods.
In the next phase of system development, a query processing object was added to the prototype system. At this time, the query processing object simply forwarded accession number queries from the system access object to the EMBL database access object. CORBA was used for communication between the three system objects (the system access object, the query processing object, and the database access object). At this time, XML data representation was introduced into the system. In order to accomplish this, an EMBL parser was developed for the database access object that parsed EMBL flatfiles to an XML format. EMBL currently offers sequence files in two XML formats, AGAVE and BSML, but the parser constructed for this project creates an XML representation closer in structure and content to the original EMBL flatfile format.
Adding the query processing object to the system was an uncomplicated task. Creating an EMBL flatfile-to-XML parser in Java for the database access object to use required significantly more work but technically was not difficult. Adapting the communication between system objects to use XML was straightforward, since XML is a standard for text and communication between objects already used strings. The XML text used for communication between objects was well-formed, but no DTD was used to standardize the communication strings. A diagram showing the addition of the query processing object to the prototype system is shown in Figure 9.12.
Figure 9.12: Query Processing Object Added to Prototype System
At this point in the project, it was time to attempt the primary goal, which was to add the capability to perform automated queries involving multiple databases. The EMBL parser provided the system with EMBL flatfile data parsed into a convenient XML format. It was decided to use reference numbers to medical abstracts in the PubMed database contained within the results of accession number EMBL queries as the basis for PubMed queries. As a result, submitting an accession number to the system could retrieve all PubMed abstracts referenced in the EMBL flatfile entry corresponding to that accession number.
The next step in this multidatabase data integration goal was to investigate PubMed database access techniques. The PubMed database could be accessed using the same CGI-based web server techniques used by the EMBL sequence database. Thus, the database could be queried at a URL with a query string such as the following:
A new database access object specific to the PubMed database was created with the capability to retrieve abstracts based on submission of citation Unique Identifiers (UIDs). This object was added to the system, and the system and client application were upgraded to allow for this new type of query in which a UID could be submitted and the corresponding PubMed abstract would be retrieved. At this point, single database queries could be performed on both the EMBL sequence database and the PubMed medical abstract database. The client would make a request of the system via the system access object, and this request would be passed on to the query processing object. Based on the query type submitted by the client, the query processing object would figure out which database access object to contact with the appropriate query data, and it would obtain a result from the database access object which it would pass on to the system access object, which would return the result to the client application.
Following this, multidatabase query capability was added to the system. The client application could submit an EMBL accession number and retrieve all PubMed abstracts that corresponded to the EMBL flatfile specified by the accession number. This would be accomplished in the following manner. The client would submit a specific query type and an accession number to the system access object, which would convert the query data to an XML format which it would pass on to the query processing object. The query processing object would look at the query type and would ask the EMBL database access object to perform an accession number query and respond with the XML version of the corresponding EMBL flatfile. The query processor would extract the PubMed UID references from this XML string from the EMBL database access object and then would ask the PubMed database access object to perform queries to retrieve the abstracts corresponding to these UIDs. These abstracts would be sent back to the query processor which would return the abstracts to the system access object. The system access object would return the abstracts to the client application.
This data integration system was tested and it successfully performed automated multidatabase queries in which data from EMBL accession number queries could be used automatically to retrieve PubMed abstracts. This system demonstrates the primary goal set forth in the thesis. It allows client to perform queries involving the integration of data from multiple biological databases. It does this in a manner that uses CORBA to communicate between system objects, and it used XML as a way to represent data in the system. This system is diagrammed in Figure 9.12.
This federated data integration system is similar to the original design of the system, although the Java servlet system access object was added to the system as a way of adding flexibility for clients and as a way of isolating clients from requiring any interaction with the system using CORBA. As a result of this addition, the Java applet client was removed from the system, and a client application was written in Visual Basic to allow users to query the system. This client is represented outside of the system, since it does not use CORBA to communicate with the system and system objects use CORBA to communicate with one another.
A notable absence from this data integration system is the internal integration system database in the original system design. This database originally was seen as being a storehouse for recording all queries and all results of these queries. It could store statistical data about the system that could be analyzed to optimize the system. Prototype work demonstrated the use of a JDBC driver to store data in an Oracle 8.1.7 database. However, it was decided that the use of an internal system database was beyond the scope of the current work. Statistical analysis of different configurations and usage patterns would be interesting, but the primary goal of this project was to demonstrate the multidatabase querying possibilities of the federated data integration system. This main goal required a significant amount of work combining several different technologies, and an internal system database was not necessary for the accomplishment of this goal.
9.5. Future Expansion and Work
The present federated data integration system could be enhanced in a number of ways. To begin with, the system has been designed so that additional databases can be added to the federation. This would enable a wider variety of queries to be performed involving multiple databases. Efforts could also be made to better handle error conditions. For example, if communication to a database failed, the system could be programmed to retry the query a certain number of times at some time intervals. Additionally, if atypical results are returned from a query, this should be able to be handled more elegantly by the system. As an example, if an error occurs parsing an EMBL sequence entry into an XML representation, an accurate description of the problem experienced should be returned to the client.
Many improvements to the system could be brought about through the use of an internal system database as a cache. For example, queries and query results could be placed in this database so that later queries involving this same data could be quickly fetched from the internal database rather than requerying a biological database across the Internet. This would allow for queries to proceed even if a biological database is down. However, the storage of query results needs to be handled carefully. One of the benefits of a data federation is its ability to perform data integration from constituent databases at runtime. Performing integration using old, out-of-date data would threaten to minimize this capability of database federations.
An internal system database can also be employed to keep track of system statistics for usage analysis and performance tuning. It could hold data such as the length of time required to perform particular types of queries involving particular databases and the dates and times at which queries were performed. This data can be used to analyze and optimize the system. For instance, if the data indicates that particular types of queries require a large amount of time to perform, efforts can be made to optimize these queries. The system objects can be distributed across computers in different manners, and performance in different distributed configurations could be compared. The data could indicate factors such as usage patterns.
The ability to handle multiple concurrent users would be an extremely important addition to this project. This could be readily accomplished in an efficient manner using the multithreading capabilities of Java, involving the maintenance of sessions and state information.
As mentioned, the system can be expanded by incorporating additional databases into the federation. In the current prototype, adding a new database to the system not only requires the addition of a database access object to interface with the database, but it also requires making changes to the implementations of the query processor, the system access servlet, and the client application. Adding a new database adds new queries to the system. Decisions and processing regarding queries are hard-coded into the system, so an object such as the query processor would require additional processing capabilities for the new queries. For a particular new query type, the query processor would have to know details such as what tags need to be examined in the strings it receives and what databases need to be contacted in order to perform the query. The system access object would also need to be modified to recognize the additional query type, and it would require the functionality to format a client’s query data into a tagged XML representation of that data and also to be able to translate the query responses of the system from XML into the representation desired by the client. If a client application is to take advantage of the new system functionality, it would also need to be modified to add the new querying capabilities into the application.
A mechanism could be devised to minimize the programming necessary to update the system to take into account queries involving new databases. As an alternative to hard-coding query-related functionality into the system, Objects such as the query processor and the system access object could be updated via configuration files or even by configuration data stored in an internal system database. This updating could even be accomplished dynamically so that the system doesn’t need to be brought down in order to perform the update. For fault tolerance, mechanisms could be put in place so that redundancy is added to the system, since in the current prototype, failure of the system access object or query processing object prevents any queries from entering or results from leaving the system, and failure of a database accessor prevents queries from being issued to the accessor’s database.
Expanding the system to involve many different types of queries involving numerous heterogeneous databases might require the addition of more data integration system objects. It may be beneficial to employ multiple query processing objects using a multi-tiered approach so that query processing subtasks can be divided among different objects.
As this chapter suggests, this work represents a starting point for the biological data integration system envisioned by the author. The prototype developed is a biological database federation that performs multidatabase queries using CORBA, XML, and Java Servlet technology. CORBA allows system objects to be written in different languages and allows the objects to be readily distributed on different hosts. XML provides a standard data representation that is hierarchical and readily parseable. Java Servlet technology provides clients with straightforward access to the data integration system in a manner that avoids any direct client interaction with CORBA. Continuing work to expand and improve the system will, in my opinion, provide a powerful tool that will make a significant contribution to the bioinformatics community.