Integrated Information Retrieval

Cold Spring Harbor Laboratory Press:
Selected Chapter

Genome Analysis: A Laboratory Manual Series Volume 1, Analyzing DNA
CHAPTER 7, Computational Analysis of DNA and Protein Sequences

Integrated Information Retrieval

Entrez
Of worms and flies (Specialized data sets)

NOTE: This is an old document. Links may no longer function correctly.
Some tables are available, but figures are not.

The ability to traverse different information space is well demonstrated with the use of the World Wide Web and hyperlinks (Figure 5), but to have the capacity to track different databases in a stable, rigorous way requires a different type of infrastructure. One example of integrated information retrieval within the field of molecular biology is Entrez. Using Entrez, users can search nucleotide, protein, structure, and genome dabatases, as well as the Genetics subset of the MEDLINE bibliographic database, all by just issuing a single query. Having a program which will allow for such travel between databases is quite logical: all of these data are interrelated in the sense that protein sequences are derived from nucleotide sequences, structures are derived from isolated proteins of known sequence, papers are written on protein purification and gene mapping, and so on. Without becoming overly-techinical, what actually makes this possible in Entrez is that all data elements from each of the constituent databases are converted into a single format called Abtract Syntax Notation (ASN.1), in which all the same elements (e.g., a bibliographic reference) are all structured in the same way. Connections are then made between the different databases to allow users to traverse this information space.

Another type of integrated information retrieval system has also survived time is AceDB, initially developed for Unix systems but now available on the Macintosh platform as well as using Web browsers as a graphical front-end. The power in ACeDB is based on its simplicity and ease of adaptation to many database models, making it the tool of choice for many popular organism or chromosome-specific databases. Unfortunately, this strength also proves to be a weakness, as each databases using ACeDB will be structured differently, making it more difficult to integrate information from various sources. Nonetheless, the popularity and power of AceDB warrants further discussion.

Entrez

Provided with a "significant" match identified by a database similarity search, what does one do next? The information in the BLAST output (or the output of any other database search program) is necessary but seldom sufficient to evaluate the full significance and implications of a sequence match. One then needs to study the full database record(s) as well as references cited therein and other relevant literature. One might also want to retrieve some of the matching sequences and perform additional database searches to confirm the results.

As already alluded to, Entrez integrates these tasks so that a user can search on a single database, find all of the relevant information for that query within that database, and then move on to related information in all of the remaining databases without having to start another search. The ease with which users can jump from one database to the next allows for a tremendous amount of information to be found in a fraction of the time that it would have taken to search each constituent database separately. Entrez can also be used to link information to documents outside the databases; for example, in the electronic version of this chapter, each reference is linked to its corresponding MEDLINE entry.

How are the interconnections made within databases? Entrez uses a procedure called neighboring to retrieve sets of related entries. This neighboring allows the user to ask the question, "What papers are similar to a given paper?" or "What sequences are similar to a given sequence?" Within the sequence databases, neighbors are determined by comparing each sequence against all others using BLAST. For the MEDLINE subset, each entry is compared against all others through the use of weighted key terms, looking for the occurrence of words or phrases in the titles, keywords, and abstracts of other papers (Wilbur & Coffee, 1994). All neighbors are pre-computed, thereby substantially improvint the speed with which related entries can be located and returned to the user.

Connections between databases are made through specific connections called hard links. For example, a paper on BRCA1 found in MEDLINE may contain the nucleotide sequence for the BRCA1 gene. If so, a hard link is established between the MEDLINE entry and the related entry in the nucleotide database. All hard links are reciprocal, meaning that the user can move between databases in any direction. As already alluded to in the introduction to this section, hard links are established anywhere where there is a logical connection between entries in different databases. Two examples of how both neighboring and hard links can be used to travers a "biological discovery space" are illustrated below.

Example 5. Recall the database search result obtained with the BRCA1 exon in Example 1. The BLAST output (Figure 1b) shows a significant similarity to a sequence called rpt-1. Entrez can now be used to determine the fucntion of this protein. The accession number (P15533) is entered into the Term: query box; the buttons labeled Accept and Retrieve 1 Document are then pressed in sequence. This leads the user to the actual database record for this protein (Figure 6a). The database record indicates that rpt-1 is nuclear protein that contains a C3HC4-class zinc finger domain and regulates gene expression.

Entrez can now be used to determine whether this protein is homologous to other proteins in the databases. To do this, the user backtracks to the Document window and checks off the small box to the left of the aa icon. This instructs Entrez to find all sequences related to the one just selected. In this case, 74 homologues, or sequence neighbors, were found, some of which are shown in Figure 6b. The protein named "estrogen-responsive finger protein" appears to be of interest in the context of the original search on rpt-1. To obtain the MEDLINE record for this estrogen-responsive finger protein, the user would select the MEDLINE from the pop-up menu at the bottom of the Document window and click the box to the left of the aa icon for this entry. Pressing the Lookup 1 button at the bottom of the Document window produces a new window with a single MEDLINE entry listed (Figure 6c). The associated abstract for this entry indicates that the estrogen-responsive finger protein may mediate estrogen effects at the transcriptional level in certain cells isolated from mammary glands. This example illustrates how, through the use of Entrez, putative relationships between different proteins and biological function have been established.

In addition to pre-computed sequence homologies, Entrez performs a similar operation on MEDLINE records creating literature neighbors, i.e., articles that are related to each other based on their frequency of use of significant terms. From the retrieved MEDLINE record in Figure 6c, one could continue to branch out in the database and find other relevant publications. By taking advantage of these pre-computed neighbors, users can readily assemble an entire bibliography on a new subject through just a few clicks and keystrokes. Other examples of Entrez neighboring may be found in Cockerill (1994) and Harper (1994).

Example 6. In addition to moving between sequence and MEDLINE entries, Entrez can also be used to find related three-dimensional structural information. As an example, consider the case of disease genes that have been isolated through positional cloning (Collins, 1995), whereby a previously-identified sequence is known to be linked to a disease locus; this sequence is then tested for mutations that segregate with the disease phenotype. Such was the case for the Cu/Zn superoxide dismutase gene (SOD1) and the autosomal dominant form of amyotrophic lateral sclerosis (Table 7; Rosen et al., 1993). A variant strategy is one where there is a strong indication as to the biochemical basis for a given phenotype, where the sequence of a relevant enzyme is known but the location of its gene is not. In this case, identification of the gene occurs through a two-step process in which linkage must first be demonstrated, followed by the search for mutations. One of the genes for hereditary non-polyposis colon cancer (HNPCC), hMLH1 (Table 7), was cloned in this way (Bronner et al., 1994; Papadopoulos et al., 1994). Experiments had suggested that the pathophysiology of HNPCC might be due to a defect in DNA mismatch repair and genes encoding mismatch repair enzymes in bacteria (mutL) and yeast (MLH1) were already known. Isolation of the human homolog, mapping, and mutation detection ensued.

It should be apparent how Entrez can be useful in assessing candidate genes and functions via sequence data and MEDLINE literature. A user can start with a biochemical function or an E.C. number and quickly find all relevant sequences and published articles. Returning to the original Query window, doing a query on human sod1 against the nucleotide databases returns six entries (Figure 7a). To determine whether any structural information is available, the user would click the check-box next to the record of interest and then change the target database to Structure. In addition to atomic coordinates and textual descriptions of structures (Figure 7b), Entrez can interface with both the RasMol (Sayle, 1994; Figure 7c) and Kinemage (Richardson and Richardson, 1992; Figure 7d) programs to view and manipulate the structures graphically. The easy availability of three-dimensional structural information is of infinite value in the design of experiments intended to elucidate structure-function relationships. (Rosen et al., 1993). By the time of this printing, the Entrez client will have its own built-in graphical viewer, named CN3D.

Of worms and flies...

In addition to Entrez, there are specialized datasets (often organism-specific) which also represent integrated information systems. These are often available on the World Wide Web, although they are also sometimes distributed on CD-ROM or are available by anonymous FTP. Two examples of these sepcialized databases are presented below (Caenorhabditis elegans and Drosophila melanogaster). Additional databases, such as the SWISS-PROT annotated sequence database overseen by Amos Bairoch at the University of Geneva, are presented in Table 4.

Caenorhabditis elegans

The C. elegans community was one of the first groups to integrate and make available its genetic and physical map information through a graphical user interface. This was possible because of the pioneering work of Jean Thierry-Mieg and Richard Durbin and the great cooperation which exist amongst the members of the worm community. The software package which was built to disseminate this information was ACeDB, "a C. elegans database". ACeDB is widely used by worm biologists and has been "cloned" as database management tool for many non-worm genome projects. Hence, reference to "ACeDB" the tool does not automatically refer to C. elegans data. The ACeDB model has been used for yeast (SacchDB), Arabidopsis (AtDB), rice, and many others.

In its native form, ACeDB runs as a UNIX application, presenting the user a graphical interface through which they can move between genetic maps, physical maps, DNA sequences, clone grids, bibliographic infromation, Southern blot GIF files, genetic cross data, and the like. A version of ACeDB is now available for the Macintosh, and a similar text-based form (tace) is available for use on Gopher servers. The ACeDB introdution page can be obtained from the Sanger Center at http://www.sanger.ac.uk/Software/Acedb/.

Drosophila melanogaster

The Drosophila community started to share its information in a slighly different way, with Don Gilbert being instrumental in making much of the information available from the Red Book present on Gopher servers, prior to the advent of the World Wide Web. Flybase, at it is now called, is now a "comprehensive database for information on the genetic and biology of Drosophila" and is available via ftp, Gopher, and the Web. Access information can be obtained at http://morgan.harvard.edu/About-flybase.html.

As with ACeDB, Flybase represents a substantial body of integrated information on and about D. melanogaster. Comparing Flybase to ACeDB, more genetic information is available on Flybase, while more molecular sequence data is available on ACeDB. Flybase is also represented within GenBank as a cross-referenced database, meaning that related GenBank records will contain a cross-reference qualifier ( /db_xref) as well as a hyperlink over to Flybase.

Previous topic: Sequence Analysis
Next topic: Multiple Alignment, Sequence Motifs and Structure Inference

Cold Spring Harbor Laboratory Press: Selected Chapter

Genome Analysis: A Laboratory Manual Series Volume 1, Analyzing DNA CHAPTER 7, Computational Analysis of DNA and Protein Sequences