Searching for homology is the process of comparing a new sequence with all other known sequences and then attempting to infer the function of the new sequence by assessing the matches and their biological annotations as described in the database itself and in the literature. In this section, sequence analysis is described from the point of view of positional cloners whose primary goal is to find the exons rapidly in a megabase expanse of genomic DNA; this usually involves a combination of exon trapping, cDNA selection, and genomic sequencing. The goals here are considerably different from those of investigators whose primary aim is to sequence a genome or large contigs thereof. In the former case, every single-pass sequence must be carefully examined for homologies and coding potential at the earliest possible time both to find genes and to prioritize mutation detection. In the latter case, detailed analysis is usually preceded by contig assembly and "finishing," with the goal of analysis being to construct complete gene models.
There are basically two types of sequence analysis: analysis by similarity (homology) and analysis by intrinsic sequence properties (Fickett, 1994) . These methodologies are often used in concert (Claverie, 1994) . Similarity analysis includes database searching and alignment (including multiple alignment), whereas intrinsic analysis is broader in scope and ranges from predicting exons based on the statistical properties of sequence composition to ab initio predictions of protein structure. An important type of intrinsic analysis is the segmentation of sequences by local compositional complexity (Wootton, 1994a; Wootton and Federhen, 1996) which is of great utility in increasing the signal-to-noise ratio in database searching and is essential for high-throughput analysis of sequence data. Another application of intrinsic sequence analysis is prediction of gene structure by programs such as GRAIL. These programs have recently been reviewed (Burset and Guigo, 1996; Fickett, 1996).
Searching Databases for Similarities
Sequence databases
There are a number of important issues in searching DNA and protein sequence databases (Altschul et al., 1994), but the most important is access to a comprehensive and up-to-date data repository. GenBank, the EMBL nucleotide sequence database, and the DNA Database of Japan (DDBJ) are three partners in a long-standing collaboration to collect and distribute all publicly-available sequence data. Sites in Bethesda, Maryland (USA), Hinxton (UK), and Mishima (Japan) exchange new sequence data and updates over the Internet on a daily basis and make this information immediately available to the public by a variety of means including E-mail, anonymous ftp, and the World Wide Web (Harper, 1994). Details on how to submit new or updated sequence information and annotation to the databases are provided later in this chapter. The major providers of protein sequence data are GenBank, EMBL (translations of coding sequences), PIR International, and Swiss-Prot. More detailed information on all of these data sources may be found in the annual database issue of Nucleic Acids Research.
Operationally, there are two forms of sequence data. The first is the complete database record that contains the names of authors/submittors, literature citations, and biological annotations, as well as the sequence itself and a table of sequence features such as locations of introns, exons, start and stop codons, etc. The second form of sequence data consists simply of an accession number and short descriptive "header" followed by the sequence itself; this is the form that is usually used for rapid similarity searching. Examples of such sequences, in FASTA format, are shown below in the examples. Accession numbers are unique identifiers for a particular sequence. They are assigned to data when it is first submitted to a database and should always be referred to in any description or publication concerning sequence data.
Since sequence data comes from a variety of sources (including the US and European Patent Offices), the NCBI provides comprehensive, quasi-nonredundant datasets (designated nr) for both nucleotide and protein sequences. These datasets are updated daily and made available for homology searching. Further details on the rationale and construction of nr are provided have been described by Altschul et al. (1994). The component data sources making up nr are always reported in the search output (see below) and further details are available through the NCBI Home Page (http://www.ncbi.nlm.nih.gov). Finally, it must be noted that sequence data entering the public databases is growing exponentially with a doubling time of approximately 12 months. GenBank (release 104.0, 15 December 1997) contains more than 551 megabases in 835,000 sequences. NCBI's protein nr database currently contains 312,250 protein sequences.
The BLAST family of programs: uses and examples
Next in importance to the sequence database itself is the computer program used to search it. A number of different search algorithms have been developed over the years, and further information about them may be found in Altschul et al. (1994), Schuler et al. (1994), and references therein. Here, we restrict ourselves to the BLAST programs, which offer a good combination of speed, sensitivity, flexibility, and statistical rigor. BLAST, an acronym for Basic Local Alignment Search Tool, can be used via an E-mail service or as a network client-server application. In addition, two World Wide Web interfaces (basic and advanced) are available and are very easy to use. The detailed examples below illustrate use of the E-mail server, but the same analyses could also be performed using any of the other BLAST implementations available. As the databases are constantly being updated, the search results obtained by repeating these exercises may differ slightly from the output shown below.
The first step in setting up a database search is to select the most appropriate BLAST program (Table 5). There are five implementations of BLAST, three designed for nucleotide sequence queries (BLASTN, BLASTX, TBLASTX) and two for protein sequence queries (BLASTP, TBLASTN). The former are used for the analysis of genomic sequence (including putative exons) and "single-pass" cDNA data; the latter usually come into play when one has identified discrete gene products from finished sequence. Once a BLAST program has been selected, an appropriate sequence database must also be selected. A list of databases available for use with the BLAST programs is presented in Table 6.
BLASTN takes a nucleotide sequence (the query sequence), and its reverse complement, and searches them against a nucleotide sequence database. BLASTN was designed for speed, not maximum sensitivity, and is not intended for finding distantly-related, coding sequences. BLASTX takes a nucleotide sequence, translates it in three forward and three reverse complement reading frames, and then compares the six putative protein translations against a protein sequence database. BLASTX is extremely useful for sensitive analysis of preliminary ("single pass") sequence data and is quite tolerant of sequencing errors (Gish and States, 1993) . BLASTN and BLASTX are used in concert for analyzing EST data, the products of exon trapping and genomic sequence sampling.
Example 1a. For our first example, let us turn back the clock and suppose we are trying to find the breast cancer susceptibility gene by exon trapping from clones spanning the BRCA1 region on chromosome 17. Using the E-mail service, the BLAST search request would be structured as follows:
mail blast@ncbi.nlm.nih.gov Subject: PROGRAM blastn DATALIB nr BEGIN >putative exon from BRCA1 region TCTGGAGTTGATCAAGGAACCTGTCTCCACAAAGTGTGACC ACATATTTTGCAA |
This message specifies that the BLAST program to be used is BLASTN and that the database to be searched is nr. The word BEGIN signals that the query sequence follows, and that query sequence is in FASTA format. While the BLAST programs have a number of optional parameters, some of which are illustrated below, this example constitutes the minimum information needed to carry out a basic BLAST search.
The results of this search ( Figure 1a ) include a number of matches to human and mouse BRCA1 sequences. Ignore these for now, since we are pretending this gene has not already been cloned. The fifth match is to a sequence described as "Mouse mRNA for estrogen-responsive" and shows a statistically significant P-value (5.2e-05 = 5.2 x 10-5). The next two items in our "hit list" represent an alignments to "Sequence 1 from Patent WO 8907652" and to "Mouse down regulatory protein", both with P-values of only 0.1. These P-values are unimpressive, and it is difficult to assess the meanings of the actual sequence alignments. It now becomes important to repeat the search using BLASTX to translate the query sequence in all six reading frames to detect potential protein similarities.
Example 1b.
mail blast@ncbi.nlm.nih.gov Subject: PROGRAM blastx DATALIB nr MATRIX pam40 BEGIN >putative exon from BRCA1 region TCTGGAGTTGATCAAGGAACCTGTCTCCACAAAGTGTGACC ACATATTTTGCAA |
In this example, an optional parameter specifying a difference scoring matrix has been included. The MATRIX statement indicates that the PAM40 matrix should be used, rather than the default matrix BLOSUM62, Henikoff and Henikoff, 1993). In this case, the PAM40 matrix was chosen because it is more suited to the short open reading frames (ORFs) that will be generated from our 54 nucleotide query sequence. More details and advice on choosing appropriate scoring matrices will be given below. Note that the database name, nr, is the same for both nucleotide (BLASTN) and protein (BLASTX) database searches, as there are both nucleotide and protein versions of nr; the appropriate version of nrwill automatically be used.
Compared with the nucleotide sequence search results from Example 1a, our hit list is much longer and includes some new and more significant matches ( Figure 1b ), underscoring the greater sensitivity of protein sequence searches. Among the statistically significant matches, both BLAST programs produced a match to rpt-1, described as a "mouse down regulatory protein" (score of 82 and P-value=0.00020). Consistency (or lack thereof) is often a useful clue to the significance of database search results. In the BLASTN search (Example 1a), it is difficult to infer much from the nucleotide alignments. However, the alignments produced by the BLASTX search show an interesting pattern of conserved cysteine and histidine residues, possibly indicating some type of common metal-binding motif. Inspecting the SwissProt database record for this sequence that indicates it to be a DNA-binding, zinc finger protein. The "putative exon" in this example indeed turns out to be exon 3 of the BRCA1 gene, encoding a portion of the zinc finger domain as described by Miki et al. (1994). In a later section, we will see how to use the Entrez system to rapidly assess a database search result by examining related sequences, map locations, all relevant literature, and even three-dimensional structures.
Example 2. To illustrate another important aspect of database searching, consider the analysis of another piece of genomic DNA that may have been isolated by exon trapping.
mail blast@ncbi.nlm.nih.gov Subject: PROGRAM blastn DATALIB nr BEGIN >another putative exon from BRCA1 region GGTCTTACTCTGTTGTCCCAGCTGGAGTACAGTGGTGCGATCA TGAGGCTTACTGTTGCCTTGACCTCCTAGGCTCAAGCGATCCT ATCACCTCAGTCTCCCAAGTAGCTGGGACT |
The results of this search ( Figure 2a ) include many very interesting and statistically significant matches. Such a result seems to good to be true, especially since there are so very many apparently significant hits, yet there is no consistency at all among the matching genes which include a serotonin transporter, a retinoblastoma susceptibility gene, and antithrombin III. Nevertheless, we are excited about and intrigued by the results, so an appropriate follow-up would be to perform a BLASTX search.
mail blast@ncbi.nlm.nih.gov Subject: PROGRAM blastx DATALIB swissprot MATRIX pam40 BEGIN >another putative exon from BRCA1 region GGTCTTACTCTGTTGTCCCAGCTGGAGTACAGTGGTGCGATC ATGAGGCTTACTGTTGCCTTGACCTCCTAGGCTCAAGCGATC CTATCACCTCAGTCTCCCAAGTAGCTGGGACT |
This time we specify swissprot instead of nr as the protein database to be searched, and the results are in Figure 2b. It is obvious why we obtained so many matches in the previous search: the query sequence is a fragment of an Alu repetitive element, and what the BLASTN search produced was simply matches to other Alu elements that are present in thousands of sequences in GenBank. While most genome researchers are aware of this pitfall in sequence analysis, many other biologists who search databases are not, and many mistaken or misleading results are still being published (and deposited in the databases) as a consequence (e.g., Tugendreich et al., 1994). Therefore, the curator of Swiss-Prot, Amos Bairoch, has wisely included "dummy" entries (representing the translation of Alu family consensus sequences) in the database as warning flags.
The lesson here is this: when analyzing any DNA sequence, check it first for the presence of repetitive elements. This is conveniently done by specifying DATALIB alu in your BLASTN search request. It is also imperative to check all sequences for residual vector-derived sequences that are often present on the ends of individual sequence reads and sometimes go unrecognized with DATALIB vector. Such "filtering" procedures are built into the new PowerBLAST program which is described later . Query masking is an important general tool for sequence analysis and will be discussed in greater detail below.
Example 3. TBLASTX is the newest member of the BLAST family of programs and performs six-frame translations of both the query and all database nucleotide sequences. It then looks for matches at the protein level. In terms of computational resources, this is very expensive, so TBLASTX searches are restricted to queries on the EST division of GenBank. Suppose that the preceding BLASTN and BLASTX searches against nr and found nothing, i.e., there are no statistically significant matches against sequences in the nucleotide or protein databases. One more procedure remains to be performed. Whereas most coding sequences in GenBank have their cognate proteins represented in the GenPept database (and thus in nr), there is one important exception: because single-pass, cDNA (EST) sequences seldom have coding sequence annotation and are subject to frameshift errors, conceptual translations (ORFs) from these sequences are not included in GenPept or nr; therefore, ESTs must be translated "on the fly" by programs like TBLASTX or TBLASTN. It is always best to use the latter if one has the protein sequence for use as a query. But in the case of putative exons, single-pass genomic or cDNA, one wants to translate the query as well as the cDNA/EST database to maximize one chances of finding any significant but subtle (particularly cross-phylum) sequence homologies where there may be no similarity between nucleotide sequences at all.
Suppose that some cDNA clones that hybridize to a YAC believed to contain a gene of interest have been cloned, and that some partial cDNA sequence data from one of the clones reveals no matches based on BLASTN or BLASTX searches. A TBLASTX search of the EST division of GenBank would be formulated as follows:
mail blast@ncbi.nlm.nih.gov Subject: PROGRAM tblastx DATALIB dbest BEGIN >OCRL-selected mRNA, partial sequence TTGAACATCATGAAACATGAGGTTGTCATTTGGTTGGGAGATTTGAATTATAGACTTTGC ATGCCTGATGCCAATGAGGTGAAAAGTCTTATTAATAAGAAAGACCTTCAGAGACTCTTG AAATTCGACCAGCTAAATATTCAGCGCACACAGAAAAAAGCTTTTGTTGACTTCAATGAA GGGGAAATCAAGTTCATCCCCACTTATAAGTATGACTCTAA |
The results are shown in Figure 3. Actually, the mRNA fragment in our example corresponds to nucleotides 1465-1685 of the GenBank entry (M88162) for Lowe's Oculocerebrorenal (OCRL-1) syndrome (Attree et al., 1992) and displays a highly significant similarity (P=3.2 x 10-18) to an ORF encoded by an Arabidopsis thaliana cDNA clone. The homologies between OCRL-1 and plant sequences are not even detectable, let alone significant, at the nucleotide level (data not shown). If the query were human genomic DNA, such a match to plant sequences would be strong evidence of an exon because similarities in non-coding sequences would not be expected to be conserved across such a vast evolutionary distance.
Given a coding nucleotide sequence and the protein it encodes, it is almost always preferable to use the protein as the query sequence to search a database because of greatly increased sensitivity to detect more subtle relationships. This is due to the larger alphabet of proteins (20 amino acids) compared with the alphabet of nucleic acid sequences (4 bases), where it is far easier to obtain a match by chance. In addition, with nucleotide alignments, only a match (positive score) or a mismatch (negative score) is obtained, but with proteins, the presence of conservative amino acid substitutions can be taken into account. Here a mismatch may yield a positive score if the nonidentical residue has physical/chemical properties similar to the one it replaced. Various scoring matrices are used to supply the substitution scores of all possible amino acid pairs. The best scoring system for general purposes is the BLOSUM62 matrix (Henikoff and Henikoff, 1993), and this is currently the default choice for the BLAST programs.It is important to recognize that BLOSUM62 is tailored for alignments of moderately diverged sequences (in practice, the most frequent type of query) and thus may not yield the best results under all conditions. Indeed, the PAM40 matrix used in Example 1b above yielded more convincing results than BLOSUM62 (not shown). Altschul (1993) recommends using a combination of three matrices to cover all contingencies. This may improve sensitivity, but at the expense of slower searches. In practice, a single BLOSUM62 matrix is routinely used but others (PAM40 and PAM250) are tried when an absolutely exhaustive search is necessary. Low PAM matrices are best for detecting very strong but localized sequence similarities, whereas high PAM matrices are best for detecting long but weak alignments between very distantly related sequences.
Confounding subsequences and query masking
One of the most important advances in database similarity searching during the past several years has been the introduction of methods for the automatic masking of low-complexity sequences. Problematic query sequences result in hundreds (or thousands) of spurious matches to nebulous entities with names such as "proline-rich protein" that may obscure more subtle but biologically interesting matches. The term "low-complexity subsequences" has a rigorous definition (Wootton and Federhen, 1993) but may be thought of as synonymous with regions of locally biased amino acid composition. "Biased" means that the sequence deviates from the random model which underlies the calculation of the statistical significance (p-value) of an alignment. Such alignments among low-complexity sequences are statistically but not biologically significant, i.e., one cannot infer homology (common ancestry) or functional similarity. Thus, it is best to simply note their presence but exclude them from the database search process. Several investigators have worked on the definition, origin, and detection of "simple" (low-complexity) sequences in protein and DNA (for review, see Wootton, 1994b). The SEG (Wootton and Federhen, 1996) and XNU (Claverie and States, 1993) algorithms have been implemented as options for the BLAST family of programs, and they may be used individually or in combination.
Just how common is the problem of confounding, low-complexity segments among proteins? Wootton (1994b) has shown that they are surprisingly abundant, accounting for 25% or more of the residues in protein sequence databases. In the context of positionally cloned human disease genes, the data are even more striking: Of the 42 gene products identified at the time of this writing, 83% have an average of five low-complexity segments, each accounting for between 1% and 48% of the residues in an individual protein. Clearly, masking low-complexity sequences is not an optionit should be used in every search, particularly those that involve conceptual translations (BLASTX, TBLASTN, TBLASTX) where, for example, translation of a poly(A) yields a polylysine ORF. Indeed, the BLAST Web servers now perform masking of low-complexity sequences by default and one has to explicitly turn this function off if masking is not desired.
Example 4. Recently, a gene responsible for maturity-onset obesity, tub, was cloned in the mouse (Noben-Trauth et al., 1996). The following is a BLASTP search in which the SEG algorithm is not used for filtering.
mail blast@ncbi.nlm.nih.gov Subject: PROGRAM blastp DATALIB nr BEGIN >gi|1280437 candidate tub gene MVQANADGRPRSRRARQSEEQAPLVESYLSSSGSTSYQVQEADSIASVQLGA TRPPAPASAKKSKGAAASGGQGGAPRKEKKGKHKGTSGPATLAEDKSEAQGP VQILTVGQSDHDKDAGETAAGGGAQPSGQDLRATMQRKGISSSMSFDEDEDE DENSSSSSQLNSNTRPSSATSRKSIREAASAPSPAAPEPPVDIEVQDLEEFA LRPAPQGITIKCRITRDKKGMDRGMYPTYFLHLDREDGKKVFLLAGRKRKKS KTSNYLISVDPTDLSRGGDSYIGKLRSNLMGTKFTVYDNGVNPQKASSSTLE SGTLRQELAAVCYETNVLGFKGPRKMSVIVPGMNMVHERVCIRPRNEHETLL ARWQNKNTESIIELQNKTPVWNDDTQSYVLNFHGRVTQASVKNFQIIHGNDP DYIVMQFGRVAEDVFTMDYNYPLCALQAFAIALSSFDSKLACE |
The results of this search (Figure 4a) are voluminous and apparently significant. However, look at the alignments for vitellogenin or SRP40: How does one interpret these results?
The SEG algorithm may be applied to a protein sequence to divide it into low-complexity and high-complexity subsequences. This is illustrated for the tub gene product in the Figure 4b. Note the four internal low-complexity regions that are enriched for serine, glycine, proline and other amino acid residues. Such subsequences can be masked prior to a BLASTP search by including the directive FILTER seg between the DATALIB nr and BEGIN statements in the previous E-mail example. As stated above, filtering is performed by the SEG algorithm by default if the BLAST Web pages are used for the search. Performing a masked instead of an unmasked search reduces the hit list from nearly 300 matches (almost all false positives) to only seven meaningful matches, with the topffive being statistically significant. All spurious matches disappear from the output. (In the filtered output alignments, masked residues appear as Xs, a character that the BLAST programs ignore.)
In addition to low-complexity segments, many proteins also have other regions that may lead to problems in interpreting database search results. Nonglobular domains of proteins, such as the collagen helix and myosin rod, also deviate from the random model of amino acid composition and represent an intermediate level of compositional complexity in comparison with globular structures (Wootton, 1994b). Users who have spent time searching databases will have undoubtedly encountered a query sequence that yields a long and monotonous hit list consisting of little else but myosins and other alpha-helical, coiled-coil proteins. Again, this phenomenon is not irrelevant to the positional cloner: 81% of human gene products isolated in this manner to date have at least one nonglobular domain. Indeed, in this sample, only six proteins (14%) have neither low-complexity segments nor nonglobular domains and would thus not benefit from masking during database searches.
In the case of low-complexity sequences, little, if any, biological importance can be inferred from their presence in a protein, except perhaps the presence of signal sequences or a membrane-spanning domain from hydrophobic stretches. In contrast, however, there is a more positive aspect to the identification of some nonglobular sequences. Instead of evolutionary homology, these sequences may represent structural analogy between proteins. For example, the presence of a coiled-coil domain may signify homo- or hetero-oligomeric protein-protein interactions and thus suggest an experiment. This was that case for polyposis-associated familial colon cancer (Kinzler et al., 1991; Su et al., 1993) . The default parameters of the SEG program are set to detect and mask low-complexity subsequences, but they can be adjusted to partition a protein into globular and nonglobular domains (Wootton, 1994a).
In this section, some of the most important ways and means to utilize BLAST and associated software tools have been discussed. These programs can be further customized by using a large number of options that are more fully described in documents available by sending an E-mail "help" message to blast@ncbi.nlm.nih.gov or on the BLAST WWW page at http://www.ncbi.nlm.nih.gov/.
Recently, a new network client-server program called PowerBLAST has become available by ftp from NCBI (ncbi.nlm.nih.gov, directory /pub/sim2/PowerBlast). This application runs on UNIX, Macintosh, and Microsoft Windows platforms. PowerBLAST has a number of new and useful features, including automatic query masking and filtering, search restriction by organism, graphical output of gapped multiple alignments, the ability to analyze 100 kb per hour, and compatibility with NCBI's new database annotation and submission tool called Sequin.