Multiple Alignment, Sequence Motifs and Structure Inference

Cold Spring Harbor Laboratory Press:
Selected Chapter

Genome Analysis: A Laboratory Manual Series Volume 1, Analyzing DNA
CHAPTER 7, Computational Analysis of DNA and Protein Sequences

Multiple Alignment, Sequence Motifs, and Structure Inference

Multiple alignment
Sequence motifs
Structure prediction and protein modeling

NOTE: This is an old document. Links may no longer function correctly.
Some tables are available, but figures are not.

As in the case of amyotrophic lateral sclerosis and the SOD1 gene product, one is not always lucky enough to have a crystal structure upon which to evaluate the possible effects of mutations. More often, though, homologues in other organisms are available and can be used for comparative sequence analysis. Multiple alignments are performed to study similarities and differences in a group of related sequences. The purpose here is to assess whether mutational changes are expected to lead to loss or diminution of function. If a change in a sequence is one that results in a nonconservative amino acid substitution, particularly in a residue that multiple alignment has shown to be conserved in a group of evolutionarily distant sequences, then that change is more likely to represent a deleterious mutation rather than a silent polymorphism.

Multiple alignment is a large and active research area in computational biology and we do not intend to really do justice to it here. Rather, we shall take the approach we have been following so far and simply point out some useful programs that are freely available in the public domain. CLUSTAL W is among the most powerful multiple sequence alignment packages available and performs progressive multiple sequence alignments based on the method of Feng and Doolittle (1987). Each pair of sequences is aligned and the distance between each pair is calculated; based on this distance matrix, a guide tree is calculated, and all of the sequences are progressively aligned based on this calculated tree. A major feature of the program is its sensitivity to the effect of gaps on the alignment; gap penalties are varied to encourage the insertion of gaps in probable loop regions rather than in the middle of structured regions. Users can specify gap penalties, choose between a number of scoring matrices, or supply their own scoring matrix for the both the pairwise and multiple alignments. The output can be obtained in a number of file formats, allowing users to proceed directly to phylogenetic packages such as PHYLIP (Felsenstein, 1993) or publication-quality alignment formatters such as ALSCRIPT (Barton, 1993). CLUSTAL W for UNIX and VMS systems is freely available by anonymous FTP at ftp.ebi.ac.uk, or by E-mail at netserv@ebi.ac.uk. CLUSTAL W can also be accessed as an external module through SeqApp, a Macintosh sequence editor and analysisprogram (Gilbert, 1992) available by anonymous FTP at ftp.bio.indiana.edu.

Another useful program is the Multiple Alignment Construction and Analysis Workbench, or MACAW (Schuler et al., 1991), for which both Macintosh and Microsoft Windows versions are available. MACAW uses a graphical interface, provides a choice of several alignment algorithms, and is available by anonymous FTP at ncbi.nlm.nih.gov (directory /pub/macaw). MACAW has been used, for example, to analyze the product of the diastrophic dysplasia gene ( Table 7; de la Hastbacka et al., 1994) that encodes for a novel sulfate transporter related to several previously-described sequences. MACAW was also used to align the human HNPCC/MLH1 gene product with its various yeast and bacterial homologs (Papadopoulos et al., 1994). As concerted genomic and cDNA sequencing progresses, it will become more and more likely that a positionally-cloned gene will already have homologs in the database at the time of its isolation. Thus multiple sequence alignment will become an increasingly important method of data analysis.

Sequence motifs

The cornerstone of sequence analysis is database searching for pairwise similarity between sequences. As described in the preceding section, multiple alignment adds another useful dimension to sequence analysis. Sequence "motifs" are derived from multiple alignments and can be used to examine individual sequences or an entire database for subtle patterns. With motifs, it is sometimes possible to detect distant relationships that may not be demonstrable based on comparisons of primary sequences alone. As for multiple alignment, the derivation and use of sequence motifs is a very active research area with a sizable literature. As before, we shall only point out a few examples from this field. For access to some of the latest developments, see Tatusov et al. (1994) and references cited herein.

Currently, the largest collection of sequence motifs in the world is PROSITE (Bairoch and Bucher, 1994). PROSITE can be accessed via either the ExPASy WWW server or anonymous FTP site. A free software package named MacPattern (Fuchs, 1991) is available for searching PROSITE motifs; MacPattern is available through anonymous FTP from EBI and other sites (Table 2). Many commercial sequence analysis packages also provide search programs that utilize PROSITE data.

One of the most useful resources for searching for protein motifs is the BLOCKS E-mail server (Table 1) developed by Steve Henikoff (Henikoff and Henikoff, 1991; Henikoff, 1993). BLOCKS performs searches of a protein or nucleotide sequence against a database of protein motifs or "blocks". Blocks are defined as short, ungapped multiple alignments that represent highly-conserved protein patterns. The blocks themselves are derived from entries in the PROSITE as well as other sources. Either a protein or nucleotide query can be submitted to the BLOCKS server; if a nucleotide sequence is submitted, the sequence is translated in all six reading frames and motifs are sought in these conceptual translations. Once the search is completed, the server will return a ranked list of significant hits, along with an alignment of the query sequence to the matched BLOCKS entries.

As PROSITE and BLOCKS represent collected families of protein motifs, searching against these databases entails submitting a single sequence to determine whether that sequence is similar to the members of an established family. Working in the opposite direction, programs are available for comparing a collection of sequences to individual entries in the protein databases. An example of such a program is the Motif Search Tool, or MoST (Tatusov et al., 1994). Based on an aligned set of input sequences, a weight matrix is calculated using one of four methods (selected by the user); a weight matrix is simply a representation, psotion-by-position in an alignment, of how likely a particular amino acid will appear. The calculated weight matrix is then used to search the databases. To increase seneitivity, newly-found sequences are added to the original data set, the weight matrix is recalculated, and the search is performed again. This procedure continues until no new sequences are found.

An increasingly important use of motifs in the future will be to "preprocess" query sequences for the presence of obvious known domains and then mask these regions (see Sequence Analysis) prior to a full-scale BLAST search. This should simultaneously increase the speed of the search while improving the ability to detect subtle matches that would otherwise be swamped out by abundant, strong matches to other sequence regions, such as kinase catalytic domains (Altschul et al., 1994).

Structure prediction and protein modeling

In the course of studies aimed at rational drug design or determining the biochemical function of a protein, knowledge of the structure of the protein is of critical importance. There are two main experimental methods by which the tertiary structure of a protein can be determined: X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy. Both of these methods, however, are technically demanding, time-consuming and not amenable to automation. These methods will therefore not be able to keep pace with the discovery of new sequences. For this reason, predictive methods are necessary to address the need for structural insights in the absence of direct experimental data. As with all such methods, it must be kept in mind that, no matter how good the method, the results are still predictions, and different methods will give different predictions. With that precaution in mind, and in conjunction with supporting biochemical data, these methods can provide valuable insights into protein structure.

As the structures of more and more proteins are being solved, it is becoming increasingly apparent that there is a relatively small set of three-dimensional motifs into which proteins are observed to fold (Chothia, 1992). This observation, coupled with the concept that protein structure is conserved to a greater extent than sequence (Chothia and Lesk, 1986) has led to the development of sophisticated methods by which the three-dimensional structure of a protein of interest can be predicted. One of the most promising methods is known as homology model building, or threading (reviewed by Fetrow and Bryant, 1993). In this method, a query sequence of unknown structure is threaded through the coordinates of a homologous protein of known structure (Bryant and Lawrence, 1993). All possible placements of the query sequence subject to a number of set physical constraints are attempted; for example, core regions (alpha-helices or beta-sheets) are kept at a fixed length, while loop regions may be allowed to have variable length within limits. By evaluating pairwise and hydrophobic interactions between nonlocal residues, this method is able to identify the most energetically favorable and conformationally stable placements of the query sequence with respect to the known structure.

The threading technique was recently applied to a DNA-binding motif found within high-mobility group proteins HMG-1 and HMG-2, a motif called the HMG-1 box. In this case, a number of non-HMG DNA-binding proteins were shown to form the HMG-1 box motif despite the absence of statistically significant sequence similarity. Another recent example involves the ob gene product which, when mutated, is associated with obesity and type II diabetes in mice (Madej et al., 1995). Here, the ob gene product was found to be similar to a family of helical cytokines that includes interleukin-2 and growth hormone. Although threading programs are not yet widely available, their development will provide the molecular biologist with a very powerful tool with which to deduce structural similarities that are not necessarily obvious through traditional sequence alignment techniques.

For a newly discovered gene product, it is possible to determine if there is a homologous protein in the structure database that might make tertiary structure modeling feasible via threading (see Integrated Information Retrieval). If there is no such structure available, methods exist for predicting possible secondary structure of a new sequence. Since the original method of Chou and Fasman (1974) was introduced, a number of different algorithms have been developed to predict the secondary structure of proteins. The end product of all of these methods is the same: a prediction of how likely protein sequences are to assume an alpha-helical, beta-sheet, or random-coil conformation. Most of these methods rely on patterns that can be deduced from sets of proteins for which three-dimensional structure information is already available.

Several E-mail servers are currently available for such secondary structure prediction, each based on a slightly different method see Table 1). The nnpredict algorithm (Kneller et al., 1990) uses a FASTA-formatted sequence as its input and allows the user to select the tertiary structure class of the protein; the server then returns the most likely structure for each individual residue in the query. PredictProtein (Rost and Sander, 1993) first takes the query sequence and performs a database search, from which a multiple sequence alignment is derived. The information in this alignment (a sequence profile) is then used to improve the accuracy of the prediction. A detailed report of the probability of each individual amino acid assuming an alpha-helical, beta-sheet, or random-coil conformation is then returned. The accuracy rate for the best-case prediction using nnpredict is reported to be 79%; for PredictProtein, the average accuracy rate is reported to be 71.6% for all residues and 92% for the most reliably scored residues in the query. One important caveat to all of these methods is that they are based upon extrapolations from the existing structure database, which is biased toward globular sequences. As work by Wootton (Wootton, 1994a,b) and others has shown, however, a large fraction of sequences, or sequence domains have nonglobular structures.

A relevant example of where motif analysis and tertiary structure modeling have advanced our understanding of a positionally cloned human disease gene is the case of Norrie disease, an X-linked disorder characterized by progressive blindness, deafness, and mental retardation. When the Norrie disease gene was first cloned, no significant homologies were found upon database searching and no motifs from PROSITE were detected (Berger et al., 1992; Chen et al., 1992). However, based upon a particular arrangement of conserved cysteine residues and database searching with consensus patterns, Meitinger et al. (1993) constructed a three-dimensional model showing that the Norrie disease protein most likely has a structure similar to that of transforming growth factor beta.

Previous topic: Integrated Information Retrieval
Next topic: Submitting Data to Public Databases

Cold Spring Harbor Laboratory Press: Selected Chapter

Genome Analysis: A Laboratory Manual Series Volume 1, Analyzing DNA CHAPTER 7, Computational Analysis of DNA and Protein Sequences