Cloning Strategy Sequence Analysis
Week 2 Sequence Analysis
Using the genetic code handout, translate the DNA sequence on your handout, and determine the reading frame which contains an ORF (open reading frame). You will turn in your manual translation with the rest of the week 1 materials when they are due.
The default format for the database entry is as a "Genbank Report" - you will see this indicated under 'Display' (red arrow & box):

Now, change the selection next to Display from Genbank to FASTA. This will retrieve the sequence alone in FASTA format (description), which is a minimal, simplified format recognized by most or all sequence analysis programs. This format is used both for nucleotide and amino acid sequences.
1. Paste your nucleotide sequence into the window.
2. Select an output format, and hit the TRANSLATE button.
3. Examine each of the 3 different ouput forms, print the FIRST PAGE ONLY from the final ("with nucleotide seq") format.
4. Check your earlier manual translation using your DNA sequence.
Use translations of genes retrieved in exercise 2. [Note that you are provided with accession numbers for the corresponding amino acid sequences.]
1. Select the button: User-entered sequence for each sequence window.
2. Give each sequence a short name
3. Paste each of your two protein sequences into the two windows.
4. Change number of alignments to be computed to: 1   --Leave other default settings unchanged
5. Click SUBMIT button.
Making alignments of related protein sequences is an important technique for a variety of purposes. Alignments show what portions of a molecule are shared between two or more molecules. Alignments are the starting point for determining the relatedness of molecules, and by extension the organisms from which they come.
Parts of a given protein that are highly conserved (that is, unchanged over evolutionary time) in many distantly related organisms are likely to be important functional regions. In other words, that portion of the protein cannot be altered by mutations in the gene encoding it without destroying or reducing the protein's function. Such mutants are selected against and do not survive. Regions of the protein that can change without disrupting protein function will evolve over time and be different in distantly related organisms.
For a molecular biologist, those highly conserved regions of a gene/protein can also be a key to isolating a gene from an organism for which have little or no DNA sequence information, as we will discuss later when examine a technique known as 'degenerate PCR.'
1. Download sequences to align
Using your newly acquired sequence retrieval skills, find and save the sequences of at least 5 proteins in FASTA format. Save each of these sequences in a new folder, and name them using the accession number.
2. Paste all sequences in FASTA format into a single file
In one method of performing a multiple sequence alignment, we must first create a plain text file with all the sequences we wish to align in the FASTA format. Open a word processor window on your computer (and stagger the two windows on the screen so you can easily go back and forth between them). The paste each sequence into a single text document, with a blank line in between each.
3. Rename the sequences
For convenience (given the program we are using), in your sequence collection file, rename each sequence with the species name initials, underscore, and short name. Make sure there are no spaces in the name.
Example: rename
>gi|6110604|gb|AF193842.1|AF193842 Staphylococcus aureus DNA polymerase I (polA) gene, complete cds
As:
>Sa_DNAPol1
4. Run the alignment with ClustalW2.
The program we will use for alignment is ClustalW2 and is found at the EBI website.
Either paste the collected sequences into the large window, or upload the saved file (note that for uploading the file MUST be in a plain text/text only format).
We will use the program default settings, although we may wish to alter two output parameters (in the lower left):
Output format and Output order
Note also that the order in which the sequences are input can significantly affect the alignment.
Here is a file for demonstration purposes: GCHs
This is the list of species in that file: Species included
1. Go to ENTREZ - Structure
2. Search with one of the following numbers:
Or, try the protein of your choice, such as the protein you chose for a multiple alignment previously. Note: not all of these proteins may have crystal structures.
3. Click on the entry, which will take you to a MMDB Structure Summary
4. View structures using FirstGlance in Jmol
Examine the structure of at least two different proteins.
5. Enter a PDB file number and wait for the image to load. Then you can view it in a variety of ways.
Extras
6. Compare structures of Lambda Cro and Lambda Repressor proteins bound to DNA
7. Compare structures of Ubx homeodomain and Lambda Cro helix-turn helix proteins bound to DNA
8. Compare structures of a bHLH protein (e.g., GCN) and a bZIP protein (e.g., Fos-Jun heterodimer) bound to DNA
PDB files of noteworthy molecules
Directory of C. remanei genomic sequences - find your assigned genomic sequences.
Exercise 7
FGENESH - a gene prediction program
1. Paste your assigned genomic sequence into the sequence window (or browse the appropriate file if you have downloaded it).
2. Select "C. elegans" for Organism.
3. Select the following "advanced options" from the list below:
- print mRNA sequences for predicted genes.
- print exon sequences for predicted genes.
4. Click the "Search" button.
5. Save the resulting file to use in next week's computer lab as well. Beside the basic information on the predicted genes found in your genomic sequence, you will use the predicted mRNA sequences in exercise 7A below, and the predicted protein sequences in exercise 7B below for each of your predicted genes.
A. Determine whether each of your predicted genes has an associated cDNA or EST. To qualify, the sequence should have 100% identity (or nearly) over an extended region and a very low E value.
1. Go to the NCBI BLAST Server
2. Under Basic BLAST, click on nucleotide blast.
3. Paste your nucleotide sequence into the large window under "Enter Query Sequence"
4. Alter the settings under "Choose Search Set" in the following manner:
a. For "Database", click "Others"; then on the pull-down menu select "Expressed sequence tags (est)"
b. For "Organism" type "Caenorhabditis" (no quotes) into the text window,
5. Select the "Show results in a new window" checkbox next to the BLAST button.
6. Click "Algorithm parameters"
For "Max target sequences" select "10"
7. Click the BLAST button.
B. Determine whether your predicted proteins have matches in the protein database.
1. Go to the NCBI BLAST Server
2. Under Basic BLAST, click on protein blast (blastp).
3. Paste each predicted amino acid sequence from your FGENESH analysis into the large window under "Enter Query Sequence"
All the other parameters can be left at their default settings.
5. Select the "Show results in a new window" checkbox, then click the BLAST button.
Note also if there are any Conserved Domains found in your search. (This information will come up immediately after initiating the search in the 'Format' window.)
C. Search for possible genes NOT predicted by your genefinder (FGENESH) using Blastx:
1. From the NCBI BLAST Server (1. above), under Basic BLAST, click on blastx
2. Paste your assigned genomic sequence into the sequence window, and click the BLAST button (we will used default parameters).
See the printed handout for additional information about interpreting your results from B. & C. and doing further analyses.
3. Among the analyses, select at least one predicted protein and prepare a multiple sequence alignment with a selection of homologous proteins from close to distant relatives.
Example of further analysis of Cr0062a.seq predicted genes
| Gene | EST? | C. elegans homolog | Conserved Domains | Type of protein encoded |
| Gene 1 | no | cat-1 gene/AAG00026 | KOG6734 | Vesicular monoamine transporter |
| Gene 2 | no | none | none | no significant matches |
| Gene 3 | no | none | KOG0293/WD40 | WD40 repeat protein |
| Gene 4 | no | ZK550.3 | KOG2089 | Oligopeptidase |
| Gene 5 | yes | msh-1/H26D21.2 | KOG0219 | Mismatch repair protein MutS family |