You should complete Exercises 1 - 3 and 5 in lab today and transfer your 'work products' into ELN. It may be possible to complete the entire exercise during lab - this is recommended. If necessary, Exercise 4 may be completed outside of class; however, you must get it started so that you can ask questions. You must at a minimum complete your multiple sequence alignment (MSA) and add to your ELN. This includes locating the AAs of your 'variants' within the MSA.
'Work Products' for this lab - what should be completed and turned in (put in your ELN).
Ideally, print out your assigned DNA sequence. (Or, if a printer is not available, 'manually' modify the file provided. Using the genetic code (found at the end of this handout: DNA+AA info), translate the DNA sequence on your handout in 3 frames, and mark the reading frame which contains an ORF (open reading frame.
Photograph your manual translation and post the image to the appropriate placce in your ELN.
NCBI National Center for Biotechnology Information - access integrated biomedical/molecular databases, including Genbank.
Two other major sequence databases:
2. The European Nucleotide Archive (ENA) at the European Bioinformatics Institute (EBI), also associated with ExPASy (Expert Protein Analysis System) based in Switzerland.
These nucleotide databases are together part of the International Nucleotide Sequence Database Collaboration.
Now, change the selection next to Format: from GenBank to FASTA, by clicking on the link (red arrow & box in illustration below). This will retrieve the sequence in FASTA format (description), which is a minimal, simplified format recognized by most or all sequence analysis programs. This format is used both for nucleotide and amino acid sequences. Save this file.
1. Paste your nucleotide sequence (from Exercise 2) into the window. Be sure to paste in only DNA sequence.
2. Select an output format, and hit the TRANSLATE button. Starting with the default "Verbose" is specifically designed to help spot ORFs, highlighted in Red . This format also allows you to go on and select a specific ORF. Note that you can go back to the GenBank report for your nucleotide sequence and confirm you've identified the correct protein sequence.
3. Examine each of the 3 different ouput forms. Determine which reading frame is correct, and save that correct translation (e.g., copy & paste into a word document), in the "with nucleotide seq" format. Make sure to format neatly with a mono-spaced font and appropriate margins so the sequence is easily read.
Highlight the start and stop codon.
4. Using this tool, you can also go back and check your manual translation (from Exercise 1) using your DNA sequence, and the "with nucleotide seq" format. How well did you do the translation?
Making alignments of related protein sequences is an important technique for a variety of purposes. Alignments show what portions of a molecule are shared between two or more molecules. Alignments are the starting point for determining the relatedness of molecules, and by extension, the organisms from which they come.
Parts of a given protein that are highly conserved (that is, unchanged over evolutionary time) in many distantly-related organisms are likely to be important functional regions. In other words, that portion of the protein cannot be altered by mutations in the gene encoding it without destroying or reducing the protein's function. Such mutants are selected against and do not survive. Regions of the protein that can change without disrupting protein function will evolve over time and be different in distantly-related organisms.
In this lab, you may imagine either of two possible future scenarios. 1)You work as a genetic counselor, and are analyzing information from a client or patient's DNA sequencing, or 2) You have received information about your genes, or those of a friend or relative who has shared the information with you, and would like to evaluate the nature of some variants found in the sequence of specific genes.
All of us have some parts of our genome that are different from the most common sequence. (Actually the original / first human genome sequence of the Human Genome Sequencing Project was a composite made from a mix of many individuals' DNA.)
The basic question: For any given variant, should I be concerned that the variant alters the function of the gene involved?
For this exercise, we will focus on small nucleotide changes that alter single specific amino acids in the protein coding region ('missense' alterations) of genes encoding mostly metabolic enzymes. Select one protein-of-interest from the list in this document: Human proteins associated with single amino acid change causing diseases or syndromes. Each member of a breakout group should select a different protein.
For each, there is a list of 'variants' associated with 'single nucleotide polymorphisms (SNPs)' found among human genomes. Some of these are known to cause dysfunction in the protein. Others are neutral, or how they alter protein function is unknown.
You will analyze at least 6 missense variants found in your human protein of interest and evaluate the likely/possible effect of the AA variant based on:
A. Change in the type of AA, based purely on AA properties and empirical substitution frequencies. You will need this this handout on amino acid properties and other information: DNA+AA info. We will discuss and demonstrate this evaluation further in class.
i. What is the type of AA (original) vs. the variant? (There are a variety of ways amino acids can be classified - see the handout.)
ii. What is the score (using the BLOSUM62 substitution matrix*) for the change?
[* Where did the BLOSUM62 alignment score matrix come from? ]
iii. From this information, do you think this AA change is a minimal, moderate, or radical change?
Before you do part 4A (which comes first logically), first complete your MSA for part 4B (below) - this is the more difficult task. You must show you have a good MSA, and you have located the AAs in your human protein on the MSA, before leaving lab today.
B. AA location in the protein - how conserved the AA is at this location? For this, you will perform a multiple sequence alignment (MSA) - see below. But first, you will identify the precise location of each specific altered AA in your protein of interest.
Once you've selected your gene / protein of interest, acquire the protein sequence using the accession number provided. Go to the NCBI Protein database and enter the accession number.
Get the protein sequence in FASTA format. You will see a link for that on the page. Copy this into a text file.
Now copy and paste your human protein-of-interest sequence (don't use the accession number) into ExPASy protein parameters tool to get a nicely numbered version that will make it easier to find specific AAs (i.e., your sequence variants). It will appear at the top of the result. You can ignore all the other useful info below. Copy the top part into a file and mark the locations of all your variants in this numbered version (e.g., highlight or bold). Example of a numbered sequence shown below:
You will need this information - the coordinates of your protein - so that you can locate your AA variants within your multiple sequence alignment (MSA).
B1. Download seven more protein sequences to align (total 8, including human)
In order to find conservation related to function, it will be best to acquire orthologs from a broad range of phylogenetic distances, such as those in the list of different species below: (Example FASTA file + species list). Choose sequences all from different animal phyla. You already have one verbrate (human), so include no more than one other chordate (ideally a non-vertebrate chordate). See here for an animal phylogeny showing some major phyla, plus names of some animals with fully sequenced genomes. (If you're having a problem finding sequences from 7 different phyla, then use more than one in a given phylum, but no more than 2 per phylum.)
a. Go to your gene/protein page by entering 'Human Protein-Name*' into the search line at the NCBI Protein database (and searching).
* the abbreviation for the protein in parenthesis in the list of proteins and variants, e.g., 'Human AMT'
This should result in a screen that looks like this (for a protein called WNT1):
b. Then, under Species in the lefthand column, click Animals.
c. Then, in the righthand column below Results by taxon, click on Tree. The example shown below indicates there are sequences available from 6 different phyla (chordates, arthropods, flatworms, molluscs, nematodes and cnidarians). [So for this example, you would have to take more than one sequence from a given phylum to get up to 8 sequences total.]
d. Click on the number next to the group name to get a list of sequences from that group (or, if there's only one, it will take you directly to that sequence page). Be sure to collect protein sequences of a very similar length to that of the human protein. True orthologs are typically within 5-20 AAs of the same length. Also avoid sequences with a description like 'Low quality' or 'partial' or 'Protein-name-like.'
On the other hand, 'hypothetical' is OK - just still look for a similar length of protein.
d. Acquire all the protein sequences in FASTA format.
B2. Paste all sequences in FASTA format into a single file
In one method of performing a multiple sequence alignment, we must first create a plain text file with all the sequences we wish to align in the FASTA format. Open a word processor window on your computer (and stagger the two windows on the screen so you can easily go back and forth between them). Then paste each sequence into a single text document, with a blank line in between each. [Again, see an example FASTA format file.]
Put the human sequence at the top, since this is the one you'll be comparing to the others.
B3. Create a new file and rename the sequences
For convenience (given the program we are using), in your sequence collection file, rename each sequence with the genus & species name initials, underscore, and short name. (You may observe in GenBank files for your protein that there is a commonly used short name, for example.) Make sure there are no spaces in the name. Keep track of the identity of your all your proteins - don't discard the accession number information and full description, for example - you may need it later.
Example: rename the full FASTA description
>gi|132814447|ref|NM_001082971.1| Homo sapiens dopa decarboxylase (aromatic L-amino acid decarboxylase) (DDC), transcript variant 1, mRNA
as:
>Hsa_DDC
- In this example, a single capital letter is used for the genus, followed by the first two letters of the species name: Homo sapiens becomes 'Hsa'.
B4. Run the alignment with Clustal Omega.
The program we will use for alignment is Clustal Omega and is found at the European Bioinformatics Institute (EBI) website. There are other multiple sequence aligmnent programs available, including on the NCBI website.
Either paste the collected sequences into the large window, or upload the saved file (note that for uploading the file MUST be in a plain text/text only format).
Although we will mostly use the program default settings, change these two output parameters:
OUTPUT FORMAT - you can try a couple different formats to see the possibilities - but this one will be most useful:
'Clustal w/numbers' - shows the amount of conservation below the alignment, with AA coordinates at the ends.
[Note, once the alignment is displayed there is a lovely option to make the AAs color coded - 'Show Colors.']
ORDER (Click the 'More Options...' button): change this selection (lower right) from the default 'aligned' to 'input' - this will retain the order of sequences in your input file - i.e., keep the human sequence at the top. [Note that the order in which the sequences are input (that is, the order you have them in your FASTA file) can affect the alignment.]
Part of an example alignment with this format (using the GCH1 proteins file) is shown below:
If you get a poor alignment, consider removing (and replacing if below 8 sequences) a protein sequence that may be disrupting the alignment. Occasionally a protein you've chosen is quite different from all the others.
Files you can use for demonstration purposes: Example FASTA files
What should be in your completed protein sequence variant analysis with MSA (exercise 4B):
i. MSA with at least 8 protein sequences from 8 different phyla (if possible), with the human protein-of-interest at the top of alignment. Highlight in the human protein at the top of the MSA the AAs for the variants you analyzed. Each protein should have a 3 letter organism identifier and short name. Include a list identifying the organisms below the MSA, with species name and common descriptor.
ii. Evaluate the likely/possible effect of each of your six AA variants based on AA location in the protein - how conserved the AA is at this location, etc. As noted above, mark the location of the variant AAs in the human protein in the MSA.
1. Find a crystal structure for your protein of interest. Go to NCBI - Structure. (You may also find a direct link to your protein's crystal structure from its main 'gene page.')
2. Search for a structure for your protein of interest in the NCBI Structure database with the name of your protein. There may be several different version of structures. Look for an indication of a larger, more complete sequence. Get the PDB ID for the structure. That's what you will use to access the structure below. These identifiers are made of 4 capital letters and numbers. Some examples: 2GK1, 1HD.
[Note: some of these proteins may not have crystal structures for the human protein - there may be a closely related protein (e.g., another mammalian protein like rat instead of human) that you can use for this purpose. Check with the instructor if you need help.]
AA's at the surface or in extended structures are somewhat less likely to disrupt function when altered. AA's deep within the protein or tightly packed are more likely to affect function if changed. If the protein is multimeric, AA's at the interface between subunits may be critical for (e.g.) dimerization or tetramerization. There may also be crystal structures of your protein interacting with a binding partner.
3. Click on the entry, which will take you to a MMDB Structure Summary
4. View structures using iCn3D - "I see in 3-D" -NCBI WebGL-based Structure Viewer page
When you're ready to use it, click on "OPEN iCn3D" (see image below).
Then enter a PDB file number by clicking FILE, and 'Retrieve by ID' - PBD ID, and wait for the image to load. Then you can view it in a variety of ways.
a. Under the Analysis menu, select 'View Sequences & Annotations,' and then select the 'Details' tab to see the AA sequence alongside the structure.
b. Select an amino acid or range of amino acids in the primary sequence to highlight the location in the structure. Use this to find the location of mutations in your protein of interest from the multiple sequence alignments. Does the location of the mutation provide any insight into why it might disrupt the function of the protein?
c. Take one screenshot of the 3D protein structure you viewed in iCn3D for each amino acid variant location, with the AA selected to turn in. Include the part of the right side window with the AA sequence showing which AA was selected. An example is shown below. In this case, the AA (R) was selected in two copies in the sequence window to show up on 2 of the 3D protein chains of the tetrameric crystal structure. The program highlights the location in yellow.