USD - Biology 382 - Interpreting FGENESH results

Interpreting FGENESH results

FGENESH - Find GENES HMM (Hidden Markov Model)

Header to list of genes:

G Str Feature Start End Score ORF Len

G: gene number

Str: Strand, + (direct or positive) or - (negative - on the strand complementary to the uploaded sequence)

Feature:
numbers indicate exon number for the predicted gene

TSS: transcription start site (TATA box; note - many (perhaps most) eukaryotic genes do not have TATA boxes)
CDSf: (protein) coding sequence, first
CDSi: coding sequence, internal
CDSl: coding sequence, last
PolA: polyadenylation signal sequence (AATAAA)
CDSo: coding sequence, solo - predicted gene with a single exon [There is a good chance for our analyses that these predictions will be incorrect - either the predicted sequence is not a real gene, or the predicted exon actually belongs to one of the adjacent genes instead of being a separate single exon gene.]

If a predicted gene has all three of CDS types (CDSf, CDSi, CDSl), we are calling it 'complete' with respect to protein coding sequence. At either end of the sequence a predicted gene may be incomplete, indication a portion of the gene is missing.

A predicted gene on the opposite strand (indicated by '-' in the Str column) will have the features in the opposite order.

Start - End: base number at which the indicated feature begins and ends, using the first base submitted as '1'.

Score: calculated probability value for feature based on the program's algorithm

ORF: open reading frame location [for our analyses, these numbers will be identical to the CDS numbers]

Len: length in bases (equal to End minus Start plus one)

After this, individual portions of the sequence are shown in order (partly depending on your selections in 'Advanced options'). In our case, the predicted complete coding cDNA sequence is shown, followed by the predicted protein sequence (the translation of the predicted cDNA sequence). [For later analyses, we may want to add the individual exon sequences.]