gene identification[1]

Upload: nalinee-dua

Post on 07-Apr-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/4/2019 Gene Identification[1]

    1/35

    Gene Identification - I

    Shivani Chandra

    Birla Institute of Scientific Research

  • 8/4/2019 Gene Identification[1]

    2/35

    Gene Identification

    Goals :

    Find genes

    Map their position

    Identify function(s)

  • 8/4/2019 Gene Identification[1]

    3/35

    Gene Identification

    Approaches :

    Classical

    Computational

  • 8/4/2019 Gene Identification[1]

    4/35

    Classical Approaches

    The three big Ms :

    Mendel (1822-1884)

    Morgan (1866-1945)

    McClintock (1902-1992)

  • 8/4/2019 Gene Identification[1]

    5/35

    Mendels Genetics

    Dominant / Recessive

    Genotype / Phenotype

    Monohybrid / Dihybrid Crosses

    Laws :

    Segregation

    Independent assortment

  • 8/4/2019 Gene Identification[1]

    6/35

    Morgans Genetics

    Won noble prize in 1933 for white eyed

    fruit-fly mutation.

    Linkage and crossing- over (recombination)

    Genetic and chromosome mapping.

  • 8/4/2019 Gene Identification[1]

    7/35

    McClintocks Genetics

    Won a noble prize in 1983 for

    transposons.

    Ability of genes to change position on a

    chromosomegenetic transposition.

    Transposons :

    Cause mutations.

    Increase/decrease amount of DNA.

  • 8/4/2019 Gene Identification[1]

    8/35

    Classical Approaches

    Study F1 and F2 to Fn generations.

    Test cross, back cross.

    Complementation tests.

    Chromosome mapping.

    Transgenics. Gene Knock-outs.

  • 8/4/2019 Gene Identification[1]

    9/35

    The Genomics Era

  • 8/4/2019 Gene Identification[1]

    10/35

    What is Computational Gene

    IdentificationGiven an uncharacterized DNA sequence, find out:

    Where does the gene starts and ends?

    Which DNA strand is used to encode the gene?

    Which reading frame is used in that strand?

    Which region codes for a protein?

  • 8/4/2019 Gene Identification[1]

    11/35

    Computational Approaches

    Computational methods to identify genes

    have been an active field of research for

    past 15 - 20 years.

    Fast and accurate.

  • 8/4/2019 Gene Identification[1]

    12/35

    Computational Gene

    Identification Classes : Intrinsic and Extrinsic

    Intrinsic, or ab initio, gene findersmake no

    explicit use of information about DNAs or

    proteins outsidethe sequence being studied.

    Extrinsic gene finders utilize sequence

    similarity search methods to identify thelocations of protein-codingregions.

  • 8/4/2019 Gene Identification[1]

    13/35

    Software Tools

    GeneMark

    Glimmer

    Grail

    GenScan

    Combined

  • 8/4/2019 Gene Identification[1]

    14/35

    GeneMark

    First tool for finding prokaryotic genes.(1993).

    Access the protein coding potential of aDNA sequence by using Markov models ofcoding and non coding regions.

    Relies on organism specific recognitionparameters to separate coding and non-coding regions.

  • 8/4/2019 Gene Identification[1]

    15/35

    GeneMark

    Exists in separate variants of gene

    prediction in prokaryotic, eukaryotic, and

    viral DNA sequences.

    Requires a sufficiently large training set of

    known genes.

  • 8/4/2019 Gene Identification[1]

    16/35

    GeneMark

    Input File :

    DNA sequence in fasta format.

    The file can contain multiple fasta records.

    Each fasta record should be less than 5Mbp.

    The total sequence size should be in the

    range of 10 Mbp to 50 Mbp

  • 8/4/2019 Gene Identification[1]

    17/35

    GeneMark

    Open the GeneMarkserver:http://opal.biology.gatech.edu/GeneMark/

    genemark24.cgi

    Hit Browse to select your inputfile. Select the closest species of organism or host as

    the model (M. tuberculosis for the

    Mycobacteriophages)

    Under graphics export options, select everything

    except generate postscript & mark putative

    exons.

  • 8/4/2019 Gene Identification[1]

    18/35

    GeneMark

    In the second column choose only list open

    reading frames and list regions of

    interest. Run Genemark (Start)10.

    You should see a text output of your

    GeneMark results.

  • 8/4/2019 Gene Identification[1]

    19/35

    GeneMark Output

    GeneMark can be instructed to generate

    reports on open reading frames (ORFs),

    regions of interest, and estimated exonboundaries.

  • 8/4/2019 Gene Identification[1]

    20/35

    GeneMark Output

    GENEMARK PREDICTIONS

    Sequence file: cya

    Sequence length: 2100 GC Content: 51.65%

    Threshold value: 0.500

  • 8/4/2019 Gene Identification[1]

    21/35

    GeneMark Output

    Open Reading Frames ListLeft Right DNA Coding Avg Start RBS RBS RBS

    end end Strand Frame Prob Prob Prob Site Seq

    ----- ---- ---------- ----- ---- ---- ---- --- ------

    3 308 direct fr 3 0.82 .... 0 0 ....

    195 308 direct fr 3 0.6 0.04 0.74 177 CCGCAG

    348 668 complement fr 2 0.9 0.96 0.98 680 CAGGAT

    1368 2102 direct fr 3 0.9 0.98 0.96 1359 TTGGAG

    1371 2102 direct fr 3 0.91 0.96 0.96 1359 TTGGAG

    1386 2102 direct fr 3 0.93 0.63 0.91 1367 AATGAT1410 2102 direct fr 3 0.96 0.9 0.76 1401 AACGAT

    1509 2102 direct fr 3 0.98 0.27 0.51 1490 AGGGTT

    1578 2102 direct fr 3 0.97 0.11 0.73 1567 ATGGCA

    1620 2102 direct fr 3 0.97 0.11 0.16 1601 GCGCTG

  • 8/4/2019 Gene Identification[1]

    22/35

    GeneMark Output

    LEnd REnd Strand Frame

    3 308 direct fr 3

    348 686 complement fr 2

    1092 1334 direct fr 31365 2102 direct fr 3

    Regions of Interest

  • 8/4/2019 Gene Identification[1]

    23/35

    Genemark Ouput

    Frame Frame At base Strand

    2 1 31152 +/-11bp complement

    2 1 63372 +/-11bp direct

    3 2 75528 +/-11bp complement

    Possible Frameshifts

  • 8/4/2019 Gene Identification[1]

    24/35

    GeneMark Output

    Approx. Exon LocationLeft Right

    End End Strand Frame Prob

    50 300 direct fr 3 0.8566

    63 247 0.9998

    365 666 complement fr 2 0.9415

    378 657 0.978

    1201 1277 direct fr 3 0.8722

    1225 1254 0.9986

    1377 2042 direct fr 3 0.9085

    1434 2042 0.978

  • 8/4/2019 Gene Identification[1]

    25/35

    GenScan

    Analyzes the DNA sequences : by using complex probabilistic structure

    of gene based on research on the level

    of transcriptional, translational, andsplicing signals.

    Statistical properties of coding and non-

    coding regions. GC contents.

  • 8/4/2019 Gene Identification[1]

    26/35

    GenScan

    The model treats the most general case

    in which the sequence may contain no

    genes, one gene, or multiple genes oneither or both DNA strands and partial

    genes as well as complete genes are

    considered.

  • 8/4/2019 Gene Identification[1]

    27/35

    GenScan

    Important Restrictions:

    Only protein coding regions are considered

    (no tRNA or rRNA genes).

    Transcription units are assumed to be non

    overlapping.

  • 8/4/2019 Gene Identification[1]

    28/35

    GenScan

    Input file :

    The sequence file may be in either Fasta or

    minimal GenBank format.

    For minimal Genbank format,Locus and

    Origin lines must be present.

  • 8/4/2019 Gene Identification[1]

    29/35

    Minimal GenBank FormatL O C U S H U M R A S H 6 4 5 3 b p d s - D N A P R I 1 5 - M A R - 1 9 8 8

    D E F I N I T I O N H u m a n c - H a - ra s 1 p r o t o - o n c o g e n e , c o m p l e te c o d i n g s e q u e n c e .

    A C C E S S I O N J 0 0 2 7 7 J 0 0 2 0 6 J 0 0 2 7 6 K 0 0 9 5 4

    F E A T U R E S L o c a t i o n / Q u a l i f i e r s

    p r im _ t r a n s c r i p t < 1 6 6 4 . .3 7 4 4 / n o t e = " c - H a - r a s 1 m R N A "

    C D S j o in ( 1 6 6 4 . .1 7 7 4 , 2 0 4 2 . . 2 2 2 0 , 2 3 7 4 . .2 5 3 3 , 3 2 3 1 . .3 3 5 0 )

    / n o t e = " c - H a - r a s 1 p 2 1 p r o t e i n ; N C B I g i : 1 9 0 8 9 1 . "

    / c o d o n _ s t a r t = 1

    / t r a n s l a t i o n = " M T E Y K L V V V G A G G V G K S A L T I Q L I Q N H F V D E Y D P T I E D S Y R K Q VV I D G E T C L L D I L D T A G Q E E Y S A M R D Q Y M R T G E G F L C V F A I N N T K S F E D I H Q Y R E Q I K RV K D S D D V P M V L V G N K C D L A A R T V E S R Q A Q D L A R S Y G I P Y I E T S A K T R Q G V E D A F Y T L V R E I R Q H K L R K L N P P D E S G P G C M S C K C V L S "

    s o u r c e 1 . . 6 4 5 3 /

    o r g a n i s m = " H o m o s a p i e n s "

    B A S E C O U N T 9 4 6 a 2 2 8 7 c 2 1 1 3 g 1 1 0 7 t

    O R I G I N 1 b p u p s t re a m o f B a m H I s i te .

    1 g g a t c c c a g c c t t tc c c c a g c c c g t a g c c c c g g g a c c t c c g c g g t g g g c g g c g c c g c g c t 6 1 g c c g g c g c a g g g a g g g c c t c t g g t g c a c c g g c a c c g c t g a g t c g g g t tc t c t c g c c g g c c 1 2 1t g t tc c c g g g a g a g c c c g g g g c c c t g c t c g g a g a t g c c g c c c c g g g c c c c c a g a c a c c g g . .. .. .. .. .. ..

  • 8/4/2019 Gene Identification[1]

    30/35

    GenScan Output

    Predicted Genes/ExonsGn.Ex Type S Begin End Len Fr Ph I/Ac Do/T CodRg P. Tscr..

    1.01 Intr + 739 851 113 0 2 49 66 74 0.287 0.98

    1.02 Intr + 1748 1860 113 2 2 53 110 80 0.866 7.231.03 Intr + 1976 2055 80 0 2 97 94 10 0.999 2.271.04 Intr + 2132 2194 63 1 0 84 80 87 0.99 6.911.05 Intr + 2434 2631 198 0 0 88 -9 263 0.895 16.671.06 Intr + 2749 2910 162 0 0 107 109 97 0.965 14.391.07 Intr + 3279 3416 138 2 0 52 77 126 0.812 9.071.08 Intr + 3576 3676 101 2 2 87 119 113 0.996 13.711.09 Intr + 3780 3846 67 0 1 63 77 46 0.998 0.41.1 Term + 4179 4340 162 2 0 75 47 276 0.979 20.451.11 PlyA + 4397 4402 6 1.05

  • 8/4/2019 Gene Identification[1]

    31/35

    GenScan Output

    Predicted peptide sequence(s)

    HS307871|GENSCAN_predicted_peptide_1|398_aa

    VQAIVWTWLDKTVGIIVGTCAKLRIPRLSDENKFLMSPPQGFPELKNDTFLRAAWGEETDYTPVWCMRQAGRYLPEFRETRAAQDFFSTCRSPEACCELTLQPLRRFLLDAAIIFSDILVVPQALGMEVTMVPGKGPSFPEPLREEQDLERLRDPEVVASELGYVFQAITLTRQRLAGRVPLIGFAGAPWTLM

    TYMVEGGGSSTMAQAKRWLYQRPQASHQLLRILTDALVPYLVGQVVAGAQALQLFESHAGHLGPQLFNKFALPYIRDVAKQVKARLREAGLAPVPMIIFAKDGHFALEELAQAGYEVVGLDWTVAPKKARECVGKTVTLQGNLDPCALYASEEEIGQLVKQMLDDFGPHRYIANLGHGLYPDMDPEHVGAFVDAVHKHSRLLRQN

  • 8/4/2019 Gene Identification[1]

    32/35

    GRAIL

    Gene Recognition and Assembly Internet

    Link.

    Identifies exons, polyA sites, promoters,repeats and frameshift errors in DNA

    sequence by comparing them to database of

    known mouse and human sequenceelements.

  • 8/4/2019 Gene Identification[1]

    33/35

    GRAIL

    Incorporates BLAST searches and

    Glimmer.

    It supports the protocols and file formatscommonly found on the World-Wide Web,

    such as HTTP, FTP, and HTML.

  • 8/4/2019 Gene Identification[1]

    34/35

    GRAIL

    GrailExp is a software package developed

    specifically for gene finding using pattern

    recognition and expressed sequence tags.

    Grail is an algorithm for inferring gene

    structures from predicted exon candidates,

    based on Expressed Sequence Tags (ESTs)

    and biological intuition/rules.

  • 8/4/2019 Gene Identification[1]

    35/35

    To be Continued