cl662 pw 02 gene finding

Upload: kanupriya-tiwari

Post on 01-Jun-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/9/2019 CL662 PW 02 Gene Finding

    1/39

    Gene FindingGene Finding

  • 8/9/2019 CL662 PW 02 Gene Finding

    2/39

    What is the problem is gene finding?

    bi ol ogi cal spel l i ngi smuchmor esl oppyt hanengi lshspel l i ngpr ot ei nswi t ht hesamef unct i onf r omt wodi f f er ent or gani smsar eal most al waysspel t di f fer ent l ysi mi l ar l yi ndnamanyi nt er est i ngsi gnal s

    var ygr eat l ywi t hi nevenwi t hi nt hesamegenome

  • 8/9/2019 CL662 PW 02 Gene Finding

    3/39

    What is the problem is gene finding?

    bi ol ogi cal spel l i ngi smuchmor esl oppyt hanengi lshspel l i ngpr ot ei nswi t ht hesamef unct i onf r omt wodi f f er ent or gani smsar eal most al waysspel t di f fer ent l ysi mi l ar l yi ndnamanyi nt er est i ngsi gnal s

    var ygr eat l ywi t hi nevent hesamegenome

    Bi ol ogi cal spel l i ng i s much mor e sl oppy

    t han Engl i sh spel l i ng. Pr ot ei ns wi t h t hesame f unct i on f r om t wo di f f er ent or gani sms

    ar e al most al ways spel t di f f er ent l y.Si mi l ar l y, i n DNA, many i nt er est i ng si gnal svar y gr eat l y wi t hi n even t he same genome.

  • 8/9/2019 CL662 PW 02 Gene Finding

    4/39

    Sequences are observer dependent

    ATGATTCTAGGAGAATCGTCTAATCGAATGGCA-------TAAAGTCTACT

    Observer A: DNA sequencer

    ATGA CT

    TTCTAGGAGAATCGTCTAATCGAATGGCA-------TAAAGTCTA

    Transcription

    startTranscription

    stop

    Observer C: Ribosome

    ATGATTCT AATCGTCTAATCGA AGTCTACT

    AGGAG ATGGCA-------TAA

    RibosomalBinding Site

    Startcodon

    StopCodon

    Observer B: RNA polymerase

  • 8/9/2019 CL662 PW 02 Gene Finding

    5/39

    ATGATTCTAGGAGAATCGTCTAATCGAATGGCA-------TAAAGTCTACT

    We observe this:

    ATGA CT

    TTCTAGGAGAATCGTCTAATCGAATGGCA-------TAAAGTCTA

    Transcription

    startTranscription

    stop

    ATGATTCT AATCGTCTAATCGA AGTCTACT

    AGGAG ATGGCA-------TAA

    RibosomalBinding Site

    Startcodon

    StopCodon

    We need to infer about:

    or

  • 8/9/2019 CL662 PW 02 Gene Finding

    6/39

    To our benefit

    There is great deal of order in biological

    sequences

    Conserved stretches of sequences arerecognized by various bio-molecules which are

    part of information decoding / processing

    machinery.

    Our goal: Find the subtle similarities & patterns.

  • 8/9/2019 CL662 PW 02 Gene Finding

    7/39

    What are genes?

    AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC

    TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA

    TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC

    ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG

    CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA

    GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC

    AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATT

    Genes are individual stretches of DNA that encode the

    sequence of amino acids comprising a particular protein.

    The 64 possible nucleotide triplets (codons) represent the

    20 amino acids using a degenerate code.

    Start Codon: ATG

    Stop Codon: TGA, TAA, TAG

    In eukaryotes, coding regions are separated by non-

    coding regions (introns).

  • 8/9/2019 CL662 PW 02 Gene Finding

    8/39

    Finding Genes (or regions of interest in DNA sequence)

    Important Signals

    Prokaryotes

    The start codon

    The stop codon

    Eukaryotes

    The start codon

    all donor sites (the

    beginning of each intron

    all acceptor sites (the end

    of each intron)

    The stop codon

    Problem: EveryATG is not a valid start codon:

    Verify the start codon (analyze region around the start codon).

    Identify additional subtle signals.

  • 8/9/2019 CL662 PW 02 Gene Finding

    9/39

    Some useful signals for Gene Finding

    DNA

    mRNA

    protein

    TATA box

    Protein

    synthesis

    starts

    Protein

    synthesis

    stops

    x x

    m-RNA expression

    start & end

    Ribosomalbinding site

    U

    pstreama

    ctivating

    s

    equences(UA

    S)

  • 8/9/2019 CL662 PW 02 Gene Finding

    10/39

    A logo of RBS and startA logo of RBS and start codoncodon in E. coli genes.in E. coli genes.

  • 8/9/2019 CL662 PW 02 Gene Finding

    11/39

    Key concepts used in Gene Detection.Key concepts used in Gene Detection.

  • 8/9/2019 CL662 PW 02 Gene Finding

    12/39

    Gene Finding MethodsGene Finding Methods

    Content Based Methods: Overall, bulkContent Based Methods: Overall, bulkproperties of the sequence. E.g. codon bias,properties of the sequence. E.g. codon bias,hexamer frequency.hexamer frequency.

    SiteSite--based or signal sensing based: E.g. donorbased or signal sensing based: E.g. donorand acceptor splice sites, binding sites forand acceptor splice sites, binding sites fortranscription factors, polyA tracts, RBS, start andtranscription factors, polyA tracts, RBS, start andstop codon.stop codon.

    Comparative methods: Translated sequemcesComparative methods: Translated sequemcesare subjected to database searches againstare subjected to database searches againstprotein sequences.protein sequences.

  • 8/9/2019 CL662 PW 02 Gene Finding

    13/39

    DNA

    mRNA

    protein

    TATA box

    Start Stop

    x x

    m-RNA expression

    start & end

    RBS

    Upstreama

    ctiv

    ating

    sequences(UAS)

    Using the subtle signals in automated gene finding

    Homology to known genes: DNAsequences are likely to be protein

    coding regions if they are

    homologous to known protein

    coding regions in other genomes

    Codon Bias: From the 64 codon

    degenerate code, certain codons

    are preferentially used by species.

    Amino acid bias.

    Start and stop codons. Length of aregion between a start and stop

    codon.

    Prokaryotes: Ribosomal binding

    site in the vicin ity of the start

    codon.

  • 8/9/2019 CL662 PW 02 Gene Finding

    14/39

    The cloverleaf structure of aThe cloverleaf structure of a tRNAtRNA molecule showing thosemolecule showing those

    features that are usedfeatures that are used tRNAscantRNAscan for detection.for detection.

  • 8/9/2019 CL662 PW 02 Gene Finding

    15/39

    The structure of a tRNA molecule: (A) Phe-tRNA molecule

    showing the arrangement of the base pairing and loops

    and (B) the 3-D structure.

  • 8/9/2019 CL662 PW 02 Gene Finding

    16/39

    The start and stop signals for prokaryotic transcription:

    Start signal- short nucleotide sequences that bind

    transcr iption enzymes. Stop signal: short loop structurepreventing the transcription apparatus from continuing.

  • 8/9/2019 CL662 PW 02 Gene Finding

    17/39

    Frequency of occurrence of different aminoFrequency of occurrence of different amino

    acidacid codonscodons in genes andin genes and intergenicintergenic DNA.DNA.

  • 8/9/2019 CL662 PW 02 Gene Finding

    18/39

    Gene Search by Homology

    Similarity of a portion of DNA with a known sequence can

    be used as both positive & negative evidence for

    likelihood for being a coding region.

    Positive evidence: Similarity to known genes in other

    organisms (Comparative evidence).

    Negative evidence: Similarity to repeat sequences (repeat

    masker)

    Can provide clues about gene location and function.

    Can locate only about half of all human genes currently.

  • 8/9/2019 CL662 PW 02 Gene Finding

    19/39

    Important methods used in Gene Detection.Important methods used in Gene Detection.

  • 8/9/2019 CL662 PW 02 Gene Finding

    20/39

    The start and stop signals for eukaryotic transcription

  • 8/9/2019 CL662 PW 02 Gene Finding

    21/39

    A schematic of the splicing of anA schematic of the splicing of an intronintron

  • 8/9/2019 CL662 PW 02 Gene Finding

    22/39

    A segment of E. coli genome that has been fully

    annotated, illustrated using the Artemis program.

  • 8/9/2019 CL662 PW 02 Gene Finding

    23/39

    A detailed view of a tRNA coding region and

    the secondary structure of the tRNA molecule.

  • 8/9/2019 CL662 PW 02 Gene Finding

    24/39

    Eukaryotic DNA to protein.

  • 8/9/2019 CL662 PW 02 Gene Finding

    25/39

    Schematic representation of the ALDH10

    gene with exons colored blue.

  • 8/9/2019 CL662 PW 02 Gene Finding

    26/39

    A flowchart of steps involved in the

    identification and annotation of gene sequences.

  • 8/9/2019 CL662 PW 02 Gene Finding

    27/39

    Problems in gene finding

    Length of a gene is variable.

    Al l signals are probabi l istic and have inherent

    (sometimes unknown) variabi l i ty.

    It is di ffi cult to quanti fy the acceptable level of

    variabi l ity for each signal.

    Example: An ATG does not always mean a valid

    start codon.

    There are exceptions to every (almost) rule

    Example: Protein coding regions wi thout the

    RBS in their vic inity of start codon may get

    expressed.

  • 8/9/2019 CL662 PW 02 Gene Finding

    28/39

    Partial sequence classification (Tagging)

    The tagging problem:

    Given: A set of tags L

    Training examples of sequences showing the breakup

    of the sequence into the set of tags

    Learn to breakup a sequence into tags (classification

    of parts of sequences)

    Examples:

    Text segmentation: Break sequence of words forming

    an address string into subparts like Road, City name.

    Continuous speech recognition: Identify words in

    cont inuous speech

    Gene finding: Identify boundaries of the protein coding

    regions in DNA sequence, identify exon / introns, etc.

  • 8/9/2019 CL662 PW 02 Gene Finding

    29/39

    A system described at any t ime as being

    in one of a set of N distinct states, S1,

    S2, ---, SN.

    The system undergoes a change ofstate (possib ly back to the same set).

    Full description: Specification of

    current state as well as all predecessor

    states. First order Markov chain:

    descrip tion truncated to just the

    current state and the predecessorstate.

    P[q t = Sj | q t-1 = Si, q t-2 = Sk, - - -]

    = P[q t = Sj | q t-1 = Si]

    Transition probabilities:

    ai,j = P[q t = Sj | q t-1 = Si, qt-2 = Sk, - - -]

    With ai,j 0 and 1

    1

    N

    j

    ija

    S1S2

    S3 S4

    a41

    a34

    a21

    Discrete Markov Processes

  • 8/9/2019 CL662 PW 02 Gene Finding

    30/39

    Example of an Observable Sequence: Weather Predict ion

    Rain /Snow

    S1

    Cloudy

    S2

    Sunny

    S3

    The weather on a day t is characterized

    by one of the three states.

    Transition probabilities:

    Given that the weather on day 1 is sunny,

    what is the probability that the weatherfor the next 5 days wil l be sun-sun-rain-

    rain-sun?

    Find P(O|Model) = product of all the

    concerned transi tion probabilities. Initial state Probabilities

    i = P[q1=Si] 1 i N

    8.01.01.0

    2.06.02.0

    3.03.04.0

    }{ij

    aA

    Each state is di rectly observable in a Markov Chain.

    a32

    a23

  • 8/9/2019 CL662 PW 02 Gene Finding

    31/39

    Example of Hidden Markov Chain:

    State is not directly observable

    Players A and B A has a set of coins with

    dif ferent biases

    A repeatedly

    Picks arbitrary coin

    Tosses it arbit rary number

    of times

    B observes H/T (symbols)

    Guesses transition points

    and biases The actual event is hidden

    from B.

    HMMs are doubly stochastic models: Occurrence of a state

    and the observed sym bol in that state.

  • 8/9/2019 CL662 PW 02 Gene Finding

    32/39

    Elements of an HMM

    Observed sequence (represented as symbols)

    O = O1 O2 - - - OT (T=duration of the sequence)

    Sequence of states (typically hidden)

    Q = q1 q2 - - - qT

    N, the number of states in the model. Although the states arehidden, there is often some physical signif icance attached to

    the states. S={S1, S2, - - -, SN}

    M, the number of dist inct observation symbols per state.

    V={v1, v2, - - -, vM}

    The state transition probability distribution A = {aij}

    The observation symbol probability d istribution in state j,

    B = {bj(k)} or the Emission frequency matrix.

    The initial state distribution = {i}

    Thus, the model parameters: = (A, B, )

  • 8/9/2019 CL662 PW 02 Gene Finding

    33/39

    Three basic problems for HMM

    Given the observation sequence O = O1 O2 - - - OT and a

    model = (A, B, ), how do we eff iciently compute P(O|

    ), theprobability that the of the observation sequence, given the

    model. Correct solution feasible.

    Example: profileHMM----classif ication of a protein

    sequence based on competing HMMs for dif ferent

    protein families.

    Given the observation sequence O = O1 O2 - - - OT and a

    model , how do we choose a corresponding state sequence

    Q = q1 q2 - - - qT. No single correct solut ion exists----Need to

    apply some optimality criteria

    Example: Finding the protein coding regions and Exon /

    intron boundaries in an anonymous sequence of DNA.

    How do we adjust the model parameters = (A, B, ), to

    maximize P(O|), O training sequence. Toughest problem.

  • 8/9/2019 CL662 PW 02 Gene Finding

    34/39

    Two sequence mining problems in biology

    1. Finding genes in DNA sequences (2nd

    Problem in HMM)

    2. Classifying proteins according to family (1st

    Problem in HMM)

    The 3rd Problem in HMM needs to be tackled in

    both 1 and 2 above.

  • 8/9/2019 CL662 PW 02 Gene Finding

    35/39

    HMM for genes with introns (spliced genes)

    GTxxxxxInterior

    IntronxxxxxxxAGccc ccc

    GTxxxxxInterior

    Intron

    xxxxxxxAGccc cc c ccc

    GTxxxxx InteriorIntron

    xxxxxxxAGccc c cc ccc

    Intron

    models

    Donor

    Model

    Acceptor

    model

    c

    c

    c

    start model

    stop model

    coding

    model

  • 8/9/2019 CL662 PW 02 Gene Finding

    36/39

    Hidden Markov model of a prokaryotic nucleotide

    sequence used in the GeneMark.hmm algorithm.

  • 8/9/2019 CL662 PW 02 Gene Finding

    37/39

    Mathematical Problem Statement for Gene Finding

    For an anonymous DNA sequence

    S = {b1,b2,.., bL}

    where, bi = A, T, G, C

    Determine the functional role of each nucleotide

    A = {a1,a2,.., aL}

    where, ai= 0 if non-coding

    ai = 1 if coding on direct strand

    ai = 2 if coding on complementary strand

  • 8/9/2019 CL662 PW 02 Gene Finding

    38/39

    Variable Duration HMM for gene finding

    The trajectory A is represented as a sequence of M hidden

    states having duration di:

    A={( a1d1) ( a2d2) . . ( aMdM) } where di =L

    Objective in Gene finding :

    To find the trajectory

    A*={( a1*d1*) ( a2*d2* ) . . . ( aM*dM*) }

    which has the largest probability of occurring simultaneously

    with sequence S compared to all other possible trajectories.

  • 8/9/2019 CL662 PW 02 Gene Finding

    39/39

    Gene Finding & Training Sets

    Majori ty of the current gene finding algori thms /

    programs uti li ze a species specific training set to

    develop the statisti cal models.

    Training set involves experimentall y determined genes.

    For organisms such as E. coli, we have a large training

    set (about 325 known genes f rom experimental /

    biochemical veri fication).

    Question: How to develop models for new -organisms

    for which very few genes are experimentally

    characterized.