genes recognition - École polytechnique fédérale de ...lsir recognition.pdf · 2. score of...

Genes Recognition

Julien Favre

Agenda

• PART 1 : Gene Structure– Gene Definition– Transcription Process– Gene Details

• PART 2 : Problem Definition– Gene Recognition– Why?– Complexity

• PART 3 : Problem Approach– Approaches– Solutions description– Method improvements– Conclusion PART 1

2

Gene Definition

What’s a Gene?

PART 13

DNA

Transcription Process I

PART 14

Transcription Process II

PART 15

STEP 1

STEP 2

STEP 3

STEP 1 Transcription

PART 16

ANIMATION

STEP 2 Processing

PART 17

Capping and Poly-A

Splicing

STEP 3 Translation

PART 18

ANIMATION

More details on Genes

PART 19

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

5’ 3’

Coding region Non Coding Region

TATA box

Start CodonEnd Codon

Beginning of the gene

Splice sites

It differs from genes to genes!

Agenda




10

Situation

• Over 3’500 million of nucleotides• 35’000 -50’000 genes

2 Important Questions:

1) Where are the genes?2) What are the coding parts?

PART 211

Why?

• Annotate and correct the DNA databases• Link genes with the known proteins• Understand the genes functions• Understand genes expression mechanism

PART 212

We can read the DNA alphabet, but we don’t know where are the meaningful words and their meaning.

Complexity I

PART 213

ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA

3’500 Million bases

Complexity II

PART 214

Acceptor SitesDonor Sites Number of parses = Fibonacci(n+m+1)

DNA

Exons Exons

Agenda




15

Approaches

3 Types of Approaches :

1. Single Gene RecognitionFunctional Signals detection

Splice sitesPromoter, Poly-A, …

2. Multiple Genes Recognition3. Similarities

PART 316

Single Gene RecognitionPrinciple

Functional Signals DetectionMain goal is to detect the beginning and the

end of the exons or genes

PART 317



5’ 3’TATA box

Start CodonEnd Codon

Splice sites

Splicing Mechanism

PART 318

Time™ and a) decompressorsee this picture.DNA

• Consensus over the donor-acceptor site GU-AG (98%)

• Extremely reliable technique to detect exons

Single Gene RecognitionMethods

PART 319

1. Combinatorial methods– Single block

2. Probabilistic methods– Simple – Markov based

3. Linear Discriminant methods

Consensus Sequence

PART 320

Obtained by choosing the most frequent base at each position of the multiple alignment of subsequences of interest

TACGATTATAATTATAATGATACTTATGATTATGTT

Consensus Sequence TATAAT

MELONMANGOHONEYSWEETCOOKY

MONEYLeads to loss of information and can produce many false positive or false negative predictions

Combinatory Methods

PART 321

Consensus Sequence(ex: TATA box)For a consensus sequence of size L and for a position in the considered sequence, we compute

1. P(L,k)= P(Detect the consensus seq. with k mismatches)2. where Fl = #possible positions in

the considered sequence and T is the number of patterns detected in the given sequence

3. For a given To, define a threshold value for the detection.

P(T) = CFlT pT (L,z)(1− P(L,z))Fl−T

z= 0

k

∑

Probabilistic Methods I

PART 322

For a given consensus sequence a Weight Matrix is computed:Computed by measuring the frequency of every element of a particular position of the base in a training set:

Matrix entries can be considered as probabilitiesDisadvantages:

– assumes independence between adjacent bases

GU Acceptor site

Probabilistic Methods II

PART 323

• Under the weight matrix model, the probability of having a sequence (x1, x2, .., xk) that matches a site is:

If we introduce a measure of the form :

Then, the more LLR exceeds 0, the better chances this sequence is a functional signal

P(X /S) = pxi

i

i=1

k

∏

LLR(X) = Log( P(X /S)P(X /N)

)

Methods improvements

PART 324

2 blocks approach P(L1,k1,L2,k2) and distance D1Multiple nucleotides probabilitiesNeuronal network approachReading frameMarkov Models

Markov Models

PART 325

Probabilistic method are 0-order Markov modelsMarkov introduces dependencies between the basesThe probabilities of observing a sequence becomes now:

P(X /S) = p0 pxi

i−1,i

i=1

k

∏

Linear Discriminant methods I

PART 326

Many functional signals are very short => Exploit related characteristics1. We build a sequence characteristics vector

(x1, …,xp)2. We define and if Z>c then the sequence

correspond to a site3. We use a training set to define {ai}, c4. The training set of « site sequences » define a

vector m1 and the « non site sequence » a vector m2

Z = aixii= 0

p

∑

a = s−1(m1 −m2) c = a (m1 + m2) /2

Linear Discriminant methods II

PART 327

1. Choose a set of p characteristics– Score of the weight matrix– Distance to a predicted site– Base composition in distant sequence– …

2. Test the characteristics with the Mahalonodisdistance:

3. Choose the set of q characteristics that maximizes D2

D2 = (m1 −m2)s−1(m1 −m2)

Linear Discriminant methods IIIExample

PART 328

Poly-A site

Chosen characteristics:1. Score of the weight matrix of Poly-A2. Score of weight matrix of the GT el.3. Distance between Poly-A and GT4. Nucleotide composition of Downstream Region(6,100)5. Nucleotide composition of Upstream Region(-100,-1)

Stop codon

T-rich

Poly-A site

GT-Rich Last Exon

5’Score of Poly-A

CAATAAA(T/C)

Distance between poly-A and GT

Score of GT

Linear Discriminant methods IVExample

PART 329

Poly-A site

Chosen characteristics:1. Score of the weight matrix of Poly-A2. Score of weight matrix of the GT el.3. Distance between Poly-A and GT4. Nucleotide composition of Downstream Region(6,100)5. Nucleotide composition of Upstream Region(-100,-1)

12.6812.3611.6710.787.61Composed D2

0.442.270.013.467.61Individual D2

35241MahalonodisDistance

Multiple Genes approach

PART 330

2 Approaches:

1. Discriminant Analysis, Pattern based– FGENES

2. HMM, Probabilistic approach– FGENEH

Discriminant Analysis

PART 331

Goal: Detect first and last Exons in a big sequence

1. Find internal exons

2. Find last exons based on 3’ sites

3. Find first exons based on 5’ sites

4. Combine results

AInternal exonIntron Intron

D

ALast exonIntron 3’ site

Stop

ATGFirst exon5’ site Intron

D

HMM method I

PART 332

We want to use Markov model to represent and recognize genes



HMM method II

PART 333

Real model:E0 E1 E2

E2E1E0

NP

Eterm

P

Einit

polyA

5’ UTR

I0 I1 I2

I0 I1 I2

Esngl

Esngl

Einit Eterm3’ UTR

5’ UTR 3’ UTR

polyA

HMM method III

PART 334

1. The model must be trained to compute:• State transition probabilities• Initial distribution

2. For a given sequence, we look for the best path using Vitterbi algorithm

3. We analyze the best path to determine if it could be a gene.

Similarity methods

PART 335

2 Goals:1. Find out the genes functions2. Improve algorithms

2 Main Methods:1. EST based2. BLAST with others species

Remarks

PART 336

• Real challenge is gene recognition in long and complex sequences

• It’s very difficult to measure methods accuracy

• The databases are full of errors

Conclusion

PART 337

• Best results are obtained in combining methods– HMM + EST+Dynamic programming

• This problem will be solved within a few years• But huge challenges are remaining

– Gene regulation – Alternative splicing– Gene expression

Questions And Remarks

PART 338

Thanks for your attention

genes recognition - École polytechnique fédérale de ...lsir recognition.pdf · 2. score of...

Documents