comparing two protein sequences

Post on 22-Jan-2016

33 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

Comparing Two Protein Sequences. Cédric Notredame. Our Scope. Look once Under the Hood. Pairwise Alignment methods are LIMITED. If You Understand the LIMITS they Become VERY POWERFUL. Pairwise Alignment methods are POWERFUL. Outline. -WHY Does It Make Sense To Compare Sequences. - PowerPoint PPT Presentation

TRANSCRIPT

Cédric Notredame (21/04/23)

Comparing Two Protein Sequences

Cédric Notredame

Cédric Notredame (21/04/23)

Our Scope

Pairwise Alignment methods are POWERFUL

Pairwise Alignment methods are LIMITED

If You Understand the LIMITS they Become VERY POWERFUL

Look once Under the Hood

Cédric Notredame (21/04/23)

Outline

-WHY Does It Make Sense To Compare Sequences

-HOW Can we Align Two Sequences ?

-HOW can I Search a Database ?

-HOW Can we Compare Two Sequences ?

Cédric Notredame (21/04/23)

Why Does It Make Sense To Compare Sequences ?

Sequence Evolution

Cédric Notredame (21/04/23)

Why Do We Want To Compare Sequences

wheat --DPNKPKRAMTSFVFFMSEFRSEFKQKHSKLKSIVEMVKAAGER | | |||||||| || | ||| ||| | |||| ||||????? KKDSNAPKRAMTSFMFFSSDFRS----KHSDL-SIVEMSKAAGAA

EXTRAPOLATE

??????

Homology?

SwissProt

Cédric Notredame (21/04/23)

Why Do We Want To Compare Sequences

Cédric Notredame (21/04/23)

Why Does It Make Sense To Align Sequences ?

-Evolution is our Real Tool.

-Nature is LAZY and Keeps re-using Stuff.

-Evolution is mostly DIVERGEANT

Same Sequence Same Ancestor

Cédric Notredame (21/04/23)

Why Does It Make Sense To Align Sequences ?

SameSequence

Same Function

Same 3D Fold

Same Origin

Cédric Notredame (21/04/23)

Comparing Is Reconstructing Evolution

Cédric Notredame (21/04/23)

An Alignment is a STORY

ADKPKRPLSAYMLWLN

ADKPKRPKPRLSAYMLWLNADKPRRPLS-YMLWLN

ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN

Mutations+

Selection

Cédric Notredame (21/04/23)

An Alignment is a STORY

ADKPRRP---LS-YMLWLNADKPKRPKPRLSAYMLWLN

Mutation

InsertionDeletion

ADKPKRPLSAYMLWLN

ADKPKRPKPRLSAYMLWLNADKPRRPLS-YMLWLN

ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN

Mutations+

Selection

Cédric Notredame (21/04/23)

Evolution is NOT Always Divergent…

AFGP with (ThrAlaAla)nSimilar To Trypsynogen

N

AFGP with (ThrAlaAla)n

S

Chen et al, 97, PNAS, 94, 3811-16

NOT

Similar to Trypsinogen

Cédric Notredame (21/04/23)

Evolution is NOT Always Divergent

AFGP with (ThrAlaAla)nSimilar To Trypsynogen

AFGP with (ThrAlaAla)nNOT

Similar to Trypsinogen

N

S

SIMILAR Sequences BUT

DIFFERENT origin

Cédric Notredame (21/04/23)

Evolution is NOT always Divergent…

But in MOST cases, you may assume it is…

SameSequence

Same Function

Same 3D Fold

Same Origin

Similar Function DOES NOT REQUIRESimilar Sequence

Similar Sequence

Historical Legacy

Cédric Notredame (21/04/23)

How Do Sequences Evolve

Each Portion of a Genome has its own Agenda.

Cédric Notredame (21/04/23)

How Do Sequences Evolve ?

CONSTRAINED Genome Positions Evolve SLOWLY

EVERY Protein Family Has its Own Level Of Constraint

Family KS KA

Histone3 6.4 0Insulin 4.0 0.1Interleukin I 4.6 1.4Globin 5.1 0.6Apolipoprot. AI 4.5 1.6Interferon G 8.6 2.8

Rates in Substitutions/site/Billion Years as measured on Mouse Vs Human (80 Million years)Ks Synonymous Mutations, Ka Non-Neutral.

Cédric Notredame (21/04/23)

Different molecular clocks for different proteins--another prediction

Cédric Notredame (21/04/23)

GC

LIV A

F

Aliphatic

Aromatic

Hydrophobic

C

How Do Sequences Evolve ?The amino Acids Venn Diagram

To Make Things Worse, Every Residue has its Own Personality

ST

WY

QHK

R

ED N

Polar

PG

Small

C

Cédric Notredame (21/04/23)

How Do Sequences Evolve ?

In a structure, each Amino Acid plays a Special Role

OmpR, Cter Domain

In the core, SIZE MATTERS

On the surface, CHARGE MATTERS

--+

Cédric Notredame (21/04/23)

How Do Sequences Evolve ?

Accepted Mutations Depend on the Structure

Big -> BigSmall ->SmallNO DELETION

--+

Charged -> ChargedSmall <-> Big or SmallDELETIONS

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?

Substitution Matrices

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?

To Compare Two Sequences, We need:

Their Function

Their Structure

We Do Not Have Them !!!

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?

We will Need To Replace Structural Information With Sequence Information.

SameSequence

Same Function

Same 3D Fold

Same Origin

It CANNOT Work ALL THE TIME !!!

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?

To Compare Sequences, We need to Compare ResiduesWe Need to Know How Much it COSTS to SUBSTITUTE

an Alanine into an Isoleucinea Tryptophan into a Glycine…The table that contains the costs for all the

possible substitutions is called the SUBSTITUTION MATRIX

How to derive that matrix?

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?

G

C

LIV A

F

Aliphatic

Aromatic

Hydrophobic

C

ST

W

YQH

K

R

ED N

Polar

PG

Small

C

Using Knowledge Could Work

But we do not know enough about Evolution and Structure.

Using Data works better.

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?Making a Substitution Matrix

-Take 100 nice pairs of Protein Sequences, easy to align (80% identical).

-Align them…

-Count each mutations in the alignments

-25 Tryptophans into phenylalanine-30 Isoleucine into Leucine…

-For each mutation, set the substitution score to the log odd ratio:

Expected by chance

ObservedLog

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?Making a Substitution Matrix

The Diagonal Indicates How Conserved a residue tends to be.W is VERY Conserved

Some Residues are Easier To mutate into other similar

Cysteins that make disulfide bridges and those that do not get averaged

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?Making a Substitution Matrix

Cédric Notredame (21/04/23)

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?Using Substitution Matrix

ADKPRRP---LS-YMLWLNADKPKRPKPRLSAYMLWLN

Mutation

InsertionDeletion

Given two Sequences and a substitution Matrix,We must Compute the CHEAPEST Alignment

Cédric Notredame (21/04/23)

Most popular Subsitution Matrices • PAM250• Blosum62 (Most widely used)

Raw Score

TPEA¦| |APGA

TPEA¦| |APGA

Score =1 = 9

• Question: Is it possible to get such a good alignment by chance only?

+ 6 + 0 + 2

Scoring an Alignment

Cédric Notredame (21/04/23)

Insertions and Deletions

Gap Penalties

• Opening a gap is more expensive than extending it

Seq AGARFIELDTHE----CAT||||||||||| |||

Seq BGARFIELDTHELASTCAT

Seq AGARFIELDTHE----CAT||||||||||| |||

Seq BGARFIELDTHELASTCAT

gap

Gap Opening PenaltyGap Extension Penalty

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?Limits of the substitution Matrices

They ignore non-local interactions and Assume that identical residues are equal

They assume evolution rate to be constant

ADKPKRPLSAYMLWLN

ADKPKRPKPRLSAYMLWLN

ADKPRRPLS-YMLWLN

ADKPKRPLSAYMLWLNADKPKRPLSAYMLWLN

Mutations+

Selection

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?Limits of the substitution Matrices

Substitution Matrices Cannot Work !!!

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?Limits of the substitution Matrices

I know… But at least, could I get some idea of when they are likely to do all right

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?The Twilight Zone

Length

%Sequence Identity

100

Same 3D Fold

Twilight Zone

Similar SequenceSimilar Structure

30%

Different SequenceStructure ????

30

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?The Twilight Zone

Substitution Matrices Work Reasonably Well on Sequences that have more than 30 % identity over more than 100 residues

Cédric Notredame (21/04/23)

Cédric Notredame (21/04/23)

Cédric Notredame (21/04/23)

Cédric Notredame (21/04/23)

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?Which Matrix Shall I used

The Initial PAM matrix was computed on 80% similar Proteins

It been extrapolated to more distantly related sequences.

Pam 250Pam 350

Other Matrices Exist:BLOSUM 42BLOSUM 62

BLOSUM 62

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?Which Matrix Shall I use

PAM: Distant Proteins High Index (PAM 350)BLOSUM: Distant Proteins Low Index (Blosum30)

•GONNET 250> BLOSUM62>PAM 250.

•But This will depend on:

•The Family.•The Program Used and Its Tuning.

Choosing The Right Matrix may be Tricky…

•Insertions, Deletions?

Cédric Notredame (21/04/23)

Dot MatricesGlobal AlignmentsLocal Alignment

HOW Can we Align Two Sequences ?

Cédric Notredame (21/04/23)

Cédric Notredame (21/04/23)

Dot Matrices

QUESTION

What are the elements shared by two sequences ?

Cédric Notredame (21/04/23)

Dot Matrices

>Seq1THEFATCAT>Seq2THELASTCAT

T H E F A T C A TTHEFASTCAT

Window

Stringency

Cédric Notredame (21/04/23)

Dot Matrices

Sequences Window size

Stringency

Cédric Notredame (21/04/23)

Dot MatricesStrigency

Window=1Stringency=1

Window=11Stringency=7

Window=25Stringency=15

Cédric Notredame (21/04/23)

Dot Matrices

xy

xy x

Cédric Notredame (21/04/23)

Dot Matrices

Cédric Notredame (21/04/23)

Dot Matrices

Cédric Notredame (21/04/23)

Dot Matrices

Cédric Notredame (21/04/23)

Dot Matrices

Cédric Notredame (21/04/23)

Dot MatricesLimits

-Visual aid

-Best Way to EXPLORE the Sequence Organisation

-Does NOT provide us with an ALIGNMENT

wheat --DPNKPKRAMTSFVFFMSEFRSEFKQKHSKLKSIVEMVKAAGER | | |||||||| || | ||| ||| | |||| ||||????? KKDSNAPKRAMTSFMFFSSDFRS----KHSDL-SIVEMSKAAGAA

Cédric Notredame (21/04/23)

Cost

L

Afine Gap Penalty

Global Alignments

-Take 2 Nice Protein Sequences

-A good Substitution Matrix (blosum)

-A Gap opening Penalty (GOP)

-A Gap extension Penalty (GEP)

GOP

GEP

GOP GOP

GOP

Parsimony: Evolution takes the simplest path

(So We Think…)

Cédric Notredame (21/04/23)

Insertions and Deletions

Gap Penalties

• Opening a gap is more expensive than extending it

Seq AGARFIELDTHE----CAT||||||||||| |||

Seq BGARFIELDTHELASTCAT

Seq AGARFIELDTHE----CAT||||||||||| |||

Seq BGARFIELDTHELASTCAT

gap

Gap Opening PenaltyGap Extension Penalty

Cédric Notredame (21/04/23)

Global Alignments

-Take 2 Nice Protein Sequences

-A good Substitution Matrix (blosum)

-A Gap opening Penalty (GOP)

-A Gap extension Penalty (GEP)

>Seq1THEFATCAT>Seq2THEFASTCAT

-DYNAMIC PROGRAMMING

DYNAMICPROGRAMMING

THEFA-TCATTHEFASTCAT

Cédric Notredame (21/04/23)

Global Alignments

F A S T

F A T

----FATFAST---

(L1+l2)!

(L1)!*(L2)!

---FAT-FAST---

--F-AT-FAST---

Brut Force Enumeration

2

( )

DYNAMIC PROGRAMMING

Cédric Notredame (21/04/23)

Global AlignmentsDYNAMIC PROGRAMMING

Match=1 MisMatch=-1Gap=-1

FAT

F A S T

1

-1

-1

-2

-3

0

-2 -3 -4

2

0

0

Dynamic Programming (Needlman and Wunsch)

FAT

F A S T

1

-1

-1

-2

-3

0

-2 -3 -4

2

0

0 -1 0

0

21-1-1

1

FAT

F A S T

1

-1 -2 -3 -4

2

0

2

1

F A S TF A - T

Cédric Notredame (21/04/23)

Global AlignmentsDYNAMIC PROGRAMMING

Global Alignments are very sensitive to gap Penalties

GOP

GEP

Cédric Notredame (21/04/23)

Global AlignmentsDYNAMIC PROGRAMMING

Global Alignments are very sensitive to gap Penalties

Global Alignments do not take into account the MODULAR nature of Proteins

C: K vitamin dep. Ca BindingK: Kringle DomainG: Growth Factor moduleF: Finger Module

Cédric Notredame (21/04/23)

Local Alignments

GLOBAL Alignment

LOCAL Alignment

Smith And Waterman (SW)=LOCAL Alignment

Cédric Notredame (21/04/23)

Local Alignments

We now have a PairWise Comparison Algorithm,

We are ready to search Databases

Cédric Notredame (21/04/23)

Database Search

1.10e-20

10

1.10e-100

1.10e-2

1.10e-1

10

3

1

3

6

1.10e-2

1

20

15

13

QUERRY

Comparison Engine

Database

E-valuesHow many time do we expect such anAlignment by chance?

SWQ

Cédric Notredame (21/04/23)

Cédric Notredame (21/04/23)

CONCLUSION

Cédric Notredame (21/04/23)

-There is a relation between Sequence and Structure.

The Easiest way to Compare Two Sequences is a dotplot.

Sequence Comparison

-Thanks to evolution, We CAN compare Sequences

-Substitution matrices only work well with similar Sequences (More than 30% id).

Cédric Notredame (21/04/23)

A few Addresses

Cédric Notredame (21/04/23)

Cédric Notredame (21/04/23)

Cédric Notredame (21/04/23)

Cédric Notredame (21/04/23)

Cédric Notredame (21/04/23)

Cédric Notredame (21/04/23)

top related