comparing two protein sequences

75
Cédric Notredame (24/06/22) Comparing Two Protein Sequences Cédric Notredame

Upload: archie

Post on 22-Jan-2016

33 views

Category:

Documents


4 download

DESCRIPTION

Comparing Two Protein Sequences. Cédric Notredame. Our Scope. Look once Under the Hood. Pairwise Alignment methods are LIMITED. If You Understand the LIMITS they Become VERY POWERFUL. Pairwise Alignment methods are POWERFUL. Outline. -WHY Does It Make Sense To Compare Sequences. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Comparing Two Protein Sequences

Cédric Notredame

Page 2: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Our Scope

Pairwise Alignment methods are POWERFUL

Pairwise Alignment methods are LIMITED

If You Understand the LIMITS they Become VERY POWERFUL

Look once Under the Hood

Page 3: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Outline

-WHY Does It Make Sense To Compare Sequences

-HOW Can we Align Two Sequences ?

-HOW can I Search a Database ?

-HOW Can we Compare Two Sequences ?

Page 4: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Why Does It Make Sense To Compare Sequences ?

Sequence Evolution

Page 5: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Why Do We Want To Compare Sequences

wheat --DPNKPKRAMTSFVFFMSEFRSEFKQKHSKLKSIVEMVKAAGER | | |||||||| || | ||| ||| | |||| ||||????? KKDSNAPKRAMTSFMFFSSDFRS----KHSDL-SIVEMSKAAGAA

EXTRAPOLATE

??????

Homology?

SwissProt

Page 6: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Why Do We Want To Compare Sequences

Page 7: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Why Does It Make Sense To Align Sequences ?

-Evolution is our Real Tool.

-Nature is LAZY and Keeps re-using Stuff.

-Evolution is mostly DIVERGEANT

Same Sequence Same Ancestor

Page 8: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Why Does It Make Sense To Align Sequences ?

SameSequence

Same Function

Same 3D Fold

Same Origin

Page 9: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Comparing Is Reconstructing Evolution

Page 10: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

An Alignment is a STORY

ADKPKRPLSAYMLWLN

ADKPKRPKPRLSAYMLWLNADKPRRPLS-YMLWLN

ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN

Mutations+

Selection

Page 11: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

An Alignment is a STORY

ADKPRRP---LS-YMLWLNADKPKRPKPRLSAYMLWLN

Mutation

InsertionDeletion

ADKPKRPLSAYMLWLN

ADKPKRPKPRLSAYMLWLNADKPRRPLS-YMLWLN

ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN

Mutations+

Selection

Page 12: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Evolution is NOT Always Divergent…

AFGP with (ThrAlaAla)nSimilar To Trypsynogen

N

AFGP with (ThrAlaAla)n

S

Chen et al, 97, PNAS, 94, 3811-16

NOT

Similar to Trypsinogen

Page 13: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Evolution is NOT Always Divergent

AFGP with (ThrAlaAla)nSimilar To Trypsynogen

AFGP with (ThrAlaAla)nNOT

Similar to Trypsinogen

N

S

SIMILAR Sequences BUT

DIFFERENT origin

Page 14: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Evolution is NOT always Divergent…

But in MOST cases, you may assume it is…

SameSequence

Same Function

Same 3D Fold

Same Origin

Similar Function DOES NOT REQUIRESimilar Sequence

Similar Sequence

Historical Legacy

Page 15: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

How Do Sequences Evolve

Each Portion of a Genome has its own Agenda.

Page 16: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

How Do Sequences Evolve ?

CONSTRAINED Genome Positions Evolve SLOWLY

EVERY Protein Family Has its Own Level Of Constraint

Family KS KA

Histone3 6.4 0Insulin 4.0 0.1Interleukin I 4.6 1.4Globin 5.1 0.6Apolipoprot. AI 4.5 1.6Interferon G 8.6 2.8

Rates in Substitutions/site/Billion Years as measured on Mouse Vs Human (80 Million years)Ks Synonymous Mutations, Ka Non-Neutral.

Page 17: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Different molecular clocks for different proteins--another prediction

Page 18: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

GC

LIV A

F

Aliphatic

Aromatic

Hydrophobic

C

How Do Sequences Evolve ?The amino Acids Venn Diagram

To Make Things Worse, Every Residue has its Own Personality

ST

WY

QHK

R

ED N

Polar

PG

Small

C

Page 19: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

How Do Sequences Evolve ?

In a structure, each Amino Acid plays a Special Role

OmpR, Cter Domain

In the core, SIZE MATTERS

On the surface, CHARGE MATTERS

--+

Page 20: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

How Do Sequences Evolve ?

Accepted Mutations Depend on the Structure

Big -> BigSmall ->SmallNO DELETION

--+

Charged -> ChargedSmall <-> Big or SmallDELETIONS

Page 21: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?

Substitution Matrices

Page 22: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?

To Compare Two Sequences, We need:

Their Function

Their Structure

We Do Not Have Them !!!

Page 23: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?

We will Need To Replace Structural Information With Sequence Information.

SameSequence

Same Function

Same 3D Fold

Same Origin

It CANNOT Work ALL THE TIME !!!

Page 24: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?

To Compare Sequences, We need to Compare ResiduesWe Need to Know How Much it COSTS to SUBSTITUTE

an Alanine into an Isoleucinea Tryptophan into a Glycine…The table that contains the costs for all the

possible substitutions is called the SUBSTITUTION MATRIX

How to derive that matrix?

Page 25: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?

G

C

LIV A

F

Aliphatic

Aromatic

Hydrophobic

C

ST

W

YQH

K

R

ED N

Polar

PG

Small

C

Using Knowledge Could Work

But we do not know enough about Evolution and Structure.

Using Data works better.

Page 26: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?Making a Substitution Matrix

-Take 100 nice pairs of Protein Sequences, easy to align (80% identical).

-Align them…

-Count each mutations in the alignments

-25 Tryptophans into phenylalanine-30 Isoleucine into Leucine…

-For each mutation, set the substitution score to the log odd ratio:

Expected by chance

ObservedLog

Page 27: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?Making a Substitution Matrix

The Diagonal Indicates How Conserved a residue tends to be.W is VERY Conserved

Some Residues are Easier To mutate into other similar

Cysteins that make disulfide bridges and those that do not get averaged

Page 28: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?Making a Substitution Matrix

Page 29: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Page 30: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?Using Substitution Matrix

ADKPRRP---LS-YMLWLNADKPKRPKPRLSAYMLWLN

Mutation

InsertionDeletion

Given two Sequences and a substitution Matrix,We must Compute the CHEAPEST Alignment

Page 31: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Most popular Subsitution Matrices • PAM250• Blosum62 (Most widely used)

Raw Score

TPEA¦| |APGA

TPEA¦| |APGA

Score =1 = 9

• Question: Is it possible to get such a good alignment by chance only?

+ 6 + 0 + 2

Scoring an Alignment

Page 32: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Insertions and Deletions

Gap Penalties

• Opening a gap is more expensive than extending it

Seq AGARFIELDTHE----CAT||||||||||| |||

Seq BGARFIELDTHELASTCAT

Seq AGARFIELDTHE----CAT||||||||||| |||

Seq BGARFIELDTHELASTCAT

gap

Gap Opening PenaltyGap Extension Penalty

Page 33: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?Limits of the substitution Matrices

They ignore non-local interactions and Assume that identical residues are equal

They assume evolution rate to be constant

ADKPKRPLSAYMLWLN

ADKPKRPKPRLSAYMLWLN

ADKPRRPLS-YMLWLN

ADKPKRPLSAYMLWLNADKPKRPLSAYMLWLN

Mutations+

Selection

Page 34: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?Limits of the substitution Matrices

Substitution Matrices Cannot Work !!!

Page 35: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?Limits of the substitution Matrices

I know… But at least, could I get some idea of when they are likely to do all right

Page 36: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?The Twilight Zone

Length

%Sequence Identity

100

Same 3D Fold

Twilight Zone

Similar SequenceSimilar Structure

30%

Different SequenceStructure ????

30

Page 37: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?The Twilight Zone

Substitution Matrices Work Reasonably Well on Sequences that have more than 30 % identity over more than 100 residues

Page 38: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Page 39: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Page 40: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Page 41: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Page 42: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?Which Matrix Shall I used

The Initial PAM matrix was computed on 80% similar Proteins

It been extrapolated to more distantly related sequences.

Pam 250Pam 350

Other Matrices Exist:BLOSUM 42BLOSUM 62

BLOSUM 62

Page 43: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

How Can We Compare Sequences ?Which Matrix Shall I use

PAM: Distant Proteins High Index (PAM 350)BLOSUM: Distant Proteins Low Index (Blosum30)

•GONNET 250> BLOSUM62>PAM 250.

•But This will depend on:

•The Family.•The Program Used and Its Tuning.

Choosing The Right Matrix may be Tricky…

•Insertions, Deletions?

Page 44: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Dot MatricesGlobal AlignmentsLocal Alignment

HOW Can we Align Two Sequences ?

Page 45: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Page 46: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Dot Matrices

QUESTION

What are the elements shared by two sequences ?

Page 47: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Dot Matrices

>Seq1THEFATCAT>Seq2THELASTCAT

T H E F A T C A TTHEFASTCAT

Window

Stringency

Page 48: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Dot Matrices

Sequences Window size

Stringency

Page 49: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Dot MatricesStrigency

Window=1Stringency=1

Window=11Stringency=7

Window=25Stringency=15

Page 50: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Dot Matrices

xy

xy x

Page 51: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Dot Matrices

Page 52: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Dot Matrices

Page 53: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Dot Matrices

Page 54: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Dot Matrices

Page 55: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Dot MatricesLimits

-Visual aid

-Best Way to EXPLORE the Sequence Organisation

-Does NOT provide us with an ALIGNMENT

wheat --DPNKPKRAMTSFVFFMSEFRSEFKQKHSKLKSIVEMVKAAGER | | |||||||| || | ||| ||| | |||| ||||????? KKDSNAPKRAMTSFMFFSSDFRS----KHSDL-SIVEMSKAAGAA

Page 56: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Cost

L

Afine Gap Penalty

Global Alignments

-Take 2 Nice Protein Sequences

-A good Substitution Matrix (blosum)

-A Gap opening Penalty (GOP)

-A Gap extension Penalty (GEP)

GOP

GEP

GOP GOP

GOP

Parsimony: Evolution takes the simplest path

(So We Think…)

Page 57: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Insertions and Deletions

Gap Penalties

• Opening a gap is more expensive than extending it

Seq AGARFIELDTHE----CAT||||||||||| |||

Seq BGARFIELDTHELASTCAT

Seq AGARFIELDTHE----CAT||||||||||| |||

Seq BGARFIELDTHELASTCAT

gap

Gap Opening PenaltyGap Extension Penalty

Page 58: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Global Alignments

-Take 2 Nice Protein Sequences

-A good Substitution Matrix (blosum)

-A Gap opening Penalty (GOP)

-A Gap extension Penalty (GEP)

>Seq1THEFATCAT>Seq2THEFASTCAT

-DYNAMIC PROGRAMMING

DYNAMICPROGRAMMING

THEFA-TCATTHEFASTCAT

Page 59: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Global Alignments

F A S T

F A T

----FATFAST---

(L1+l2)!

(L1)!*(L2)!

---FAT-FAST---

--F-AT-FAST---

Brut Force Enumeration

2

( )

DYNAMIC PROGRAMMING

Page 60: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Global AlignmentsDYNAMIC PROGRAMMING

Match=1 MisMatch=-1Gap=-1

FAT

F A S T

1

-1

-1

-2

-3

0

-2 -3 -4

2

0

0

Dynamic Programming (Needlman and Wunsch)

FAT

F A S T

1

-1

-1

-2

-3

0

-2 -3 -4

2

0

0 -1 0

0

21-1-1

1

FAT

F A S T

1

-1 -2 -3 -4

2

0

2

1

F A S TF A - T

Page 61: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Global AlignmentsDYNAMIC PROGRAMMING

Global Alignments are very sensitive to gap Penalties

GOP

GEP

Page 62: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Global AlignmentsDYNAMIC PROGRAMMING

Global Alignments are very sensitive to gap Penalties

Global Alignments do not take into account the MODULAR nature of Proteins

C: K vitamin dep. Ca BindingK: Kringle DomainG: Growth Factor moduleF: Finger Module

Page 63: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Local Alignments

GLOBAL Alignment

LOCAL Alignment

Smith And Waterman (SW)=LOCAL Alignment

Page 64: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Local Alignments

We now have a PairWise Comparison Algorithm,

We are ready to search Databases

Page 65: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Database Search

1.10e-20

10

1.10e-100

1.10e-2

1.10e-1

10

3

1

3

6

1.10e-2

1

20

15

13

QUERRY

Comparison Engine

Database

E-valuesHow many time do we expect such anAlignment by chance?

SWQ

Page 66: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Page 67: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

CONCLUSION

Page 68: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

-There is a relation between Sequence and Structure.

The Easiest way to Compare Two Sequences is a dotplot.

Sequence Comparison

-Thanks to evolution, We CAN compare Sequences

-Substitution matrices only work well with similar Sequences (More than 30% id).

Page 69: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

A few Addresses

Page 70: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Page 71: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Page 72: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Page 73: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Page 74: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)

Page 75: Comparing Two Protein Sequences

Cédric Notredame (21/04/23)