identification of specificity-determining positions in protein alignments

40
Identification of specificity-determining positions in protein alignments Mikhail Gelfand Research and Training Center “Bioinformatics” Institute for Information Transmission Problems, RAS ECCB2005, Madrid

Upload: tawana

Post on 15-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Identification of specificity-determining positions in protein alignments. Mikhail Gelfand Research and Training Center “Bioinformatics” Institute for Information Transmission Problems, RAS ECCB2005, Madrid. Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Identification  of specificity-determining positions in protein alignments

Identification of specificity-determining positions

in protein alignments

Mikhail GelfandResearch and Training Center “Bioinformatics”

Institute for Information Transmission Problems, RAS

ECCB2005, Madrid

Page 2: Identification  of specificity-determining positions in protein alignments

Motivation

• Large protein families with general function assigned by homology, not much functional information

• Much less structural data. Not many structures with substrates, cofactors etc.

• Some specificity assignments from comparative genomics=>• Search for specificity-determining positions in alignments

– identification of functional sites– prediction of specificity– understanding and eventually re-design of function

Page 3: Identification  of specificity-determining positions in protein alignments

Specificity (of transporters) from comparative genomics – three

examples. 1. New specificities in a little studied family

S-box (rectangle frame)MetJ (circle frame)LYS-element (circles)Tyr-T-box (rectangles)

BC1434

FN 062 4

269.47

SON-3

CJ

CPE

LysT

MetT

TyrT

MleN

DF

CTCCB

OB

SO N-2VC-2

NM B

SON-1

VC-1

BHHP

C

TTE-nhaC

AC0744

FN0978

BL1111

CTC 00901

OB2874OB1118

NMB05 36

FN0352BC4121

EF-nhaC 1

EF-nhaC 2

PPE

LP-nha2

LP-nha1 L

L

M

G A

ELB

BS-yheL

BS-m leN

FN0650

VC2037

BC1709

SA 2292HI1107

VV21061FN207 7

BH3946

BC0373

FN14 22

BB0638

BB0637

F N1420

CTC02529SO1087

VCA0193

BT1270

C

CB

T C02520

CPE2317

FN1414

SA2117

Archaea

clostrid ia

Pasteure llaceae

malate/lactate

Page 4: Identification  of specificity-determining positions in protein alignments

2. Misleading homology: The PnuC family of transporters

The RFN elements

The THI elements

Page 5: Identification  of specificity-determining positions in protein alignments

3. A nightmare. The NiCoT family of nickel-cobalt transporters

Page 6: Identification  of specificity-determining positions in protein alignments

SDP (Specificity-Determining Position)

Alignment position that is conserved withingroups of proteins having the same specificity(specificity groups) but differs between them

SDP is not equivalent to a functionally important position

Page 7: Identification  of specificity-determining positions in protein alignments

Measure of specificity: mutual information

= count of amino acid α in group i at position p divided by the total number of sequences

= frequency of amino acid α in position p

= fraction of proteins in group i

i

p

ppp iff

ififI

groupsyspecificit all

acidsamino all )()(

),(log),(

),( if p

)(pf

)(if

Page 8: Identification  of specificity-determining positions in protein alignments

Taking into account the structure of the phylogenetic tree:

random shuffling and linear regression

Z-score

)exp(

exp

pIpIpI

pZ

min

linear regression

=> positions that are more specific than expected given the tree

Page 9: Identification  of specificity-determining positions in protein alignments

Smoothing: pseudocounts and similarity between amino acid residues

• m(ab) = amino acid substitution matrix

• n(a,i) = count of amino acid a at position i

Page 10: Identification  of specificity-determining positions in protein alignments

Automated threshold setting: the Bernoulli estimator

Are 5 SDP with Z-score > 12 better than 10 SDP with Z-score > 9?

21 ZZ k

kZZkPk scores- Zobserved least at are thereminarg*

n

kni

iniin

k

pqC1

1minarg

kZ

k dZZZZPp )exp(2

1)( 2

pq 1

Page 11: Identification  of specificity-determining positions in protein alignments

Other similar techniques

• Evolutionary trace (Lichtarge et al. 1996, 1997) – need structure; gradual construction of group-specific consensus

• Evolutionary rate shifts (DIVERGE, Gu et al. 2002) – positions with group-specific evolutionary rate

• Surface patches of slowly evolving residues (Rate4Site, Pupko et al. 2002) – need structure

• PCA in the sequence space (Casari et al., 1995)• Correlated mutations (Pazos and Valencia, 2002)• Prediction of functional sub-types (Hannenhalli and

Russell, 2000) – relative entropy of HMM profiles for groups

Page 12: Identification  of specificity-determining positions in protein alignments

SDPpred: Web interfaceInput: multiple alignment of proteins

divided into specificity groups

=== AQP ===%sp|Q9L772|AQPZ_BRUME-------------------------------------mlnklsaeffgtfwlvfggcgsailaa--afp-------elgigflgvalafgltvltmayavggisg--ghfnpavslgltviiilgsts------------------------------slap------------------qlwlfwvaplvgavigaiiwkgllgrd---------------------------------------%sp|P48838|AQPZ_ECOLI-------------------------------------mfrklaaecfgtfwlvfggcgsavlaa--gfp-------elgigfagvalafgltvltmafavghisg--ghfnpavtiglwalvihgatd------------------------------kfap------------------qlwffwvvpivggiiggliyrtllekrd--------------------------------------%tr|Q92ZW9-------------------------------------mfkklcaeflgtcwlvlggcgsavlas--afp-------qvgigllgvsfafgltvltmaytvggisg--ghfnpavslglaviiilgsth------------------------------rrvp------------------qlwlfwiaplfgaaiagivwksvgeefrpvd-----------------------------------=== GLP ===%sp|P11244|GLPF_ECOLI----------------------------msqt---stlkgqciaeflgtglliffgvgcvaalkvag---------a-sfgqweisviwglgvamaiyltagvsg--ahlnpavtialwlglilaltd------------------------------dgn--------------g-vpr-flvplfgpivgaivgafayrkligrhlpcdicvveek--etttpseqkasl--------------%sp|P44826|GLPF_HAEIN----------------------------mdks-----lkancigeflgtalliffgvgcv

Page 13: Identification  of specificity-determining positions in protein alignments

SDPpred: Output

Alignment of the family with the SDPs highlighted(Alignment view)

Detailed description of each SDP(List of SDPs)

Plot of probabilities used by the Bernoulli estimator to set the cutoff (Probability plot view)

Page 14: Identification  of specificity-determining positions in protein alignments

Transcription factors from the LacI family

• Training set: 459 sequences, average length: 338 amino acids, 85 specificity groups

10 residues contact NPF (analog of the effector)

6 residues in the intersubunit contacts

7 residues contact the operator sequence

7 residues in the effector contact zone (5Ǻ<dmin<10Ǻ)

5 residues in the intersubunit contact zone (5Ǻ<dmin<10Ǻ)

6 residues in the operator contact zone (5Ǻ<dmin<10Ǻ)

– 44 SDPs

LacI from E.coli

Page 15: Identification  of specificity-determining positions in protein alignments

SDP clusters at the subunit contact region

LacI (lactose repressor) from E.coli (1jwl)

EffectorEffector

DNA operatorDNA operator

Cluster ICluster I

Cluster IICluster II

Page 16: Identification  of specificity-determining positions in protein alignments

Overall statistics (LacI of E. coli)

• Total 348 amino acids

• 44 SDP

Non-contacting residues (distance to the DNA, effector, or the other subunit >10Ǻ)

Contact zone (may be functional)

Contacting residues (distance to the DNA, effector, or the other subunit <5Ǻ)

Page 17: Identification  of specificity-determining positions in protein alignments

Membrane channels of the MIP family

• Training set: 17 sequences, average length 280 amino acids, 2 specificity groups: Aquaporines & glyceroaquaporines

– 21 SDPs8 residues contact glycerol (substrate) (dmin<5Ǻ)

8 residues oriented to the channel

5 residues in the contacts with other subunits GlpF from E.coli

Page 18: Identification  of specificity-determining positions in protein alignments

Glpf (glycerol facilitator) from E. coli (1fx8)

Cluster ICluster I Cluster IICluster II

Subunit I

SubstrateSubstrate(glycerol)(glycerol)

Two SDP clusters at the contact of subunits forming the tetramer

20Leu, 24Ile, 108Tyr of one subunit, 193Ser of another subunit

Glu43

Page 19: Identification  of specificity-determining positions in protein alignments

Overall statistics (GlpF from E.coli)

• Total 281 amino acids

• 21 SDP

Contacting residues (distance to the substrate, or another subunit <5Ǻ)

Non-contacting residues (distance to the substrate, or another subunit >10Ǻ)

Contact zone (may be functional)

Page 20: Identification  of specificity-determining positions in protein alignments

isocitrate/isopropylmalate dehydrogenases : combinations of specificities towards

substrate and cofactor• IDH: catalyzes the oxidation of

isocitrate to α-ketoglutorate and CO2 (TCA) using either NAD or NADP as a cofactor in organisms from prokaryotes to higher eukaryotes

• IMDH: catalyzes oxidative decarboxylation of 3-isopropylmalate into 2-oxo-4-methylvalerate (leucine biosynthesis) in prokaryotes and fungi, the cofactor is NAD

MitochondriaMitochondria

ArchaeaArchaeaBacteriaBacteria

EukaryotaEukaryota

ArchaeaArchaeaBacteriaBacteriaEukaryotaEukaryota

Page 21: Identification  of specificity-determining positions in protein alignments

Selecting specificity groups

1. By substrate: all IDHs vs. all IMDHs

3. Four groups

IDH (NAD)IDH (NAD) IDH (NAD)IDH (NAD)

IDH (NADP)IDH (NADP)type IItype II

IDH (NADP)IDH (NADP)type IItype II

IMDH (NAD)IMDH (NAD) IMDH (NAD)IMDH (NAD)

IDH (NADP)IDH (NADP)type Itype I

IDH (NADP)IDH (NADP)type Itype I

IDH (NAD)IDH (NAD)

IDH (NADP)IDH (NADP)type IItype II

IMDH (NAD)IMDH (NAD)

IDH (NADP)IDH (NADP)type Itype I

2. By cofactor: all NAD-dependent vs. all NADP-dependent

Page 22: Identification  of specificity-determining positions in protein alignments

Predicted SDPs

most SDPs near the substrate

SDPs near the substrate and the cofactor

SDPs near the substrate, the cofactor and the other subunit

Page 23: Identification  of specificity-determining positions in protein alignments

SDPs, the cofactor and the substrate

Substrate (isocitrate)

Cofactor (NADP)Nicotinamide nucleotide

Adenine nucleotide344Lys, 345Tyr, 351Val:cofactor-specific SDPs,known determinants of specificity to cofactor

100Lys, 104Thr, 105Thr, 107Val, 337Ala, 341Thr:substrate-specific and four group SDPs, functionally not characterized

NADP-dependent IDH from E. coli (1ai2)

Page 24: Identification  of specificity-determining positions in protein alignments

SDPs predicted for different groupings

cofactor-specific SDPs

substrate-specific SDPs

Four groups

154Glu158Asp

208Arg

229His

231Gly

233Ile

287Gln

300Ala

305Asn308Tyr327Asn344Lys

345Tyr

351Val38Gly 40Asp

100Lys

103Leu

105Thr

115Asn155Asn164Glu241Phe

337Ala

341Thr

97Val

98Ala

104Thr107Val 152Phe

161Ala162Gly

232Asn

245Gly

31Tyr

323Ala

36Gly

45Met

Color code:Contacts cofactorContacts substrate AND cofactorContacts substrateContacts substrate AND the other subunit Contacts the other subunit

Page 25: Identification  of specificity-determining positions in protein alignments

Overview

• Transcription factors: contacts with the cofactor and the DNA

• Transporters: contacts with the substrate

• Enzymes: contacts with the substrate and the cofactor

And all:

• contacts between subunits

Page 26: Identification  of specificity-determining positions in protein alignments

Protein-DNA interactions

CRP PurR

IHF TrpR

Entropy at aligned sites (blue plots) and the number of contacts (red: heavy atoms in a base pair at a distance <cutoff from a protein atom)

Page 27: Identification  of specificity-determining positions in protein alignments

The observed correlation does not depend on the distance cutoff

Page 28: Identification  of specificity-determining positions in protein alignments

CRP/FNR family of regulators

FNR

HcpR

CooA

Gam ma

Desulfovibrio

Desulfovibrio

TGTCGGCnnGCCGACA

TTGTgAnnnnnnTcACAA

TTGTGAnnnnnnTCACAA

TTGATnnnnATCAA

Page 29: Identification  of specificity-determining positions in protein alignments

Correlation between contacting nucleotides and amino acid residues

• CooA in Desulfovibrio spp.• CRP in Gamma-proteobacteria• HcpR in Desulfovibrio spp. • FNR in Gamma-proteobacteria

DD COOA ALTTEQLSLHMGATRQTVSTLLNNLVRDV COOA ELTMEQLAGLVGTTRQTASTLLNDMIREC CRP KITRQEIGQIVGCSRETVGRILKMLEDYP CRP KXTRQEIGQIVGCSRETVGRILKMLEDVC CRP KITRQEIGQIVGCSRETVGRILKMLEEDD HCPR DVSKSLLAGVLGTARETLSRALAKLVEDV HCPR DVTKGLLAGLLGTARETLSRCLSRMVEEC FNR TMTRGDIGNYLGLTVETISRLLGRFQKYP FNR TMTRGDIGNYLGLTVETISRLLGRFQKVC FNR TMTRGDIGNYLGLTVETISRLLGRFQK

TGTCGGCnnGCCGACA

TTGTgAnnnnnnTcACAA

TTGTGAnnnnnnTCACAA

TTGATnnnnATCAA

Contacting residues: REnnnRTG: 1st arginineGA: glutamate and 2nd arginine

Page 30: Identification  of specificity-determining positions in protein alignments

The correlation holds for other factors in the family

Factor Organisms Consensus Specific aa Metabolic system Inducer

CRP Enterobacteria&Vibrio&PasteurellaceaeTTGTGAnnnnnnTCACAA R E R catabolic repression cAMPVFR Pseudomonas sp. TTGTGAnnnnnnTCACAA R E R virulence cAMPCLP Xanthomonas&Xylella sp. nTGTGAnnnnnnTCACAn R E R phytopathogenicity ? (not cAMP)FNR & ANR Gamma-proteobacteria nnTTGATnnnnATCAAnn V E R response to anaerobiosis O2,NOFNR Beta-proteobacteria nnTTGATnnnnATCAAnn L E R response to anaerobiosis O2FNR & FixK Alpha-proteobacteria nnTTGATnnnnATCAAnn I/L E R nitrogen fixation O2DNR & Nnr Pseudomonas &Paracoccus nnTTGATnnnnATCAAnn P E R denitrification NO, NO2FNR Bacillus sp. nTGTGAnnTAnnTCACAn R E R response to anaerobiosis O2-low conditionsPrfA Listeria nnTTAACAnnTGTTAAnn S S R virulence ?NtcA Cyanobacteria ntGTAnCnnnnGnTACan R V R nitrogen metabolism 2-oxoglutarateCysR Cyanobacteria ? R V R sulfate utilization sulfate?CooA Desulfovibrio sp. and R.rubrum nTGTCGGCnnGCCGACAn R Q T CO utilization COHcpR* Desulfovibrio sp. TTGTgAnnnnnnTcACAA R E R prismane & sulfate reduction ?HcpR* Desulfuromonas acetoxidans, Desulfotalea psychrophilaatTTGAccnnggTCAAat S/P E R prismane ?HcpR* Clostridia, Bacteroides, Thermotogales, Fusobacteria, TreponemactGTAACawwtCTTACag R P R prismane ?HcpR* ~P. gingivalis nTGTCGCnnnnGCGACAn R A R prismane ?HcpR* ~C. difficile nnGGATnnnnnnATCCnn R S R prismane ?HcpR* ~T.tengcongensis, D.halfniensa nTGTGAnnnnnnTCACAn R E R prismane ?HcpR* ~Acidithiobacillus ferrooxidans nCTTGATTnnAATCAAGn P E R prismane ?ArcR Bacillus, Enterococcus sp. nTGTGAnATATnTCACAn R E A/S arginine catabolism O2CprK Desulfitobacterium dehalogenas nnTTAnTGnnCAnTAAnn H V R/K halorespiration aromaticsFlpA&B Lactococcus lactis nnTTGATnnnnATCAAnn P E R ? Eh, O2

Page 31: Identification  of specificity-determining positions in protein alignments

Plans and perspectives. Protein-DNA interactions

1**

T

2* 9

****

4****

10****

6**

7***

8**

T G

3**

5**** 12

*

13****

15****

16****

17***

14*

11****

24*

19**

T

25****

21*

22****

23**

18****

20*

G

27****

28**

30***

31****

T

32**

29****

26*

38****

C

39****

41****

A

42****

T43

****

40****

37*

34****

35****

36****

33**

A

Each ortho logous group is reduced to a s ing le representa ive.The branch colour denotes the feeder pathw ay regula ted.The experim enta l data ava ilab le for at least one regula tor o f an orthologous group is show n by the type-face of species designations:

, and the branch outline th icknessexperim entally confirm ed sites,

experim entally confirm ed regulation (the th icker line indica te experim enta lly confirm ed pathw ay).

The Logo's num bering corresponds to the branch num bers o f the tree.

regulon pred icted de novosignal p roposed de novo,new regu lon m em bers proposednew sites predicted.

**********

1

2

3

4

5

6

7

89

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

EC_RbsR

EC_PurR

PA1949

BS_CcpA

TTE0201

R EF00754

R EF00345

BS_R bsR

SACR_LACLA

M ALR_STAXYBS_YvdE

SM b21598

SM c04260

RAFR_ECOLI

SM b20324

R R C 03428

SM c03060

EC_MalI

SM c04401

SM c02975

SM b21272

m lr2242SCRR_SALTY

RKP03067

EC_FruR

R KP05215SM b21650

GALR_STRTR

EC_EbgR

EC_TreR

VC A0654

SCRR_STAXY

SM b21372

PA2259

BS_KdgR

XC C 2369

EC_AscGSTM 1555

EC_GalS

EC_GalREC_CytR

CSCR_ECOLI

SM c03165

R Sc1790

R KP05499

R PU 04121

PA2320

EC_IdnREC_GntR

R R C 03254

STM 2345

STM 3696

EC _YcjW

EC_LacI

D -galactose & galactosidesm altose & trehalose

sucroseD -fructose

D -riboseD -xylose

LacI family of transcriptional

regulators (each branch represents a subfamily)

Page 32: Identification  of specificity-determining positions in protein alignments

… and their signals

1605 regulators from 189 genomes, forming 302 groups of orthologs and binding 2518 sites

Page 33: Identification  of specificity-determining positions in protein alignments

Plans and perspectives. Experimental verification

• A new family of Ni/Co transporters

• No structural data• Specificity

predicted by comparative genomics

• Predicted SDPs form several clusters in the alignment, are located on the same sides of alpha-helices

• Mutational analysis

Page 34: Identification  of specificity-determining positions in protein alignments

Terminators of translation in prokaryotes / decoding of

stop-codons. Specificity of

RF1 (UAG, UAA) and RF2 (UGA,

UAA)

Fragment of the alignment (117 pairs). SDPs are shown by black boxes above the alignment.

Page 35: Identification  of specificity-determining positions in protein alignments

“Interesting” positions: invariant, SDPs, variable rate.

Page 36: Identification  of specificity-determining positions in protein alignments

SDPs and invariant

positions:two

decoding sites?

Page 37: Identification  of specificity-determining positions in protein alignments

Plans and perspectives

• Use of 3D structures, when available. Identification of functional sites as spatial clusters of SDPs and conserved positions

• Automated identification of specificity groups based on the analysis of the phylogenetic tree

• Protein-DNA interactions• Identification of protein-protein contact

surfaces

Page 38: Identification  of specificity-determining positions in protein alignments

Publications

• N.J.Oparina, O.V.Kalinina, M.S.Gelfand, L.L.Kisselev (2005) Common and specific amino acid residues in the prokaryotic polypeptide release factors RF1 and RF2: possible functional implications. Nucleic Acids Research 33 (in press).

• O.V.Kalinina, A.A.Mironov, M.S.Gelfand, A.B.Rakhmaninova (2004) Automated selection of positions determining functional specificity of proteins by comparative analysis of orthologous groups in protein families. Protein Science 13: 443-456.

• O.V.Kalinina, P.S.Novichkov, A.A.Mironov, M.S.Gelfand, A.B.Rakhmaninova (2004) SDPpred: a tool for prediction of amino acid residues that determine differences in functional specificity of homologous proteins. Nucleic Acids Research 32: W424-W428.

• O.V.Kalinina, M.S.Gelfand, A.A.Mironov, A.B.Rakhmaninova (2003) Amino acid residues forming specific contacts between subunits in tetramers of the membrane channel GlpF. Biophysics (Moscow) 48: S141-S145.

• L.A.Mirny, M.S.Gelfand (2002) Using orthologous and paralogous proteins to identify specificity determining residues in bacterial transcription factors. Journal of Molecular Biology 321: 7-20.

• L.Mirny, M.S.Gelfand (2002) Structural analysis of conserved base-pairs in protein-DNA complexes. Nucleic Acids Research 30: 1704-1711.

• http://math.belozersky.msu.ru/~psn/

Page 39: Identification  of specificity-determining positions in protein alignments

Acknowledgements• Leonid Mirny (Harvard, MIT)• Olga Kalinina • Andrei A. Mironov • Alexandra B. Rakhmaninova • Dmitry Rodionov• Olga Laikova

• Howard Hughes Medical Institute • Ludwig Institute of Cancer Research• Russian Fund of Basic Research• Russian Academy of Sciences,

programs “Molecular and Cellular Biology”and “Origin and Evolution of the Biosphere”

Page 40: Identification  of specificity-determining positions in protein alignments