mining plant pathogen genomes for effectors

Mining pathogen genomes for effectors

Leighton Pritchard

The overall goal l Star0ng from a genome sequence, iden0fy genes that code for candidate effectors (or, star0ng from gene product complement, iden0fy candidate effectors)

What is an effector? l Molecule produced by pathogen that (directly?) modifies host molecular/biochemical ‘behaviour’, e.g.

l  Inhibits enzyme ac0on (Cladosporium fulvum AVR2, AVR4; Phytophthora infestans EPIC1, EPIC2B; P. sojae glucanase inhibitors)

l  Cleaves protein target (Pseudomonas syringae AvrRpt2)

l  (De-‐)phosphorylates protein target (Pseudomonas syringae AvrRPM1, AvrB)

l  Addi0onal component in/retarge0ng host system, e.g. E3 ligase ac0vity (P. syringae AvrPtoB; P. infestans Avr3a)

l  Regulatory control (Xanthomonas campestris AvrBs3, TAL effectors)

What is an effector? l No unifying biochemical mechanism; may act inside or outwith host cell

l No formal, agreed defini0on (direct/indirect ac0on; structural damage – PCWDEs, etc.)

l No single ‘test for candidate effectors’ l  Really tes0ng for protein family membership and/or evidence of

‘effector-‐like behaviour’

l  A general sequence classifica0on problem (func0onal annota0on)

l  Many possible bioinforma0c/computa0onal approaches

l  No big red bu[on

Surgery without knife skills?

Before we start…

A F 4 7 “If a card has a vowel on one side, it has an even number on the other side.” Which card(s) are useful to turn over to test this proposi0on?

Before we start…

A F

4 7

A 7

F 4

A 4

F 7

Before we start…

A F

4 7

A 7

F 4

A 4

F 7 Wason SelecIon Task: confirma0on bias, context

Why is this relevant?

effector not effector RxLR not

RxLR

“If a protein has an RxLR moIf, it is an effector.” Which experiments are useful to perform to test this proposi0on?

Effector Club

The first rule of finding effectors is:

You are not finding effectors

Effector Club

l Classifica0on of sequences is modelling

l  simplified representa0on of reality

l  criteria based on known effectors

l  Iden0fies candidate effectors l  experimental verifica0on required

l General bioinforma0c problem

l  specifics vary for each classifier (model)

Sequence space

An abstract concept

Sequence space

Each point is a sequence

Sequence space

d1 d2

d1 < d2 Distance reflects sequence similarity

Sequence space

Known exemplar: red

Sequence space

Define distance from the example ≈ ‘similar’

Sequence space

‘similar’ sequences are same class (e.g. func0on)

Sequence space

Known exemplars: red

Sequence space

Define a centre, and a distance that includes the examples

Sequence space

Classify ‘similar’ sequences

Finding effectors l Simple:

1.  Have one or more examples of your effector (class)

2.  Define some kind of appropriate threshold of similarity

3.  Check all the gene/gene product sequences in the genome against that threshold

Finding effectors l Simple:

1.  Have one or more examples of your effector (class)

2.  Define some kind of appropriate threshold of similarity

3.  Check all the gene/gene product sequences in the genome against that threshold

There are 50 slides to go… it’s not that simple

It’s not that simple

l How do we define ‘distance’?

l How large a ‘distance’ do we take?

l How do we know we’ve chosen a sensible ‘distance’?

CharacterisIcs of known effectors l Modularity

l  Delivery: localisa0on/transloca0on domain(s)

l  Ac0vity: func0onal/interac0on domain(s)

l Sequence mo0fs

l  Localisa0on/transloca0on domain(s) ocen common to effector class (e.g. RxLR, T3E)

l  Func0onal domain(s) may be common to effector class (e.g. TAL), or divergent (e.g. RxLR, T3E)




Greenberg JT, Vinatzer BA (2003) Iden0fying type III effectors of plant pathogens and analyzing their interac0on with plant cells. Curr Opin Microbiol 6: 20–28. Collmer A, Lindeberg M, Petnicki-‐Ocwieja T, Schneider DJ, Alfano JR (2002) Genomic mining type III secre0on system effectors in Pseudomonas syringae yields new picks for all TTSS prospectors. Trends in Microbiology 10: 462–469.




Dong S, Yu D, Cui L, Qutob D, Tedman-‐Jones J, et al. (2011) Sequence Variants of the Phytophthora sojae RXLR Effector Avr3a/5 Are Differen0ally Recognized by Rps3a and Rps5 in Soybean. PLoS ONE 6: e20172. doi:10.1371/journal.pone.0020172.t004. Bouwmeester K, Meijer HJG, Govers, F (2011) At the fron0er; RXLR effectors crossing the Phytophthora-‐host interface. FronCers in Plant-‐Microbe InteracCons 10.3389




l Sequence mo0fs

l  Localisa0on/transloca0on domain(s) typically common to effector class (e.g. RxLR, T3E, CHxC)

l  Func0onal domain(s) may be common to effector class (e.g. TAL), or divergent (e.g. RxLR, T3E in general)

Boch J, Scholze H, Schornack S, Landgraf A, Hahn S, et al. (2009) Breaking the code of DNA binding specificity of TAL-‐type III effectors. Science 326: 1509–1512. doi:10.1126/science.1178811.

CharacterisIcs of known effectors l “Arms Races” occur:

l  Host defences track effector evolu0on

l  Effectors evade host defences

l Divergence of effectors under selec0on pressure l  Diversifying selec0on; divergence may

result from evasion of detec0on, rather than change of biochemical ‘func0on’

l Effectors may be found preferen0ally in characteris0c loca0ons

l  P. infestans ‘gene sparse’ regions

Raffaele S, Win J, Cano LM, Kamoun S (2010) Analyses of genome architecture and gene expression reveal novel candidate virulence factors in the secretome of Phytophthora infestans. BMC Genomics 11: 637. doi:10.1186/1471-‐2164-‐11-‐637.

CharacterisIcs of known effectors l Applica0on of ‘filters’: reduce the number of sequences to check

l  Presence/absence filters:

� SignalP (export signal)

� RxLR/T3SS (transloca0on signal)

� Expression (used by pathogen)

� Posi0ve selec0on (suggests arms race)

� etc…

l Workflows (e.g. Galaxy, Taverna) useful here

Fabro G, Steinbrenner J, Coates M, Ishaque N, Baxter L, et al. (2011) Mul0ple candidate effectors from the oomycete pathogen Hyaloperonospora arabidopsidis suppress host plant immunity. PLoS Pathog 7: e1002348. doi:10.1371/journal.ppat.1002348.

Redefining sequence space l Effectors may share common module, but otherwise be dissimilar.

l We can emphasise sequence similarity by focusing on the common region

l  this is essen0ally ‘redefining’ sequence space

l  brings known effectors ‘together’

l  may bring non-‐effectors with similar sequence closer, too

SSMMMAAAAAAAA SSMMMBBBBBBBB SSMMMAAAAAAAA SSMMMBBBBBBBB SSMMMAAAAAAAA SSMMMBBBBBBBB

Sequence space

Comparing whole sequences

AAAAAAAA

BBBBBBB

Redefining sequence space l Effectors may share common module, but otherwise be dissimilar.

l We can emphasise similarity by focusing on regions common to an effector class, e.g. T3SS, L-‐FLAK

l  this is essen0ally redefining sequence space

l  brings known effectors ‘closer together’

l  may bring non-‐effectors with similar sequence closer, too

SSMMMAAAAAAAA SSMMMBBBBBBBB SSMMMAAAAAAAA SSMMMBBBBBBBB SSMMMAAAAAAAA SSMMMBBBBBBBB

Sequence space

Pull domains together, push non-‐domains away

Building a classifier




Defining a distance l Sequence iden0ty (op0mal alignment)

l Derived score (based on sequence iden0ty/alignment)

l  Bit score in BLAST

l  E-‐value in BLAST

l Derived score (based on other measures)

l  Bit score in HMMer

l Clustering l  Sequence iden0ty (e.g. CD-‐HIT)

l  MCL

Defining a distance l Sequence iden0ty







l  MCL





l Derived score (based on other measures) [not alignment]



l  MCL







l Clustering (not strictly a distance) l  Sequence iden0ty (e.g. CD-‐HIT)

l  MCL

(we’re really assessing criteria for class membership)

Defining a distance: sequence idenIty l  Distance between sequences ≈ difference between sequences

l  sequence iden0ty: propor0on of iden0cal symbols

l  e.g. BLAST output

l  Gotchas: not always symmetrical; dependent on alignment parameters!

Score = 95.3 bits (51), Expect = 3e-24 ! Identities = 161/212 (76%), Gaps = 15/212 (7%) ! Strand=Plus/Plus !!Query 1 GCCGCCGGGGTTCAGGTTCCACCCGACGGACGAGGAGCTGGTGATGC-ACTACCT-CTGC 58 ! ||||||||||||||||||||||||||||||||||||||| | | | ||||||| | || !Sbjct 1 GCCGCCGGGGTTCAGGTTCCACCCGACGGACGAGGAGCTCATCAC-CTACTACCTGCGGC 59 !!Query 59 CGGCGGT-GC-GCCGGCCTCCCCATCGCCGTCCCCATCATCGCCGAGGTCGACCTCTACA 116 ! | | | || | |||| | || || || | ||||||||||||||| ||| ||| !Sbjct 60 AGAAGATCGCCGACGGCGGCTTCA-CGGCGAGGGC--CATCGCCGAGGTCGATCTCAACA 116 !!Query 117 AGTTCGATCCATGGCATCTCCCA-AGAATGGCGCTGTACGGC-GAGAAGGAGTGGTACTT 174 ! ||| ||| |||||| |||||||| |||| ||| | || |||||||| |||||||| !Sbjct 117 AGTGCGAGCCATGGGATCTCCCAGAGAA-GGCAAAA-ATGGGAGAGAAGGAATGGTACTT 174 !!Query 175 CTTCTCCCCTC-GGGACCGCAAGTACCCGAAC 205 ! ||| | ||| |||| || |||||||| ||| !Sbjct 175 CTT-TAGCCTAAGGGATCGAAAGTACCC-AAC 204 !





Score = 4970 !Length of alignment = 533 !Sequence Solyc11g008000.1.1 : 1 - 529 (Sequence length = 529) !Sequence Solyc11g005920.1.1 : 1 - 688 (Sequence length = 688) !Percentage ID = 32.83 !!Score = 5040 !Length of alignment = 533 !Sequence Solyc11g005920.1.1 : 1 - 688 (Sequence length = 688) !Sequence Solyc11g008000.1.1 : 1 - 529 (Sequence length = 529) !Percentage ID = 32.46 !

(pairwise alignment in Jalview)





Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 31.4 bits (64), Expect = 4e-05, Method: Composition-based stats. ! Identities = 37/134 (28%), Positives = 53/134 (40%), Gaps = 12/134 (8%) !!Query 34 GFRFHPTDEELVLYYLKRKICRRRILLDA---IAETDVY-KWEPEDLPDLSKLKTGD--- 86 ! GFRF PTD E V + L + + + D+ D Y + EP D+ D !Sbjct 7 GFRFSPTDAEAVTFLL--RFIAGKFMDDSGFITTHVDTYSEQEPWDIYSHGVPCCNDDND 64 !!Query 87 -RQWFFFSPRDRKYPNGARSNRASKHGYWKATGKDRIITCNSRAV-GVKKTLVFYKGRAP 144 ! Q+ FF +K S G WK K + + V G KK++ YK + !Sbjct 65 CTQYRFFITTLKKKSESRYSRNVGNKGSWKQQDKSKPVRKKGGPVIGYKKSMC-YKNKGY 123 !!Query 145 VGERTDWVMHEYTM 158 ! E W+M EY + !Sbjct 124 KQEDGHWLMKEYDL 137 !!

(BLASTP, BLOSUM80 matrix)





Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 7e-07, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) !!Query 31 FPPGFRFHPTDEELVLYYLKRKICRRRILLDAIAETDVYKW---EPEDLPDLSKLKTGDR 87 ! + GFRF PTD E V + L R I + + T V + EP D+ D !Sbjct 4 LEEGFRFSPTDAEAVTFLL-RFIAGKFMDDSGFITTHVDTYSEQEPWDIYSHGVPCCNDD 62 !!Query 88 ----QWFFFSPRDRKYPNGARSNRASKHGYWKATGKDRIITCNSRAVGVKKTLVFYKGRA 143 ! Q+ FF +K S G WK K + + V K + YK + !Sbjct 63 NDCTQYRFFITTLKKKSESRYSRNVGNKGSWKQQDKSKPVRKKGGPVIGYKKSMCYKNKG 122 !!Query 144 PVGERTDWVMHEYTMDEEELKRCQNAQDYYALYKVFKKS 182 ! E W+M EY + L + L + K++ !Sbjct 123 YKQEDGHWLMKEYDLSTYILDKFDKDCRDIVLCAIKKRT 161 !!

(BLASTP, BLOSUM45 matrix)

Defining a distance: beyond idenIty

Query 1 GCCGCCGGGGTTCAGGTTCCACCCGACGGACGAGGAGCTGGTGATGC-ACTACCT-CTGC 58 ! ||||||||||||||||||||||||||||||||||||||| | | | ||||||| | || !Sbjct 1 GCCGCCGGGGTTCAGGTTCCACCCGACGGACGAGGAGCTCATCAC-CTACTACCTGCGGC 59 !!Query 59 CGGCGGT-GC-GCCGGCCTCCCCATCGCCGTCCCCATCATCGCCGAGGTCGACCTCTACA 116 ! | | | || | |||| | || || || | ||||||||||||||| ||| ||| !Sbjct 60 AGAAGATCGCCGACGGCGGCTTCA-CGGCGAGGGC--CATCGCCGAGGTCGATCTCAACA 116 !!Query 117 AGTTCGATCCATGGCATCTCCCA-AGAATGGCGCTGTACGGC-GAGAAGGAGTGGTACTT 174 ! ||| ||| |||||| |||||||| |||| ||| | || |||||||| |||||||| !Sbjct 117 AGTGCGAGCCATGGGATCTCCCAGAGAA-GGCAAAA-ATGGGAGAGAAGGAATGGTACTT 174 !!Query 175 CTTCTCCCCTC-GGGACCGCAAGTACCCGAAC 205 ! ||| | ||| |||| || |||||||| ||| !Sbjct 175 CTT-TAGCCTAAGGGATCGAAAGTACCC-AAC 204 !

Iden0ty ≈ yes/no We can quan0fy similarity in ‘bits’

Defining a distance: bit score and E-‐value l  Bit score and E-‐value can be used as distance measures.

l  I prefer (normalised) bit scores

l  Small changes in score → large changes in E

l  E varies linearly with database size; λS independent of database size

E = kmne-‐λS




l  E varies linearly with database size and query length; λS independent of database size

E = kmne-‐λS

Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 31.4 bits (64), Expect = 4e-05, Method: Composition-based stats. ! Identities = 37/134 (28%), Positives = 53/134 (40%), Gaps = 12/134 (8%) !!!!Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 7e-07, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) !

BLOSUM80

BLOSUM45




l  E varies linearly with database size and query length; λS independent of database size

E = kmne-‐λS

Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 7e-07, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) ! Score = 30.0 bits (89), Expect = 8e-05, Method: Composition-based stats. ! Identities = 23/114 (21%), Positives = 36/114 (32%), Gaps = 0/114 (0%)!(db size: 5 sequences) !!Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 1e-06, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) ! Score = 30.0 bits (89), Expect = 1e-04, Method: Composition-based stats. ! Identities = 23/114 (21%), Positives = 36/114 (32%), Gaps = 0/114 (0%) !(db size: 483 sequences) !!Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !***** No hits found ***** !(db size: 644 sequences) !!




l  E varies linearly with database size and query length; λS independent of database size Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !

Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 7e-07, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) ! Score = 30.0 bits (89), Expect = 8e-05, Method: Composition-based stats. ! Identities = 23/114 (21%), Positives = 36/114 (32%), Gaps = 0/114 (0%)!(db size: 5 sequences) !!Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 1e-06, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) ! Score = 30.0 bits (89), Expect = 1e-04, Method: Composition-based stats. ! Identities = 23/114 (21%), Positives = 36/114 (32%), Gaps = 0/114 (0%) !(db size: 483 sequences) !!Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !***** No hits found ***** !(db size: 644 sequences) !!

E = kmne-‐λS

Defining a distance: alignment v profile ACATAT!TCAACT!ACACGC!AGAATC!ACAGAA!!

Alignments compare two sequences Profiles capture informaIon from several sequences

Defining a distance: alignment v profile ACATAT!TCAACT!ACACGC!AGAATC!ACAGAA!ACAAAT!!

consensus


Defining a distance: alignment v profile ACATAT!TCAACT!ACACGC!AGAATC!ACAGAA!ACAAAT!![AT][CG]A[ACGT][ACGT][TCA]![AT]-[CG]-A-X(2)-{G}!

regular expression


Defining a distance: alignment v profile ACATAT!TCAACT!ACACGC!AGAATC!ACAGAA!ACAAAT!!

PSSM

123456!A405221!C040112!G010110!T100112!!

[AT][CG]A[ACGT][ACGT][TCA]![AT]-[CG]-A-X(2)-{G}!


Defining a distance: alignment v profile ACATAT!TCAACT!ACACGC!AGAATC!ACAGAA!ACAAAT!![AT][CG]A[ACGT][ACGT][TCA]![AT]-[CG]-A-X(2)-{G}!

123456!A405221!C040112!G010110!T100112!!

hidden Markov model (HMM)


Defining a distance: bit scores in HMMer l  HMMer works differently to BLAST: profile HMMs

l  Sta0s0cal model of mul0ple sequence alignment (not pairwise sequence alignment)

l  phmmer and jackhmmer equivalents of BLASTP and PSIBLAST

l Explicit sta0s0cal representa0on of alignment uncertainty

l Sequence scores, not alignment scores

l Bit score is ‘log-‐odds’ bit score: log-odds = log

✓P (sequence matches alignment)

P (sequence matches null model)

◆

Goritschnig S, Krasileva KV, Dahlbeck D, Staskawicz BJ (2012) Computa0onal predic0on and molecular characteriza0on of an oomycete effector and the cognate Arabidopsis resistance gene. PLoS GeneCcs 8: e1002502. doi:10.1371/journal.pgen.1002502. Haas BJ, Kamoun S, Zody MC, Jiang RHY, Handsaker RE, et al. (2009) Genome sequence and analysis of the Irish potato famine pathogen Phytophthora infestans. Nature 461: 393–398. doi:10.1038/nature08358. Whisson SC, Boevink PC, Moleleki L, Avrova AO, Morales JG, et al. (2007) A transloca0on signal for delivery of oomycete effector proteins into host plant cells. Nature 450: 115–118. doi:10.1038/nature06203.









◆









◆

Null model is a control Choice of null model can be important









◆

Sequence matches alignment beier than control (null) → log-‐odds > 0 Sequence matches control (null) beier than alignment → log-‐odds < 0 Sequence matches alignment and control (null) equally → log-‐odds ≈ 0

Defining a distance: bit scores in HMMer

Query: NAM [M=129] !Accession: PF02365.10 !Description: No apical meristem (NAM) protein !Scores for complete sequences (score includes all domains): ! --- full sequence --- --- best 1 domain --- -#dom- ! E-value score bias E-value score bias exp N Sequence Description ! ------- ------ ----- ------- ------ ----- ---- -- -------- ----------- ! 3.1e-54 171.0 0.1 5.3e-54 170.3 0.1 1.4 1 StNac1_5 ! 5.5e-54 170.2 0.1 8.8e-54 169.6 0.1 1.3 1 NbNac1_1 ! 4e-53 167.4 0.1 6.3e-53 166.8 0.1 1.3 1 NbNac2_1 ! 1.5e-52 165.6 0.1 3.3e-52 164.5 0.1 1.6 1 StNac2_5 !!!Domain annotation for each sequence (and alignments): !>> StNac1_5 ! # score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc! --- ------ ----- --------- --------- ------- ------- ------- ------- ------- ------- ---- ! 1 ! 170.3 0.1 5.3e-54 5.3e-54 1 128 [. 28 156 .. 28 157 .. 0.97 !! Alignments for each domain: ! == domain 1 score: 170.3 bits; conditional E-value: 5.3e-54 ! PF02365.10 1 lppGfrFhPtdeelvveyLkkkvegkkleleevikevdiykvePwdLp..akvkaeekewyfFskrdkkyatgkrknratksgyWkatgkdkevlskkg 97 ! lp+G+rF+Ptdeelv++yL+ k++g + ++ +vi+evdi+k+ePwdLp ++v+++++ew+fF+++d+ky++g+r nrat++gyWkatgkd+++++kkg! StNac1_5 28 LPVGYRFRPTDEELVNHYLRLKINGADSQV-SVIREVDICKLEPWDLPdlSVVESHDNEWFFFCPKDRKYQNGQRLNRATERGYWKATGKDRNIVTKKG 125 ! 699************************999.99***************888899999****************************************** PP !! PF02365.10 98 elvglkktLvfykgrapkgektdWvmheyrl 128 ! +++g+kktLv+y grap+g++t+Wv+heyr+ ! StNac1_5 126 AKIGMKKTLVYYIGRAPEGKRTHWVIHEYRA 156 ! *****************************96 PP !!

l Easy to read bit scores from HMMer output

Defining a distance: composiIon l  Some0mes, sequence comparison doesn’t tell you much (e.g. T3

effector signals)

l  Can use ‘bulk proper0es’ of sequence composi0on l  Many ways to derive a ‘distance’

Greenberg JT, Vinatzer BA (2003) Iden0fying type III effectors of plant pathogens and analyzing their interac0on with plant cells. Curr Opin Microbiol 6: 20–28. Arnold R, Brandmaier S, Kleine F, Tischler P, Heinz E, et al. (2009) Sequence-‐based predic0on of type III secreted proteins. PLoS Pathog 5: e1000376. doi:10.1371/journal.ppat.1000376.

Defining a ‘distance’: clustering

Defining a ‘distance’: clustering

Not really a distance, more a bound Sequences that cluster with your known examples

Defining a ‘distance’: CD-‐HIT clusters l Clustering tool, online at h[p://weizhong-‐lab.ucsd.edu/cd-‐hit/

l  Sequences sorted by decreasing length

l  First sequence is representa0ve of first cluster: ‘seen’

l  Consider each remaining sequence in turn: compare with ‘seen’ set

� Similarity of sequence with ‘seen’ sequence > threshold? Merge into cluster

� Otherwise start new cluster: ‘seen’

l Fast, but can be sensi0ve to sequence set composi0on (use mul0-‐step).

Defining a ‘distance’: CD-‐HIT clusters l Clustering tool, online at h[p://weizhong-‐lab.ucsd.edu/cd-‐hit/

l  Sequences sorted by decreasing length

l  First sequence is representa0ve of first cluster: ‘seen’

l  Consider each remaining sequence in turn: compare with ‘seen’ set

� Similarity of sequence with ‘seen’ sequence > threshold? Merge into cluster

� Otherwise start new cluster: ‘seen’

l Fast, but can be sensi0ve to sequence set composi0on (use mul0-‐step).

need to test clusters for robustness

Defining a ‘distance’: MCL clustering l Clustering algorithm (used in TribeMCL, OrthoMCL)

l  Markov Clustering Algorithm

l  Finds clusters in networks

l Use BLAST to generate all-‐vs-‐all pairwise comparisons

l  Results are a network (similarity graph)

l Given such a network: l  Expansion (raise to power) – ‘spreads links’

l  Infla0on (scaling) – ‘thickens strong links’








Repeated applicaIon of the expansion/inflaIon cycle results in the formaIon of clusters.

Defining a ‘distance’: MCL clustering Expansion InflaIon

…

…

… …

→

→

Input

Clustering








l One key parameter: inflaIon value

l  Need to cluster over several infla0on values to confirm robustness (consistency of clustering)

InflaIon value clusters

1.4 3

2.0 6

4.0 18

6.0 33

Defining a distance l Sequence iden0ty – scores alignment (symmetry?)


l  Bit score in BLAST – scores alignment (subs0tu0on matrix)

l  E-‐value in BLAST – scores alignment (sensi0ve to query/db size, subn matrix)


l  Bit score in HMMer – scores sequence rela0ve to model (null model?)

l Clustering l  Sequence iden0ty (e.g. CD-‐HIT) – can be sensi0ve to sequence order (mul0-‐

step? test for robustness? CD-‐HIT uses sequence iden0ty)

l  MCL – needs all-‐v-‐all pairwise (test for robustness; uses BLAST E-‐value by default)

Many definiIons of distance


l How large a ‘distance’ (or what clustering resolu0on) do we take?


How large a distance do we allow?


l How large a ‘distance’ (or what clustering resoluIon) do we take?


Confusion Matrix

l Our distance/boundary classifies sequences as ‘in’ or ‘out’ l  ‘red’ or ‘blue’

l Changing distance/bound results in various degrees of success…

IN OUT

Red 1 5

Blue 1 36

Confusion matrix:

Confusion Matrix



IN OUT

Red 1 5

Blue 1 36

True posiIve

False posiIve True negaIve

False negaIve

Confusion Matrix



IN OUT

Red 1 5

Blue 1 36

False posi0ve rate FP/(FP+TN)

False nega0ve rate FN/(TP+FN)

Sensi0vity TP/(TP+FN)

Specificity TN/(FP+TN)

False discovery rate FP/(FP+TP)

Confusion Matrix



IN OUT

Red 1 5

Blue 1 36

False posi0ve rate 1/37 = 0.03

False nega0ve rate 5/6 = 0.83

Sensi0vity 1/6 = 0.17

Specificity 36/37 = 0.97

Confusion Matrix



IN OUT

Red 5 2

Blue 4 33

False posi0ve rate 0.11

False nega0ve rate 0.29

Sensi0vity 0.81

Specificity 0.89

Confusion Matrix



IN OUT

Red 7 0

Blue 14 23


False nega0ve rate 0

Sensi0vity 1

Specificity 0.62

ROC Curve

l  To assess how well a method performs, can use ROC (Receiver Opera0ng Characteris0c) curve

l  Typically, we use area under the curve (AUC) to choose between methods

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

SensiIvity

False PosiIve Rate

ROC Curve

Classifier

Random

ROC Curve


l  Typically, we use area under the curve (AUC) to choose between methods

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

SensiIvity

False PosiIve Rate

ROC Curve

Classifier

Random

be[er performance

ROC Curve


l  The ‘best’ parameter se}ng for a method is typically near the apex.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

SensiIvity

False PosiIve Rate

ROC Curve

Classifier

Random

F-‐measure

IN OUT

Red 1 5

Blue 1 36

False posi0ve rate FP/(FP+TN)

False nega0ve rate FN/(TP+FN)

Sensi0vity TP/(TP+FN)

Specificity TN/(FP+TN)

l We can ‘game’ ROC sta0s0cs by increasing irrelevant ‘nega0ve’ examples

l  Increasing TN ‘improves’ false posi0ve rate and specificity

l Can use precision and recall instead

F-‐measure

l We can ‘game’ ROC sta0s0cs by increasing irrelevant ‘nega0ve’ examples

l  Increasing TN ‘improves’ false posi0ve rate and specificity

l Can use precision and recall instead

IN OUT

Red 1 5

Blue 1 36

Precision (PPV) TP/(TP+FP)

Recall = sensi0vity TP/(TP+FN)

FDR = 1-‐PPV FP/(TP+FP)

F-‐measure

l Precision: Propor0on of accurate posi0ve predic0ons

l Recall: Propor0on of posi0ve examples recovered (sensiCvity)

l F1 = 2 (precision x recall)/(precision + recall)

IN OUT

Red 1 5

Blue 1 36

Precision (PPV) TP/(TP+FP)

Recall = sensi0vity TP/(TP+FN)

FDR = 1-‐PPV FP/(TP+FP)

F-‐measure

l The F-‐measure indicates which set of parameters (which distance) ‘best’

l Several F-‐measures available that weight precision and recall differently

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 2 3

F-‐measure


l Assign known ‘posi0ve’ and ‘nega0ve’ examples

l Vary distances and take F-‐measure

l Choose distance that gives the best performance

Confusion Matrix

l BUT: how do we know that we’ve chosen a suitable distance? l  Training set choice is cri0cal

IN OUT

Red 5 2

Blue 4 33



Sensi0vity 0.81

Specificity 0.89

Training set choice

Train classifier on known examples: looks good…

UnrepresentaIve examples

…but training set biased/unrepresenta0ve sample…

Overfiong

…or ‘fits’ known posi0ves unfeasibly 0ghtly



l How large a ‘distance’ (or what clustering resoluIon) do we take?


How do we know we’ve chosen a suitable distance?


l How large a ‘distance’ (or what clustering resolu0on) do we take?


A trip to the doctor, part I l Rou0ne medical checkup

l Test for disease X (horrible, unpleasant, poten0ally suppura0ng)

l Test has sensi)vity (i.e. predicts disease where there is disease) of 95%

l Test has false posi)ve rate (i.e. predicts disease where there is no disease) of 1%

l Your test is posi0ve

l What is the probability that you have disease X?





l Your test is posiIve








0.01 0.05 0.95 0.99 0.50

How do we know we’ve chosen a suitable distance?




Cross-‐validaIon l Es0ma0on of classifier performance depends on

l  distance measure

l  composi0on of training set (‘posi0ves’ and ‘nega0ves’)

l Cross-‐valida0on gives objec0ve measure of performance

l Many strategies available, including:

l  leave-‐one-‐out (LOO)

l  k-‐fold crossvalida0on

l  repeated (random) subsampling

l Essen0ally: always keep a hold-‐out set (not used to train)

k-‐fold crossvalidaIon l No crossvalida0on:

l  One training set

l  No test (hold-‐out/valida0on) set

l  Risks overfi}ng

Training Set

Test Set

k-‐fold crossvalidaIon l Valida0on:

l  One training set, one test (hold-‐out/valida0on) set

l  Test performance of classifier on unseen data

Training Set

Test Set

k-‐fold crossvalidaIon l 2-‐fold crossvalida0on:

l  Two runs, each with one training set, one test set

l  Swap training and test sets, collate results

Training Set

Test Set

run1

run2

k-‐fold crossvalidaIon l 3-‐fold crossvalida0on:

l  Three runs, each with one training set, one test set

Training Set

Test Set

run1

run2

run3

k-‐fold crossvalidaIon l k-‐fold crossvalida0on:

l  k runs, each with one training set, one test set (n items in dataset, k>1)

Training Set Test Set

run1

run2

runk

n/k n-‐(n/k)

…

Arer crossvalidaIon



Sensi0vity 0.81

Specificity 0.89

Precision 0.56

•  Use crossvalida0on to find ‘best’ method & parameters •  Crossvalida0on gives you es0mated performance metrics on

unseen data •  Apply ‘best’ method to complete dataset for predic0on

A trip to the doctor, part II l Test for disease X (horrible, unpleasant, poten0ally suppura0ng)




l To calculate the probability that the test correctly determines whether you have the disease, you need to know the baseline occurrence.

A trip to the doctor, part II l Test for disease X (horrible, unpleasant, poten0ally suppura0ng)



l Your test is posi0ve

l To calculate the probability that the test correctly determines whether you have the disease, you need to know the baseline occurrence.

Baseline occurrence: 1% ⇒ P(disease|+ve)=0.490 Baseline occurrence: 80% ⇒ P(disease|+ve)=0.997

What is the baseline occurrence for effectors?

l Usually rely on predic0ons for expected baseline

l Bacterial genomes: ≈4500 genes

l  Type III effectors: 1-‐10% (Arnold et al. 2009); 1-‐2% (Collmer et al. 2002); 1% (Boch and Bonas, 2010)

l Oomycete/fungal genomes: ≈20000 genes

l  RxLRs: 120-‐460 (1-‐2%; Whisson et al. 2007); ≤563 (≲2% Haas et al. 2009);

l  CRNs: 19-‐196 (≲1%; Haas et al. 2009)

l  CHxC: ≈30 (<1%; Kemen et al. 2011)

l We need to take care over result interpreta0on:

l  Predic0on method with 5% false nega0ve rate and 1% false posi0ve rate, with 1% baseline, predic0ng 500 effectors:

� P(effector|posiIve test)≈0.5

A lesson from the literature?

l “The resul0ng computa0onal model revealed a strong type III secre0on signal in the N-‐terminus that can be used to detect effectors with sensi0vity of 71% and [specificity] of 85%.”

l  Sensi0vity [P(+ve|T3E)] = 0.71; FPR [1-‐Specificity; P(+ve|not T3E)] = 0.15

l  Base rate [P(T3E)] ≈ 3%; Genes = 4500

l  We expect P(T3E|+ve) ≈ 0.13

l  (and a significant number, up to 15% of the genome, of false posi0ves…)

P (T3E|+ve) =P (+ve|T3E)P (T3E)

P (+ve|T3E)P (T3E) + P (+ve|T3E)P (T3E)

A lesson from the literature? l “The surprisingly high number of (false) posi0ves in genomes without TTSS exceeds the expected false posi0ve rate (Table 1)”

0.038 x 5169 x 0.13 ≈ 26 [No. +ve x P(T3E|+ve)]

Director’s Commentary: Finding RxLRs

l Supplementary from Whisson et al. (2007) l  Whisson SC, Boevink PC, Moleleki L, Avrova AO, Morales JG, et al. (2007) A transloca0on

signal for delivery of oomycete effector proteins into host plant cells. Nature 450: 115–118. doi:10.1038/nature06203.

l Not perfect

l Detail of one way to construct a classifier

Building a training set l Star0ng point: 49 candidate sequences (reference set)

l Known: l  Contain (puta0vely) RxLR-‐EER mo0f

l  All but one transcribed (i.e. not bad gene calls)

l Assumed:

l  Presence of signal pep0de and RxLR-‐EER categorises effectors

Building a training set l SignalP 3.0 (Bendtsen et al. 2004) to predict loca0ons of signal pep0des.

l  SignalP also has sta0s0cal performance es0mates:

l Se}ngs:

l  HMM cutoff probability = 0.9

l  Cleavage site between posi0ons 10 and 40 inclusive

l  Jus0fica0on: use in previous studies by others

Building a training set l Of 49, four sequences failed

l  One carried forward on experimental grounds (highly-‐expressed)

l Training set now has 46 sequences

l But seven of these actually have no recognisable RxLR-‐EER mo0f, so are discarded

l Training set now has 39 sequences

Building a classifier l We have a recognisable mo0f, with substan0al local varia0on and indels

l  Therefore chose profile HMM

l  Use HMMer socware

l Profile HMMs sensi0ve to quality of alignment

l Therefore treat alignment as a parameter of the HMM (much difference between alignments!)

Building a classifier l Anchored at RxLR and EER

Building a classifier l ClustalW

Building a classifier l T-‐Coffee

Building a classifier l Parameters modified for HMM

l  Alignment package (no alignment, anchored, Clustal, DiAlign, T-‐Coffee) on default se}ngs

l  Full-‐length and truncated (no signal pep0de) alignments to test for influence of signal pep0de region on classifier

� Plus one alignment of RxLR-‐EER plus flanking region only (‘cropped’)

l HMM built for each of eleven alignments

l  Default parameters

l Once built, the HMM is the classifier.

Building a classifier

Trunca0ng sequences reshapes sequence space








hmmbuild --amino <output> <alignment>!

TesIng the classifiers

Only posiIve examples: How well does a classifier cover them?

TesIng the classifiers l Eleven classifiers to test

l Step 1: Consistency test l  Does the classifier correctly call as posi0ve the sequences used to train it?

l  Es0mates recovery of the informa0on in the training set

l Step 2: Recovery of full sequences l  Es0mates performance of classifier on complete sequence data

SigP-‐RxLR-‐Cterm

RxLR-‐Cterm

RxLR

TesIng the classifiers

Only posiIve examples: How well does a classifier recover unseen sequence?

TesIng the classifiers l Step 3: Leave-‐One-‐Out Crossvalida0on

l  But only have posi0ve examples!

l  Removes possibility that classifier matches on basis of having ‘seen’ a sequence before

TesIng the classifiers l Leave-‐one-‐out (LOO) crossvalida0on:

l  k runs, each with one training set, one test set (n items in dataset, k=n)

Training Set Test Set

run1

run2

runk

…

TesIng the classifiers l Step 3: Leave-‐One-‐Out Crossvalida0on

l  But only have posi0ve examples!

l  Removes possibility that classifier matches on basis of having ‘seen’ a sequence before

SigP-‐ RxLR-‐ Cterm

RxLR-‐ Cterm

RxLR

Beier match to classifier than to control

TesIng the classifiers l Step 4: Tests on nega0ve samples

l  Completely shuffled sequences

l  Shuffled downstream of the signal pep0de only

l  Replace RxLR-‐EER with AAAA-‐AAA

No classifier idenIfies a false posiIve (no classifier matches on sequence composi0on alone)





(some recogni0on on basis of signal pep0de)


RxLR-‐ Cterm

RxLR





(some recogni0on on sequence other than mo0f)


RxLR-‐ Cterm

RxLR

Choosing a classifier l The ‘cropped’ classifier has:

l  100% recovery of posi0ve training sequences

l  0% recovery of nega0ve test sequences

l Some varia0on in classifier performance on whole genome:

Oranges are not the only fruit l Other classifiers had been proposed, e.g. Bha[acharjee et al. (2006):

l  Presence of signal pep0de, with cleavage site in first 40aa

l  Regular expression test:

� R.LR.{,40}[ED][ED][KR]in first 100aa acer cleavage site

l Can choose between methods, or report range of predic0ons

So how did it work out…? l Refined all RxLR predic0ons to ‘priority set’ of ≈200 for cloning

l First set of 46 candidate effectors (07/11): l  25 host interactors detected by Y2H

l  Localisa0on data for 41 candidates

l  Silencing phenotypes for 19 candidates

l  22 puta0ve orthologues with P. capsici

l Currently: l  44 silencing phenotypes

Transient expression in leaf of GFP-‐fused RxLR candidate, showing plasma membrane localisa0on

Acknowledgements l Phytophthora groups at JHI

l  (Paul Birch, Steve Whisson, Dave Cooke)

l Bacteriology groups at JHI l  (Ian Toth, Nicola Holden)

l  Imaging at JHI

l  (Petra Boevink)

l Numerous sta0s0cians

l  (David Broadhurst, Andy Woodward, BioSS)

Sequence space

CD-‐Hit sequence ordering l  “Algorithm limita0ons: […]

Let say, there are two clusters: cluster #1 has A, X and Y where A is the representa0ve, and cluster #2 has B and Z where B is the representa0ve. The problem is that even if Y is more similar to B than to A, it can s0ll be in cluster #1 because Y first hits A during the clustering process.”

l  h[p://weizhong-‐lab.ucsd.edu/cd-‐hit/wiki/doku.php?id=cd-‐hit_user_guide