mining plant pathogen genomes for effectors
DESCRIPTION
Presentation given as part of the EMBO Workshop on Plant-Microbe Interactions, at The Sainsbury Laboratory, Norwich, 20th June 2012. This presentation describes bioinformatic and statistical considerations for the prediction of plant pathogen effectors from genome sequences and annotation, with several literature examples.TRANSCRIPT
Mining pathogen genomes for effectors
Leighton Pritchard
The overall goal l Star0ng from a genome sequence, iden0fy genes that code for candidate effectors (or, star0ng from gene product complement, iden0fy candidate effectors)
The overall goal l Star0ng from a genome sequence, iden0fy genes that code for candidate effectors (or, star0ng from gene product complement, iden0fy candidate effectors)
The overall goal l Star0ng from a genome sequence, iden0fy genes that code for candidate effectors (or, star0ng from gene product complement, iden0fy candidate effectors)
The overall goal l Star0ng from a genome sequence, iden0fy genes that code for candidate effectors (or, star0ng from gene product complement, iden0fy candidate effectors)
What is an effector? l Molecule produced by pathogen that (directly?) modifies host molecular/biochemical ‘behaviour’, e.g.
l Inhibits enzyme ac0on (Cladosporium fulvum AVR2, AVR4; Phytophthora infestans EPIC1, EPIC2B; P. sojae glucanase inhibitors)
l Cleaves protein target (Pseudomonas syringae AvrRpt2)
l (De-‐)phosphorylates protein target (Pseudomonas syringae AvrRPM1, AvrB)
l Addi0onal component in/retarge0ng host system, e.g. E3 ligase ac0vity (P. syringae AvrPtoB; P. infestans Avr3a)
l Regulatory control (Xanthomonas campestris AvrBs3, TAL effectors)
What is an effector? l Molecule produced by pathogen that (directly?) modifies host molecular/biochemical ‘behaviour’, e.g.
l Inhibits enzyme ac0on (Cladosporium fulvum AVR2, AVR4; Phytophthora infestans EPIC1, EPIC2B; P. sojae glucanase inhibitors)
l Cleaves protein target (Pseudomonas syringae AvrRpt2)
l (De-‐)phosphorylates protein target (Pseudomonas syringae AvrRPM1, AvrB)
l Addi0onal component in/retarge0ng host system, e.g. E3 ligase ac0vity (P. syringae AvrPtoB; P. infestans Avr3a)
l Regulatory control (Xanthomonas campestris AvrBs3, TAL effectors)
What is an effector? l No unifying biochemical mechanism; may act inside or outwith host cell
l No formal, agreed defini0on (direct/indirect ac0on; structural damage – PCWDEs, etc.)
l No single ‘test for candidate effectors’ l Really tes0ng for protein family membership and/or evidence of
‘effector-‐like behaviour’
l A general sequence classifica0on problem (func0onal annota0on)
l Many possible bioinforma0c/computa0onal approaches
l No big red bu[on
What is an effector? l No unifying biochemical mechanism; may act inside or outwith host cell
l No formal, agreed defini0on (direct/indirect ac0on; structural damage – PCWDEs, etc.)
l No single ‘test for candidate effectors’ l Really tes0ng for protein family membership and/or evidence of
‘effector-‐like behaviour’
l A general sequence classifica0on problem (func0onal annota0on)
l Many possible bioinforma0c/computa0onal approaches
l No big red bu[on
What is an effector? l No unifying biochemical mechanism; may act inside or outwith host cell
l No formal, agreed defini0on (direct/indirect ac0on; structural damage – PCWDEs, etc.)
l No single ‘test for candidate effectors’ l Really tes0ng for protein family membership and/or evidence of
‘effector-‐like behaviour’
l A general sequence classifica0on problem (func0onal annota0on)
l Many possible bioinforma0c/computa0onal approaches
l No big red bu[on
Surgery without knife skills?
Before we start…
A F 4 7 “If a card has a vowel on one side, it has an even number on the other side.” Which card(s) are useful to turn over to test this proposi0on?
Before we start…
A F
4 7
A 7
F 4
A 4
F 7
Before we start…
A F
4 7
A 7
F 4
A 4
F 7 Wason SelecIon Task: confirma0on bias, context
Why is this relevant?
effector not effector RxLR not
RxLR
“If a protein has an RxLR moIf, it is an effector.” Which experiments are useful to perform to test this proposi0on?
Effector Club
The first rule of finding effectors is:
You are not finding effectors
Effector Club
l Classifica0on of sequences is modelling
l simplified representa0on of reality
l criteria based on known effectors
l Iden0fies candidate effectors l experimental verifica0on required
l General bioinforma0c problem
l specifics vary for each classifier (model)
Effector Club
l Classifica0on of sequences is modelling
l simplified representa0on of reality
l criteria based on known effectors
l Iden0fies candidate effectors l experimental verifica0on required
l General bioinforma0c problem
l specifics vary for each classifier (model)
Effector Club
l Classifica0on of sequences is modelling
l simplified representa0on of reality
l criteria based on known effectors
l Iden0fies candidate effectors l experimental verifica0on required
l General bioinforma0c problem
l specifics vary for each classifier (model)
Sequence space
An abstract concept
Sequence space
Each point is a sequence
Sequence space
d1 d2
d1 < d2 Distance reflects sequence similarity
Sequence space
Known exemplar: red
Sequence space
Define distance from the example ≈ ‘similar’
Sequence space
‘similar’ sequences are same class (e.g. func0on)
Sequence space
Known exemplars: red
Sequence space
Define a centre, and a distance that includes the examples
Sequence space
Classify ‘similar’ sequences
Finding effectors l Simple:
1. Have one or more examples of your effector (class)
2. Define some kind of appropriate threshold of similarity
3. Check all the gene/gene product sequences in the genome against that threshold
Finding effectors l Simple:
1. Have one or more examples of your effector (class)
2. Define some kind of appropriate threshold of similarity
3. Check all the gene/gene product sequences in the genome against that threshold
Finding effectors l Simple:
1. Have one or more examples of your effector (class)
2. Define some kind of appropriate threshold of similarity
3. Check all the gene/gene product sequences in the genome against that threshold
Finding effectors l Simple:
1. Have one or more examples of your effector (class)
2. Define some kind of appropriate threshold of similarity
3. Check all the gene/gene product sequences in the genome against that threshold
There are 50 slides to go… it’s not that simple
It’s not that simple
l How do we define ‘distance’?
l How large a ‘distance’ do we take?
l How do we know we’ve chosen a sensible ‘distance’?
It’s not that simple
l How do we define ‘distance’?
l How large a ‘distance’ do we take?
l How do we know we’ve chosen a sensible ‘distance’?
It’s not that simple
l How do we define ‘distance’?
l How large a ‘distance’ do we take?
l How do we know we’ve chosen a sensible ‘distance’?
CharacterisIcs of known effectors l Modularity
l Delivery: localisa0on/transloca0on domain(s)
l Ac0vity: func0onal/interac0on domain(s)
l Sequence mo0fs
l Localisa0on/transloca0on domain(s) ocen common to effector class (e.g. RxLR, T3E)
l Func0onal domain(s) may be common to effector class (e.g. TAL), or divergent (e.g. RxLR, T3E)
CharacterisIcs of known effectors l Modularity
l Delivery: localisa0on/transloca0on domain(s)
l Ac0vity: func0onal/interac0on domain(s)
Greenberg JT, Vinatzer BA (2003) Iden0fying type III effectors of plant pathogens and analyzing their interac0on with plant cells. Curr Opin Microbiol 6: 20–28. Collmer A, Lindeberg M, Petnicki-‐Ocwieja T, Schneider DJ, Alfano JR (2002) Genomic mining type III secre0on system effectors in Pseudomonas syringae yields new picks for all TTSS prospectors. Trends in Microbiology 10: 462–469.
CharacterisIcs of known effectors l Modularity
l Delivery: localisa0on/transloca0on domain(s)
l Ac0vity: func0onal/interac0on domain(s)
Dong S, Yu D, Cui L, Qutob D, Tedman-‐Jones J, et al. (2011) Sequence Variants of the Phytophthora sojae RXLR Effector Avr3a/5 Are Differen0ally Recognized by Rps3a and Rps5 in Soybean. PLoS ONE 6: e20172. doi:10.1371/journal.pone.0020172.t004. Bouwmeester K, Meijer HJG, Govers, F (2011) At the fron0er; RXLR effectors crossing the Phytophthora-‐host interface. FronCers in Plant-‐Microbe InteracCons 10.3389
CharacterisIcs of known effectors l Modularity
l Delivery: localisa0on/transloca0on domain(s)
l Ac0vity: func0onal/interac0on domain(s)
l Sequence mo0fs
l Localisa0on/transloca0on domain(s) typically common to effector class (e.g. RxLR, T3E, CHxC)
l Func0onal domain(s) may be common to effector class (e.g. TAL), or divergent (e.g. RxLR, T3E in general)
Boch J, Scholze H, Schornack S, Landgraf A, Hahn S, et al. (2009) Breaking the code of DNA binding specificity of TAL-‐type III effectors. Science 326: 1509–1512. doi:10.1126/science.1178811.
CharacterisIcs of known effectors l “Arms Races” occur:
l Host defences track effector evolu0on
l Effectors evade host defences
l Divergence of effectors under selec0on pressure l Diversifying selec0on; divergence may
result from evasion of detec0on, rather than change of biochemical ‘func0on’
l Effectors may be found preferen0ally in characteris0c loca0ons
l P. infestans ‘gene sparse’ regions
Raffaele S, Win J, Cano LM, Kamoun S (2010) Analyses of genome architecture and gene expression reveal novel candidate virulence factors in the secretome of Phytophthora infestans. BMC Genomics 11: 637. doi:10.1186/1471-‐2164-‐11-‐637.
CharacterisIcs of known effectors l Applica0on of ‘filters’: reduce the number of sequences to check
l Presence/absence filters:
� SignalP (export signal)
� RxLR/T3SS (transloca0on signal)
� Expression (used by pathogen)
� Posi0ve selec0on (suggests arms race)
� etc…
l Workflows (e.g. Galaxy, Taverna) useful here
Fabro G, Steinbrenner J, Coates M, Ishaque N, Baxter L, et al. (2011) Mul0ple candidate effectors from the oomycete pathogen Hyaloperonospora arabidopsidis suppress host plant immunity. PLoS Pathog 7: e1002348. doi:10.1371/journal.ppat.1002348.
Redefining sequence space l Effectors may share common module, but otherwise be dissimilar.
l We can emphasise sequence similarity by focusing on the common region
l this is essen0ally ‘redefining’ sequence space
l brings known effectors ‘together’
l may bring non-‐effectors with similar sequence closer, too
SSMMMAAAAAAAA SSMMMBBBBBBBB SSMMMAAAAAAAA SSMMMBBBBBBBB SSMMMAAAAAAAA SSMMMBBBBBBBB
Sequence space
Comparing whole sequences
AAAAAAAA
BBBBBBB
Redefining sequence space l Effectors may share common module, but otherwise be dissimilar.
l We can emphasise similarity by focusing on regions common to an effector class, e.g. T3SS, L-‐FLAK
l this is essen0ally redefining sequence space
l brings known effectors ‘closer together’
l may bring non-‐effectors with similar sequence closer, too
SSMMMAAAAAAAA SSMMMBBBBBBBB SSMMMAAAAAAAA SSMMMBBBBBBBB SSMMMAAAAAAAA SSMMMBBBBBBBB
Sequence space
Pull domains together, push non-‐domains away
Building a classifier
l How do we define ‘distance’?
l How large a ‘distance’ do we take?
l How do we know we’ve chosen a sensible ‘distance’?
Defining a distance l Sequence iden0ty (op0mal alignment)
l Derived score (based on sequence iden0ty/alignment)
l Bit score in BLAST
l E-‐value in BLAST
l Derived score (based on other measures)
l Bit score in HMMer
l Clustering l Sequence iden0ty (e.g. CD-‐HIT)
l MCL
Defining a distance l Sequence iden0ty
l Derived score (based on sequence iden0ty/alignment)
l Bit score in BLAST
l E-‐value in BLAST
l Derived score (based on other measures)
l Bit score in HMMer
l Clustering l Sequence iden0ty (e.g. CD-‐HIT)
l MCL
Defining a distance l Sequence iden0ty
l Derived score (based on sequence iden0ty/alignment)
l Bit score in BLAST
l E-‐value in BLAST
l Derived score (based on other measures) [not alignment]
l Bit score in HMMer
l Clustering l Sequence iden0ty (e.g. CD-‐HIT)
l MCL
Defining a distance l Sequence iden0ty
l Derived score (based on sequence iden0ty/alignment)
l Bit score in BLAST
l E-‐value in BLAST
l Derived score (based on other measures)
l Bit score in HMMer
l Clustering (not strictly a distance) l Sequence iden0ty (e.g. CD-‐HIT)
l MCL
(we’re really assessing criteria for class membership)
Defining a distance: sequence idenIty l Distance between sequences ≈ difference between sequences
l sequence iden0ty: propor0on of iden0cal symbols
l e.g. BLAST output
l Gotchas: not always symmetrical; dependent on alignment parameters!
Score = 95.3 bits (51), Expect = 3e-24 ! Identities = 161/212 (76%), Gaps = 15/212 (7%) ! Strand=Plus/Plus !!Query 1 GCCGCCGGGGTTCAGGTTCCACCCGACGGACGAGGAGCTGGTGATGC-ACTACCT-CTGC 58 ! ||||||||||||||||||||||||||||||||||||||| | | | ||||||| | || !Sbjct 1 GCCGCCGGGGTTCAGGTTCCACCCGACGGACGAGGAGCTCATCAC-CTACTACCTGCGGC 59 !!Query 59 CGGCGGT-GC-GCCGGCCTCCCCATCGCCGTCCCCATCATCGCCGAGGTCGACCTCTACA 116 ! | | | || | |||| | || || || | ||||||||||||||| ||| ||| !Sbjct 60 AGAAGATCGCCGACGGCGGCTTCA-CGGCGAGGGC--CATCGCCGAGGTCGATCTCAACA 116 !!Query 117 AGTTCGATCCATGGCATCTCCCA-AGAATGGCGCTGTACGGC-GAGAAGGAGTGGTACTT 174 ! ||| ||| |||||| |||||||| |||| ||| | || |||||||| |||||||| !Sbjct 117 AGTGCGAGCCATGGGATCTCCCAGAGAA-GGCAAAA-ATGGGAGAGAAGGAATGGTACTT 174 !!Query 175 CTTCTCCCCTC-GGGACCGCAAGTACCCGAAC 205 ! ||| | ||| |||| || |||||||| ||| !Sbjct 175 CTT-TAGCCTAAGGGATCGAAAGTACCC-AAC 204 !
Defining a distance: sequence idenIty l Distance between sequences ≈ difference between sequences
l sequence iden0ty: propor0on of iden0cal symbols
l e.g. BLAST output
l Gotchas: not always symmetrical; dependent on alignment parameters!
Score = 4970 !Length of alignment = 533 !Sequence Solyc11g008000.1.1 : 1 - 529 (Sequence length = 529) !Sequence Solyc11g005920.1.1 : 1 - 688 (Sequence length = 688) !Percentage ID = 32.83 !!Score = 5040 !Length of alignment = 533 !Sequence Solyc11g005920.1.1 : 1 - 688 (Sequence length = 688) !Sequence Solyc11g008000.1.1 : 1 - 529 (Sequence length = 529) !Percentage ID = 32.46 !
(pairwise alignment in Jalview)
Defining a distance: sequence idenIty l Distance between sequences ≈ difference between sequences
l sequence iden0ty: propor0on of iden0cal symbols
l e.g. BLAST output
l Gotchas: not always symmetrical; dependent on alignment parameters!
Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 31.4 bits (64), Expect = 4e-05, Method: Composition-based stats. ! Identities = 37/134 (28%), Positives = 53/134 (40%), Gaps = 12/134 (8%) !!Query 34 GFRFHPTDEELVLYYLKRKICRRRILLDA---IAETDVY-KWEPEDLPDLSKLKTGD--- 86 ! GFRF PTD E V + L + + + D+ D Y + EP D+ D !Sbjct 7 GFRFSPTDAEAVTFLL--RFIAGKFMDDSGFITTHVDTYSEQEPWDIYSHGVPCCNDDND 64 !!Query 87 -RQWFFFSPRDRKYPNGARSNRASKHGYWKATGKDRIITCNSRAV-GVKKTLVFYKGRAP 144 ! Q+ FF +K S G WK K + + V G KK++ YK + !Sbjct 65 CTQYRFFITTLKKKSESRYSRNVGNKGSWKQQDKSKPVRKKGGPVIGYKKSMC-YKNKGY 123 !!Query 145 VGERTDWVMHEYTM 158 ! E W+M EY + !Sbjct 124 KQEDGHWLMKEYDL 137 !!
(BLASTP, BLOSUM80 matrix)
Defining a distance: sequence idenIty l Distance between sequences ≈ difference between sequences
l sequence iden0ty: propor0on of iden0cal symbols
l e.g. BLAST output
l Gotchas: not always symmetrical; dependent on alignment parameters!
Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 7e-07, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) !!Query 31 FPPGFRFHPTDEELVLYYLKRKICRRRILLDAIAETDVYKW---EPEDLPDLSKLKTGDR 87 ! + GFRF PTD E V + L R I + + T V + EP D+ D !Sbjct 4 LEEGFRFSPTDAEAVTFLL-RFIAGKFMDDSGFITTHVDTYSEQEPWDIYSHGVPCCNDD 62 !!Query 88 ----QWFFFSPRDRKYPNGARSNRASKHGYWKATGKDRIITCNSRAVGVKKTLVFYKGRA 143 ! Q+ FF +K S G WK K + + V K + YK + !Sbjct 63 NDCTQYRFFITTLKKKSESRYSRNVGNKGSWKQQDKSKPVRKKGGPVIGYKKSMCYKNKG 122 !!Query 144 PVGERTDWVMHEYTMDEEELKRCQNAQDYYALYKVFKKS 182 ! E W+M EY + L + L + K++ !Sbjct 123 YKQEDGHWLMKEYDLSTYILDKFDKDCRDIVLCAIKKRT 161 !!
(BLASTP, BLOSUM45 matrix)
Defining a distance: beyond idenIty
Query 1 GCCGCCGGGGTTCAGGTTCCACCCGACGGACGAGGAGCTGGTGATGC-ACTACCT-CTGC 58 ! ||||||||||||||||||||||||||||||||||||||| | | | ||||||| | || !Sbjct 1 GCCGCCGGGGTTCAGGTTCCACCCGACGGACGAGGAGCTCATCAC-CTACTACCTGCGGC 59 !!Query 59 CGGCGGT-GC-GCCGGCCTCCCCATCGCCGTCCCCATCATCGCCGAGGTCGACCTCTACA 116 ! | | | || | |||| | || || || | ||||||||||||||| ||| ||| !Sbjct 60 AGAAGATCGCCGACGGCGGCTTCA-CGGCGAGGGC--CATCGCCGAGGTCGATCTCAACA 116 !!Query 117 AGTTCGATCCATGGCATCTCCCA-AGAATGGCGCTGTACGGC-GAGAAGGAGTGGTACTT 174 ! ||| ||| |||||| |||||||| |||| ||| | || |||||||| |||||||| !Sbjct 117 AGTGCGAGCCATGGGATCTCCCAGAGAA-GGCAAAA-ATGGGAGAGAAGGAATGGTACTT 174 !!Query 175 CTTCTCCCCTC-GGGACCGCAAGTACCCGAAC 205 ! ||| | ||| |||| || |||||||| ||| !Sbjct 175 CTT-TAGCCTAAGGGATCGAAAGTACCC-AAC 204 !
Iden0ty ≈ yes/no We can quan0fy similarity in ‘bits’
Defining a distance: bit score and E-‐value l Bit score and E-‐value can be used as distance measures.
l I prefer (normalised) bit scores
l Small changes in score → large changes in E
l E varies linearly with database size; λS independent of database size
E = kmne-‐λS
Defining a distance: bit score and E-‐value l Bit score and E-‐value can be used as distance measures.
l I prefer (normalised) bit scores
l Small changes in score → large changes in E
l E varies linearly with database size and query length; λS independent of database size
E = kmne-‐λS
Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 31.4 bits (64), Expect = 4e-05, Method: Composition-based stats. ! Identities = 37/134 (28%), Positives = 53/134 (40%), Gaps = 12/134 (8%) !!!!Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 7e-07, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) !
BLOSUM80
BLOSUM45
Defining a distance: bit score and E-‐value l Bit score and E-‐value can be used as distance measures.
l I prefer (normalised) bit scores
l Small changes in score → large changes in E
l E varies linearly with database size and query length; λS independent of database size
E = kmne-‐λS
Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 7e-07, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) ! Score = 30.0 bits (89), Expect = 8e-05, Method: Composition-based stats. ! Identities = 23/114 (21%), Positives = 36/114 (32%), Gaps = 0/114 (0%)!(db size: 5 sequences) !!Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 1e-06, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) ! Score = 30.0 bits (89), Expect = 1e-04, Method: Composition-based stats. ! Identities = 23/114 (21%), Positives = 36/114 (32%), Gaps = 0/114 (0%) !(db size: 483 sequences) !!Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !***** No hits found ***** !(db size: 644 sequences) !!
Defining a distance: bit score and E-‐value l Bit score and E-‐value can be used as distance measures.
l I prefer (normalised) bit scores
l Small changes in score → large changes in E
l E varies linearly with database size and query length; λS independent of database size Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !
Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 7e-07, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) ! Score = 30.0 bits (89), Expect = 8e-05, Method: Composition-based stats. ! Identities = 23/114 (21%), Positives = 36/114 (32%), Gaps = 0/114 (0%)!(db size: 5 sequences) !!Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 1e-06, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) ! Score = 30.0 bits (89), Expect = 1e-04, Method: Composition-based stats. ! Identities = 23/114 (21%), Positives = 36/114 (32%), Gaps = 0/114 (0%) !(db size: 483 sequences) !!Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !***** No hits found ***** !(db size: 644 sequences) !!
E = kmne-‐λS
Defining a distance: alignment v profile ACATAT!TCAACT!ACACGC!AGAATC!ACAGAA!!
Alignments compare two sequences Profiles capture informaIon from several sequences
Defining a distance: alignment v profile ACATAT!TCAACT!ACACGC!AGAATC!ACAGAA!ACAAAT!!
consensus
Alignments compare two sequences Profiles capture informaIon from several sequences
Defining a distance: alignment v profile ACATAT!TCAACT!ACACGC!AGAATC!ACAGAA!ACAAAT!![AT][CG]A[ACGT][ACGT][TCA]![AT]-[CG]-A-X(2)-{G}!
regular expression
Alignments compare two sequences Profiles capture informaIon from several sequences
Defining a distance: alignment v profile ACATAT!TCAACT!ACACGC!AGAATC!ACAGAA!ACAAAT!!
PSSM
123456!A405221!C040112!G010110!T100112!!
[AT][CG]A[ACGT][ACGT][TCA]![AT]-[CG]-A-X(2)-{G}!
Alignments compare two sequences Profiles capture informaIon from several sequences
Defining a distance: alignment v profile ACATAT!TCAACT!ACACGC!AGAATC!ACAGAA!ACAAAT!![AT][CG]A[ACGT][ACGT][TCA]![AT]-[CG]-A-X(2)-{G}!
123456!A405221!C040112!G010110!T100112!!
hidden Markov model (HMM)
Alignments compare two sequences Profiles capture informaIon from several sequences
Defining a distance: bit scores in HMMer l HMMer works differently to BLAST: profile HMMs
l Sta0s0cal model of mul0ple sequence alignment (not pairwise sequence alignment)
l phmmer and jackhmmer equivalents of BLASTP and PSIBLAST
l Explicit sta0s0cal representa0on of alignment uncertainty
l Sequence scores, not alignment scores
l Bit score is ‘log-‐odds’ bit score: log-odds = log
✓P (sequence matches alignment)
P (sequence matches null model)
◆
Goritschnig S, Krasileva KV, Dahlbeck D, Staskawicz BJ (2012) Computa0onal predic0on and molecular characteriza0on of an oomycete effector and the cognate Arabidopsis resistance gene. PLoS GeneCcs 8: e1002502. doi:10.1371/journal.pgen.1002502. Haas BJ, Kamoun S, Zody MC, Jiang RHY, Handsaker RE, et al. (2009) Genome sequence and analysis of the Irish potato famine pathogen Phytophthora infestans. Nature 461: 393–398. doi:10.1038/nature08358. Whisson SC, Boevink PC, Moleleki L, Avrova AO, Morales JG, et al. (2007) A transloca0on signal for delivery of oomycete effector proteins into host plant cells. Nature 450: 115–118. doi:10.1038/nature06203.
Defining a distance: bit scores in HMMer l HMMer works differently to BLAST: profile HMMs
l Sta0s0cal model of mul0ple sequence alignment (not pairwise sequence alignment)
l phmmer and jackhmmer equivalents of BLASTP and PSIBLAST
l Explicit sta0s0cal representa0on of alignment uncertainty
l Sequence scores, not alignment scores
l Bit score is ‘log-‐odds’ bit score: log-odds = log
✓P (sequence matches alignment)
P (sequence matches null model)
◆
Defining a distance: bit scores in HMMer l HMMer works differently to BLAST: profile HMMs
l Sta0s0cal model of mul0ple sequence alignment (not pairwise sequence alignment)
l phmmer and jackhmmer equivalents of BLASTP and PSIBLAST
l Explicit sta0s0cal representa0on of alignment uncertainty
l Sequence scores, not alignment scores
l Bit score is ‘log-‐odds’ bit score: log-odds = log
✓P (sequence matches alignment)
P (sequence matches null model)
◆
Defining a distance: bit scores in HMMer l HMMer works differently to BLAST: profile HMMs
l Sta0s0cal model of mul0ple sequence alignment (not pairwise sequence alignment)
l phmmer and jackhmmer equivalents of BLASTP and PSIBLAST
l Explicit sta0s0cal representa0on of alignment uncertainty
l Sequence scores, not alignment scores
l Bit score is ‘log-‐odds’ bit score: log-odds = log
✓P (sequence matches alignment)
P (sequence matches null model)
◆
Null model is a control Choice of null model can be important
Defining a distance: bit scores in HMMer l HMMer works differently to BLAST: profile HMMs
l Sta0s0cal model of mul0ple sequence alignment (not pairwise sequence alignment)
l phmmer and jackhmmer equivalents of BLASTP and PSIBLAST
l Explicit sta0s0cal representa0on of alignment uncertainty
l Sequence scores, not alignment scores
l Bit score is ‘log-‐odds’ bit score: log-odds = log
✓P (sequence matches alignment)
P (sequence matches null model)
◆
Sequence matches alignment beier than control (null) → log-‐odds > 0 Sequence matches control (null) beier than alignment → log-‐odds < 0 Sequence matches alignment and control (null) equally → log-‐odds ≈ 0
Defining a distance: bit scores in HMMer
Query: NAM [M=129] !Accession: PF02365.10 !Description: No apical meristem (NAM) protein !Scores for complete sequences (score includes all domains): ! --- full sequence --- --- best 1 domain --- -#dom- ! E-value score bias E-value score bias exp N Sequence Description ! ------- ------ ----- ------- ------ ----- ---- -- -------- ----------- ! 3.1e-54 171.0 0.1 5.3e-54 170.3 0.1 1.4 1 StNac1_5 ! 5.5e-54 170.2 0.1 8.8e-54 169.6 0.1 1.3 1 NbNac1_1 ! 4e-53 167.4 0.1 6.3e-53 166.8 0.1 1.3 1 NbNac2_1 ! 1.5e-52 165.6 0.1 3.3e-52 164.5 0.1 1.6 1 StNac2_5 !!!Domain annotation for each sequence (and alignments): !>> StNac1_5 ! # score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc! --- ------ ----- --------- --------- ------- ------- ------- ------- ------- ------- ---- ! 1 ! 170.3 0.1 5.3e-54 5.3e-54 1 128 [. 28 156 .. 28 157 .. 0.97 !! Alignments for each domain: ! == domain 1 score: 170.3 bits; conditional E-value: 5.3e-54 ! PF02365.10 1 lppGfrFhPtdeelvveyLkkkvegkkleleevikevdiykvePwdLp..akvkaeekewyfFskrdkkyatgkrknratksgyWkatgkdkevlskkg 97 ! lp+G+rF+Ptdeelv++yL+ k++g + ++ +vi+evdi+k+ePwdLp ++v+++++ew+fF+++d+ky++g+r nrat++gyWkatgkd+++++kkg! StNac1_5 28 LPVGYRFRPTDEELVNHYLRLKINGADSQV-SVIREVDICKLEPWDLPdlSVVESHDNEWFFFCPKDRKYQNGQRLNRATERGYWKATGKDRNIVTKKG 125 ! 699************************999.99***************888899999****************************************** PP !! PF02365.10 98 elvglkktLvfykgrapkgektdWvmheyrl 128 ! +++g+kktLv+y grap+g++t+Wv+heyr+ ! StNac1_5 126 AKIGMKKTLVYYIGRAPEGKRTHWVIHEYRA 156 ! *****************************96 PP !!
l Easy to read bit scores from HMMer output
Defining a distance: composiIon l Some0mes, sequence comparison doesn’t tell you much (e.g. T3
effector signals)
l Can use ‘bulk proper0es’ of sequence composi0on l Many ways to derive a ‘distance’
Greenberg JT, Vinatzer BA (2003) Iden0fying type III effectors of plant pathogens and analyzing their interac0on with plant cells. Curr Opin Microbiol 6: 20–28. Arnold R, Brandmaier S, Kleine F, Tischler P, Heinz E, et al. (2009) Sequence-‐based predic0on of type III secreted proteins. PLoS Pathog 5: e1000376. doi:10.1371/journal.ppat.1000376.
Defining a ‘distance’: clustering
Defining a ‘distance’: clustering
Not really a distance, more a bound Sequences that cluster with your known examples
Defining a ‘distance’: CD-‐HIT clusters l Clustering tool, online at h[p://weizhong-‐lab.ucsd.edu/cd-‐hit/
l Sequences sorted by decreasing length
l First sequence is representa0ve of first cluster: ‘seen’
l Consider each remaining sequence in turn: compare with ‘seen’ set
� Similarity of sequence with ‘seen’ sequence > threshold? Merge into cluster
� Otherwise start new cluster: ‘seen’
l Fast, but can be sensi0ve to sequence set composi0on (use mul0-‐step).
Defining a ‘distance’: CD-‐HIT clusters l Clustering tool, online at h[p://weizhong-‐lab.ucsd.edu/cd-‐hit/
l Sequences sorted by decreasing length
l First sequence is representa0ve of first cluster: ‘seen’
l Consider each remaining sequence in turn: compare with ‘seen’ set
� Similarity of sequence with ‘seen’ sequence > threshold? Merge into cluster
� Otherwise start new cluster: ‘seen’
l Fast, but can be sensi0ve to sequence set composi0on (use mul0-‐step).
Defining a ‘distance’: CD-‐HIT clusters l Clustering tool, online at h[p://weizhong-‐lab.ucsd.edu/cd-‐hit/
l Sequences sorted by decreasing length
l First sequence is representa0ve of first cluster: ‘seen’
l Consider each remaining sequence in turn: compare with ‘seen’ set
� Similarity of sequence with ‘seen’ sequence > threshold? Merge into cluster
� Otherwise start new cluster: ‘seen’
l Fast, but can be sensi0ve to sequence set composi0on (use mul0-‐step).
Defining a ‘distance’: CD-‐HIT clusters l Clustering tool, online at h[p://weizhong-‐lab.ucsd.edu/cd-‐hit/
l Sequences sorted by decreasing length
l First sequence is representa0ve of first cluster: ‘seen’
l Consider each remaining sequence in turn: compare with ‘seen’ set
� Similarity of sequence with ‘seen’ sequence > threshold? Merge into cluster
� Otherwise start new cluster: ‘seen’
l Fast, but can be sensi0ve to sequence set composi0on (use mul0-‐step).
Defining a ‘distance’: CD-‐HIT clusters l Clustering tool, online at h[p://weizhong-‐lab.ucsd.edu/cd-‐hit/
l Sequences sorted by decreasing length
l First sequence is representa0ve of first cluster: ‘seen’
l Consider each remaining sequence in turn: compare with ‘seen’ set
� Similarity of sequence with ‘seen’ sequence > threshold? Merge into cluster
� Otherwise start new cluster: ‘seen’
l Fast, but can be sensi0ve to sequence set composi0on (use mul0-‐step).
Defining a ‘distance’: CD-‐HIT clusters l Clustering tool, online at h[p://weizhong-‐lab.ucsd.edu/cd-‐hit/
l Sequences sorted by decreasing length
l First sequence is representa0ve of first cluster: ‘seen’
l Consider each remaining sequence in turn: compare with ‘seen’ set
� Similarity of sequence with ‘seen’ sequence > threshold? Merge into cluster
� Otherwise start new cluster: ‘seen’
l Fast, but can be sensi0ve to sequence set composi0on (use mul0-‐step).
Defining a ‘distance’: CD-‐HIT clusters l Clustering tool, online at h[p://weizhong-‐lab.ucsd.edu/cd-‐hit/
l Sequences sorted by decreasing length
l First sequence is representa0ve of first cluster: ‘seen’
l Consider each remaining sequence in turn: compare with ‘seen’ set
� Similarity of sequence with ‘seen’ sequence > threshold? Merge into cluster
� Otherwise start new cluster: ‘seen’
l Fast, but can be sensi0ve to sequence set composi0on (use mul0-‐step).
need to test clusters for robustness
Defining a ‘distance’: CD-‐HIT clusters l Clustering tool, online at h[p://weizhong-‐lab.ucsd.edu/cd-‐hit/
l Sequences sorted by decreasing length
l First sequence is representa0ve of first cluster: ‘seen’
l Consider each remaining sequence in turn: compare with ‘seen’ set
� Similarity of sequence with ‘seen’ sequence > threshold? Merge into cluster
� Otherwise start new cluster: ‘seen’
l Fast, but can be sensi0ve to sequence set composi0on (use mul0-‐step).
need to test clusters for robustness
Defining a ‘distance’: MCL clustering l Clustering algorithm (used in TribeMCL, OrthoMCL)
l Markov Clustering Algorithm
l Finds clusters in networks
l Use BLAST to generate all-‐vs-‐all pairwise comparisons
l Results are a network (similarity graph)
l Given such a network: l Expansion (raise to power) – ‘spreads links’
l Infla0on (scaling) – ‘thickens strong links’
Defining a ‘distance’: MCL clustering l Clustering algorithm (used in TribeMCL, OrthoMCL)
l Markov Clustering Algorithm
l Finds clusters in networks
l Use BLAST to generate all-‐vs-‐all pairwise comparisons
l Results are a network (similarity graph)
l Given such a network: l Expansion (raise to power) – ‘spreads links’
l Infla0on (scaling) – ‘thickens strong links’
Defining a ‘distance’: MCL clustering l Clustering algorithm (used in TribeMCL, OrthoMCL)
l Markov Clustering Algorithm
l Finds clusters in networks
l Use BLAST to generate all-‐vs-‐all pairwise comparisons
l Results are a network (similarity graph)
l Given such a network: l Expansion (raise to power) – ‘spreads links’
l Infla0on (scaling) – ‘thickens strong links’
Repeated applicaIon of the expansion/inflaIon cycle results in the formaIon of clusters.
Defining a ‘distance’: MCL clustering Expansion InflaIon
…
…
… …
→
→
Input
Clustering
Defining a ‘distance’: MCL clustering l Clustering algorithm (used in TribeMCL, OrthoMCL)
l Markov Clustering Algorithm
l Finds clusters in networks
l Use BLAST to generate all-‐vs-‐all pairwise comparisons
l Results are a network (similarity graph)
l Given such a network: l Expansion (raise to power) – ‘spreads links’
l Infla0on (scaling) – ‘thickens strong links’
l One key parameter: inflaIon value
l Need to cluster over several infla0on values to confirm robustness (consistency of clustering)
InflaIon value clusters
1.4 3
2.0 6
4.0 18
6.0 33
Defining a distance l Sequence iden0ty – scores alignment (symmetry?)
l Derived score (based on sequence iden0ty/alignment)
l Bit score in BLAST – scores alignment (subs0tu0on matrix)
l E-‐value in BLAST – scores alignment (sensi0ve to query/db size, subn matrix)
l Derived score (based on other measures)
l Bit score in HMMer – scores sequence rela0ve to model (null model?)
l Clustering l Sequence iden0ty (e.g. CD-‐HIT) – can be sensi0ve to sequence order (mul0-‐
step? test for robustness? CD-‐HIT uses sequence iden0ty)
l MCL – needs all-‐v-‐all pairwise (test for robustness; uses BLAST E-‐value by default)
Many definiIons of distance
l How do we define ‘distance’?
l How large a ‘distance’ (or what clustering resolu0on) do we take?
l How do we know we’ve chosen a sensible ‘distance’?
How large a distance do we allow?
l How do we define ‘distance’?
l How large a ‘distance’ (or what clustering resoluIon) do we take?
l How do we know we’ve chosen a sensible ‘distance’?
Confusion Matrix
l Our distance/boundary classifies sequences as ‘in’ or ‘out’ l ‘red’ or ‘blue’
l Changing distance/bound results in various degrees of success…
IN OUT
Red 1 5
Blue 1 36
Confusion matrix:
Confusion Matrix
l Our distance/boundary classifies sequences as ‘in’ or ‘out’ l ‘red’ or ‘blue’
l Changing distance/bound results in various degrees of success…
IN OUT
Red 1 5
Blue 1 36
True posiIve
False posiIve True negaIve
False negaIve
Confusion Matrix
l Our distance/boundary classifies sequences as ‘in’ or ‘out’ l ‘red’ or ‘blue’
l Changing distance/bound results in various degrees of success…
IN OUT
Red 1 5
Blue 1 36
False posi0ve rate FP/(FP+TN)
False nega0ve rate FN/(TP+FN)
Sensi0vity TP/(TP+FN)
Specificity TN/(FP+TN)
False discovery rate FP/(FP+TP)
Confusion Matrix
l Our distance/boundary classifies sequences as ‘in’ or ‘out’ l ‘red’ or ‘blue’
l Changing distance/bound results in various degrees of success…
IN OUT
Red 1 5
Blue 1 36
False posi0ve rate 1/37 = 0.03
False nega0ve rate 5/6 = 0.83
Sensi0vity 1/6 = 0.17
Specificity 36/37 = 0.97
Confusion Matrix
l Our distance/boundary classifies sequences as ‘in’ or ‘out’ l ‘red’ or ‘blue’
l Changing distance/bound results in various degrees of success…
IN OUT
Red 5 2
Blue 4 33
False posi0ve rate 0.11
False nega0ve rate 0.29
Sensi0vity 0.81
Specificity 0.89
Confusion Matrix
l Our distance/boundary classifies sequences as ‘in’ or ‘out’ l ‘red’ or ‘blue’
l Changing distance/bound results in various degrees of success…
IN OUT
Red 7 0
Blue 14 23
False posi0ve rate 0.38
False nega0ve rate 0
Sensi0vity 1
Specificity 0.62
ROC Curve
l To assess how well a method performs, can use ROC (Receiver Opera0ng Characteris0c) curve
l Typically, we use area under the curve (AUC) to choose between methods
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
SensiIvity
False PosiIve Rate
ROC Curve
Classifier
Random
ROC Curve
l To assess how well a method performs, can use ROC (Receiver Opera0ng Characteris0c) curve
l Typically, we use area under the curve (AUC) to choose between methods
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
SensiIvity
False PosiIve Rate
ROC Curve
Classifier
Random
be[er performance
ROC Curve
l To assess how well a method performs, can use ROC (Receiver Opera0ng Characteris0c) curve
l The ‘best’ parameter se}ng for a method is typically near the apex.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
SensiIvity
False PosiIve Rate
ROC Curve
Classifier
Random
F-‐measure
IN OUT
Red 1 5
Blue 1 36
False posi0ve rate FP/(FP+TN)
False nega0ve rate FN/(TP+FN)
Sensi0vity TP/(TP+FN)
Specificity TN/(FP+TN)
l We can ‘game’ ROC sta0s0cs by increasing irrelevant ‘nega0ve’ examples
l Increasing TN ‘improves’ false posi0ve rate and specificity
l Can use precision and recall instead
F-‐measure
l We can ‘game’ ROC sta0s0cs by increasing irrelevant ‘nega0ve’ examples
l Increasing TN ‘improves’ false posi0ve rate and specificity
l Can use precision and recall instead
IN OUT
Red 1 5
Blue 1 36
Precision (PPV) TP/(TP+FP)
Recall = sensi0vity TP/(TP+FN)
FDR = 1-‐PPV FP/(TP+FP)
F-‐measure
l Precision: Propor0on of accurate posi0ve predic0ons
l Recall: Propor0on of posi0ve examples recovered (sensiCvity)
l F1 = 2 (precision x recall)/(precision + recall)
IN OUT
Red 1 5
Blue 1 36
Precision (PPV) TP/(TP+FP)
Recall = sensi0vity TP/(TP+FN)
FDR = 1-‐PPV FP/(TP+FP)
F-‐measure
l The F-‐measure indicates which set of parameters (which distance) ‘best’
l Several F-‐measures available that weight precision and recall differently
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1 2 3
F-‐measure
How large a distance do we allow?
l Assign known ‘posi0ve’ and ‘nega0ve’ examples
l Vary distances and take F-‐measure
l Choose distance that gives the best performance
How large a distance do we allow?
l Assign known ‘posi0ve’ and ‘nega0ve’ examples
l Vary distances and take F-‐measure
l Choose distance that gives the best performance
How large a distance do we allow?
l Assign known ‘posi0ve’ and ‘nega0ve’ examples
l Vary distances and take F-‐measure
l Choose distance that gives the best performance
Confusion Matrix
l BUT: how do we know that we’ve chosen a suitable distance? l Training set choice is cri0cal
IN OUT
Red 5 2
Blue 4 33
False posi0ve rate 0.11
False nega0ve rate 0.29
Sensi0vity 0.81
Specificity 0.89
Training set choice
Train classifier on known examples: looks good…
UnrepresentaIve examples
…but training set biased/unrepresenta0ve sample…
Overfiong
…or ‘fits’ known posi0ves unfeasibly 0ghtly
How large a distance do we allow?
l How do we define ‘distance’?
l How large a ‘distance’ (or what clustering resoluIon) do we take?
l How do we know we’ve chosen a sensible ‘distance’?
How do we know we’ve chosen a suitable distance?
l How do we define ‘distance’?
l How large a ‘distance’ (or what clustering resolu0on) do we take?
l How do we know we’ve chosen a sensible ‘distance’?
A trip to the doctor, part I l Rou0ne medical checkup
l Test for disease X (horrible, unpleasant, poten0ally suppura0ng)
l Test has sensi)vity (i.e. predicts disease where there is disease) of 95%
l Test has false posi)ve rate (i.e. predicts disease where there is no disease) of 1%
l Your test is posi0ve
l What is the probability that you have disease X?
A trip to the doctor, part I l Rou0ne medical checkup
l Test for disease X (horrible, unpleasant, poten0ally suppura0ng)
l Test has sensi)vity (i.e. predicts disease where there is disease) of 95%
l Test has false posi)ve rate (i.e. predicts disease where there is no disease) of 1%
l Your test is posi0ve
l What is the probability that you have disease X?
A trip to the doctor, part I l Rou0ne medical checkup
l Test for disease X (horrible, unpleasant, poten0ally suppura0ng)
l Test has sensi)vity (i.e. predicts disease where there is disease) of 95%
l Test has false posi)ve rate (i.e. predicts disease where there is no disease) of 1%
l Your test is posi0ve
l What is the probability that you have disease X?
A trip to the doctor, part I l Rou0ne medical checkup
l Test for disease X (horrible, unpleasant, poten0ally suppura0ng)
l Test has sensi)vity (i.e. predicts disease where there is disease) of 95%
l Test has false posi)ve rate (i.e. predicts disease where there is no disease) of 1%
l Your test is posiIve
l What is the probability that you have disease X?
A trip to the doctor, part I l Rou0ne medical checkup
l Test for disease X (horrible, unpleasant, poten0ally suppura0ng)
l Test has sensi)vity (i.e. predicts disease where there is disease) of 95%
l Test has false posi)ve rate (i.e. predicts disease where there is no disease) of 1%
l Your test is posiIve
l What is the probability that you have disease X?
0.01 0.05 0.95 0.99 0.50
How do we know we’ve chosen a suitable distance?
l How do we define ‘distance’?
l How large a ‘distance’ do we take?
l How do we know we’ve chosen a sensible ‘distance’?
Cross-‐validaIon l Es0ma0on of classifier performance depends on
l distance measure
l composi0on of training set (‘posi0ves’ and ‘nega0ves’)
l Cross-‐valida0on gives objec0ve measure of performance
l Many strategies available, including:
l leave-‐one-‐out (LOO)
l k-‐fold crossvalida0on
l repeated (random) subsampling
l Essen0ally: always keep a hold-‐out set (not used to train)
Cross-‐validaIon l Es0ma0on of classifier performance depends on
l distance measure
l composi0on of training set (‘posi0ves’ and ‘nega0ves’)
l Cross-‐valida0on gives objec0ve measure of performance
l Many strategies available, including:
l leave-‐one-‐out (LOO)
l k-‐fold crossvalida0on
l repeated (random) subsampling
l Essen0ally: always keep a hold-‐out set (not used to train)
Cross-‐validaIon l Es0ma0on of classifier performance depends on
l distance measure
l composi0on of training set (‘posi0ves’ and ‘nega0ves’)
l Cross-‐valida0on gives objec0ve measure of performance
l Many strategies available, including:
l leave-‐one-‐out (LOO)
l k-‐fold crossvalida0on
l repeated (random) subsampling
l Essen0ally: always keep a hold-‐out set (not used to train)
Cross-‐validaIon l Es0ma0on of classifier performance depends on
l distance measure
l composi0on of training set (‘posi0ves’ and ‘nega0ves’)
l Cross-‐valida0on gives objec0ve measure of performance
l Many strategies available, including:
l leave-‐one-‐out (LOO)
l k-‐fold crossvalida0on
l repeated (random) subsampling
l Essen0ally: always keep a hold-‐out set (not used to train)
k-‐fold crossvalidaIon l No crossvalida0on:
l One training set
l No test (hold-‐out/valida0on) set
l Risks overfi}ng
Training Set
Test Set
k-‐fold crossvalidaIon l Valida0on:
l One training set, one test (hold-‐out/valida0on) set
l Test performance of classifier on unseen data
Training Set
Test Set
k-‐fold crossvalidaIon l 2-‐fold crossvalida0on:
l Two runs, each with one training set, one test set
l Swap training and test sets, collate results
Training Set
Test Set
run1
run2
k-‐fold crossvalidaIon l 3-‐fold crossvalida0on:
l Three runs, each with one training set, one test set
Training Set
Test Set
run1
run2
run3
k-‐fold crossvalidaIon l k-‐fold crossvalida0on:
l k runs, each with one training set, one test set (n items in dataset, k>1)
Training Set Test Set
run1
run2
runk
n/k n-‐(n/k)
…
Arer crossvalidaIon
False posi0ve rate 0.11
False nega0ve rate 0.29
Sensi0vity 0.81
Specificity 0.89
Precision 0.56
• Use crossvalida0on to find ‘best’ method & parameters • Crossvalida0on gives you es0mated performance metrics on
unseen data • Apply ‘best’ method to complete dataset for predic0on
A trip to the doctor, part II l Test for disease X (horrible, unpleasant, poten0ally suppura0ng)
l Test has sensi)vity (i.e. predicts disease where there is disease) of 95%
l Test has false posi)ve rate (i.e. predicts disease where there is no disease) of 1%
l Your test is posiIve
l To calculate the probability that the test correctly determines whether you have the disease, you need to know the baseline occurrence.
A trip to the doctor, part II l Test for disease X (horrible, unpleasant, poten0ally suppura0ng)
l Test has sensi)vity (i.e. predicts disease where there is disease) of 95%
l Test has false posi)ve rate (i.e. predicts disease where there is no disease) of 1%
l Your test is posiIve
l To calculate the probability that the test correctly determines whether you have the disease, you need to know the baseline occurrence.
A trip to the doctor, part II l Test for disease X (horrible, unpleasant, poten0ally suppura0ng)
l Test has sensi)vity (i.e. predicts disease where there is disease) of 95%
l Test has false posi)ve rate (i.e. predicts disease where there is no disease) of 1%
l Your test is posi0ve
l To calculate the probability that the test correctly determines whether you have the disease, you need to know the baseline occurrence.
Baseline occurrence: 1% ⇒ P(disease|+ve)=0.490 Baseline occurrence: 80% ⇒ P(disease|+ve)=0.997
What is the baseline occurrence for effectors?
l Usually rely on predic0ons for expected baseline
l Bacterial genomes: ≈4500 genes
l Type III effectors: 1-‐10% (Arnold et al. 2009); 1-‐2% (Collmer et al. 2002); 1% (Boch and Bonas, 2010)
l Oomycete/fungal genomes: ≈20000 genes
l RxLRs: 120-‐460 (1-‐2%; Whisson et al. 2007); ≤563 (≲2% Haas et al. 2009);
l CRNs: 19-‐196 (≲1%; Haas et al. 2009)
l CHxC: ≈30 (<1%; Kemen et al. 2011)
l We need to take care over result interpreta0on:
l Predic0on method with 5% false nega0ve rate and 1% false posi0ve rate, with 1% baseline, predic0ng 500 effectors:
� P(effector|posiIve test)≈0.5
What is the baseline occurrence for effectors?
l Usually rely on predic0ons for expected baseline
l Bacterial genomes: ≈4500 genes
l Type III effectors: 1-‐10% (Arnold et al. 2009); 1-‐2% (Collmer et al. 2002); 1% (Boch and Bonas, 2010)
l Oomycete/fungal genomes: ≈20000 genes
l RxLRs: 120-‐460 (1-‐2%; Whisson et al. 2007); ≤563 (≲2% Haas et al. 2009);
l CRNs: 19-‐196 (≲1%; Haas et al. 2009)
l CHxC: ≈30 (<1%; Kemen et al. 2011)
l We need to take care over result interpreta0on:
l Predic0on method with 5% false nega0ve rate and 1% false posi0ve rate, with 1% baseline, predic0ng 500 effectors:
� P(effector|posiIve test)≈0.5
What is the baseline occurrence for effectors?
l Usually rely on predic0ons for expected baseline
l Bacterial genomes: ≈4500 genes
l Type III effectors: 1-‐10% (Arnold et al. 2009); 1-‐2% (Collmer et al. 2002); 1% (Boch and Bonas, 2010)
l Oomycete/fungal genomes: ≈20000 genes
l RxLRs: 120-‐460 (1-‐2%; Whisson et al. 2007); ≤563 (≲2% Haas et al. 2009);
l CRNs: 19-‐196 (≲1%; Haas et al. 2009)
l CHxC: ≈30 (<1%; Kemen et al. 2011)
l We need to take care over result interpreta0on:
l Predic0on method with 5% false nega0ve rate and 1% false posi0ve rate, with 1% baseline, predic0ng 500 effectors:
� P(effector|posiIve test)≈0.5
What is the baseline occurrence for effectors?
l Usually rely on predic0ons for expected baseline
l Bacterial genomes: ≈4500 genes
l Type III effectors: 1-‐10% (Arnold et al. 2009); 1-‐2% (Collmer et al. 2002); 1% (Boch and Bonas, 2010)
l Oomycete/fungal genomes: ≈20000 genes
l RxLRs: 120-‐460 (1-‐2%; Whisson et al. 2007); ≤563 (≲2% Haas et al. 2009);
l CRNs: 19-‐196 (≲1%; Haas et al. 2009)
l CHxC: ≈30 (<1%; Kemen et al. 2011)
l We need to take care over result interpreta0on:
l Predic0on method with 5% false nega0ve rate and 1% false posi0ve rate, with 1% baseline, predic0ng 500 effectors:
� P(effector|posiIve test)≈0.5
A lesson from the literature?
l “The resul0ng computa0onal model revealed a strong type III secre0on signal in the N-‐terminus that can be used to detect effectors with sensi0vity of 71% and [specificity] of 85%.”
l Sensi0vity [P(+ve|T3E)] = 0.71; FPR [1-‐Specificity; P(+ve|not T3E)] = 0.15
l Base rate [P(T3E)] ≈ 3%; Genes = 4500
l We expect P(T3E|+ve) ≈ 0.13
l (and a significant number, up to 15% of the genome, of false posi0ves…)
P (T3E|+ve) =P (+ve|T3E)P (T3E)
P (+ve|T3E)P (T3E) + P (+ve|T3E)P (T3E)
A lesson from the literature?
l “The resul0ng computa0onal model revealed a strong type III secre0on signal in the N-‐terminus that can be used to detect effectors with sensi0vity of 71% and [specificity] of 85%.”
l Sensi0vity [P(+ve|T3E)] = 0.71; FPR [1-‐Specificity; P(+ve|not T3E)] = 0.15
l Base rate [P(T3E)] ≈ 3%; Genes = 4500
l We expect P(T3E|+ve) ≈ 0.13
l (and a significant number, up to 15% of the genome, of false posi0ves…)
P (T3E|+ve) =P (+ve|T3E)P (T3E)
P (+ve|T3E)P (T3E) + P (+ve|T3E)P (T3E)
A lesson from the literature?
l “The resul0ng computa0onal model revealed a strong type III secre0on signal in the N-‐terminus that can be used to detect effectors with sensi0vity of 71% and [specificity] of 85%.”
l Sensi0vity [P(+ve|T3E)] = 0.71; FPR [1-‐Specificity; P(+ve|not T3E)] = 0.15
l Base rate [P(T3E)] ≈ 3%; Genes = 4500
l We expect P(T3E|+ve) ≈ 0.13
l (and a significant number, up to 15% of the genome, of false posi0ves…)
P (T3E|+ve) =P (+ve|T3E)P (T3E)
P (+ve|T3E)P (T3E) + P (+ve|T3E)P (T3E)
A lesson from the literature? l “The surprisingly high number of (false) posi0ves in genomes without TTSS exceeds the expected false posi0ve rate (Table 1)”
0.038 x 5169 x 0.13 ≈ 26 [No. +ve x P(T3E|+ve)]
Director’s Commentary: Finding RxLRs
l Supplementary from Whisson et al. (2007) l Whisson SC, Boevink PC, Moleleki L, Avrova AO, Morales JG, et al. (2007) A transloca0on
signal for delivery of oomycete effector proteins into host plant cells. Nature 450: 115–118. doi:10.1038/nature06203.
l Not perfect
l Detail of one way to construct a classifier
Director’s Commentary: Finding RxLRs
l Supplementary from Whisson et al. (2007) l Whisson SC, Boevink PC, Moleleki L, Avrova AO, Morales JG, et al. (2007) A transloca0on
signal for delivery of oomycete effector proteins into host plant cells. Nature 450: 115–118. doi:10.1038/nature06203.
l Not perfect
l Detail of one way to construct a classifier
Director’s Commentary: Finding RxLRs
l Supplementary from Whisson et al. (2007) l Whisson SC, Boevink PC, Moleleki L, Avrova AO, Morales JG, et al. (2007) A transloca0on
signal for delivery of oomycete effector proteins into host plant cells. Nature 450: 115–118. doi:10.1038/nature06203.
l Not perfect
l Detail of one way to construct a classifier
Building a training set l Star0ng point: 49 candidate sequences (reference set)
l Known: l Contain (puta0vely) RxLR-‐EER mo0f
l All but one transcribed (i.e. not bad gene calls)
l Assumed:
l Presence of signal pep0de and RxLR-‐EER categorises effectors
Building a training set l Star0ng point: 49 candidate sequences (reference set)
l Known: l Contain (puta0vely) RxLR-‐EER mo0f
l All but one transcribed (i.e. not bad gene calls)
l Assumed:
l Presence of signal pep0de and RxLR-‐EER categorises effectors
Building a training set l Star0ng point: 49 candidate sequences (reference set)
l Known: l Contain (puta0vely) RxLR-‐EER mo0f
l All but one transcribed (i.e. not bad gene calls)
l Assumed:
l Presence of signal pep0de and RxLR-‐EER categorises effectors
Building a training set l SignalP 3.0 (Bendtsen et al. 2004) to predict loca0ons of signal pep0des.
l SignalP also has sta0s0cal performance es0mates:
l Se}ngs:
l HMM cutoff probability = 0.9
l Cleavage site between posi0ons 10 and 40 inclusive
l Jus0fica0on: use in previous studies by others
Building a training set l SignalP 3.0 (Bendtsen et al. 2004) to predict loca0ons of signal pep0des.
l SignalP also has sta0s0cal performance es0mates:
l Se}ngs:
l HMM cutoff probability = 0.9
l Cleavage site between posi0ons 10 and 40 inclusive
l Jus0fica0on: use in previous studies by others
Building a training set l Of 49, four sequences failed
l One carried forward on experimental grounds (highly-‐expressed)
l Training set now has 46 sequences
l But seven of these actually have no recognisable RxLR-‐EER mo0f, so are discarded
l Training set now has 39 sequences
Building a training set l Of 49, four sequences failed
l One carried forward on experimental grounds (highly-‐expressed)
l Training set now has 46 sequences
l But seven of these actually have no recognisable RxLR-‐EER mo0f, so are discarded
l Training set now has 39 sequences
Building a training set l Of 49, four sequences failed
l One carried forward on experimental grounds (highly-‐expressed)
l Training set now has 46 sequences
l But seven of these actually have no recognisable RxLR-‐EER mo0f, so are discarded
l Training set now has 39 sequences
Building a training set l Of 49, four sequences failed
l One carried forward on experimental grounds (highly-‐expressed)
l Training set now has 46 sequences
l But seven of these actually have no recognisable RxLR-‐EER mo0f, so are discarded
l Training set now has 39 sequences
Building a classifier l We have a recognisable mo0f, with substan0al local varia0on and indels
l Therefore chose profile HMM
l Use HMMer socware
l Profile HMMs sensi0ve to quality of alignment
l Therefore treat alignment as a parameter of the HMM (much difference between alignments!)
Building a classifier l We have a recognisable mo0f, with substan0al local varia0on and indels
l Therefore chose profile HMM
l Use HMMer socware
l Profile HMMs sensi0ve to quality of alignment
l Therefore treat alignment as a parameter of the HMM (much difference between alignments!)
Building a classifier l We have a recognisable mo0f, with substan0al local varia0on and indels
l Therefore chose profile HMM
l Use HMMer socware
l Profile HMMs sensi0ve to quality of alignment
l Therefore treat alignment as a parameter of the HMM (much difference between alignments!)
Building a classifier l Anchored at RxLR and EER
Building a classifier l ClustalW
Building a classifier l T-‐Coffee
Building a classifier l Parameters modified for HMM
l Alignment package (no alignment, anchored, Clustal, DiAlign, T-‐Coffee) on default se}ngs
l Full-‐length and truncated (no signal pep0de) alignments to test for influence of signal pep0de region on classifier
� Plus one alignment of RxLR-‐EER plus flanking region only (‘cropped’)
l HMM built for each of eleven alignments
l Default parameters
l Once built, the HMM is the classifier.
Building a classifier
Trunca0ng sequences reshapes sequence space
Building a classifier l Parameters modified for HMM
l Alignment package (no alignment, anchored, Clustal, DiAlign, T-‐Coffee) on default se}ngs
l Full-‐length and truncated (no signal pep0de) alignments to test for influence of signal pep0de region on classifier
� Plus one alignment of RxLR-‐EER plus flanking region only (‘cropped’)
l HMM built for each of eleven alignments
l Default parameters
l Once built, the HMM is the classifier.
Building a classifier l Parameters modified for HMM
l Alignment package (no alignment, anchored, Clustal, DiAlign, T-‐Coffee) on default se}ngs
l Full-‐length and truncated (no signal pep0de) alignments to test for influence of signal pep0de region on classifier
� Plus one alignment of RxLR-‐EER plus flanking region only (‘cropped’)
l HMM built for each of eleven alignments
l Default parameters
l Once built, the HMM is the classifier.
hmmbuild --amino <output> <alignment>!
TesIng the classifiers
Only posiIve examples: How well does a classifier cover them?
TesIng the classifiers l Eleven classifiers to test
l Step 1: Consistency test l Does the classifier correctly call as posi0ve the sequences used to train it?
l Es0mates recovery of the informa0on in the training set
l Step 2: Recovery of full sequences l Es0mates performance of classifier on complete sequence data
SigP-‐RxLR-‐Cterm
RxLR-‐Cterm
RxLR
TesIng the classifiers
Only posiIve examples: How well does a classifier recover unseen sequence?
TesIng the classifiers
Only posiIve examples: How well does a classifier recover unseen sequence?
TesIng the classifiers l Step 3: Leave-‐One-‐Out Crossvalida0on
l But only have posi0ve examples!
l Removes possibility that classifier matches on basis of having ‘seen’ a sequence before
TesIng the classifiers l Leave-‐one-‐out (LOO) crossvalida0on:
l k runs, each with one training set, one test set (n items in dataset, k=n)
Training Set Test Set
run1
run2
runk
…
TesIng the classifiers l Step 3: Leave-‐One-‐Out Crossvalida0on
l But only have posi0ve examples!
l Removes possibility that classifier matches on basis of having ‘seen’ a sequence before
SigP-‐ RxLR-‐ Cterm
RxLR-‐ Cterm
RxLR
Beier match to classifier than to control
TesIng the classifiers l Step 4: Tests on nega0ve samples
l Completely shuffled sequences
l Shuffled downstream of the signal pep0de only
l Replace RxLR-‐EER with AAAA-‐AAA
No classifier idenIfies a false posiIve (no classifier matches on sequence composi0on alone)
TesIng the classifiers l Step 4: Tests on nega0ve samples
l Completely shuffled sequences
l Shuffled downstream of the signal pep0de only
l Replace RxLR-‐EER with AAAA-‐AAA
(some recogni0on on basis of signal pep0de)
SigP-‐ RxLR-‐ Cterm
RxLR-‐ Cterm
RxLR
TesIng the classifiers l Step 4: Tests on nega0ve samples
l Completely shuffled sequences
l Shuffled downstream of the signal pep0de only
l Replace RxLR-‐EER with AAAA-‐AAA
(some recogni0on on sequence other than mo0f)
SigP-‐ RxLR-‐ Cterm
RxLR-‐ Cterm
RxLR
Choosing a classifier l The ‘cropped’ classifier has:
l 100% recovery of posi0ve training sequences
l 0% recovery of nega0ve test sequences
l Some varia0on in classifier performance on whole genome:
Choosing a classifier l The ‘cropped’ classifier has:
l 100% recovery of posi0ve training sequences
l 0% recovery of nega0ve test sequences
l Some varia0on in classifier performance on whole genome:
Oranges are not the only fruit l Other classifiers had been proposed, e.g. Bha[acharjee et al. (2006):
l Presence of signal pep0de, with cleavage site in first 40aa
l Regular expression test:
� R.LR.{,40}[ED][ED][KR]in first 100aa acer cleavage site
l Can choose between methods, or report range of predic0ons
Oranges are not the only fruit l Other classifiers had been proposed, e.g. Bha[acharjee et al. (2006):
l Presence of signal pep0de, with cleavage site in first 40aa
l Regular expression test:
� R.LR.{,40}[ED][ED][KR]in first 100aa acer cleavage site
l Can choose between methods, or report range of predic0ons
So how did it work out…? l Refined all RxLR predic0ons to ‘priority set’ of ≈200 for cloning
l First set of 46 candidate effectors (07/11): l 25 host interactors detected by Y2H
l Localisa0on data for 41 candidates
l Silencing phenotypes for 19 candidates
l 22 puta0ve orthologues with P. capsici
l Currently: l 44 silencing phenotypes
Transient expression in leaf of GFP-‐fused RxLR candidate, showing plasma membrane localisa0on
Acknowledgements l Phytophthora groups at JHI
l (Paul Birch, Steve Whisson, Dave Cooke)
l Bacteriology groups at JHI l (Ian Toth, Nicola Holden)
l Imaging at JHI
l (Petra Boevink)
l Numerous sta0s0cians
l (David Broadhurst, Andy Woodward, BioSS)
Sequence space
CD-‐Hit sequence ordering l “Algorithm limita0ons: […]
Let say, there are two clusters: cluster #1 has A, X and Y where A is the representa0ve, and cluster #2 has B and Z where B is the representa0ve. The problem is that even if Y is more similar to B than to A, it can s0ll be in cluster #1 because Y first hits A during the clustering process.”
l h[p://weizhong-‐lab.ucsd.edu/cd-‐hit/wiki/doku.php?id=cd-‐hit_user_guide