Queensland University of Technology
CRICOS No. 00213J
Using a Beagle to sniff for Bacterial Promoters
Stefan R. Maetschke, Michael Towsey and James M. Hogan
Queensland University of Technology
CRICOS No. 00213Ja university for the worldrealR
2
An Agenda
• Bacterial Promoters– The domain and the motifs – Earlier approaches, including ours
• Why dumber is better – Not quite, but flexibility before sophistication – Exploiting new features as they are identified
• Results
CRICOS No. 00213Ja university for the worldrealR
3
Upstream from a Bacterial Gene
TSSpromoter
RNA polymerase
transcription
GSS gene
• Search for ‘conserved’ -10 and -35 hexamers– Except they’re not really conserved– Plagued by massive false positive rates
• But this is the Reader’s Digest version
CRICOS No. 00213Ja university for the worldrealR
4
Previous Work
• Mainly in the E. coli system • PWMs – simple, but poor discrimination
– Good performance if compound structure used – (Collado-Vides et. al.: State of the art pre 2006)
• HMMs – less successful than in eukaryotes • TDNNs – boosted by GSS offset distribution • SVMs – spectrum kernel ensemble
– (Gordon et. al. (us): state of the art, but at a price)
70
CRICOS No. 00213Ja university for the worldrealR
5
Beagle
• Principled and rapid inclusion of motifs as they are discovered or hypothesised – Prior to the Gordon et. al. paper, a TP:FP ratio of
1:300 was considered good. – But this was based solely on -10 and -35 motifs
• A model description language and parser– Less sophisticated than it sounds, but sufficient
• Iterative refinement of the model
CRICOS No. 00213Ja university for the worldrealR
6
Upstream from a Bacterial Gene
TTGACA
-10 element
TATAAT
TSS GSS-35 element
ATG
Core Enzyme:
’
Specific sigma controls binding at -10, -35 elements
But binding probability varies enormously
Compensate when hexamers are weak
’
“It has long been known that domains 2 and 4 … bind to the strongly conserved -10 and -35 boxes”. Except when they don’t because they aren’t…
CRICOS No. 00213Ja university for the worldrealR
7
Upstream from a Bacterial Gene
TTGACA TRTG
Extended -10 element
TATAAT
TSS GSS-35 element
’
ATG
Simple Extended -10: TG Discovered in B. Subtilis, found in 20% of promoters in E. Coli
-16 hypothesised to be important in E. Coli, TRTG or T(AG)TG consensus
But even the alpha units aren’t what they seem…
CRICOS No. 00213Ja university for the worldrealR
8
Upstream from a Bacterial Gene
TTGACAAAAAAARNRAWWWWWTTTTT
CTD1CTD2
NTD2
proximal UP element
TSS GSSdistal UP element
-35 element
’
ATG
NTD1
TRTG
Extended -10 element
TGTATAAT
-16
CTDs are carboxy terminal domains, binding to UP elements
AT-rich region, proximal element more important
CRICOS No. 00213Ja university for the worldrealR
9
The Data
• E. Coli and B. Subtilis• Confirmed TSS locations within 250bp of the
nearest gene start – No overlapping reading frames
• N=492 (E. Coli), 205 (B. Subtilis) • 250 bp USRs available
CRICOS No. 00213Ja university for the worldrealR
10
Beagle algorithm
• Define a consensus promoter– e.g. <TTGACA (15, 21) TATAAT (4, 13) TSS>– Ordered pairs specify gap ranges
• Parse the description and define PWMs and weighted gaps – Initially trivial
• Refine using the confirmed TSS locations
CRICOS No. 00213Ja university for the worldrealR
11
Beagle algorithm
• For each USR in the training set:– Anchor the pattern to the known TSS location– Determine the best match based on the current model
• Find the MLE of the model parameters based on the best matches from the training data.
• Test the refined definition on unseen data– 10 repeats x 10 fold cross validation
– Essentially TSS prediction
• Iterate until improvement ceases.
CRICOS No. 00213Ja university for the worldrealR
12
TSS recognition (% accuracy)
Pattern E. coli B. subtilisCanonical -35, -10
boxes 37.5 ± 1.4 % 61.6 ± 1.8 %
Canonical
+ distance to GSS 43.3 ± 1.2 % 61.2 ± 1.7 %
Guess which promoter boxes are more strongly conserved…
CRICOS No. 00213Ja university for the worldrealR
13
Including UP elements
• NNW15NN – AT rich region
• NNAAAWWTWTTNNAAANNN – Estrem et al 1998
• NNAAAWWTWTTN – A6RNR– Gourse et al 2000– distal - proximal motif
CRICOS No. 00213Ja university for the worldrealR
14
TSS recognition (% accuracy)
Pattern E. coli B. subtilisCanonical boxes
+ distance to GSS 43.3 ± 1.2 % 61.2 ± 1.7 %
Canonical
+ distance to GSS
+ Estrem UP
41.4 ± 1.2 % 62.0 ± 1.7 %
Canonical
+ distance to GSS
+ AT rich region
47.3 ± 1.2 % 64.8 ± 1.8 %
CRICOS No. 00213Ja university for the worldrealR
15
Comparing E. coli and B. subtilis promoters
B. subtilis -35 element
B. subtilis -10 element
E. coli -10 element
E. coli -35 element
E. Coli has 7 known sigmas; B. Subtilis 18…
CRICOS No. 00213Ja university for the worldrealR
16
Motifs ‘in the Gap’
• Extended -10 element – Consensus TGTATAAT– Strongly implicated in Subtilis– Hypothesised as significant in 20% E Coli
• Extended -16 element – Consensus TRTG
CRICOS No. 00213Ja university for the worldrealR
17
TSS recognition (% accuracy)
Pattern E. coli B. subtilisCanonical boxes
+ distance to GSS 43.3 ± 1.2 % 61.2 ± 1.7 %
Canonical
+ distance to GSS
+TG extended-10
41.6 ± 1.3 % 62.5 ± 1.8 %
Canonical
+ distance to GSS
+TRTG extended-10
37.6 ± 1.3 % 62.6 ± 1.8 %
CRICOS No. 00213Ja university for the worldrealR
18
The Complete Picture
-10-35
CTDII
CTD
NTD
’
70CTDII
CTDII
-40.5-52-62-72
UP elementAT rich
Variable location
CRICOS No. 00213Ja university for the worldrealR
19
TSS recognition (% accuracy)
Pattern E. coli B. subtilisCanonical boxes
+ distance to GSS 43.3 ± 1.2 % 61.2 ± 1.7 %
Canonical
+ distance to GSS
+TG extended-10
+ AT rich region
48.3 ± 1.5 % 68.8 ± 1.6 %
Canonical
+ distance to GSS
+TRTG extended-10
+ AT rich region
40.5 ± 1.4 % 71.2 ± 1.7 %
CRICOS No. 00213Ja university for the worldrealR
20
TSS recognition (% accuracy)
E. coli
43.3%
48.3%
B. subtilis
61.2%
71.2%
+AT rich 47.3% 41.6% +TG +AT rich 64.8% 62.6% +TRTG
CRICOS No. 00213Ja university for the worldrealR
21
Conclusions
• Beagle provides a simple bridge between experiment and computational discovery– Is the extended -16 motif really important in E. Coli?– (Well, not in any general sense)
• Fast, robust and flexible • Extensions
– Combination of model organisms– Comparative genomics & regulation