mastering microbes with microchips fiona brinkman fiona brinkman department of molecular biology and...
Post on 11-Jan-2016
247 Views
Preview:
TRANSCRIPT
Mastering Microbes with MicrochipsMastering Microbes with Microchips
Fiona BrinkmanFiona Brinkman Department of Molecular Biology and BiochemistryDepartment of Molecular Biology and Biochemistry
Simon Fraser University, Simon Fraser University, Greater Vancouver, British Columbia, CanadaGreater Vancouver, British Columbia, Canada
What I What I won’twon’t talk about! talk about!
1.1. Pseudomonas Genome Database: Model Pseudomonas Genome Database: Model for continually-updated genome for continually-updated genome annotation and analysisannotation and analysis
2.2. Microarray analysis software Microarray analysis software development for the Pathogenomics development for the Pathogenomics (FPMI) Project(FPMI) Project
How can we best combat infectious How can we best combat infectious disease causing-bacteria?disease causing-bacteria?
+ =Rank Name
Kills1. Fiona
542. Ryan 0
+ =
Pathogens and The Art of WarPathogens and The Art of War
““What is of supreme importance in war is What is of supreme importance in war is to attack the enemy's strategy. Next best is to attack the enemy's strategy. Next best is to disrupt his alliances by diplomacy. The to disrupt his alliances by diplomacy. The next best is to attack his army. And the next best is to attack his army. And the worst policy is to attack cities.”worst policy is to attack cities.”
Pathogens and The Art of WarPathogens and The Art of War
““And the worst policy is to attack cities.”And the worst policy is to attack cities.”
Infectious Diseases – There must be a better way…Infectious Diseases – There must be a better way…
Leading cause of productivity lossLeading cause of productivity loss Responsible for two thirds of deaths of persons under age 40Responsible for two thirds of deaths of persons under age 40
1980 1982 1984 1986 1988 1990 1992 19940
10
20
30
40Prevalence of Superbugs
Source: Clinical Infectious Diseases 24:S133 (1997)
MRSA VRE
% o
f Is
olat
es
Pathogens and The Art of WarPathogens and The Art of War
““What is of supreme importance in war is to attack the enemy's strategy.”What is of supreme importance in war is to attack the enemy's strategy.”
strategy = virulence factorsstrategy = virulence factors
Pathogens and The ArtArt of War
““Attack your enemy where he is unprepared”Attack your enemy where he is unprepared”
Boost innate immune system Boost innate immune system
How can we best combat pathogens?How can we best combat pathogens?
A. Identify pathogen proteins more likely to be… A. Identify pathogen proteins more likely to be…
1.1. ……virulence factorsvirulence factors
- VGS Database and IslandPath- VGS Database and IslandPath
2.2. ……quickly accessible to drugs/immune system quickly accessible to drugs/immune system (cell surface)(cell surface)
- PSORT-B - PSORT-B
B. Identify human genes involved in boosting B. Identify human genes involved in boosting our innate immune systemour innate immune system
Summary of insights and lessons learned…Summary of insights and lessons learned…
Virulence Gene Subset (VGS) DatabaseVirulence Gene Subset (VGS) Database
• Based on literature analysisBased on literature analysis
• Experimentally determined virulence factors Experimentally determined virulence factors
• Extensive information in separate fieldsExtensive information in separate fields– Species informationSpecies information
– Gene/Protein informationGene/Protein information
– Gene knockout information relevant to virulence studiesGene knockout information relevant to virulence studies
– Infection assay informationInfection assay information
– ReferencesReferences
Horizontal Gene Transfer and Horizontal Gene Transfer and Virulence FactorsVirulence Factors
Transposons:Transposons: ST enterotoxin genes in ST enterotoxin genes in E. coliE. coli
Prophages:Prophages:Shiga-like toxins in EHECShiga-like toxins in EHECDiptheria toxin gene, Cholera toxinDiptheria toxin gene, Cholera toxinBotulinum toxinsBotulinum toxins
Plasmids:Plasmids:Shigella, Salmonella, YersiniaShigella, Salmonella, Yersinia
Pathogenicity Islands:Pathogenicity Islands:
UroUro//EnteroEntero--pathogenic pathogenic E. coliE. coliSalmonella typhimuriumSalmonella typhimuriumYersinia spp.Yersinia spp.Helicobacter pyloriHelicobacter pyloriVibrio choleraeVibrio cholerae
Pathogenicity IslandsPathogenicity Islands
Associated withAssociated with
– Atypical %G+CAtypical %G+C– tRNA sequencestRNA sequences– Transposases, Integrases and other mobility genesTransposases, Integrases and other mobility genes– Flanking repeatsFlanking repeats
IslandPath: Aiding identification of IslandPath: Aiding identification of Pathogenicity Islands and other Genomic Islands Pathogenicity Islands and other Genomic Islands
Yellow circle = high %G+C
Pink circle = low %G+C
Region of unusual dinucleotide bias
tRNA gene lies between the two dots
rRNA gene lies between the two dots
Both tRNA and rRNA lie between the two dots
Dot is named a transposase
Dot is named an integrase
_
Hsiao et al. (2003) Hsiao et al. (2003) BioinformaticsBioinformatics 19: 418-420 19: 418-420
Genome divided into “ORF-clusters” of 6 consecutive ORFs Genome divided into “ORF-clusters” of 6 consecutive ORFs
Dinucleotide relative abundance is calculated for the region asDinucleotide relative abundance is calculated for the region as
**XYXY = f* = f*XYXY/f*/f*XXf*f*YY where where f*f*XX denotes the frequency of the mononucleotide X denotes the frequency of the mononucleotide X
f*f*XYXY the frequency of the dinucleotide XY the frequency of the dinucleotide XY
For each ORF cluster,For each ORF cluster,the average absolute dinucleotide relative abundance difference isthe average absolute dinucleotide relative abundance difference is
where where f (fragment) is derived from sequences in an ORF-cluster f (fragment) is derived from sequences in an ORF-cluster g (genome) is derived from all predicted ORFs in the genomeg (genome) is derived from all predicted ORFs in the genome
Dinucleotide bias analysisDinucleotide bias analysis
|)(*)(*|16
1),(* gfgf xyxy
Hsiao et al. Hsiao et al. (2003) (2003) BioinformaticsBioinformatics 19: 418-420 19: 418-420
Dinucleotide bias analysisDinucleotide bias analysis
““ORF-clusters” sampled in an overlapping manner (shift by one ORF at a time)ORF-clusters” sampled in an overlapping manner (shift by one ORF at a time)
The mean The mean is calculated by averaging the results from all ORF-clusters in is calculated by averaging the results from all ORF-clusters in the genomethe genome
Regions with greater than 1 standard deviation away from the mean are marked Regions with greater than 1 standard deviation away from the mean are marked on the IslandPath graphical display with strikethrough lineson the IslandPath graphical display with strikethrough lines
Why did we use 6 ORFs per cluster?Why did we use 6 ORFs per cluster?- Not enough bp in a single ORF to get a good estimate - Not enough bp in a single ORF to get a good estimate - 4.5kb (corresponding to approximately 6-8 ORFs) is required for “reliable - 4.5kb (corresponding to approximately 6-8 ORFs) is required for “reliable
estimation of nucleotide composition”estimation of nucleotide composition” (Lawrence and Ochman, (Lawrence and Ochman, J Mol EvolutionJ Mol Evolution 1997 44:383-97) 1997 44:383-97)
),(* gf
1
7
11
20
22
33
34 3536
II
I
V
IV
III
VI
VII
VIII
IX
X
32
Boxes: Known islands in the Boxes: Known islands in the Salmonella typhi Salmonella typhi genomegenome
What features best predict Islands?What features best predict Islands?
Examined prevalence of features in over 200 known islandsExamined prevalence of features in over 200 known islands
• 94% of islands contain >25% dinucleotide bias (majority have 94% of islands contain >25% dinucleotide bias (majority have >75% dinucleotide bias coverage)>75% dinucleotide bias coverage)
• Mobility genes identified in >75% (but ID recently improved)Mobility genes identified in >75% (but ID recently improved)
• Atypical %G+C (above cutoff used in Brinkman et al., 2002) not Atypical %G+C (above cutoff used in Brinkman et al., 2002) not over 50% coverage on average, and tRNA genes not observed with over 50% coverage on average, and tRNA genes not observed with >50% of known islands>50% of known islands
1
37
11
18
20
22
33
34 3536
II
I
V
IV
III
VI
VII
VIII
IX
X
32
1
569
1012
13
1415
17
2122
24
323334
3536
Boxes: “Insertions” in the Boxes: “Insertions” in the Salmonella typhiSalmonella typhi genome verses genome verses Salmonella typhimurium Salmonella typhimurium
Properties of genes in these islands?Properties of genes in these islands?
Defined a “putative island” as Defined a “putative island” as
– 8 or more genes in a row with dinucleotide bias8 or more genes in a row with dinucleotide bias
Functional category analysis Functional category analysis Any difference for Any difference for genes in islands verses genome?genes in islands verses genome?
P value of Paired T test (66 organisms):4e-19
Hypothetical genes are more common in putative islands vs the rest of the genome
Why are hypothetical genes more common within putative Why are hypothetical genes more common within putative islands/dinucleotide biased regions?islands/dinucleotide biased regions?
1.1. Genes being horizontally acquired in bacteria come from a large pool Genes being horizontally acquired in bacteria come from a large pool of as yet unstudied genes?of as yet unstudied genes?
2.2. Genes are being miss-predicted within these regions because of the Genes are being miss-predicted within these regions because of the region’s different genomic composition? region’s different genomic composition?
Testing hypothesis 2: Testing hypothesis 2: - Genes <300 bp in size are more likely to be false positives- Genes <300 bp in size are more likely to be false positives- Therefore, remove genes less than 300 bp and reanalyze- Therefore, remove genes less than 300 bp and reanalyze
P value of Paired T test (55 organisms):0.027
P value of Paired T test (66 organisms):3e-17
Other categories more common in islandsOther categories more common in islands
COG functional categoryCOG functional category Paired T test Paired T test
p valuep value
Hypothesis to testHypothesis to test
Translation, ribosomal Translation, ribosomal structure and biogenesisstructure and biogenesis
4.6e-84.6e-8 Ribosome operons highly Ribosome operons highly expressed and so have expressed and so have unusual bp composition unusual bp composition and falsely ID’d as islandsand falsely ID’d as islands
Cell motilityCell motility 6e-36e-3 Mix of above and below Mix of above and below hypotheseshypotheses
SecretionSecretion 0.020.02 Reflects nature of Reflects nature of acquired subnetworks and acquired subnetworks and how they must interact how they must interact with the environment?with the environment?
Aquiring genes = Acquiring subnetworksAquiring genes = Acquiring subnetworks
Most functional categories involve cytoplasmic proteins Secretion category more
associated with subcellular localization and possible subnetworks that would easy to add to an existing cell network
bacterial cell
What does all this mean?What does all this mean?
1.1. Acquired genes may come from a large pool of genes of which many Acquired genes may come from a large pool of genes of which many are still uncharacterized? are still uncharacterized?
2.2. Acquired genes = acquired subnetworks …that involve interactions Acquired genes = acquired subnetworks …that involve interactions that cross cell membranes? that cross cell membranes?
3.3. What predicted gene dataset you use can have a significant effect on What predicted gene dataset you use can have a significant effect on downstream analyses.downstream analyses.
4.4. Analyzing correlations is difficult! Keep testing those hypotheses!Analyzing correlations is difficult! Keep testing those hypotheses!
Future studiesFuture studies
1.1. Vary the analysis approach Vary the analysis approach - Same result with other functional category classification systems - Same result with other functional category classification systems - More precise criteria for identifying islands- More precise criteria for identifying islands- Different dinucleotide bias calculation? - Different dinucleotide bias calculation?
2.2. Examine in the context of gene expression data Examine in the context of gene expression data
3.3. Statistical modeling of the data Statistical modeling of the data (Dana Aeschliman and Jenny Bryan)(Dana Aeschliman and Jenny Bryan)
How can we best combat pathogens?How can we best combat pathogens?
A. Identify pathogen proteins more likely to be… A. Identify pathogen proteins more likely to be…
1.1. ……virulence factorsvirulence factors
- VGS Database and IslandPath- VGS Database and IslandPath
2.2. ……quickly accessible to drugs/immune system quickly accessible to drugs/immune system (cell surface)(cell surface)
- PSORT-B - PSORT-B
B. Identify human genes involved in boosting B. Identify human genes involved in boosting our innate immune systemour innate immune system
Summary of insights and lessons learned…Summary of insights and lessons learned…
Subcellular Localization PredictionSubcellular Localization Prediction
Annotation
Experimental design
Functions
Drug/vaccine targets
www.psort.org/psortbwww.psort.org/psortb
• Web-based subcellular localization prediction toolWeb-based subcellular localization prediction tool
• Score for each of 5 primary Gram -ve localization sitesScore for each of 5 primary Gram -ve localization sites– PSORT I does not predict extracellular proteinsPSORT I does not predict extracellular proteins– Also returns “unknown” (PSORT I forces a prediction)Also returns “unknown” (PSORT I forces a prediction)
• Trained and tested using a dataset of proteins of experimentally-Trained and tested using a dataset of proteins of experimentally-verified subcellular localizationverified subcellular localization– Constructed manually through literature reviewConstructed manually through literature review– Largest dataset of its kindLargest dataset of its kind
• Analyzes 6 biological features using 6 modulesAnalyzes 6 biological features using 6 modules– More comprehensive than existing tools More comprehensive than existing tools
PSORT-B ModulesPSORT-B Modules
Signal peptides: Non-cytoplasmicSignal peptides: Non-cytoplasmic
Amino acid composition/patterns: Cytoplasmic Amino acid composition/patterns: Cytoplasmic All localizations All localizations- Support Vector Machine’s trained with aa composition - Support Vector Machine’s trained with aa composition
subsequences subsequences
Transmembrane helices: Inner membraneTransmembrane helices: Inner membrane- HMMTOP- HMMTOP
PROSITE motifs: All localizationsPROSITE motifs: All localizations
Outer membrane motifs: Outer membraneOuter membrane motifs: Outer membrane- Association-rule mining to identify - Association-rule mining to identify
Homology to proteins of experimentally known localization: All localizationsHomology to proteins of experimentally known localization: All localizations- “SCL-BLAST” against database of pro of known localizations- “SCL-BLAST” against database of pro of known localizations- E=10e-10 and Length restriction of 80-120% vs both subject and - E=10e-10 and Length restriction of 80-120% vs both subject and query query
Integration Integration with a with a Baysian Baysian NetworkNetwork
Of Precision, Recall and Accuracy…Of Precision, Recall and Accuracy…
• PSORT- B designed for high precision (97% specificity, )PSORT- B designed for high precision (97% specificity, )– PSORT I’s specificity measured at 59%PSORT I’s specificity measured at 59%
• However, recall lower (75% sensitivity, ) which affects However, recall lower (75% sensitivity, ) which affects overall measure of accuracyoverall measure of accuracy– PSORT I recall 60%PSORT I recall 60%
• New version to be released this yearNew version to be released this year
TPTPTP+FPTP+FP
TPTPTP+FNTP+FN
Insights Gained During DevelopmentInsights Gained During Development
• Localization is an highly evolutionarily conserved traitLocalization is an highly evolutionarily conserved trait
– Conserved between Gram-positives and Gram-negatives (for Conserved between Gram-positives and Gram-negatives (for localizations present in both classes)localizations present in both classes)
– Reflection of the: Reflection of the: Need for cell to conserve subcellular networks? Need for cell to conserve subcellular networks? Different environments of each localization?Different environments of each localization?
Insights Gained During DevelopmentInsights Gained During Development
• Identified motifs characteristic of outer membrane proteins Identified motifs characteristic of outer membrane proteins through a data mining approach through a data mining approach (Martin Ester, Ke Wang, and others)(Martin Ester, Ke Wang, and others)
– Motifs (~6 aa long) map primarily to periplasmic Motifs (~6 aa long) map primarily to periplasmic turn regions of known 3D structuresturn regions of known 3D structures
– May reflect importance of periplasmic May reflect importance of periplasmic turns in a transmembrane beta-barrel turns in a transmembrane beta-barrel structure vs. other similar non-membrane structure vs. other similar non-membrane barrel structures barrel structures
Periplasmic turns Periplasmic turns
Analysis of bacterial proteomesAnalysis of bacterial proteomes
• What proportion of proteins are of a particular subcellular What proportion of proteins are of a particular subcellular localization?localization?
• Investigating the hypothesis:Investigating the hypothesis:– The proportion of membrane proteins increases in those organisms The proportion of membrane proteins increases in those organisms
inhabiting a greater variety of environmentsinhabiting a greater variety of environments
• Analysis of the deduced proteomes from 77 bacterial genome Analysis of the deduced proteomes from 77 bacterial genome projects.projects.
Cytoplasmicy = 0.0781x + 15.018
R2 = 0.8992
0
100
200
300
400
500
600
700
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Proteom e Size
N o
f p
red
. pro
tein
s
Cytoplasmic Membrane y = 0.1601x - 4.0179
R2 = 0.9704
0
200
400
600
800
1000
1200
1400
1600
0 2000 4000 6000 8000 10000
Proteom e Size
N o
f p
red
. pro
tein
s
Outer Membrane
y = 0.0132x - 6.8619
R2 = 0.6356
0
20
40
60
80
100
120
140
0 2000 4000 6000 8000 10000
Proteom e Size
N o
f p
red
. pro
tein
s
Extracellular y = 0.0041x - 1.1381
R2 = 0.6703
0
5
10
15
20
25
30
35
0 2000 4000 6000 8000 10000
Proteome Size
N o
f p
red
. pro
tein
s
PSORT-B predictionProportion of total predicted proteins
% st dev.
Cytoplasmic 30 % 5.9 %
CytoplasmicMembrane
57 % 5.8 %
Periplasmic 7.6 % 3.1 %
OuterMembrane
3.8 % 1.9 %
Extracelluar 1.3 % 0.8 %
What does this mean?What does this mean?
1.1. Protein localization is very conserved Protein localization is very conserved
2.2. Increased genome size = increase in networksIncreased genome size = increase in networks Therefore, conservation in localization proportions Therefore, conservation in localization proportions indicates that new networks being added tend to traverse indicates that new networks being added tend to traverse localizationslocalizations
3.3. Note: Can’t discount biases in unpredicted proteins, but Note: Can’t discount biases in unpredicted proteins, but new PSORT-B version will help confirm results new PSORT-B version will help confirm results
SummarySummary
• Converting pathogens and boosting rapid defenses Converting pathogens and boosting rapid defenses may be the way to win the war against pathogens may be the way to win the war against pathogens
• Identifying virulence factors is criticalIdentifying virulence factors is critical
• Acquired genes, including virulence factors, may come from a large Acquired genes, including virulence factors, may come from a large pool of genes that are predominantly uncharacterized.pool of genes that are predominantly uncharacterized.
• Acquired genes = acquired subnetworks that involve interactions that Acquired genes = acquired subnetworks that involve interactions that tend to traverse subcellular boundaries.tend to traverse subcellular boundaries.
www.pathogenomics.sfu.ca/brinkman www.pathogenomics.sfu.ca/brinkman
The Brinkman LabThe Brinkman Lab
Genome PrairieGenome PrairieGenome BCGenome BCInimex Inimex NSERCNSERC
Ray Karsten Geoff Sébastien MattJenn Will Mike Fiona Anastasia “The other Alison Fiona”
Dana Aeschliman Dana Aeschliman Jenny BryanJenny Bryan
Martin EsterMartin EsterKe WangKe WangRong SheRong SheChristopher WalshChristopher Walsh
All Software All Software freely freely available and available and open sourceopen source
FPMIIN
DU
STR
Y
Inim
ex P
harm
a In
c
AC
AD
EM
IA
VID
O, U
Sask
UB
C, SFU
, BC
GSC
GOVERNMENTGenome CanadaGenome Prairie
Genome BCGovt of Saskatchewan
Functional Pathogenomics of Mucosal Immunitywww.pathogenomics.ca
top related