informatik anwendungen in der genom- und proteomforschung
TRANSCRIPT
funded by the German Federal Ministry for Education and Research
InformatikAnwendungen in der Genom-
und ProteomforschungKnut Reinert
FU Berlin
Mai 2006HU Berlin, Adlershof
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Bioinformatics in the media
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Topics in Bioinformatics
Gene finding:•Improve prediction of coding and regulatory regions•Comparing multiple genomes is promising
Some pictures with the courtesy of the MPI für Informatik, Saarbrücken
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Topics in Bioinformatics
Find drugs that alter or inhibit the function of the target molecule.Searching data bases helps to find suitable candidates and reduce side effects.
Molcular therapieof deseases:protein-protein docking
Methotrexate bindet an dihydrofolate reductase
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Topics in Bioinformatics
Identifying SNPs allows:• the association of patterns of genetic diversity to diseases • the association of genetic patterns to drug tolerance
Identifying SNPs(or other polymorphisms)
TTCGGGTTGTTCGGGTTGGGACGTGCACTAAACGTGCACTAACCTCGTCG
GACGTGCACTAAATCGCGCAACTGACGTGCACTAAATCGCGCAACTG
GGGTTGGGGTTGCCACGTGCACTAAATCGCGCAACTGACGTGCACTAAATCGCGCAACTG
TTCGGGTTGTTCGGGTTGGGACGTGACGTGTTACTAAATCGCGCAACTGACTAAATCGCGCAACTG
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Inhalte der Bioinformatik
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Topics in Bioinformatics
Problem (Project together with Charité):Within the first year 8-10% of the patients lose thetransplant.After 10 years about 50% have lost the transplant.Diagnosis is invasiv and leads sometimes to loss of craft.
Goal:Analyse urin samples of patients and detect as early as possible diagnostic markers to counteract craft loss.
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Topics in Bioinformatics
Automated measurement methods leadto terabytes of data (Prof. Schlüter)
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Mass spectrometry
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Mass spectrometry
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Mass spectrometry
0 250 500 750 1000m/z
EVAAFAQFGSDLDASTK
3x increased expression
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Bioinformatics in Berlin
Max-Planck Institutefor Molecular Genetics
HU BerlinUniversity of
Applied Sciences
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Overview
• Motivation for OpenMS• Bioinformatics Issues in quantitative
Proteomics– Signal Processing– Feature Finding– Map Mapping– Differential Quantitation– Identification, Clustering– Software Engineering, Databases
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Genomics
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Genomics vs. Proteomics
Same genome...
...different proteomes
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Genomics vs. Proteomics
Genomics ProteomicsGenome is rather
static
~ 30 k genes
Established, fully automated technology
(capillary sequencer)
Proteome is dynamic
(age, tissue, what youhad for lunch)
Up to 2000 k Proteins
Emerging technology
(MS, HPLC/MS, protein chips)
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
• Transcription and translation are heavily regulated
• Protein expression levels are not static
• mRNA levels and protein levels often not correlated
• Contradictory results from seemingly similar methods
• RNA chips
• DNA chips
• gene disruption
• knock out
ProteinDNA mRNATranscription Translation
From Genes to Proteins
Anderson et al., Electrophoresis (1998), 19, 1853-61
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
TransportPosttranslational modification
Regulation
Exogeneous parameters(stress, temperature,
drugs, metabolites, ...)
Degradation
Folding
Proteins = End Products?
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Definition of proteomics
• Proteomics [n]: the branch of genetics that
studies the full set of proteins encoded by a
genome(www.hyperdictionary.com)
• Proteomics can be defined as the qualitative
and quantitative comparison of proteomes under
different conditions to further unravel
biological processes.(www.expasy.org)
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Target ID and biologicalnetworks
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
TrialBiol.Data
Drug Discovery Pipeline
Accelerated by
Bioinformatics and
Computational Chemistry
Target ID Lead ID Optimization Approval
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
TrialBiol.Data
Drug Discovery Pipeline
Accelerated by
Bioinformatics and
Computational Chemistry
Target ID Lead ID Optimization Approval
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Application fields
• Diagnostics: Find relevant patterns in one-or two-dimensional LC measurements
• Time series: Analyze the temporal behaviour in a time series experiment
• Quantitative Measurements: Determineabsolute content of peptides usingadditive method (Myoglobin, Gliadin)
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Differential Analysis in Proteomics
All techniques share the following steps:
1. Separation of proteins/peptides.
Typical techniques: 2D PAGE, HPLC, DC, etc.
2. Assignment of proteins/peptides for relative
quantitation
3. (Relative) Quantitation
4. Identification of peptides/proteins
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Separation (2D-PAGE)
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Separation (ShotgunProteomics)
• Key idea– Separation of whole proteins possible but
difficult, hence digestion preferred– Separate peptides– Identify proteins through peptides
MG
M
K
N
VQWEDSLG
G L
L
VWGM
GE G
A
I
H
R
V ED V A G G Q
E VL
FL K
TPH EGEL
K
F D K F K
H L K
ES D ME K KHA S E D L K
A
T HN GVL
T
LGGILK
KFGE L GQ
P VI K
Q SAHG LH E A E L T P H A T K I
Q
VL
Q
S
Y E
AE
LF
K
II
S
RFA L E L
GD
F
P G
A
H
M
Q
S
G
DA
A
K
N MD AA K
Y K
Peptid-digest
digestion
Proteins
Separation
M G L S D G E W Q L V L N V W G K
H P G D F G A D A Q G A M S K
Y L E F I S E A I I Q V L Q S K
G H H E A E L T P A Q S H A T K
V E A D V A G H G Q E V L RI
S E D E M K
A S E D L K
A L E L F R
E L G F Q G
N D M A A K
I P V K
H L K
F D K
L F K
F K
Y K
H K
K
G H P E T L KE
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Separation 1Different peptides have different retention time
IonizationPeptide receivesz charge units
Separation 2Detector measures m/z
HPLC ESI TOF Spectrum(scan)
HPLC-MS Analysis
RT
I
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Single stage MS scan
Massspectrometry
Measures mass/charge ratio of
ionized peptidesfor a short period of
time
m/z
Inte
nsit
y
Peak intensity in scan corresponds to amount present, but intensities
are not comparable!
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Maps
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Maps
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Maps
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Maps
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Maps
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Maps
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Maps
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Maps
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Maps
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Maps
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Maps
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Maps
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Maps
0 250 500 750 1000m/z
EVAAFAQFGSDLDASTK
15 nmol/µl, 3x overexpressed, …
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Differential Analysis
Two common basic approaches:• Isotope labeling (e.g. ICAT, MeCAT, SILAC,…)• Direct Differential Quantification (DDQ)
ICAT pairs atknown distance
diseased
normal
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Isotope Labeling (ICAT)
Heavy and light ICAT reagents are 8 Dalton apart
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Isotope Labeling (ICAT)
“diseased”
Cell state 1
Cell state 2
“Normal”
Label proteins with heavy ICAT
Label proteins with light ICAT
Combine
Fractionate protein prep- membrane - cytosolic Proteolysis
Isolate ICAT-labeled peptides
Nat. Biotechnol. 17: 994-999,1999
Heavy and light ICAT reagent 8 Dalton apart
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Map 1 (normal) Map 2 (diseased)
Differential Analysis
Two common basic approaches:• Isotope tagging (e.g. ICAT, MeCAT)• Direct Differential Quantitation (DDQ)
diseased normal
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
What’s in a Map?
• Retention time (RT) for each scan
• Peptide mass/charge ratio (m/z) (usually within ~20 ppm)
• Intensity (I)
⇒ use m/z, RT to identify peptides
⇒ use I to quantify peptides
(relative quantitation only!)
Maps become HUGE (108 Peaks)!
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Raw Data
FilteredRaw Data
Map
AnnotatedMaps
Proteomics Data Flow
HPLC/MSSample
Sig.Proc.
Data Reduction
Diff.Anal.
IdentificationDifferentially
ExpressedProteins
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Raw Data
FilteredRaw Data
Map
AnnotatedMaps
Proteomics Data Flow
HPLC/MSSample
Sig.Proc.
Data Reduction
Diff.Anal.
IdentificationDifferentially
ExpressedProteins
10 GB
1 GB50 MB
50 MB 1 kB
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
HPLC/MS – Becoming the Standard?
Advantages of HPLC/MS over 2D PAGE1. Easier automation
– no robots– no gel handling
2. Better separation power– Monolithic columns– Multi-dimensional chromatography
3. Simpler coupling to MS
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
HPLC/MS – Becoming the Standard?
Disadvantages of HPLC/MS• Proteins are chopped up• Quantitation difficult• Huge amount of data (10 GiB/run)• Data hard to manage/interpret
⇒ Need for Informatics!
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Overview
• Motivation for OpenMS• Bioinformatics Issues in quantitative
Proteomics– Signal Processing– Feature Finding– Map Mapping– Differential Quantitation– Identification, Clustering– Software Engineering, Databases
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Signal Processing
Issues• Baseline reduction• Noise filtering• Calibration⇒ Increase reliability of data
m/z
inte
nsity
m/zin
tens
itym/z
inte
nsity
m/z
inte
nsity
+ + =
signal baseline noise raw datam/z
inte
nsity
m/z
inte
nsity
Eva
Lang
e
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Signal Processing
m/z
inte
nsity
m/zin
tens
itym/z
inte
nsity
m/z
inte
nsity
+ + =
signal baseline noise raw data
Eva
Lang
e
Feature MapRaw Map Stick Map
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Baseline Filtering
TopHat filtering determines baseline
200 400 600 800 1000 1200 1400 1600 1800 2000
-2000
0
2000
4000
6000
8000
10000
12000
14000Shots
200 400 600 800 1000 1200 1400 1600 1800 2000
1
1.5
2
2.5
3
3.5x 104 Shots
Eva Lange
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Stick conversion aka Peakpicking
Issues
• „peak picking“ – Identify peaks
• „stick conversion“ – Integrate peaks to sticks
⇒ Reduce amount of data by factor of 10 – 100
m/z
inte
nsity
raw datam/z
inte
nsity
sticks
Eva
Lang
e
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
m/z
inte
nsity
raw datam/z
inte
nsity
sticks
Eva
Lang
e
Feature MapRaw Map Stick Map
Stick conversion aka Peakpicking
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Peak picking in m/z dimensionMain objectives of a peak picking algorithm:•precise peak positions •run in real time
m/z
inte
nsity
m/z
inte
nsity
Main difficulties of a peak picking algorithm:•considerable asymmetry of peaks
•convolution of isotopic peaks
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Three step algorithm
1. Peak detection2. Extract important peak parameter:
• centroid• area• height• fwhm• asymmetric peak shape• signal/noise ratio
3. Optimize the peak parameter (optional)
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
2230 2234 2238 2242 2246 2250
2230 2234 2238 2242 2246 2250
2230 2234 2238 2242 2246 2250
2230 2234 2238 2242 2246 2250
A
B
C
D
Idea: Split the signal into different frequency ranges
(length scales)
Method: Continuous Wavelet Transformation (CWT)
1. Peak detection
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
CWT is the sum over all positionsb of the signal multiplied byscaled, shifted versions of the wavelet
CWT breaks up the signal intoshifted and scaled versions of the original (mother) wavelet
Continuous wavelet transformation
m/zb
,)(')(1),(W dxa
bxxsa
abs ∫+∞
∞−
−= ϕϕ
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
1. Peak detection
Workflow 1. Compute the wavelettransform
2. Search for a peak‘smaximum
3. Search for peak-endpoints
4. Estimate the centroid
5. Determine the height
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
1. Compute the peak positions
Compute the CWT:
•use the marr wavelet
•predefined scale a (should correspond to the typical peakwidth)
maximum position in the CWT is a good first estimate of the maximum in the data even for asymmetric peaks
convolution can be computed very efficiently with pre-tabulated values for the wavelet
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
2. Extract important peak parameter
20
,,2
))(cosh()(0 xxw
hxSech xwh−
=
Fit functions similar to real raw mass peaks
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Workflow
2. Extract important peakparameter
1. Estimate the peak‘sleft area
2. Fit a sech2 to the left
3. Estimate the peak‘sright area
4. Fit a sech2 to the right
Asymmetric peak shape
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
2. Extract important peak parameter
Fitting asymmetric lorentzian and sech2 functions to a rawpeak using the peak‘s :
•maximum position
•height
•area
introduces a smoothing effect
compute the peak‘s width from its area in constant time
yield very good approximations to the original peak shape
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
for all mass spectra s in experiment dobox_list = splitSpectrum(s)for all boxes b in box_list do
peak_list = []wb = continuousWaveletTransform(b)while getNextMaximumPosition(wb,b,x0) do
(xl,xr) = searchForPeakEndPoints(b,x0)c = estimateCentroid(xl,xr)h = intensity(x0)
(Al,Ar) = integrateAreas(xl,xr)f = fitPeakShape(Al,Ar,x0,h)
push(f,peak_list)removeRawData(xl,xr,b)wb = continuousWaveletTransform(b)
end whileoptimizePeakParameter(peak_list,b)
end forend for
Peak picking algorithm
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Peak picking algorithmfor all mass spectra s in experiment do
box_list = splitSpectrum(s)for all boxes b in box_list do
peak_list = []wb = continuousWaveletTransform(b)while getNextMaximumPosition(wb,b,x0) do
(xl,xr) = searchForPeakEndPoints(b,x0)c = estimateCentroid(xl,xr)h = intensity(x0)
(Al,Ar) = integrateAreas(xl,xr)f = fitPeakShape(Al,Ar,x0,h)
push(f,peak_list)removeRawData(xl,xr,b)wb = continuousWaveletTransform(b)
end whileoptimizePeakParameter(peak_list,b)
end forend for
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
for all mass spectra s in experiment dobox_list = splitSpectrum(s)for all boxes b in box_list do
peak_list = []wb = continuousWaveletTransform(b)while getNextMaximumPosition(wb,b,x0) do
(xl,xr) = searchForPeakEndPoints(b,x0)c = estimateCentroid(xl,xr)h = intensity(x0)
(Al,Ar) = integrateAreas(xl,xr)f = fitPeakShape(Al,Ar,x0,h)
push(f,peak_list)removeRawData(xl,xr,b)wb = continuousWaveletTransform(b)
end whileoptimizePeakParameter(peak_list,b)
end forend for
Peak picking algorithm
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
for all mass spectra s in experiment dobox_list = splitSpectrum(s)for all boxes b in box_list do
peak_list = []wb = continuousWaveletTransform(b)while getNextMaximumPosition(wb,b,x0) do
(xl,xr) = searchForPeakEndPoints(b,x0)c = estimateCentroid(xl,xr)h = intensity(x0)
(Al,Ar) = integrateAreas(xl,xr)f = fitPeakShape(Al,Ar,x0,h)
push(f,peak_list)removeRawData(xl,xr,b)wb = continuousWaveletTransform(b)
end whileoptimizePeakParameter(peak_list,b)
end forend for
Peak picking algorithm
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
for all mass spectra s in experiment dobox_list = splitSpectrum(s)for all boxes b in box_list do
peak_list = []wb = continuousWaveletTransform(b)while getNextMaximumPosition(wb,b,x0) do
(xl,xr) = searchForPeakEndPoints(b,x0)c = estimateCentroid(xl,xr)h = intensity(x0)
(Al,Ar) = integrateAreas(xl,xr)f = fitPeakShape(Al,Ar,x0,h)
push(f,peak_list)removeRawData(xl,xr,b)wb = continuousWaveletTransform(b)
end whileoptimizePeakParameter(peak_list,b)
end forend for
Peak picking algorithm
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Evaluation of a peak picking method
Compute accurate:• centroid• area• height
test against a set of known composition
Results are heavily affected by:• quality of experimental data• calibration method
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Data set I: peptide mix (peptide standards mix #P2693 from Sigma Aldrich) of nine known peptides.
Measuring method: HPLC-MS using a quadrupole ion trap mass spectrometer equipped with an electrospray ion source in full scan mode (m/z 500-1500)
low resolution (Δm=0.2) and several overlapping isotope patterns
Evaluation of our peak picking method
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Evaluation of our peak picking methodEvaluation:• compare with vendor supplied Bruker Data-
Analysis software (3.2) on the same spectra• no sophisticated calibration, we only allow a
constant mass offset to keep the number of fit parameters as small as possible
Performance criteria:1. number of the discovered and separated isotopic
patterns (given by at least three consecutive peaks) of each peptide in an expected retention time interval
2. average relative error of the monoisotopic mass
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
591 591.5 592 592.5 593 593.5 594 594.5
10x 104
#occOpenMS Apexzmasstheo
01321620.8151I
1821349.7360H
0721183.5730G
2321061.5614F
2311084.4379E
5811007.4365D
15191574.2257C
29291573.3071B
A 19221556.2693
Results: LC-ESI-ion trap data (Full scan mode (m/z 500-1500))
Data set I (Peptide mix of nine known peptides)
Relative error of centroids:
21ppm, 1.2ppm, 35ppm, 16ppm
OpenMSApex
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
#occOpenMS Apexzmasstheo
01321620.8151I
1821349.7360H
0721183.5730G
2321061.5614F
2311084.4379E
5811007.4365D
15191574.2257C
29291573.3071B
A 19221556.2693
Results: LC-ESI-ion trap data (Full scan mode (m/z 500-1500))
Data set I (Peptide mix of nine known peptides)
591 591.5 592 592.5 593 593.5 594 594.5
10x 104
Relative error of centroids:
21ppm, 1.2ppm, 35ppm, 16ppm
OpenMSApex
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
#occOpenMS Apexzmasstheo
01321620.8151I
1821349.7360H
0721183.5730G
2321061.5614F
2311084.4379E
5811007.4365D
15191574.2257C
29291573.3071B
A 19221556.2693
Results: LC-ESI-ion trap data (Full scan mode (m/z 500-1500))
Data set I (Peptide mix of nine known peptides)
591 591.5 592 592.5 593 593.5 594 594.5
10x 104
Relative error of centroids:
21ppm, 1.2ppm, 35ppm, 16ppm
OpenMSApex
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
rel. err. [ppm]
OpenMS Apexzmasstheo
-3721620.8151I
132821349.7360H
-1521183.5730G
645621061.5614F
121811084.4379E
942511007.4365D
21441574.2257C
16161573.3071B
A 39311556.2693
Results: LC-ESI-ion trap data (Full scan mode (m/z 500-1500))
Data set I (Peptide mix of nine known peptides)
591 591.5 592 592.5 593 593.5 594 594.5
10x 104
Relative error of centroids:
21ppm, 1.2ppm, 35ppm, 16ppm
OpenMSApex
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
rel. err. [ppm]
OpenMS Apexzmasstheo
-3721620.8151I
132821349.7360H
-1521183.5730G
645621061.5614F
121811084.4379E
942511007.4365D
21441574.2257C
16161573.3071B
A 39311556.2693
Results: LC-ESI-ion trap data (Full scan mode (m/z 500-1500))
Data set I (Peptide mix of nine known peptides)
591 591.5 592 592.5 593 593.5 594 594.5
10x 104
Relative error of centroids:
21ppm, 1.2ppm, 35ppm, 16ppm
OpenMSApex
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
rel. err. [ppm]
OpenMS Apexzmasstheo
-3721620.8151I
132821349.7360H
-1521183.5730G
645621061.5614F
121811084.4379E
942511007.4365D
21441574.2257C
16161573.3071B
A 39311556.2693
Results: LC-ESI-ion trap data (Full scan mode (m/z 500-1500))
Data set I (Peptide mix of nine known peptides)
591 591.5 592 592.5 593 593.5 594 594.5
10x 104
Relative error of centroids:
21ppm, 1.2ppm, 35ppm, 16ppm
OpenMSApex
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Sanity check of our peak picking methodData set II: tryptic digest of bovine serum albumin (BSA from Sigma Aldrich).
Measuring method: MALDI-TOF using a Ultraflex II Lift mass spectrometer operated in the reflectron mode and using Panorama delayed ion extraction (m/z 500-5000)
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Sanity check of our peak picking method
Evaluation:• compare with vendor supplied Bruker FlexAnalysis
software on the same spectra• no sophisticated calibration
Performance criterion:1. compute the sequence coverage using the
determined peak list as a MASCOT peptide mass fingerprinting query
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
80 - 93
Sequence coverageOpenMS Bruker
44%52% - 67%
Results: MALDI-TOF data Data set II (tryptic digest of bovine serum)
rel. err. [ppm]
OpenMS Bruker
95
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
80 - 93
Sequence coverageOpenMS Bruker
44%52% - 67%
Results: MALDI-TOF data Data set II (tryptic digest of bovine serum)
rel. err. [ppm]
OpenMS Bruker
95
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
9580 - 93
Sequence coverageOpenMS Bruker
44%52% - 67%
Results: MALDI-TOF data Data set II (tryptic digest of bovine serum)
rel. err. [ppm]
OpenMS Bruker
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Sequence coverageOpenMS Bruker
44%52% - 67%
Results: MALDI-TOF data Data set II (tryptic digest of bovine serum)
rel. err. [ppm]
OpenMS Bruker
9580 - 93
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Sequence coverageOpenMS Bruker
44%52% - 67%
9580 - 93
Results: MALDI-TOF data Data set II (tryptic digest of bovine serum)
rel. err. [ppm]
OpenMS Bruker
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Overview
• Motivation for OpenMS• Bioinformatics Issues in quantitative
Proteomics– Signal Processing– Feature Finding– Map Mapping– Differential Quantitation– Identification, Clustering– Software Engineering, Databases
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Feature Finding
Isotope pattern Elution profile
Peptide (feature)
Capture ALL peaks belonging to a peptide for quantitation!
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Feature Finding
• A two-dimensional model has to beadjusted to the raw data
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Feature Finding …
Attributes of a feature:
mass-to-charge ratio
retention time
intensity, volume
quality
… … …
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Feature Finding
m/z rt
feature model isotope pattern elution profile+=
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Isotope Patterns
• Natural isotopes occur with well-known abundances
• Peak intensities determined by binomialconvolution
• Depend on molecular formula of peptide
12C 98.90% 13C 1.10%14N 99.63% 15N 0.37%16O 99.76% 17O 0.04% 18O 0.20%1H 99.98% 2H 0.02%
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Randomly sample peptides from protein database and compute average composition. Table with average isotope pattern for specified
peptide mass
m [Da] P(k=0)
P(k=1)
P(k=2)
P(k=3)
P(k=4)
1000 0.55 0.30 0.10 0.02 0.00
2000 0.30 0.33 0.21 0.09 0.03
3000 0.17 0.28 0.25 0.15 0.08
4000 0.09 0.20 0.24 0.19 0.12
Estimating the Isotope Pattern
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Feature Models: m/z
• Isotope patterns
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
-2 -1 0 1 2 3 4 5 6
relati
ve in
tensit
y
isotopic rank
Effect of the smoothing width on an averagine isotope pattern at mass 1350
stdev = 0.10.20.30.6
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Feature Models: rt
• Elution profiles– Currently modeled by a normal
distribution, other shapes possible
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Peak aggregation to features yields• 1,000-fold compression in data volume• reduction to meaningful entity: a peptide• accurate quantitation: integration of all related peaks
Peak Aggregation
FeaturesRT
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Peak Aggregation
Features
Feature MapRaw Map Stick Map
Clemens GröplOle Schulz-Trieglaff
Jens Joachim
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Overview Data Reduction
• Noise/baseline filtering
• Peak picking, integration• Peak aggregation
– De-isotoping: joining peaks from sameisotope pattern
– De-charging: joining features from samepeptide, different charge states
⇒ Aggregated MapReduction in Data Volume: up to 104
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Overview
• Motivation for OpenMS• Bioinformatics Issues in quantitative
Proteomics– Signal Processing– Feature Finding– Map Mapping– Differential Quantitation– Identification, Clustering– Software Engineering, Databases
• Subsequent talks give more details
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Map Mapping – At All Levels
Data reduction Mapping
Raw Map 1 Raw Map 2
Stick Map 1 Stick Map 2
Feature Map 1 Feature Map 2 Feature Map n…
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Mapping of Raw Maps
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Mapping Discrete Peaks
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Geometric Hashing
• Consider difference vectors from features in map 1 (“+”) to features in map 2 (“×”)
• Restrict difference vector to reasonable differences along RT and m/z dimensions
• Hash these vectors into bins (“2D histogram”)
• In sufficiently similar maps, the optimal translation will be the most frequent one
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Finding the Offset Vector
Clemens GröplEva Lange
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Overview
• Motivation for OpenMS• Bioinformatics Issues in quantitative
Proteomics– Signal Processing– Feature Finding– Map Mapping– Differential Quantitation– Identification, Clustering– Software Engineering, Databases
• Subsequent talks give more details
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Differential Analysis
• Identify „differential peptides“• Depends on quantitation technique
– Labeling techniques: Find pairs within same map
– Unlabeled (DDQ): Assign matching pairs across maps
• Usually done manually• In HT settings this has to be done
automatically!• Reduction to features helps enormously
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Assuming charge z and n Cys in a peptide,check for pairs (8n)/z Thomson apart in ONE MAP
ICAT Pair Identification
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Assign pairs across two or more mapsRT of peptide may vary between maps⇒ compute suitable mapping
(linear function usually suffices)
DDQ – Peptide Assignment
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Multidimensional Search
• Use efficient data structures for range searches (e.g. Kd-trees or range trees)
• Employ memory management for cached range queries to allow for large data sets (e.g. large stick maps)
• Making use of robust and efficient implementations of these data structures in CGAL (Computational Geometry Algorithms Library)
• CGAL provides standard data structures for d-dimensional computational geometry
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Map Normalization
• Intensities will vary between maps, even for the same sample (e.g. volume error during injection)
• Usually this results in a constant ratio of feature intensities
• Map normalization is required to determine this ratio for accurate relative quantitation
• Assuming a sufficient similarity between two maps, the majority of features will be identical, only few will be differ in concentration between the two samples
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Run several experiments on the same sample
• Too time-consuming for manual assignment
• Automatic assignment increases reliability
1. Features that show up most of the time
are real
2. Features that do not are probably noise
3. Allows statements about statistical
significance
Improve Sensitivity and Specificity
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
feature 1 feature 2 feature 3 feature n………
feature 1 feature 2 feature 3 feature n………
Compute scaling between pairs of maps, group them
Normalize and group measurements
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Statistical analysis of differential patterns
Marker in patient Marker in control
Ole Schulz-TrieglaffTim Conrad
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Overview
• Motivation for OpenMS• Bioinformatics Issues in quantitative
Proteomics– Signal Processing– Feature Finding– Map Mapping– Differential Quantitation– Identification, Clustering– Software Engineering, Databases
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Peptide Identification (MS-MS)
Secondary fragmentation
Fragment Spectrum
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Peptide Fragmentation
H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH
Ri-1 Ri Ri+1
AA residuei-1 AA residuei AA residuei+1
N-terminus C-terminus
The peptide backbone breaks to formfragments with characteristic masses.
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Peptide Fragmentation
H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH
Ri-1 Ri Ri+1
AA residuei-1 AA residuei AA residuei+1
N-terminus C-terminus
The peptide backbone breaks to formfragments with characteristic masses.
Ionized parent peptide
H+
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Peptide Fragmentation
H...-HN-CH-CO NH-CH-CO-NH-CH-CO-…OH
Ri-1 Ri Ri+1
AA residuei-1AA residuei AA residuei+1
N-terminus C-terminus
The peptide backbone breaks to formfragments with characteristic masses.
Ionized peptide fragment
H+
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Peptide Ions in Spectrum
K1166
L1020
E907
D778
E663
E534
L405
F292
G145
S88 b ions
100
0250 500 750 1000 m/z
% In
tens
ity
147260389504633762875102210801166 y ions
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Peptide Ions in Spectrum
147K
1166L
260
1020E
389
907D
504
778E
633
663E
762
534L
875
405F
1022
292G
1080
145S
1166
88
y ions
b ions
100
0250 500 750 1000
y2 y3 y4
y5
y6
y7
b3b4 b5 b8 b9
[M+2H]2+
b6 b7 y9y8
m/z
% In
tens
ity
KLEDEELFGS
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
What’s the problem?
-HN-CH-CO-NH-CH-CO-NH-
Ri CH-R’
bibi+1
yn-iyn-i-1
R”i+1
i+1
Peptide fragmentation possibilities (ion types)
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
What’s the problem?
-HN-CH-CO-NH-CH-CO-NH-
Ri CH-R’ai
bici
xn-iyn-i
zn-i
yn-i-1
bi+1R”
di+1
vn-i wn-i
i+1
i+1
low energy fragments high energy fragments
Peptide fragmentation possibilities (ion types)
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Identification – SCOPE
Score P-values sequence
IDidentified fragments
Vineet Bafna, Nathan Edwards, Proc. ISMB 2001
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Overview
• Motivation for OpenMS• Bioinformatics Issues in quantitative
Proteomics– Signal Processing– Feature Finding– Map Mapping– Differential Quantitation– Identification, Clustering– Software Engineering, Databases
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
OpenMS
• Software framework for shotgun proteomics• ISO/ANSI C++ compliant• Features
– Open Source (LGPL)– Data structures & algorithms
• d-dimensional core• Signal processing for raw data conversion
– Data import/export in standard formats– Database storage of all data
(using an extended PEDRO/MIAPE model)– DB-driven workflow management– Visualization for spectra, maps
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
MDI Window
1D Window
SpecView – 1D Window
DB
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
SpecView – 3D Window
• Visualization of raw, stick, and feature maps
• Rapid interactivezoom in/out through efficientquad treeimplementation
• Data import from files or direct visualization from the DB
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Clustering/ID View
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Visualization Code Example// create application with a single widgetQApplication app(argc, argv);Spectrum1DWidget* wid = new Spectrum1DWidget;app.setMainWidget(wid);
// load data from DBDBAdapter dba;dba.connectDB("OpenMS", "<user>", "<password>");dba.executeQuery("SELECT id FROM PeakList … ", true);wid->setDataSet(dba.objects()[0]);
// execute applicationwid->show();return app.exec();
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Acknowledgements
Dr. Clemens GröplEva Lange, Tim Conrad,Ole Schulz-Trieglaff(Algorithmische Bioinformatik,FU Berlin)
Prof. Hartmut Schlüter(Universitätsmedizin Berlin, Charité)
Prof. Dr. Oliver KohlbacherMarc Sturm, Andreas BertschJens Joachim(SBS/WSI, Tübingen)
Andreas Hildebrandt(Uni Saarbrücken)
Prof. Dr. Christian HuberBettina Mayr et al.(Instr. Analytik & Bioanalytik,Univ. des Saarlandes, Saarbrücken)
Dr. Albert Sickmann(Virchow-Zentrum, Würzburg)
Herbert ThieleJens Decker(Bruker Daltonics, Bremen)
Dr. Christoph Klein(IRMM, Geel; now IHCP Ispra)
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Überblick Assemblierung
1. Überblick : DNA Sequenzierung
2. Überblick : BAC-by-BAC Assembler
3. Überblick : WGS Assembler4. Überblick :
Compartmentalized Assembler5. Details der CSA Pipeline
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
Danke für die Aufmerksamkeit !!!
Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik
DNA DNA SequenzSequenz
GRGRÖÖSSENAUSWAHLSSENAUSWAHL(LIBRARY)(LIBRARY)
AUFBRECHENAUFBRECHEN
Shotgun DNA Sequenzierung (Technologie)
CLONENCLONEN
VectorVector
End Reads(Mates)End Reads(Mates)
SEQUENZIERENSEQUENZIEREN
PrimerPrimer
DNA Sequenzierung