peaks: de novo sequencing using ms/ms spectra bin ma, u. western ontario, canada kaizhong zhang,u....
Post on 15-Jan-2016
215 views
TRANSCRIPT
PEAKS: De Novo Sequencing using MS/MS spectra
Bin Ma, U. Western Ontario, Canada
Kaizhong Zhang, U. Western Ontario, Canada
Chengzhi Liang, Bioinformatics Solutions Inc. Canada
Outline
• Background – Tandem Mass Spectrometry
• De novo sequencing– Problem Definition and Algorithm.
• Software implementation – PEAKS
• Future work
Background
• Human has 100,000 different proteins. Because of the existence of post translational modifications, each protein can have many different versions.
• Diseases are closely related to the abnormal proteins or the expression levels of proteins.
• Given a tissue, the identification of the proteins (and their modified versions) in it is a fundamental problem for the drug design.
Proteins and Peptides
• A protein is a sequence of 20 different types of amino acids.– A protein is a string over alphabet with size 20
• A peptide is a substring of the protein.• The 20 amino acids have 19 distinct masses.
– I and L have the same mass and cannot (difficult) be distinguished by MS/MS.
– Regard them as the same letter.
Tandem Mass Spectrometry
• MS/MS is the only reliable way for protein identification.
…VITK | GTDIMNEMR | SMW…
tissue fraction gel protein
peptide
LGSSEVEQVQLVVDGVKpeptide sequence:
tandem mass spectrometer:
MS/MS spectrum
de novo sequencing:
LGSSEVEQVQLVVDGVK
database
How Does a Peptide Fragment?
m(y1)=19+m(A4)m(y2)=19+m(A4)+m(A3)m(y3)=19+m(A4)+m(A3)+m(A2)
m(b1)=1+m(A1)m(b2)=1+m(A1)+m(A2)m(b3)=1+m(A1)+m(A2)+m(A3)
Matching Sequence with Spectrum
• For any peptide P= a1…an, m(P) = Σi ai.
• De Novo Sequencing
– Given a spectrum, a mass value m, compute a sequence P, s.t. m(P)=m, and the matching score score(P) is maximized.
De Novo Sequencing
A Simpler Case – Only Y-ions
Y-ions Determined By a Suffix19
y1 y2 y3score(Q) can be defined for a suffix Q.
)(max)()(
QscoreuDPuQm
)()()( ufVRscoreLVRscore
)()(max)(a
ufauDPuDP
Counting Both y and b ions
Strategies
• Consider a pair of prefix R and a suffix Q simultaneously.
• Consider only those pairs (R,Q) that satisfy a nice property, which we call “chummy”
• Chummy pairs allow:– The score of a chummy pair can be computed
recursively from a smaller chummy pair. – There are a series of chummy pairs that grow to
the optimal solution.
Dynamic Programming
• Combining Lemma A, B, we can compute
• Suppose (R,Q) is the pair maximizing DP(u,v) under the condition m(R)+m(Q)+a=m. Then RaQ is the optimal peptide.
),(max),(
chummy ),(
)(,)(QRscorevuDP
QR
vQmuRm
PEAKS – The Software
Red = Correct
m/z z Correct Sequence PEAKS (de novo) Comments Lutefisk (de novo)
MALDI MS/MS BSA
927.4 1 YLYEIAR YLYEIAR correct [276.14]EY[184.08]R 1439.7 1 RHPEYAVSVLLR GVLMVDVPPADNGR Wrong (?) No results 1479.8 1 LGEYGFQNALIVR LWYGFQNALIVR correct No results 1639.8 1 KVPQVSTPTLVEVSR RAPKVPQVSTPTLVEVSR correct No results
ESI MS/MS Cyt- c
482.7 2 EDLIAYLK EDLIAYLK correct [357.15]LAYLK 584.8 2 TGPNLHGLFGR TGPNLHGLFGR correct TGPNLHGLFGR 589.3 1 GDVEK VDVEK V = Ac-G VDVEK 634.4 1 IFVQK IFVQK correct IFVQK 678.3 1 YIPGTK YIPGTK correct YIPGTK 728.8 2 TGQAPGFSYTDANK TGQAPGFSYTDANK correct [199.10]SAPGF[250.09]TWNK 779.4 1 MIFAGIK MIFAGIK correct [244.12]FAGLK 792.9 2 KTGQAPGFSYTDAMK KTGAGAPGFSYTDAMK almost [229.15]QGAPGAYQNHANK 817.3 2 IFVQKCAQCHTVEK QFVTHMACCHTVEK partial [257.08][218.08][GP][260.08][HM]TVEK
Apo-Myoglobin
662.3 1 ASEDLK ASEDLK correct [244.07]SALK 689.9 2 HGTVVLTALGGILK HGTVVLTALGGILK correct HGTVVLTALG[170.1]LK 748.4 1 ALELFR ALELFR correct [184.12]ELFR 803.9 2 VEADIAGHGQEVLIR LDADIAGHGQEVLIR almost no results 908.4 2 GLSDGEWQQVLNVWGK GLSDGEWQQVLNVWGK correct [170.11]SG[244.07]WQQVLNVWGK 943.2 2 YLEFISDAIIHVLHSK YLEFISDAIIHVLHSK correct [276.1]EFLSD[184.12]LHVLHSK
Comparison of PEAKS and Lutefisk
Users
Implementation Particulars
• More accurate scoring:– sum of the logarithmic intensities– many other ion types– coexisting ions, e.g., x2, y2, z2
• Deconvolution– converting multiply-charged peaks to singly-charged
ones
• Recalibration – compress/stretch the spectrum for calibration error
• Noise reduction
Acknowledgement
• Bin Ma, Kaizhong Zhang were supported by NSERC.
• Chengzhi Liang was supported by BSI.
• Thanks the development team in BSI for the software development.
Tandem Mass Spectrometer
massanalyzer
fragment
precursor ions fragment ions
MPSER
SG…
+
PAK +
+
P+ AKPAK +
PAK + PA+ K
AK+P
K+PA
P +K+
PA+
AK+
PAK +
PAK +
de novo sequencing
…
massanalyzer
ionsdetector
Algorithm Sandwich• DP(0,0) = 0; DP(u,v) = -infinity for (u,v)!=(0,0);
• for u from 1 to m/2 do
for v from u-max(a) to u+max(a) do
for a in Σ do
if u<v then
else
• find u,v,a, s.t. u+v+a=m and DP(u,v) maximized;
• backtracking;
),(),,(),(max),( vauDPvufvuDPvauDP
),(),,(),(max),( avuDPvugvuDPavuDP
Dynamic Programming
1. for u from 0 to m
2. backtracking
)()(max)( ufauDPuDP a
Dynamic Programming
),(max),(
suffix is prefix, is
)(,)(QRscorevuDP
QR
vQmuRm
•We hope DP(u,v) for u+v=m gives the optimal prefix and suffix. •The optimal solution can be obtained by concatenation of the prefix and suffix.
Chummy Pairs
• Two strings Ra and bQ are called chummy pairs, iff. either of the following two is true:(C1)(C2)
)a(1)b(19)(1 RmQmRm
)b(19)a(1)(19 QmRmQm
(LGE, LVR) (C2)(LGE, VR) (C1)(LGE, R) (C1)(LG,VR) is not chummy
Chummy pairs
• Lemma A – Suppose Ra and bQ are a chummy pair. u=m(Ra), v=m(bQ). If (C1) is true,
If (C2) is true,
) , ( ) a ( ) b a (v u f ,Q R score Q, R score
) , ( ) b ( ) b a (v u g Q R, score Q, R score
Chummy Pairs
• Lemma B – Let P be the optimal solution. Then there is a chummy pair (R,Q) and a letter a such that P=RaQ. Also, there is a chummy pair series such that
),(),(),(),( 11 QRQRQR nn