High throughput urine biomarker discovery and integrative analysis for translational medicine
Bruce Ling, Ph.D.
A molecular indicator of a specific biological property; a biochemical feature or facet that can be used to measure the progress of disease or the effects of treatment (NIH, 2002)
Biomarker
• Small molecules •Glucose (diabetes)• Serum cholesterol (cardiovascular disease)
•Proteins• PSA (prostate cancer)• HER2 (IHC) (breast cancer Herceptin Therapy)• hCG (pregnancy test)
• RNA/DNA• HER2 (FISH) (breast cancer)• OncoDX (Genomic Health, breast cancer)
Biomarker examples
Pediatric Diseases• Kidney transplant Acute Rejection
• Kawasaki Disease
• Systemic Juvenile Idiopathic Arthritis
• Necrotizing Enterocolitis
• Inflammatory Bowel Disease
• Glioblastoma multiforme
• Preterm Labor
Where to look for biomarkers
– Disease tissue
– Proximal/distal fluids
• Plasma/serum, urine, amniotic, synovial fluid, CSF, saliva, tears, etc.
Why Urine?
• Patient consenting
• Non-invasive
• Easy to collect for time course analysis
• Abundant and stable
Urine is a rich resource for biomarker discovery
• Filtration of plasma• 900 liters daily
• Urine proteome • > 1500 proteins, ~30 mg/day
• 30% from circulation
• 70% from urogenital tract
• Urine peptidome• > 100, 000 naturally occurring peptide, ~20 mg/day
1) Equal mass of protein and peptide in urine translates into at least a ten-fold greater molar abundance of peptides than proteins
2) Urine peptide analysis is not hampered by highly abundant protein issues
3) One hour one dimensional HPLC separation is sufficient for the analysis of greater than 100,000 urine peptides, allowing a high throughput biomarker discovery
Urine Peptidome: a fertile ground for biomarker discovery
Challenges of Urine Analysis• Dilution factor causing concentration variations
– Solution: content normalization• Creatinine; house keeping urine abundant peptides; equal peptide mass
• Peptide content can be complicated by– Diet, exercise, circadian rhythm, circulatory levels of hormones– Solution: careful experimental design to avoid these confounding
issues, e.g., • Cohorts of patients of similar demographics • Multi-center sample collection and validation
5 ml Urine
Filtrate
Peptides <6K
Post Ethyl acetate Fraction
Mass Spectrometry Analysis and Protein Identification
Centricon 10K
C18 desalt (Sep-pak)
Ethyl acetate extraction
C18 HPLC 30 seconds per fraction
Collect on MALDI target plate
Proteins >6K
Urine Peptidome Profiling by Mass Spectrometry
Biomarker HTS FlowsSample peptides:
-Class 1:1,2,3…
-Class 2:1,2,3…
-Class 3:1,2,3…
RP-HPLC
Collect 120 fractions on MALDI plates
MALDI-TOF MS on each fraction
MASS-Conductor ®
Machine learning
feature discovery and classification
Candidate Biomarkers
987.62
1027.51
1098.55
etc.
Biomarker Confirmation/Validation
Identify
Differentiating Markers
New sample
Sets
ValidationNew Center sample sets
Higher throughput
Quantitative methods
Quantitative MS
Immunoassay
Testing
New Longitudinal sample sets
Exploration
Protein ID
MS/MS
Data Challenges in Urine Peptide Biomarker Discovery
• Data tracking and storage– Patient demographics– Peptide profiles in various fractions/samples
• Dimension reduction and data reduction– Multi-dimensional data sets– Huge data sets and lots of noise
A project of 40 samples produced 241.5 GB raw data in MYSQL database
HPLC fractionPep
tid
e m
ass
Patient ID
Patient
demographicsPep
tide
signal
Decode the Urine Peptidome
Patient 1 Patient 2 Patient 3 Patient 4 …
peptide 1 signal signal signal signal …
peptide 2 signal signal signal signal …
peptide 3 … … … … …
peptide 4 … … … … …
peptide 5 … … … … …
… … … … … …
peptide 100,000
… … … … …
???
Decode the Urine Peptidome
• Peak finding in each fraction for each sample
• Align the peaks across the samples
• Create common peak index
Data mining issues in Biomarker Discovery
• Peak number >> sample number
• False discovery in multiple hypothesis testing
• Multi-class classification and validation
• Discovery of biomarker signature
• Robustly loading and tracking of high volume proteomic data
• Robust reduction of raw data sets and enabling of efficient and accurate peak finding, alignment and indexing
• Robust and automatic high throughput computing for expensive algorithms
• Integration of FDR analysis and multi-class classification algorithms to obtain statistically differentiating feature panels
• Automatic generation of data reports with graphics
MASS-Conductor® Platform Support Urine Peptide Biomarker Discovery
MASS-Conductor® Platform High Throughput Computing
Urine Biomarker Discovery: Case Study
Integrative Urinary Peptidomics in Renal Transplantation Identifies Novel Biomarkers
for Acute Rejection
Xuefeng B. Ling2*, Tara K. Sigdel1*, Kenneth Lau2, Lihua Ying1, Irwin Lau2, James
Schilling2¥, Minnie M. Sarwal1¥
Divisions of 1Nephrology and Department of Pediatrics, 2Biotechnology Core, Stanford
University School of Medicine, Stanford University, Stanford, CA 94305
Kidney Transplant Rejection
• Most effective treatment for end stage renal disease
• 16,000 per year in US
• Grafts monitored by biopsy
• Unmet needs:– Less invasive and more frequent monitoring
– Acute rejection vs. stable graft
– Acute rejection vs. BK virus
Allograft Acute Rejection Urine Biomarker Discovery
Peak finding
Peak alignment
Peak indexing
Supervised Data mining
Feature selection
Training
Testing
LCMS Data reduction
Unsupervised Data mining
2D - Clustering
QuantitativeLCMS
Validation
1 2 3 4
Biomarker Panel: Supervised Analysis
Biomarker Panel: Unsupervised Analysis
NH2
ZP-d
omai
n
EGF-likeDomain I
EGF-
like
Dom
ain
II
EGF-likeDomain III
COOH
286465
107 108149
334
585
Urine THP Peptide Biomarkers Fall into a Tight Cluster in C-Terminus
1. R.VLNLGPITR.K2. G.SVIDQSRVLNLGPI.T3. I.DQSRVLNLGPITR.K4. R.SGSVIDQSRVLNLGPI.T5. S.VIDQSRVLNLGPITR.K6.R.SGSVIDQSRVLNLGPIT.R7. G.SVIDQSRVLNLGPITR.K8.R.SGSVIDQSRVLNLGPITR.K
MRM: Multiplexed Quantitative Biomarker Validation
0.0 0.2 0.4 0.6 0.8 1.0
SAMPLE: URINE PEPTIDES SAMPLE: URINE PEPTIDES
THP 1680.98 VIDQSRVLNLGPITR
THP 1912.07 SGSVIDQSRVLNLGPITR
THP 1680.98 VIDQSRVLNLGPITR
THP 1912.07 SGSVIDQSRVLNLGPITR
AR versus STA AR versus BK
Sen
sitiv
ity
1- Specificity 1- Specificity
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
AUC: 0.83
AUC: 0.74
AUC: 0.92
AUC: 0.83
ROC Analysis of THP Peptide Biomarkers Quantified by MRM
1. COL1A1 1235.56 APGDRGEPGPPGP2. COL1A1 1251.55 APGDRGEPGPPGP3. COL1A1 1322.57 APGDRGEPGPPGPA4. COL1A1 1316.59 DAGPVGPPGPPGPPG5. COL1A1 1409.66 GPPGPPGPPGPPGPPS6. COL1A1 2048.92 NGDDGEAGKPGRPGERGPPGP 7. COL1A1 2064.91 NGDDGEAGKPGRPGERGPPGP 8. COL1A1 2192.97 NGDDGEAGKPGRPGERGPPGPQ 9. COL1A1 2362.12 GKNGDDGEAGKPGRPGERGPPGPQ10. COL1A1 2378.10 GKNGDDGEAGKPGRPGERGPPGPQ11. COL1A1 2645.24 GPPGKNGDDGEAGKPGRPGERGPPGPQ12. COL1A1 1709.79 PPGEAGKPGEQGVPGDLG13. COL1A1 2031.95 PPGEAGKPGEQGVPGDLGAPGP14. COL1A1 2221.97 ADGQPGAKGEPGDAGAKGDAGPPGP15. COL1A1 2205.99 ADGQPGAKGEPGDAGAKGDAGPPGP 16. COL1A1 2277.01 ADGQPGAKGEPGDAGAKGDAGPPGPA17. COL1A1 2293.01 ADGQPGAKGEPGDAGAKGDAGPPGPA18. COL1A1 2617.15 GPPGADGQPGAKGEPGDAGAKGDAGPPGPA19. COL1A1 2086.93 EGSPGRDGSPGAKGDRGETGPA20. COL1A1 2157.96 AEGSPGRDGSPGAKGDRGETGPA21. COL1A1 3014.41 ESGREGAPGAEGSPGRDGSPGAKGDRGETGPA22. COL1A1 1266.58 SPGPDGKTGPPGPA23. COL1A1 2129.99 DGKTGPPGPAGQDGRPGPPGPPG24. COL1A1 2017.93 GRPGEVGPPGPPGPAGEKGSPG25. COL1A2 2081.94 DGPPGRDGQPGHKGERGYPG 26. COL1A2 2195.99 NDGPPGRDGQPGHKGERGYPG27. COL2A1 1861.85 SNGNPGPPGPPGPSGKDGPK28. COL3A1 1738.76 NDGAPGKNGERGGPGGPGP29. COL3A1 2008.93 DGESGRPGRPGERGLPGPPG30. COL3A1 2079.92 DAGAPGAPGGKGDAGAPGERGPPG31. COL3A1 2565.18 GAPGQNGEPGGKGERGAPGEKGEGGPPG 32. COL3A1 2743.24 KNGETGPQGPPGPTGPGGDKGDTGPPGPQG33. COL4A1 1424.66 PGQQGNPGAQGLPGP34. COL4A2 1126.51 GLPGLPGPKGFA 35. COL4A3 1161.52 GEPGPPGPPGNLG36. COL4A4 1218.55 GLPGPPGPKGPRG 37. COL4A5 1144.52 GPPGPPGPLGPLG38. COL4A5 1269.53 PGLDGMKGDPGLP39. COL4A5 1733.76 GIKGEKGNPGQPGLPGLP 40. COL4A6 1158.52 GLPGPPGPPGPPS 41. COL5A1 1748.82 KGPQGKPGLAGMPGANGPP 42. COL7A1 1690.80 PGLPGQVGETGKPGAPGR43. COL9A1 1732.84 KRPDSGATGLPGRPGPPG44. COL11A1 1441.64 GPPGPPGLPGPQGPKG45. COL11A1 1828.84 DGPPGPPGERGPQGPQGPV 46. COL17A1 1368.62 LPGPPGPPGSFLSN47. COL18A1 1142.51 GPPGPPGPPGPPS
1. THP 982.59 VLNLGPITR2. THP 1047.48 SGSVIDQSRV3. THP 1211.66 DQSRVLNLGPI 4. THP 1225.69 SRVLNLGPITR5. THP 1324.76 IDQSRVLNLGPI6. THP 1423.83 VIDQSRVLNLGPI 7. THP 1468.82 DQSRVLNLGPITR8. THP 1510.87 SVIDQSRVLNLGPI9. THP 1567.91 GSVIDQSRVLNLGPI10. THP 1581.91 IDQSRVLNLGPITR11. THP 1654.91 SGSVIDQSRVLNLGPI12. THP 1680.98 VIDQSRVLNLGPITR13. THP 1755.96 SGSVIDQSRVLNLGPIT14. THP 1768.01 SVIDQSRVLNLGPITR15. THP 1912.07 SGSVIDQSRVLNLGPITR16. THP 2040.16 SGSVIDQSRVLNLGPITRK
A B
AR Urine Biomarkers are Collagen and THP Peptides
Col
lage
n p
epti
de
bio
mar
ker
s
TH
P p
epti
de
bio
mar
ker
s
Hypothesis 1Gene expressionalteration in AR
Hypothesis 2Protease expression
alteration in AR
Hypothesis 3Protease inhibitor
expressionalteration in AR
Hypothesis of Molecular Mechanisms for AR Urine Biomarkers
Exploration data set6
(TGCG)
1
Affymetirics HG-U95Av2
(AR: PBL, n=6; BX, n=7)(STA: PBL, n=9; BX, n=10)(NR: PBL, n=8; BX, n=5)(HC: PBL, n=8; BX, n=9)
Exploration Analysis
Confirmation
2
Affymetirics HU-133
(AR: BX, n=37)(HC: BX, n=23)
Confirmation Analysis
Validation
3
Quantitative RT-PCR
(AR: BX, n=14)(STA: BX, n=10)(HC: BX, n=10)
Validation Analysis
Expression analysis of peptide biomarkers’ corresponding
precursor genes
Expression analysis of metzincin superfamily genes
Expression analysis of protease inhibitor genes
Discovery mechanism biomarkers
Confirmation data set(Stanford )
Validation data set(Stanford )
Transcriptome Analysis of Allograft Biopsies
Parental Protein Expression Analysis of Allograft Biopsies Contrasting Urine Peptide Biomarker Changes
Genome-wide Protease and Protease Inhibitor Expression Analysis of Allograft Biopsies Revealed Up Regulation of MMP7, SERPING1, TIMP1
AR STA HC
Sig
nal I
nten
sity
0
10
20
30
40
50
TIMP1COL1A2 UMODSERPING1MMP7 COL3A1 0.0 0.2 0.4 0.6 0.8 1.0
1- Specificity
Mean ( AUC): 0.98
Sen
sitiv
ity
0.0
0.2
0.4
0.6
0.8
1.0
Allograft Biopsies Expression Biomarkers Effectively Classified AR
Proposed Underlying Mechanisms for the AR Urine Peptide Biomarkers
Hypothesis: Collagen Breakdown and Deposition in AR
Decreased Collagen Peptides In AR
IncreasedTIMP1 (Collagenase
Inhibitor) in AR
Increased Collagen Deposition in AR
More Graft FibrosisAfter an AR episode?
Biopsy Gene ExpressionGSE 14328
Increased MMP7 in AR
Decreased Collagen Breakdown in AR
Decreased Collagenase
Activity In AR tissue
Increased Collagen Expression in AR
Integrated Analysis Urine Peptidomics
Urine
Renal Biopsy
Urine Peptide Analysis by MS
Urine Biomarker Discovery: Case Study
Ensemble Analyses of Urine Peptide Profiles with Clinical Findings
Sufficiently Predict Pediatric Necrotizing Enterocolitis Outcomes
Running title: NEC peptide biomarkers
Xuefeng B. Ling1, Kenneth Lau1, Roger Lu1, Gigi Liu1, Harvey Cohen1,
James Schilling1, Karl G. Sylvester1¥
Department of Pediatrics, Stanford University1, Stanford, CA 94305;
Unmet Medical Needs in Necrotizing Entrocolitis
Necrotizing enterocolitis (NEC) is a medical condition primarily seen in premature infants, where portions of the bowel undergo necrosis (tissue death).
Despite decades of research the pathogenesis of NEC remains obscure, the diagnostic parameters unclear, and both treatment and prevention strategies remain inadequate and dated.
There is the real need for better molecular identification of NEC in order to assist in altering its onset and progression.
Clinical parameters do not adequately predict outcome in Necrotizing Enterocolitis
Low Risk Group
Intermediate Risk Group
High Risk Group
Rat
e o
f N
EC
-S o
ccu
rren
ce (
% p
atie
nts
)
NEC score
-10 0 10 20 30 40
0
10
20
30
M: n = 2S: n = 15
M: n = 16S: n = 10
M: n = 26S: n = 0
M S
NEC
Clinical Parameters Based Model stratifies Necrotizing Enterocolitis Patients
NEC Urine Naturally Occurring Peptide Biomarker Discovery
Peak finding
Peak alignment
Peak indexing
Supervised Data mining
Feature selection
Training
Testing
LCMS Data reduction
Unsupervised Data mining
2D - Clustering
1 2 3
Biomarker Panel: Supervised Analysis (Training and Testing)
Biomarker Panel: Unsupervised Analysis
Biomarker Panel: Combined data set and ROC analysis
Permutation based FDR analysis of the biomarker signature
Discovery setn = 34
17 17Clinical
Diagnosis
Medical NEC Scoring
PercentAgreementwith clinicaldiagnosis
M S
NEC
7 0
Urine peptide based Classification
M S
Lown=7
Classified as M
Classified as S
7 0
0 0
NEC RiskGroups
9 6
M S
Intermediaten=15
8 1
1 5
0 9
M S
Highn=9
0 0
0 9
100 % 100 %
+ -
100 %
100 % 100 %
+ -
100 %
88.9 % 83.3 %
+ -
86.1 %
Diagnosed as M
Diagnosed as S
7 0
0 0
4 3
5 3
0 1
0 8
P = 0.01
ClinicalDiagnosis
N/An=3
Proposed Ensemble Approach to Diagnose Necrotizing Enterocolitis Patients
NEC Patients
Clinical Model
NEC Risk
Urine Biomarkers
NEC Diagnosis
TABLE 2
Cluster Protein Location MH+ SequenceRelative
Abundance U test P value
M S
1 COL1A1 220-249 2924.41 RGppGPPGKNGDDGEAGKPGRPGERGPpGp 0.2562 -0.2562 4.25E-03
COL1A1 220-249 2940.36 RGPPGppGKNGDDGEAGKpGRpGERGpPGP 0.2541 -0.2541 6.80E-03
2 COL1A2 485-514 2889.36 ARGEPGNIGFPGPKGPTGDPGKNGDKGHAG 0.2265 -0.2265 8.93E-05 3 COL1A2 925-952 2865.31 GRDGNpGNDGpPGRDGQpGHKGERGYpG 0.2919 -0.2919 1.99E-03 COL1A2 933-952 2081.94 DGpPGRDGQpGHKGERGYpG 0.2655 -0.2655 5.39E-03
4
COL1A2 135-157 2229.06 AGpPGKAGEDGHpGKPGRpGERG 0.2732 -0.2732 1.45E-02 COL1A2 131-157 2626.27 ARGpAGpPGKAGEDGHpGKPGRpGERG 0.223 -0.223 2.16E-02 COL1A2 131-157 2642.28 ARGpAGpPGKAGEDGHpGKpGRpGERG 0.2016 -0.2016 3.14E-02 COL1A2 137-157 2142.05 GpPGKAGEDGHPGKPGRpGERG 0.2624 -0.2624 1.06E-02 COL1A2 131-157 2158.03 GPpGKAGEDGHpGKPGRpGERG 0.3038 -0.3038 2.16E-02 5 COL3A1 813-840 2565.18 GApGQNGEPGGKGERGApGEKGEGGpPG 0.2623 -0.2623 2.58E-03
6 COL3A1 1168-1194 2680.19 NRGERGSEGSPGHPGQpGppGppGAPGP -0.2382 0.2382 1.06E-02
COL3A1 1168-1194 2696.22 NRGERGSEGSpGHpGQpGPPGPpGApGp 0.1893 -0.1893 1.96E-02
Overlapping Urine Peptide Biomarkers for NEC
Proposed Underlying Mechanisms of Urine Naturally Occurring Peptide Biomarkers
1 2 3 4 5 6 7 8 9 10 11 12 130.00E+00
2.00E-01
4.00E-01
6.00E-01
8.00E-01
1.00E+00
1.20E+00 PR
Enbrel
CR CR
Anakinra
CRPR CR
Enbrel Anakinra A
B
C
Prediction of drug response in SJIA
Urine peptide biomarkers: the discovery process
Sample peptides:
-Class 1:1,2,3…
-Class 2:1,2,3…
-Class 3:1,2,3…
SCX/RP-HPLC
Collect 100 fractions on MALDI plates
MALDI-TOF MS
for each sample
LC fraction -- m/.z --abundance
MASS-Conductor ®
Machine learning
feature discovery and classification
Biomarker panels
MSMS protein IDProspective validation
with quantitative mass spec (MRM)
Interdisciplinary Skillsfor Biomarker Discovery
BiologyAnalytic
biochemistry
BiostatisticsComputer Science
Medicine
Q & A
Genome vs. Proteome
The Isotope Envelope
Predictor discoveryin training set
2
Training set(10 AR, 10 STA, 6 BK)
1
LCMSraw spectra
Peak findingpeak alignment
feature extraction
20937 unique features
Classifier training
Six-foldCross-validation
ClassifyAR, STA, BK
MASS-Conductor Urine biomarker discovery and testing
Predictor confirmationin testing set
3
Testing set(10 AR, 10 STA, 4 BK)
Predictor sets
Linear discriminant analysis(LDA)
Calculate estimates ofpredicted class probabilities
Analysis of goodness of class separation
Pattern analysisin all set
4
Cluster analysis
All set(20 AR, 20 STA,
10 BK, 10 NS, 10 HC)
Predictors of 40 peptides
2d hierarchicalclustering
heatmap plottingRemove
background signals
Normalization
Platform Validation
5
Correlation Analysis
2 peptide biomarkers
MRM assay development
MRM assay AR, STA, BK, NS, HC
Training + Testing Samples
LC-MALDI MRM
Allograft Acute Rejection Urine Biomarker Discovery
Correlation Studies Between LCMS and MRM Platforms
Analytical ChallengesHigh complexity and wide dynamic range
Analytical ChallengesHigh complexity and wide dynamic range
Analytical ChallengesHigh complexity and wide dynamic range
Analytical ChallengesHigh complexity and wide dynamic range
Tirumalai, R. S. (2003) Mol. Cell. Proteomics 2: 1096-1103
Plasma Proteins
Big Trees
Tirumalai, R. S. (2003) Mol. Cell. Proteomics 2: 1096-1103
Plasma Proteins
Big Trees Bushes
Tirumalai, R. S. (2003) Mol. Cell. Proteomics 2: 1096-1103
Plasma Proteins
Big Trees Bushes
Grass + Bugs
www.genwaybio.com
Analytical ChallengesDetect low abundance proteins
Big Trees = HAP
Bushes = MAP
Grass + Bugs = LAP
Bottom up LCMS Biomarker Discovery
Sample preparation
Digestion
Peptidepurification
SCX RP
Protein mixture Digested peptides
Mass-spec Spectra
Data Analysis
Multi-dimensionalchromatography
MS/MS Protein ID
Mass Spectrometry In A Nutshell
time
hνF=ma
Ion sourcedetector
m/z
MS Spectrum
Mass Analyzer
MS/MS Peptide Sequencinghν
sourcedetector
Fragment ions
gate
Collision cell
MS/MS Spectrum
1st Mass Analyzer 2nd Mass Analyzer
Differential Expression Analysis in Quantitative LCMS
Peptide 1: M/ZPeptide 2: M/Z’Peptide 3: M/Z’’
Peptide 1: protein IDPeptide 2: protein ID’Peptide 3: protein ID’’
MS based MS/MS based
MASS-Conductor®Exhaustive MS comparison
Spectrum counting
Labeling, e.g. iTRAQ
Qualitative Comparative Analysis– Spectrum Counting
PROTEIN X
Sample A Sample B
MS/MS
Number of Detected Peptides
Number of Detected Peptides
[PROTEIN X] [PROTEIN X]
IF
THEN
PROTEIN IDENTIFICATION
- Peptide fragments EQUAL
MS/MS b
y
b
yb
yb
yMS
Mix -N H
114 31-N H
115 30-N H
116 29-N H
117 28
+
+
+
+
-PRG114 31
-PRG115 30
-PRG116 29
-PRG117 28
S1
S2
S3
S4
Par
alle
l D
enat
ure
& D
iges
t - Reporter-Balance-Peptide INTACT- 4 samples identical m/z
114
115
116
117
- Reporter ions DIFFERENT
-Chemically identical
-Migrate together in HPLC
MSMS Based Comparative Analysis– iTRAQ (isobaric tag)
Reporter Ions114, 115, 116, 117
• More abundant proteins tends to get more sequence coverage in MS/MS, masking away the MSMS opportunities for the peptides coming from the low abundant proteins
• Spectrum counting is semi-quantitative• iTRAQ is not scalable for a moderate throughput
biomarker discovery• iTRAQ cost• iTRAQ tag number
Issues in MS/MS Based Analysis
MS Based Comparative Analysis– Targeted MASS-Conductor® Approach
1. ALL peptide MS signals will be exhaustively comparedleading to the discovery of statistically differential signals
2. ONLY peptides of interest, usually a very small number, will be tried with full attention for the MS/MS ID. If necessary, MS/MS signals can be enhanced by more loading or fraction enrichment before MS
• Robustly handling of high volume proteomic data– e.g. One SCX fraction and 120 RP fractions
• 40 sample project MYSQL data storage– raw data is 241.5 GB – Peak data is 4.4 GB
• Robust and automatic high throughput computing• Robust reduction of raw data sets and enabling of efficient and accurate feature
discovery • Sophisticated data mining approaches to obtain statistically differentiating
features• Graphic data analysis
MASS-Conductor® Platform Data Mining Requirements
“MASS-Conductor ®” An in house software platform, including JAVA, PERL, R, RUBY and MYSQL
implementations
• Interface with AB and Thermo mass specs– Convert LC-MALDI T2D files in a batch manner to
text files
• Extract mono-isotopic LC-MALDI peaks• Track multiple scans of the same MALDI plate
and HPLC SCX/RP fractions where each peak resides
• Cluster mono-isotopic peaks across categorical samples for comparative analysis
• Interface and integrate SAM, PAM, 1d classifiers, 2d classifiers, margin tree, CART algorithm packages for differential feature selection and classification
Common FeatureAlignment/Extraction
Spectrum Raw datasets
Peak datasets
Feature datasets
Indexed datasets
Mass-Conductor Database
Binary/Multi-class ClassificationFalse Discovery Rate Analysis Biomarker Discovery
PotentialBiomarkers
Web-ServiceCollaboration
Peak Extraction
Feature indexing
Patient datasets “MASS-Conductor ®”
DATA REDUCTION in “MASS-Conductor ®” Peak Extraction from Spectra Raw Data
Patient sample LC-MALDI Spot/fraction 13. m/z 900 – 4000: 118142 raw data points 1690 peak data points
0
200
400
600
800
1000
1200
1400
1600
1800
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61
62 peaks2530 data points
0
200
400
600
800
1000
1200
1400
1600
1800
1 153 305 457 609 761 913 1065 1217 1369 1521 1673 1825 1977 2129 2281 2433
m/z 1200 – 1250
Before data reduction
AR
0
100
200
300
400
500
600
700
800
900
1000
0 20 40 60 80 100 120
S
0
100
200
300
400
500
600
700
800
900
1000
0 20 40 60 80 100 120
V
0
100
200
300
400
500
600
700
800
900
1000
0 20 40 60 80 100 120
S
0
100
200
300
400
500
600
700
800
900
1000
0 20 40 60 80 100 120
AR
0
100
200
300
400
500
600
700
800
900
1000
0 20 40 60 80 100 120
V
0
100
200
300
400
500
600
700
800
900
1000
0 20 40 60 80 100 120
After data reduction
Class A
Class B
Class C
fractions
MS
sig
nal DATA REDUCTION – One Peptide Example
Peak Extraction from Spectra Raw Data
SEQUENCE 640 AA; 69761 MW 001 MGQPSLTWML MVVVASWFIT TAATDTSEAR WCSECHSNAT CTEDEAVTTC TCQEGFTGDG 061 LTCVDLDECA IPGAHNCSAN SSCVNTPGSF SCVCPEGFRL SPGLGCTDVD ECAEPGLSHC 121 HALATCVNVV GSYLCVCPAG YRGDGWHCEC SPGSCGPGLD CVPEGDALVC ADPCQAHRTL 181 DEYWRSTEYG EGYACDTDLR GWYRFVGQGG ARMAETCVPV LRCNTAAPMW LNGTHPSSDE 241 GIVSRKACAH WSGHCCLWDA SVQVKACAGG YYVYNLTAPP ECHLAYCTDP SSVEGTCEEC 301 SIDEDCKSNN GRWHCQCKQD FNITDISLLE HRLECGANDM KVSLGKCQLK SLGFDKVFMY 361 LSDSRCSGFN DRDNRDWVSV VTPARDGPCG TVLTRNETHA TYSNTLYLAD EIIIRDLNIK 421 INFACSYPLD MKVSLKTALQ PMVSALNIRV GGTGMFTVRM ALFQTPSYTQ PYQGSSVTLS 481 TEAFLYVGTM LDGGDLSRFA LLMTNCYATP SSNATDPLKY FIIQDRCPHT RDSTIQVVEN
541 GESSQGRFSV QMFRFAGNYD LVYLHCEVYL CDTMNEKCKP TCSGTRFRSG SVIDQSRVLN 601 LGPITRKGVQ ATVSRAFSSL GLLKVWLPLL LSATLTLTFQ
Human THP precursor, Swiss-Prot: P07911
Urine THP Peptide Biomarkers Fall into Tight Clusters in C-Terminus