development and evaluation of a prototype system for automated analysis of clinical mass...

30
Development and Evaluation of a Prototype System for Automated Analysis of Clinical Mass Spectrometry Data Nafeh Fananapazir Master’s Thesis Defense February 27th, 2006

Upload: wilfred-harris

Post on 31-Dec-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Development and Evaluation of a Prototype System for Automated Analysis of Clinical Mass

Spectrometry Data

Nafeh Fananapazir

Master’s Thesis DefenseFebruary 27th, 2006

Outline

1. Backgrounda. Mass spectrometry in contextb. Overview of MS in the clinical domainc. Challenges in MS analysis

2. Development of an automated system (FAST-AIMS)a. Requirements for an automated systemb. Overview of FAST-AIMSc. Technical description of FAST-AIMS

3. Evaluation of FAST-AIMSa. Evaluation one: multiple user studyb. Evaluation two: multiple dataset study

4. Conclusions

1. Background

Mass spectrometry in context Analytical tool for measuring molecular weight of sample

components Used in the measurement of biological samples

lipids, complex carbohydrates, nucleic acids, peptides, proteins Resolution on the order of 0.01% of total molecular weight

Small samples: small organic molecules measured at ppm Large samples: within 4 daltons of a 40,000 dalton sample

Uses Resolution: determination of sample purity, detection of aa

substitutions, detection of post-trans modifications Reaction monitoring: enzyme reactions, protein digestion Sequencing: SEQUEST/SALSA

1. Background

Mass spectrometry in context Components of MS analysis

1. Sample isolation (2DE, biopsy, serum)

2. Sample processing (proteolytic digestion, LC)

3. Ionization (soft: molecules intact; hard: molecules fragment)a. ESI (electron spray ionization)

• evaporation of charged, aerosolized droplets under vacuum

• voltage can be raised, increasing fragmentation

b. MALDI (matrix assisted laser desorption ionization)• laser-energy excitation energy of matrix-embedded sample

• matrix: UV-absorbing; prevents excessive fragmentation

c. SELDI (alternative to MALDI)• Chromatographic separation based on hydrophobicity, cation/anion

exchange, metal affinity

1. Background

Mass spectrometry in context Components of MS analysis (continued)

4. Sample Analyzer (TOF, quadrupole, tandem)

5. Ion detection/recording (m/z vs. intensity)

6. Calibration (internal or external)

7. Data analysis

1. Background

Mass spectrometry in context Spectra produced

Mass/charge ratio (m/z) plotted against relative intensity 104 - 106 data points per spectrum Sample SELDI-TOF spectrum:

1. Background

Overview of MS in the clinical domain Tissue types

Relatively non-invasive (e.g. Blood serum, Urine) Invasive (e.g. tissue biopsy, pancreatic juice)

Pathology types Newborn screening for metabolic diseases (e.g. PKU) Cancer (e.g. ovarian, prostate, pancreatic, lung) Non-cancer (e.g. hepatitis, cerebrovascular accidents)

Advantages of MS analysis Non-invasive Potential for early detection Early results appear promising

1. Background

Overview of MS in the clinical domain Brief historical considerations

[February 2002] Petricoin et al. use SELDI-TOF spectra from blood serum samples to create a classification model for ovarian cancer

[July 2003] Coombes et al. publish first publicly available pre-processing algorithms related to peak detection

Over 15 primary studies related to disease classification using MS data have been published

Early Generalizations Most proteome MS studies use MALDI-TOF or SELDI-TOF Most studies focus on blood serum proteome analysis Very few studies associated with publicly available dataset

1. Background

Challenges in MS analysis1. Sample collection

• Consistency/reproducibility of sample collection/processing• Abundance of clinically relevant biomarkers for disease

• Tumors produce very little biomarker [Diamandis]• Response: enzymatic amplification, “mopping” effect of larger

molecules

2. Data analysis• Lack of disclosure of methods and algorithms employed• Overfitting

3. Interpretation of results• Determination of clinical relevance (performance metric)• Determination of biological relevance

• Are biomarkers specific to disease of interest?• Example: generalized inflammatory response• Perhaps focus should be on studying tumor proteomes [Liebler]

1. Background

Challenges in MS analysis

1. Background

Challenges in MS analysis1. Microarray analysis of nucleotides

• All spots may be known a priori• The array is the same (spots are “aligned”) from sample to sample• Intensity represents extent of hybridization with known oligonucleotides• Possible to limit analysis to known physiologic/pathologic pathways

2. Mass spectrometry analysis of peptides• Peptides represented by peaks are not known a priori

• A peak may represent: noise, single peptide (known or unknown), peptide amalgamation

• M/Z values are not aligned from sample to sample• Peak alignment is not straight-forward• Not possible to limit analysis to known physiologic/pathologic pathways• Spectra may represent tens to hundreds of thousands of data points• Lack of software performing complete analysis

2. Development of FAST-AIMS

Requirements for an automated system• Goal 1: Complete analysis of unprocessed MS data

productive of diagnostic or prognostic models and associated biomarkers• Goal is being met through the development and application of

machine learning and biostatistics techniques by those with expertise.

• Goal 2: Creation of software that performs such analysis, allowing those without related expertise to have access to the information contained in clinical MS data.• No publicly available software system (commercial or free)

exists providing complete, high-quality first-pass analysis of MS data.

2. Development of FAST-AIMS

Overview of FAST-AIMS• FAST-AIMS

• Fully Automated Software Tool for Artificial Intelligence in Mass Spectrometry

• First software system to provide integrated analysis from raw data to model development

• Features Mass Spectrometry Data Preprocessing Able to accommodate a range of user expertise Avoids overfitting Does not require additional software Can perform three different tasks

1. Generation of a classification model2. Estimation of classification performance on unseen data3. Application of a generated model to new data

2. Development of FAST-AIMS

Technical description of FAST-AIMS Programming languages employed

• Wizard-like GUI interface: Delphi 7.0• Data analysis algorithms: Matlab 6.5• Integrated into a downloadable executable

Current dataset requirements:• Each row represents one spectrum• The first column contains class information

for each spectrum• binary (0,1): non-cancer/cancer• tertiary (0,1,2): healthy/benign/cancer

2. Development of FAST-AIMS

Technical description of FAST-AIMS

Non-Cancer

Cancer

Unknown

Cancer

TR

AIN

ING

TE

ST

Current operation: Data acquisitionBaseline subtractionPeak detectionPeak alignmentNormalizationFeature selectionBuild classification modelApply model to new dataEvaluate classification

2. Development of FAST-AIMS

Technical description of FAST-AIMS

2. Development of FAST-AIMS

Technical description of FAST-AIMS Model generation: experimental design

Generation of classifier/feature selection permutations Extending the range of parameters for a given algorithm

has a multiplicative effect on the number of permutations Selection of the following would yield 40 permutations:

Algorithms Parameters

FS1 (All features)FS2 (RFE) cost {10,100}; f {20, 40, 60} FS3 (HITON) α {0.3, 0.5, 0.7}

C1 (SVM) cost {10,100}; degree {1,2}

Model generation: experimental design Example: 5-fold cross validation with 4 FS/Classifier

permutations (P1,P2,P3,P4) to choose from

P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4

P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4

P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4

P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4

P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4P1 P2 P3 P4

0.8 0.6 0.8 0.6

0.8 0.6 0.7 0.7

0.9 0.5 0.9 0.7

0.7 0.6 0.9 0.7

0.6 0.7 1.0 0.6

Avg. Perf.

0.8

0.8

0.9

0.7

0.6

0.6 0.8 0.6

0.6 0.7 0.7

0.5 0.9 0.7

0.6 0.9 0.7

0.7 1.0 0.6

P1: 0.76P2: 0.60P3: 0.86P4: 0.66P3: 0.86

2. Development of FAST-AIMS

Technical description of FAST-AIMS

3. Evaluation of FAST-AIMS

Evaluation one: multiple user study Elements of study protocol

Users selected based on maximizing range of expertise Dataset selection

should minimize chance of prior exposure should not have been used during FAST-AIMS development analysis should be non-trivial

All users were given an unrelated “practice” dataset FAST-AIMS users signed disclosure agreements and were

given: A general introduction to the study A copy of FAST-AIMS A FAST-AIMS tutorial document

Users were to work independently towards submission of a classification model for application on a withheld testing set

3. Evaluation of FAST-AIMS

Evaluation one: multiple user study Dataset: Prostate cancer (n=162) [Banez 2003]

Cancer (n=106), Non-cancer (n=56) Training (n=108), Testing (n=54)

Study participants designated as follows: FAST-AIMS users

Expert (n=4) Non-expert (n=2)

Non-FAST-AIMS users Biostatistician (n=1)

3. Evaluation of FAST-AIMS

Evaluation one: multiple user study

3. Evaluation of FAST-AIMS

Evaluation one: multiple user study

3. Evaluation of FAST-AIMS

Evaluation two: multiple dataset study

4. Conclusions

Results Mass Spectrometry Data analysis is currently very difficult

and requires expertise FAST-AIMS is the first system to show that this analysis can

be fully automated and that it can be performed by non-expert users

Two evaluations of FAST-AIMS were performed Evaluation one:

Non-expert and expert users of FAST-AIMS can achieve performances that nearly match that of expert biostatisticians in shorter time

All participants in the study achieved classification accuracy higher than previously published for the Banez dataset

Evaluation two: FAST-AIMS achieved high classification performance when

evaluating three different datasets

4. Conclusions

Future directions

Where FAST-AIMS is relatively weak Pre-processing techniques need further development Number of features (peaks) selected can still be very

high; a small trade off in accuracy may allow for selection of far fewer peaks

Lack of reporting of statistical significance of each peak Error estimation methods could be strengthened Interface issues

Lack of visualization Results are currently logged/reported in a text file

4. Conclusions

Future directions

Why FAST-AIMS is important Shows that analysis can be automated Accommodates

Naïve users: through a guided screen sequence and incorporation of defaults

Expert users: allows selection of a wide range of algorithms and associated parameters

Allows all users to harness computing power in aid of faster analysis

Even non-expert users can achieve classification performance greater than published results

Acknowledgements

With thanks extended to: Alexander Statnikov, Programmer, Study participant Yerbolat Dosbayev, Programmer, Study participant Kevin Maas, Study participant Yin Aphinyanaphongs, Study participant Yu Shyr, Ming Li (biostatistician component of evaluation one)

The presenter also wishes to thank his thesis committee: Constantin Aliferis, Primary thesis advisor Dean Billheimer, Department of Biostatistics Doug Hardin, Department of Mathematics Shawn Levy, Department of Biomedical Informatics Dan Liebler, Department of Biochemistry Ioannis Tsamardinos, Department of Biomedical Informatics

Addendum

Technical description of FAST-AIMSBased on user-defined parameters, the following sequence of

analysis is performed (all steps are logged):

1. Steps that can be performed on an individual spectra independently of other spectra are performed on all samples• peak identification, baseline subtraction [Coombes 2003]

2. Data is partitioned based on the task(s) chosen.

Generate model Estimate performance

The dataset is divided into n-subsets. Samples are assigned to each subset randomly, while maintaining class distribution within each subset. Each subset forms the testing set for a partition. The remainder of the dataset becomes the training set for that partition.

The procedure for generating a model is followed. For each partition, each of the (n-1) subsets associated with each training set forms a train-test subset for a nested-partition. The remainder of the training set becomes the train-train set for that nested-partition.

2. Development of FAST-AIMS

Technical description of FAST-AIMS3. Remaining pre-processing steps (requiring evaluation of

multiple spectra) are performed within each training (or train-train) set• normalization sequence, peak alignment [Yasui 2003]

4. All permutations of feature selection and classification algorithms and associated parameters selected are determined.

5. Each permutation is used to generate a set of features (based on feature selection) and a classification performance as follows:

Generate model Estimate performance

Feature selection is performed on each training set. Dimensionality of the dataset is reduced based on features selected. Classifier builds a model on reduced dataset. Model is applied to testing set and classification performance is recorded based on user-selected metric (ROC or accuracy).

Within each training set, the procedure for generating a model is followed (considering train-train sets and train-test sets).

6. Classification models are generated independent of classification of testing data:

7. Results are determined:

Generate model Estimate performance

A single classification model is generated by averaging the classification performance of each permutation on each testing set and choosing the permutation with best performance.

Within each training set, a classification model is generated by averaging the classification performance of each permutation on each train-test set and choosing the permutation with best performance. n models are thus generated.

Generate model Estimate performance

Result is the classification model determined (permutation and user-defined pre-processing steps) and associated average performance.

The optimized model for each training set is applied to the associated test set for each partition. A non-overfitted classification performance is determined for each partition. The average of these is reported as estimated performance.

2. Development of FAST-AIMS

Technical description of FAST-AIMS