artiﬁcial intelligence and expert systems in mass spectrometry · 2017. 2. 14. · artiﬁcial...

Artificial Intelligence and Expert Systems in Mass Spectrometry

Ronald C. Beavis, Steven M. Colby, Royston Goodacre, Peter de B. Harrington,James P. Reilly, Stephen Sokolow, and Charles W. Wilkerson

inEncyclopedia of Analytical Chemistry

R.A. Meyers (Ed.)pp. 11558–11597

John Wiley & Sons Ltd, Chichester, 2000

ARTIFICIAL INTELLIGENCE AND EXPERT SYSTEMS IN MASS SPECTROMETRY 1

Artificial Intelligence andExpert Systems in MassSpectrometry

Ronald C. BeavisProteometrics LLC, New York, NY, USA

Steven M. ColbyScientific Instrument Services, Inc., Ringoes, NJ,USA

Royston GoodacreUniversity of Wales, Aberystwyth, UK

Peter de B. HarringtonOhio University, Athens, OH, USA

James P. ReillyIndiana University, Bloomington, IN, USA

Stephen SokolowBear Instruments, Santa Clara, CA, USA

Charles W. WilkersonLos Alamos National Laboratory, Los Alamos,NM, USA

1 Introduction 21.1 Definitions of Artificial Intelligence

and Expert Systems 21.2 Growth in Technology 21.3 Article Summary 2

2 Brief History of Computers in MassSpectrometry 22.1 Introduction 22.2 Early Devices 22.3 Instrument Design 32.4 Computerization 32.5 Brief Introduction to Artificial

Intelligence and Expert Systems 32.6 Brief Overview of Artificial Intelli-

gence and Expert Systems in MassSpectrometry 4

3 Mass Spectrometry Data Systems 43.1 Introduction 43.2 Fundamental Tasks of a Data

System 53.3 Requirements for Operating

Systems 6

3.4 Impact of Continuing Advances inComputers on Mass SpectrometryData Systems 6

3.5 Programmability 7

4 Biological Applications 94.1 Protein Sequence Determination 94.2 Database Search Strategies 94.3 Nucleotide Databases 94.4 Protein Modification Analysis 104.5 Use with Differential Displays 114.6 Alternate Splicing 11

5 Mass Spectrometry Applications of Prin-cipal Component and Factor Analyses 125.1 Introduction 125.2 Selected History 125.3 Introductory Example 135.4 Theoretical Basis 145.5 Related Methods and Future

Applications 175.6 Reviews and Tutorials 185.7 Acknowledgments 18

6 Artificial Neural Networks 186.1 Summary 186.2 Introduction to Multivariate

Data 186.3 Supervised Versus Unsupervised

Learning 186.4 Biological Inspiration 196.5 Data Selection 206.6 Cluster Analyses with Artificial

Neural Networks 216.7 Supervised Analysis with Artificial

Neural Networks 236.8 Applications of Artificial Neu-

ral Networks to Pyrolysis MassSpectrometry 26

6.9 Concluding Remarks 28

7 Optimization Techniques in MassSpectrometry 297.1 Introduction 297.2 Time-of-flight Mass Spectroscopy

Mass Calibration 29

Abbreviations and Acronyms 31

Related Articles 31

References 32

This article provides a brief introduction to aspects ofmass spectrometry (MS) that employ artificial intelligence

Encyclopedia of Analytical ChemistryR.A. Meyers (Ed.) Copyright John Wiley & Sons Ltd

2 MASS SPECTROMETRY

(AI) and expert system (ES) technology. These areas havegrown rapidly with the development of computer softwareand hardware capabilities. In many cases, they havebecome fundamental parts of modern mass spectrometers.

Specific attention is paid to applications that demonstratehow important features of MS are now dependent on AIand ESs. The following topics are specifically covered:history, MS data systems, biological applications, artificialneural networks (ANNs), and optimization techniques.

1 INTRODUCTION

1.1 Definitions of Artificial Intelligence and ExpertSystems

This article covers the application of AI and ESs asapplied to the techniques of MS. ESs are methods orprograms by which a fixed set of rules or data is usedto control a system, analyze data, or generate a result.In contrast, AI is associated with the higher intellectualprocesses, such as the ability to reason, discover meanings,generalize, or learn. In relation to MS, AI is generallylimited to cases wherein ANNs are employed to learnor discover new patterns or relationships between data.Reviews of AI and ESs are available..1,2/

1.2 Growth in Technology

The growth in MS has been spurred by improvementsin software sophistication and computer capabilities. Theability of computing systems to both collect and analyzedata has grown very rapidly since the 1970s. The mostimportant improvements have been in calculation speedof the machines, their ability to store large amounts of datavery quickly, and their size. These improvements haveallowed processes such as multitasking during data acqui-sition, where the computer both collects data and controlsthe instrument operation, and automated spectral match-ing, where large volumes of data are quickly analyzed.

The improvements in computer technology haveresulted in an increase in the performance and types ofmass spectrometers available. For example, instrumentswith even the simplest types of mass analyzers are nowcomputer controlled. This has dramatically increased thestability, reproducibility, and capabilities of these devices.It is now possible to perform a tandem mass spectrometry(MS/MS) experiment where the data collection parame-ters are changed on the millisecond timescale in responseto the data collected..3/ This allows a library search to beperformed, possible match candidates to be experimen-tally tested, and a positive identification to be made, allduring the elution of a chromatography peak.

The development of computers has also allowed the useof new types of MS. For example, the data generated byFourier transform mass spectrometry (FTMS), pyrolysis,

and electrospray MS would be very difficult if not impos-sible to collect and analyze without high speed computers.

1.3 Article Summary

This article includes sections on the history of computersin MS, MS data systems, biological applications, MSapplications of principal component analysis (PCA) andfactor analysis (FA), ANNs, and optimization techniquesin MS. This article does not include a discussion ofthe use and development of libraries of electron impactionization data or of peak deconvolution and componentidentification based on these libraries. Reviews of thesetopics are available..4,5/

2 BRIEF HISTORY OF COMPUTERS INMASS SPECTROMETRY

2.1 Introduction

Digital computers are now an indispensable part of mostanalytical instruments. There are many reasons for thispervasive presence. Perhaps most important is the abilityof computers to perform repetitive tasks without variation(in the absence of hardware failure), which is criticalto reproducible and defensible experimental results.Further, properly designed and implemented computercontrol/data systems maximize instrument and laboratoryefficiency, resulting in higher sample throughput, fasterresults to the end-user, and increased profitability (eitherin terms of publications or currency) for the laboratory.As a technique that arguably provides more chemical andstructural information per unit sample than any other,MS has been employed in a variety of environments overits long history. The evolution of the mass spectrometerfrom a fundamental research tool for the elucidationof atomic and molecular properties to a benchtop turn-key instrument, in large measure parallels the evolutionof both discrete and integrated electronic devices andcomputational hardware and software.

In this article no attempt is made to tabulate anexhaustive list of historical references to the applicationof computers in MS, but rather selected citations arepresented to provide a flavor of the development in thefield. There is one monograph dedicated to computers inMS,.6/ and the topic is given treatments ranging fromcursory to complete in a variety of books on massspectrometric techniques.

2.2 Early Devices

Early mass analyzers were spatially dispersive instru-ments, or mass spectrographs,.7/ utilizing static magneticor DC (direct current) electric fields to perturb the


trajectories of accelerated ions. At the time of their devel-opment (ca. 1910–1920) photographic plates were placedin the focal plane of the spectrograph, and after exposureto the ion beam an image was developed and the resultingdata analyzed. Quantitative analyses were effected bycomparing the amount of exposure on a film producedwhen the unknown sample was determined to calibratedplates developed after measuring known amounts of ref-erence materials. This technique is still in use today incertain specialized (and somewhat archaic) applications.In the 1930s and 1940s detectors based on direct mea-surement of ion beam flux (such as the Faraday cup andelectron multiplier), were introduced. Such detectors aresingle-channel transducers, and require that slits be posi-tioned (in a dispersive instrument) to limit the exposureof the detector to a single mass at any given time. Thesignal is then amplified and recorded as a function of someindependent variable (such as magnetic field strength, orthe ion accelerating voltage) that is proportional to themass-to-charge ratio (m/z) of the ions in the sample.

With the introduction of electronic detectors, it becamepractical to couple detector output to a digital computervia some type of interface. For low-intensity signals, suchas measurement of discrete ions, pulse-counting tech-niques are employed. As this is inherently a digitalprocess, transmission of data to a computer is rela-tively straightforward. Larger signals, characterized bysignificant and measurable detector currents, employanalog-to-digital converters (ADCs) prior to storage andmanipulation on the computer.

2.3 Instrument Design

2.3.1 Time-of-flight

Time-of-flight (TOF) mass spectrometers were firstdeveloped in 1932,.8/ but the most familiar design, whichforms the basis of current instruments, was described byWiley and McLaren in 1955..9/ Accurate measurementof ion time of arrival at the detector is key to achievingoptimum resolving power and mass accuracy with thisinstrument. Prior to the introduction of computer dataacquisition, oscillographic recording was required, withmanual postprocessing.

2.3.2 Quadrupole

The quadrupole mass filter was first described in 1958..10/

The advantages of this instrument include small size,low ion energy (volts rather than kilovolts for dispersiveand TOF instruments), modest production costs, andthe ability to quickly scan through a wide range of m/zvalues. As a result, this design has become by far themost popular variety of mass spectrometer. A relatedmass analyzer, the quadrupole ion trap, was not widely

developed until the 1970s..11/ Like the linear quadrupolemass filter, the ion trap is small, inexpensive, and robust.Both of these devices rely on the application of concertedradiofrequency (RF) and DC fields in order to defineconditions under which ions have stable trajectories inthe instrument.

2.3.3 Ion Cyclotron Resonance

The ion cyclotron resonance (ICR) mass spectrometer,first reported in 1968,.12/ relies on the absorption of RFenergy and the natural precession of charged particlesin the presence of a magnetic field for mass separation.Nominal resolving power is obtained in this instrumentwhen operated in a continuous scanning mode, wherethe RF frequency is slowly swept and energy is absorbedwhen ions in the cell are resonant with the excitation. Themost common incarnation of ICR is often referred to asFTMS,.13/ and spectral information is extracted from thedigitally recorded decay and dephasing of ion orbits aftera pulsed application of RF energy. This approach allowsfor significantly improved resolving power (1000-foldimprovement) over the scanning experiment.

2.4 Computerization

As a result of the widespread availability of minicomput-ers in the late 1960s, and microcomputers in the 1970s and1980s, automation of mass spectrometer control, tuning,data acquisition, and data processing became practical.The reduction in both size and cost of computationalengines, with a concomitant increase in processing power,cannot be overemphasized in the development of auto-mated data systems for mass spectrometers (and otheranalytical instrumentation). Certainly, the widespreadimplementation of gas chromatography/mass spectrom-etry (GC/MS) would have been significantly delayedwithout the availability of reasonably priced quadrupolemass spectrometers and minicomputer-based data acqui-sition and processing equipment. The operation of FTMSwould be nearly impossible without the involvement ofcomputers.

2.5 Brief Introduction to Artificial Intelligence andExpert Systems

Almost from the beginning of the digital computingera, both hardware and software engineers have beeninterested in developing computing tools that can monitortheir environments, and subsequently make decisionsand/or carry out actions based on rules either knowna priori – from programming – or deduced – as a resultof iterative observation/decision/feedback experiences.Such computational devices may be called ‘expertsystems’, or may be said to operate based on ‘artificial

4 MASS SPECTROMETRY

intelligence’. It is certainly beyond the scope of thisarticle to provide a complete history of AI and ES,but there are a multitude of both books and researcharticles related to this topic..14,15/ Today, many parts ofour world are monitored, and in some cases controlled,by automated, computerized equipment. In an effort tomake these devices more responsive and efficient, manyof them employ embedded ESs of varying degrees ofsophistication. Programming languages such as LISP andPROLOG have been developed specifically to facilitatethe development of software to implement AI andES. The combination of powerful hardware, innovativealgorithms, and capture of years of expert knowledge hasallowed instruments to become increasingly independentof operator interaction, reducing the possibility forerror and allowing the scientist to concentrate on theinterpretation of the processed data and the formulationof new experiments.

2.6 Brief Overview of Artificial Intelligence and ExpertSystems in Mass Spectrometry

In the world of MS, AI and ES tools are used inthree primary areas: optimization and control of theperformance of the mass spectrometer itself, collection ofthe detector signal as a function of m/z, and analysis ofthe data.

2.6.1 Spectrometer Control

There are many instrumental parameters that need to beadjusted and held at an optimum value for best spectrom-eter performance. Initially, the instrument must be tuned,i.e. brought to a state in which peak intensity, peak shape,and mass calibration are all within acceptable limits. Thisis accomplished by introducing a known compound, suchas PFTBA (perfluorotributylamine), into the spectrom-eter that produces a variety of well-characterized frag-ments over the mass range of interest, and adjusting (inan optimized fashion) the various instrument parametersto achieve the desired level of performance. Computersare almost invariably used to perform this task, becausethe adjustable parameters are often highly interrelated(repeller voltage, ion focusing lens potentials, electronmultiplier voltage, mass scan rate, ion storage time, chem-ical ionization reagent gas pressure, time delay for ionextraction, etc.). Techniques such as simplex optimizationare used to efficiently locate in parameter space the best-tune conditions. After tuning is complete, the computercan then monitor all of the vital signs of the instru-ment during operation, and alert the spectrometrist ofmarginal performance conditions, and even recommendappropriate interventions, before data quality is affected.

2.6.2 Data Collection

In almost all data systems, the operator uses the computerto define the scope of the measurements to be made.Subsequently, the computer sets instrument parametersto control, for example, the speed of data collection, themass range to be recorded, and other instrument type-dependent variables. As the experiment is performed, thecomputer records the detector signal via either a directdigital interface (for counting experiments) or an ADC.Correlation of the detector signal with the correspondingm/z condition is accomplished through a mass-axiscalibration routine. Depending on the mass spectrometertype, this may be a DC, RF, or time reference.

2.6.3 Data Analysis

After the data have been collected, their chemicalinformation must be extracted and interpreted. Therehas been a significant amount of development in thearea of data analysis software since the first reportof such use in 1959..16/ In this early work, a systemof simultaneous linear equations were used to convertraw peak areas to normalized analyte mole fractions.A 17-component sample required 0.5–3 min of comput-ing time for processing. Today, mixtures with nearly anorder of magnitude more analytes can be reduced in lesstime, providing significantly more information than sim-ply peak quantitation. In addition to quantifying analytes,mass spectrometer data systems routinely provide iden-tification of species from their mass spectral fingerprints.One of the earliest examples of the application of AIto mass spectral interpretation was the work of Djerassiet al..17/ A LISP (a list processing language)-based code,DENDRAL, was developed and subsequently applied toa variety of analyte classes. Most mass spectrometristsare familiar with spectral libraries, ranging from the printversion of the so-called eight-peak index.18/ to the mostmodern computerized systems. The latter use intelligentpeak-searching and pattern-matching algorithms to pro-vide the operator with the most likely identities of speciesin a spectrum.

3 MASS SPECTROMETRY DATA SYSTEMS

3.1 Introduction

Since the mid-1970s the programming of mass spectraldata systems has changed enormously. Although the basictasks of an MS data system are fundamentally the samenow as they were in the 1970s, many of the numbersinvolved have become substantially larger. In addition,developing mass spectral technologies such as FTMS haveplaced very heavy demands on the acquisition process.


Spectrum libraries have become larger. Analyses of largecomplex molecules (i.e. peptides) may consume a greatdeal of computer resources. Fortunately, the changes incomputer and operating system technologies since the1970s have been even more staggering than the changesin MS.

Section 3.2 defines the basic tasks of a MS datasystem. Section 3.3 describes the requirements imposedon the computers and operating systems that aspire toperform these tasks. Section 3.4 examines some of thespecifics of how changes in computer technology haveaffected mass spectral data systems. Section 3.5 treatsthe subject of programmability. As the number of MSalgorithms proliferate, the need for a data system tobe customizable (i.e. programmable) has become evermore important – if users cannot define their own waysof collecting and analyzing data, unlimited computerpower may be useless. Practical examples from actualdata systems are presented, to show that the concerns ofa programmer are often quite different from those of achemist.

3.2 Fundamental Tasks of a Data System

The tasks of an MS data system are often neatly dividedinto instrument control, acquisition of data to a storagemedium, and analysis of the data. The division is, ofcourse, not really so simple. The collection of datadepends significantly on simultaneous instrument controland the analysis of the collected data may be fed backinto the instrument control. For example, in the process oftuning an instrument, the software may vary a variety ofdifferent parameters, each time collecting and assessingsome data before trying a new set of conditions. Inthis case there is a feedback loop that involves control,acquisition, and analysis. The feedback must be verytightly orchestrated to be useful.

3.2.1 Instrument Control

The task of instrument control has several aspects – rout-ine operation, instrument protection, tuning, and diag-nostic programs. During routine operation many voltagesmust be set or scanned, and as much instrument statusas possible must be read from the instrument. This statusinformation may be stored with the data. It may be usedto keep temperatures stable within the instrument by run-ning PID (Proportional–Integral–Differential) loops onheaters. Or, it may be used to protect the instrument. Forexample, a sudden rise in pressure may indicate a leakand some voltages should be turned off. If mass peaksare saturated, perhaps the detector voltage should bedecreased, or a warning message should be shown on thecomputer screen. The process of tuning and diagnosticprograms, each in their own way, are microcosms of the

entire MS data system. Those experienced in designingMS data systems have learned that it is advantageous tofirst write the diagnostic programs, basing them on verysmall and easily understood modules. These will, afterall, be needed for the first evaluation of the instrument.It is then possible to base the ordinary operation of theinstrument on these same modules. Doing so tends toprovide the entire system with a relatively good struc-ture. This bottom-up modular structure also makes iteasy to add elementary operations (e.g. when addingnew hardware) and higher-level operations can almostalways be defined as combinations of the elementaryprocesses.

3.2.2 Data Collection

The task of data collection is fundamentally important.Today’s computer operating systems are multitasking andtherefore capable of running several processes at once. Ifthe mass spectrometer is connected to a chromatograph orother time-dependent sample-introduction device, thenthe data collection must have priority over all otheroperations. A disaster can result if some data are missed.To guard against this, an MS data system may use morethan one processor, dedicating at least one processor todata collection.

3.2.3 Data Analysis

Analysis of the collected data includes the following items:

ž Conversion of raw (e.g. profile or Fourier-transform)data to mass peaks.

ž Data display for the chemist.ž Enhancement of the data by background subtraction

or other means.ž Use of the area under chromatogram peaks or other

MS data to compare unknowns with standards and toachieve quantitative results.

ž Library searching.ž Report generation.

A modern data analysis program includes other moreadvanced topics, which are covered elsewhere in thisarticle; even the elementary operations listed above havemany variations. Data systems must be flexible enoughto allow the user to perform the operations in exactlythe way and order required, hence the importance ofprogrammability. The control, collection and analysisare all achieved through a user interface. This elementof the data system determines the ways in which theuser is able to enter information and communicatewith the system. Section 3.4 looks at how changes inoperating systems have affected the user interface andhence the ease of using mass spectral data systems.

6 MASS SPECTROMETRY

It should be noted that, from the programmer’s pointof view, the design of an easy-to-use user interface isgenerally a much harder and time-consuming part of theprogrammer’s task than implementing all of the chemicalalgorithms. The user interface includes the display ofdata and instrument status, as well as input devices suchas menus and buttons that allow the user to control thesystem.

The display must respond to real changes in instrumentstatus in a timely manner. For example, suppose that inthe process of tuning an instrument the user is manuallyincreasing a voltage setting by clicking a button on thescreen. If nothing happens to the status display for morethan a second, the user is very likely to click on the buttonagain to accelerate the change in the system. This is simplybecause faster computer response time has naturally ledto greater user impatience. However, overclicking canresult in overshooting an optimum setting and this makesinstrument adjustment almost impossible. Therefore, acrucial task of the data system to reflect the real-timestatus of the instrument.

3.3 Requirements for Operating Systems

As noted above, data collection must never fail. Asthe operating system used by a chemist is almostcertainly a multitasking system, it is necessary to ensurethat the highest possible priority is given to the datacollection task. It must not be possible for other tasks tousurp the precious time required by the data collectionprocedures. This is the overriding concern in the selectionof an operating system. For this reason Windows NTis a much more appropriate choice than Windows 95,for MS data systems. Several other operations alsorequire high priority because they cannot be interrupted,such as those that involve delicate timing or real-timefeedback.

If multiple processors are used, other requirementsmust be considered. For example, if an embeddedprocessor in the instrument communicates with the datasystem over a serial or parallel line, it is important thatthe instrument be plug-and-play; that is, both sides shoulddisconnect cleanly when the cable is disconnected andreconnect automatically when the cable is reconnected.If the embedded processor is depending on the datasystem for control and the connection is broken, theembedded processor should go into a standby state forsafety purposes.

Most instrument manufacturers have chosen to basetheir data systems on PCs running Microsoft operatingsystems. A brief survey of 22 instrument manufacturersfound that 18 of them were using a version of MicrosoftWindows. Others used OS/2, and operating systems fromHewlett-Packard, Sun, and Apple.

3.4 Impact of Continuing Advances in Computers onMass Spectrometry Data Systems

The most obvious improvements in computers havebeen the dramatic increases in speed and in the sizeof computer memories and storage. The forefathersof today’s data systems were developed on home-builtcomputers using Intel chipsets or on systems produced byData General, Digital Equipment, Commodore, or Apple(section 2). These systems typically had 16–64 kB of ramand sometimes included a 5 or 10 MB disk. Since the 1970sthe availability of memory and storage has increased byover three orders of magnitude. Execution times havealso increased, albeit to a lesser extent. For example,library searches are now four to eight times faster.

Operations that require large arrays of data and massiveamounts of arithmetic have benefited most from theimprovements in hardware design. These improvementshave also made it much easier to implement algorithms.Previously, developers had to implement programmingtricks to handle very large arrays of data. Activitiessuch as library searches required extensive coding inorder for their execution to be completed in a reasonableamount of time. Today even more advanced and thoroughsearches can be implemented with a few lines of C code.These advantages also apply to algorithms written by theuser of the data system (if a programming language isavailable – see section 3.5).

Networks are beginning to have a major impact on datasystems. Local networking provides a great advantage bygiving the user a wide variety of high-capacity storageoptions. The internet allows easier transfer of data andresults, but has found only limited use in instrumentcontrol. In both cases security issues are a major concern.Although most laboratory management systems providesecurity features, such as passwords, etc. the proper set-up and administration of these controls is required. Thismay be beyond the resources of some laboratories and isclearly an added cost.

The current operating systems have had a significantimpact on the standardization of user interfaces. Inthe first mass spectral data systems, each had differentways to enter commands or click with a mouse. It wastherefore a major challenge to instruct users on how touse a data system. In some cases the operator had todevelop significant programming skills to use the system.In current user interfaces many operations, such as cutand paste, are standardized on the computer. As these areperformed in the same way in most computer programs,everyone who has worked with a computer is well-versedin the art of using menus and mouse clicks to interact witha computer program. The fact that a large majority of datasystems are based on Windows makes this even moretrue. Chemists now have a much easier time learning


to use new data systems because they already have agood idea of how the user interface will work. Thisstandardization has produced the one drawback, in thatmany programs now look the same and it is becoming achallenge for programmers to make their systems uniqueand original.

3.5 Programmability

As discussed above, many aspects of modern massspectral data systems require that they be programmable(or customizable). Every system is limited to have a finitenumber of built-in operating modes and algorithms. Thechemist, therefore, needs to have the ability to mix modesand tailor algorithms to suit experimental objectives. Theprogrammer who writes the data system is not able toanticipate which aspects of an algorithm the user maywish to vary, so ultimately the user needs to be able toprogram functions into the system. This section describesthe elements that a system must include, to be trulyprogrammable.

First the user needs a language to write algorithms in.The language needs to incorporate basic arithmetic andcommon math functions. It also needs to have programflow control elements such as loop and logic structures(‘if’, ‘while’, and ‘repeat’). The user needs to be able touse predefined variables such as ‘first mass’, ‘last mass’,‘detector voltage’. They also need to control MS oper-ations with built-in commands such as ‘Do one scan’,‘filament on’, ‘filament off’. The language must havebuilt-in feedback so that decisions can be based on thestate of the instrument or the nature of the data. Functionssuch as ‘Source temperature’ or ‘Manifold pressure’ canserve this purpose. The most advanced systems includefunctions such as ‘Intensity of mass 278 in the last data-set’ or ‘Mass of the biggest peak in the last data-set’ thatreturn facts about the data.

The language should to be able to perform all control,collection, and analysis steps. It ought to be possible torun more than one process at once, so that the systemcan collect one set of data while analyzing another,and perhaps reporting on a third. For good laboratorypractice, it is important to have functions to write anysort of information into a file. This will ensure that everydataset has enough information stored within it to showexactly how it was acquired. It also allows diagnosticprograms to keep track of instrument performance overany period of time.

The feedback functions in the language can be used fora wide variety of algorithms. For example, in the arenaof safety, the chemist can specify the actions to be takenif a temperature or pressure gets too high. Alternatively,the chemist could write a tuning loop that sets a voltage,collects a scan of data, and reads back information abouta peak.

Section 3.5.1 includes a number of illustrative exam-ples. The procedures are written in a pseudocode quitesimilar to an actual programming language. The firstexample shows the optimization of data collection bytiming acquisition. The second is part of an autotunealgorithm. The third is a higher-level procedure for auto-matic quantitation, meant to run continuously in thebackground.

3.5.1 Example 1: Timed Acquisition

One can increase the amount of analytically relevantinformation by only collecting data that is appropriatefor the retention time. The following routine is for anMS/MS instrument that does single reaction monitoringof several different reactions, 219–69 for the first twominutes, 512–69 for the next two minutes and 131–69thereafter:

start collectionsrm(219,69)while retention time < 2 : scan : endsrm(512,69)while retention time < 4 : scan : endsrm(131,69)while retention time < 10 : scan : endend collection

The functions referred to have the following meaning:

ž srm(m1,m2) means set the instrument to monitorthe reaction m1–m2.

ž scan means collect one scan of data.

3.5.2 Example 2: Tuning

This is an example of a tuning algorithm called ‘opti-mize lens’; it’s one argument specifies which lens to tune.While tuning, the system collect raw data. For thesedata, ‘height’ refers to the height of the biggest peak inthe dataset. As before, ‘scan’ means collect one scan ofdata. The items ‘biggest area’ and ‘best lens’ are tempo-rary variables. The goal of the procedure is to find anoptimum value of a lens.

optimize lens(n)biggest height D 0for lens(n) D �100 to 0 in steps of 1

scanif height >D biggest height

biggest height D heightbest lens D lens.1/

endendlens(n) D best lens

8 MASS SPECTROMETRY

When this is done, lens n will have been opti-mized.Such a routine may be built into a higher level-routine:

optimize all lensesoptimize lens(1)optimize lens(2)optimize lens(3)etc . . .

This process may be abstracted to as high a level asrequired.

3.5.3 Example 3: Automatic Quantitation

If the data system is designed properly, rules can bedefined to run continuously in the background. Here is anexample of a high-level algorithm that provides automaticupdating of a quantitation list when the chemist changesthe calculations for one of the compounds in the list. Forexample, suppose the user has collected several data files,including analytes and internal and external standards.They have quantitated a set of compounds in these datafiles, using mass chromatograms to obtain an area foreach unknown or standard. The areas and concentrationsof the standards are used to create a response curve.The areas of unknowns are used, in conjunction with theresponse curve, to calculate the unknown concentrations.One now has a list of areas and quantities for eachcompound, along with the information on how they werecomputed. If the user were to change the area of oneof the standard compounds by changing the parametersthat went into its calculation, we would like to see theamounts of all related peaks change correspondingly.Here is an example of a procedure that performs thisoperation.

Repeat-foreverif some compound changed

for r D 1 to number of response pointsc D external standard(r)c1 D internal standard(r)response x(r) D compound area(c)/

compound area(c1)response y(r) D compound amount(c)/

compound amount(c1)endfor c D 1 to number of compounds

compute amount(c)end

endSleep one second

end

The functions referred to have the following meanings:

ž some compoundchanged

set to ‘True’ if any one ofthe compounds in the listchanged area or amount,which means that‘compound area’ or‘compound amount’changed for thatcompound.

ž number ofresponse points

the number of points in theresponse list.

ž external standard(r) the compound number ofthe external standard atposition r in the responselist.

ž internal standard(r) the compound number ofthe internal standard atposition r in the responselist.

ž compound area(c) the area under thechromatogram forcompound c.

ž compound amount(c) the calculated or givenamount of compound c.

ž number ofcompounds

the number of compoundsin the list.

ž compute amount(c) computes the amount ofcompound c from its areaand the response list.

ž sleep one second prevents the procedurefrom hogging thesystem – there is no needto check more than oncea second that the userhas changed the data.

This procedure checks whether some compound haschanged area or amount (changed by the user). If so,it recalculates the response curve by filling in eachpoint on the response curve from the areas and theamounts of the appropriate compounds. Then, for eachcompound, it computes the amount of that compound(‘compute amount’ uses the response curve). If thedisplay of data is responsive to changes in the data,the user will see all areas and amounts change as soon asone value is changed. In section 3.2 an example was givenof the necessity of a close link between data and display;this procedure is another example.

To keep the code simple, this example assumes thatthere is only one response list involved. However, it iseasy to extend the code to a system that includes severalresponse lists.

These examples give an indication of how pro-grammable a data system can be. The challenge for the


designers of data systems is to balance flexibility withsimplicity for the sake of the chemist who is content withthe basic operation of the system. MS is not a trivialtask and operating a mass spectral data system is likelyto remain a challenging task as the functionality of MSData systems continues to expand. Hopefully, the userinterface, which is what makes it possible to use all thisfunctionality, will keep pace.

4 BIOLOGICAL APPLICATIONS

4.1 Protein Sequence Determination

MS has long had as a goal the ability to determinethe sequence within polymeric biologically importantmolecules, such as DNA and proteins. There havebeen notable advances in this area in the period1990–1999..19 – 24/ However, the goal of developing asimple yet general method for rapidly sequencing thesemolecules by MS has remained elusive.

Fortunately, alternative approaches have been intro-duced that take advantage of the large amount of DNAand RNA sequence information that has been generatedby genome sequencing projects and which is currentlystored in databases. Using this nucleic acid sequenceinformation, it is possible to determine whether theresults of a mass spectrometric experiment correspondto a sequence in a database. If such a correspondenceexists, it is no longer necessary to sequence the protein(or corresponding RNA) by MS or other means – theanswer can be simply read from the database. If thedatabase information is incomplete, it can serve as astarting point for other studies, greatly reducing the exper-imental work required for the determination of the fullsequence.

4.1.1 Peptide Cleavage and Extraction

All protein sequence identification experiments beginwith the creation of a set of smaller oligopeptidemolecules from the intact protein. The patterns gen-erated from these oligopeptides are then used tosearch nucleotide sequence databases. These oligopep-tides (frequently referred to simply as ‘peptides’) areproduced by the action of a reagent that cleaves theprotein’s peptide bond backbone at sequence-specificsites, such as peptide bonds that are adjacent to alimited set of amino acids. Peptide digesting enzymes,such as trypsin or endopeptidase Lys-C, are commonlyused for this purpose. Reactive amino acids, partic-ularly cysteine residues, are protected with chemicalreagents that prevent them from modification during theprocess.

4.1.2 Dataset Generation – Mass Spectrometry,Matrix-assisted Laser Desorption/Ionization andElectrospray Ionization

Once the oligopeptides have been generated, the massesof all of the peptides generated from a protein can bemeasured at once, using matrix-assisted laser desorp-tion/ionization (MALDI) or electrospray ionization (ESI)ion sources, mounted on a variety of different types ofmass analyzers. Analysis using a MALDI ion source iscurrently the most common method, but the use of sophis-ticated deconvolution will make ESI a viable option.Proteins produce patterns containing 10–1000 isotopicpeak clusters, depending on the sequence of a particularprotein. Each peak cluster represents a particular peptidesequence.

Alternatively, the ions corresponding to an individualpeptide from a protein digestion can be isolated, eitherusing chromatography or MS/MS techniques. The result-ing ions can then be fragmented in a gas phase collisioncell producing a pattern of masses characteristic of thesequence of the original peptide (MS/MS or MS/MS/MS,i.e. MSn). This pattern can be used to search databases,using the accumulated knowledge of the preferred gas-phase peptide bond cleavage rate constants. The resultingpattern is strongly affected by the time elapsed betweencollision and measurement of the product ion distribution,so different rules must be applied for different types ofMS/MS analyzers.

4.2 Database Search Strategies

The data sets generated by mass spectrometric exper-iments can be compared to the nucleotide sequenceinformation present in databases in several ways. Allof these methods share some common features. In orderto compare sequences, the chemical reactions involvedin producing the cleaved peptides are simulated, produc-ing a theoretical set of peptides for each known proteinsequence in the database. This simulation can either bedone during the search process or a specialized databaseconsisting of the peptides resulting from a particular cleav-age and protection chemistry can be prepared in advance.The simulations are then compared to the experimen-tal data, either using specialized correlation functionsor using multiple-step discrete pattern matching. Thiscomparison is done by assuming that sequences that cor-respond to the experimental data set will contain a setof peptides with masses that agree with the experimentaldata, within some experimental error.

4.3 Nucleotide Databases

Databases of complete gene sequences can be searchedas though they were protein sequence databases. The

10 MASS SPECTROMETRY

existence of known start codons and intron/exon assign-ments allows the use of, either MS or MSn patterns.Nucleotide databases that contain incomplete sequenceinformation, such as the database of expressed sequencetags (dBEST),.25/ present special challenges. In this typeof database, there are six possible peptide sequences foreach nucleotide sequence and each must be searchedindependently. The short length of the sequences makesthe use of MS-only data impractical; these databasesrequire the use of MSn fragmentation patterns.

4.3.1 Annotated Protein Databases

Dedicated protein sequence databases that store anno-tated oligopeptide translations of nucleic acid sequencesare the best databases for any MS-related search strat-egy. The annotations in the database indicate whatis known about post-translational modification of theprotein, allowing the chemical cleavage simulation tobe performed more accurately than is possible usingnucleotide information alone. The number of proteinsequences in this type of database is still very lim-ited – annotation is time-consuming and only possiblewhen detailed experimental results are available for aparticular sequence.

4.3.2 Confirmation and Scoring Results

The results of comparing a set of experimental masses to asequence database usually results in the identification of anumber of candidate sequences that match to some extentwith the experimental data. The task of distinguishingrandom matches from the ‘correct’ match has beenapproached in a number of ways. The simplest scoringsystem involves counting the number of masses that agreewithin a given error and reporting the sequence withthe most matches as being the best candidate sequence.This approach is very simplistic and frequently deceptive.More sophisticated scoring schemes involve appraisingpattern matches on the following criteria:

ž sequence coverage – the fraction of the candidateprotein represented by the experimental masses;

ž sequence distribution – the pattern of matched pep-tides in the candidate protein;

ž mass deviation – the pattern of experimental massdeviations from the simulation values;

ž statistical significance – the likelihood that the matchcould have occurred at random.

Research into the appropriate scoring scheme forMS and MSn match scoring is still ongoing. The mostsuccessful of scoring systems will be the basis for thenext generation of fully automated protein identificationinstruments.

Currently, none of the protein identification algorithmsmake use of AI or algorithm training methods. TheProfound algorithm is currently the closest to using AI – ituses a Bayesian statistical approach to evaluating datasets, allowing for the unbiased evaluation of search resultsand for the detection of multiple sequences in a single MSdata set..26/

4.4 Protein Modification Analysis

MS may have limitations in the determination of proteinsequences de novo, but it is very well suited to thedetection of chemical modifications of a known sequence.The detection of these modifications is very dependent ongood software as there is too much information for manualdata reduction. The general strategy is very similar to thatused to identify proteins, a process that grew out of thestandard practice for finding modifications. The generalstrategy is as follows: determination of the intact proteinmolecular mass; cleavage to peptides; generation of massspectra; and automated, detailed comparison of the MSdata set with a known sequence.

4.4.1 Peptide Cleavage and Extraction

The cleavage and protection chemistry available fordetection of modifications is much broader than that usedin protein identification experiments. Any proteolyticenzyme, chemical cleavage or protection method can beused, depending on the type of modification sought. Popu-lar endoproteinase enzymes are trypsin, endoproteinasesLys-C and Asp-N, and Staph. V8 proteinase..27/ Exopep-tidases, such as carboxypeptidase A, B, and P can alsobe useful for generating C-terminal sequencing laddersfor smaller peptides..28/ Unlike the protein identificationprocedure, it is very useful to follow a time course of pro-tein cleavage, as the dynamics of proteolysis can providevaluable clues to the identity and location of modifi-cations. Chemical cleavage reagents, such as cyanogenbromide, iodosobenzoic acid and hydroxylamine, can beused in place of enzymes. These reagents are less popularthan enzymes, because of their propensity for produc-ing complicating modifications in the sequence throughside-reactions.

4.4.2 Generation of Mass Spectroscopy Datasets

Mass spectroscopy datasets are collected in the sameway as for protein identification experiments. Typically, anumber of experiments are run, using different cleavagereagents with different and complementary specificity.For example, both a trypsin and endoproteinase Asp-N digest would be both run, taking several time pointsduring the reaction to reconstruct its time course. All ofthe data collected is stored for later analysis.


Datasets for MSn can be prepared that greatly assistanalysis in the detection of common modifications, suchas phosphorylation or disulfide cross-linking. These mod-ifications produce characteristic fragmentation signalsfollowing gas-phase collisions. The most popular methodfor collecting this type of specialized data is directlycoupling the output from high-performance liquid chro-matography (HPLC) to an MS/MS device (such as a triplequadrupole or a ion trap analyzer) and flagging spectrathat contain these characteristic signals.

4.4.3 Comparison with Sequence

Once a dataset has been assembled, it must be comparedwith the results that would be expected from the predictedamino acid sequence. For a simple enzymatic cleavageexperiment on a protein that has 30 theoretical cleavagesites (N) and no cystines, there are approximately 450possible product peptides. The complexity of the task ofexamining a dataset for each of the possible productsand locating the peaks that do not agree is clearlytoo time-consuming and error prone to be performedmanually.

The majority of data is analyzed using automated sys-tems to assist the investigator – no system that performs acomplete and reliable analysis is currently available. Mod-ern analysis is performed by first determining the massof a peak in the MS dataset and searching a sequencefor a peptide with a mass that is within a user-definederror of the experimental value. The dataset can be asingle mass spectrum containing all of the cleaved pep-tides or an HPLC/MS dataset that contains thousandsof individual spectra, each of which will contain zero ormore of the peptides, depending on the chromatographicconditions.

The best analysis systems use a multifactorial fuzzy-logic-based approach to analyzing the data. The entiredataset is interrogated and individual matches rated withrespect to all of the other assignments. Peptides withthe same mass (within the allowed error) are assignedbased on the kinetics of the cleavage reaction, as inferredby the fuzzy logic rules. Peaks that can be assigned bymass, but which are unlikely based on the entire dataset, are flagged for further examination and confirmation.These flagged peaks, as well as those that could not beassigned are then iterated through a selection of knownmodifications and the complete sequence assignmentprocess repeated. The fuzzy logic assignments dependon the entire data set so the change of value in thesimulated experiment requires a complete reexaminationfor the assignments.

Once this iterative process is finished, the results can beprojected back onto the theoretical sequence, with eachassignment flagged and color coded so that interesting

portions of the sequence are displayed. This process isparticularly effective if the three-dimensional structureof the protein is known, where the peptides can belocated in a structure diagram shown in a stereoscopicdisplay.

4.5 Use with Differential Displays

Differential displays are a particularly useful tool incurrent cell biology. They consist of some type ofhigh-resolution protein separation system, such as two-dimensional gel electrophoresis, and a signal detectionprocess such as affinity or silver staining. A cell challengedin various ways will produce displays that differ asthe protein complement being expressed in the cellchanges. By overlaying displays, spots that changeare apparent. The challenge is then to determinewhat protein has appeared (or disappeared or changedpositions).

The techniques described in sections 4.1–4.3 can beapplied to these displays. By excising interesting areas ofthe separation bed and extracting the protein componentsin various ways, the protein sequence can be rapidlyidentified. A new generation of automated differentialdisplay devices utilizing MS as a protein identificationsystem is currently being designed. These instruments willreplace the current practice of manual sample preparationand mass analysis, although the protein identificationalgorithms will remain the same. The fully automatedinstruments will probably perform best on data derivedfrom species with known genomes.

4.6 Alternate Splicing

When a eukaryotic organism translates its DNA intoRNA in the nucleus (the primary transcript), the transferRNA is usually edited before it is exported out of thenucleus as transfer RNA for transcription into a peptidechain. This editing process, generally referred to as RNAsplicing, involves the removal of portions of the RNAthat do not code for peptide sequence (exons), leavingthe portions that do code for sequence and transcriptionregulatory functions (introns). In multicellular organismswith differentiated cell and tissue types – which includesall animals and plants – different exons can be splicedinto the transfer-RNA in different cell types, resulting indifferent protein sequences that originate from the samegene. These different proteins that originate from thesame gene are called ‘alternate splices’. The regions ofgenomic DNA that will be deleted or included can bepredicted with some accuracy for the most likely transfer-RNA product; however, the alternate forms cannotbe predicted in advance and they must be discoveredexperimentally.


Protein identification-type experiments are ideallysuited to the rapid identification of alternately splicedproteins. In order to distinguish alternate splicing fromproteolytic processing, the existing generation of proteinrecognition algorithms will need to include a methodfor searching and scoring multiple gaps using thegenomic sequence as a starting point. By using predictedexon/intron divisions, it should be possible to searchthe possible DNA-to-RNA translation sequences todetermine whether an alternate splice form is present in aparticular differential display. Such a search is beyond thecapabilities of the current generation of software: they allrequire an accurate RNA translation. However, with theintroduction of AI-type training capabilities, it should bepossible to apply the most sophisticated of the currentalgorithms to this problem.

5 MASS SPECTROMETRY APPLICATIONSOF PRINCIPAL COMPONENT ANDFACTOR ANALYSES

5.1 Introduction

PCA calculates an orthogonal basis (i.e. coordinatesystem) for sets of mass spectra for which each axismaximizes the variation of the spectral dataset. Eachaxis is represented as a vector that relates the lineardependence of the mass spectral features (i.e. m/zvariables). Typically, the new coordinate system has areduced dimensionality. The PCA procedure allows thescientist to compress the data, remove noise, and discoverthe underlying or latent linear structure of the data.

FA rotates the principal components away fromdirections that maximize variance towards new chemi-cally relevant directions; it allows scientists to resolveunderlying pure components in mixtures, build classi-fication models, and determine mass spectral featuresthat relate to specific properties such as concentration orclass.

5.2 Selected History

When computers were interfaced with mass spectrom-eters, numerical calculations could be used to simplify thedata. A brief and somewhat selective history follows. ThePCA technique was developed for the study of psycho-logical measurements that are inherently complicated bymany difficult-to-control factors..29/ These factors can beattributed to the different environmental, behavioral, orgenetic influence on the human subjects who are eval-uated. Some method was needed that would determinewhich factors were important and which factors werecorrelated.

The earliest applications of PCA in analytical chem-istry determined the number of underlying componentsin mixtures. Specifically, for optical measurements, a mix-ture could be effectively modeled by a linear combinationof the spectra of the pure components. The number ofpure components of the mixture would correspond tothe rank of the data matrix. The rank of a matrix ofoptical spectra of mixtures was computed using Gaussianelimination..30,31/ The application of FA to solving prob-lems in chemical analysis was pioneered by Malinowskiet al..32,33/

The applications of PCA and FA to gas chromatogra-phy (GC) and MS first occurred in the 1970s. Initially, FAwas employed to study the relationships between chemi-cal structure and GC retention indices..34 – 37/ Then PCAwas demonstrated as a tool for deconvolving overlappingGC peaks..38/ Next, FA was applied to 22 isomers ofalkyl benzenes to assist the interpretation of fragmenta-tion pathways and as a method for compressing the massspectra to lower dimensionality..39,40/ The FA methodwas used for classifying mass spectra..41/

The coupling of multichannel detection, specificallyMS to GC, allowed PCA and FA to resolve over-lapping components of GC/MS peaks..42,43/ The tar-get transform FA method was automated for GC/MSanalysis..44/

FA was initially applied to solving problems ofoverlapping peaks in GC/MS. Soon it was realized that FAwas a useful tool for the analysis of complex mixtures suchas biological (bacteria, proteins, and hair) and geological(coal, atmospheric particles, and kerogen) samples. Thesecomplex samples were all amenable to pyrolysis massspectrometry (PyMS)..45/ The discriminant and FA wereapplied to various biological samples..46/ An unsupervisedgraphical rotation method was developed and appliedto geological samples..47/ Canonical variates analysis(CVA).48/ was used to take advantage of measurementerrors furnished by replicate spectra and was combinedwith rotation for mixtures of glycogen, dextran, andbovine serum albumin,.49/ and has become one of themethods of choice for the analysis of MS fingerprintsfrom bacteria..50/ The FA method was demonstratedas an effective tool for analysis of smoke particlesby PyMS..49/ A related method that exploits PCA forclassification is soft independent modeling for classanalogies (SIMCA)..51/

Other techniques that benefited from FA and PCAare laser ionization mass spectrometry (LI/MS),.52/ fastatom bombardment mass spectrometry (FAB/MS),.53/

electrospray MS,.54/ and secondary ion mass spectrometry(SIMS)..55/ In the SIMS work, cluster analysis wasused to help align high-resolution mass measurementsinto optimized columns of the data matrix, which wasevaluated using PCA.


Table 1 The number of hydrocarbon spectra in the data setwith respect to class and carbon number

Hydrocarbon Carbon number Totalclass 4 5 6 7 10

Diene 16 40 52 56 33 197Alkene 12 17 60 61 37 187Alkane 8 14 28 31 62 143Total 36 71 140 148 132 527

5.3 Introductory Example

A brief demonstration of PCA and FA is presentedwith accompanying graphs. A data set of mass spectrawas obtained from the Wiley Registry of Mass Spectra,5th edition,.245/ that comprised spectra of hydrocarbonsthat were alkane, alkene, or diene. This data matrix isexemplary because the MS fragmentation patterns areeasy to predict. These data were part of a larger projectthat built classification models for identifying spectraof plastic recycling products..56/ The data matrix wascomposed of 527 spectra and 95 columns that correspondto m/z values. The m/z values ranged from 50 to 144.Typically, if all the spectra have no mass peaks at aspecified m/z, this column is excluded from the datamatrix. Table 1 gives the design of the hydrocarbondata set.

The principal components were calculated by singularvalue decomposition (SVD).57/ in a Win32 program thatwas written in CCC. The analysis of these data requiredless than 5 s on a 300 MHz PC computer with 128 MB ofrandom access memory and operating under Windows 98in console mode.

The spectra were preprocessed by normalizing to unitvector length and centering the spectra about theirmean spectrum before the PCA. Figure 1 gives theeigenvalues with respect to the component number.The eigenvalues measure the variance spanned by eacheigenvector. For intricate data sets, the eigenvaluestypically asymptotically approach zero. The relativevariance of each eigenvalue is calculated by dividingthe eigenvalue by the total variance of the data matrix.The total variance is easily obtained as the sum of theeigenvalues. From this calculation, the first two principalcomponents account for approximately half the variancein this data set.

Examination of the mass spectral scores on the firsttwo components in Figure 2 shows that the spectra tendto cluster by class (i.e. degree of unsaturation). Thefirst component has the largest range of values and isresponsible for separating the spectra in order of diene,alkene, and alkane. This component can be investigatedfurther using the variable loadings in Figure 3. Thisgraph shows the principal component plotted with

0 5 10 15 20 25

Eigenvalue number

0

20

40

60

80

100

Eig

enva

lue

Figure 1 Eigenvalues plotted as function of the number ofcomponents for a set of 527 mass spectra with 95 variables.

−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

Principal component 1

−0.8

−0.4

0.0

0.4

Prin

cipa

l com

pone

nt 2

dddddddd

dddd d

dddd d

ddd dd

d

dddddddddddddddddddddddddddd

dd

d

d

dd

d

dd

dd

ddd

d

d

d dd

d

d dddd

d

dd

d

ddddddddd

dddd d

dddd d

ddd dd

d

dddddddddddddddddddddddddddd

dd

d

d

dd

d

dd

dd

ddd

d

d

d dd

d

d dddd

d

dd

d

d

d

d

dd

d

d

dd

d

d

d

d

ddd

dd

d

d

d ddd

d

d

dd

d

dd

dd

d

eeeeeeeeeeee

eeeeeeeeeeeeeeeeee

eeee

e

e

e

e

ee

ee

e

e ee

eeeee

eeeeee e

e

e

eee

eee

e e

eee

eeee

ee eee

e

eeeeee

ee

e

e eee

e

e

e

e

eee

e

e

e

ee

ee

e

e

e

e

e

eee ee

eeee

e

ee

e

e

eeee

eeeee

e

e

e

e

e

eeeeeee

e

e e

eeeee

e

eee

e

e

ee

eee

e

e

ee

ee

ee e

e

ee

e e

ee

e

e

ee

aaaaa

a

a aaaaa

a

a

aa

a

aaaaa aaaa

a

a

a a

a

aa

a

a

aaaaa

a

aaaa

aaaa

a

aa

a

aa

aa aaa aa

aa a

a

a

a

a

a

aaaaaaaa

aaa aa

a

a

a

aaa

a

a

aaa

aa

a

a

aaaa

aaa

aaaa aaaaaa

aaa

aaa

aa

a

aa aaaaaaa

aa

a

a

aa a

aaa

Figure 2 Observation scores of hydrocarbon mass spectra onthe first two principal components, 47% of the cumulativevariance: a, alkanes; d, dienes; e, alkenes.

respect to m/z, so that key spectral features may beinvestigated.

The principal components point in mathematically, butnot necessarily chemically, relevant directions. Targettransform FA was used to rotate 13 principal componentsthat spanned 95% of the variance in directions thatcorrelate with the specific structural classes of the spectra.Figures 4–6 give the rotated factors for the diene, alkene,and alkane classes. Notice that the periodicity of thefragmentation pattern is precisely as one would expectfor these sets of data. The alkenes follow a pattern ofcarbon number times 14, the dienes follow the samepattern except shifted to two less, and the alkanes shiftedby two more. The shifts account for the change in mass ofthe molecule by the loss of two hydrogen atoms for eachdegree of unsaturation.


60 80 100 120 140

m/z

−0.5

−0.1

0.3

0.7

Rel

ativ

e in

tens

ity

71

57

85

53

68

Figure 3 Variable loadings for the first principal component ofthe mass spectra dataset, 21% of the cumulative variance.

60 80 100 120 140−0.04

−0.02

0.00

0.02

0.04

Rel

ativ

e in

tens

ity

69

5381

57

67

95109 123 138

m/z

Figure 4 The target-transformed factor for dienes obtainedfrom a set of 13 principal components that spanned 95% of thecumulative variance.

5.4 Theoretical Basis

5.4.1 Principal Component Analysis

Typically, data are arranged into a matrix format sothat each row corresponds to a mass spectrum andeach column to a measurement at a specific m/z value.This matrix is designated as D. The PCA method ismathematically based on eigenvectors or eigenanalysis.

The method decomposes a set of data into two setsof matrices. The matrices are special in that the columnspoint in directions of major sources of variation of thedata matrix. These vectors are eigenvectors. (Eigen is theGerman word for characteristic.) Because these vectorsalready point in a direction inherent to the data matrix,they will not change direction when multiplied by the datamatrix. This property is referred to as the eigenvector

60 80 100 120 140

−0.04

−0.02

0.00

0.02

0.04

Rel

ativ

e in

tens

ity

67

55

81

57

69

98140

71

83

m/z

Figure 5 The target-transformed factor for alkenes obtainedfrom a set of 13 principal components that spanned 95% of thecumulative variance.

60 80 100 120 140−0.04

−0.02

0.00

0.02

0.04

0.06

Rel

ativ

e in

tens

ity

6755

81

57

69

142

71

85113

m/z

Figure 6 The target-transformed factor for alkanes obtainedfrom a set of 13 principal components that spanned 95% of thecumulative variance.

relationship and is defined as Equations (1) and (2):

DTDvi D livi .1/DDTui D liui .2/

where DTD is a square symmetric matrix that character-izes the covariance of the columns of the data set D, viis eigenvector i that is in the row-space of D, and li iseigenvalue i. In Equation (2), DDT is a square symmetricmatrix that characterizes the covariance of the rows ofthe data set and ui is in the column-space of the D.

Besides maximizing the variance, the sets of eigen-vectors form orthogonal bases. This property can beexpressed by Equations (3) and (4),

VTV D I .3/UTU D I .4/


for which V is a matrix comprising row-space eigenvectors(vi) and U is a matrix comprising column-space eigenvec-tors (ui). The identity matrix I, comprises values of unityalong the diagonal and values of zero for all other matrixelements. The relationship given in Equations (3) and (4)is important because it shows that the transpose of anorthogonal matrix is equal to its inverse.

For large sets of data, computing the covariance matrixis time-consuming. A method that is precise and fast forcomputing both sets of eigenvectors is SVD:.58/

D D USVT .5/From Equation (5) D can be decomposed into the twomatrices of eigenvectors and a diagonal matrix S ofsingular values (Equation 6):

li D s2i .6/The singular values wi are equal to the square root of theeigenvalues, which leads to another important propertythat is given by Equation (7):

Dn D USnVT .7/This relationship is important because any power

of D can be calculated by decomposing the matrix,raising the diagonal matrix of singular values to the nthpower, and reconstructing the matrix. A useful power isnegative unity, because D�1 can be used for calculatingcalibration models. Furthermore, pseudoinverses can becalculated from singular or ill-conditioned data matricesby reconstructing using only the components (i.e. vectors)that correspond to singular values above a threshold.

The other important element of PCA is the conceptof a principal component. Because the row-space andcolumn-space eigenvectors are orthogonal and henceindependent, the number of eigenvectors will always beless than the dimensionality (i.e. minimum of the numberof rows or columns) of D. The number of nonzero eigenor singular values gives the mathematical rank r of D.The rank gives the number of underlying linear compo-nents of a matrix. However, besides mathematical rank,there are physical and chemical ranks. The physical rankgives all the sources of variances that are associated withthe physics of obtaining the measurement including noise.These variances may correspond to background or instru-mental components. The chemical rank corresponds tovariances related to the chemical components of interest.Therefore, the mathematical rank is the number of com-ponents with eigenvalues greater than zero. The physicalrank corresponds to eigenvalues greater than a thresholdthat characterizes the indeterminate error of making themeasurement. The chemical rank is typically the smallestand corresponds to the number of chemical components,when the variances of the data follow a linear model.

Typically, the components that are less than eitherthe physical or chemical ranks are referred to as princi-pal components. The components that correspond to thesmaller eigenvalues are referred to as secondary compo-nents. Secondary components usually characterize noiseor undesirable variances in the data. The determina-tion of the correct number of principal components r isimportant. If the number of principal components is toosmall then characteristic variances will be removed fromthe data. If the number of principal components is toolarge then noise will be embedded in the componentsas well as signal. There are several methods to evalu-ate the calculation of the correct number of principalcomponents.

One of the simplest methods is to reconstruct the data Dusing subsets of the eigenvectors. When the reconstructeddata resemble the original data within the precision ofthe measurement, then the proper number of principalcomponents has been obtained. An empirical approachdetermines the minimum of the indicator function (IND),which is not well understood, but furnishes reliableestimates of the chemical rank..59/

There are three key pieces of information furnished byPCA. The first is the relationship of the variance that isspanned by each component. Plotting the eigenvalues as afunction of components, gives information regarding thedistribution of information in the data. The eigenvaluesalso convey information regarding the condition numberof the data matrix. The condition number is obtainedby dividing the largest eigenvalue by the smallest. Thiscondition number can be used to assess the error boundson a regression model.60/ and as a means to evaluateselectivity..61/

This approach was what made PCA useful for assessingthe number of analytical components contained in aGC peak. This methodology is still used; however, itis referred to as window or evolving factor analysis(EFA). Instead of processing the spectra contained ina chromatographic peak, a window (i.e. a user-defineddataset) can be moved along the chromatogram. Thechemical rank is evaluated and gives the number ofchemical components in the window.

The second piece of information is furnished by theobservation scores. Score plots display the distributionof spectra or rows of D in a lower dimension graph.The scores of the first two components provide a two-dimensional window that maximizes information contentof the spectra. If the rows are ordered with respect to time,the observation scores give trajectories of the changes thatoccur in the data over time (Equation 8):

oi D diV D uiS .8/for which oi is a row vector of the ith observation scoreof spectrum i (di). This may be calculated by multiplying


a spectrum or the ith row of D by the matrix of principalcomponents. The observation scores can be calculateddirectly for the results of SVD by multiplying the matrixof singular values W by the ith row of the column-spaceeigenvectors U. Plots of the observation scores are alsoreferred to as the Karhunen–Loève plots. These plotsallow clustering of the data to be visualized.

The final piece of information is yielded by the variableloadings, which indicate the direction that the row-space eigenvectors point. The variable loadings showthe importance of each variable for a given principalcomponent. Plots of variable loadings can be examinedfor characteristic spectral features. They also are usedtogether with the observation score plots to see whichspectral features are responsible for separating objects inthe score plots.

In some instances, the data matrix D can be modified sothat the principal components point in directions that aremore chemically relevant. These modifications to D arereferred to collectively as preprocessing. Typically, thespectra are mean-centered, which refers to subtractingthe average spectrum from each spectrum in the dataset.This centers the distribution of spectra about the origin.If the data are not mean-centered, the first principalcomponent will span the variance characterized by theoverall distance of the data set from the origin.

In some cases, the spectra are normalized so as toremove any variations related to concentrations. Normal-ization scales the rows of D, so that each row is weightedequally. Mathematically, normalizing the spectra to unitvector length will achieve this equalized weighting. Forspectra that vary linearly with concentration, the concen-tration information is manifested in the vector length ofthe spectrum. Other methods of normalization includenormalizing to a constant base peak intensity (i.e. maxi-mum peak of unity) or to a constant integrated intensity(i.e. sum of peaks of unity).

The data may be scaled so that the variables or columnsof D are weighted equally. Scaling is important for massspectra, because peaks of higher mass that tend to conveymore information, have smaller intensities, and tend to beless influential in the calculation of principal components.

Autoscaling gives each variable or column of data equalweight. This method of scaling is useful when the noise orthe signals are not uniformly distributed across the massrange. For this method of preprocessing, each column ofD is divided by its standard deviation. The problem withautoscaling is that variables that convey noise only aregiven equal weight with those that convey signal. A betterapproach is to scale the data by the experimental errorsfor each variable. Experimental error can be measured asthe standard deviation of replicate spectra. The variancesof these standard deviations can be added for differentsamples to calculate an estimate of the experimental

error. The experimental error avoids the diminution ofthe signals during scaling.

An alternative to scaling is transformation. In somecases the data may be converted to the square root orlogarithm to give greater weight to smaller features. Auseful method for preprocessing mass spectra is throughmodulo compression..62/

5.4.2 Canonical Variates Analysis

For supervised classification, a useful method related toPCA is CVA,.63/ which is also applied with discriminant(function) analysis..64/ The CVA method is not usuallyperformed on the original feature space (mass spectra)because the mass spectra have colinear variables ortoo many variables for CVA. This problem may beresolved by compressing the data, such as using principalcomponent scores.52/ or by calculating the pseudo-inverseof the covariance matrix..65/ The canonical variates (CVs)are principal components that are calculated from amatrix that is related to Fisher variance and analysisof variance. In the traditional method, two covariancematrices are calculated. The first matrix characterizes thecovariance of the class means about the grand mean ofthe data. The second matrix characterizes the variation ofthe spectra about their class means.

The CVA approach uses PCA twice in the calculation.First, SVD is used to compute the pseudo-inverse of thewithin-groups sum of squares matrix (SSCw). The CVs arethe variable loadings obtained from PCA applied to R,which is obtained by Equation (9):

R D SSb SSCw .9/for which Sb is the between-class sum of squaresmatrix and SCw is the pseudo-inverse of the within-classsum of squares matrix (Sw). These are calculated byEquations (10) and (11),

SSb DNc∑iD1

Ni. Nxi � Nx/. Nxi � Nx/T .10/

SSw DNc∑iD1

Ni∑jD1.xji � Nxi/.xji � Nxi/T .11/

for which Nc is the number of classes, Ni is the number ofspectra in the ith class, Nxi is the class mean, and the Nx is theglobal mean. The rank of R will be equal to the number ofclasses less one (e.g. Nc � 1), because a degree of freedomis lost by centering the spectra about the global mean andthe product of two matrices can not exceed the minimumrank of the two multiplier matrices..66/ The CVs are a basisset of orthogonal vectors that maximize the separations ofthe classes (i.e. maximize the distance among the means


and minimize the distance of the spectra from their classmeans). Thus the principle of CVA is similar to PCA but,because the objective of CVA is to maximize the ratioof the between-group to within-group variance, a plotof the first two CVs displays the best two-dimensionalrepresentation of the class separation.

5.4.3 Factor Analysis

PCA yields variable loadings that are characteristic for thedata matrix. The variable loadings are meaningful withrespect to maximizing variances. For other applicationsit is useful to rotate these loadings in other directionsthat pertain to specific problems. Once the principalcomponents are rotated, the technique is referred to asFA. Rotations are either oblique or orthogonal. Theorthogonal rotations maintain the linear independenceof the principal components and the basis. Obliquerotations remove the constraint of linear independenceand therefore model more closely physical and chemicalphenomena. These methods calculate a square matrixT of coefficients that rotate the components with adimensionality of r. For which r is the number of principalcomponents. Typically, the column-space components orobservation scores are rotated in the forward directionand the row-space components or variable loadings arerotated in the reverse direction using T�1. The rotationmatrices can be computed by numerical optimizationof a variety of objective functions or they can berotated graphically until they resemble a target. Fororthogonal rotation, the most popular objective functionis Varimax..67/ This rotation method seeks to increasethe magnitude of the observation scores on a singlecomponent and reduce the scores magnitude on all othercomponents.

Target transformation calculates a transformationmatrix that rotates the row-space and column-spaceeigenvectors or components in directions that agreewith a target vector. Typically, the targets are a setof properties that may correlate with the objects, andthe transformation matrix is calculated by regression.These transformation matrices may be calculated usingthe eigenvectors from SVD (Equations 12 and 13),

T D UTX .12/X̂ D UT .13/

for which X is composed of columns of targets, T isthe transformation matrix, and X̂ is the estimated targetmatrix. The loadings can be rotated by regressing thematrix of variable loadings V onto the target matrix Tthat has r rows and the number of columns equals thenumber of target vectors (Equation 14):

Ŷ D VT.TTT/�1 .14/

In some cases, it is advised to use the pseudo-inverse ofT, because the inner product of T may be ill-conditionedor singular. The factor variable loadings for the targetsare estimated by Ŷ.

5.5 Related Methods and Future Applications

5.5.1 Calibration

There are various methods to exploit the propertiesof eigenvectors to accomplish calibration. Calibrationfurnishes models that predict properties from data suchas mass spectra. The most common use for calibrationis to construct models that estimate the concentration ofcomponents in complex samples by their mass spectra.

Principal component regression (PCR) uses the obser-vation scores for computing the regression model. Theadvantage of this approach is that for MS data in manycases D is underdetermined (i.e. more m/z measurementsthan spectra). Because the observation scores will equalthe chemical rank, the number of variables are reducedand regression by inverse least squares becomes possible.

A related method uses SVD to calculate the pseudoin-verse DC. The SVD regression is computationally moreefficient than PCR, but is mathematically equivalent. Avery effective method for many problems is partial leastsquares (PLS). This calculates common column-spaceeigenvectors between the independent block (i.e. D) anddependent block (i.e. Y) of data. The PLS method wasinitially developed in the field of econometrics. Both PLSand PCA are described in a tutorial;.68/ PLS has beenenhanced to handle multiway or higher-order data..69/

Quantitative analysis of complex binary and tertiarybiochemical mixtures analyzed with PyMS.70/ showedthat, of the latent variable PCR and PLS methods, thebest technique was PLS, a finding to be found generallyby other studies..71,72/

5.5.2 Multivariate Curve Resolution

The same FA methods that were initially applied topeaks of GC/MS data have evolved so that they canbe applied to the entire chromatographic runs. Thesemethods start with a set of principal components. Thecomponents are rotated by a method known as alternatingleast squares (ALS). The key is to apply mathematicalconstraints such as non-negativity (no negative peaks)and unimodality (a spectrum will appear in only one peakof a chromatogram).

Curve resolution provides a means to enhance the spa-tial or depth resolution of ion measurements of surfacesor could be exploited to examine changes in electrospraymass spectra as a function of changing solvent conditions.Curve resolution will continue to exploit PCA and FA


to detect impure chromatographic peaks and mathemati-cally resolve the overlapping components.

EFA and window factor analysis (WFA) use theeigenvalues to model the change in concentrationsof components in the data matrices. The eigenvaluescan be combined to form initial concentration profilesthat are regressed onto the data. The concentrationprofiles and extracted spectra are refined using ALS withconstraints.

5.5.3 Multiway Analysis

The entire chromatographic mass spectral data matrixD is only the beginning. If several chromatographicruns are used to characterize a chemical process or ifmultidimensional MS matrices of data are collected, atensor or cube of data would be obtained. Using methodsbased on the Tucker model,.73/ the higher-order sets ofdata can be decomposed into vectors or planes of principalcomponents. A method related to the Tucker model isPARAFAC.

5.6 Reviews and Tutorials

Malinowski’s monograph is an excellent resource forPCA and FA..74/ Tutorials on FA and related methodscan be found in the literature – the philosophical basis ofPCA and FA,.75/ EFA,.76/ and target transform FA..77/

Multivariate curve resolution applied to chromatographywith multichannel detection has been published asa tutorial.78/ and reviewed specifically for GC/MS..79/

Tutorials of the multiway PCA method PARAFAC.80/

and PLS.68/ are also useful entry points into thesemethods. The text by Martens and Næs on multivariatecalibration thoroughly describes PLS..81/

5.7 Acknowledgments

Tricia Buxton, Guoxiang Chen, and Aaron Urbas arethanked for their help with preparing this section. ThomasIsenhour and Kent Voorhees are thanked for their helpwith searching the literature. The introductory exampledata set was initially prepared by Peter Tandler.

6 ARTIFICIAL NEURAL NETWORKS

6.1 Summary

The availability of powerful desktop computers inconjunction with the development of several user-friendlypackages that can simulate ANNs has led to the increasein adoption of these ‘intelligent’ systems by the analyticalscientist for pattern recognition. The nature, propertiesand exploitation of ANNs with particular reference toMS is reviewed.

6.2 Introduction to Multivariate Data

Multivariate data consist of the results of observationsof many different characters (variables) for a numberof individuals (objects)..82,83/ Each variable may beregarded as constituting a different dimension, suchthat if there are n variables each object may be saidto reside at a unique position in an abstract entity,referred to as n-dimensional hyperspace. In the caseof MS, these variables are represented by the intensitiesof particular mass ions. This hyperspace is necessarilydifficult to visualize, and the underlying theme ofmultivariate analysis (MVA) is thus simplification.84/ ordimensionality reduction, which usually means that wewant to summarize a large body of data by means ofrelatively few parameters, preferably the two or threethat lend themselves to graphical display, with minimalloss of information.

6.3 Supervised Versus Unsupervised Learning

Conventionally the reduction of the multivariate datagenerated by MS.85 – 87/ has normally been carried outusing PCA;.84,88 – 90/ the PCA technique is well-knownfor reducing the dimensionality of multivariate datawhile preserving most of the variance, and the principalcomponent scores can easily be plotted and clusters in thedata visualized.

Analyses of this type fall into the category of unsu-pervised learning (Figure 7a), in which the relevantmultivariate algorithms seek clusters in the data..90/ Pro-vided that the data set contains standards – of knownorigin and relevant to the analyses – it is evident that onecan establish the closeness of any unknown samples to astandard, and thus effect the identification of the former.This technique is termed ‘operational fingerprinting’ byMeuzelaar et al..91/

Such methods, although in some sense quantitative,are better seen as qualitative as their chief purposeis merely to distinguish objects or populations. Morerecently, a variety of related but much more power-ful methods, which are most often referred to withinthe framework of chemometrics, have been applied tosupervised analysis of multivariate data (Figure 7b). Inthese methods, one seeks to relate the multivariateMS inputs to the concentrations of target determi-nands, i.e. to generate a quantitative analysis, essentiallyvia suitable types of multidimensional curve fittingor linear regression analysis..83,92 – 96/ Although non-linear versions of these techniques are increasinglyavailable,.97 – 103/ the usual implementations of these meth-ods are linear in scope. However, a related approach tochemometrics, which is inherently nonlinear, is the useof ANNs.


Multivariatedata

Featureextraction

Clustering

Humaninterpretation

(a)

Multivariatedata

Calibrationsystem

Output

ComparisonKnowntarget

Error

(b)

Figure 7 (a) Unsupervised learning – when learning is unsu-pervised, the system is shown a set of inputs (multivariate MSdata) and then left to cluster them into groups. For MVAthis optimization procedure is usually simplification or dimen-sionality reduction; this means that a large body of data (theinputs) are summarized by means of a few parameters with min-imal loss of information. After clustering the results then haveto be interpreted. (b) Supervised learning – when the desiredresponses (targets) associated with each of the inputs (multi-variate data) are known then the system may be supervised. Thegoal of supervised learning is to find a model that will correctlyassociate the inputs with the targets; this is usually achieved byminimizing the error between the known target and the model’sresponse (output).

6.4 Biological Inspiration

ANNs are biologically inspired; they are composed ofprocessing units that act in a manner that is analogousto the basic function of the biological neuron (Figure 8).In essence, the functionality of the biological neuronconsists of receiving signals, or stimuli, from other cells attheir synapses, processing this information, and deciding(usually on a threshold basis) whether or not to producea response, that is passed onto other cells. In ANNs theseneurons are replaced with very simple computationalunits which can take a numerical input and transform it(usually via summation) into an output. These processingunits are then organized in a way that models theorganization of the biological neural network, the brain.

Despite the rather superficial resemblance betweenthe ANN and biological neural network, ANNs doexhibit a surprising number of the brain’s characteristics.For example, they learn from experience, general

artiﬁcial intelligence and expert systems in mass spectrometry · 2017. 2. 14. · artiﬁcial...

Documents