exploring multivariate clinical routine data concerning...

12
Journal of Automatic Chemistry, Vol. 10, No. 2 (April-June 1988), pp. 67-78 Exploring multivariate clinical chemical routine data concerning three major disea se groups Jan B. Hemel, Central Laboratory for Clinical Chemistry, University Hospital Groningen, P.O. Box 30001, NL-9700 RB Groningen, The Netherlands Hilko van tier Voet*, Research Group Chemometrics, Pharmaceutical Laboratories, State University of Groningen, A. Deusinglaan 2, NL-9713 A W Groningen, The Netherlands Roll Hendriks, Frans R. Hindriks and Willem van tier Slik Central Laboratory for Clinical Chemistry, University Hospital Groningen, P.O. Box 30001, NL-9700 RB Groningen, The Netherlands In preparation for multivariate analysis, an exploratory study has been undertaken to investigate the relative position, separability, homogeneity and shape of three major disease groups, using data from a clinical chemical routine package. The data set consists of 46 hepatology patients, 50 nephrology patients and 46 cardiology patients, and the measured blood levels include 20 common clinical chemical routine assays. Missing value problems were avoided by deleting some of the variables and objects. A univariate analysis was used as the basis of a rescaling of the data. Bivariate (pairwise) plots of some major assays each show limited separation. The set of three such plots of the three major principal components reveals more distinction between the groups than was offered by univariate analysis. Three-dimensional extensions of these techniques allow better insight than any of the two- dimensional plots, but these three-dimensional versions require more plots for complete interpretation. Non-linear mapping of the data is the best way of retaining the distances and a fairly good separation is achieved in the plot. The plot is less informative about shape and relative position of the classes. Representation of the data as pictures of faces does not offer additional information and visual clustering is worse than in any of the techniques mentioned. During the analysis many assumed properties of the data are confirmed and a good starting point for multivariate classification is attained. Easy visual detection of outliers is offered by all techniques. Unfortunately, valuable information is lost in this data set by deleting some incomplete variables. Introduction During the past decades the number of constituents that can be measured in body fluids has increased substan- tially. Simultaneously, the costs per assay decreased at an * Present address: Agricultural Mathematics Group, P.O. Box 100, NL-6700 AC Wageningen, The Netherlands. even more rapid rate due to large-scale laboratory automation. These facts have stimulated the physician to order more assays than can be effectively interpreted by a human without help from sophisticated techniques. Although each assay may contribute some information, the sequential univariate way of interpreting the results leaves part of the information unrevealed. Fortunately, the impasse that results can be broken. Firstly, the physician could be advised to order only those assays that will give him the desired information, in other words, to order very selectively. This requires a thorough knowledge of the value of the assays for each diagnosis. As clinical chemistry offers ever more new and possibly valuable assays, it is difficult to keep this knowledge up to date. The introduction of protocols of diagnosis and treatment in medicine, supported by recent medical decision-making techniques, is an attempt to optimize the use of laboratory results by selection of an optimal subset of all assays for a specific problem. Advantages of this approach include reduction of overall costs and growing experience with a selection of assays, leading to assess- ment of their value. Another approach is to change the way of interpreting the results from sequentially univariate (the assay outcomes are judged successively) to multivariate (all assays are judged simultaneously). As the maximum number of features that one can simultaneously grasp does not exceed three by much, the wealth of data coming from a modern screening series of, say 20, laboratory assays is far too large to be interpreted without help. This is where statistical multivariate techniques may be useful. By representing a patient’s record as a point in a p-dimen- sional space spanned by the p assays as axes, the results of all p assays can be judged at once. Mathematical models enveloping each disease class may be developed, and this can be followed by classification of an unknown test point into the class whose model fits best. Furthermore, multivariate processing of data may indicate which assays contain most discriminatory information between disease classes and thus supplies suggestions about which measurements should be done to clear what doubts. In this way, multivariate data analysis may both reduce the number of assays needed by maximizing the un- covered information, and indicate which assays offer most. Since medical diagnosis is in fact a kind of classification, multivariate classification is the most interesting branch of multivariate analysis of medical data. It is stressed that classification is not necessarily choosing the one and only diagnosis, but may very well take the form of a 67

Upload: others

Post on 14-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploring multivariate clinical routine data concerning ...downloads.hindawi.com/journals/jammc/1988/176716.pdf · JournalofAutomaticChemistry,Vol. 10, No.2 (April-June 1988),pp

Journal of Automatic Chemistry, Vol. 10, No. 2 (April-June 1988), pp. 67-78

Exploring multivariate clinical chemicalroutine data concerning three major disea segroups

Jan B. Hemel,Central Laboratory for Clinical Chemistry, University Hospital Groningen,P.O. Box 30001, NL-9700 RB Groningen, The Netherlands

Hilko van tier Voet*,Research Group Chemometrics, Pharmaceutical Laboratories, State University ofGroningen, A. Deusinglaan 2, NL-9713 AW Groningen, The Netherlands

Roll Hendriks, Frans R. Hindriks andWillem van tier SlikCentral Laboratory for Clinical Chemistry, University Hospital Groningen,P.O. Box 30001, NL-9700 RB Groningen, The Netherlands

In preparationfor multivariate analysis, an exploratory study hasbeen undertaken to investigate the relative position, separability,homogeneity and shape of three major disease groups, using datafrom a clinical chemical routine package.The data set consists of 46 hepatology patients, 50 nephrologypatients and 46 cardiology patients, and the measured blood levelsinclude 20 common clinical chemical routine assays. Missing valueproblems were avoided by deleting some ofthe variables and objects.

A univariate analysis was used as the basis ofa rescaling of thedata.

Bivariate (pairwise) plots ofsome major assays each show limitedseparation. The set ofthree such plots ofthe three major principalcomponents reveals more distinction between the groups than was

offered by univariate analysis. Three-dimensional extensions ofthese techniques allow better insight than any of the two-dimensional plots, but these three-dimensional versions requiremore plots for complete interpretation.

Non-linear mapping of the data is the best way of retaining thedistances and a fairly good separation is achieved in the plot. Theplot is less informative about shape and relative position of theclasses.

Representation of the data as pictures offaces does not offeradditional information and visual clustering is worse than in any ofthe techniques mentioned.

During the analysis many assumed properties of the data are

confirmed and a good starting pointfor multivariate classificationis attained. Easy visual detection of outliers is offered by alltechniques. Unfortunately, valuable information is lost in this dataset by deleting some incomplete variables.

Introduction

During the past decades the number of constituents thatcan be measured in body fluids has increased substan-tially. Simultaneously, the costs per assay decreased at an

* Present address: Agricultural Mathematics Group, P.O. Box100, NL-6700 AC Wageningen, The Netherlands.

even more rapid rate due to large-scale laboratoryautomation. These facts have stimulated the physician toorder more assays than can be effectively interpreted by ahuman without help from sophisticated techniques.Although each assay may contribute some information,the sequential univariate way of interpreting the resultsleaves part of the information unrevealed. Fortunately,the impasse that results can be broken.

Firstly, the physician could be advised to order only thoseassays that will give him the desired information, in otherwords, to order very selectively. This requires a thoroughknowledge ofthe value ofthe assays for each diagnosis. Asclinical chemistry offers ever more new and possiblyvaluable assays, it is difficult to keep this knowledge up todate. The introduction of protocols of diagnosis andtreatment in medicine, supported by recent medicaldecision-making techniques, is an attempt to optimize theuse of laboratory results by selection of an optimal subsetof all assays for a specific problem. Advantages of thisapproach include reduction of overall costs and growingexperience with a selection of assays, leading to assess-ment of their value.

Another approach is to change the way ofinterpreting theresults from sequentially univariate (the assay outcomesare judged successively) to multivariate (all assays arejudged simultaneously). As the maximum number offeatures that one can simultaneously grasp does notexceed three by much, the wealth of data coming from amodern screening series of, say 20, laboratory assays is fartoo large to be interpreted without help. This is wherestatistical multivariate techniques may be useful. Byrepresenting a patient’s record as a point in a p-dimen-sional space spanned by the p assays as axes, the results ofall p assays can be judged at once. Mathematical modelsenveloping each disease class may be developed, and thiscan be followed by classification ofan unknown test pointinto the class whose model fits best. Furthermore,multivariate processing of data may indicate whichassays contain most discriminatory information betweendisease classes and thus supplies suggestions about whichmeasurements should be done to clear what doubts.

In this way, multivariate data analysis may both reducethe number of assays needed by maximizing the un-covered information, and indicate which assays offer most.

Since medical diagnosis is in fact a kind of classification,multivariate classification is the most interesting branchofmultivariate analysis ofmedical data. It is stressed thatclassification is not necessarily choosing the one and onlydiagnosis, but may very well take the form of a

67

Page 2: Exploring multivariate clinical routine data concerning ...downloads.hindawi.com/journals/jammc/1988/176716.pdf · JournalofAutomaticChemistry,Vol. 10, No.2 (April-June 1988),pp

Jan B. I-Iemel et al. Exploring multivariate clinical chemistry routine data

probabilistic differential diagnosis. Multivariate classi-fication is an often-occurring aim ofdata analysis [see, forinstance, 1-8]. Although at first sight the procedure mayseem to consist only of choosing a multivariate classifica-tion method (MCM) and applying it to the data using anappropriate computer program, in most cases manyunexpected problems are encountered. Problems ofscaling, transformation, outlier detection, class model-ling, and many more will trouble the investigator andmay prevent him from reaching his goal. Some of theseproblems may (and should) be foreseen and even solvedby exploring the data before starting the actual classifica-tion procedure. A carefully executed exploration willanswer questions about separability, relative position andshape of the classes, it may suggest adequate classmodels, and it will indicate possible outliers. Tukey haswritten a comprehensive work about data exploration[9].

Equipment and computer programs

All samples were analysed on a Technicon SMACcontinuous flow analyser. Calculations were done on

Groningen University’s Control Data Cyber 170/760computer. Plots were made on the University’s VersatecV80 electrostatic plotter.

The following programs were used:

(1) SIMCA-3B [12], for the calculation of modellingpowers.

(2) ARTHUR [13], for variable selection, non-linearmapping and principal-components plots.

(3) Some ad hoc programs for the three-dimensionalplots.

(4) CLAS [14], for all other calculations.

Before starting with the exploration it is useful to

preprocess the data by eliminating missing values andscaling the variables [2, 7, 8 and 10].

The actual exploration includes several stages.

In the first stage univariate statistics, like means andmoments, may be calculated for the complete data set aswell as for each class separately. These will detect clearseparability on a single variable. Histograms and othergraphical displays may add univariate distributionalinformation. The knowledge of the univariate separa-bility that can be acquired in this way gives an impressionof the multivariate distinction as well. However, badunivariate separation does not imply multivariate over-lap, making a multivariate approach useful.

The second step involves calculation of correlations.These provide us with an impression ofthe dispensabilityof each variable in the presence of another one.

In the third stage, several display techniques that aim at

retaining the multivariate aspect of information in thedata complete the exploration.

This paper is dedicated to the exploration of a medicaldataset, consisting of clinical chemical screening data ofpatients suffering from a heart, liver or kidney disease.Usually, in medical diagnosis a sequential univariateapproach is used implicitly in the application ofreferenceintervals 11]. Multivariate diagnosis is expected to use a

larger part of the information present in the data. Toinvestigate the separability, relative positions and shapeof these major disease groups, and to detect the presenceof atypical cases, this study was undertaken.

After the description of the computer equipment and theprograms that were used the data set is introduced. In thefollowing section a selection of multivariate displaymethods is discussed. After a section discussing thenecessary preprocessing of the data the main section willcontain a discussion of exploration results of the data.The paper is concluded with an overview of the informa-tion that is derived from the various techniques.

Data

Most multivariate studies aim at distinguishing betweenvery similar groups, using a highly selected set of veryspecific assays. However, classification into major diseasegroups using only clinical chemical routine data can serveas the first step in referring a patient to the right medicalspecialist. Although separation of major disease groupsmay at first seem trivial, the heterogeneity within theselarge groups and cases with multiple diseases makeseparation difficult.

The data set that is explored consists of 142 patientrecords. The patients suffered from the following diseasegroups: 46 liver diseases (15 alcoholic cirrhosis, fiveprimary biliary cirrhosis, 26 cirrhosis due to chronicactive hepatitis- the LIVER class), 50 kidney diseases(various, unspecified- the KIDNEY class), 46 heartdiseases (24 myocardial infarction, 22 coronary arterydisease- the HEART class).

The criteria for selection of the patients were: (1)diagnosis certain within the boundaries of the class; (2)no concurrent disease from any of the two other classespresent; (3) admitted to hospital in 1982; (4) bloodsample available within 10 days of admittance.

The only information about the patients, apart from theirdiagnostic class, consists of their blood concentrations ofsodium, potassium, chloride, urea, creatinine, uric acid,alkaline phosphatase (AP), lactate dehydrogenase(LDH), aspartate aminotransferase (ASAT), alanineaminotransferase (ALAT), total bilirubin, direct bili-rubin, calcium, inorganic phosphate, total protein,albumen, cholesterol, triglycerides, iron and y-glutamyltransferase (GGT), expressed in appropriate units. Thisset of assays is the routine clinical chemical package usedat Groningen University Hospital. It has not beenoptimized for discrimination between HEART, LIVERor KIDNEY patients. For each patient the completeseries of assays was performed on a single blood sample,taken within 10 days ofadmission. Twelve per cent of thevalues in the data is missing due to a selective ordering ofassays by the physican. More details about the data are

given in table 1.

68

Page 3: Exploring multivariate clinical routine data concerning ...downloads.hindawi.com/journals/jammc/1988/176716.pdf · JournalofAutomaticChemistry,Vol. 10, No.2 (April-June 1988),pp

Jan B. Hemel et al. Exploring multivariate clinical chemistry routine data

Table 1. Summary ofthe data set before and after reduction because ofmissing values.

With missing values Without missing values

Variable Unit Class Mean s.d. Mean s.d.

Sodium mMol/1

Potassium mMol/l

Chloride mMol/1

Urea mMol/1

Creatinine Mol/1

Uric acid mMol/l

AP U/1

LDH U/1

ASAT U/1

ALAT U/1

Bilirubin, tot. tMol/1

Bilirubin, dir. gMol/1

Calcium mMol/1

Phosphate mMol/l

Protein g/l

heartliverkidneyall

heartliverkidneyall

heartliverkidneyall

heartliverkidneyall

heartliverkidneyall

heartliverkidneyall

heartliverkidneyall

heartliverkidneyall

heartliverkidneyall

heartliverkidneyall

heartliverkidneyall

heartliverkidneyall

heartliverkidneyall

heartliverkidneyall

heartliverkidneyall

139 2.6138 3.8138 4.7138 3.9

4-3 0.474.0 0.504.6 0.624.3 0.59

103 2.6102 6.2102 6.6102 5.4

6 1.96 5.717 12.410 9.5

93 1484 61384 389191 270

0.36 0.080.29 0.100.42 0.120.36 0.12

87 27238 219101 54139 143

326 245280 140260 86287 168

48 64118 15025 2762 101

25 12105 9827 5451 74

9 664 1418 7

30 93

1.5 0.732.5 80.61.9 1.4

14.0 52.8

2.4 0.142.3 0.162.4 0.212"3 0.19

1.0 0.261.1 0"301.5 0.661.2 0.51

66 5.465 9.362 9.664 8"6

4403

2724

4244

0000

002

41116

19

2924

9044

020

2402527

1424

4621420

264211

3940

14

11024

139 2-6137 4"2137 4-7138 4.0

4"2 0-463.9 0.534.5 0.644.3 0.60

103 2.7101 6"6101 6.8102 5-7

6 2"06 6"318 12-911 10"2

94 1588 68

407 402207 289

89251104144

334292263295

541272566

25992849

2723257151

25615388178

6916629109

12875868

69

Page 4: Exploring multivariate clinical routine data concerning ...downloads.hindawi.com/journals/jammc/1988/176716.pdf · JournalofAutomaticChemistry,Vol. 10, No.2 (April-June 1988),pp

Jan B. Hemel et al. Exploring multivariate clinical chemistry routine data

Table 1 continued.

With missing values Without missing values

Variable Unit Class Mean s.d. % miss. Mean s.d.

Albumen g/1 heartliverkidneyall

Cholesterol mMol/1 heartliverkidneyall

Trig yceride mMo /1 heartliverkidneyall

Iron bMol/1 heartliverkidneyall

GGT U/1 heartliverkidneyall

44 2"9 3935 5"7 436 7"4 238 7"0 15

7 1"6 245 2"7 46 2"8 246 2"6 18

2"4 1.1 611"5 1.0 152"2 1.4 361"9 1"2 37

14 8"2 5419 10"1 1114 8"0 2016 9"2 28

39 32 50372 632 1551 123 16169 423 27

Numbers of patients initially:Nheart=46; Nli,,r=46; Nkidney--50; Ntotal-" 142;

After removing missing values:Nheart=38; Nliver=36; Nkidney=43; Ntot 117

Italics: variable is deleted after missing value removal.

Multivariate display methods

To get information about the separability of the classes,visual inspection of the multidimensional space and thesituation of the object patterns in that space can be ahelpful tool. Since human imagination can only cope withspaces with up to three dimensions, so-called displaymethods map the multidimensional data onto a low(usually two) dimensional space.

What could be the use of display methods? One may getan impression of the separability of the classes, of thepresence of outliers or atypical cases, and, if mathemat-ical models are fitted, one may visually judge the models.It is stressed that separability need not- and perhapsshould not- be the only purpose of exploring data.

To reach these goals there are some requirements to befumned. () distances between objects (and classes)should be conserved as well as possible; (2) the informa-tion content, in the sense of variance retained in thepicture, should be as large as possible; (3) the pictureshould be easily interpretable.

Many different display methods have been proposed inliterature [9, 15-19]. Some of them will be brieflydiscussed here.

Pairwise scatter plots ofsome or all variables against eachother. These plots show relations between the variables.With growing number ofvariables the number ofpossibleplots grows unmanageably large. These pictures are

separately simple to interpret. They contain each on theirown but a fraction of the information content, buttogether virtually all. Distances are only partiallyretained.

New orthogonal axes may be chosen in the p-dimensionalspace, for example by calculating the principal com-ponents (PCs) of the data. The first principal componentis an axis that is chosen so that the projections of allpresent points on that axis show maximum variance. Thesecond PC is the axis that is orthogonal to the first oneand contains next largest variance, and so on. Thetransformation of the data to scores on these new PC-axesis called the Karhunen-Lo?eve transformation. The pair-wise scatter plots of the PCs against each other are calledPC-plots. As the first PCs contain, by definition, most ofthe variance in the data, only the first few PC-plots maysuffice to depict most of the information (in the sensedefined before), so a dimension reduction results. Thenew axes are chosen in a way to guarantee a maximuminformation content in the first two PCs. The only otherdifference with pairwise plots ofvariables is the interpret-ability, which is generally much worse, since the axes inthe plots are mixtures of the original assay-axes. Insightinto this mixture can be obtained from a biplot, a PC-plotwith the original axes projected into it. This biplot givesinformation about the relation of the original axes to thePCs.

Scatter plots of variables or PCs are conceivable in threedimensions as well as in two. Grotch has reported thisand similar techniques [23]. For the actual plotting, a

projection of three-space on two-space is necessary,

70

Page 5: Exploring multivariate clinical routine data concerning ...downloads.hindawi.com/journals/jammc/1988/176716.pdf · JournalofAutomaticChemistry,Vol. 10, No.2 (April-June 1988),pp

Jan B. Hemel e! al. Exploring multivariate clinical chemistry routine data

resulting in pseudo-three-dimensional plots. By rotatingthe three-dimensional space in different ways before theprojection, a three-dimensional illusion may be obtained.By using advanced real-time 3D-graphics programs thisillusion can be significantly enhanced, but only at highcosts.

A technique that is not a simple orthogonal projection ofthe data onto two-space is extensively described byKowalski and Bender [10]. This so-called non-linearmapping attempts to retain the distances between thedata points. One may think of all points connected to allother points with tcnsionlcss springs and pressing thisp-dimensional structure onto two-space. This techniqueis offered by ARTHUR as the NLM routine. Inter-individual distances arc best retained by this method; theinterpretation of other aspects than these distances ishard. A problem with this method is that an NLM-plot isdependent on the initial projection direction, so it is not

uniquely determined.

Chcrnoff, contemplating the fact that in two dimensions ahuman is the best pattern recognizer, and that everyonehas been studying faces right from his birth, proposed a

procedure to translate every variable into a feature of a

cartoon face [19 and 20]. By looking at the picture manyvariables arc interpreted simultaneously. A problem withthis technique is that not every facial feature is equallyprominent, and that the ranking for recognition issub.jcctivc [21].

Preprocessing the data

As mentioned before, the data presented quite a few gaps.This is a very usual phenomenon with retrospectivemedical data: a physician is not interested in all assays, so

only a selection is executed. The locations ofthe gaps mayreflect the surmises (or certainties) of the physician. If allincomplete objects arc excluded from the data, it isprobable that patients that arc very typical for their classarc lost. Patients that arc easily recognized as sufferingfrom a liver disease may not be examined for kidneyfailure, and so kidney function tests may not be ordered.On the other hand, omitting incomplete variables leavesonly very non-specific tests. We chose deletion of allvariables that missed more than 10% of the values in anyclass, removing the remaining gaps by deletion ofobjects.Nine variables and 117 ob.jccts wcrc retained. Thevariables were sodium, potassium, chloride, urea, crcati-nine, AP, LDH, ASAT and ALAT. Excluded ob.jcctscame about equally from all classes. (The arbitrariness ofthis way of missing data removal has stimulated us in a

later study to develop another missing data handlingtechnique, finally emerging into the method of slepwisedelelion [24].)

Variation between different objects in the data comesfrom four causes. The first cause is inter-class variation(pathology): each disease has its own physiology andresulting blood concentrations. This is the differencebetween the patients in which we arc interested. Thesecond cause of difference between patient records isintra-class variation: this includes diflkrences betweenpatients suffering from the same disease. Another cause is

intra-patient variation: variations of concentrations intime within a patient, the staging of the disease, as well asrandom or circadian fluctuations. The last source ofdifference is analytical error, the error made in themeasurement.

The aim of class separation implies that variation fromother sources than inter-class differences should beeliminated as well as possible, leaving the differencescaused by pathology untouched. The usual approaches ofautoscaling and class-scaling, that equalize the variationin each feature in the entire data set or per classrespectively, implicitly assume that no distributionalcharaceristics arc known about the unwanted sources ofvariation. However, in many medical data sets, as in theone under consideration, such information is available forhealthy people. For most assays, reference limits arcdetermined on the basis of a healthy population. If thedata are scaled according to the standard deviationsfound in this population, healthy persons will aggregateinto a more or less spherical shape, depending on thedistribution being symmetrical or not, and on thepresence or absence of inter-assay correlations. Inter-individual and analytical variation is in this approach’downscalcd’, leaving inter-class and disease-stage varia-tion unaffected. In this reproducibilily scaling it is assumedthat analytical error and intra-individual variation arethe same for both the healthy and the ill. In an earlierstudy by our group, reference limits were determinedbased on a patient population [11]. The standarddeviations given in table 2 were derived from this study.The variables are divided by these numbers to get thescaled ones. Table 2 also includes the abbreviations usedin the following.

Disease-stage variation is more difficult to combat. In thepresent study the effect of the stage of the disease islimited by choosing the moment of admittance to thehospital as reference, and sampling within 10 days. It

Table 2. Slandard devialions usedfor reproducibilily scaling.

Variable Abbreviation s.d.

Sodium NA 2"3Potassium K 0"36Chloride CL 2"7Urea UR 1"4Creatinine CR 18Uric acid UA 0"067AP AP 26LI)H LDH 38ASAT ASAT 4.7ALAT ALAT 6.4Bilirubin, tot. DBI 2"2Bilirubin, dir. TBI 0"60Calcium CA O" 11Phosphate P 0"22Protein TP 4"6Albumen ALB 3"9Cholesterol CItO 1.6Trig ycerides TGL 0"50Iron FE 13GGT GGT 6"9

Italics: variable is deleted after missing value removal.

71

Page 6: Exploring multivariate clinical routine data concerning ...downloads.hindawi.com/journals/jammc/1988/176716.pdf · JournalofAutomaticChemistry,Vol. 10, No.2 (April-June 1988),pp

Jan B. Hemel et al. Exploring multivariate clinical chemistry routine data

must be noted that for myocardial infarction this periodmight be too long to reduce disease-stage variationsufficiently.

For the determination of the most important variablesseveral methods are available. The Umea group hasdefined [22] the Modelling Power of a variable j for themodel of class c as being

MPj ?d/S}Iwith SF,,d: residual standard deviation of variable j in

class c with respect to the class model.

t.ot. residual standard deviation of variable j inclass c without application of any model.

This MP emphasizes the description of a class with PCmodels. The discriminating powers defined by this groupstress the difference between the classes. In ARTHUR afeature selection method called SELECT is available thatoffers a selection according to interclass/intra-class ratioof variance. SELECT, too, takes separation between theclasses as its primary aim. Because we are concerned withexploration, modelling powers were calculated usingSIMCA-3B. Since SIMCA-3B can elegantly cope withmissing values, these calculations were done also beforethe reduction of the data set. In this way we get someimpression of the importance ofomitted variables as well.Modelling powers were calculated for models consistingof one and two PCs for each class. Some major results are

summarized in table 3. It is seen that some deletedvariables are very promising, although the figures mustbe interpreted carefully, because of the sometimes largeproportion of missing data.

To investigate the value of the systematically missingvariables a large study is started at our laboratory inwhich complete sets ofroutine data are prepared for everypatient that is admitted to the ward for internal diseasesat the Groningen University Hospital. The results aresaved for later use when the patient’s diagnoses areavailable.

Table 3. Largest modelling powers.

Results and discussion

First of all the univariate stalistics of the unscaled data arecalculated. The most relevant statistics are summarizedin table 1. From this table it can be seen that the presenceof some variables differs greatly between the diseases, asis expected because of the mentioned selectivity inordering assays. Interesting variables are beginning toshow because of differences in means between thediseases, but the optimism about their relevance is oftenreduced by the accompanying large standard deviations.It is noticed, that standard deviations and means are notmuch influenced by the deletion of missing values. Thissuggests that deletion of objects has been sufficientlyrandom, since shift ofmeans and/or standard deviation isexpected if systematically some subclasses ofobjects weredeleted.

More information may come from a bivariate approach.Correlations are easily obtainable. In CLAS they arecalculated across the entire data set by default. Thecorrelations found to be greater than 0.5 are reported infigure 1. They are all significant at a level of at least99.95% (one-sided). The variables concerned turn out toform small clusters. It must be stressed that somecorrelations may be strongly dependent on the classesincluded in the data set. For instance a variable may behigh for a class and normal for other classes in whichanother variable’s level may be elevated. This situationwill result in a high (negative) correlation between thesevariables that might be absent in a healthy control group.It follows that correlations calculated in this data set arcnot applicable to data-sets consisting of different classes.However, from these correlations variables that areredundant in the context of these classes can be detected.

It is seen that although many highly significant correla-tions exist, the correlations do not allow us to leave manyvariables out because of complete redundancy. Only thebilirubins and, to a lesser extent, creatinine and urea, andAP and direct bilirubin are roughly equivalent in thisdata set.

Class Model size* Data set Variables with largest modelling powers

Heart FR

2 F2 R

Liver FR

2 F2 R

Kidney FR

2 F2 R

ASATASAT

ASATASAT

Bilirubin, dir.ASAT

Bilirubin, dir.ASAT

CreatinineCreatinine

CreatinineALAT

LDH > ALATLDH > ALATGGT > LDHLDH > ALAT

bilirubin, tot. > APLDH > ALATGGT > Bilirubin, tot.ALAT > LDHUrea > PhosphateUrea >> LDHGGT > Urea

Creatinine > ASAT

Italics: variable is deleted after missing value removal.* Number of PCs in the model.

" F: full data set including data gaps.R: reduced data set, freed from data gaps.

72

Page 7: Exploring multivariate clinical routine data concerning ...downloads.hindawi.com/journals/jammc/1988/176716.pdf · JournalofAutomaticChemistry,Vol. 10, No.2 (April-June 1988),pp

.]an B. Hemel et al. Exploring multivariate clinical chemistry routine data

NA 0.63 CL CHO 0.51 TGL

CA 0. 63 ALB 0. 61 TP

GGT

0.0 0.54

AP 0.80 OBI

8.78 0.99

DBI

UA 0. 59 UR 0. 85 CR

8.67 8.61

Figure 1. Correlations in the data set that are greater than 0"5.Abbreviations from table 2.

Plots of one variable versus another variable give visualinformation that may remain hidden otherwise. Themodelling powers as listed in table 3 supported by clinicalchemical knowledge suggest that ASAT, ALAT, LDHand creatininc may be interesting. The most intbrmativcof the set of possible plots appears to be ASAT versuscrcatininc. The plot is shown in figure 2. ASAT isimportant tbr the modelling of heart and liver diseasegroups, and crcatininc is characteristic tbr kidneydiseases. In this exploration we are concerned withdescription of the classes. We should be aware of the tactthat for optimal separation another pair of variablesmight be more appropriate.

2{ 22 22

CRFigure 2. ASA T plolted versus creatinine (CR) (after scaling).1 liver; 2 kidney; 3 hearl.

It is obvious from the plot that the classes arc not

separated completely. Patients in an early stage of illnessand almost recovered patients fi’om all disease classesjoinin a common region where the healthy population will be

found as well. There are two apparent causes for thiseffect. Restricting patient samples to a period of 10 daysafter admittance is not enough to assure the same stage ofdisease, and illness is not equally manifest in all patients(for example because of therapy). Because of the rapidchanges of enzyme blood concentrations after a heartattack the former phenomenon is expected to be import-ant in the heart group. In the following we will be mainlyconcerned with the diverting ’tails’ of the classes.

The plot shows that the KIDNEY group is well separatedfrom the other groups, but that little separation is seenbetween the LIVER and HEART classes. In variable-variable plots in which best-modelling variables are

chosen, this is likely to occur, unless the chosen variablesare important in all models, but show different levels foreach class. A KIDNEY group outlier (marked with an

arrow in all figures) is clearly detected amidst HEARTand LIVER patients. It should not be included in theKIDNEY class model construction. Another atypicalpatient, coming from the LIVER class, is noticed becauseof an exceptionally high ASAT level (circled in thefigures). Although this patient definitely belongs to theLIVER class, the rarity of the ASAT blood concentrationmay disturb the construction of an adequate LIVERmodel. It is advisable to construct a model that does notinclude this patient, and therefore fits the more typicalmembers of the LIVER group.

Another approach to bivariatc plotting of the data is theuse of principal components instead of variables in theplots. The plot of the first two PCs against each other isshown in figure 3. The second PC, at the abscissa,consists largely of crcatininc, so KIDNEY patientsspread along this axis. ASAT and ALAT are the maincontributors to the first PC, therefore some similarity ofthis plot with the previous one is seen.

"-I

113lit

PC-2

73

Page 8: Exploring multivariate clinical routine data concerning ...downloads.hindawi.com/journals/jammc/1988/176716.pdf · JournalofAutomaticChemistry,Vol. 10, No.2 (April-June 1988),pp

Jan B. Hemel et al. Exploring multivariate clinical-chemistry routine data

The Non-Linear Map ofthe data presents a different viewon its distribution. The NLM plot offigure 4 is calculatedwith the first two eigenvectors as initial projection plane.Because of the high computational demands only arandom selection of 47 patients (plus the KIDNEYoutlier for illustration) is plotted. In this plot interindivi-dual distances are best retained. KIDNEY is wellseparated from LIVER, but HEART is overlapped byboth other classes. As only part of the data.is plotted thedegree of overlap might be underestimated.

11

Figure 4. Non-linear map, initiated with first two eigen vectors.1 liver; 2 kidney; 3 heart (based on only 48).

As an extension of the bivariate plottings shown before,pseudo-three-dimensional plots of three variables as wellas using three PCs are drawn in the figures 5 and 6. Thesefigures are extracts of larger sets of drawings. Everycomplete series of plots consists of(l) an overview: thedata as seen if looking along the (1,1,1) vector in thedirection of the origin; (2) for each class separately a plotdrawn from the same viewpoint, with the fitted PC-planein it; (3) a rotated view on this space, one for every class,to look in a direction parallel to the PC-plane fitted to thisclass. Only points from the class at hand are plotted; (4) arotated view of this space, one for every class, to lookperpendicular to the PC-plane fitted to this class. Onlypoints from the class at hand are drawn in this plot.

The selection of the variables can be based on severalcriteria. Using the modelling powers from table 3creatinine-LDH-ALAT and creatinine-LDH-ASATwere most prominent. Figure 5 represents the former set.For three classes, as in the data under consideration, acomplete series would result in 10 plots, of which only aselection (complete data and LIVER class) is shown.Apart from an overview, as presented by plots (a) and (b),this series provide information about the sufficiency ofthetwo-dimensional PC-models (c), and the distributionwithin the model (d). Ifa line or point shape model wouldbe sufficient for the data, this will be apparent in plot (d).

From these plots it can be seen that the classes are largelyseparable, but that they meet and overlap in a centralpart, the place where the healthy population would befound. In figure 5(b) the LIVER class is drawn sepa-rately. A two-dimensional PC model is tentatively fittedto it. This plot gives some insight into the relative positionof the LIVER gro.up. Plot 5(c) offers a view on the LIVERclass parallel to the PC-plane. Deviations from the modelcan bejudged. Apart from two patients a two dimensionalmodel seems to be adequate in this space. In plot 5(d),which offers a view perpendicular to the class model, theusefulness of a two-dimensional model as opposed to aone-dimensional one can be judged. In our opinion bothdimensions are sufficiently ’used’ to retain them. In thisway an impression can be obtained of the validity of theclass model, and of its location in space relative to theother classes. As only the central part of the plot is drawnthe KIDNEY outlier is not seen. The other atypicalpatient (from the LIVER class) is apparently lessextreme in this plot (without ASAT) than was seenbefore. However, plot 5(d) shows him to be rathereccentric.

The series of plots numbered-6 gives the same views onthe data as those from figure 5, but this time thethree-dimensional space spanned by the first three PCscalculated from the whole data set is used to look at. Theconclusions are about the same. The usefulness of (atleast) two dimensions for the LIVER class model showseven more pronounced. In figure 6(a) the KIDNEYoutlier detected earlier is caught again. The loss as

compared to the previous plots is the interpretability: theaxes are combinations of the original axes and cannot benamed easily. The gain is that the directions that containmost variance in the data are presented in the plot. Ifclass distinction is the major source of variance a betterseparation between classes may be seen in these PC-plotsthan in the set of variable plots.

To interpret simultaneously as much of the multivariateinformation as possible, faces were drawn according to avariation on the Chernoff faces by Frith [21]. As theselection of variables (a maximum of nine) to betranslated to face features might be crucial for therecognition, the first nine PCs of the entire data set wereused instead of the original variables. In this way at leastmost of the variance is present in the pictures. Most ofthese faces were very difficult to classify. Therefore onlyfive stereotypical faces from each class are portrayed infigure 7. These faces were selected as stereotypical usingthe other techniques mentioned above: the patients comefrom the tails ofthe ’class clouds’. Note that these patientsare not the most typical ones, but, rather, the mostextreme cases. Even so, LIVER and HEART appear tobe similar while only the ’extreme’ KIDNEY patientsdiffer reasonably from them. The distinction that isvisible is similar to the distinction in the PC plot (figure3). Thus the PCs numbered three and higher do not seemto have much influence.

Also shown in figure 7 is the KIDNEY class member thatresembled markedly the LIVER family (X). This patientis the outlier that was detected also in previous displays.

74

Page 9: Exploring multivariate clinical routine data concerning ...downloads.hindawi.com/journals/jammc/1988/176716.pdf · JournalofAutomaticChemistry,Vol. 10, No.2 (April-June 1988),pp

Jan B. Hemel et al. Exploring multivariate clinical chemistry routine data

(A) (B)

(c)

Figure 5. Pseudo-3D plots, using ASA T, LDH and creatinine (CR) (all scaled) as the axes.

Fq liver; [X kidney; *= heart.(a) Overview ofcomplete data.(b) 0 Zas LZV, fom am epoz.(c) Only class LIVER, in parallel view on PC-plane ofLIVERclass.

(d) Only class LIVER, in perpendicular view on PC-plane ofLIVER class.To improve readability the axes are cut at 50 (scaled).

The atypical LIVER patient is denoted with Y. This faceis seen to differ considerably from all other faces shownhere. The patients that arc not ’portrayed’ in figure 7have faces that are somewhere in between the shownpictures.

It is possible that another selection of variables to beincluded in the face features improves the distinction, butas long as it is not known which features contribute mostto visual recognition, even with an optimal set ofvariables, all permutations must be tried in order to

identify the best clustering ordering of the set.

Conclusions

The aim of this study was to get an impression of theseparability of HEART, LIVER and KIDNEY patientson basis of 20 routine assays, and also of the presence ofatypical cases in the data, and of the applicability andhomogeneity of low dimensional class models. Sinceseparability is not the only aim, the selection of variablesshould not be based on discrimination alone. If discrimi-nation between classes is the leading guide for an

exploration., the selection of the variables to be plottedmay be biased. These plots are not characteristic for the

75

Page 10: Exploring multivariate clinical routine data concerning ...downloads.hindawi.com/journals/jammc/1988/176716.pdf · JournalofAutomaticChemistry,Vol. 10, No.2 (April-June 1988),pp

Jan B. Hemel et al. Exploring multivariate clinical chemistry routine data

(A) (B)

/,,’(D)

/’/ //

Figure 6. Pseudo-3D plots, using PC-l, PC-2, and PC-3 (afterscaling) as the axes.7 liver; / kidney; heart.(a) Overview ofcomplete data.(b) Only class LIVER, from same viewpoint.

(c) Only class LIVER, in parallel view on PC-plane ofLIVERclass.(d) Only class Liver, in perpendicular view on PC-plane ofLIVER class.To improve readability the axes are cut al SO (scaled).

76

Page 11: Exploring multivariate clinical routine data concerning ...downloads.hindawi.com/journals/jammc/1988/176716.pdf · JournalofAutomaticChemistry,Vol. 10, No.2 (April-June 1988),pp

Jan B. Hemel et al. Exploring multivariate clinical chemistry routine data

class models and their distances. Atypical objects cannotbe objectively recognized. In this paper, therefore, viewson the data are chosen that stress class characteristicsrather than class differences. The possibility of classifica-tion may be estimated conservatively.

As a means for data exploration, pictures of faces do notturn out to be particularly useful. Outliers may be easilyrecognized, but clustering of the classes is hardly visiblein the analysed data set. There is little theory available forranking of features according to ’recognizability’.

LIVER

For patient classification it is necessary that all othersources of variance than inter-class variation (pathology)are removed. Reproducibility scaling based on standarddeviations in a population of healthy persons reducesanalytical error and .random intra-patient and inter-patient fluctuations. Differences between patients thatare in a different stage of their illness are not sufficientlyreduced by sampling patients within 10 days of admit-tance. This is especially relevant with patients fromdisease groups with rapidly changing patterns (forexample myocardial infarction). This is a serious prob-lem which gives classification of relatively stable diseasegroups, and clearly staged diseases.; the best chance forsuccess. Patients under effective therapy are not alwaysrecognized as ill; it is a matter ofopinion ifthey should be.Both effects cause the data set to be shaped like a spider,with a large proportion of seemingly ’healthy’ patients,and offshoots with the more seriously ill ones.

Figure 7. Representation ofpatients from the HEART, LIVERand KIDNEY classes as faces.X: outlier from KIDNEY class.Y: atypical LIVER case.

The selection of variables for the displays is largely basedon modelling powers as calculated by SIMCA-3B. Thesecalculations accommodate missing values. Some vari-ables that had to be omitted because of missing .valuesnevertheless contain valuable class modelling informa-tion. So a complete data set in which these variables areincluded can be expected to offer more informativedisplays. A large complete data set is being built now toinvestigate the use of the systematically missing variablescores.

Techniques that offer simultaneous display of threedimensions give more insight into the data than two-dimensional displays. It turns out to be easier for ahuman to combine these pseudo-three-dimensional plotsin his mind to a single image than a set of two-dimensional plots.

if the properties of the classes in relation to the assays(variables) is the subject of the study, pseudo-3-D plotswith three assays as basis serves interpretability. On theother hand, a more complete image of the separability islikely to result from 3-D plots of a three-PC space.

All graphical techniques make atypical values easilydetectable.

Acknowledgements

The authors would like to thank Professor Dr C. H. Gips,Professor Dr G. K. van der Hem, and Dr J. W. Viersmafor kindly giving access to their patient data.

References

1. VAN DER VOLT, H., Dooos, D. A., MF.S, M. and VANDE HAAR, G., Analytica Chimica Acta, 159 (1984), 159.

2. KYa, L., Talanta, 28 (1981), 871.3. COOMANS, D., BROECKAERT, I., JocnvFa, M. and

MASSART, D. L., Methods of Information in Medicine, 2(1983), 93.

4. SOLBERG, H. E., CRC Critical Reviews in Clinical LaboratorySciences 1978), 209.

5. COOMANS, D., JONCKHEER, M., MASSART, D. L.,BROECKAERT, I. and BLOCKX, P., Analytica Chimica Acta, 103(1978), 409.

6. VAN VARK, G. N., van PER SMAN, P. G. M., Z. Morph.Anthrop., 73 (1982), 21.

7. WOLD, S., ALBANO, C., DUNN III, W. J., ESBENSN, K.,HF,II3G, S., JOI-ANSSON, E. and SjCsTCM, M., in:Martens,J. (Ed.), Proceedings ofthe IUFOST Conference,Food Research and Data Analysis (Applied Science Publishers,London, 1983).

8. VAUZA, K., Pattern Recognition in Chemistry (Springer-Verlag, Berlin, 1980), 97.

9. TUKFY, J. W., Exploratory Data Analysis (Addison-Wesley,Reading, Massachusetts, 1977).

10. KOWACSKI, B. R. and BF.N)F.R, C. F., Journal ofthe AmericanChemical Society, 95/3 (1973), 686.

11. HF.FJ,, J. B., HINt)IKS, F. A. and VAN DF,R SLm, W.,Journal ofAutomatic Chemistry, 7 (1985), 20.

12. Wo,D, S., Pattern Recognition, 8 (1976), 127.13. HARPER, A. M., DUEWER, D. L., KOWALSKI, B. R. and

FASCHING, J. g., in Kowalski, B. R. (Ed.), Chemometrics:Theory and Application, ACS Symposium Series 52(American Chemical Society, Washington D.C., 1977), p.14.

77

Page 12: Exploring multivariate clinical routine data concerning ...downloads.hindawi.com/journals/jammc/1988/176716.pdf · JournalofAutomaticChemistry,Vol. 10, No.2 (April-June 1988),pp

Jan B. Hemel et al. Exploring multivariate clinical chemistry routine data

14. HEMEL, J. B. and VAN DER VOET, H., Analytica Chimica Acta,191 (1986), 33.

15. CHAMBERS, J. M., CLEAVELAND, W. S., KLEINER, B., andTUKEY, P. A., Graphical Methods for Data Analysis (Wads-worth International Group, Belmont, Duxbury Press,Boston, 1983).

16. WAN, P. C. C. (Ed), Graphical Representation ofMultivariateData (Academic Press, New York, 1978).

17. EVERITT, B., Graphical Techniques for Multivariate Data,(Heinemann, London, 1978).

18. CHI-HSIUNG LxN and HWA-FU CHEN, Analytical Chemistry, 49(1977), 1357.

19. CHERNOFF, H., Journal ofthe American Statistical Association, 68(1973), 361.

20. BRUCKNER, L. A., in Wang, P. C. C. (Ed), GraphicalRepresentation of Multivariate Data (Academic Press, NewYork, 1978), p. 93.

21. EVERITT, B., Graphical Techniques for Multivariate Data(Heinemann, London, 1978), p. 87.

22.ALBANO, C., BLOMQVIST, G., COOMANS, D., DVNN III, W.J.,EDLUND, g., ELIASSON, B., HELLBERG, S., JOHANSSON, E.,NORDiN, B., JOKNELS, D., SJ6STR6M, M., WOLD, H. andWOLD, S., in H6skuldsson A. and Esbensen, K. (Eds),proceedings of the Symposium on Applied Statistics(NEUCC, RECAU, RECKU and Danish Society ofTheoretical Statistics, Copenhagen, 1981).

23. GROTCH, S. L., in Kowalski, B. R. (Ed), Chemometrics.Mathematics and Statistics in Chemistry (D. Reidel, Dordrecht,Boston, 1984), p. 439.

24. HEMEL, J. B., VAN BER VOET, H., HINDRIKS, F. R. and VANDER SLIK, W., Analytica Chimica Acta, 193 (1987), 255.

THERMAL ANALYSIS IN RESEARCH AND PRODUCTION- AN INTENSIVE SHORT COURSE

To be heldfrom 16 to 18 May in Basel, Switzerland

Key topics are:

InstrumentationMeasurable parametersOrganic compoundsPolymers

ThermoplasticsHigh temperature polymersElastomersThermosetsFibresCompositesCopolymers, blends and polymer miscibilityCompounding selection of additives

Pharmaceuticals

InorganicsCatalystsCeramics and glassesMetals and alloysSemiconductorsSuperconductors

Stability determinationAging of materialsHazards evaluationOptimizing processing conditionsEnergyQuality controlProblem solving

The faculty include:

Patrick K. Gallagher (AT&T Bell Laboratories, USA)Daniele Giron (Sandoz Ltd, Switzerland)Edith A. Turi (Polytechnic University, USA)Robert A. Weiss (University of Connecticut, USA).

On 17 May there will be a commercial exhibition combined with a social hour and consultation.

Each registrant will be provided with a workbook containing the copies of the projected slides and therecommended bibliography. Each will have the opportunity to discuss informally with the lecturers problems oftheir particular interest.

Forfurther information, contact Prof Edith A. Turi, Polytechnic University, 5 Oxford Drive, Livingston, New Jersey 07039,USA, or the Technomic Publishing A G, Elisabethenstrasse 15, CH-4051 Basel, Switzerland.

78