chapter 8: classification and clustering methods 8.1 introduction 8.2 parametric classification...

Chap 8-Data Analysis Book-Reddy 1

Chapter 8: Classification and Clustering Methods8.1 Introduction

8.2 Parametric classification approaches

8.2.1 Distance between measurements

8.2.2 Statistical classification

8.2.3 OLS regression method

8.2.4 Discriminant function analysis

8.2.5 Bayesian classification

8.3 Heuristic classification methods

8.3.1 Rule-based methods

8.3.2 Decision trees

8.3.3 k nearest neighbors

8.4 Classification and regression trees (CART) and treed regression

8.5 Clustering methods

8.5.1 Types of clustering methods

8.5.2 Partitional clustering methods

8.5.3 Hierarchical clustering methods


8.1 IntroductionClustering analysis involves several procedures by which a group of samples

(or multivariate observations) can be clustered or partitioned or separated into sub-sets of greater homogeneity, i.e., those based on some pre-determined similarity criteria.

• Examples:

- clustering individuals based on their similarities with respect to physical attributes or mental attitudes or medical problems;

- multivariate performance data of a mechanical piece of equipment can be separated into those which represent normal operation as against faulty operation.

Thus, clustering analysis reveals inter-relationships between samples which can serve to group them under situations where one does not know the number of sub-groups beforehand.

Often clustering results depend to largely on the clustering method used


• Classification analysis, applies to situations when the groups are known beforehand.

• The intent is to identify models which best characterize the boundaries between groups, so that future objects can be allocated into the appropriate group.

• Since the groups are pre-determined, classification problems are somewhat simpler to analyze than clustering problems.

• The challenge in classification modeling is dealing with the misclassification rate of objects

• Three types of classification methods are briefly treated:

- parametric methods (involving statistical, ordinary least squares, discriminant analysis and Bayesian techniques),

-heuristic methods (rule-based, decision-tree, k nearest neighbors)

- parametric models (classification and regression trees)


8.2 Parametric classification approaches8.2.1 Distance between measurements

Euclidian distance between two objects (x1, y1) and (x2, y2) plotted on Cartesian coordinates is characterized in two-dimensions by:

2 2 1/ 2

2 1 2 1[( ) ( ) ]d x x y y In general, for p variables, the generalized Euclidean distance from object i to object j is:

2 1/ 2

1

[ ( ) ]p

ij ik jkk

d x x

where ikx is the value of the variable kX for object i

and jkx is the value of the same variable for object j. The distance term will be affected by the magnitude of

the variables, and the one with largest numerical values will overwhelm the variation in the others. Thus, some sort of normalization is warranted. One common approach is to equalize the variances by defining a new variable (x1/s1) where s1

2 is an estimate of the variance of variable x1


Example 8.2.1. Using distance measures for evaluating canine samples

A biologist wishes to evaluate whether the modern dog in Thailand descended from prehistoric ones from the same region or were inbred with similar dogs which migrated from nearby China or India.

The basis of this evaluation will be six measurements all related to the mandible or lower jaw of the dog:

x1 – breadth of mandible, x2 – height of mandible below the first molar, x3 – length of first molar, x4 – breadth of first molar, x5 – length from first to third molar, x6 – length from first to fourth premolar.


Table 8.1 Mean mandible (or lower jaw) measurements of six variables for four canine groups (modified example from Higham et al., 1980)

x1 x2 x3 x4 x5 x6

Modern dog 9.7 21 19.4 7.7 32 36.5 Chinese wolf 13.5 27.3 26.8 10.6 41.9 48.1 Indian wolf 11.5 24.3 24.5 9.3 40 44.6 Prehistoric dog 10.3 22.1 19.1 8.1 32.2 35

Mean 11.25 23.675 22.45 8.925 36.525 41.05 Standard deviation 1.68 2.78 3.81 1.31 5.17 6.31

Table 8.2 Standardized measurements

z1 z2 z3 z4 z5 z6

Modern dog -0.922 -0.963 -0.800 -0.937 -0.875 -0.721 Chinese wolf 1.342 1.304 1.140 1.281 1.040 1.117 Indian wolf 0.149 0.225 0.537 0.287 0.672 0.562 Prehistoric dog -0.567 -0.567 -0.878 -0.631 -0.837 -0.958

The measurements have to be standardized, and so, the mean and standard deviations for each variable across groups is determined . For example, the standardized value for the modern dog: z1= (9.7-11.25)/1.68=-0.922, and so on.


Table 8.3 Euclidean distances between canine groups

Modern dog Chinese wolf Indian wolf Prehistoric dog

Modern dog - Chinese wolf 5.100 -

Indian wolf 3.145 2.094 - Prehistoric dog 0.665 4.765 2.928 -

Finally the Euclidean distances among all groups are computed as shown. It is clear that prehistoric dogs are similar to modern ones because their distances are much smaller than those of others. Next in terms of similarity are the Chinese and Indian wolfs.

Other measures of distance:-Manhattan measure: based on absolute measure-Mahanabolis measure: superior when variables are correlated


Group BGroup A

Measured Variable

Dis

trib

utio

ns

Cut-off value

Group B objects misclassified into Group A

Group A objects misclassified into Group B

Group B

Group A

Measured Variable

Dis

trib

utio

ns

Cut-off value

Fig. 8.1 Errors of misclassification for the univariate case.•When the two distributions are similar and if equal misclassification rates are sought, then the cutoff value or score is selected at the intersection point of the two distributions;•When the two distributions are not similar, a similar cut-off value will result in different misclassification errors.

8.2.2 Statistical classification,Classification methods provide the necessary methodology to: (i) statistically distinguish or “discriminate” between differences in two or more groups

when one knows beforehand that such groupings exist, and (ii) subsequently assign a future unclassified observation into a specific group with the

smallest probability of error.


Table 8.4 Data table specifying type and associated EUI and the results of the classification analysis (only misclassified ones are indicated)

Building # Type EUI (kBtu/ft2/yr) Assuming cut-

off score of 38.2 C1 O 40.1 Training C2 O 41.4 “ C3 O 38.7 “ C4 O 37.5 “ Misclassified C5 O 43.0 “ C6 E 37.4 “ C7 E 38.3 “ Misclassified C8 E 36.9 “ C9 E 35.3 “ C10 E 36.1 “ C11 O 37.2 Testing Misclassified C12 O 39.2 “ C13 E 37.2 “ C14 E 36.3 “

Example 8.2.2. Statistical classification of office buildings

The objective is to distinguish between medium-sized office buildings which are ordinary (type O) or energy efficient (type E)

To be judged by their “energy use index” (or EUI) First 10 values (C1 to C10) used to train (to determine the cut-off score),

while the last four will be used for testing (to determine the misclassification rate)


A simple first attempt at determining this value is to take it as being the mid-point of both the means.

-Average values:

for the first five buildings (C1-C5 are type O): 39.0

second five (C6-C10 are type E): 36.8

- If an average cut-off value of 38.2 is selected, one should expect the EUI for ordinary buildings to be higher than this value and that for energy efficient ones to be lower.

- The results listed in the last column of Table 8.4 indicate one misclassification in each category during the training phase.

- Thus, this simple-minded cutoff value may be acceptable since it leads to equal misclassification rates among both categories.

- Note that among the last 4 buildings (C11-C14) used for testing the selected cut-off score, one of the ordinary buildings is improperly classified.


x

x

x

x xx x

xx

x

x

x

x

xx

x

+

+

+ +

+

+

++

+

+

+

+

+

+

+

+

x1

x2

Class A

Class B

CB

CA

Fig. 8.2 Bivariate classification using the distance approach showing the centers and boundaries of the two groups.

Classification using the simple distance measure:Two groups are shown in Fig. 8.2 with normalized variables. -During training, centers and separating boundaries of each class are determined.

- The two circles (shown continuous) encircle all the points. -A future observation will be classified in the group whose class center is closest

However, a deficiency is that some of the points will be misclassified.-One solution to decreasing the misclassification rates is to reduce the boundaries

(shown by dotted circles). -However, some of the observations now fall outside the dotted circles, and these

points cannot be classified into either Class A or Class B. -Hence, the option of reducing the boundary diameters is acceptable if one is

willing to group certain points into a third class called “unable to classify”.


8.2.3 Ordinary Least Squares regression method

.. ... ..Varia

ble

x2

. ..+ +++ +++ +

...

...

++

++ +

+

++

+ +++

+

...

...

Variable x1

Discriminant lineGroup A

Group B

Separationline

Fig. 8.3 Classification involves identifying a separating boundary

between two known groups (shown as dots and crosses) based on two

attributes (or variables or regressors) x1 and x2 which will minimize the

misclassification of the points.

Regression methods can also be used to model differences between groups and thereby assign a future observation to a particular group.

- During regression, response variable is a continuous one -During classification, “ must be (or be converted) to categorical (regressors may be continuous or discrete or mixed)


Colo

r Int

ensit

y

14

12

10

8

6

4

2

011.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0

Alcohol content (%)

•Linear decision boundaries

11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0

14

12

10

8

6

4

2

0

Alcohol content (%)

Colo

r Int

ensit

y

•Piecewise linear decision boundaries

Fig. 8.4 Linear and piecewise linear decision boundaries for two-dimensional data (color intensity and alcohol content) used to classify the type of wine into one of three classes


8.2.4 Discriminant Function Analysis

Linear discriminant function analysis (LDA) is similar to MLR analysis but approaches the problem differently.

The similarity is identifying a linear model from a set of p observed quantitative variables xi such as:

0 1 1 2 2 ... p pz w w x w x w x where z is called the discriminant score and wi are the

model weights. A model such as this allows multivariate observations ix to be converted into univariate observations zi.

However, the determination of the weights, which are similar to the model parameters during regression, is done differently. LDA seeks to determine weights that maximize the ratio of between-class scatter to the within-class scatter

Max2

1 22

squared distance between means of z ( ){ }

Variance of z z

8.4

where 1 2 and are the average values of zi for Group A and Group B respectively, and the variance of z is that of any one group, with the assumption that the variances of both groups are equal.

Method optimal when the two classes are normally distributed with equal covariance matrices;

even when they are not, the method gives satisfactory results.


The model, once identified, can be used for discrimination, i.e., to classify new observations as belonging to one or another group (for two groups only).

Done by determining the threshold or the separating

score, with new objects having scores larger than this score being assigned to one class and those with lower scores assigned to the other class.

If zA and zB are the mean discriminant scores of

preclassified samples from groups A and B, the optimal choice for the threshold score zthres when the two classes are of equal size, are distributed with similar variance and for equal misclassification rates ( ) / 2thres A Bz z z .

A new sample will be classified to one group or another depending on whether z is larger than or less than zthres.

Misclassification errors pertaining to one class can be reduced by appropriate weighting if the resulting consequences are more severe in one group as compared to the other.


Table 8.5 Analysis results for chiller fault detection example using two different methods: OLS regression and the linear discriminant analysis (LDA)

Assigned Grouping

Variables OLS

Model score

LDA Model score

Class

COP

Tcd-sub 0C

Tcd-app 0C

Training 0 3.765 4.911 2.319 -0.08 -5.59

0 3.405 3.778 1.822 -0.01 -4.91

0 2.425 2.611 1.009 0.09 -3.95

0 4.512 5.800 3.376 0.04 -4.49

0 4.748 4.589 2.752 -0.09 -5.71

0 4.513 3.356 1.892 -0.19 -6.71

0 3.503 2.244 1.272 -0.01 -4.96

0 3.593 4.878 2.706 0.14 -3.48

0 3.252 3.700 1.720 0.00 -4.83

0 2.463 2.578 1.102 0.13 -3.60

0 4.274 5.422 3.323 0.14 -3.49

0 4.684 4.989 3.140 0.03 -4.58

0 4.641 3.589 2.188 -0.14 -6.18

0 3.038 1.989 1.061 0.06 -4.26

0 3.763 4.656 2.687 0.13 -3.62

0 3.342 3.456 1.926 0.11 -3.79

0 2.526 2.600 1.108 0.11 -3.77

0 4.411 5.411 3.383 0.13 -3.56

0 4.029 3.844 2.128 -0.05 -5.31

0 4.443 3.556 2.121 -0.11 -5.91

0 3.151 2.333 1.224 0.04 -4.42

1 3.587 6.656 4.497 0.62 1.14

1 3.198 5.767 3.881 0.60 0.99

1 2.416 4.333 2.793 0.58 0.74

Portion of table


Example 8.2.3. Using discriminant analysis to distinguish fault-free and faulty behavior of chillers MLR and LDA used to classify two data sets representative of normal and faulty operation of a large centrifugal chiller using three regressors variables.

The two data sets consist of 27 operating points each; however, 21 points are used for training the model while the remaining points are used for evaluating the classification model.

Fault-free data is assigned a class value of 0 while faulty data a value of 1.

Models identified from training data used to predict class membership for testing data.

The cut-off score is 0.5 for both approaches, i.e., if calculated score is less than 0.5, then the observation is deemed to belong to the fault-free behavior, and vice versa.


Scatterplot

2.3 2.8 3.3 3.8 4.3 4.8

COP

02

46

810

0

2

4

6

8

Tc_T

cdo

Score0 1

Tcd_sub

COPTcd_app

Fig. 8.5 Scatter plot of the three variables. Fault-free data (coded 0) is shown as diamonds while faulty data (coded 1) is shown as crosses. Clearly there is no overlap

between the data sets and the analysis indicates no misclassified data points.

The following models were identified (with all coefficients being statistically significant): OLS z=1.03467-0.33752*COP-0.3446*Tcd-sub+0.74493*Tcd-app

LDA z = 2.1803-2.51329*COP-1.63738*Tcd-sub+4.19767*Tcd-app


Plot of Score

-0.2 0.1 0.4 0.7 1 1.3predicted

-0.2

0.1

0.4

0.7

1

1.3

obse

rved

Fig 8.6 Though an OLS model can be used for classification, it is not really meant for this purpose.

This is illustrated by the poor correspondence between observed vs predicted values of the coded “class” variables.

The observed values can assume numerical values of either 0 or 1 only, while the values predicted by the model range from -0.2 to about 1.2


Several studies use standardized regressors (with zero mean and unit variance) to identify the discriminant function. Others argue otherwise.

One difference is that when using standardized variables, the discriminant function would not need an intercept term, while this is needed when using untransformed variables.

Though LDA is widely used for classification problems, it is increasingly being replaced by logistic regression since the latter:

- makes fewer assumptions on distribution of variables (so more flexible)

- more robust statistically when dealing with actual data.

- more parsimonious

- value of the weights easier to interpret.

A drawback (if it is one!) is that model weights have to be estimated by maximum likelihood method

Summary of example results:LDA model scores also fall on either side of 0.5 but are magnified as compared to the OLS scores leading to more robust classification. In this example, there are no misclassification data points for either model during both training and testing periods.


8.2.5 Bayesian Classification

Recall: Bayesian statistics provide the formal manner by which prior opinion expressed as probabilities can be revised in the light of new information (from additional data collected) to yield posterior probabilities.

It can also be used for classification tasks .

The simplified or Naïve Bayes method assumes that the predictors are statistically independent and uses prior probabilities for training the model, which subsequently can be used along with the likelihood function of a new sample to classify the sample into the most likely group.

- Advantages: training and classification are easily interpreted,

can handle large number of predictors,

easy to use,

handles missing data well,

requires little computational effort. - Disadvantages: does not handle continuous data well

does not always yield satisfactory results.

Despite these limitations, it is a useful analysis approach to have in one’s toolkit.


Table 8.6 Count data of the 200 samples collected and calculated probabilities of attributes (Example 8.2.4) Attribute Number of samples Calculated Probabilities of Attributes Poor Average Good Poor Average Good Type Bituminous 72 20 8 0.72 0.20 0.08 Anthracite 0 44 56 0.0 0.44 0.56 Carbon % 50-60% 41 0 0 41/72 0 0 60-70% 31 42 0 31/72 42/64 0 70-80% 0 22 28 0 22/64 28/64 80-90% 0 0 36 0 0 36/64 Total 72 64 64 - - -

Example 8.2.4. Bayesian classification of coal sample based on carbon contentA power plant gets two types of coal: bituminous and anthracite.

Each of these two types can contain different fixed carbon content depending on the time at which and the location from where the sample was mined.

Further, each of these types of coal can be assigned into one of three categories: Poor, Average and Good depending on the carbon content whose thresholds are different for the two types of coal. For example, a bituminous sample can be graded as “good” while an anthracite sample can be graded as “average” even though both samples may have the same carbon content.

Training Data Set


Table 8.7 Calculation of the prior, sample and posterior probabilities Prior probabilities

Sample probabilities Likelihood Posterior probabilities

Poor (p)

p(p)= (41+31)/200=0.36 p(s/p)=0.72 x 0 =0 0.36x0 =0 0/0.091=0

Average (a)

p(a)=(42+22)/200=0.32 p(s/a)=0.20 x (22/64)=0.0687 0.32x0.0687=0.022 0.022/0.0332=0.663

Good (g)

p(g)=(28+36)/200=0.32 p(s/g)=0.08 x (28/64)=0.035 0.32x0.035=0.0112 0.0112/0.0332=0.337

Sum = 0.0332

The power plant operator wants to classify a new sample of bituminous carbon which is found to contain 70-80% carbon content.

-The sample probabilities are shown in the third column of the table below.

-The values in the likelihood column add up to 0.0332 which is used to determine the actual posterior probabilities shown in the last column.

-The category which has the highest posterior probability can then be identified. -Thus, the new sample will be classified as “average”.


8.3 Heuristic Classification Methods

8.3.1 Rule Based Methods

•Simplest method: uses “if-then” rules. •Such classification rules consist of the “if” or antecedent part and the “then” or consequent part of the rule. •These rules must cover all the possibilities, and every instance must be uniquely assigned to a particular group. •Such a heuristic approach is widely used in several fields because of the ease of interpretation and implementation of the algorithm.


Example 8.3.1 Rule-based admission policy into the Yale medical school

The selection committee framed the following set of rules for interviewing applicants into the school based on undergraduate (UG) GPA and MCAT verbal (V) and MCAT quantitative (Q) scores

- If UA GPA < 3.47 and MCAT-V <555, then Class A- reject

- If UA GPA < 3.47 and MCAT-V555 and MCAT-Q < 655, then Group B, reject

- If UA GPA< 3.47 and MCAT-V 555 and MCAT-Q 655, then Group C, interview

- If UA GPA 3.47 and MCAT-V <535, then Group D, reject

- If UA GPA 3.47 and MCAT-V 535, then Group E, interview.

For example, consider an applicant with UA GPA=3.6 and MCAT-V =525.He would fall under group D and be rejected without an interview.

Thus, the pre-determined threshold or selection criteria of GPA, MCAT-V and MCAT-Q are in essence the classification model, while classification of a future applicant is straightforward.


8.3.2 Decision TreesGPA(727)

< 3.47

MCAT-V

Reject(Group A)

<555

Reject(Group B)

MCAT-Q

555

(93)

(342)

(249)

MCAT-V

3.47

(385)

<655(122)

(127)

Interview(Group C)

655

Reject(Group D)

Interview(Group E)

<535(51)

(354)

535

Fig. 8.7 Tree diagram for the medical school admission process with five terminal nodes, each representing a different group. This is a

binary tree with three levels.How an applicant pool of 727 are screened at

each stage is shown

Similar to probability trees Graphical way of dividing a decision problem into a hierarchical structure for easier understanding and analysis.

Decision trees are predictive modeling approaches which can be used for classification, clustering as well as for regression model building.

They essentially divide the spatial space such that each branch can be associated with a different sub-region. A rule is associated with each node of the tree, and observations which satisfy the rule are assigned to the corresponding branch of the tree.

Though similar to if-then rules in their structure, decision trees are easier to comprehend in more complex situations, and are more efficient computationally.


8.3.3 k Nearest Neighbors (kNN)• (kNN) is a conceptually simple approach widely used for classification. • It is based on the distance measure, and requires a training data set of

observations from different groups identified as such. • A future object can be classified by determining the point closest to this new

object, and simply assigning the new object to the group to which the closest point belongs. The classification is more robust if a few points are used rather than a single closest neighbor.

This, however, leads to the following issues which complicate the classification: • how many closest points “k” should be used for the classification, and• how to reconcile differences when nearest neighbors come from different groups.kNN is more of an algorithm than a clear-cut analytical procedure. An allied classification method is the closest neighborhood scheme, where an object is

classified in that group for which its distance from the center of that group happens to be the smallest as compared to its distances from the centers of other possible groups. Training would involve computing the centers of each group and distances of individual objects from this center.

• A redeeming feature of kNN is that it does not impose a priori any assumptions about the distribution from which the modeling sample is drawn.


Example 8.3.3 . Using k nearest neighborhood to calculate uncertainty in building energy savings

This example illustrates how the nearest neighbor approach can be used to estimate the uncertainty in building energy savings after energy conservation measures (ECM) have been installed.

The analysis methodology which consists of four steps: (i) identify a baseline multivariate regression model for

energy use against climatic and operating variables before the retrofits were implemented,

(ii) use this baseline model along with post-retrofit climatic and operating variables to predict energy use

pre, modelE reflective of consumption during the pre-retrofit stage,

(iii) compute energy savings as the difference between the model predicted baseline energy use and the actual measured energy use during the post-retrofit period, savings pre, model post,meas( )E E E and,

(iv) determine the uncertainty in the energy savings based on the multivariate baseline model goodness-of-fit (such as the RMSE) and the uncertainty in the post-retrofit measured energy use.


Unfortunately, step (iv) is not straightforward. Energy use models identified by global or year-long data do not adequately capture seasonal changes in energy use due to control and operation changes done to the various building systems since these variables do not appear explicitly as regressor variables. Hence, classical models identified from whole-year data are handicapped in this respect, and this often leads to model residuals that have different patterns during different times of the year.

Such improper residual behavior results in improper

estimates of the uncertainty surrounding the measured savings. An alternative is to use the nearest neighbors approach which relies on “local” model behavior as against global estimates such as the overall RMSE.

However, the k-nearest neighbor approach requires two

aspects to be defined specific to the problem at hand: (i) definition of the distance measure, (ii) deciding on the number of neighbor points to

select.


The uncertainty in this estimate is better characterized by identifying a certain number of days in the pre-retrofit period which closely match the specific values of the regressor set for the post-retrofit day j, and then determining the error distribution from this set of days. Thus, the method is applicable regardless of the type of model residual behavior encountered.

All regressors do not have the same effect on the

response variable; hence, those that are more influential need to be weighted more and vice versa. The definition of the distance ijd between two given days i and j specified by the set of regressor variables ,k ix and ,k jx is defined as:

2 2, ,

1

( ) /p

ij k k i k jk

d w x x p

8.5

where the weights kw are given in terms of the derivative of energy use with respect to the regressors:

,pre modelk

k

Ew

x

8.6

The partial derivatives can be determined numerically by perturbation. Days that are at a given “energy distance” from a given day lie on an ellipsoid whose axis in the k-

direction is proportional to (1/ )kw .


15

20

25

30

35

40

45

50

55

60

65

70

75

50 55 60 65 70 75 80 85

Daily DBT (degrees F)

Dai

ly D

PT

(deg

rees

F)

Fig. 8.8 Illustration of the neighborhood concept for a baseline with two regressors (dry-bulb temperature DBT and dew point temperature DPT)

If the DBT variable has more “weight” than DPT on the variation of the response variable, this would translate geometrically into an elliptic domain as shown.

The data set of “neighbor points” to the post datum point (75, 60) would consist of all points contained within the ellipse. Further, a given point within this ellipse may be assigned more “influence” the closer it is to the center of the ellipse (from Subbarao et al., 2011).


CoolingLoad-Meas

0

50

100

150

200

250

300

0 10 20 30 40 50 60 70 80 90

Ambient dry-bulb temperature (F)

Th

erm

al

co

oli

ng

lo

ad

(M

MB

tu/d

ay)

Daily DBT (degrees F)

Ther

mal

coo

ling

load

(10^

6 Bt

u/da

y)

Fig. 8.9 Scatter plot of thermal cooling load Qc versus DBT

The approach is illustrated with a simple example involving synthetic daily data of building energy use. Only two variables are considered: (i) regressors: ambient air dry-bulb temperature (DBT)(ii) and dew point temperature (DPT)


Consider the case when one wishes to determine the uncertainty in the response variable corresponding to a set of operating conditions specified by DBT=750 F and ResDPT=50 F which results in Qc = 233.88 MBtu/day. The gradients of these regressors:

5.0685 and 7.606 ( ) (Re )

c cQ Q

DBT sDPT

The “distance” statistic for each of the 249 days in our synthetic data set has been computed, and the data sorted by this statistic. The top 20 data points (with smallest distance) are shown with the last column assembling the “distance” variable.

The 90% limits (-8.43 and 8.29 which are bolded) around the model predicted value of 233.88 MBtu/day for the cooling energy use. In this case, the distribution is fairly symmetric, and one could report a local prediction value of ( 233.88 8.3 MBtu/day) at the 90% confidence level.

If the traditional method of reporting uncertainty were to be adopted, the RMSE would result in ( 9.44 MBtu/day) at the 90% confidence level.

Thus, using the k-nearest neighbors approach has led to some reduction in the uncertainty interval around the local prediction value; but more importantly, this estimate of uncertainty is more realistic and robust.


Table 8.8 Table showing how the model residual values can be used to ascertain pre-specified confidence levels of the response adopting a non-parametric approach (Example 8.3.3). Values shown of the regressor and response variables are for the 20 closest neighborhood points from a data set of 249 points. Reference point of DBT= 750 F and DPT-550=50 F determined using ANN model 2-10-1 are shown, as are the “distance” and the model residual values. The residual values shown bolded are the 5% and 95% values DBT ResDPT Q_c_Meas Q_c_Model Residuals Distance

(0F) ( 0F) (106 Btu/day) (106 Btu/day) (106 Btu/day) -

1 74.67 4.76 225.59 233.88 8.29 1.78 2 75.42 4.22 236.36 231.39 -4.97 4.47 3 77.08 5.45 240.21 248.19 7.98 7.85 4 76.54 6.23 251.99 252.94 0.95 8.62 5 77.00 4.08 239.01 238.55 -0.46 8.71 6 77.46 4.32 241.97 242.60 0.64 9.54 7 72.83 3.88 224.54 217.61 -6.93 9.82 8 77.75 5.00 240.63 247.13 6.50 9.86 9 75.04 2.88 221.23 223.84 2.61 11.40 10 75.29 2.85 224.36 222.07 -2.29 11.61 11 74.71 2.80 214.56 220.89 6.33 11.89 12 78.08 6.08 252.04 256.14 4.10 12.49 13 77.33 3.07 231.57 234.58 3.00 13.32 14 71.42 3.37 210.83 204.44 -6.39 15.53 15 78.83 3.61 238.85 242.55 3.70 15.63 16 73.67 2.19 200.35 209.02 8.66 15.87 17 79.42 3.60 250.10 241.68 -8.43 17.54 18 73.00 1.62 213.35 204.23 -9.12 19.55 19 79.96 2.74 240.66 239.73 -0.92 21.53 20 78.42 1.37 224.75 230.06 5.32 23.06


8.4 Classification and Regression Trees (CART)• Constructing a tree is analogous to training in a model-building context,

but here, it involves :

(i) choosing the splitting attributes, i.e., the set of important variables to perform the splitting; in many engineering problems, this is a moot step,

(ii) ordering the splitting attributes, i.e., ranking them by order of importance in explaining the variation in the dependent variable,

(iii) deciding on the number of splits of the splitting attributes which is dictated by the domain or range of variation of that particular attribute,

(iv)defining the tree structure, i.e., number of nodes and branches,

(v) selecting stopping criteria which are a set of pre-defined rules which reveal that no further gain is being made in the model;

(vi) pruning which involves making modifications to the tree constructed using the training data so that it applies well to the testing data.


• (CART) is a non-parametric decision tree technique that can be applied either to classification or regression problems, depending on whether the dependent variable is categorical or numeric respectively

• It is one of an increasing number of computer intensive methods which perform an exhaustive search to determine best tree size and configuration in multivariate data. While being a fully automatic method, it is flexible, powerful and parsimonious, i.e., it identifies a tree with the fewest number of branches.

• Another appeal of CART is that it chooses the splitting variables and splitting points that best discriminate between the outcome classes. The algorithm, however, suffers from the danger of over-fitting, and hence, a cross-validation data set is essential


• Most trees, including CART are binary decision trees (i.e., the tree splits into two branches at each node), though this is not mandatory

• Each branch of the tree ends in a terminal node while each observation falls into one and exactly one terminal node.

• The tree is created by an exhaustive search performed at each node to determine the best split. The computation stops when any further split does not improve the classification.

• Treed regression is very similar to CART except that the latter fits the mean of the dependent variable in each terminal node, while treed regression can assume any functional form.


Example 8.4.1. Using treed regression to model atmospheric ozone variation with climatic variables.

Cleveland (1994) presents data from 111 days in the

New York City metropolitan region in the early 1970s consisting of:- ozone concentration (an index for air pollutant) in parts per billion (ppb), and

three climatic variables: ambient temperature (in 0F), wind speed (in mph) and solar radiation (in langleys).

It is the intent to develop a regression model for predicting ozone levels against the three variables.

One notes that though some sort of correlation exits, the scatter is fairly important. An obvious way is to use multiple regression with inclusion of higher order terms as necessary.

Fig.10.4.1 Scatter plots of ozone versus climaticdata (with permission from Cleveland, 1994)

Fig.10.4.2 Scatter plots and linear regression Models for the three terminal nodes of the CART model (with permission from Grimshaw, 1997)

Fig. 8.10(a) Scatter plots of ozone versus climatic data


Fig.10.4.1 Scatter plots of ozone versus climaticdata (with permission from Cleveland, 1994)

Fig.10.4.2 Scatter plots and linear regression Models for the three terminal nodes of the CART model (with permission from Grimshaw, 1997)

Fig. 8.10(b) Scatter plots and linear regressionmodels for the three terminal nodes of the treed regression model

An alternative, and in many cases, superior approach is to use treed regression.

This involves, partitioning the spatial region into sub-regions, and identifying models for each region or terminal node separately.

A treed regression approach to identify

three terminal nodes as shown: (i) wind speed < 6 mph

(representative of stagnant air conditions),

(ii) wind speed > 6 mph and ambient temperature < 82.5 0F, and

(iii) wind speed > 6 mph and ambient temperature > 82.5 0F.


Fig. 10.4.3 Treed regression model for predicting ozone level against climatic variables (with permission from Grimshaw, 1997)

Fig. 8.10(c) Treed regression model for predicting ozone level against climatic variables


• CART and treed regression are robust methods which are ideally suited for the analysis of complex data which can be numeric or categorical, involving nonlinear relationships, high-order interactions, and missing values in either response or regressor variables.

• Despite such difficulties, the methods are simple to understand and give easily interpretable results.

• Trees explain variation of a single response variable by repeatedly splitting the data into more homogeneous groups or spatial ranges, using combinations of explanatory variables that may be categorical and/or numeric.

• Each group is characterized by a typical value of the response variable, the number of observations in the group, and the values of the explanatory variables that define it.

• The tree is represented graphically, and this aids exploration and understanding.

Classification and regression have a wide range of applications, including scientific experiments, medical diagnosis, fraud detection, credit approval, and target marketing


8.5 Clustering MethodsAim of cluster analysis is to allocate a set of observation sets into groups which are

similar or “close” to one another with respect to certain attribute(s) or characteristic(s).

Thus, an observation can be placed in one and only one cluster.

For example, performance data collected from mechanical equipment could be classified as representing good, faulty or uncertain operation.

• In general, the number of clusters is not predefined and has to be gleaned from the

data set.• This and the fact that one does have a training data set to build a model make

clustering a much more difficult problem than classification.• A wide variety of clustering techniques and algorithms has been proposed, and

there is no generally accepted best method.• Some authors point out that, except when the clusters are clear-cut, the resulting

clusters often depend on the analysis approach used and somewhat subjective.• Thus, there is often no one single best result, and there exists the distinct

possibility that different analysts will arrive at different results.


Broadly speaking, there are two types of clustering methods both of which are based on distance-algorithms where objects are clustered into groups depending on their relative closeness to each other.

(a) partitional clustering where non-overlapping clusters are identified.

(b) hierarchic clustering which allows one to identify closeness of different objects at different levels of aggregation. Thus, one starts by identifying several lower-level clusters or groups, and then gradually merging these in a sequential manner depending on their relative closeness, so that finally only one group results.

Both approaches rely, in essence, in identifying those which exhibit small within-cluster variation as against large between-cluster variation.

Several algorithms are available for cluster analysis


Partitional clustering (or disjoint clusters) determines the optimal number of clusters by performing the analysis with different pre-selected number of clusters

A widely used criterion is the within-cluster variation, i.e., squared error metric which measures the square distance from each point within the cluster to the centroid of the cluster.

Similarly, a between-cluster variation can be computed representative of the distance from one cluster center to another.

The ratio of the between-cluster variation to the average within clusters is analogous to the F-ratio used in ANOVA tests.

Thus, one starts with an arbitrary number of cluster centers, start assigning objects to what is deemed to be the nearest cluster center, compute the F-ratio of the resulting cluster, and then jiggle the objects back and forth between the clusters each time re-calculating the mean so that the F ratio is maximized or is sufficiently large.

It is recommended that this process be repeated with different seeds or initial centers since their initial selection may result in cluster formations which are localized. This tedious process can only be done by computers for most practical problem.

A slight deviant of the above algorithm is the widely used k-means algorithm where instead of a F-test, the sum of the squared errors is directly used for clustering. This is best illustrated with a simple two-dimension sample.


Fig. 8.11 Schematic of two clusters with individual points shown as x. The within-cluster variation is the sum of the individual distances from the centroid to the points within the

cluster, while the between-cluster variation is the distance between the two centroids

8.5.2 Partitional Clustering methodsDetermines the optimal number of clusters by performing the analysis with different pre-selected number of clusters A widely used criterion is the within-cluster variation, i.e., squared error metric which measures the square distance from each point within the cluster to the centroid of the cluster. Similarly, a between-cluster variation can be computed representative of the distance from one cluster center to another. The ratio of the between-cluster variation to the average within clusters is analogous to the F-ratio used in ANOVA tests.


Algorithm

- Start with an arbitrary number of cluster centers, - Assign objects to what is deemed to be the nearest cluster center, - Compute the F-ratio of the resulting cluster,- Jiggle the objects back and forth between the clusters each time re-calculating

the mean so that the F ratio is maximized or is sufficiently large.

It is recommended that this process be repeated with different seeds or initial centers since their initial selection may result in cluster formations which are localized.

This tedious process can only be done by computers for most practical problem.

A slight deviant of the above algorithm is the widely used k-means algorithm where instead of a F-test, the sum of the squared errors is directly used for clustering. This is best illustrated with a simple two-dimension sample.


Example 8.5.1 k-means clustering algorithm. Consider five objects or points characterized by two

Cartesian coordinates: x1= (0,2); x2 =(0,0), x3 = (1.5,0), x4 =(5,0), and x5 = (5,2).

The process of clustering these five objects involves: (a) Select an initial partition of k clusters containing

randomly chosen samples and compute their centroids Say, one selects two clusters and assigns to cluster C1=(x1, x2, x4) and C2=(x3, x5). Next, the centroids of the two clusters are determined:

1

2

{(0 0 5) / 3, (2 0 0) / 3} {1.66,0.66}

{(1.5 5) / 2, (0 2) / 2} {3.25,1.0}

M

M

(b) Compute the within –cluster variations: 2 2 2 2 2 2 21

2 2 2 2 22

[(0 1.66) (2 0.66) ] [(0 1.66) (0 0.66) ] [(5 1.66) (0 0.66) ] 19.36

[(1.5 3.25) (0 1) ] [(5 3.25) (2 1) ] 8.12

e

e

and the total error2 2 2

1 2 19.36 8.12 27.48E e e (c) Generate a new partition by assigning each

sample to the closest cluster center For example, the distance of x1 from the centroid M1 is

2 2 1/ 21 1( , ) (1.66 1.34 ) 2.14d M x , while that for 2 1( , )d M x

=3.40. Thus object x1 will be assigned to the group which has the smaller distance, namely C1. Similarly, one can compute distance measures of all other objects, and assign each object as shown

9.36


Table 8.9 Distance measures of the five objects with respect to the two groups

1 1( , )d M x =2.14 2 1( , )d M x =3.40 So assign 1 1x C

1 2( , )d M x =1.79 2 2( , )d M x =3.40 2 1x C

1 3( , )d M x =0.83 2 3( , )d M x =2.01 3 1x C

1 4( , )d M x =3.41 2 4( , )d M x =2.01 4 2x C

1 5( , )d M x =3.60 2 5( , )d M x =2.01 5 2x C

(a) Compute new cluster centers as centroids of the clusters

The new cluster centers are 1 {0.5,0.67}M and 2 {5.0,1.0}M (b) Repeat steps (b) and (c) until an optimum value is

found or until the cluster membership stabilizes For the new clusters C1=(x1,x2,x3) and C2=(x4, x5), the within-cluster variation and the total square errors are:

2 2 21 24.17, 2.00, 6.17e e E . Thus, the total error has decreased

significantly just after one iteration.


Fig. 8.12 Visual clustering may not always lead to an optimal splitting

It is recommended that the data be plotted so that starting values of the cluster centers could be visually determined.

Though this is a good strategy in general, there are instances when this is not optimal.

Consider the data set in Fig. 8.12a, where one would intuitively draw the two clusters (dotted circles) as shown. However, it turns out that the split depicted in Fig.8.12b results in lower sum of squared error which is the better manner of performing the clustering.

Thus, initial definition of cluster centers done visually have to be verified by analytical measures. Though the k-means clustering method is very popular, it is said to be sensitive to noise and outlier points.


8.5.3 Hierarchical Clustering Methods

Hierarchical clustering, does not start by partitioning a set of objects into mutually exclusive clusters, but forms them sequentially in a nested fashion.

For example, the eight objects shown at the left of the tree diagram (also called dendrogram) are merged into clusters at different stages depending on their relative similarity.

This allows one to identify objects which are close to each other at different levels.


12

3

4 5

67

81

2

3

4 5

67

8

12

3

4 5

67

81

2

3

4 5

67

8

12

3

4 5

67

8

(a) Eight clusters (b) Five clusters

(c) Four clusters (e) Two clusters

(f) One cluster

Fig. 8.14 Example of hierarchical agglomerative

clustering

The sets of objects (O1, O2), (O4, O5), and (O6, O7), are the most similar to each other, and are merged together resulting in the five cluster diagram in Fig. 8.14b. If one wishes to form four clusters, it is best to merge object O3 with the first set (Fig.8.14c). This merging is continued till all objects have been combined into a single undifferentiated group. This process of starting with individual objects and repeatedly merging nearest objects into clusters till one is left with a single cluster is referred to as agglomerative clustering.

Another approach called divisive clustering tackles the problem in the other direction, namely starts by placing all objects in a single cluster and repeatedly splitting the clusters in two until all objects are placed in their own cluster. These two somewhat complementary approaches are akin to the forward and backward stepwise regression approaches. Note that both approaches are not always consistent in the way they cluster a set of data.


Hierarchical techniques are appropriate for instances when the data set has naturally-occurring or physically-based nested relationships, such as plant or animal taxonomies.

The partitional clustering algorithm is advantageous in applications involving large data sets for which hierarchical clustering is computationally complex.

chapter 8: classification and clustering methods 8.1 introduction 8.2 parametric classification...

Documents

ata analysis bookreddy55chap

ata analysis bookreddy338

ata analysis bookreddy18

ata analysis bookreddy4

ata analysis bookreddy6

discriminant analysis

clustering problems

classification problems