chapter 6 regression algorithms in data mining

Chapter 6Chapter 6Regression Algorithms in Data Regression Algorithms in Data

MiningMining

Fit data

Time-series data: Forecast

Other data: Predict

結束

6-2

ContentsContents

Describes OLS (ordinary least square) regression and Logistic regression

Describes linear discriminant analysis and centroid discriminant analysis

Demonstrates techniques on small data sets

Reviews the real applications of each model

Shows the application of models to larger data sets

結束

6-3

Use in Data MiningUse in Data Mining

Telecommunication Industry, turnover (churn)

One of major analytic models for classification problem.Linear regression

The standard – ordinary least squares regressionCan use for discriminant analysisCan apply stepwise regression

Nonlinear regressionMore complex (but less reliable) data fitting

Logistic regressionWhen data are categorical (usually binary)

結束

6-4

OLS ModelOLS Model

error term theis

st variableindependenfor tscoefficien theare

termintercept theis

variabledependent theis where

...

0

22110

n

Y

XXXY

n

nn

結束

6-5

OLS RegressionOLS Regression

Uses intercept and slope coefficients () to minimize squared error terms over all i observations

Fits the data with a linear model

Time-series data:Observations over past periodsBest fit line (in terms of minimizing sum of

squared errors)

結束

6-6

Regression OutputRegression Output

R2 : 0.987

Intercept: 0.642 t=0.286 P=0.776

Week: 5.086 t=53.27 P=0

Requests = 0.642 + 5.086*Week

結束

6-7

ExampleExample

SSE

R2

SST

結束

6-8

ExampleExample

結束

6-9

AA graph of the time-series modelgraph of the time-series model

(X1) Requests vs. (X2) Pred_lmreg_1

20018016014012010080604020

200

190

180

170

160

150

140

130

120

110

100

90

80

70

60

50

40

30

20

10

結束

6-10

Time-Series ForecastTime-Series Forecast

Time- series prediction

0

50

100

150

200

250

0 10 20 30 40 50

結束

6-11

Regression TestsRegression Tests

FIT:SSE – sum of squared errors

Synonym: SSR – sum of squared residualsR2 – proportion explained by modelAdjusted R2 – adjusts calculation to penalize for

number of independent variablesSignificanceF-test - test of overall model significancet-test - test of significant difference between model

coefficient & zeroP – probability that the coefficient is zero

(or at least the other side of zero from the coefficient)

See page. 103

結束

6-12

Regression Model TestsRegression Model Tests

SSE (sum of squared errors)For each observation, subtract model value from

observed, square difference, total over all observationsBy itself means nothingCan compare across models (lower is better)Can use to evaluate proportion of variance in data

explained by modelR2

Ratio of explained squared dependent variable values (MSR) to sum of squares (SST)SST = MSR plus SSE

0 ≤ R2 ≤ 1

See page. 104

結束

6-13

Multiple RegressionMultiple Regression

Can include more than one independent variableTrade-off:

Too many variables – many spurious, overlapping information

Too few variables – miss important contentAdding variables will always increase R2

Adjusted R2 penalizes for additional independent variables

結束

6-14

Example: Hiring DataExample: Hiring Data

Dependent Variable – Sales

Independent Variables:Years of EducationCollege GPAAgeGenderCollege Degree

See page. 104-105

結束

6-15

Regression ModelRegression Model

Sales = 269025 -17148*YrsEd P = 0.175 -7172*GPA P = 0.812+4331*Age P = 0.116-23581*Male P = 0.266+31001*Degree P = 0.450

R2 = 0.252 Adj R2 = -0.015Weak model, no significant at 0.10

結束

6-16

Improved Regression ModelImproved Regression Model

Sales = 173284

- 9991*YrsEd P = 0.098*

+3537*Age P = 0.141

-18730*Male P = 0.328

R2 = 0.218 Adj R2 = 0.070

結束

6-17

Logistic RegressionLogistic Regression

Data often ordinal or nominal

Regression based on continuous numbers not appropriateNeed dummy variables

Binary – either are or are not– LOGISTIC REGRESSION (probability of either

1 or 0)

Two or more categories– DISCRIMINANT ANALYSIS (perform

regression for each outcome; pick one that fit’s best)

結束

6-18

Logistic RegressionLogistic Regression

For dependent variables that are nominal or ordinal

Probability of acceptance of case i to class j

Sigmoidal function(in English, an S curve from 0 to 1)

iixj

eP

01

1

結束

6-19

Insurance Claim ModelInsurance Claim Model

Fraud = 81.824 -2.778 * Age P = 0.789-75.893 * Male P = 0.758+ 0.017 * Claim P = 0.757-36.648 * Tickets P = 0.824+ 6.914 * Prior P = 0.935-29.362 * Atty Smith P = 0.776

Can get probability by running score through logistic formula

See page. 107~109

結束

6-20

Linear Discriminant AnalysisLinear Discriminant Analysis

Group objects into predetermined set of outcome classes

Regression one means of performing discriminant analysis2 groups: find cutoff for regression scoreMore than 2 groups: multiple cutoffs

結束

6-21

Centroid Method (NOT regression)Centroid Method (NOT regression)

Binary data

Divide training set into two groups by binary outcomeStandardize data to remove scales

Identify means for each independent variable by group (the CENTROID)

Calculate distance function

結束

6-22

Fraud DataFraud Data

Age Claim Tickets Prior Outcome

52 2000 0 1 OK

38 1800 0 0 OK

19 600 2 2 OK

21 5600 1 2 Fraud

41 4200 1 2 Fraud

結束

6-23

Standardized & Sorted Fraud DataStandardized & Sorted Fraud Data

Age Claim Tickets Prior Outcome

1 0.60 1 0.5 0

0.9 0.64 1 1 0

0 0.88 0 0 0

0.633 0.707 0.667 0.500 0

0.05 0 1 0 1

1 0.16 1 0 1

0.525 0.080 1.000 0.000 1

結束

6-24

Distance CalculationsDistance Calculations

New To 0 To 1

Age 0.50 (0.633-0.5)2 0.018 (0.525-0.5)2 0.001

Claim 0.30 (0.707-0.3)2 0.166 (0.08-0.3)2 0.048

Tickets 0 (0.667-0)2 0.445 (1-0)2 1.000

Prior 1 (0.5-1)2 0.250 (0-1)2 1.000

Totals 0.879 2.049

結束

6-25

Discriminant Analysis with RegressionDiscriminant Analysis with RegressionStandardized data, Binary outcomesStandardized data, Binary outcomes

Intercept 0.430 P = 0.670Age -0.421 P = 0.671Gender 0.333 P = 0.733Claim -0.648 P = 0.469Tickets 0.584 P = 0.566Prior Claims -1.091 P = 0.399Attorney 0.573 P = 0.607

R2 = 0.804Cutoff average of group averages: 0.429

結束

6-26

Case: Stepwise RegressionCase: Stepwise Regression

Stepwise RegressionAutomatic selection of independent variables

Look at F scores of simple regressionsAdd variable with greatest F statisticCheck partial F scores for adding each variable not in

modelDelete variables no longer significantIf no external variables significant, quit

Considered inferior to selection of variables by experts

結束

6-27

Credit Card Bankruptcy PredictionCredit Card Bankruptcy PredictionFoster & Stine (2004), Foster & Stine (2004), Journal of the American Statistical AssociationJournal of the American Statistical Association

Data on 244,000 credit card accounts12-month period1 percent defaultCost of granting loan that defaults almost $5,000Cost of denying loan that would have paid about $50

結束

6-28

Data TreatmentData Treatment

Divided observations into 5 groupsUsed one for trainingAny smaller would have problems due to insufficient

default casesUsed 80% of data for detailed testing

Regression performed better than C5 model Even though C5 used costs, regression didn’t

結束

6-29

SummarySummary

Regression a basic classical modelMany forms

Logistic regression very useful in data miningOften have binary outcomesAlso can use on categorical data

Can use for discriminant analysisTo classify

chapter 6 regression algorithms in data mining

Documents

forecastother data

square regression

linear modeltimeseries

regression testsfit

regression algorithms

regression outputr2

small data setsreviews

larger data sets6