chapter 6 regression algorithms in data mining
DESCRIPTION
Chapter 6 Regression Algorithms in Data Mining. Fit data Time-series data: Forecast Other data: Predict. Contents. Describes OLS (ordinary least square) regression and Logistic regression Describes linear discriminant analysis and centroid discriminant analysis - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Chapter 6 Regression Algorithms in Data Mining](https://reader034.vdocument.in/reader034/viewer/2022051316/568136dc550346895d9e7779/html5/thumbnails/1.jpg)
Chapter 6Chapter 6Regression Algorithms in Data Regression Algorithms in Data
MiningMining
Fit data
Time-series data: Forecast
Other data: Predict
![Page 2: Chapter 6 Regression Algorithms in Data Mining](https://reader034.vdocument.in/reader034/viewer/2022051316/568136dc550346895d9e7779/html5/thumbnails/2.jpg)
結束
6-2
ContentsContents
Describes OLS (ordinary least square) regression and Logistic regression
Describes linear discriminant analysis and centroid discriminant analysis
Demonstrates techniques on small data sets
Reviews the real applications of each model
Shows the application of models to larger data sets
![Page 3: Chapter 6 Regression Algorithms in Data Mining](https://reader034.vdocument.in/reader034/viewer/2022051316/568136dc550346895d9e7779/html5/thumbnails/3.jpg)
結束
6-3
Use in Data MiningUse in Data Mining
Telecommunication Industry, turnover (churn)
One of major analytic models for classification problem.Linear regression
The standard – ordinary least squares regressionCan use for discriminant analysisCan apply stepwise regression
Nonlinear regressionMore complex (but less reliable) data fitting
Logistic regressionWhen data are categorical (usually binary)
![Page 4: Chapter 6 Regression Algorithms in Data Mining](https://reader034.vdocument.in/reader034/viewer/2022051316/568136dc550346895d9e7779/html5/thumbnails/4.jpg)
結束
6-4
OLS ModelOLS Model
error term theis
st variableindependenfor tscoefficien theare
termintercept theis
variabledependent theis where
...
0
22110
n
Y
XXXY
n
nn
![Page 5: Chapter 6 Regression Algorithms in Data Mining](https://reader034.vdocument.in/reader034/viewer/2022051316/568136dc550346895d9e7779/html5/thumbnails/5.jpg)
結束
6-5
OLS RegressionOLS Regression
Uses intercept and slope coefficients () to minimize squared error terms over all i observations
Fits the data with a linear model
Time-series data:Observations over past periodsBest fit line (in terms of minimizing sum of
squared errors)
![Page 6: Chapter 6 Regression Algorithms in Data Mining](https://reader034.vdocument.in/reader034/viewer/2022051316/568136dc550346895d9e7779/html5/thumbnails/6.jpg)
結束
6-6
Regression OutputRegression Output
R2 : 0.987
Intercept: 0.642 t=0.286 P=0.776
Week: 5.086 t=53.27 P=0
Requests = 0.642 + 5.086*Week
![Page 7: Chapter 6 Regression Algorithms in Data Mining](https://reader034.vdocument.in/reader034/viewer/2022051316/568136dc550346895d9e7779/html5/thumbnails/7.jpg)
結束
6-7
ExampleExample
SSE
R2
SST
![Page 8: Chapter 6 Regression Algorithms in Data Mining](https://reader034.vdocument.in/reader034/viewer/2022051316/568136dc550346895d9e7779/html5/thumbnails/8.jpg)
結束
6-8
ExampleExample
![Page 9: Chapter 6 Regression Algorithms in Data Mining](https://reader034.vdocument.in/reader034/viewer/2022051316/568136dc550346895d9e7779/html5/thumbnails/9.jpg)
結束
6-9
AA graph of the time-series modelgraph of the time-series model
(X1) Requests vs. (X2) Pred_lmreg_1
20018016014012010080604020
200
190
180
170
160
150
140
130
120
110
100
90
80
70
60
50
40
30
20
10
![Page 10: Chapter 6 Regression Algorithms in Data Mining](https://reader034.vdocument.in/reader034/viewer/2022051316/568136dc550346895d9e7779/html5/thumbnails/10.jpg)
結束
6-10
Time-Series ForecastTime-Series Forecast
Time- series prediction
0
50
100
150
200
250
0 10 20 30 40 50
![Page 11: Chapter 6 Regression Algorithms in Data Mining](https://reader034.vdocument.in/reader034/viewer/2022051316/568136dc550346895d9e7779/html5/thumbnails/11.jpg)
結束
6-11
Regression TestsRegression Tests
FIT:SSE – sum of squared errors
Synonym: SSR – sum of squared residualsR2 – proportion explained by modelAdjusted R2 – adjusts calculation to penalize for
number of independent variablesSignificanceF-test - test of overall model significancet-test - test of significant difference between model
coefficient & zeroP – probability that the coefficient is zero
(or at least the other side of zero from the coefficient)
See page. 103
![Page 12: Chapter 6 Regression Algorithms in Data Mining](https://reader034.vdocument.in/reader034/viewer/2022051316/568136dc550346895d9e7779/html5/thumbnails/12.jpg)
結束
6-12
Regression Model TestsRegression Model Tests
SSE (sum of squared errors)For each observation, subtract model value from
observed, square difference, total over all observationsBy itself means nothingCan compare across models (lower is better)Can use to evaluate proportion of variance in data
explained by modelR2
Ratio of explained squared dependent variable values (MSR) to sum of squares (SST)SST = MSR plus SSE
0 ≤ R2 ≤ 1
See page. 104
![Page 13: Chapter 6 Regression Algorithms in Data Mining](https://reader034.vdocument.in/reader034/viewer/2022051316/568136dc550346895d9e7779/html5/thumbnails/13.jpg)
結束
6-13
Multiple RegressionMultiple Regression
Can include more than one independent variableTrade-off:
Too many variables – many spurious, overlapping information
Too few variables – miss important contentAdding variables will always increase R2
Adjusted R2 penalizes for additional independent variables
![Page 14: Chapter 6 Regression Algorithms in Data Mining](https://reader034.vdocument.in/reader034/viewer/2022051316/568136dc550346895d9e7779/html5/thumbnails/14.jpg)
結束
6-14
Example: Hiring DataExample: Hiring Data
Dependent Variable – Sales
Independent Variables:Years of EducationCollege GPAAgeGenderCollege Degree
See page. 104-105
![Page 15: Chapter 6 Regression Algorithms in Data Mining](https://reader034.vdocument.in/reader034/viewer/2022051316/568136dc550346895d9e7779/html5/thumbnails/15.jpg)
結束
6-15
Regression ModelRegression Model
Sales = 269025 -17148*YrsEd P = 0.175 -7172*GPA P = 0.812+4331*Age P = 0.116-23581*Male P = 0.266+31001*Degree P = 0.450
R2 = 0.252 Adj R2 = -0.015Weak model, no significant at 0.10
![Page 16: Chapter 6 Regression Algorithms in Data Mining](https://reader034.vdocument.in/reader034/viewer/2022051316/568136dc550346895d9e7779/html5/thumbnails/16.jpg)
結束
6-16
Improved Regression ModelImproved Regression Model
Sales = 173284
- 9991*YrsEd P = 0.098*
+3537*Age P = 0.141
-18730*Male P = 0.328
R2 = 0.218 Adj R2 = 0.070
![Page 17: Chapter 6 Regression Algorithms in Data Mining](https://reader034.vdocument.in/reader034/viewer/2022051316/568136dc550346895d9e7779/html5/thumbnails/17.jpg)
結束
6-17
Logistic RegressionLogistic Regression
Data often ordinal or nominal
Regression based on continuous numbers not appropriateNeed dummy variables
Binary – either are or are not– LOGISTIC REGRESSION (probability of either
1 or 0)
Two or more categories– DISCRIMINANT ANALYSIS (perform
regression for each outcome; pick one that fit’s best)
![Page 18: Chapter 6 Regression Algorithms in Data Mining](https://reader034.vdocument.in/reader034/viewer/2022051316/568136dc550346895d9e7779/html5/thumbnails/18.jpg)
結束
6-18
Logistic RegressionLogistic Regression
For dependent variables that are nominal or ordinal
Probability of acceptance of case i to class j
Sigmoidal function(in English, an S curve from 0 to 1)
iixj
eP
01
1
![Page 19: Chapter 6 Regression Algorithms in Data Mining](https://reader034.vdocument.in/reader034/viewer/2022051316/568136dc550346895d9e7779/html5/thumbnails/19.jpg)
結束
6-19
Insurance Claim ModelInsurance Claim Model
Fraud = 81.824 -2.778 * Age P = 0.789-75.893 * Male P = 0.758+ 0.017 * Claim P = 0.757-36.648 * Tickets P = 0.824+ 6.914 * Prior P = 0.935-29.362 * Atty Smith P = 0.776
Can get probability by running score through logistic formula
See page. 107~109
![Page 20: Chapter 6 Regression Algorithms in Data Mining](https://reader034.vdocument.in/reader034/viewer/2022051316/568136dc550346895d9e7779/html5/thumbnails/20.jpg)
結束
6-20
Linear Discriminant AnalysisLinear Discriminant Analysis
Group objects into predetermined set of outcome classes
Regression one means of performing discriminant analysis2 groups: find cutoff for regression scoreMore than 2 groups: multiple cutoffs
![Page 21: Chapter 6 Regression Algorithms in Data Mining](https://reader034.vdocument.in/reader034/viewer/2022051316/568136dc550346895d9e7779/html5/thumbnails/21.jpg)
結束
6-21
Centroid Method (NOT regression)Centroid Method (NOT regression)
Binary data
Divide training set into two groups by binary outcomeStandardize data to remove scales
Identify means for each independent variable by group (the CENTROID)
Calculate distance function
![Page 22: Chapter 6 Regression Algorithms in Data Mining](https://reader034.vdocument.in/reader034/viewer/2022051316/568136dc550346895d9e7779/html5/thumbnails/22.jpg)
結束
6-22
Fraud DataFraud Data
Age Claim Tickets Prior Outcome
52 2000 0 1 OK
38 1800 0 0 OK
19 600 2 2 OK
21 5600 1 2 Fraud
41 4200 1 2 Fraud
![Page 23: Chapter 6 Regression Algorithms in Data Mining](https://reader034.vdocument.in/reader034/viewer/2022051316/568136dc550346895d9e7779/html5/thumbnails/23.jpg)
結束
6-23
Standardized & Sorted Fraud DataStandardized & Sorted Fraud Data
Age Claim Tickets Prior Outcome
1 0.60 1 0.5 0
0.9 0.64 1 1 0
0 0.88 0 0 0
0.633 0.707 0.667 0.500 0
0.05 0 1 0 1
1 0.16 1 0 1
0.525 0.080 1.000 0.000 1
![Page 24: Chapter 6 Regression Algorithms in Data Mining](https://reader034.vdocument.in/reader034/viewer/2022051316/568136dc550346895d9e7779/html5/thumbnails/24.jpg)
結束
6-24
Distance CalculationsDistance Calculations
New To 0 To 1
Age 0.50 (0.633-0.5)2 0.018 (0.525-0.5)2 0.001
Claim 0.30 (0.707-0.3)2 0.166 (0.08-0.3)2 0.048
Tickets 0 (0.667-0)2 0.445 (1-0)2 1.000
Prior 1 (0.5-1)2 0.250 (0-1)2 1.000
Totals 0.879 2.049
![Page 25: Chapter 6 Regression Algorithms in Data Mining](https://reader034.vdocument.in/reader034/viewer/2022051316/568136dc550346895d9e7779/html5/thumbnails/25.jpg)
結束
6-25
Discriminant Analysis with RegressionDiscriminant Analysis with RegressionStandardized data, Binary outcomesStandardized data, Binary outcomes
Intercept 0.430 P = 0.670Age -0.421 P = 0.671Gender 0.333 P = 0.733Claim -0.648 P = 0.469Tickets 0.584 P = 0.566Prior Claims -1.091 P = 0.399Attorney 0.573 P = 0.607
R2 = 0.804Cutoff average of group averages: 0.429
![Page 26: Chapter 6 Regression Algorithms in Data Mining](https://reader034.vdocument.in/reader034/viewer/2022051316/568136dc550346895d9e7779/html5/thumbnails/26.jpg)
結束
6-26
Case: Stepwise RegressionCase: Stepwise Regression
Stepwise RegressionAutomatic selection of independent variables
Look at F scores of simple regressionsAdd variable with greatest F statisticCheck partial F scores for adding each variable not in
modelDelete variables no longer significantIf no external variables significant, quit
Considered inferior to selection of variables by experts
![Page 27: Chapter 6 Regression Algorithms in Data Mining](https://reader034.vdocument.in/reader034/viewer/2022051316/568136dc550346895d9e7779/html5/thumbnails/27.jpg)
結束
6-27
Credit Card Bankruptcy PredictionCredit Card Bankruptcy PredictionFoster & Stine (2004), Foster & Stine (2004), Journal of the American Statistical AssociationJournal of the American Statistical Association
Data on 244,000 credit card accounts12-month period1 percent defaultCost of granting loan that defaults almost $5,000Cost of denying loan that would have paid about $50
![Page 28: Chapter 6 Regression Algorithms in Data Mining](https://reader034.vdocument.in/reader034/viewer/2022051316/568136dc550346895d9e7779/html5/thumbnails/28.jpg)
結束
6-28
Data TreatmentData Treatment
Divided observations into 5 groupsUsed one for trainingAny smaller would have problems due to insufficient
default casesUsed 80% of data for detailed testing
Regression performed better than C5 model Even though C5 used costs, regression didn’t
![Page 29: Chapter 6 Regression Algorithms in Data Mining](https://reader034.vdocument.in/reader034/viewer/2022051316/568136dc550346895d9e7779/html5/thumbnails/29.jpg)
結束
6-29
SummarySummary
Regression a basic classical modelMany forms
Logistic regression very useful in data miningOften have binary outcomesAlso can use on categorical data
Can use for discriminant analysisTo classify