data mining applications in p&c insurance
DESCRIPTION
Data Mining Applications in P&C Insurance. CASE Spring Meeting April 12, 2005 Lijia Guo, PhD, ASA, MAAA University of Central Florida. Agenda. Introductions to data mining modeling Understanding the data mining process Data mining (DM) techniques Applications in P&C Insurance Case Study. - PowerPoint PPT PresentationTRANSCRIPT
11
Data Mining Applications Data Mining Applications in P&C Insurancein P&C Insurance
CASE Spring MeetingCASE Spring Meeting
April 12, 2005April 12, 2005
Lijia Guo, PhD, ASA, MAAALijia Guo, PhD, ASA, MAAA
University of Central FloridaUniversity of Central Florida
April 12, 2005April 12, 2005 GuoGuo 22
AgendaAgenda
Introductions to data mining modelingIntroductions to data mining modeling Understanding the data mining processUnderstanding the data mining process Data mining (DM) techniquesData mining (DM) techniques Applications in P&C InsuranceApplications in P&C Insurance Case StudyCase Study
April 12, 2005April 12, 2005 GuoGuo 33
Introduction – What is Data Mining?Introduction – What is Data Mining?
Process of exploration and analysis of large Process of exploration and analysis of large quantities of data in order to discover meaningful quantities of data in order to discover meaningful patterns and rules.patterns and rules.
Uses a variety of data analysis tools to discover Uses a variety of data analysis tools to discover relationships that may be used to make valid relationships that may be used to make valid predictions.predictions.
It is not a magic wand:It is not a magic wand: Must know your businessMust know your business Understand your dataUnderstand your data Understand the analytical methods Understand the analytical methods
April 12, 2005April 12, 2005 GuoGuo 44
Introduction - DM ModelingIntroduction - DM Modeling
An information discovery process.An information discovery process. Knowing your goalsKnowing your goals Understanding your dataUnderstanding your data Choosing the right methodsChoosing the right methods Understanding the limitations Understanding the limitations Validation and testingValidation and testing Make crucial business decisionsMake crucial business decisions
April 12, 2005April 12, 2005 GuoGuo 55
Transform Data
Apply DM Models
Validate DM Models IMPLEMENT
Define the GoalIdentify Data
SourcesUnderstand the
Economics
Prepare Data
Introduction – DM ProcessIntroduction – DM Process
April 12, 2005April 12, 2005 GuoGuo 66
Introduction – DM GoalsIntroduction – DM Goals
Identifying responsive potential customersIdentifying responsive potential customers Identifying existing customers that more Identifying existing customers that more
likely to terminatelikely to terminate Identifying high risk purchaserIdentifying high risk purchaser Identifying the factors that cause large Identifying the factors that cause large
claims claims Identifying interactions among risk factorsIdentifying interactions among risk factors
April 12, 2005April 12, 2005 GuoGuo 77
Introduction – DM ProcessIntroduction – DM Process
April 12, 2005April 12, 2005 GuoGuo 88
DM TechniquesDM Techniques
Decision Trees Decision Trees Logistic regressionLogistic regression Neural NetworksNeural Networks Fuzzy LogicsFuzzy Logics Genetic AlgorithmsGenetic Algorithms
ClusteringClustering Associated discoveryAssociated discovery Sequence DiscoverySequence Discovery Bayesian analysisBayesian analysis Visualization Visualization
Hybrid algorithmsHybrid algorithms
April 12, 2005April 12, 2005 GuoGuo 99
DM Techniques -- Decision TreesDM Techniques -- Decision Trees
What are decision treesWhat are decision trees Classify observations based on the values of Classify observations based on the values of
nominal, binary, or ordinal targetsnominal, binary, or ordinal targets Predict outcomes for interval targets Predict outcomes for interval targets Predict the appropriate decision when you Predict the appropriate decision when you
specify decision alternatives specify decision alternatives
April 12, 2005April 12, 2005 GuoGuo 1010
DM Techniques -- Decision Trees DM Techniques -- Decision Trees ExampleExample
Classification Of Surrender Risk
Income >$50,000Yes Or No
Job >5 YearsYes or No
High DebtYes or No
NoYes
If yes low riskElse high risk
If yes low riskelse high risk
April 12, 2005April 12, 2005 GuoGuo 1111
DM Techniques -- Decision TreesDM Techniques -- Decision Trees
Strengths and weaknessesStrengths and weaknesses Insights into the decision-making process Insights into the decision-making process Efficient and is thus suitable for large Efficient and is thus suitable for large
data sets data sets Relatively unstableRelatively unstable Difficult to detect linear or quadratic Difficult to detect linear or quadratic
relationshipsrelationships
April 12, 2005April 12, 2005 GuoGuo 1212
DM Techniques DM Techniques -- Logistic regression-- Logistic regression
What is Logistic regression What is Logistic regression How Logistic regression worksHow Logistic regression works
Odds ratiosOdds ratios Each dependent variable affects logit linearlyEach dependent variable affects logit linearly
.,,2,1,1
loglogit1
0 niwherexp
p k
jjij
i
i
April 12, 2005April 12, 2005 GuoGuo 1313
Strengths and weaknessesStrengths and weaknesses Maximum Likelihood Curve FittingMaximum Likelihood Curve Fitting Multiple Logistic Regression ModelMultiple Logistic Regression Model Interaction-effect modifierInteraction-effect modifier Multinomial Logistic Regression ModelMultinomial Logistic Regression Model
DM Techniques - Logistic RegressionDM Techniques - Logistic Regression
April 12, 2005April 12, 2005 GuoGuo 1414
DM Techniques DM Techniques -- Neural Networks-- Neural Networks
What are Neural NetworksWhat are Neural Networks
network architecture with two hidden layers
1x
2x
3x
1H
2H
y
21w
22w
11w
21w
31w
32w
1w
2w
Input layer - a Input layer - a unit for each unit for each input variableinput variable
Output layer - Output layer - the targetthe target
Hidden layer - Hidden layer - hidden unit hidden unit (neurons) (neurons)
April 12, 2005April 12, 2005 GuoGuo 1515
: : output activation functionoutput activation function. . : : activation functionsactivation functions-nonlinear -nonlinear
transformations.transformations. : : weightsweights : : BiasBias
10 0 1 1 2 2
1 1 01 11 1 21 2 31 3
2 2 02 12 1 22 2 32 3
( )
( )
( )
g E y w w H w H
H g w w x w x w x
H g w w x w x w x
0 ( )g ( )ig
11 21 32 1 2, , , , ,w w w w w
0 01 02, ,w w w
DM Techniques – Neural NetworksDM Techniques – Neural Networks
April 12, 2005April 12, 2005 GuoGuo 1616
How Neural Networks workHow Neural Networks work Processing elementsProcessing elements TrainingTraining PredictingPredicting Activation FunctionsActivation Functions
• logistic function logistic function
• hyperbolic tangenthyperbolic tangent
1( )
1l
e
tanh( )x x
x x
e ex
e e
DM Techniques –Neural NetworksDM Techniques –Neural Networks
April 12, 2005April 12, 2005 GuoGuo 1717
Strengths and weaknessesStrengths and weaknesses• Accurately prediction for complex problemsAccurately prediction for complex problems• Black box predict engineBlack box predict engine• OvertrainingOvertraining• Training speedTraining speed
DM Techniques -- Neural NetworksDM Techniques -- Neural Networks
April 12, 2005April 12, 2005 GuoGuo 1818
DM Techniques -- Hybrid AlgorithmsDM Techniques -- Hybrid Algorithms
Problems with standard algorithmsProblems with standard algorithms Advanced algorithmsAdvanced algorithms Discovery-driven approachesDiscovery-driven approaches Mixture of algorithmsMixture of algorithms
April 12, 2005April 12, 2005 GuoGuo 1919
DM Applications in P&C InsuranceDM Applications in P&C Insurance
Data WarehouseData Warehouse UnderwritingUnderwriting Pricing/Rate MakingPricing/Rate Making Claim ScoringClaim Scoring Risk ManagementRisk Management Policy Level AnalysisPolicy Level Analysis Variable SelectionVariable Selection
April 12, 2005April 12, 2005 GuoGuo 2020
Data Warehousing ExampleData Warehousing Example
HospitalHospitalClaimsClaims
HospitalHospitalClaimsClaims
PharmacyPharmacyClaimsClaims
PharmacyPharmacyClaimsClaims
PhysicianPhysicianClaimsClaims
PhysicianPhysicianClaimsClaims
Op
era
tion
al D
at a S
tore
Tertiary Selection: WHAT DOES THE TRANSACTION
DATA TELL US?
Derived Variables/Flags
Rx
Med Claims
Surveys ...
Service Level Table
Group by Patient
Summary:WHAT DO WE KNOW ABOUT THIS PATIENT?
Summary LevelVariables
Service LevelVariables
Summary Level Table
Primary Selection:WHO?
Transactions SurveysDemographics
Unique Patient List Transactions Surveys Demographics
Secondary Selection: WHAT DATA?
April 12, 2005April 12, 2005 GuoGuo 2121
DM in Insurance UnderwritingDM in Insurance Underwriting
Improving profit margin.Improving profit margin. Gaining competitive edgeGaining competitive edge Risk evaluation process.Risk evaluation process.
Lots of variablesLots of variables Lots of interactionsLots of interactions
Easy to follow procedure.Easy to follow procedure. Decision tree can be usedDecision tree can be used
April 12, 2005April 12, 2005 GuoGuo 2222
DM in Insurance UnderwritingDM in Insurance Underwriting- - Auto Driver’s Claim InformationAuto Driver’s Claim Information
VariableVariable Variable TypeVariable Type Measurement LevelMeasurement Level DescriptionDescription
AgeAge ContinuousContinuous IntervalInterval Driver’s age in yearsDriver’s age in years
Car ageCar age ContinuousContinuous IntervalInterval Age of the carAge of the car
Car typeCar type CategoricalCategorical NominalNominal Type of the carType of the car
GenderGender CategoricalCategorical BinaryBinary F=female, M=maleF=female, M=male
Coverage level Coverage level CategoricalCategorical NominalNominal Policy coveragePolicy coverage
EducationEducation CategoricalCategorical NominalNominal Education level of the driveEducation level of the drive
LocationLocation CategoricalCategorical NominalNominal Location of residenceLocation of residence
ClimateClimate CategoricalCategorical NominalNominal Climate code for residenceClimate code for residence
Credit ratingCredit rating ContinuousContinuous IntervalInterval Credit score of the driverCredit score of the driver
IDID InputInput NominalNominal Driver’s identification numberDriver’s identification number
No. of claimsNo. of claims CategoricalCategorical NominalNominal Number of claimsNumber of claims
April 12, 2005April 12, 2005 GuoGuo 2323
DM in Insurance UnderwritingDM in Insurance Underwriting- Decision - Decision Tree DiagramTree Diagram
April 12, 2005April 12, 2005 GuoGuo 2424
DM in Pricing/Rate MakingDM in Pricing/Rate Making
Data: Data: Auto Driver’s Claim InformationAuto Driver’s Claim Information Decision trees analysis to identify risk Decision trees analysis to identify risk
factors that predict profits, claims and factors that predict profits, claims and losseslosses
Logistic regression applied to modelLogistic regression applied to model Claim frequencyClaim frequency Effect of each risk factor Effect of each risk factor
April 12, 2005April 12, 2005 GuoGuo 2525
DM in Pricing/Rate MakingDM in Pricing/Rate Making
Effect T-scores from the logistic regression
April 12, 2005April 12, 2005 GuoGuo 2626
DM in Pricing/Rate MakingDM in Pricing/Rate Making- - Assessment Assessment
AssessmentAssessment Cross-model comparisonsCross-model comparisons of of the expected to actual the expected to actual
profits/lossesprofits/losses Independent of all other factors (sample size,..)Independent of all other factors (sample size,..)
Lift chartsLift charts % claim-occurrence value to a random baseline % claim-occurrence value to a random baseline
modelmodel Performance quality demonstrated by the degree the Performance quality demonstrated by the degree the
lift chart curve pushes upward and to the leftlift chart curve pushes upward and to the left
April 12, 2005April 12, 2005 GuoGuo 2727
DM in Pricing/Rate MakingDM in Pricing/Rate Making- - Lift Chart for Logistic Lift Chart for Logistic
RegressionRegression
logistic Regression - Captured 30% of the drivers in the 10th percentile- Better predictive power from about the 20th to the 80th percentiles
April 12, 2005April 12, 2005 GuoGuo 2828
DM in Risk ManagementDM in Risk Management
ReinsuranceReinsurance To structure more effectively by segmentationTo structure more effectively by segmentation
HedgingHedging Target Target retention and building loyalty retention and building loyalty
April 12, 2005April 12, 2005 GuoGuo 2929
DM in Policy Level AnalysisDM in Policy Level Analysis
Retention analysisRetention analysis Profitability analysis Profitability analysis Policyholder’s behavior Policyholder’s behavior DM methods used DM methods used
Neural networksNeural networks Decision treesDecision trees Logistic regression Logistic regression
April 12, 2005April 12, 2005 GuoGuo 3030
Applications – Variable SelectionApplications – Variable Selection
Problem Problem
-- -- Given {Y,X} whereGiven {Y,X} where Find F, such that Find F, such that Find and F*, such thatFind and F*, such that
Improving model accuracy and efficiencyImproving model accuracy and efficiency Making crucial business decisionsMaking crucial business decisions
,Z X( )F X Y
*( )F X Y
1 2{ , ,... }NX x x x
April 12, 2005April 12, 2005 GuoGuo 3131
Case Study - Group InsuranceCase Study - Group Insurance
Identify ways to build upon the current Identify ways to build upon the current
manual rating structure utilizing exiting rating manual rating structure utilizing exiting rating variables to develop a practical tool to guild variables to develop a practical tool to guild underwriting in rates adjustmentsunderwriting in rates adjustments
Identify any new rating variables with Identify any new rating variables with significant predictive powersignificant predictive power Currently gathered, but not utilized dataCurrently gathered, but not utilized data Transformations of existing variablesTransformations of existing variables introduce new rating variables (e.g. external financial introduce new rating variables (e.g. external financial
data)data)
April 12, 2005April 12, 2005 GuoGuo 3232
Case Study – Group InsuranceCase Study – Group Insurance
Profit margin over x year periodProfit margin over x year period 128 input variables128 input variables Principle Components Analysis applied Principle Components Analysis applied 42 variables remains42 variables remains How to improve business profit?How to improve business profit?
April 12, 2005April 12, 2005 GuoGuo 3333
Case StudyCase Study - - Goals Goals
Developing a practical Developing a practical underwriting toolunderwriting tool Detecting deviationsDetecting deviations Identifying key driversIdentifying key drivers
Improving model predictive powerImproving model predictive power Risk selectionRisk selection
April 12, 2005April 12, 2005 GuoGuo 3434
Function ApproximationFunction Approximation
0 1 1 2 2( ) ( ) ( ) ... ( )M MF X F T X T X T X
is the initial guessis the initial guess Stegewise approximationStegewise approximation Each stage added by reducing errors Each stage added by reducing errors Each stage is weak linear – a small tree.Each stage is weak linear – a small tree. Sequential adjustmentSequential adjustment
0F
April 12, 2005April 12, 2005 GuoGuo 3535
Regression Tree ExampleRegression Tree Example
Profit=6.5%Profit=6.5%+0.8% , if AS > 421+0.8% , if AS > 421
-0.5% , otherwise-0.5% , otherwise
+1.2% , if male +1.2% , if male young than 30young than 30
-1.1% , otherwise-1.1% , otherwise
April 12, 2005April 12, 2005 GuoGuo 3636
Function ApproximationFunction Approximation
GIVENGIVEN Y: Output and Y: Output and X: X: Inputs or PredictorsInputs or Predictors L(Y, F): Loss FunctionL(Y, F): Loss Function
ESTIMATEESTIMATE
( ) ,*( ) arg min [ ( , ( ))]F X Y XF X E L Y F X
April 12, 2005April 12, 2005 GuoGuo 3737
Classical Function ApproximationClassical Function Approximation
Solve from Solve from
( , ), { }jF F X
{ }j
min ( , ( , ))L Y F X B
April 12, 2005April 12, 2005 GuoGuo 3838
Nonparametric Function Nonparametric Function ApproximationApproximation
Compute Compute
0{ ( )}iF X
1( )
N
i i
Lg
F X
��������������
Initial guessInitial guess
Take a step in the steepest descent directionTake a step in the steepest descent direction
April 12, 2005April 12, 2005 GuoGuo 3939
Gradient BoostingGradient Boosting
Initial guess Initial guess
FOR m = 1 TO MFOR m = 1 TO M
Fit an Fit an L-node regression treeL-node regression tree to the current residuals to the current residuals For each given node, calculate node average residual For each given node, calculate node average residual
Update: Update: ENDEND
2
1
1({ ( )}) ( ( ))
N
i i ii
L F X Y F XN
0{ ( )}iF X
1( ( ))m m ig L F X
1{ ( )} { ( )} ( )m i m i m iF X F X h X
( )m ih X
April 12, 2005April 12, 2005 GuoGuo 4040
Case StudyCase Study
Tw o Predictor Dependence ForPROFIT_MARGIN
-0.10
-0.05
0.00
0.05
2000 4000 6000 8000 10000 12000 14000
Par
tial D
epen
denc
e
AVG_SALARY
Tw o Variable Dependence for PROFIT_MARGIN; Slice REGION = 1.0810810553721892PROFIT_MARGIN
-0.05
0.00
0.05
2000 4000 6000 8000 10000 12000 14000
Par
tial D
epen
denc
e
AVG_SALARY
Tw o Variable Dependence for PROFIT_MARGIN; Slice REGION = 0.99999997554681241PROFIT_MARGIN
April 12, 2005April 12, 2005 GuoGuo 4141
Case StudyCase Study
Tw o Predictor Dependence ForPROFIT_MARGIN
-0.10
-0.05
0.00
0.05
5000 10000 15000 20000 25000
Par
tial D
epen
denc
e AVG_SALARY
Tw o Variable Dependence for PROFIT_MARGIN; Slice SIZE = 0.9999999902187251PROFIT_MARGIN
-0.05
0.00
0.05
5000 10000 15000 20000 25000
Par
tial D
epen
denc
e
AVG_SALARY
Tw o Variable Dependence for PROFIT_MARGIN; Slice SIZE = 1.0270270151969716PROFIT_MARGIN
April 12, 2005April 12, 2005 GuoGuo 4242
Case StudyCase Study- - Single Stats and Variable ImportanceSingle Stats and Variable Importance
Input Additive Multiplicative Importance Variable 1 0.2679 0.2690 100.00 Variable 2 0.2779 0.3203 75.23 Variable 3 0.1456 0.1771 54.65 Variable 4 0.2263 0.2469 47.41 Variable 5 0.1059 0.1425 42.81 Variable 6 0.2741 0.2847 34.81 Variable 7 0.1289 0.1306 34.27 Variable 8 0.0797 0.0864 25.35 Variable 9 0.1129 0.1148 23.37
April 12, 2005April 12, 2005 GuoGuo 4343
Case StudyCase Study- - Pair Stats and Variable ImportancePair Stats and Variable Importance
Variables Additive MultiplicativeVariable 1 & Variable 2 0.3714 0.3847
Variable 2 & Variable 3 0.3704 0.4066
Variable 2 & Variable 4 0.3686 0.4010
Variable 2 & Variable 7 0.3401 0.3856
Variable 3 & Variable 4 0.2795 0.3137
Variable 3 & Variable 6 0.2895 0.3082
Variable 4 & Variable 7 0.2417 0.2592
Variable 5 & Variable 6 0.2622 0.2766
Variable 6 & Variable 7 0.2904 0.3066
April 12, 2005April 12, 2005 GuoGuo 4444
Predictive ModelingPredictive Modeling
Predicts deviations from expected Predicts deviations from expected profitability (used 9 variables) profitability (used 9 variables)
Practical guide for underwriters to use for Practical guide for underwriters to use for rates adjustmentsrates adjustments
New variables Identified to have strong New variables Identified to have strong predictive powerpredictive power
Improve business profit (20% Profit margin)Improve business profit (20% Profit margin)
April 12, 2005April 12, 2005 GuoGuo 4545
Importance of Multiple Importance of Multiple TechniquesTechniques
Robust model with high predictive Robust model with high predictive accuracyaccuracy
Practical constrainsPractical constrains Algorithm complexityAlgorithm complexity Ease of understanding of resultsEase of understanding of results
April 12, 2005April 12, 2005 GuoGuo 4646
Is Data Mining for you?Is Data Mining for you?
Defining the goalsDefining the goals Understanding your dataUnderstanding your data Using multiple techniquesUsing multiple techniques Improving your decision making Improving your decision making
processprocess Gaining competitive edges!Gaining competitive edges!
Thank you!Thank you!