managing data (big and small) for analyticsmanaging data (big and small) for analytics june 11, 2015...
TRANSCRIPT
Managing Data (Big
and Small) for Analytics
June 11, 2015
1
Importance of Predictive Analytics
• Predictive Analytics can help insurers be more
effective in all segments of the value chain – Marketing – Target and acquire the right customers
– Actuarial – Prices that accurately reflect risk
– Underwriting – Select the proper risks and proper
products
– Claims – Identify suspicious claims
• The industry is getting more competitive – Top 10 personal auto insurers had 1/2 the market share
in 1980; now they have 2/3 of the market share
– Only the “fittest” will survive; analytics can provide the
needed competitive advantage
– The industry has recognized the value of analytics
2
• Predictive analytics is used
most often in personal lines.
• 100% of the larger personal
lines insurers we surveyed use
predictive analytics!
• Of course, personal lines
(and PL auto, in particular) is
the largest and one of the
most competitive segments
of the P&C market. Insurers
are looking for any
competitive edge they can
find.
Who Uses Predictive Modeling?
Use of predictive analytics by
size of personal lines book
Source: ISO and Earnix 2013 Insurance Predictive Modeling Survey
• Pricing is the most common
use of predictive modeling.
• A majority of insurers also use
predictive modeling for
underwriting at least
frequently.
• But there is still significant
usage in marketing, claims,
and reserving.
How Insurers Use Predictive Modeling
Predictive modeling use by
function
42%
20% 9% 7% 9%
39%
32%
18% 14% 9%
13%
21%
14% 18%
11%
4%
9%
12% 14%
17%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Rarely
Occasionally
Frequently
Always
Source: ISO and Earnix 2013 Insurance Predictive Modeling Survey
• Lack of sufficient data is the
biggest challenge – both
quantity and scope.
• Lack of skilled modelers
is a close second most
challenging factor for those
building an internal
predictive modeling
capability.
Predictive Modeling Challenges
Predictive modeling challenges
9%
6%
23%
31%
47%
53%
Other
We have no challenges
Need better modeling tools
Need additional computingresources
Not enough skilled modelers
Lack of sufficient data
Source: ISO and Earnix 2013 Insurance Predictive Modeling Survey
• Data is a challenge for
everybody, but large and
small insurers have different
challenges.
• Larger insurers are most
concerned with data quality.
• Smaller insurers don’t have
enough observations.
Data Challenges
Data challenges by company size
42%
17%
18%
6%
7%
6%
30%
57%
3% 13%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
<$1B >$1B
Other
Data is not clean,tough to use
Data is not current
Not enough variabilityin the data
Not enoughobservations
Numbers may not add up to 100% due to rounding
Source: ISO and Earnix 2013 Insurance Predictive Modeling Survey
• More than 90% of insurers
supplement their internal
data with one or more types
of third-party data.
• The most common data
types are credit-related data
and geo-demographic
data.
Third-Party Data
Types of third-party data used
5%
9%
42%
46%
53%
67%
80%
Other
3rd party telematics data
Weather data
Catastrophe models data
Competitive pricing data
Geo-demographical data
Insurance score or raw creditattributes
Source: ISO and Earnix 2013 Insurance Predictive Modeling Survey
• Preparing data for analysis is
a major bottleneck and
drain on resources for most
insurers.
• 54% of insurers typically
spend more than 3 months
to prepare their data for a
project.
Data Preparation
Data extraction and preparation time
18%
28%
37%
17%
Less than 1month
1 – 2 months 3 – 6 months More than 6months
Source: ISO and Earnix 2013 Insurance Predictive Modeling Survey
• The insurance industry,
especially many smaller
insurers, have not yet figured
out how to take advantage
of Big Data.
• Once new big data
technologies have been
used to extract the useful
“nuggets” from the new, vast
data streams, the data
management tasks are
similar to other types of
analytic data.
Big Data
Role of Big Data in Modeling Initiatives
Source: ISO and Earnix 2013 Insurance Predictive Modeling Survey
7% 17%
23%
34% 8%
12%
62%
37%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
<$1B >$1B
Doing nothing withBig Data
Would like to takeadvantage of Big Data,but it is costprohibitiveEvaluating/implementing the use of Big Data
Use Big Data and haveincorporated it in ourmodels
Good Data: Good Analytics
Good quality data can often compensate for
mediocre analysis …
… but, the reverse is never true.
No matter how skilled the analyst, …
… bad data will always lead to bad results!
10
Data Use for Analytics is Different
• Some Characteristics of Analytics Use of Data
– Sophisticated Users
– Ad Hoc & Iterative
– Repurposed Data
– Granular and Denormalized Data
– Data Quality and Metadata are different
11
• Most modelers will have advanced degrees
in Statistics, Economics, Applied
Mathematics, etc.
• Users will be looking at the data from new
perspectives and using the data in new ways
• Can lead to new insights into the data for
data owners – can also cause friction.
• Data managers need to understand the
predictive analytics process
Sophisticated Users
12
Predictive Analytics Process Overview
13
Data
Business
Understanding
Data
Understanding
Data
Preparation
Modeling
Evaluation
Deployment
CRISP – DM: Cross-Industry
Standard Process for Data Mining
• Analysts will design their queries as needed
• The iterative nature of the analytic process
means the analyst will be back again and
again for more and different data
• There is no “standard” analytics data report
that can be specified when designing the
data resource. Structure needs to efficiently
and flexibly support ad hoc quiries.
Ad Hoc Nature of Data Access
14
• Rarely are the operational data stores collected into a single “Enterprise Data Warehouse” – You will need to create a useful analytic data store
• Even more rare, is data that has been collected specifically for analytics – usually, analytics is an opportunistic user of data that has been collected for other purposes – Data will need to be cleaned, transformed,
conformed, and documented before it is certified as “fit for use” for analytics and included in the analytic data store
Repurposed Data
15
• Insurance companies collect vast quantities of data in the course of business
• Typical Insurance Analytics Data Sources – Customer Relationship Management – Quoting/Underwriting – Policy Management – Billing – Claims – Audit – Actuarial Research – Financial Reporting – Publically Available Data – Third-party data vendors
Insurance Company Data Sources
16
• End goal – Always remember the goal – Two-dimensional flat file for input into modeling
software
– Each record contains an identifier, candidate
predictor variables for testing, and one or more
target variables
• Analysis requires historical data and the
vintage of the predictor variables must be
matched to the target variables
• Must support the granularity required for level
analysis
Granular and Denormalized Data
17
• What does each record represent? • Common record types for insurance analytics
– Customer-related • First named Insured • Household • Quote
– Policy – Coverage – Claim – Geography
• Census tract • County • State • Underwriting Territory • Zip Code
Granularity
18
• Star Schema is often adopted to support analytics
• Data will often be denormalized and aggregated from source systems
• Analytic databases will often grow to contain more history than the source systems. Plan for growth.
• Every variable needs a vintage • Indexing needs will be imperfectly defined. Count on supporting multiple table joins from any direction = many indexes.
• Granularity – pick the lowest level as your base – This means more data, but … – … it is the most flexible design. Data can usually be
aggregated to a higher level of granularity but you can never go below your base level.
Implications for Analytic Database Design
19
• Important Data Quality measures for Analytics: – Accuracy – how well does data match reality?
– Reliability – is measurement repeatable and consistent?
– Timeliness – is data available at time of prediction?
– Completeness – is data available for most cases?
– Permissibility – can you use the data as intended?
• Analysts usually can’t control the quality of the data when acquired. So, they must at least
know the quality of the data in order to
determine the usefulness of the data.
• Metadata – documentation of this information
Data Quality and Metadata
20
• Advanced analytics requires different data
management support than most other uses of
the data
• Two broad areas that demonstrate those
needs: – Database design
– Data quality and metadata
• Strive to build an analytic data store that
considers the unique needs of analytics.
Analytic Data Management Summary
21
The great big data debate
• Meaning of big data
• The ROI question
• Finding the right data for the problem*
• Challenges and approaches
• Emerging data sources
* - … or, finding the right problem for the data
22
The definition of big data is big data
23
The broad
range of
new and
massive
data types
that have
appeared
over the last
decade or
so. – Tom
Davenport,
Big Data @
Work
The ability of society to harness information in novel ways to
produce useful insights or goods and services of significant
value” and “…things one can do at a large scale that cannot
be done at a smaller one, to extract new insights or create
new forms of value.. – Viktor Mayer-Schönberger and
Kenneth Cukier Datasets whose
size is beyond the
ability of typical
database software
tools to capture,
store, manage,
and analyze --
McKinsey
Data of a very large size, typically
to the extent that its manipulation
and management present
significant logistical challenges. –
Oxford English Dictionary
An all-encompassing term for any collection of data sets so large and complex
that it becomes difficult to process using on-hand data management tools or
traditional data processing applications. – Wikipedia
Source: 12 Big Data Definitions: What’s Yours?
http://www.forbes.com/sites/gilpress/2014/09/03/12-big-data-definitions-whats-yours/
Volume
Velo
city
Variety
Vera
city
Top twenty words across definitions
24
Examples of big data in P&C today
25
• Aerial
• Business
• Connected home
• Consumer
• Econometric
• Financial
• GIS
• Government
• Psychographic
• Retail
• Social media
• Topographic
• Traffic Cam
• Vehicle Build
• Vehicle Telematics
• Weather
26
The ROI question
• Aerial
• Business
• Connected home
• Consumer
• Econometric
• Financial
• GIS
• Government
• Psychographic
• Retail
• Social media
• Topographic
• Traffic Cam
• Vehicle Build
• Vehicle Telematics
• Weather
Driving behavior data
27
2013/08/18
22:47:53.07 UTC
34° 59’ 20”
-106 ° 36’ 52”
-9.8
m /s
2
3.27 Gal / fuel
256.6°F
4200 RPM
72,852 Miles
Dr. Seatbelt: Y
Interstate 40
(Freeway)
Speed Limit
65 MPH
Albuquerque,
New Mexico
101°F
25 Mil Vis
Wind: 2mph NW
Sunny
Simulating the ROI on UBI
28
ROI
Selection
Driving
Improve-
ment
Ancillary
Revenue
vs. Cost
Savings
Reunder
-writing
Technology
Incentives
and Rewards
Analytics and
Models
Logistics
Partners and
Services
Elasticity
and Cost
Mix of
Business
Competitive
Landscape
Regulation
Distribution
Anti-selection example
29
All values are hypothetical and illustrative. In the example, policyholders
switch to insurers with UBI if they can find a rate 25% lower.
Avg Pure Loss
Year Rate Danny DJ Michelle Jesse Joey Prem Ratio
1 800 228 423 520 618 813 520 65%
2 800 X 423 520 618 813 593 74%
3 913 X X 520 618 813 650 71%
4 1000 X X X 618 813 715 72%
5 1100 X X X 618 813 715 65%
884 611 69%
© Insurance Services Office, Inc., 2015
30
How much does the model matter
-75.0%
-50.0%
-25.0%
0.0%
25.0%
50.0%
1 2 3 4 5 6
Constant Learning
-75.0%
-50.0%
-25.0%
0.0%
25.0%
50.0%
1 2 3 4 5 6
Constant Learning
-75.0%
-50.0%
-25.0%
0.0%
25.0%
50.0%
1 2 3 4 5 6
Constant Learning Graduated Learning
Hypothetical
Five Year
Annualized ROI
Model Power
(High Tertile ÷
Lower Tertile )
“Common Dongle / Book of Business” Assumption Set
• Approximately industry average premiums / expenses
• $100 hardware, $5 monthly wireless
• Three year useful life
• Three vehicles per year
• 10% annual cost reductions
© Insurance Services Office, Inc., 2015
Example of learning
31
0
10
20
30
40
50
60
70
0 5 10 15 20
Initially Safest Initially Average Initially Riskiest
ISO
Safe
ty S
core
Weeks of Driving
UBI Score by Week of Driving
Incre
asin
g R
isk
Source: Sample of over 1,000 private passenger vehicles observed in late 2014
© Insurance Services Office, Inc., 2015
Monte Carlo simulation
32
0%
1%
2%
3%
4%
5%
6%
7%
8%
9%
Five year return on investment (ROI)
Assumes common dongle cost structure, device deployment to three
vehicles per year in typical book, and 3x model power.
Perc
enta
ge o
f S
imula
tions
© Insurance Services Office, Inc., 2015
33
Finding the right data for the problem
• Aerial
• Business
• Connected home
• Consumer
• Econometric
• Financial
• GIS
• Government
• Psychographic
• Retail
• Social media
• Topographic
• Traffic Cam
• Vehicle Build
• Vehicle Telematics
• Weather
• Carrier XYZ
• 2007 $80M DWP
• 2007 Loss Ratio: 65.8%
• Carrier charges single rate
for entire territory
• Does not believe in granular
rating, so does nothing while
competitors implement
granular rates
Location of policies for a single territory
Best Practice Case Study: Granular Rating
34
• Carrier XYZ (4 Years Later)
• Competitors have cherry-
picked best risks, leaving
XYZ with concentration of
poor risks
• 2011 $75M DWP
• 2011 Loss Ratio: 72.7%
• Down $5M in premium
• Loss Ratio Up 6.9 points
• Profits down $7M
Location of policies for a single territory
Best Practice Case Study: Granular Rating
35
Going granular
36
Territories: 1 ZIP Codes: 34 Block Groups: 669
Example: Milwaukee, Wisconsin, Geographic Area
Environmental Risk Factors for Auto
Importance of Peril
34%
66%
2012
28%
72%
2007
Non-By-
Peril
Insurers
Loss Ratio
76.6%
25 By-Peril
Insurers’
Loss Ratio
69.2%
Estimated
Market
Share DWP
(A.M. Best)
Same age, ZIP, age, PPC: same risk?
Traditional approach: these homes may all be rated similarly based on the amount of insurance
Attribute approach: roof materials, HVAC, number of bathrooms, basements, garages etc. make a difference – esp. by peril
Building characteristics approach
40
Brickface, attached garage ↓ wind exposure
Larger floor plan, carport ↑ wind exposure
Fireplace ↑ Higher hail exposure
Composite shingles ↓ hail exposure
3½ bath, two stories ↑ water (non-weather) exposure
Single story, pool ↑ theft/vandalism exposure
41
Challenges and approaches
• Aerial
• Business
• Connected home
• Consumer
• Econometric
• Financial
• GIS
• Government
• Psychographic
• Retail
• Social media
• Topographic
• Traffic Cam
• Vehicle Build
• Vehicle Telematics
• Weather
Data bucket challenge
Wet Weather
… eaking [sic] ice maker in bar …
after heavy downpour, insured noticed water damage to ceiling and walls in den
… freeze damage to swimming pool …
… freezer defrosted and water leaked …
Wet, not Weather
Fuzzy matching example
43
Model Years Size Body Vehicle
2011-2013 Small Two-door Car Honda Civic
2011-2013 Small Two-door Car Honda Civic Si
Year Make Model
2013 Honda Civic 2DR FWD
Year Make Model
2013 Honda Civic 2-door coupe
Class Make Model
Small Family Car Honda Civic
Small Family Car Honda Civic Hybrid
Year Make Model Style
2013 Honda Civic EX
Sources:
Cars.com
Edmunds.com
Euroncap.org
Iihs.org/iihs/ratings
Iihs.org/iihs/topics/insurance-loss-information
Safercar.gov
Year Make Model Configuration
2013 Honda Civic Coupe EX
© In
su
ran
ce
Se
rvic
es O
ffice, In
c. 2
01
5
44
VIN
ABC
Symbol approaches -- attributes
Relative
Frequency
and
Severity
Covariates
Territory, Operator Age, Marital Status,
Driving Record, Insurance Score, Limits,
Deductibles, Affinity
Manu-
facturer
Data
Ratings
and Tests
Car
Gurus
Econo-
metrics
Wheelbase, Height, Weight, Body Style,
Engine Size, Horsepower, Airbags
e.g. Crash Tests
e.g. Braking
Distance
e.g. model year
CPI or KBB
© Insurance Services Office, Inc. 2015
45
VIN
ABC
Symbol approaches -- attributes
Relative
Frequency
and
Severity
Covariates
Territory, Operator Age, Marital Status,
Driving Record, Insurance Score, Limits,
Deductibles, Affinity
Manu-
facturer
Data
Ratings
and Tests
Car
Gurus
Econo-
metrics
Wheelbase, Height, Weight, Body Style,
Engine Size, Horsepower, Airbags
e.g. Crash Tests
e.g. Braking
Distance
e.g. model year
CPI or KBB
© Insurance Services Office, Inc. 2015
Potential decision tree – theft claims
46
Anti-
Theft
Alarm
Yes
No
Body
Style
Cool
Uncool
SVR
System
Yes
No
Frequency 2%
Frequency 8% Frequency 1%
Frequency 0.25% Frequency 7%
Frequency 9.75% Frequency 11.5%
Frequency 2.5%
Frequency 22.75%
Frequency 4.75%
Frequency 1.75%
Frequency 13.75%
NOTE: These results are hypothetical. Please do not reproduce. © Insurance Services Office, Inc. 2015
Comparing approaches
47
Experience Attribute
Speed Trailing indicators
Leading indicators
Granularity Reliant on MSRP w/in series Trim level predictions
Objectivity Intangibles and evolution Defined set of attributes
Maintenance
Annual review Resolution and remodeling
Accuracy High at series level but
limited within series
Limited for variations beyond
modeled set of attributes
© Insurance Services Office, Inc. 2015
Best of all worlds
48
Mixer
Best Estimate
© Insurance Services Office, Inc. 2015
49
Emerging Data Sources
• Aerial
• Business
• Connected home
• Consumer
• Econometric
• Financial
• GIS
• Government
• Psychographic
• Retail
• Social media
• Topographic
• Traffic Cam
• Vehicle Build
• Vehicle Telematics
• Weather
Speed kills
50
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400
Day of observation period ending April 3, 2015
Num
ber
of T
weets
per
day
Twitter Universe's Reaction to Autonomous Cars
3/18/2015 Tesla
teases self-
driving software
1/6/2015 Ford
announces five-
year plan, Benz
concept 7/27/2014 Baidu
announces auto-
nomous concept
3/16/2015
Coast-to-
coast
experiment
10/21/2014 Audi
sets speed record
for autonomous
cars
5/29/2014 Google
reveals “no steering
wheel” concept
Big data fits on a single page
51
Groups
Toastmasters
Straphangers
Woodstock Revival
Slopes Loyalty Program
Tapas for the Road
Work History
Employer: Insuranceplex
Position: Actuarial Sensei
Tenure: 1974 – Present
Recent Activity
Basic Info
Gender: Male
Birthday: MYOB
Hometown: Island
Paradise
Status: Happily
Unmarried
Contact Info
Email: [email protected]
Phone: 123.456.7890
Personal Information
Activities: Fine dining,
skiing, oration
Interests: World War II,
spirits, fast cars
Friends
Patrick has 314 friends
including:
Jim W.
David C.
Mary V.
Jeff D.
Shalini L.
… see more …
Patrick W.
Patrick W. said:
Just installed a gadget in my Acura to
save $ on car insurance… will need the $
for the third home I’m looking at in FL!
Mary V. commented: Your crazy.
(This example is completely hypothetical and
intended for illustrative purposes only.)
Data uniquely suited for LTV
52
Source: ISO Social Media
Insurer Focus Group
(December 2014)
Model-ready social media data
53
Engage with insurer
on social media?
Yes No
Positive Neutral Negative Posts on social
review sites?
No Yes
Mostly
negative
reviews
Mostly
positive
reviews
Reviewed on
commerce
sites?
Yes No
Mostly
positive
reviews
Mostly
negative
reviews
What are
hobbies and
interests?
Active
lifestyle Sedentary
lifestyle
Hypothetical machine
learning example
e(d) = 18 months e(d) = 96 months
e(d) = 78 months e(d) = 32 months
e(d) = 63 months e(d) = 38 months
e(d) = 44 months
Is the ‘next trillion dollar industry’ waiting for P&C to get well soon?
• Obese 75+% more likely to die in crash car
• Diabetics10-20% more likely to crash car
• 2-3% of crashes involve drowsy driving
• Poor eyesight linked to ~60% of car crashes
• 50+% of work accidents due to drowsiness
• Stress-related workplace claims ~2x costlier
54
Can P&C insurers make a
positive difference? Sources:
Geggel (New York Times 1/28/2013), “Precautions Urged for Drivers with Diabetes”
Melamed and Oksenberg, “Excessive Daytime Sleepiness and Risk of Occupational Injuries in Non-Shift Daytime Workers”
Maine Employers Mutual Insurance Company and David Lee, “Managing Employee Stress and Safety”
National Highway Transportation Safety Institute, “Traffic Safety Facts: Drowsy Driving” (March 2011)
Rice and Zhu (Emergency Medical Journal 1/21/2013), “Driver Obesity and the Risk of Fatal Injury During Traffic Collisions”
Vision Impact Institute, “Ten Things You Need to Know: Concise Facts on Vision Economics”
The little source of transformative data
55
Small takeways…
• Understand the value proposition
• Confirm the data matches the problem
• Clean, timely data required to compete
• Machine learning increasing in prominence
• Policyholders expect value for their big data
56
#thanks
No part of this presentation may be copied or redistributed without the
prior written consent of ISO. This material was used exclusively as an
exhibit to an oral presentation. It may not be, nor should it be relied
upon as reflecting, a complete record of the discussion..
© Insurance Services Office, Inc., 2015