predicative modeling and data mining for actuaries › library › iinsights_0704...machine learning...

Predicative Modeling and Data Mining for Actuaries

Dan Steinberg, Ph.D.Mikhail Golovnya

http://www.salford-systems.comSalford Systems

What Everyone Hears

Machine Learning

Pattern Recognition

Artificial Intelligence

Predictive Analytics

CRISP-DM, CRM, KDD, etc.

Business Intelligence

Data Warehousing

OLAP, CART, SVM, NN, etc.

DATA MINING

– Statistics– Computer science– Database Management– Insurance– Finance– Marketing– Electrical Engineering– Robotics– Biotech and more

http://www.search4dinosaurs.com/tm_diplodocus.html

Data Mining Defined• Data mining is the search for patterns

in data using modern highly automated, computer intensive methods– Data mining may be best defined as the

use of a specific class of tools (data mining methods) in the analysis of data

– The term “search” is key to this definition, as is “automated”

• The literature often refers to finding hidden information in data

http://images.google.com/imgres?imgurl=http://academic.evergreen.edu/curricular/scirev/Galileo01.jpg&imgrefurl=http://academic.evergreen.edu/curricular/scirev/&h=401&w=300&sz=22&hl=en&start=10&um=1&tbnid=BsFZCsbLRcrfyM:&tbnh=124&tbnw=93&prev=/images%3Fq%3Dgalileo%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

The Three Sisters

Study the phenomenonUnderstand its natureTry to discover a lawThe laws usually hold for a long time

Collect some dataGuess the model (perhaps, using science)Use the data to clarify and/or validate the modelIf looks “fishy”, pick another model and do it again

Access to lots of dataNo clue what the model might beNo long term law is even possibleLet the machine build a modelAnd let’s use this model while we can

Science

Statistics

Data Mining

Data Mining Mythology

• Quest for the Holy Grail – build an algorithm that will always find 100% accurate models

• Absolute Powers – data mining will finally find and explain everything

• Gold Rush – with the right tool one can rip the stock-market and become obscenely rich

• Magic Wand – getting a complete solution from start to finish with a single button push

• Doomsday Scenario – all conventional analysts will eventually be replaced by smart computer chips

Patterns Searched For• We will focus on patterns that allow us to

accomplish two tasks:– Classification– Regression

• We will briefly touch on a third common task– Finding groups in data

(clustering, density estimation)• There are other patterns we will not discuss

today including– Patterns in sequences– Connections in networks (the web, social networks,

link analysis)

This is known as“supervised learning”

This is known as“unsupervised learning”

http://images.google.com/imgres?imgurl=http://www.purdue.edu/hr/images/magnifyingGlass.jpg&imgrefurl=http://www.purdue.edu/hr/WorkLife/resourceCenter.htm&h=428&w=600&sz=20&hl=en&start=19&um=1&tbnid=9H9QiCBNEpmVIM:&tbnh=96&tbnw=135&prev=/images%3Fq%3Dmagnifying%2Bglass%26start%3D18%26ndsp%3D18%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26sa%3DN%26as_qdr%3Dall

Major Data Mining Tools

• CART® (Decision Trees, C4.5, CHAID among others)• MARS® (Multivariate Adaptive Regression Splines)• Artificial Neural Networks (ANNs, many commercial)• Association Rules (Clustering, market basket analysis)• TreeNet® (Stochastic Gradient Tree Boosting)• RandomForests® (Ensembles of trees w/ random splits)• Genetic Algorithms (evolutionary model development) • Self Organizing Maps (SOM, like k-means clustering)• Support Vector Machine (SVM wrapped in many patents) • Nearest Neighbor Classifiers

http://images.google.com/imgres?imgurl=http://www.davistownmuseum.org/BioPics/PageWrenches.jpg&imgrefurl=http://www.davistownmuseum.org/bioBostonWrench.htm&h=480&w=640&sz=36&hl=en&start=17&um=1&tbnid=jZ7nrDJAWKsD2M:&tbnh=103&tbnw=137&prev=/images%3Fq%3Dwrenches%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

CART® Overview• Classification and Regression Trees (CART®) –

original approach based on the “let the data decide local regions” concept developed by Breiman, Friedman, Olshen, and Stone:– For each current data region, consider all possible

orthogonal splits (based on one variable) into 2 sub-regions– The best split is then defined as the one having the smallest

MSE after fitting a constant in each sub-region (regression) or the smallest resulting impurity (classification)

– Proceed sequentially until all structure in the training set hasbeen completely exhausted largest tree is produced

– Create a sequence of nested sub-trees with different amount of localization

• Pick the best tree based on the Test set performance

http://images.google.com/imgres?imgurl=http://www.phong.com/tutorials/mask.tree/tree.jpg&imgrefurl=http://www.phong.com/tutorials/mask.tree/&h=1333&w=1000&sz=490&tbnid=HD-eU_5VWDUnDM:&tbnh=150&tbnw=113&prev=/images%3Fq%3Dtree%2Bpicture&start=1&sa=X&oi=images&ct=image&cd=1

CART® – Major Strengths

• Relatively fast• Requires minimal supervision by analyst• Produces easy to understand models• Conducts automatic variable selection• Handles missing values via surrogate splits• Invariant to monotonic transformations of

predictors• Impervious to outliers

Classification with CART®(real world study early 1990s)

• Fixed line service provider offering a new mobile phone service

• Wants to identify customers most likely to accept new mobile offer

• Data set based on limited market trial

Real World Trial• 830 households offered a mobile phone

package• All offered identical package but pricing was

varied at random– Handset prices ranged from low to high values– Per minute prices ranged separately from low to

high rate rates• Household asked to make yes or no decision

on offer• 15.2% of the households accepted the offer• Our goal was to answer two key questions

– Who to make offers to?– How to price?

http://images.google.com/imgres?imgurl=http://web.me.unr.edu/wirtz/images/TRIAL%2520RUN.JPG&imgrefurl=http://web.me.unr.edu/wirtz/FunStuff.htm&h=302&w=499&sz=154&tbnid=XXwK_YZp6Sc7zM:&tbnh=79&tbnw=130&prev=/images%3Fq%3Dtrial%2Bpicture&start=3&sa=X&oi=images&ct=image&cd=3

Setting Up the Model

• Only requirement is to select TARGET (dependent) variable. CART will do everything else automatically

http://images.google.com/imgres?imgurl=http://www.ignorancia.org/uploads/images/gears/gears-4.jpg&imgrefurl=http://www.ignorancia.org/en/index.php%3Fpage%3DGears&h=600&w=800&sz=75&hl=en&start=5&um=1&tbnid=NgxYqy9vLMduDM:&tbnh=107&tbnw=143&prev=/images%3Fq%3Dgears%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Model Results

• CART completes analysis and gives access to all results from the NAVIGATOR– Shown on the next slide

• Upper section displays tree of a selected size• Lower section displays error rate for trees of

all possible sizes• Green bar marks most accurate tree• We display a compact 10 node tree for further

scrutiny

Navigator Display

• Most accurate tree is marked with the green bar. We select the 10 node tree for convenience of a more compact display.

Root Node• Root node displays details of TARGET

variable in overall training data• We see that 15.2% of the 830

households accepted the offer• Goal of the analysis is now to extract

patterns characteristic of responders

• If we could only use a single piece of information to separate responders from non-responders CART chooses the HANDSET PRICE

• Those offered the phone with a price > 130 contain only 9.9% responders

• Those offered a lower price respond at 21.9%

http://images.google.com/imgres?imgurl=http://www.snwa.com/assets/images/land_tree_roots_250.jpg&imgrefurl=http://www.snwa.com/html/land_trees_roots.html&h=288&w=250&sz=11&tbnid=9V-zEbu6IZLfZM:&tbnh=115&tbnw=100&prev=/images%3Fq%3Dtree%2Broot%2Bpicture&start=1&sa=X&oi=images&ct=image&cd=1

Maximal Tree

• Maximal tree is the raw material for the best model

• Goal is to find optimal tree embedded inside maximal tree

• Will find optimal tree via “pruning”• Like backwards stepwise regression

http://images.google.com/imgres?imgurl=http://www.picturesofengland.com/pictures/500/Sherwood_Forest_1114983307.jpg&imgrefurl=http://www.picturesofengland.com/England/Nottinghamshire/Mansfield/Sherwood_Forest/pictures&h=374&w=500&sz=95&hl=en&start=1&um=1&tbnid=91qrEBRBuXzydM:&tbnh=97&tbnw=130&prev=/images%3Fq%3Dhuge%2Btree%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Model Overview: Main Drivers

• Red= Good Response Blue=Poor Response)

• High values of a split variable always go to the right; low values go left

http://images.google.com/imgres?imgurl=http://www.autoworld.com/news/Hyundai/racecar.jpg&imgrefurl=http://www.autoworld.com/news/Hyundai/RaceNews.htm&h=295&w=450&sz=55&hl=en&start=12&um=1&tbnid=VCdnoUCxBcx1-M:&tbnh=83&tbnw=127&prev=/images%3Fq%3Dracecar%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Examine Individual Node• Hover mouse over node to see

inside• Even though this node is on the

“high price” of the tree it still exhibits the strongest response across all terminal node segments (43.5% response)

• Here we select rules expressed in C

• Entire tree can also be rendered in Java, XML/PMML, or SAS

http://images.google.com/imgres?imgurl=http://www.post-gazette.com/images4/20070127HO_VISTA_450.jpg&imgrefurl=http://www.post-gazette.com/pg/07027/756845-96.stm&h=340&w=450&sz=41&hl=en&start=1&um=1&tbnid=aN7hIJw86v7-LM:&tbnh=96&tbnw=127&prev=/images%3Fq%3Dtest%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Configure Print Image

• Variety of fine controls to position and scale image nicely

http://images.google.com/imgres?imgurl=http://www.bjorn3d.com/Material/Article_66/Images/Printer.jpg&imgrefurl=http://www.bjorn3d.com/read.php%3FcID%3D66&h=371&w=450&sz=50&hl=en&start=58&um=1&tbnid=qhqyF7-sqPAMUM:&tbnh=105&tbnw=127&prev=/images%3Fq%3Dprinter%26start%3D54%26ndsp%3D18%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26sa%3DN%26as_qdr%3Dall

Variable Importance Ranking

• CART offers multiple ways to compute variable importance

http://images.google.com/imgres?imgurl=http://www.bbc.co.uk/schools/gcsebitesize/geography/images/g_ure_stu_4.gif&imgrefurl=http://www.bbc.co.uk/schools/gcsebitesize/geography/urbanrural/settlementtypesrev5.shtml&h=312&w=500&sz=7&tbnid=1PTf1j0Rm9Ou4M:&tbnh=81&tbnw=130&prev=/images%3Fq%3Dhierarchy%2Bpicture&start=1&sa=X&oi=images&ct=image&cd=1

Predictive Accuracy

• This model is not very accurate but ranks responders well

http://images.google.com/imgres?imgurl=http://orpheus.fh-hagenberg.at/students/cms/cms03004/crosshair2.jpg&imgrefurl=http://orpheus.fh-hagenberg.at/students/cms/cms03004/&h=298&w=298&sz=10&hl=en&start=76&um=1&tbnid=or5GdmQ7tGE3kM:&tbnh=116&tbnw=116&prev=/images%3Fq%3Dcrosshair%2Bpicture%26start%3D72%26ndsp%3D18%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26sa%3DN%26as_qdr%3Dall

Gains and ROC Curves

• In top decile model captures about 40% of responders

http://images.google.com/imgres?imgurl=http://heasarc.gsfc.nasa.gov/docs/asca/gisgain_2002-June/new_gis_temp2gain.jpg&imgrefurl=http://heasarc.gsfc.nasa.gov/docs/asca/gisgain_2002-June/announce.html&h=645&w=845&sz=300&hl=en&start=6&um=1&tbnid=lNZyHGTmqBYpOM:&tbnh=111&tbnw=145&prev=/images%3Fq%3Dgain%2Bpicture%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Modeling Automation

• Here we display results for each of the 6 major tree growing methods. Twoing yields best performance here. This is one of 18 different automation schemes

Battery Summaries

• Variable importance across alternative strategies

• Error profiles for each alternative strategy

• A number of other summaries also available

http://images.google.com/imgres?imgurl=http://cache.eb.com/eb/image%3Fid%3D92993%26rendTypeId%3D4&imgrefurl=http://concise.britannica.com/ebc/art-86900&h=450&w=516&sz=51&hl=en&start=31&um=1&tbnid=OuIMNjmEHg0SXM:&tbnh=114&tbnw=131&prev=/images%3Fq%3Dassembly%2Bline%26start%3D18%26ndsp%3D18%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26sa%3DN%26as_qdr%3Dall

Vary Penalty on False Positives

Class Accuracy Tradeoff

• The battery includes a variety of models with different emphasis on individual classes

• Analyst can choose a model most appropriate for the current set of requirements

http://images.google.com/imgres?imgurl=https://engineering.purdue.edu/MSE/Fac_Staff/Faculty/TURNER_LAB/images/tradeoff_integration.jpg&imgrefurl=https://engineering.purdue.edu/MSE/Fac_Staff/Faculty/TURNER_LAB/project_epitaxial_layer_transfer_llo.htm&h=327&w=400&sz=43&hl=en&start=44&um=1&tbnid=Z6mHeQ8mF9z7KM:&tbnh=101&tbnw=124&prev=/images%3Fq%3Dtradeoff%26start%3D36%26ndsp%3D18%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26sa%3DN%26as_qdr%3Dall

Backwards Elimination of Features

Hotspot Detection

• Lift in node plotted against sample size: Examination of individual nodes from many different trees to find best segments

http://images.google.com/imgres?imgurl=http://www.volcanovillage.com/Kilauea_Volcano_Photo_Gallery_1/Kilauea_Volcano_Eruption_Blue_volcano.jpg&imgrefurl=http://www.volcanovillage.com/KilaueaVolcanoPhotoGallery01.htm&h=511&w=792&sz=56&hl=en&start=21&um=1&tbnid=QB5rkI2xSlTvaM:&tbnh=92&tbnw=143&prev=/images%3Fq%3Dvolcano%26start%3D18%26ndsp%3D18%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26sa%3DN%26as_qdr%3Dall

Constrained Trees• Many predictive models can benefit from

Salford’s patent pending “Structured Trees”• Trees constrained in how they are grown to

reflect decision support requirements• In mobile phone example: want tree to first

segment on customer characteristics and then complete using price variables

• Price variables are under the control of the company

• Customer characteristics are not under company control

http://www.mbphotography.net/Portfolio/slides/Bibury cottages 2 (median).html

Setting Up Constraints

• Green indicates where in the tree variables of group are allowed to appear

http://images.google.com/imgres?imgurl=https://www.wtgnh.com/wtgnh/support/chain.jpg&imgrefurl=https://www.wtgnh.com/wtgnh/index.cfm%3Fmethod%3Dmain.supplychain&h=400&w=300&sz=7&hl=en&start=3&um=1&tbnid=nBFnUAqBxJI2UM:&tbnh=124&tbnw=93&prev=/images%3Fq%3Dchain%26ndsp%3D18%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26sa%3DN%26as_qdr%3Dall

Constrained Tree

• Demographic and spend information at top of tree• Handset (HANDPRIC) and per minute pricing

(USEPRICE) at bottom

http://images.google.com/imgres?imgurl=http://www.sights-and-culture.com/china/suzhou-ruiguang-ta-pagoda-0073.jpg&imgrefurl=http://www.sights-and-culture.com/china/suzhou-ruiguang-ta-pagoda.html&h=800&w=583&sz=88&hl=en&start=14&um=1&tbnid=C48b_XEHk6Lx1M:&tbnh=143&tbnw=104&prev=/images%3Fq%3Dpagoda%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Translating Tree into Code

Further Development of CART• Hybrid models

– Combining CART with Logistic Regression– Combining CART with Neural Nets

• Linear combination splits• MARS – a better way to handle regression• Committees of trees

– Bagging– Arcing– RF

• Stochastic Gradient Boosting (MART/TreeNet)

• Random Forests

http://images.google.com/imgres?imgurl=http://www.photocritic.org/wp-content/uploads/2006/03/Picture-3.jpg&imgrefurl=http://www.photocritic.org/2006/high-speed-photography/&h=371&w=383&sz=22&hl=en&start=2&um=1&tbnid=vtkkc8olX3wM8M:&tbnh=119&tbnw=123&prev=/images%3Fq%3Dspeed%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Multivariate Adaptive Regression Splines

• MARS is a highly-automated tool for regression• Developed by Jerome H. Friedman of Stanford

Univ.– Annals of Statistics, 1991 dense 65 page article– Takes some inspiration from its ancestor CART®– Produces smooth curves and surfaces, not the step-

functions of CART• Appropriate target variables are continuous• Can also perform well on binary dependent

variables– censored survival model (waiting time models as in

churn)

http://photojournal.jpl.nasa.gov/catalog/PIA09082

Key Features of MARS• End result of a MARS run is a

regression model– MARS automatically chooses which

variables to use– variables are optimally transformed– interactions are detected– model is self-tested to protect against over-

fitting• Some formal studies find MARS can

outperform Neural Nets

http://images.google.com/imgres?imgurl=http://www.universetoday.com/wp-content/uploads/2006/09/2006-0921mars.jpg&imgrefurl=http://www.universetoday.com/2006/09/21/new-image-of-the-face-on-mars/&h=822&w=1024&sz=157&hl=en&start=15&um=1&tbnid=OB4c8OrYxrcLgM:&tbnh=120&tbnw=150&prev=/images%3Fq%3Dface%2Bmars%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Boston Housing Data Set• Harrison, D. and D. Rubinfeld.

Hedonic Housing Prices & Demand For Clean Air. Journal of Environmental Economics and Management, v5, 81-102 , 1978– 506 census tracts in City of Boston for the year 1970– Goal: study relationship between quality of life variables and

property values– MV median value of owner-occupied homes in tract (‘000s)– CRIM per capita crime rates– NOX concentration of nitrogen oxides (pphm)– AGE percent built before 1940– DIS weighted distance to centers of employment– RM average number of rooms per house– LSTAT percent neighborhood ‘lower SES’– RAD accessibility to radial highways– CHAS borders Charles River (0/1)– INDUS percent non-retail business– TAX tax rate– PT pupil teacher ratio

http://www.aguntherphotography.com/files/active/6/1246_large.jpg

Target: Median House Value (MV)

Scatter Matrix

• Clearly some non-normal distributions and non-linear relationships

http://7art-screensavers.com/screens/alien-magical-matrix-3d/alien-magic-matrix-3d-communication-brings-your-consciousness-to-the-higher-levels-of-reality.jpg

MARS Model• Piece-Wise Linear Regression

– Simplest version of splines – Example: MARS spline with 3 knots

superimposed on the actual data

0

1020

30

4050

60

0 10 20 30 40LSTAT

MV

http://images.google.com/imgres?imgurl=http://cse.ssl.berkeley.edu/bmendez/ay10/2002/notes/pics/KEPLER1.GIF&imgrefurl=http://cse.ssl.berkeley.edu/bmendez/ay10/2002/notes/lec6.html&h=331&w=295&sz=48&hl=en&start=23&um=1&tbnid=v_0v0YRAtC2jQM:&tbnh=119&tbnw=106&prev=/images%3Fq%3Dmodel%2Bkepler%26start%3D18%26ndsp%3D18%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26sa%3DN%26as_qdr%3Dall

Challenge: Searching for Multiple Knots

• Finding the one best knot in a simple regression is a straightforward search problem– try a large number of potential knots and choose

one with best R-squared– computation can be implemented efficiently using

update algorithms; entire regression does not have to be rerun for every possible knot (just update X’X matrices)

• Finding more than two knots simultaneously will require far more computations

• To preserve linear problem complexity, one has to add or remove knots sequentially one at a time

http://images.google.com/imgres?imgurl=http://www.jaestudio.com/Beast.jpg&imgrefurl=http://www.celticquicknews.co.uk/2006/02/motherwell-1-2-celtic-cis-cup-semi.shtml&h=648&w=825&sz=115&hl=en&start=18&um=1&tbnid=eJ5CHwTeCMM1nM:&tbnh=113&tbnw=144&prev=/images%3Fq%3Dbeast%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Example: Flat-Top Function• True function (in graph on left) has two knots

at X=30 and X=60• Observed data at right contains random error• Best single knot will be at X=45 and MARS

finds this first

0

20

40

60

80

0 30 60 90X

Y

0

20

40

60

80

0 30 60 90X

Y

http://images.google.com/imgres?imgurl=http://pinker.wjh.harvard.edu/photos/american_west/images/mesa.jpg&imgrefurl=http://pinker.wjh.harvard.edu/photos/american_west/pages/mesa.htm&h=600&w=900&sz=168&hl=en&start=1&um=1&tbnid=Hkt4yD52vLjnTM:&tbnh=97&tbnw=146&prev=/images%3Fq%3Dmesa%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Adding Knots Sequentially• Start with one knot - then steadily

increase number of allowed knots

http://images.google.com/imgres?imgurl=http://camelphotos.com/Graphics/8_camels.jpg&imgrefurl=http://www.camelphotos.com/WorkingCamelsP1.html&h=311&w=420&sz=33&hl=en&start=15&um=1&tbnid=Zu4252-OSQR1gM:&tbnh=93&tbnw=125&prev=/images%3Fq%3Dcamels%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Computational Strategy

• Multiple knot placement is implemented in a step-wise manner

• Need a forward/backward procedure as used in CART

• The forward procedure results to a model that is clearly overfit with too many knots

• The backward procedure removes least contributing knots sequentially– Using appropriate statistical criterion remove all

knots that add sufficiently little to model quality• Resulting model will have approximately

correct knot locations

http://images.google.com/imgres?imgurl=http://www3.telus.net/chessvancouver/images/chess.jpg&imgrefurl=http://www3.telus.net/chessvancouver/&h=550&w=800&sz=64&hl=en&start=3&um=1&tbnid=XvX2roFS1d__1M:&tbnh=98&tbnw=143&prev=/images%3Fq%3Dchess%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Basis Functions

• Thinking in terms of knot selection works very well to illustrate splines in one dimension

• Thinking in terms of knot locations is unwieldy for working with a large number of variables simultaneously– need a concise notation and programming expressions that

are easy to manipulate – Not clear how to construct or represent interactions using

knot locations• Basis Functions (BF) provide analytical machinery to

express the knot placement strategy• MARS creates sets of basis functions to decompose

the information in each variable individually

http://images.google.com/imgres?imgurl=http://www.tomdownload.com/games/arcade/images/bricks_of_egypt_big.jpg&imgrefurl=http://www.tomdownload.com/games/arcade/bricks_of_egypt.htm&h=479&w=641&sz=52&hl=en&start=31&um=1&tbnid=RJ0-TO_5brH4mM:&tbnh=102&tbnw=137&prev=/images%3Fq%3Dbricks%26start%3D18%26ndsp%3D18%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26sa%3DN%26as_qdr%3Dall

Spline with 3 Basis Functions• We now have the following basis functions in INDUS

BF1 = max (0, INDUS - 4) BF2 = max (0, INDUS - 8) BF3=max(0, 4 - INDUS)MV= 29.433 + 0.925*(4 - INDUS)+ -2.180*(INDUS-4)+ +1.939*(INDUS-8)+

• All 3 line segments have negative slope even though 2 coefficients above >0

0

10

20

30

40

0 5 10 15 20 25 30INDUS

MV

MARS Process• MARS technology is similar to CART’s

– grow a deliberately overfit model and then prune back– core notion is that good model cannot be built from a forward

stepping plus stopping rule– must overfit generously and then remove unneeded basis

functions• The forward step always adds basis functions in pairs• The backward step always removes basis functions one

at a time• In practice user specifies an upper limit for number of

knots to be generated in forward stage– limit should be large enough to ensure that true model can be

captured– will have to be set by trial and error

http://images.google.com/imgres?imgurl=http://www.step.polymtl.ca/~coyote/picturesd/dragnlnc/forging.jpg&imgrefurl=http://www.step.polymtl.ca/~coyote/dragonlance_misc.html&h=697&w=888&sz=253&hl=en&start=2&um=1&tbnid=nn4InJhC-772HM:&tbnh=115&tbnw=146&prev=/images%3Fq%3Dforging%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Multi Tree Methods• Grow a tree on training data• Find a way to grow another different

tree (change something in set up)• Repeat many times, eg 500

replications• Average results or create voting

scheme.– Eg. relate PD to fraction of trees

predicting default for a given

Beauty of the method is that every new tree starts with a complete set of data.Any one tree can run out of data, but when that happens we just start again with a new tree and all the data (before sampling)

PredictionVia

Voting

http://images.google.com/imgres?imgurl=http://photo.net/photo/pcd2488/maple-trees-12.4.jpg&imgrefurl=http://photo.net/photo/pcd2488/maple-trees-12&h=1054&w=1556&sz=422&hl=en&start=13&um=1&tbnid=C7trLXds7W_5lM:&tbnh=102&tbnw=150&prev=/images%3Fq%3Dtrees%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Sampling with Replacement• Have a training set of size N• Create a new data set of size N by doing

sampling with replacement from the training set

• The new set will be different from the original!– About a third of the original data will be left out– Another third will enter exactly once– The other third will enter more than once but

unlikely more than 6 times• May do this repeatedly to generate multiple

bootstrap samples

http://images.google.com/imgres?imgurl=http://www.affordablehousinginstitute.org/blogs/us/star_wars_clone_army.jpg&imgrefurl=http://www.affordablehousinginstitute.org/blogs/us/2005/07/publicprivate_p.html&h=351&w=300&sz=22&hl=en&start=56&um=1&tbnid=DVQVC69tbpfapM:&tbnh=120&tbnw=103&prev=/images%3Fq%3Dclones%2Bstarwars%26start%3D54%26ndsp%3D18%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26sa%3DN%26as_qdr%3Dall

Bootstrap Resampling• Probability of being omitted in a single draw is (1 - 1/n)• Probability of being omitted in all n draws is (1 - 1/n)n

• Limit of series as n increases is (1/e) = 0.368– approximately 36.5% sample excluded 0 % of resample– 37.5% sample included once 37.5% of resample– 18% sample included twice thus represent ... 36 % of resample– 6% sample included three times ... 18 % of resample– 2% sample included four or more times ... 8 % of resample

100 %– Example: distribution of weights in a 2,000 record resample:

http://images.google.com/imgres?imgurl=http://www.prevezanos.com/epost/motive/Party/Dice.jpg&imgrefurl=http://strivepr.com/wordpress/category/isle-of-man-examiner/&h=349&w=350&sz=15&hl=en&start=4&um=1&tbnid=AY8o1wKG2YPxXM:&tbnh=120&tbnw=120&prev=/images%3Fq%3Ddice%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Combining Trees by Voting• Trees combined via voting (classification) or

averaging (regression)• Classification trees “vote”

– Recall that classification trees classify• assign each case to ONE class only

– With 50 trees, 50 class assignments for each case– Winner is the class with the most votes– Votes could be weighted – say by accuracy of individual

trees• Regression trees assign a real predicted value

for each case– Predictions are combined via averaging– Results will be much smoother than from a single tree

http://images.google.com/imgres?imgurl=http://www2.co.galveston.tx.us/judgecriss/212th%2520DC%2520Pictures/jury2.jpg&imgrefurl=http://www2.co.galveston.tx.us/judgecriss/courtclass/jury.htm&h=265&w=421&sz=32&hl=en&start=16&um=1&tbnid=6-GzGSj7yHy7fM:&tbnh=79&tbnw=125&prev=/images%3Fq%3Djury%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Bootstrap Aggregation Performance GainsStatlog Data Set Summary

Test Set Misclassification Rate (%)

Data Set # Training # Variables # Classes # Test SetLetters 15,000 16 26 5,000 Satellite 4,435 36 6 2,000 Shuttle 43,500 9 7 14,500 DNA 2,000 60 3 6,186

Data Set 1 Tree Bag DecreaseLetters 12.6 6.4 49% Satellite 14.8 10.3 30% Shuttle 0.062 0.014 77% DNA 6.2 5.0 19%

Adaptive Resampling and Combining (ARCing, a Variant of Boosting)

• Bagging proceeds by independent, identically-distributed resampling draws

• Adaptive resampling: probability that a case is sampled varies dynamically

• Starts with all cases having equal probability

• After first tree is grown, weight is increased on all misclassified cases

• For regression, weight increases with prediction error for that case

• Idea is to focus tree on those cases most difficult to predict correctly

ARCing Reweights Training Data

• Similar procedure first introduced by Freund & Schapire (1996)• Breiman variant (ARC-x4) is easier to understand:

– Suppose we have already grown K trees: let mi = # times case i was misclassified (0 ≤ mi ≤ K)

Define wi = (1 + mi4)

Prob (sample inclusion) ∝ wi

– Weight = 1 for cases with zero occurrences of misclassification

– Weight = 1+ K4 for cases with K misclassifications• Rapidly becomes large if case is difficult to classify

– Samples will tend to be increasingly dominated by misclassified cases

http://images.google.com/imgres?imgurl=http://www.phys.ksu.edu/personal/cdlin/picture/us/utah/delicate-arch.jpg&imgrefurl=http://www.phys.ksu.edu/personal/cdlin/photo-world.html&h=678&w=904&sz=308&hl=en&start=6&um=1&tbnid=oKS9gIz0REfWlM:&tbnh=110&tbnw=147&prev=/images%3Fq%3Darch%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Boston Housing Data

Total SUM of Squares => 5,628

SUM(RES^2) SUM(ABS(DEV)) R-squared

FIRST TREE ONLY* 2078.620 272.000 63%

BAGGING 837.814 215.580 85%

ARCING** 744.090 214.423 87%

* Single tree now performs worse than stand alone CART run (R-squared=72%) because in bagging we always work with exploratory trees only* * Arcing performance beats MARS additive model but is still inferior to the MARS interactions model

Problems with Boosting• Boosting (and Bagging) are very slow and

consume a lot of memory, the final models tend to be awkwardly large and unwieldy

• Boosting in general is vulnerable to overtraining– Much better fit on training than on test data– Tendency to perform poorly on future data

• Boosting highly vulnerable to errors in the data– Technique designed to obsess over errors– Will keep trying to “learn” patterns to predict

miscoded data• Documented in study by Dietterich (1998)

• An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization

http://images.google.com/imgres?imgurl=http://cas.hamptonu.edu/centerinfo/photo-album/ScienceGraphics/images/volcano.jpg&imgrefurl=http://cas.hamptonu.edu/centerinfo/photo-album/ScienceGraphics/volcano.html&h=1084&w=1368&sz=427&hl=en&start=19&um=1&tbnid=CkmnLpJCoiieDM:&tbnh=119&tbnw=150&prev=/images%3Fq%3Dvolcano%26start%3D18%26ndsp%3D18%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26sa%3DN%26as_qdr%3Dall

Introduction to Random Forests

• New approach for many data analytical tasks developed by Leo Breiman of University of California, Berkeley– Co-author of CART® with Friedman, Olshen and Stone– Author of Bagging and Arcing approaches to

combining trees• Good for classification and regression problems

– Also for clustering, density estimation– Outlier and anomaly detection– Explicit missing value imputation

• Builds on the notions of committees of experts but is substantially different in key implementation details

http://images.google.com/imgres?imgurl=http://www.mass.gov/envir/forest/images/multiLayerForest.jpg&imgrefurl=http://www.mass.gov/envir/forest/&h=600&w=900&sz=353&hl=en&start=1&um=1&tbnid=2ulvQgpTvFQArM:&tbnh=97&tbnw=146&prev=/images%3Fq%3Dforest%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Random Forest

• A random forest is a collection of single trees grown in a special way

• The overall prediction is determined by voting (in classification) or averaging (in regression)

• The Law of Large Numbers ensures convergence

• The key to accuracy is low correlation and bias

• To keep bias low, trees are grown to maximum depth

http://images.google.com/imgres?imgurl=http://www.hickerphoto.com/data/media/172/redwood_national_forest_t2051.jpg&imgrefurl=http://www.hickerphoto.com/redwood-national-forest-8433-pictures.htm&h=311&w=468&sz=54&hl=en&start=3&um=1&tbnid=jLS8zw8a8fjNmM:&tbnh=85&tbnw=128&prev=/images%3Fq%3Dforest%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

The Algorithm• Each tree is grown on a bootstrap

sample from the learning set• A number R is specified (square root by

default) such that it is noticeably smaller than the total number of available predictors

• During tree growing phase, at each node only R predictors are randomly selected and tried

http://images.google.com/imgres?imgurl=http://pass.maths.org.uk/issue22/news/prime/prime_alg.jpg&imgrefurl=http://pass.maths.org.uk/issue22/news/prime/index.html&h=416&w=600&sz=35&hl=en&start=153&um=1&tbnid=K5kKE8krZACPoM:&tbnh=94&tbnw=135&prev=/images%3Fq%3Dalgorithm%26start%3D144%26ndsp%3D18%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26sa%3DN%26as_qdr%3Dall

Performance• All major advantages of a single tree

are automatically preserved• Since each tree is grown on a bootstrap

sample, one can– Use out of bag samples to compute an

unbiased estimate of the accuracy– Use out of bag samples to determine

variable importances• There is no overfitting as the number of

trees increases

http://images.google.com/imgres?imgurl=http://www.amaflattrack.com/adm/upl/1005944608_A01054.jpg&imgrefurl=http://www.amaflattrack.com/article.php%3FUID%3D2XUnjyXhQwjEPcpUjrQ6jgYXJ8lMFC%26sc%3D1074%26aid%3D550&h=296&w=400&sz=31&hl=en&start=6&um=1&tbnid=0z8podBk4p3LZM:&tbnh=92&tbnw=124&prev=/images%3Fq%3Dmotorcycle%2Bdirt%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Further Derivatives

• It is possible to compute generalized proximity between any pair of cases

• Based on proximities one can– Proceed with a well-defined clustering solution– Detect outliers– Generate informative data views/projections using

scaling coordinates– Do missing value imputation

• Easy expansion into the unsupervised learning domain

http://images.google.com/imgres?imgurl=http://www.ccjonesphotos.com/PS/images/Scanning%2520for%2520Movement.jpg&imgrefurl=http://www.ccjonesphotos.com/PS/pages/Scanning%2520for%2520Movement.htm&h=652&w=1000&sz=253&hl=en&start=15&um=1&tbnid=78j06cbHF6vJTM:&tbnh=97&tbnw=149&prev=/images%3Fq%3Dmovement%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Introduction to TreeNet• New approach to machine learning and

function approximation developed by Jerome H. Friedman at Stanford University– Co-author of CART® with Breiman, Olshen and

Stone– Author of MARS®, PRIM, Projection Pursuit,

COSA, RuleFit™ and more• Also known as Stochastic Gradient Boosting• Very strong for classification and regression

problems• Builds on the notions of committees of

experts and boosting but is substantially different in key implementation details

http://images.google.com/imgres?imgurl=http://news.com.com/i/ne/p/2005/412fishingnet500x650.jpg&imgrefurl=http://news.com.com/Photos%2BWeaving%2Bhigh-tech%2Bfabrics%2Bof%2Bthe%2Bfuture%2B-%2Bpage%2B12/2009-1008_3-5667576-12.html&h=650&w=500&sz=192&hl=en&start=1&um=1&tbnid=yB7oHFda6LSc7M:&tbnh=137&tbnw=105&prev=/images%3Fq%3Dfishing%2Bnet%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Some TreeNet Successes• 2006 PAKDD competition: customer type

discrimination 3rd place– Model built in one day. 1st place accuracy 81.9% TreeNet

Accuracy 81.2%

• 2005 BI-CUP Sponsored by University of Chile attracted 60 competitors. ‘Most Accurate”

• 2004 KDDCup “Most Accurate”• 2003 “Duke University/NCR Teradata CRM modeling

competition– Most Accurate” and “Best Top Decile Lift” on both in and out

of time samples

• In use in major financial services and web operations

http://images.google.com/imgres?imgurl=http://www.albanyinstitute.org/collections/objects/images/cup.jpg&imgrefurl=http://www.albanyinstitute.org/collections/objects/cup.htm&h=477&w=300&sz=45&hl=en&start=1&um=1&tbnid=CCqxDhOV6A1BvM:&tbnh=129&tbnw=81&prev=/images%3Fq%3Dcup%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

TreeNet Results Display

• Performance on train (blue) and test (red) data as model evolves

Benefits of TreeNet• Built on CART trees and thus

– immune to outliers– handles missing values automatically– selects variables, – results invariant with monotone transformations of

variables• Resistant to mislabeled target data

– In medicine cases are commonly misdiagnosed– In business, occasionally non-responders flagged as

“responders”• Resistant to over training – generalizes very well• Can be remarkably accurate with little effort• Trains very rapidly; comparable to CART

http://images.google.com/imgres?imgurl=http://nodens.physics.ox.ac.uk/~oi/Album2/Wien/wienfuji14.jpg&imgrefurl=http://nodens.physics.ox.ac.uk/~oi/Album2/wien.html&h=726&w=1243&sz=160&hl=en&start=1&um=1&tbnid=PTTkOvOQPx8WgM:&tbnh=88&tbnw=150&prev=/images%3Fq%3Dgreek%2Btemple%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

TreeNet Process

• Begin with one very small tree as initial model– Could be as small as ONE split generating 2

terminal nodes– Typical model will have 3-5 splits in a tree,

generating 4-6 terminal nodes– Output is a probability (eg of default)– Model is intentionally “weak”

• Compute “residuals” for this simple model (prediction error) for every record in data (even for classification model)

• Grow second small tree to predict the residuals from first tree

http://images.google.com/imgres?imgurl=http://spoonertrainride.com/photogallery/423g3.jpg&imgrefurl=http://spoonertrainride.com/photogallery/photo_gallery_423.htm&h=449&w=597&sz=49&hl=en&start=72&um=1&tbnid=u6jAoi6GlCzCoM:&tbnh=102&tbnw=135&prev=/images%3Fq%3Dtrain%2Bapproaching%26start%3D54%26ndsp%3D18%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26sa%3DN%26as_qdr%3Dall

TreeNet: Sample Individual Tree

PredictedResponseOther Trees

4.172Intercept

YES

NO

TOTAL_DEPS < 65.5M

EQ2TOT_AST < 2.8

+0.065

-0.213

-0.010Tree 2

EQ2CUST_STF < 5.04

EQ2TOT_AST < 2.8

-0.088

+0.140

-0.007Tree 3

EQ2CUST_STF < 5.04

COST2INC < 65.4-0.228

-0.051

+0.068Tree 1

Key Points

• Trees are kept small (2-6 nodes common)• Updates are small (downweighted). Update

factors can be as small as .01, .001, .0001.• Use random subsets of the training data in

each cycle. – Never train on all the training data in any one

cycle• Highly problematic cases are IGNORED.

– If model prediction starts to diverge substantially from observed data, that data will not be used in further updates

http://images.google.com/imgres?imgurl=http://www.student.cs.uwaterloo.ca/~cs779/Gallery/Winter2003/efourquet/images/editPoints.jpg&imgrefurl=http://www.student.cs.uwaterloo.ca/~cs779/Gallery/Winter2003/efourquet/index.html&h=952&w=901&sz=50&hl=en&start=16&um=1&tbnid=jxjjrupP8il30M:&tbnh=148&tbnw=140&prev=/images%3Fq%3Dpoints%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

TreeNet Modeling Options• Strictly Additive Model (no interactions

allowed)• Low level interactions allowed• High level interactions allowed• Constraints: only specific interactions allowed

(TN PRO)• Standard Classification• Binary non-parametric logistic regression• Regression: Least Squares, LAD, Huber-M

hybrid of LS/LAD• In development:

– Survival models, matched sample case control logistic, Poisson

http://images.google.com/imgres?imgurl=http://z.about.com/d/trucks/1/0/l/R/06_range_rover_cockpit.jpg&imgrefurl=http://trucks.about.com/od/2006suvs/ss/range_rover_5.htm&h=300&w=400&sz=39&hl=en&start=119&um=1&tbnid=KsCgCJNX-IziEM:&tbnh=93&tbnw=124&prev=/images%3Fq%3Dcockpit%26start%3D108%26ndsp%3D18%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26sa%3DN%26as_qdr%3Dall

Interpreting TN Models

• As TN models consist of hundreds or even thousands of trees there is no useful way to represent the model via a display of one or two trees

• However, the model can be summarized in a variety of ways– Partial dependency plots: These exhibit the relationship between

the target and any predictor – as captured by the model.– Variable Importance Rankings: These stable rankings give an

excellent assessment of the relative importance of predictors– ROC curves: TN models produce scores that are typically unique

fore ach scored record– Confusion Matrix: Using an adjustable score threshold this matrix

displays the model false positive and false negative rates.

http://images.google.com/imgres?imgurl=http://meetbythebat.mlblogs.com/meet_by_the_bat/images/psychic.jpg&imgrefurl=http://meetbythebat.mlblogs.com/meet_by_the_bat/2006/07/index.html&h=200&w=200&sz=40&hl=en&start=6&um=1&tbnid=hoeWPAXQ7QL7aM:&tbnh=104&tbnw=104&prev=/images%3Fq%3Dpsychic%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Partial Dependency Plots

http://images.google.com/imgres?imgurl=http://www.shafferfineart.com/Revelation-9600.jpg&imgrefurl=http://www.shafferfineart.com/The_Art_of_Walfrido.htm&h=600&w=400&sz=16&hl=en&start=19&um=1&tbnid=f2tC3hjVJAsCaM:&tbnh=135&tbnw=90&prev=/images%3Fq%3Drevelation%26start%3D18%26ndsp%3D18%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26sa%3DN%26as_qdr%3Dall

Interaction Plots

http://images.google.com/imgres?imgurl=http://humanflowerproject.com/images/uploads/yellow-soccer.jpg&imgrefurl=http://www.humanflowerproject.com/index.php/weblog/hooligans_and_bloopers_world_cup/&h=384&w=320&sz=34&hl=en&start=7&um=1&tbnid=KB7g4dDM0xjIKM:&tbnh=123&tbnw=103&prev=/images%3Fq%3Dsoccer%2Bplayers%2Bukraine%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Interaction Detection• TreeNet models based on 2-node trees by

definition EXCLUDE interactions– Model may be highly nonlinear but is by definition

strictly additive– Every term in the model is based on a single

variable (single split)• Build TreeNet on larger tree (default is 6

nodes)– Permits up to 5-way interaction but in practice is

more like 3-way interaction• Can conduct informal likelihood ratio test

TN(2-node) versus TN(6-node)• Large differences signal important

interactions

http://images.google.com/imgres?imgurl=http://www.bbc.co.uk/blogs/ni/sherlock.jpg&imgrefurl=http://www.bbc.co.uk/blogs/ni/2006/08/&h=254&w=205&sz=16&hl=en&start=7&um=1&tbnid=oQTVp26O9avshM:&tbnh=111&tbnw=90&prev=/images%3Fq%3Ddetective%2Bsherlock%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Case I: Customer Retention

• Have a dataset containing information about last year renewals of insurance policies for the existing customers– About 100,000 observations randomly split

into 50% learn and 50% test samples– Overall renewal rate set at about 50%

• Want to build a segmentation model to identify segments likely to renew

http://images.google.com/imgres?imgurl=http://www.limestonemedia.com/how-to-plans/images/trap-720.gif&imgrefurl=http://www.limestonemedia.com/how-to-plans/wild-hog-portable-trap-2.htm&h=568&w=700&sz=34&hl=en&start=1&um=1&tbnid=9HNnRv-dfjERmM:&tbnh=114&tbnw=140&prev=/images%3Fq%3Dtrap%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Key Segmentation Variables• GENDER – customer’s gender• AREA – customer’s area• POLICY_TYPE – type of policy• POLICY_AGE – years with the customer• RESTRICTION – policy restrictions• AGE – customer age• POLICYCHANGE – recent change in

premium• MARKETINTENSITY –how competitive the

current market is

http://images.google.com/imgres?imgurl=http://www.worth1000.com/entries/243000/243383JPro_w.jpg&imgrefurl=http://www.worth1000.com/cache/contest/contestcache.asp%3Fcontest_id%3D10480&h=999&w=401&sz=19&hl=en&start=26&um=1&tbnid=zLq-QcYZlY8ibM:&tbnh=149&tbnw=60&prev=/images%3Fq%3Dkey%2Bdoor%26start%3D18%26ndsp%3D18%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26sa%3DN%26as_qdr%3Dall

CART Model Results

• CART has identified a 12-node tree with good agreement on renewal rates between the learn and the test partitions

• Segments 4 and 11 have the highest renewal rates

http://images.google.com/imgres?imgurl=http://jason.aminus3.com/images/user_000002/image_003964/green_tree_ireland_small.jpg&imgrefurl=http://jason.aminus3.com/image/2006-07-29.html&h=400&w=600&sz=62&hl=en&start=171&um=1&tbnid=01oADnXxflXb8M:&tbnh=90&tbnw=135&prev=/images%3Fq%3Dtree%26start%3D162%26ndsp%3D18%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26sa%3DN%26as_qdr%3Dall

Extreme Segments

• Segment 4: customers with at least 8-year history are likely to renew when market intensity is low

• Segment 11: when market intensity is high only the most loyal customers (11 or more years history) are likely to renew provided that there were no significant premium increase

• Segment 12: when there is a significant premium increase we are likely to loose our customers in the intense market conditions

http://images.google.com/imgres?imgurl=http://www.efluids.com/efluids/gallery_problems/gallery_images/fighter.jpg&imgrefurl=http://www.efluids.com/efluids/gallery_problems/gallery_pages/fighter.htm&h=768&w=1024&sz=132&hl=en&start=9&um=1&tbnid=K-OJp1qk25kgeM:&tbnh=113&tbnw=150&prev=/images%3Fq%3Dfighter%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Refining Models with TreeNet

• TreeNet model further confirms CART findings by providing smooth curvilinear contributions for the top three predictors

http://images.google.com/imgres?imgurl=http://www.aceros-de-hispania.com/image/gladius/colada-cid-sword.jpg&imgrefurl=http://www.aceros-de-hispania.com/gb/infer.asp%3Fac%3D5%26trabajo%3Dlistar%26sg%3Desp_funda%26pa%3Desp_funda&h=591&w=400&sz=65&hl=en&start=9&um=1&tbnid=t0ZTATpaCI3VkM:&tbnh=135&tbnw=91&prev=/images%3Fq%3Dsword%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Case II: Modeling Medical Claims

• Have a history of medical claims on approximately 340,000 customers

• Initial claim rate at 64%• Use 50% as TEST data• Limit node sizes to 100

http://images.google.com/imgres?imgurl=http://www.robertpicardo.com/holodoc/doctor.jpg&imgrefurl=http://www.robertpicardo.com/holodoc/holodoc.html&h=410&w=300&sz=40&hl=en&start=11&um=1&tbnid=9sjW3RzQfEtDjM:&tbnh=125&tbnw=91&prev=/images%3Fq%3Ddoctor%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

CART Results

• CART has identified that most of the data can be partitioned into two segments: the one with low claims rate and the one with high claims rate

http://images.google.com/imgres?imgurl=http://p.vtourist.com/1965497-A_Joshua_Tree-Joshua_Tree_National_Park.jpg&imgrefurl=http://members.virtualtourist.com/m/6629b/b7ee0/&h=373&w=560&sz=42&hl=en&start=197&um=1&tbnid=7wdLEOetWKoYEM:&tbnh=89&tbnw=133&prev=/images%3Fq%3Dtree%26start%3D180%26ndsp%3D18%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26sa%3DN%26as_qdr%3Dall

Using Battery PRIOR

• Battery PRIOR results to a sequence of CART runs with varying degree of richness of the constructed segments

• We can now quickly identify hot-spots and the corresponding rules that lead there

http://images.google.com/imgres?imgurl=http://www.brpguns.com/images/stg768ww.jpg&imgrefurl=http://www.brpguns.com/&h=326&w=576&sz=86&hl=en&start=532&um=1&tbnid=Uo7ZjzJgYEPaIM:&tbnh=76&tbnw=134&prev=/images%3Fq%3Dmachine%2Bgun%26start%3D522%26ndsp%3D18%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26sa%3DN%26as_qdr%3Dall

Case III: Dodgy Data• Try to replicate premiums calculated by

a mainframe system – only 95% accuracy is achieved

• What’s wrong with the remaining 5%?• Want to find possible error in the

program responsible for computing premiums

• Will run CART on a dataset where the error records are marked as “Yes” class

http://images.google.com/imgres?imgurl=http://www.dogfacts.org/dog-pictures.jpg&imgrefurl=http://www.dogfacts.org/dog-pictures-dog-facts.htm&h=300&w=400&sz=25&hl=en&start=3&um=1&tbnid=mX6A9b1a9GtTuM:&tbnh=93&tbnw=124&prev=/images%3Fq%3Ddog%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Dodgy Data Example - Solved• CART has produced the following

solution within minutes

• Some EDI transactions are erroneously being set to Finance Code “N” at the stage of loading into the mainframe

• The entire problem is solved in 30 minutes!

http://images.google.com/imgres?imgurl=http://images.worldofstock.com/slides/PPT1850.jpg&imgrefurl=http://www.worldofstock.com/closeups/PPT1850.php&h=500&w=388&sz=30&hl=en&start=3&um=1&tbnid=5yAqQOXaYqdJHM:&tbnh=130&tbnw=101&prev=/images%3Fq%3Ddog%2Bhappy%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Case IV: Price Change

• When looking at the distribution of prices in the portfolio, a small number of extreme outliers getting large increases is detected

• Assigning all outliers into “Yes” group and running CART results to a single-split tree

• Per CART’s finding, one of the rate tables still contains added dummy premium that by pure accident was not removed

• The whole process takes 5 minutes!

http://images.google.com/imgres?imgurl=http://www.affordablehousinginstitute.org/blogs/us/herd_water_buffalo.jpg&imgrefurl=http://www.affordablehousinginstitute.org/blogs/us/2005/06/the_law_of_the.html&h=458&w=600&sz=72&hl=en&start=1&um=1&tbnid=W8Z-aVCt70iZSM:&tbnh=103&tbnw=135&prev=/images%3Fq%3Dherd%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Case V: Who Gets Big Decrease?

• A group of customers is getting a large decrease on the proposed rates

• The actual rating system is very complicated and hard to track – it is not easy to tell who belongs to this group

• CART builds a simple tree to present a story that could be used to present things to management

• All of these seemingly trivial examples would have taken a lot more time using conventional analysis tools

http://images.google.com/imgres?imgurl=http://www.bigelowaerospace.com/multiverse/images/space_prize.jpg&imgrefurl=http://www.bigelowaerospace.com/multiverse/space_prize.php&h=634&w=500&sz=95&hl=en&start=4&um=1&tbnid=Olo8vWCudqoT4M:&tbnh=137&tbnw=108&prev=/images%3Fq%3Dprize%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Case VI: TrackingPortfolio Change

• Interested how portfolio changed before and after a price change

• Extracted several months of new businesses before and after the price change, two separate groups were classified as “Yes” and “No”

• Run CART to detect major differences between the two groups

http://images.google.com/imgres?imgurl=http://www.brass.cf.ac.uk/images/Footprint.jpg&imgrefurl=http://www.brass.cf.ac.uk/projects/02Measuring_Sustainability--Rugby_Six_Nations--Ecological_Footprint.html&h=299&w=448&sz=112&hl=en&start=5&um=1&tbnid=nCK9GwgSd_Jd2M:&tbnh=85&tbnw=127&prev=/images%3Fq%3Dfootprint%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

CART Solution (Home Insurance)

• The tree reflects the fact that the quoted by default excess rate has changed

• A perfect example of the actuarial quality control cycle

Case VII: Missing Value Analysis• CART uses powerful mechanism of

surrogates to deal with missing values• May also introduce missing value indicators that

mark all cases where a given variable is missing• Now run sequence of analyses declaring one

missing value indicator as a target variable – this often reveals very interesting structure behind missingness

• This process a generally much cheaper than direct inquiry into operations and data collection process

• CART short-circuits the path to understanding the data

http://images.google.com/imgres?imgurl=http://razorland55.free.fr/friend01/Question_Mark2.jpg&imgrefurl=http://martonhouse.wordpress.com/tag/thoughts/&h=480&w=394&sz=208&hl=en&start=2&um=1&tbnid=Fb0Q7LzLPhuEgM:&tbnh=129&tbnw=106&prev=/images%3Fq%3Dquestion%2Bmark%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

Case VIII: Checking Assumptions• We often make or have assumptions about

the data, for example:• When excluding $0 claims from analysis one

might assume that these claims are randomly spread – CART can quickly verify this

• Assign $0 claims as “Yes” and the rest as “No”

• When the assumption is valid, expect to have no tree

• Otherwise, the tree will pinpoint exactly how and where the assumption is violated

• This process oftentimes results to unexpected and striking discoveries

http://images.google.com/imgres?imgurl=http://www.rowrowrow.com/linkart/RightQuestions.jpg&imgrefurl=http://www.rowrowrow.com/links.html&h=475&w=354&sz=45&hl=en&start=79&um=1&tbnid=W8J31v8qUomCtM:&tbnh=129&tbnw=96&prev=/images%3Fq%3Dquestions%26start%3D72%26ndsp%3D18%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26sa%3DN%26as_qdr%3Dall

References• Breiman, L. (1996). Bagging predictors. Machine

Learning, 24, 123-140.• Breiman, L., Friedman, J., Olshen, L. and Stone, C.

(1984) Classification and Regression Trees. Wadsworth. Reprinted by CRC press.

• Friedman, J. H. (1991a). Multivariate adaptive regression splines (with discussion). Annals of Statistics, 19, 1-141 (March).

• Friedman, J.H. (1999). Stochastic gradient boosting. Stanford: Statistics Department, Stanford University.

• Friedman, J.H. (1999). Greedy function approximation: a gradient boosting machine. Stanford: Statistics Department, Stanford University.

• Hastie, T., Tibshirani, R., and Friedman, J.H (2000). The Elements of Statistical Learning. Springer.

http://images.google.com/imgres?imgurl=http://content.answers.com/main/content/wp/en/thumb/f/f7/250px-OxUniChainedbookBod.jpg&imgrefurl=http://www.answers.com/topic/oxford-university&h=332&w=250&sz=26&hl=en&start=5&um=1&tbnid=g6TK1d3tVCvDtM:&tbnh=119&tbnw=90&prev=/images%3Fq%3Dbook%2Bancient%26svnum%3D10%26um%3D1%26hl%3Den%26lr%3D%26as_qdr%3Dall

predicative modeling and data mining for actuaries › library › iinsights_0704...machine learning...

Documents