data mining executive overview alan montgomery vp business development, spss amontgomery@spss.com...
Post on 22-Dec-2015
230 Views
Preview:
TRANSCRIPT
Data Mining Executive Overview
Data Mining Executive Overview
Alan MontgomeryAlan Montgomery
VP Business Development, SPSS VP Business Development, SPSS amontgomery@spss.comamontgomery@spss.com
Alan MontgomeryAlan Montgomery
VP Business Development, SPSS VP Business Development, SPSS amontgomery@spss.comamontgomery@spss.com
““Data mining makes the difference”Data mining makes the difference”““Data mining makes the difference”Data mining makes the difference”
AgendaAgenda
• What is data mining?What is data mining?
• Who is using data mining, and for what?Who is using data mining, and for what?
• How data mining fits into an IT systemHow data mining fits into an IT system
• Some myths about data miningSome myths about data mining
• What is data mining?What is data mining?
• Who is using data mining, and for what?Who is using data mining, and for what?
• How data mining fits into an IT systemHow data mining fits into an IT system
• Some myths about data miningSome myths about data mining
Information: InternetInformation: Internet
• SPSS: SPSS: http://www.spss.comhttp://www.spss.com
• Two Crows Corp (Herb Edelstein):Two Crows Corp (Herb Edelstein): http://www.twocrows.comhttp://www.twocrows.com
• Andy Pryke’s Data MineAndy Pryke’s Data Mine http://www.cs.bham.ac.uk/~anp/TheDataMine.htmlhttp://www.cs.bham.ac.uk/~anp/TheDataMine.html
• Knowledge Discovery Mine:Knowledge Discovery Mine: http://www.kdnuggets.comhttp://www.kdnuggets.com
Bibliography by (Herb Edelstein)Bibliography by (Herb Edelstein)M. Berry, G. Linoff, Data Mining Techniques, John Wiley, 1997
William S. Cleveland, The Elements of Graphing Data, Hobart Press, 1994
Howard Wainer, Visual Revelations, Copernicus, 1997
R. Kennedy, Lee, Reed, Van Roy, Solving Pattern Recognition Problems, Prentice-Hall, 1998
U. Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, Advances in Knowledge Discovery and Data Mining, MIT Press, 1996
Dorian Pyle, Data Preparation for Data Mining, Morgan Kaufmann, 1999
C. Westphal, T. Blaxton, Data Mining Solutions, John Wiley, 1998
Vasant Dhar, Roger Stein, Seven Methods for Transforming Corporate Data into Business Intelligence, Prentice Hall 1997
Joseph P. Bigus, Data Mining With Neural Networks, McGraw-Hill, 1996L.
Brieman, Freidman, Olshen, Stone, Classification and Regression Trees, Wadsworth, 1984
J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1992
• Data can hold organization’s Data can hold organization’s operations history, what we operations history, what we did . . . and what was the did . . . and what was the outcomeoutcome
• Can we find which actions gave Can we find which actions gave good (bad) outcomes?good (bad) outcomes?
• So learn from our past failures So learn from our past failures and successes to do better in and successes to do better in future.future.
Data holds KnowledgeData holds Knowledge
MarketingMarketing - who’s likely to buy?- who’s likely to buy? Forecasts Forecasts - what demand will we have?- what demand will we have? LoyaltyLoyalty - who’s likely to defect?- who’s likely to defect? CreditCredit - which loans were profitable? - which loans were profitable? Fraud Fraud - when did it occur? - when did it occur?
What we learn from dataWhat we learn from data
In each case can we: find In each case can we: find the signsthe signs? ?
. . . find . . . find others others showing similar signs?showing similar signs?
Data mining is naturalData mining is natural
• This process is simply This process is simply “learning from “learning from experience”experience”
• It is a totally natural and routine part It is a totally natural and routine part of every successful business.of every successful business.
• Data mining just helps you do it more Data mining just helps you do it more quickly, accurately, and systematically.quickly, accurately, and systematically.
Winterthur: Customer Customer Loyalty or “Churn”Loyalty or “Churn”Winterthur: Customer Customer Loyalty or “Churn”Loyalty or “Churn”
• Churn is a common data mining issue.
• What’s at stake? Losing car insurance clients at rate of 13.25% a year ($$$$).
• Business Goal: retain profitable clients.
• Data Mining Goals: predict which clients are likely to resign their policy.
• Winterthur can then take action.
• Churn is a common data mining issue.
• What’s at stake? Losing car insurance clients at rate of 13.25% a year ($$$$).
• Business Goal: retain profitable clients.
• Data Mining Goals: predict which clients are likely to resign their policy.
• Winterthur can then take action.
Approach to churnApproach to churn
Select data on customers who resignedSelect data on customers who resigned
• Divide this sample into: Divide this sample into: – a a training settraining set to learn from; to learn from;
– a a test settest set to check the results. to check the results.
• Compare leavers in training set with Compare leavers in training set with similar customers who did not leave. similar customers who did not leave.
• Learn the signatureLearn the signature of likely churners. of likely churners.
Select data on customers who resignedSelect data on customers who resigned
• Divide this sample into: Divide this sample into: – a a training settraining set to learn from; to learn from;
– a a test settest set to check the results. to check the results.
• Compare leavers in training set with Compare leavers in training set with similar customers who did not leave. similar customers who did not leave.
• Learn the signatureLearn the signature of likely churners. of likely churners.
Winterthur ApplicationWinterthur Application
• Two complementary approaches
• In both we learn from a training set, and build a model.
1 Classify customers into leavers and non- leavers. Model gives Yes/No Answer.
2 Predict “likelihood” of people leaving. Generates a “propensity to leave”, or “score” for each case. Model gives numeric answer.
• Two complementary approaches
• In both we learn from a training set, and build a model.
1 Classify customers into leavers and non- leavers. Model gives Yes/No Answer.
2 Predict “likelihood” of people leaving. Generates a “propensity to leave”, or “score” for each case. Model gives numeric answer.
Winterthur ResultsWinterthur Results
Result on churn classification.
• Achieved > 91.5% accuracy predicting churn (Yes/No) on the test set.
• This was 20% better than next competitor!
Result on churn classification.
• Achieved > 91.5% accuracy predicting churn (Yes/No) on the test set.
• This was 20% better than next competitor!
Summary Data MiningSummary Data Mining
• Data Mining meansData Mining means
• finding patterns in your datafinding patterns in your data
• which you can use which you can use
• to do your business better.to do your business better.
• Decisions from dataDecisions from data
• It is a completely natural business processIt is a completely natural business process• . . . with a very wide range of applicability.. . . with a very wide range of applicability.
Applications of Data MiningApplications of Data Mining
Four Case StudiesFour Case StudiesReutersReuters
BBCBBC
HalfordsHalfords
Survey of other users and applicationsSurvey of other users and applications
ReutersReutersValidating Forex DataValidating Forex Data
• Reuters gets currency prices from Reuters gets currency prices from many sourcesmany sources
• May contain errorsMay contain errors
• Easy to spot afterwards (spikes, dips)Easy to spot afterwards (spikes, dips)
• Conventional checking systems spot only Conventional checking systems spot only obvious errorsobvious errors
• What’s at stake?What’s at stake?
Reuters reputation, therefore salesReuters reputation, therefore sales
Reuters - Validating Reuters - Validating Forex DataForex Data
• Used historical Forex dataUsed historical Forex data
• Derived dynamic, time-Derived dynamic, time-based descriptorsbased descriptors
• Built models Built models (neural networks, rules) (neural networks, rules) to predict price movementsto predict price movements
• Report deviations from predictionsReport deviations from predictions
BBC TV Audience PredictionBBC TV Audience Prediction
• What’s at stake?What’s at stake? Survival of BBC! Survival of BBC!
• Business goalBusiness goal– increase audience for TV programsincrease audience for TV programs
• Proposed business actionProposed business action– better scheduling of programsbetter scheduling of programs
• Data mining goalData mining goal – predict predict audience share a programme will achieve in a audience share a programme will achieve in a
particular slotparticular slot
BBC ResultsBBC Results
• Neural network trained on 1 years dataNeural network trained on 1 years data– predicts audience share within 4% predicts audience share within 4%
– equals best (> 2 years) human schedulers equals best (> 2 years) human schedulers
• Some problem programmesSome problem programmes– human schedulers had same problems!human schedulers had same problems!
• Rules gave insight into Rules gave insight into “reasons”“reasons”
• . . . but beware of reasons . . . . . . but beware of reasons . . .
Take care with “explanations”Take care with “explanations”
• ““Any program (X) which follows a UK “soap” Any program (X) which follows a UK “soap” will achieve 6% less share that if X is put will achieve 6% less share that if X is put anywhere else”anywhere else”
• So UK “soaps” cause audience to turn off ??So UK “soaps” cause audience to turn off ??
• No! The competition is at work!No! The competition is at work!
Halfords - Predicting SalesHalfords - Predicting Sales
• Halfords are a retail organizationHalfords are a retail organization
• . . . planning to open new stores. . . planning to open new stores
• What’s at stake?What’s at stake?
$10M investment / store$10M investment / store
• Goal:Goal: predict sales from a new store predict sales from a new store
• 500 stores to learn from, many factors: 500 stores to learn from, many factors: • site, competition, catchment area, site, competition, catchment area,
management practice, . . . . management practice, . . . .
Pre
dic
ted
Pre
dic
ted s
ales
sale
s
Halfords - Predicting SalesHalfords - Predicting Sales
Clementine models much more accurate Clementine models much more accurate than previous statistical modelsthan previous statistical models
Regression Model (6m) Clementine Model(3w)Regression Model (6m) Clementine Model(3w)
Pre
dic
ted
Pre
dic
ted s
ales
sale
s
Actual salesActual sales Actual salesActual sales
TelcosTelcos•AT & T•Cable & Wireless•Cellnet•Airtouch Cellular•Singapore Telecoms
Who is using data mining?Who is using data mining? Who is using data mining?Who is using data mining?
FinanceFinance•Reuters•Barclays•National Westminster•Citibank
PharmaceuticalPharmaceutical•Glaxo-Wellcome•Pfizer•Du Pont•Unilever
GovernmentGovernment•HM Customs & Excise•IRS•The Home Office•DERA
ManufacturingManufacturing•Daimler Benz•Ford•British Steel•Caterpillar
RetailRetail•Boots•Tandy •ICL Retail•Halfords
Value of Reducing Attrition by 5%Value of Reducing Attrition by 5%
Auto/HomeInsurance
BranchBank
Deposits
Credit Card IndustrialBrokerage
IndustrialDistribution
LifeInsurance
Publishing Software
0
10
20
30
40
50
60
70
80
90
100
Inc
rea
se
in
Pro
fita
bil
ity
Based on Based on The Loyalty EffectThe Loyalty Effect; Frederick F. Reichheld, Thomas Teal;; Frederick F. Reichheld, Thomas Teal; Harvard Business School Press, 1996 Harvard Business School Press, 1996
Two Crows Survey Results
0 20 40 60 80
% of Respondents
Customer profiling
Targeted marketing
Market basket analysis
Attrit ion management
Fraud detection
Credit risk analysis
Ty
pe o
f A
pp
lica
tio
n
Evolution of Marketing Evolution of Marketing
• Market products to Market products to –EveryoneEveryone
– SegmentsSegments
– Customers based on behavior (RFM)Customers based on behavior (RFM)
– Customers and non-customersCustomers and non-customers based on demographics and based on demographics and psychographics psychographics
• Market products to Market products to –EveryoneEveryone
– SegmentsSegments
– Customers based on behavior (RFM)Customers based on behavior (RFM)
– Customers and non-customersCustomers and non-customers based on demographics and based on demographics and psychographics psychographics
Evolution of Marketing Technology
Evolution of Marketing Technology
• Mailing list management
• Ad-hoc segmentation
• RFM
• Statistical selection: clustering, regression, logistic regression, etc.
• Statistical selection: CHAID
• Statistical selection: data mining
• Mailing list management
• Ad-hoc segmentation
• RFM
• Statistical selection: clustering, regression, logistic regression, etc.
• Statistical selection: CHAID
• Statistical selection: data mining
Lift
0
2,000
4,000
6,000
8,000
10,000
12,000
010
020
030
040
050
060
070
080
090
010
00
Size of Mailing (thousands)
Nu
mb
er o
f R
esp
onse
s
RandomScored
Lift measures the improvement between two Lift measures the improvement between two treatments of the datatreatments of the data
Return on Investment
-40%
-20%
0%
20%
40%
60%
80%
100%
ROI
0 20 40 60 80 100
% of Total Population
RandomScored
Typical ApplicationsTypical Applications
• Finance and Financial ServicesFinance and Financial Services• Lending risk assessmentLending risk assessment
• Prediction of customer profitabilityPrediction of customer profitability
• Targeting direct marketingTargeting direct marketing
• Predicting market ratesPredicting market rates
• Fraud detectionFraud detection
• Calculating insurance claim profilesCalculating insurance claim profiles
Typical ApplicationsTypical Applications
• UtilitiesUtilities• Electricity demand forecastingElectricity demand forecasting• Modeling energy pricingModeling energy pricing• Developing control algorithmsDeveloping control algorithms
• RetailRetail• ““Basket Analysis” (shopping patterns)Basket Analysis” (shopping patterns)• Promotions analysisPromotions analysis• Analysis of personnel dataAnalysis of personnel data
Typical ApplicationsTypical Applications
• Science and HealthcareScience and Healthcare• Drug discoveryDrug discovery• Predicting corrosivity of chemicalsPredicting corrosivity of chemicals• Assessing treatment effectivenessAssessing treatment effectiveness• Monitoring intensive care patientsMonitoring intensive care patients• Predict crop yield from environmental factorsPredict crop yield from environmental factors• Choosing dental treatment for childrenChoosing dental treatment for children• Predicting recovery timePredicting recovery time• Analysis of child care projectsAnalysis of child care projects
Typical Applications
• Market ResearchMarket Research• Increasing response rates to surveysIncreasing response rates to surveys• Estimating missing values in dataEstimating missing values in data
• Manufacturing/DefenceManufacturing/Defence• Analyzing equipment failuresAnalyzing equipment failures• Managing spares, warranty claims, recallsManaging spares, warranty claims, recalls• Quality managementQuality management• Supply logisticsSupply logistics
Customer relationshipsCustomer relationships
• Forecasting Forecasting – what demand will we have?what demand will we have?
• LoyaltyLoyalty– who’s likely to defect?who’s likely to defect?
• Credit analysisCredit analysis– What loans are the most risky?What loans are the most risky?
• Forecasting Forecasting – what demand will we have?what demand will we have?
• LoyaltyLoyalty– who’s likely to defect?who’s likely to defect?
• Credit analysisCredit analysis– What loans are the most risky?What loans are the most risky?
• Profit modeling: Profit modeling: – which customers generate most, or least, profitwhich customers generate most, or least, profit
• Fraud detectionFraud detection• When did it occur; what were the signs?When did it occur; what were the signs?• Do others show same signs?Do others show same signs?
SummarySummary
• Data mining has very broad range of Data mining has very broad range of applicationsapplications
• It is already being used by leading It is already being used by leading companies in many sectors world-widecompanies in many sectors world-wide
• Data mining has very broad range of Data mining has very broad range of applicationsapplications
• It is already being used by leading It is already being used by leading companies in many sectors world-widecompanies in many sectors world-wide
AgendaAgenda
• What is data mining?What is data mining?
• Who is using data mining, and for what?Who is using data mining, and for what?
• Systems Architecture for data miningSystems Architecture for data mining
• Some myths about data miningSome myths about data mining
• What is data mining?What is data mining?
• Who is using data mining, and for what?Who is using data mining, and for what?
• Systems Architecture for data miningSystems Architecture for data mining
• Some myths about data miningSome myths about data mining
Recall the decision-value Recall the decision-value pyramid pyramid
Data from operational systemsData from operational systemsTPS, D/B, Management ReportsTPS, D/B, Management Reports
Management informationManagement informationRD/B, EIS, OLAPRD/B, EIS, OLAP
KnowledgeKnowledgeData MiningData Mining
Decision ValueDecision Value
““Typical” multi-level ISTypical” multi-level ISDesigned for: Designed for: short transactionsshort transactions resilience.resilience.Big danger:Big danger: killer SQL querykiller SQL query
Data Warehouse
Designed for: Designed for: killer SQL query.killer SQL query.Big dangers:Big dangers: size? politics? size? politics? unclean data? unclean data?
ReceiptsOrders
Invoices
Transaction Databases
Operations managementOperations management
SupervisorySupervisoryManagemenManagemen
tt
Data Marts
StrategyStrategy
BI architectureBI architecture
Browser
Paperreports
KNOWLEDGEKNOWLEDGEWORKERSWORKERS
Data collection software
External data
ERP systems
Other transaction
systems
Extract
Cleanse
Manage
Load
Calculate
Enrich
Impute
Transform
Functional department
systems
Legacy Legacy databasesdatabases
DataDatawarehousewarehouse
Reporting
OLAP
Pattern recognition
Exception detection
Segmentation
Classification
Profiling
Scoring
Forecasting
Simulation
Optimization
Data Data sourcessources
DataDatapreparationpreparation
DataDatastoragestorage
Data analysis Data analysis & data mining& data mining DeploymentDeployment
INFORMATION INFORMATION CONSUMERSCONSUMERS
Web Web serverserver
Desktopsoftware
Services / Application development / PrototypingServices / Application development / Prototyping
MODEL MODEL BUILDERSBUILDERS
DataDatamartmart
DataDatamartmart
Browser
Browser
DM in an Information SystemDM in an Information System
• The only requirements for data mining are The only requirements for data mining are – a business problema business problem
– some relevant datasome relevant data
• The data can come from any data sourceThe data can come from any data source
• . . . or combination of data sources. . . or combination of data sources
• Successful data mining requires two viewpointsSuccessful data mining requires two viewpoints– knowledge of the knowledge of the business meaningbusiness meaning of the data of the data
– some common-sense analytical knowledgesome common-sense analytical knowledge
Data Mining Process in Data Mining Process in a multi-level ISa multi-level IS
Transaction Databases
Data Warehouse
Data Marts
Orders
Invoices
Receipts
Other e.g. geographic, e.g. geographic, demographic, etc.demographic, etc.
Eureka??Eureka??
Business intelligence toolsBusiness intelligence tools
Neural networksNeural networks
Tree builders, Rule inductionTree builders, Rule induction
StatisticsStatistics
Data visualisationData visualisation
On Line Analytical Processing (OLAP)On Line Analytical Processing (OLAP)
AutomaticAutomaticHigh dimensionalityHigh dimensionality
Non-Linear relationsNon-Linear relationsHighly predictiveHighly predictive
Query, SQL, SpreadsheetsQuery, SQL, Spreadsheets
User drivenUser drivenLow dimensionalityLow dimensionality
Little predictive valueLittle predictive value
The data “mine”The data “mine”
Business intelligence comparedBusiness intelligence comparedBusiness intelligence comparedBusiness intelligence compared
• Validation drivenValidation driven• ManualManual
‘‘What were sales of What were sales of product X in October’product X in October’
Query/ReportingQuery/Reporting Data MiningData MiningOLAPOLAP
• Visualisation-drivenVisualisation-driven • ManualManual
time
pro
fit
prod
uct
‘‘Drill down October Drill down October Sales of product X at Sales of product X at 4% profit level, all 4% profit level, all regions’regions’
• Goal-drivenGoal-driven • AutomaticAutomatic
Goal = ‘significant loss’:Goal = ‘significant loss’:
‘‘If period = week 40If period = week 40and product = BBQand product = BBQthen profit level = then profit level = significant loss’significant loss’
ExecutableDecisionModel
Reports &Graphs
Discovered Knowledge isDiscovered Knowledge isa non-trivial a non-trivial patternpattern in data in data
classificationclassification these people will buy; those people will not these people will buy; those people will not
associationassociation people who buy beer also buy nutspeople who buy beer also buy nuts
sequencesequence afterafter marriage, people buy insurance marriage, people buy insurance
clustering/segmentationclustering/segmentation health, convenience, luxury food eaters . . . health, convenience, luxury food eaters . . .
Select appropriate modeling techniqueSelect appropriate modeling technique
rule inductionrule inductionneural networksneural networks
tree generatorstree generatorsrule inductionrule inductionneural networksneural networksregressionregression
kohonen networks rule kohonen networks rule induction k-induction k-meansmeans
web diagrams a web diagrams a prioripriori rule rule inductioninduction
trend functionstrend functions rule inductionrule induction
neural networksneural networks
rule inductionrule inductionneural networksneural networks
tree generatorstree generatorsrule inductionrule inductionneural networksneural networksregressionregression
kohonen networks rule kohonen networks rule induction k-induction k-meansmeans
web diagrams a web diagrams a prioripriori rule rule inductioninduction
trend functionstrend functions rule inductionrule induction
neural networksneural networks
Categorize your customers or clients
ClassificationCategorize your customers or clients
Classification
Forecast future sales or usage
PredictionForecast future sales or usage
Prediction
Group similar customers or clients
SegmentationGroup similar customers or clients
Segmentation
Discover products that are purchased together
AssociationDiscover products that are purchased together
Association
Find patterns and trends over time
SequenceFind patterns and trends over time
Sequence
Decision modelsDecision models
• The ideal result is The ideal result is actionableactionable knowledge knowledge
• … … executable software which makes a decisionexecutable software which makes a decision– market to market to thesethese people out of the list people out of the list
– accept/decline accept/decline thisthis loan application loan application
– predicted revenue from predicted revenue from thisthis store is $205M store is $205M
– weight weight thisthis premium by -5%premium by -5%
– sales in sales in this areathis area are below par: investigate! are below par: investigate!
• Models (software agents) can be deployedModels (software agents) can be deployedwherever appropriate in the existing ISwherever appropriate in the existing IS
Models deployed in an ISModels deployed in an ISDecision models (“agents”) in actionDecision models (“agents”) in action
Orders
Invoices
ReceiptsData Marts
Data Warehouse
Transaction Databases
Reports
Model used for new process
Data Marts
Data Warehouse
New product?New product?New promotion?New promotion?
• Warehouse not Warehouse not requiredrequired for data mining...for data mining...
• ... but it ... but it isis usually an excellent platformusually an excellent platform
• Warehouse cleans data and solves politicsWarehouse cleans data and solves politics– mine first, learn what the warehouse should holdmine first, learn what the warehouse should hold
– mine first, use the savings to mine first, use the savings to paypay for warehouse! for warehouse!
Warehousing and miningWarehousing and mining
Storage, ManagementStorage, ManagementOrganisation, ControlOrganisation, Control
Discovery, UnderstandingDiscovery, UnderstandingModellingModelling
Data Data WarehouseWarehouse
Data Data MiningMining$0.5-5M$0.5-5M $30-200K$30-200K
Data mining is naturalData mining is naturalData mining is naturalData mining is natural• DM automates the oldest, most natural DM automates the oldest, most natural
process: process: learning from experiencelearning from experience
• Finds models of Finds models of best business practice best business practice that can be deployed throughout the that can be deployed throughout the enterpriseenterprise
• DM automates the oldest, most natural DM automates the oldest, most natural process: process: learning from experiencelearning from experience
• Finds models of Finds models of best business practice best business practice that can be deployed throughout the that can be deployed throughout the enterpriseenterprise
DeployDeploymodels for models for best practicebest practice
DataDataDataData DataData
MiningMining
DataData
MiningMining
Enterprise learning Enterprise learning feedback loopfeedback loop
The VisionThe VisionThe VisionThe Vision
decision-enableddecision-enabled enterprises enterprises
that continually adapt tothat continually adapt to
new customer and market situationsnew customer and market situations
decision-enableddecision-enabled enterprises enterprises
that continually adapt tothat continually adapt to
new customer and market situationsnew customer and market situations
Summary of this sectionSummary of this section• Data mining automates Data mining automates “learning from experience”“learning from experience”
• . . .. . . helps create organizations that helps create organizations that adaptadapt
• there is no limit to the number of applicationsthere is no limit to the number of applications
• only requirement is business problem plus only requirement is business problem plus relevant datarelevant data
• results can be reports, but better as results can be reports, but better as active best active best practice modelspractice models learned from data learned from data
• models provide benefit only when deployed!models provide benefit only when deployed!
• you don’t need to have a warehouse, you don’t need to have a warehouse,
. . . but it can help.. . . but it can help.
AgendaAgenda
• What is data mining?What is data mining?
• Who is using data mining, and for what?Who is using data mining, and for what?
• How data mining fits into an IT systemHow data mining fits into an IT system
• Some myths about data miningSome myths about data mining
• What is data mining?What is data mining?
• Who is using data mining, and for what?Who is using data mining, and for what?
• How data mining fits into an IT systemHow data mining fits into an IT system
• Some myths about data miningSome myths about data mining
Data mining mythsData mining myths• Myth:Myth: “data mining is something algorithms do to large “data mining is something algorithms do to large
volumes of data; algorithms can discover new knowledge”volumes of data; algorithms can discover new knowledge”
• Fact:Fact: “ “Data mining is something Data mining is something peoplepeople do on their do on their businessesbusinesses.” High-value results are often obtained with .” High-value results are often obtained with modest amounts of data.modest amounts of data.
• Myth:Myth: Data mining requires a high degree of analytical Data mining requires a high degree of analytical skills (e.g. a PhD in statistics)skills (e.g. a PhD in statistics)
• Fact:Fact:The best data miner is someone who knows and The best data miner is someone who knows and understands the business.understands the business.
Data mining vendors - the myth-makers!
Data mining vendors - the myth-makers!
• Vendors position DM to sell their:–parallel machines or large disks
–expensive parallel algorithms
–dramatic visualisation
–high-power external consulting
• Some problems need these (and their cost); many do not.
• Vendors position DM to sell their:–parallel machines or large disks
–expensive parallel algorithms
–dramatic visualisation
–high-power external consulting
• Some problems need these (and their cost); many do not.
Mine data intelligentlyMine data intelligently• Data mining is Data mining is notnot blundering blindly about in blundering blindly about in
data using the most powerful shovel (algorithm).data using the most powerful shovel (algorithm).
• Though it is smart to have a lot of quality tools Though it is smart to have a lot of quality tools (algorithms) available.(algorithms) available.
• Contrast:Contrast:–hydraulic mining by washing away mountainshydraulic mining by washing away mountains
–mining by intelligent prospectingmining by intelligent prospecting
• Data mining is Data mining is notnot blundering blindly about in blundering blindly about in data using the most powerful shovel (algorithm).data using the most powerful shovel (algorithm).
• Though it is smart to have a lot of quality tools Though it is smart to have a lot of quality tools (algorithms) available.(algorithms) available.
• Contrast:Contrast:–hydraulic mining by washing away mountainshydraulic mining by washing away mountains
–mining by intelligent prospectingmining by intelligent prospecting
Good Data Mining is:Good Data Mining is:
. . . “intelligent prospecting”
• decide what you are looking for first,
• then apply knowledge (c.f. geology, mineralogy..),
• then take samples,
• assay the results from the samples,
• finally mine.
. . . “intelligent prospecting”
• decide what you are looking for first,
• then apply knowledge (c.f. geology, mineralogy..),
• then take samples,
• assay the results from the samples,
• finally mine.
Good Data Mining is:Good Data Mining is:
. . best with known business problem / opportunity patterns to learn from
(known buyers, bad debts, fraud cases, good promotions, profitable lines . . .)
• This determines: –business goals and goal variables,
–data that is rich in information for this problem
–suggest the analysis strategy
. . best with known business problem / opportunity patterns to learn from
(known buyers, bad debts, fraud cases, good promotions, profitable lines . . .)
• This determines: –business goals and goal variables,
–data that is rich in information for this problem
–suggest the analysis strategy
Understand the Business Understand the Business Problem FirstProblem First
Understand the Business Understand the Business Problem FirstProblem First
Increase Increase revenuerevenue
Improve Improve processesprocesses
Increase Increase revenuerevenue
Improve Improve processesprocesses
$$InsightInsightInsightInsight
Business Business problemproblemBusiness Business problemproblem
??What What
you knowyou knowWhat What
you knowyou know
DataData C2C2C1C1
ClusteringClustering
DM rarely requires massive data during the prospecting phase
DM rarely requires massive data during the prospecting phase
Case of the mysterious disappearing Terabytes
• “Can Clementine handle our data base? We have 3Tb going back 20 years, 17M clients.”
• “Probably, tell us what you want to investigate.”
• “Account closure patterns, to reduce churn”
• “How many occur each month?” (1700) 10-4
• What’s important? (age, marriage, . . . . ) 10-5
• When did you start saving this? (2 years ago) 10-6
• When do closure signs begin? (3 months) 10-7
Case of the mysterious disappearing Terabytes
• “Can Clementine handle our data base? We have 3Tb going back 20 years, 17M clients.”
• “Probably, tell us what you want to investigate.”
• “Account closure patterns, to reduce churn”
• “How many occur each month?” (1700) 10-4
• What’s important? (age, marriage, . . . . ) 10-5
• When did you start saving this? (2 years ago) 10-6
• When do closure signs begin? (3 months) 10-7
Winterthur ResultWinterthur Result
Recall the Winterthur “churn” problem
• Result on churn classification.
• Achieved > 91.5% accuracy predicting churn (Yes/No) on the (unseen) test set.
• This was 20% better than next competitor! (SAS EM, IBM IM, HNC, Thinking Machines Inc.)
Recall the Winterthur “churn” problem
• Result on churn classification.
• Achieved > 91.5% accuracy predicting churn (Yes/No) on the (unseen) test set.
• This was 20% better than next competitor! (SAS EM, IBM IM, HNC, Thinking Machines Inc.)
Pre
dic
ted
Pre
dic
ted s
ales
sale
s
Halfords - Predicting SalesHalfords - Predicting Sales
Recall the store sales prediction resultRecall the store sales prediction result
Regression Model (6m) Clementine Model(3w)Regression Model (6m) Clementine Model(3w)
Pre
dic
ted
Pre
dic
ted s
ales
sale
s
Actual salesActual sales Actual salesActual sales
The data is not the businessThe data is not the business
Business DataBusiness Data
Name AgeIncome
Mar/Sin/Div
CarCCard
Purch
Val LastPurch
Children
Source
F. Bloggs25 25000SingleYes M/C 5 23.5 34 0 L1J. Smith37 33000Mar. Yes VISA3 123.4102 2 L2J. Dow 45 40000Div. No VISA12 15.2 48 1 L1
The BusinessThe Business
Business deals with the real world
Business deals with the real world
• Most of what is interesting to business is fuzzy - customers, customers’ behaviour
• Hard to give a numeric value.
• Business/market people know strengths and weaknesses in the data
• Garbage (or bias) in = garbage (or bias) out.
• Most of what is interesting to business is fuzzy - customers, customers’ behaviour
• Hard to give a numeric value.
• Business/market people know strengths and weaknesses in the data
• Garbage (or bias) in = garbage (or bias) out.
What’s in the chasm?What’s in the chasm?What’s in the chasm?What’s in the chasm?
• Business knowledge that’s in your head (or Business knowledge that’s in your head (or library, or in other department) library, or in other department)
• Data we aren’t yet using e.g. MR data.Data we aren’t yet using e.g. MR data.
• E.g. company launched new product E.g. company launched new product –90% of our non-buyers are close to buying90% of our non-buyers are close to buying
–90% of our non-buyers will never buy90% of our non-buyers will never buy
• Same transaction data, but dramatically Same transaction data, but dramatically different prospects different prospects
• Business knowledge that’s in your head (or Business knowledge that’s in your head (or library, or in other department) library, or in other department)
• Data we aren’t yet using e.g. MR data.Data we aren’t yet using e.g. MR data.
• E.g. company launched new product E.g. company launched new product –90% of our non-buyers are close to buying90% of our non-buyers are close to buying
–90% of our non-buyers will never buy90% of our non-buyers will never buy
• Same transaction data, but dramatically Same transaction data, but dramatically different prospects different prospects
Business knowledgeBusiness knowledge
• Which factors are relevant?–quality/blend of raw materials
–time of year / weather
• Maybe key predictors must be derived–a sum: household income,
–a trend: rate of sales decrease
–a ratio: sales/sq ft.
• Business/Market knowledge is the key
• Which factors are relevant?–quality/blend of raw materials
–time of year / weather
• Maybe key predictors must be derived–a sum: household income,
–a trend: rate of sales decrease
–a ratio: sales/sq ft.
• Business/Market knowledge is the key
Halfords’ applicationHalfords’ applicationHigher accuracy than previous statistical models.Higher accuracy than previous statistical models.
Why?Why?
External statistics company In-house business managerExternal statistics company In-house business manager
Regression (6 months)Regression (6 months) Clementine (3 weeks) Clementine (3 weeks)
Pre
dic
ted
Pre
dic
ted s
ale
ssa
les
Pre
dic
ted
Pre
dic
ted s
ale
ssa
les
Actual salesActual sales Actual salesActual sales
1 Split into train and test data1 Split into train and test data
3 Test the models3 Test the models
2 Train 2 Train modelsmodels
Rationale for ClementineRationale for ClementineTMTM
• Algorithms have no business knowledge or Algorithms have no business knowledge or common sensecommon sense
• Need to use algorithms alongside business/ Need to use algorithms alongside business/ market expertisemarket expertise
• DM is a creative/discovery process. We need DM is a creative/discovery process. We need fluency to follow train of thought (hunches). fluency to follow train of thought (hunches).
• Hunching is hard if business user must keep Hunching is hard if business user must keep telling technology expert what to do.telling technology expert what to do.
Clementine objectives Clementine objectives Clementine objectives Clementine objectives
• A data mining system which users can drive themselves
• Many fully-packaged algorithms (no one silver bullet)
• Can follow up clues discovered in the data
• Easy to input own ideas / knowledge
• As easy as a spreadsheet
• A data mining system which users can drive themselves
• Many fully-packaged algorithms (no one silver bullet)
• Can follow up clues discovered in the data
• Easy to input own ideas / knowledge
• As easy as a spreadsheet
SPSS’ data mining SPSS’ data mining workbench of the futureworkbench of the future
SPSS’ data mining SPSS’ data mining workbench of the futureworkbench of the future
User interface
Algorithms
Infrastructure
User interface
Algorithms
Infrastructure
Clementine
Clementine SPSSOther algorithms
Scalable architecture
Commondeploymentvehicles
Data mining: decisions from dataData mining: decisions from datato do your business betterto do your business better
Data mining: decisions from dataData mining: decisions from datato do your business betterto do your business better
Increase Increase revenuerevenue
Improve Improve processesprocesses
Increase Increase revenuerevenue
Improve Improve processesprocesses
$$InsightInsightInsightInsight
Business Business problemproblemBusiness Business problemproblem
??What What
you knowyou knowWhat What
you knowyou know
DataData C2C2C1C1
ClusteringClustering
top related