science in business data mining?
DESCRIPTION
Science in Business Data Mining?. Background: support managerial decision making Is there a science to data mining (with CI-methods)? Outline Data Mining in Business & Management Rules established in Business practices vs. Data mining? Statistics vs. Data driven modelling A personal view - PowerPoint PPT PresentationTRANSCRIPT
Science in Business Data Science in Business Data Mining?Mining?
Background: support managerial decision makingBackground: support managerial decision making
Is there a science to data mining (with CI-methods)? Is there a science to data mining (with CI-methods)?
OutlineOutline1.1. Data Mining in Business & ManagementData Mining in Business & Management2.2. Rules established in Business practices vs. Data Rules established in Business practices vs. Data
mining?mining?1.1. Statistics vs. Data driven modellingStatistics vs. Data driven modelling2.2. A personal viewA personal view
3.3. How do develop meta-knowledgeHow do develop meta-knowledgeSven F. Crone,
Lancaster University Management SchoolResearch Centre for Forecasting
YESYES, but it , but it dependsdepends
(and it may be empirical Wizardry (and it may be empirical Wizardry driven by efficiency rather than driven by efficiency rather than
effectiveness!)effectiveness!)
Business Data Mining?Business Data Mining?
Main areas for Data Mining:Main areas for Data Mining: Finance: Finance: Credit risk (personal & corporate)Credit risk (personal & corporate) Marketing:Marketing: Customer Relationship ManagementCustomer Relationship Management
(=Direct Marketing, Database Marketing)(=Direct Marketing, Database Marketing)Sven F. Crone,
Lancaster University Management SchoolResearch Centre for Forecasting
Established CustomerNew Customer Former CustomerProspect
IndividualDemand
TargetMarket
New Customer
Inital Customer
High valueCustomer
High potential Customer
Low valueCustomer
VoluntaryChurn
Forced Churn
Aggregate Demand
Acquisition Activation
Resignation
RetentionRelationship
Marketing Response
Adoption
Market Experiments
Intentions
Extrapolative Forecasting
(incl. Judgement)Churn Prediction
Credit Scoring DirectMarketing
adapted from Berry and Linoff (2004) and Olafson et al (2006)
Best practicesBest practicesCredit ScoringCredit Scoring
Small & Balanced classesSmall & Balanced classes Use 2000 of minority classUse 2000 of minority class Use undersamplingUse undersampling
Discretise all (!) variablesDiscretise all (!) variables Binary dummies / WOE to Binary dummies / WOE to
capture non-linearitycapture non-linearity Use Logistic regressionUse Logistic regression
Cross-SellingCross-Selling
Large & imbalanced Large & imbalanced samplesample Use large sample sizesUse large sample sizes Original (Imbalanced) Original (Imbalanced)
class distributionclass distribution ……
A personal view:A personal view:• Data selection is best using prior domain knowledge (use Data selection is best using prior domain knowledge (use
filters)filters)• Pre-processing more important than method Pre-processing more important than method [Crone et al, 2006; [Crone et al, 2006;
Keogh 2002]Keogh 2002]
• (Balanced) sampling & pre-processing is method (Balanced) sampling & pre-processing is method dependentdependent
• Best practices exist & are domain dependentBest practices exist & are domain dependent(e.g. homogeneous datasets in credit scoring)(e.g. homogeneous datasets in credit scoring)
• Flat Maximum effect Flat Maximum effect [Lovie & Lovie, 1986][Lovie & Lovie, 1986]
Sven F. Crone, Lancaster University Management School
Research Centre for Forecasting
GAPGAP
Extensive use of expert domain
knowledge efficient solution ≠
best
Practitioners & Consultants use statistics
How do derive (meta)-How do derive (meta)-knowledge?knowledge?
Lessons from other disciplines: Time Series ForecastingLessons from other disciplines: Time Series Forecasting More ‘Evidence based methods” More ‘Evidence based methods” [Armstrong 2000][Armstrong 2000]
Empirical EvidenceEmpirical Evidence Conditions under which methods perform well (multiple hypothesis)Conditions under which methods perform well (multiple hypothesis)
Domain specific Competitions (valid & reliable)Domain specific Competitions (valid & reliable) Multiple out-of-sample evaluations (≠ single fold, one origin)Multiple out-of-sample evaluations (≠ single fold, one origin) Multiple homogeneous datasets from one domainMultiple homogeneous datasets from one domain Use of valid benchmark methods & unbiased error measuresUse of valid benchmark methods & unbiased error measures Honour the domain & decision context (active learning, cost sensitive)Honour the domain & decision context (active learning, cost sensitive)
ReplicationsReplications Studies must allow replications – document all steps / parametersStudies must allow replications – document all steps / parameters
STOP FINE-TUNING / MARGINAL EXTENSION OF STOP FINE-TUNING / MARGINAL EXTENSION OF SINGLE METHOD ON SINGLE TOY DATASETSINGLE METHOD ON SINGLE TOY DATASET
Develop solutions for domain Develop solutions for domain (Why make life harder?)(Why make life harder?)
Where to start? Where to start? follow high impact approach! follow high impact approach! Identify most prominent application domains (e.g. credit risk)Identify most prominent application domains (e.g. credit risk) Select promising application domains for CI-methodsSelect promising application domains for CI-methods Get corporate sponsor & run competitionGet corporate sponsor & run competition Analyse conditions (!) using meta-studies!Analyse conditions (!) using meta-studies! Embed findings as methodology in SOFTWAREEmbed findings as methodology in SOFTWARE Sven F. Crone,
Lancaster University Management SchoolResearch Centre for Forecasting
LiteratureLiterature Ian Ayres (2007) Super Crunchers: Why Ian Ayres (2007) Super Crunchers: Why
Thinking-by-Numbers Is the New Way to Thinking-by-Numbers Is the New Way to Be Smart, BantamBe Smart, Bantam
Thomas H. Davenport, Jeanne G. Harris Thomas H. Davenport, Jeanne G. Harris (2007) Competing on Analytics: The New (2007) Competing on Analytics: The New Science of Winning, Science of Winning, Harvard Business Harvard Business School PressSchool Press
Fildes, Nikolopoulos, Crone, Synthetos (2009) Fildes, Nikolopoulos, Crone, Synthetos (2009) Forecasting and Operational Research – a Review, JORS, Forecasting and Operational Research – a Review, JORS, forthcomingforthcoming
Finlay, Crone (under review), Sampling issues in Credit Finlay, Crone (under review), Sampling issues in Credit Scoring – the effect of sample size and sample Scoring – the effect of sample size and sample distribution on predictive accuracy, EJORdistribution on predictive accuracy, EJOR
Keogh, Kasetty (2002, 2004) Keogh, Kasetty (2002, 2004) On the Need for Time On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Series Data Mining Benchmarks: A Survey and Empirical Demonstration, SIGKDD’02 & Data Mining JournalDemonstration, SIGKDD’02 & Data Mining Journal