copyright © 2006, sas institute inc. all rights reserved. everything you ever wanted to know about...
TRANSCRIPT
Copyright © 2006, SAS Institute Inc. All rights reserved.
Everything you ever wanted to know about data mining but were afraid to ask.David DulingData Mining R&D DirectorSAS Institute
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Abstract
Data mining is the process of systematically sifting through often large databases to identify patterns and trends relevant to solving business problems such as increasing sales and efficiency. Successful examples of data mining can be found in many areas of business and science including customer relations management, market basket analysis, human resources, bio-informatics and medicine, fraud detection, and searching the web. The rapid growth in data mining is largely due to the increased availability of large databases, advances in large scale computing and development of data mining algorithms. This presentation will begin with a brief history of data mining, cover current trends with case studies and finish with a look into the future of data mining from the SAS perspective.
Data mining is the process of systematically sifting through often large databases to identify patterns and trends relevant to solving business problems such as increasing sales and efficiency. Successful examples of data mining can be found in many areas of business and science including customer relations management, market basket analysis, human resources, bio-informatics and medicine, fraud detection, and searching the web. The rapid growth in data mining is largely due to the increased availability of large databases, advances in large scale computing and development of data mining algorithms. This presentation will begin with a brief history of data mining, cover current trends with case studies and finish with a look into the future of data mining from the SAS perspective.
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Q: Where did data mining start ? 1841
• Lewis Tappan formed the Mercantile Company to provide credit- worthiness reports (ie: credit scores) to New York merchants.
• Employed a number of ‘correspondents’’ in western frontier towns to monitor the behavior of local traders, in addition to pooling merchant records. A huge data base was accumulated.
• Enormous success !
1849• John Bradstreet starts a credit reporting company in Cincinnati, OH.
1859• The Mercantile Company was sold to Robert Graham Dun
1933• The Dun company merged with the Bradstreet company
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Q: How about something with computers ? Date: September, 1963
Authors: James Myers and Edward Forgy
Title: The Development of Numerical Credit Evaluation Systems
Publication: Journal of the American Statistical Association
Abstract
Several discriminant and multiple regression analyses were performed on retail credit application data to develop a numerical scoring system for predicting credit risk in a finance company. Results showed that equal weights for all significantly predictive items were as effective as weights from the more sophisticated techniques of discriminant analysis and "stepwise multiple regression." However, a variation of the basic discriminant analysis produced a better separation of groups at the lower score levels, where more potential losses could be eliminated with a minimum cost of potentially good accounts.
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Data Mining, circa 1963 IBM 7090 600 applicants
“Machine storage limitationsrestricted the total number ofvariables which could beconsidered at one time to 25.”
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Q: That Sounds like Statistics so what’s the difference?
Statistics Experimental Prior Hypothesis
• Idea before data acquisition• Data acquisition planned
Experimental Design• Sampling strategies• Factorial designs• Required confidence• Minimize model terms
Inference• Hypothesis testing• Prediction
Data mining Commercial Posterior Hypothesis
• Idea after data acquisition• Data acquisition opportunistic
No Experimental Design• Explore data• Create hypothesis• Generate query• Create models
Prediction• Lift, Profit, Response• Inference
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
It’s all about the DataExperimental Opportunistic
Purpose Research Operational
Value Scientific Commercial
Generation Actively Passivelycontrolled observed
Size Small Massive
Hygiene Clean Dirty
State Static
Dynamic
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Where does mining data come from ?
Continent
Continent_Id: INTEGER
Continent_Name: CHARACTER(30)
Country
Country: CHARACTER(2)
Country_Name: CHARACTER(45)Population: NUMERIC(6)Office: CHARACTER(2)Dir: CHARACTER(3)Country_Id: INTEGERContinent_Id: INTEGER (FK)Country_Former_Nam: CHARACTER(45)
Customer
Customer_Id: INTEGER
Country: CHARACTER(2)Gender: CHARACTER(1)Personal_Id: CHARACTER(15)Customer_Name: CHARACTER(40)Customer_Firstname: CHARACTER(20)Customer_Lastname: CHARACTER(30)Birthday: DATECustomer_Address: CHARACTER(40)Street_Id: INTEGER (FK)Street_Number: CHARACTER(8)Customer_Type_Id: INTEGER (FK)
Customer_Type
Customer_Type_Id: INTEGER
Customer_Type: CHARACTER(40)Customer_Group_Id: INTEGERCustomer_Group: CHARACTER(40)
Order
Order_Id: INTEGER
Employee_Id: INTEGER (FK)Customer_Id: INTEGER (FK)Order_Date: DATEDelivery_Date: DATEOrder_Type: INTEGER
Order_Item
Order_Id: INTEGER (FK)Order_Item_No: INTEGER
Product_Id: INTEGER (FK)Amount: SMALLINTPrice: DECIMAL(12,2)Unit_Cost_Price: DECIMAL(12,2)Promotion: DECIMAL(5,2)
Organization
Employee_Id: INTEGER
Org_Name: CHARACTER(40)Country: CHARACTER(2)Org_Level_Id: INTEGER (FK)Start_Date: DATEEnd_Date: DATEOrg_Ref_Id: INTEGER (FK)
Org_Level
Org_Level_Id: INTEGER
Org_Text: CHARACTER(40)
Price_List
Product_Id: INTEGER (FK)Start_Date: DATE
End_Date: DATEUnit_Cost_Price: DECIMAL(12,2)Unit_Sales_Price: DECIMAL(12,2)
Product
Product_Id: INTEGER
Product_Name: CHARACTER(45)Supplier_Id: INTEGER (FK)Product_Level_Id: INTEGER (FK)Product_Ref_Id: INTEGER (FK)Product_Level: NUMERIC(3)
Street_Code
Street_Id: INTEGER
Country: CHARACTER(2)Street_Name: CHARACTER(30)Zip_Code: CHARACTER(10)From_Street_No: NUMERIC(8)To_Street_No: NUMERIC(8)City: CHARACTER(22)County: CHARACTER(25)State: CHARACTER(2)City_Id: INTEGERState_Id: INTEGERCounty_Id: INTEGER (FK)Zip_Id: INTEGER (FK)
Supplier
Supplier_Id: INTEGER
Supplier_Name: CHARACTER(30)Street_Id: INTEGER (FK)Supplier_Address: CHARACTER(30)Supplier_Street_Nu: NUMERIC(3)Country: CHARACTER(2)
Product_Level
Product_Level_Id: INTEGER
Product_Level_Name: CHARACTER(30)
Staff
Employee_Id: INTEGER (FK)Start_Date: DATE
Salary: DECIMAL(12,2)Birthday: DATEEnd_Date: DATEEmp_Hire_Date: DATEGender: CHARACTER(1)Emp_Term_Date: DATEJob_Title: CHARACTER(25)
Promotion
Product_Id: INTEGER (FK)Start_Date: DATE
End_Date: DATESales_Price: DECIMAL(12,2)Promotion: DECIMAL(5,2)
County
County_Id: INTEGER
County_Type: INTEGER (FK)County_Name: CHARACTER(30)Region_Id: INTEGER (FK)
Region
Region_Id: INTEGER
Region_Type: INTEGER (FK)Region_Name: CHARACTER(30)State_Id: INTEGER (FK)
State
State_Id: INTEGER
State_Type: CHARACTER(30)State_Name: CHARACTER(30)Country: CHARACTER(2) (FK)Geo_Type_Id: INTEGER (FK)
Geo_Type
Geo_Type_Id: INTEGER
Geo_Type_Name: CHARACTER(30)
Zip_Code
Zip_Id: INTEGER
City_Name: CHARACTER(30)Zipcode: CHARACTER(18)City_Id: INTEGER (FK)
City
City_Id: INTEGER
City_Name: CHARACTER(30)
Data Warehouses store detail data on transactions and states
Very simple demo example
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Data: majority of time spent on mining
Intelligent Enterprise Magazine
http://www.intelligententerprise.com/030405/606feat2_1.jhtml
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Each data type has its own mix ofdata prep and mining
Market baskets, Item sets
Market baskets with time order
Web paths: unique sequences
Time stamped transactions
Text normalization
Demographics, Personal Information
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Use integrated data for mining
ID columns
Web pathsMarket baskets Seasonal indicesDemographic,
Financial Text dimensions
Interactions
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Example Quest: Maximize response to this year’s summer promotion
How: Find those customers most likely to respond
Use response to last year’s summer promotion as indicator of response to this year’s promotion. This is the dependent variable.
Use all customer data available before last summer. These are the independent variables.• Demographics• Sales item history• Sales amount history• Web site history• Call center records• ….
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Q: Why don’t we just select last year’s customer lists ?
Data is non-stationary (remember this point)
Move to new locations
Change jobs
Income goes up or down
Debt increase or decrease
Marital status
Parental status
…
The model is a function of the attributes, not the individuals
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Q: What Functions are Popular ? Associations unsupervised
Clustering unsupervised
PCA/SVD unsupervised
Logistic Regression supervised
Decision Tree supervised
Neural Network supervised
Ensembles supervised
… and many other forms, variants, and names
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
How do I find Patterns ? Try ASSOCIATIONS and SEQUENCES
Searches for frequent patterns
(Car Wreck Dr. X ) (Diagnosis Code xxx MRI)
Confidence:
If (A) happens then (B) happens 80% of the time
C = ( B | A ) / A
Support:
(A) (B) happens in 10% of all itemsets
S = ( B | A ) / N
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Rule Sets show the next most likely action
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Associations and Sequences / Visualization
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
What is the Most Popular Data Mining Function ?
InputInput
Prob
-Logistic Regression Still Rules !-Linear combination of terms (z)-Relatively easy to compute-Converges to a solution-Explainable
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Hunt 1966 Concept Learning System
Kass 1980 Chi-squared Automatic Interaction Detection
Breiman 1984 Classification and Regression Trees
Quinlan 1993 C 4.5 rule sets
Numerous others…• Algorithms for efficiently building trees
• Hypothesis tests for finding split points
− Various measurement scales
Q: What is CART and why do I need it ?
A Decision Tree !
Classification and Regression
Strategy Development
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Building a Decision Tree
Keep doing that until there are no more beneficial
splits...
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Recursive Partitioning
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Benefits of Trees
Interpretability
• Tree structured presentation
Mixed Measurement Scales• Nominal, ordinal, interval
• Regression trees
Robustness
Missing Values
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
…Benefits Automatically
• Detects interactions (AID)
• Accommodates non-linearity
• Selects input variables
InputInput
Prob
MultivariateStep Function
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Drawbacks of Trees
Roughness
Linear, Main Effects
Instability
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Q: Why do they call it ‘Neural’ network ?Neuron
Hidden Unit
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Feed Forward Neural Network
Hidden Layers
Output Layer
Input
Layer
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
How does it work?C= combination ( Weights * Inputs )
A = Activation ( C )
f(W,I) = A[C] + b -> output
…
y~ f(W,(s,t))
s= f(S,(p,q,r))
t= f(T,(p,q,r))
p= f(P,X)
q= f(Q,X)
r= f(R,X)
…
y ~ f(W,(f(S.(f(P,X), f(Q,X), f(R,X)))), f(T,(f(P,X), f(Q,X), f(R,X)))))
Err = E(Y,y) ~ (Y - y)^2
ss
tt
pp
rr
aa
bb
cc
dd
ee
ff
gg
hh
ii
jj
yy
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Activation FunctionLayer
Inp
ut
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Training
Parameter 2
Para
mete
r 1
Error Function
Iterative Optimization Algorithm
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Training history for our example
Error measure goes down with every iteration.
Weights evolve at every iteration
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Neural Pros and Cons
Very Flexible functions
Implicit transformation and interactions
Good algorithms for controlling complexity
No inference
Complex function
Many possible networks – large search space
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Q: How do I know the model will work on new data ? Make sure that you don’t have a perfect model !
• Real data has multiple forms of the dependent variable effect
Limit exposure to data that changes over time• Examine distributions of data at several time points
• Select stable data
• Use standardizations
• Use category=other
Backtest• Use a hold out sample from a later time period
Monitor Performance• Compare actual and expected results
• Compare input term distributions
Don’t fit the noise.
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Q: How do I model signal instead of noise ? Limit model complexity by using Validation Data. Decision Tree: Pruning Neural Network: Early Stopping
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Complexity -> OverfittingTraining Set Test Set
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Better Fitting … with a more simple modelTraining Set Test Set
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
How do I select the best model ?
ROC for overall model performance:
Decision Tree
Lift for targeted model performance:
Neural Network
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Computers keep getting so much faster, why does my neural network take so long to run?
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
The enemy: growth in data warehouses
In the aggregate, the 2001 survey pool reported 632 TB of storage. Just two years later, those surveyed were using almost 2 petabytes (2,000 TB) of storage. Based on the number of survey respondents, the average large database — whether used for decision support or transaction processing — increased its storage requirements three and one-half times in just two years. DM-REVIEW
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Single disk size growth
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Clock speed vs. disk sizes
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Even worse, it’s all about complexity of data1M rows x 100 columns x 8 bytes = 800MB
1000 rows x 1000 columns x 8 bytes = 8MB
Which data is more complex ?
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Q: so how does this make me money ?
Models get deployed to operational systems• New data is acquired
• Each case is scored with the model function
• Action taken on each case:
− Send promotion or don’t send promotion
− Select item for cross sell offer
− Grand credit or don’t grant credit
− Alert engineers that a manufacturing defect has been found.
Model driven decision are nearly always better than intuition
…iff… the data miner has accounted for enough sources of variation.
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Offline Applications• ETL for model development and scoring• Scores generated on nightly basis• ID and Score data pre-loaded into data store• Score tables pushed to external applications
CampaignPlanning
CampaignExecution
Data Mining
Data StoreScores
Scoring Engine
BI Application
Scheduled ScoringETL process
Operations
ETL engineModel
Development
InformationTechnology
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Online Applications• Scores generated on nightly basis• ID and Score data pre-loaded into data store• Individual score requests contain one or more IDs • Decision server translates score to action
Customercall center
Data StoreScores
BI Application
Scheduled ScoringETL process
Front OfficeApplication
ETL engineModel
Development
Decision Server
Scoring Engine
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
On-Demand Applications• Model input data pre-loaded into data store• New data provided by application• Score engine pulls data by ID from data store• joins with new data• Scores generated immediately• Decision Server translates score to action
Fraud detectionMonty launderingMedical diagnostics
Front OfficeApplication
Scheduled ScoringETL process
AutomationApplication
ETL engineModel
Development
Decision Server
Scoring Engine
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only
Q: So what are the cool applications right now ?
GOOGLE, YAHOO, ASK, etc…• Huge model training task: index and summarize the web
• Techniques: text data processing; page rank
• Real time scoring task: process your query
NETFLIX• $1M challenge: beat their statisticians
• Huge sparse matrix: fill in the blanks
• Techniques SVD by numerical approximation
aka: Hebbian-learning Neural Net
Ensembles
Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use onlyCopyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only