copyright © 2006, sas institute inc. all rights reserved. everything you ever wanted to know about...

47
Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data Mining R&D Director SAS Institute

Upload: trevor-dixon

Post on 27-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved.

Everything you ever wanted to know about data mining but were afraid to ask.David DulingData Mining R&D DirectorSAS Institute

Page 2: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Abstract

Data mining is the process of systematically sifting through often large databases to identify patterns and trends relevant to solving business problems such as increasing sales and efficiency. Successful examples of data mining can be found in many areas of business and science including customer relations management, market basket analysis, human resources, bio-informatics and medicine, fraud detection, and searching the web.  The rapid growth in data mining is largely due to the increased availability of large databases, advances in large scale computing and development of data mining algorithms.  This presentation will begin with a brief history of data mining, cover current trends with case studies and finish with a look into the future of data mining from the SAS perspective.

Data mining is the process of systematically sifting through often large databases to identify patterns and trends relevant to solving business problems such as increasing sales and efficiency. Successful examples of data mining can be found in many areas of business and science including customer relations management, market basket analysis, human resources, bio-informatics and medicine, fraud detection, and searching the web.  The rapid growth in data mining is largely due to the increased availability of large databases, advances in large scale computing and development of data mining algorithms.  This presentation will begin with a brief history of data mining, cover current trends with case studies and finish with a look into the future of data mining from the SAS perspective.

Page 3: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Q: Where did data mining start ? 1841

• Lewis Tappan formed the Mercantile Company to provide credit- worthiness reports (ie: credit scores) to New York merchants.

• Employed a number of ‘correspondents’’ in western frontier towns to monitor the behavior of local traders, in addition to pooling merchant records. A huge data base was accumulated.

• Enormous success !

1849• John Bradstreet starts a credit reporting company in Cincinnati, OH.

1859• The Mercantile Company was sold to Robert Graham Dun

1933• The Dun company merged with the Bradstreet company

Page 4: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Q: How about something with computers ? Date: September, 1963

Authors: James Myers and Edward Forgy

Title: The Development of Numerical Credit Evaluation Systems

Publication: Journal of the American Statistical Association

Abstract

Several discriminant and multiple regression analyses were performed on retail credit application data to develop a numerical scoring system for predicting credit risk in a finance company. Results showed that equal weights for all significantly predictive items were as effective as weights from the more sophisticated techniques of discriminant analysis and "stepwise multiple regression." However, a variation of the basic discriminant analysis produced a better separation of groups at the lower score levels, where more potential losses could be eliminated with a minimum cost of potentially good accounts.

Page 5: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Data Mining, circa 1963 IBM 7090 600 applicants

“Machine storage limitationsrestricted the total number ofvariables which could beconsidered at one time to 25.”

Page 6: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Q: That Sounds like Statistics so what’s the difference?

Statistics Experimental Prior Hypothesis

• Idea before data acquisition• Data acquisition planned

Experimental Design• Sampling strategies• Factorial designs• Required confidence• Minimize model terms

Inference• Hypothesis testing• Prediction

Data mining Commercial Posterior Hypothesis

• Idea after data acquisition• Data acquisition opportunistic

No Experimental Design• Explore data• Create hypothesis• Generate query• Create models

Prediction• Lift, Profit, Response• Inference

Page 7: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

It’s all about the DataExperimental Opportunistic

Purpose Research Operational

Value Scientific Commercial

Generation Actively Passivelycontrolled observed

Size Small Massive

Hygiene Clean Dirty

State Static

Dynamic

Page 8: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Where does mining data come from ?

Continent

Continent_Id: INTEGER

Continent_Name: CHARACTER(30)

Country

Country: CHARACTER(2)

Country_Name: CHARACTER(45)Population: NUMERIC(6)Office: CHARACTER(2)Dir: CHARACTER(3)Country_Id: INTEGERContinent_Id: INTEGER (FK)Country_Former_Nam: CHARACTER(45)

Customer

Customer_Id: INTEGER

Country: CHARACTER(2)Gender: CHARACTER(1)Personal_Id: CHARACTER(15)Customer_Name: CHARACTER(40)Customer_Firstname: CHARACTER(20)Customer_Lastname: CHARACTER(30)Birthday: DATECustomer_Address: CHARACTER(40)Street_Id: INTEGER (FK)Street_Number: CHARACTER(8)Customer_Type_Id: INTEGER (FK)

Customer_Type

Customer_Type_Id: INTEGER

Customer_Type: CHARACTER(40)Customer_Group_Id: INTEGERCustomer_Group: CHARACTER(40)

Order

Order_Id: INTEGER

Employee_Id: INTEGER (FK)Customer_Id: INTEGER (FK)Order_Date: DATEDelivery_Date: DATEOrder_Type: INTEGER

Order_Item

Order_Id: INTEGER (FK)Order_Item_No: INTEGER

Product_Id: INTEGER (FK)Amount: SMALLINTPrice: DECIMAL(12,2)Unit_Cost_Price: DECIMAL(12,2)Promotion: DECIMAL(5,2)

Organization

Employee_Id: INTEGER

Org_Name: CHARACTER(40)Country: CHARACTER(2)Org_Level_Id: INTEGER (FK)Start_Date: DATEEnd_Date: DATEOrg_Ref_Id: INTEGER (FK)

Org_Level

Org_Level_Id: INTEGER

Org_Text: CHARACTER(40)

Price_List

Product_Id: INTEGER (FK)Start_Date: DATE

End_Date: DATEUnit_Cost_Price: DECIMAL(12,2)Unit_Sales_Price: DECIMAL(12,2)

Product

Product_Id: INTEGER

Product_Name: CHARACTER(45)Supplier_Id: INTEGER (FK)Product_Level_Id: INTEGER (FK)Product_Ref_Id: INTEGER (FK)Product_Level: NUMERIC(3)

Street_Code

Street_Id: INTEGER

Country: CHARACTER(2)Street_Name: CHARACTER(30)Zip_Code: CHARACTER(10)From_Street_No: NUMERIC(8)To_Street_No: NUMERIC(8)City: CHARACTER(22)County: CHARACTER(25)State: CHARACTER(2)City_Id: INTEGERState_Id: INTEGERCounty_Id: INTEGER (FK)Zip_Id: INTEGER (FK)

Supplier

Supplier_Id: INTEGER

Supplier_Name: CHARACTER(30)Street_Id: INTEGER (FK)Supplier_Address: CHARACTER(30)Supplier_Street_Nu: NUMERIC(3)Country: CHARACTER(2)

Product_Level

Product_Level_Id: INTEGER

Product_Level_Name: CHARACTER(30)

Staff

Employee_Id: INTEGER (FK)Start_Date: DATE

Salary: DECIMAL(12,2)Birthday: DATEEnd_Date: DATEEmp_Hire_Date: DATEGender: CHARACTER(1)Emp_Term_Date: DATEJob_Title: CHARACTER(25)

Promotion

Product_Id: INTEGER (FK)Start_Date: DATE

End_Date: DATESales_Price: DECIMAL(12,2)Promotion: DECIMAL(5,2)

County

County_Id: INTEGER

County_Type: INTEGER (FK)County_Name: CHARACTER(30)Region_Id: INTEGER (FK)

Region

Region_Id: INTEGER

Region_Type: INTEGER (FK)Region_Name: CHARACTER(30)State_Id: INTEGER (FK)

State

State_Id: INTEGER

State_Type: CHARACTER(30)State_Name: CHARACTER(30)Country: CHARACTER(2) (FK)Geo_Type_Id: INTEGER (FK)

Geo_Type

Geo_Type_Id: INTEGER

Geo_Type_Name: CHARACTER(30)

Zip_Code

Zip_Id: INTEGER

City_Name: CHARACTER(30)Zipcode: CHARACTER(18)City_Id: INTEGER (FK)

City

City_Id: INTEGER

City_Name: CHARACTER(30)

Data Warehouses store detail data on transactions and states

Very simple demo example

Page 9: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Data: majority of time spent on mining

Intelligent Enterprise Magazine

http://www.intelligententerprise.com/030405/606feat2_1.jhtml

Page 10: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Each data type has its own mix ofdata prep and mining

Market baskets, Item sets

Market baskets with time order

Web paths: unique sequences

Time stamped transactions

Text normalization

Demographics, Personal Information

Page 11: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Use integrated data for mining

ID columns

Web pathsMarket baskets Seasonal indicesDemographic,

Financial Text dimensions

Interactions

Page 12: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Example Quest: Maximize response to this year’s summer promotion

How: Find those customers most likely to respond

Use response to last year’s summer promotion as indicator of response to this year’s promotion. This is the dependent variable.

Use all customer data available before last summer. These are the independent variables.• Demographics• Sales item history• Sales amount history• Web site history• Call center records• ….

Page 13: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Q: Why don’t we just select last year’s customer lists ?

Data is non-stationary (remember this point)

Move to new locations

Change jobs

Income goes up or down

Debt increase or decrease

Marital status

Parental status

The model is a function of the attributes, not the individuals

Page 14: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Q: What Functions are Popular ? Associations unsupervised

Clustering unsupervised

PCA/SVD unsupervised

Logistic Regression supervised

Decision Tree supervised

Neural Network supervised

Ensembles supervised

… and many other forms, variants, and names

Page 15: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

How do I find Patterns ? Try ASSOCIATIONS and SEQUENCES

Searches for frequent patterns

(Car Wreck Dr. X ) (Diagnosis Code xxx MRI)

Confidence:

If (A) happens then (B) happens 80% of the time

C = ( B | A ) / A

Support:

(A) (B) happens in 10% of all itemsets

S = ( B | A ) / N

Page 16: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Rule Sets show the next most likely action

Page 17: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Associations and Sequences / Visualization

Page 18: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

What is the Most Popular Data Mining Function ?

InputInput

Prob

-Logistic Regression Still Rules !-Linear combination of terms (z)-Relatively easy to compute-Converges to a solution-Explainable

Page 19: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Hunt 1966 Concept Learning System

Kass 1980 Chi-squared Automatic Interaction Detection

Breiman 1984 Classification and Regression Trees

Quinlan 1993 C 4.5 rule sets

Numerous others…• Algorithms for efficiently building trees

• Hypothesis tests for finding split points

− Various measurement scales

Q: What is CART and why do I need it ?

A Decision Tree !

Classification and Regression

Strategy Development

Page 20: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Building a Decision Tree

Keep doing that until there are no more beneficial

splits...

Page 21: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Recursive Partitioning

Page 22: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Benefits of Trees

Interpretability

• Tree structured presentation

Mixed Measurement Scales• Nominal, ordinal, interval

• Regression trees

Robustness

Missing Values

Page 23: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

…Benefits Automatically

• Detects interactions (AID)

• Accommodates non-linearity

• Selects input variables

InputInput

Prob

MultivariateStep Function

Page 24: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Drawbacks of Trees

Roughness

Linear, Main Effects

Instability

Page 25: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Q: Why do they call it ‘Neural’ network ?Neuron

Hidden Unit

Page 26: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Feed Forward Neural Network

Hidden Layers

Output Layer

Input

Layer

Page 27: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

How does it work?C= combination ( Weights * Inputs )

A = Activation ( C )

f(W,I) = A[C] + b -> output

y~ f(W,(s,t))

s= f(S,(p,q,r))

t= f(T,(p,q,r))

p= f(P,X)

q= f(Q,X)

r= f(R,X)

y ~ f(W,(f(S.(f(P,X), f(Q,X), f(R,X)))), f(T,(f(P,X), f(Q,X), f(R,X)))))

Err = E(Y,y) ~ (Y - y)^2

ss

tt

pp

qq

rr

aa

bb

cc

dd

ee

ff

gg

hh

ii

jj

yy

Page 28: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Activation FunctionLayer

Inp

ut

Page 29: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Training

Parameter 2

Para

mete

r 1

Error Function

Iterative Optimization Algorithm

Page 30: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Training history for our example

Error measure goes down with every iteration.

Weights evolve at every iteration

Page 31: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Neural Pros and Cons

Very Flexible functions

Implicit transformation and interactions

Good algorithms for controlling complexity

No inference

Complex function

Many possible networks – large search space

Page 32: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Q: How do I know the model will work on new data ? Make sure that you don’t have a perfect model !

• Real data has multiple forms of the dependent variable effect

Limit exposure to data that changes over time• Examine distributions of data at several time points

• Select stable data

• Use standardizations

• Use category=other

Backtest• Use a hold out sample from a later time period

Monitor Performance• Compare actual and expected results

• Compare input term distributions

Don’t fit the noise.

Page 33: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Q: How do I model signal instead of noise ? Limit model complexity by using Validation Data. Decision Tree: Pruning Neural Network: Early Stopping

Page 34: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Complexity -> OverfittingTraining Set Test Set

Page 35: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Better Fitting … with a more simple modelTraining Set Test Set

Page 36: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

How do I select the best model ?

ROC for overall model performance:

Decision Tree

Lift for targeted model performance:

Neural Network

Page 37: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Computers keep getting so much faster, why does my neural network take so long to run?

Page 38: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

The enemy: growth in data warehouses

In the aggregate, the 2001 survey pool reported 632 TB of storage. Just two years later, those surveyed were using almost 2 petabytes (2,000 TB) of storage. Based on the number of survey respondents, the average large database — whether used for decision support or transaction processing — increased its storage requirements three and one-half times in just two years. DM-REVIEW

Page 39: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Single disk size growth

Page 40: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Clock speed vs. disk sizes

Page 41: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Even worse, it’s all about complexity of data1M rows x 100 columns x 8 bytes = 800MB

1000 rows x 1000 columns x 8 bytes = 8MB

Which data is more complex ?

Page 42: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Q: so how does this make me money ?

Models get deployed to operational systems• New data is acquired

• Each case is scored with the model function

• Action taken on each case:

− Send promotion or don’t send promotion

− Select item for cross sell offer

− Grand credit or don’t grant credit

− Alert engineers that a manufacturing defect has been found.

Model driven decision are nearly always better than intuition

…iff… the data miner has accounted for enough sources of variation.

Page 43: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Offline Applications• ETL for model development and scoring• Scores generated on nightly basis• ID and Score data pre-loaded into data store• Score tables pushed to external applications

CampaignPlanning

CampaignExecution

Data Mining

Data StoreScores

Scoring Engine

BI Application

Scheduled ScoringETL process

Operations

ETL engineModel

Development

InformationTechnology

Page 44: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Online Applications• Scores generated on nightly basis• ID and Score data pre-loaded into data store• Individual score requests contain one or more IDs • Decision server translates score to action

Customercall center

Data StoreScores

BI Application

Scheduled ScoringETL process

Front OfficeApplication

ETL engineModel

Development

Decision Server

Scoring Engine

Page 45: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

On-Demand Applications• Model input data pre-loaded into data store• New data provided by application• Score engine pulls data by ID from data store• joins with new data• Scores generated immediately• Decision Server translates score to action

Fraud detectionMonty launderingMedical diagnostics

Front OfficeApplication

Scheduled ScoringETL process

AutomationApplication

ETL engineModel

Development

Decision Server

Scoring Engine

Page 46: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only

Q: So what are the cool applications right now ?

GOOGLE, YAHOO, ASK, etc…• Huge model training task: index and summarize the web

• Techniques: text data processing; page rank

• Real time scoring task: process your query

NETFLIX• $1M challenge: beat their statisticians

• Huge sparse matrix: fill in the blanks

• Techniques SVD by numerical approximation

aka: Hebbian-learning Neural Net

Ensembles

Page 47: Copyright © 2006, SAS Institute Inc. All rights reserved. Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data

Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use onlyCopyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only