dat205 advanced data mining using sql server 2000

48
ZhaoHui Tang ZhaoHui Tang Program Manager Program Manager SQL Server Analysis SQL Server Analysis Services Services Microsoft Corporation Microsoft Corporation DAT205 DAT205 Advanced Data Mining Advanced Data Mining Using SQL Server 2000 Using SQL Server 2000

Upload: deanna

Post on 13-Feb-2016

55 views

Category:

Documents


0 download

DESCRIPTION

DAT205 Advanced Data Mining Using SQL Server 2000. ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation. Agenda. Microsoft Data Mining Algorithms OLE DB for DM Data mining query Data Mining Case Study: Click Stream Analysis Customer Segmentation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: DAT205 Advanced Data Mining Using SQL Server 2000

ZhaoHui Tang ZhaoHui Tang Program ManagerProgram Manager SQL Server Analysis ServicesSQL Server Analysis ServicesMicrosoft CorporationMicrosoft Corporation

DAT205DAT205Advanced Data Mining Using Advanced Data Mining Using SQL Server 2000SQL Server 2000

Page 2: DAT205 Advanced Data Mining Using SQL Server 2000

AgendaAgenda

• Microsoft Data Mining AlgorithmsMicrosoft Data Mining Algorithms• OLE DB for DM Data mining queryOLE DB for DM Data mining query• Data Mining Case Study: Click Stream Data Mining Case Study: Click Stream

Analysis Analysis – Customer SegmentationCustomer Segmentation– Site affiliationSite affiliation– Target ads in banner Target ads in banner

• Performance of Microsoft Data Mining Performance of Microsoft Data Mining Algorithm Algorithm

• Q&AQ&A

Page 3: DAT205 Advanced Data Mining Using SQL Server 2000

Data Mining Algorithms in SQL Data Mining Algorithms in SQL Server 2000Server 2000

Page 4: DAT205 Advanced Data Mining Using SQL Server 2000

Decision TreeDecision Tree• Popular technique for Popular technique for

classification, classification, Prediction taskPrediction task– Churn analysisChurn analysis– Credit risk analysisCredit risk analysis– ……

• Easy to understandEasy to understand– any path from node to any path from node to

leaf forms a ruleleaf forms a rule• Fast to buildFast to build• Prediction based on Prediction based on

leaf node statsleaf node stats• Variation: C4.5, C5, Variation: C4.5, C5,

CART, ChaidCART, Chaid

Attend College:55% Yes45% No

All Students

Attend College:79% Yes21% No

IQ=High

Attend College:35% Yes65% No

IQ < > High

Attend College:94% Yes6% No

Parent Income = High

Attend College:69% Yes31% No

Parent Income = Low

Page 5: DAT205 Advanced Data Mining Using SQL Server 2000

How tree worksHow tree worksIQIQ Parent Parent

EncouragementEncouragementParent Parent IncomeIncome

GenderGender

HighHigh MediumMedium LowLow TrueTrue FalseFalse HighHigh FalseFalse MaleMale FemaleFemale

CollegePCollegePlanlan

YesYes 300 500 200 700 300 400 600 500 500

NoNo 100 1000 900 400 1600 400 1600 1100 900

0

100

200

300

400

500

600

700

800

900

1000

IQ=High IQ=Medium IQ=Low

0

200

400

600

800

1000

1200

1400

1600

1800

PI=High PI=FALSE

0

200

400

600

800

1000

1200

1400

1600

1800

PE=TRUE PE=FALSE

0

200

400

600

800

1000

1200

Male Female

YesYes

NoNo

Page 6: DAT205 Advanced Data Mining Using SQL Server 2000

Split recursivelySplit recursivelyCollege Plan33% Yes67% No

All Students

College Plan63% Yes37% No

Parent Encouragement = True

College Plan16% Yes84% No

Parent Encouragement = False

IQIQ Parent Parent EncouragementEncouragement

Parent Parent IncomeIncome

GenderGender

HighHigh MediumMedium LowLow TrueTrue FalseFalse HighHigh FalseFalse MaleMale FemaleFemale

CollegePCollegePlanlan

YesYes 200 400 100 700 0 300 400 400 250

NoNo 50 250 100 400 0 100 300 250 150

Page 7: DAT205 Advanced Data Mining Using SQL Server 2000

Microsoft Decision TreesMicrosoft Decision Trees

• Probabilistic Classification TreeProbabilistic Classification Tree• Splitting methods: Bayesian score and Splitting methods: Bayesian score and

EntropyEntropy• Forward pruningForward pruning• Tree shape: Binary and Nary treeTree shape: Binary and Nary tree• Scalable frameworkScalable framework

Page 8: DAT205 Advanced Data Mining Using SQL Server 2000

Clustering Algorithm (EM)Clustering Algorithm (EM)

• A popular method for customer A popular method for customer segmentation, mailing list, profiling…segmentation, mailing list, profiling…

• Algorithm processAlgorithm process– Assign a set of Initial PointsAssign a set of Initial Points– Assign initial cluster to each pointsAssign initial cluster to each points– Assign data points to Assign data points to each clustereach cluster with a with a

probabilityprobability– Computer new central point based on Computer new central point based on weighted weighted

computation computation – Cycle until convergenceCycle until convergence

Page 9: DAT205 Advanced Data Mining Using SQL Server 2000

EM IllustrationEM Illustration

X

X

X

Page 10: DAT205 Advanced Data Mining Using SQL Server 2000

Microsoft Clustering Algorithm Microsoft Clustering Algorithm (Scalable EM)(Scalable EM)

Data

Fill BufferBuild/Update

Model

Compressed date Sufficient stats

Identify Data to be Compressed

Stop?

Final Model

Page 11: DAT205 Advanced Data Mining Using SQL Server 2000

OLE DB for Data MiningOLE DB for Data Mining

Page 12: DAT205 Advanced Data Mining Using SQL Server 2000

OLE DB for DMOLE DB for DM• Industry standard for data miningIndustry standard for data mining• Based on existing technologiesBased on existing technologies

– SQLSQL– OLE DBOLE DB

• Define common concepts for DMDefine common concepts for DM– Case, Nested CaseCase, Nested Case– Mining ModelMining Model– Model CreationModel Creation– Model TrainingModel Training– Prediction Prediction

• Language based API Language based API

Page 13: DAT205 Advanced Data Mining Using SQL Server 2000

Customer TableCustomer TableCustomer ID Profession Income Gender Risk

1 Engineer 85 Male No

2 Worker 40 Male Yes

3 Doctor 90 Female No

4 Teacher 50 Female No

5 Worker 45 Male No

… … … … …

Page 14: DAT205 Advanced Data Mining Using SQL Server 2000

DM Query LanguageDM Query LanguageCreate Mining ModelCreate Mining Model CreditRisk CreditRisk

(CustomerID long key,(CustomerID long key,

Gender text discrete,Gender text discrete,

Income long continuous,Income long continuous,

Profession text discrete,Profession text discrete,

RiskRisk text discrete predict)text discrete predict)

UsingUsing Microsoft_Decision_Trees Microsoft_Decision_Trees

Insert intoInsert into CreditRisk CreditRisk

(CustomerId, Gender, Income, (CustomerId, Gender, Income, Profession, Risk)Profession, Risk)

Select Select

CustomerID, Gender, Income, CustomerID, Gender, Income, Profession,RiskProfession,Risk

From CustomersFrom Customers

SelectSelect NewCustomers.CustomerID, NewCustomers.CustomerID, CreditRisk.Risk, PredictProbability(CreditRisk)CreditRisk.Risk, PredictProbability(CreditRisk)

FromFrom CreditRisk CreditRisk Prediction JoinPrediction Join NewCustomers NewCustomers

OnOn CreditRisk.Gender=NewCustomer.Gender CreditRisk.Gender=NewCustomer.Gender

And CreditRisk.Income=NewCustomer.IncomeAnd CreditRisk.Income=NewCustomer.Income

AndAnd

CreditRisk.Profession=NewCustomer.ProfessionCreditRisk.Profession=NewCustomer.Profession

Page 15: DAT205 Advanced Data Mining Using SQL Server 2000

Schema RowsetsSchema Rowsets

• Tabular data to provide meta data Tabular data to provide meta data informationinformation

• List of Schema Rowsets in OLE DB for DMList of Schema Rowsets in OLE DB for DM– Mining_ServicesMining_Services– Mining_Service_ParametersMining_Service_Parameters– Mining_ModelsMining_Models– Mining_ColumnsMining_Columns– Mining_Model_ContentsMining_Model_Contents– Model_Content_PMMLModel_Content_PMML

Page 16: DAT205 Advanced Data Mining Using SQL Server 2000

Mining Model Contents Schema Mining Model Contents Schema RowsetsRowsets

Page 17: DAT205 Advanced Data Mining Using SQL Server 2000

Schema Rowsets & Thin Client Schema Rowsets & Thin Client BrowserBrowser

Page 18: DAT205 Advanced Data Mining Using SQL Server 2000
Page 19: DAT205 Advanced Data Mining Using SQL Server 2000

Case Study: Click Stream Case Study: Click Stream AnalysisAnalysis

Page 20: DAT205 Advanced Data Mining Using SQL Server 2000

Schema Schema

CustomerCustomerCustomerGuidCustomerGuidDayTimeOnLineDayTimeOnLineNightTimeOnLinNightTimeOnLineeBrowserTypeBrowserTypeEmailTimeEmailTimeChatTimeChatTimeGeoLocationGeoLocation

WebClickWebClickCustomerGuidCustomerGuidURLCategoryURLCategoryTimeTimeDurationDurationReferPageReferPage

Page 21: DAT205 Advanced Data Mining Using SQL Server 2000

Web Customer SegmentationWeb Customer Segmentation

Page 22: DAT205 Advanced Data Mining Using SQL Server 2000

Web Visitors SegmentationWeb Visitors Segmentation

Page 23: DAT205 Advanced Data Mining Using SQL Server 2000

Segmentation based on Customer Segmentation based on Customer tabletable

Create Mining ModelCreate Mining Model CustomerClustering CustomerClustering

(CustomerID text key,(CustomerID text key,

DayTimeOnline long continuousDayTimeOnline long continuous

NightTimeOnline long continuous,NightTimeOnline long continuous,

BrowserType BrowserType text discrete, text discrete,

ChatTime ChatTime long continuous,long continuous,

EmailTimeEmailTime long continuous,long continuous,

GeoLocationGeoLocation text discretetext discrete

))

UsingUsing Microsoft_Clustering Microsoft_Clustering

Page 24: DAT205 Advanced Data Mining Using SQL Server 2000

Segmentation based on Customer Segmentation based on Customer and WebClickand WebClick

Create Mining ModelCreate Mining Model CustomerClustering CustomerClustering

(CustomerID text key,(CustomerID text key,

DayTimeOnline long continuous,DayTimeOnline long continuous,

NightTimeOnline long continuous,NightTimeOnline long continuous,

BrowserType BrowserType text discrete, text discrete,

ChatTime ChatTime long continuous,long continuous,

EmailTimeEmailTime long continuous,long continuous,

GeoLocationGeoLocation text discretetext discrete

WebClickWebClick table (table (

UrlCategory text key )UrlCategory text key )

))UsingUsing Microsoft_Clustering Microsoft_Clustering

Page 25: DAT205 Advanced Data Mining Using SQL Server 2000

MSFTies SegmentationMSFTies Segmentation

Page 26: DAT205 Advanced Data Mining Using SQL Server 2000

Web Site AffiliationWeb Site Affiliation

Page 27: DAT205 Advanced Data Mining Using SQL Server 2000

Association analysis using Association analysis using Microsoft Decision Trees Microsoft Decision Trees

Insurance No Insurance

Loan No Loan

Business

Loan No Loan

Stock No Stock

Insurance

Business No Business

Shopping No Shopping

Stock

Stock

Insurance No Insurance

Loan

No Stock

Page 28: DAT205 Advanced Data Mining Using SQL Server 2000

Association analysis using Association analysis using Microsoft Decision Trees Microsoft Decision Trees

Insurance No Insurance

Loan No Loan

Business

Loan No Loan

Stock No Stock

Insurance

Business No Business

Shopping No Shopping

Stock

Stock

Insurance No Insurance

Loan

No Stock

Page 29: DAT205 Advanced Data Mining Using SQL Server 2000

Site AffiliationSite Affiliation

Page 30: DAT205 Advanced Data Mining Using SQL Server 2000

Site AffiliationSite AffiliationCreate Mining ModelCreate Mining Model SiteAffiliation SiteAffiliation

(CustomerID text key,(CustomerID text key,

WebClick table predict (WebClick table predict (

UrlCategory text key )UrlCategory text key )

))UsingUsing Microsoft_Decision_Trees Microsoft_Decision_Trees

Insert intoInsert into SiteAffiliation (CustomerID,WebClick (skip, SiteAffiliation (CustomerID,WebClick (skip, UrlCategory)UrlCategory)OpenRowset(‘MSDataShape’, 'data OpenRowset(‘MSDataShape’, 'data provider=SQLOLEDB;Server=myserver;UID=me; provider=SQLOLEDB;Server=myserver;UID=me; PWD=mypass' , PWD=mypass' , 'Shape{Select CustomerID from Customer}'Shape{Select CustomerID from Customer}

Append ( {Select customerid, URLCategoryAppend ( {Select customerid, URLCategoryfrom WebClick }from WebClick }

relate CustomerID to CustomerID) as WebClick’ relate CustomerID to CustomerID) as WebClick’

))

Page 31: DAT205 Advanced Data Mining Using SQL Server 2000
Page 32: DAT205 Advanced Data Mining Using SQL Server 2000

Path PredictionPath Prediction

Page 33: DAT205 Advanced Data Mining Using SQL Server 2000

Path PredictionPath Prediction

Page 34: DAT205 Advanced Data Mining Using SQL Server 2000

Singleton PredictionSingleton PredictionSelectSelect Flattened Flattened

Topcount((select URLCategory, $adjustedProbability as Topcount((select URLCategory, $adjustedProbability as prob prob

From Predict([Web Click], INCLUDE_STATISTICS, From Predict([Web Click], INCLUDE_STATISTICS, EXCLUSIVE)), prob, 5) EXCLUSIVE)), prob, 5)

FromFrom

WebLog PREDICTION JOIN (select (select 'Business' WebLog PREDICTION JOIN (select (select 'Business' as URLCategory) union (select ‘Telecom’ as as URLCategory) union (select ‘Telecom’ as URLCategory) as WebClick) as inputURLCategory) as WebClick) as input

OnOn

WebLog.[Web Click].URLCategory = WebLog.[Web Click].URLCategory = input.WebClick.URLCategoryinput.WebClick.URLCategory

Page 35: DAT205 Advanced Data Mining Using SQL Server 2000

ArchitectureArchitecture

Web Web CustomerCustomer IISIIS

ASPASP

DM ProviderDM Provider

DMMDMM

InternetInternet

Real Time Predictio

n

ADO/DSOADO/DSO

Page 36: DAT205 Advanced Data Mining Using SQL Server 2000

Performance of DM AlgorithmsPerformance of DM Algorithms

Page 37: DAT205 Advanced Data Mining Using SQL Server 2000

DM Performance Study DM Performance Study

• Joint effort between Unisys & MicrosoftJoint effort between Unisys & Microsoft• Two parts of the white paper:Two parts of the white paper:

First part:First part: Use AS2k to build DM Models for Use AS2k to build DM Models for a a banking business scenario banking business scenario

Second Part:Second Part: Performance results of DM Performance results of DM algorithms studyalgorithms study

• Some results in this session…Some results in this session…• Details in the Details in the paperpaper and and SQL Server SQL Server

magazinemagazine articles… articles…

Page 38: DAT205 Advanced Data Mining Using SQL Server 2000

Data Source for DMMsData Source for DMMs

Page 39: DAT205 Advanced Data Mining Using SQL Server 2000

Training Performance Results…Training Performance Results…

Page 40: DAT205 Advanced Data Mining Using SQL Server 2000

Sample Business Question for Sample Business Question for Non Nested MDTNon Nested MDT

11 Identify those customers that are Identify those customers that are most likely to churn (leave) based most likely to churn (leave) based on customer demographical on customer demographical information.information.

Page 41: DAT205 Advanced Data Mining Using SQL Server 2000

Non Nested: Training Times for varying Number of Input attributesNon Nested: Training Times for varying Number of Input attributes

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

0 50 100 150 200 250

Number of Attributes

Trai

ning

Tim

e (m

inut

es)

Training Time

Assumptions:Assumptions:• 1 mm cases• 25 states• 1 predictable attribute

I/P AttributesI/P Attributes Training TimeTraining Time

1010 4.084.08

2020 7.277.27

5050 31.5431.54

100100 40.5540.55

200200 129.35129.35

Observations:Observations:

Page 42: DAT205 Advanced Data Mining Using SQL Server 2000

Non Nested: Training Times for varying Number of CasesNon Nested: Training Times for varying Number of Cases

Assumptions:Assumptions:• 20 attributes• 25 states• 1 predictable attribute

Training Time

10,0001,000,000

5,000,000

10000000

0.00

20.00

40.00

60.00

80.00

100.00

120.00

0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000

Number of Cases

Tran

ing

Tim

e (m

inut

es)

Training Time

Observations:Observations:

CasesCases Training Training TimeTime

10,00010,000 0.380.38

1,000,0001,000,000 11.3211.32

5,000,0005,000,000 34.1934.19

10,000,00010,000,000 100.53100.53

Page 43: DAT205 Advanced Data Mining Using SQL Server 2000

Sample Business Question for Sample Business Question for Nested MDTNested MDT

22 Find the list of other products that the Find the list of other products that the customer may be interested in based on the customer may be interested in based on the products the customer has purchased.products the customer has purchased.

Page 44: DAT205 Advanced Data Mining Using SQL Server 2000

Nested Cases: Training Times for varying Sample size of Case TableNested Cases: Training Times for varying Sample size of Case Table

Training Time

0

50

100

150

200

250

300

0 50000 100000 150000 200000 250000

Number of Master Cases

Trai

ning

Tim

e (m

inut

es)

Training Time

Assumptions:Assumptions:• Avg. customer

purchases=25• States in nested=200• Nested key predictable

Observations:Observations:

Master CasesMaster Cases Training Training TimeTime

10,00010,000 15.0915.09

50,00050,000 67.7967.79

100,000100,000 120.88120.88

200,000200,000 240.62240.62

Page 45: DAT205 Advanced Data Mining Using SQL Server 2000

Nested Cases: Training Times for varying Number of Products Nested Cases: Training Times for varying Number of Products purchased per customerpurchased per customer

Assumptions:Assumptions:• 200000 cases• 1000 products in nested

Observations:Observations:

Nested CasesNested Cases Training Training TimeTime

1010 85.2685.26

2525 120.82120.82

5050 172.96172.96

100100 281.65281.65

Page 46: DAT205 Advanced Data Mining Using SQL Server 2000

For more info…For more info…

• DM URLDM URL– www.microsoft.com/data/oledbwww.microsoft.com/data/oledb– www.microsoft.com/data/www.microsoft.com/data/oledb/DMResKit.htmoledb/DMResKit.htm

• News Group:News Group:– Microsoft.public.SQLserver.dataminingMicrosoft.public.SQLserver.datamining– Communities.msn.com/AnalysisServicesDataMiningCommunities.msn.com/AnalysisServicesDataMining

• White papers:White papers:– Performance paper:Performance paper:

www.unisys.com/windows2000/default-07.asp www.unisys.com/windows2000/default-07.asp www.microsoft.com/SQL/evaluation/compare/analysisdmwp.aspwww.microsoft.com/SQL/evaluation/compare/analysisdmwp.asp

Page 47: DAT205 Advanced Data Mining Using SQL Server 2000

Don’t forget to complete the Don’t forget to complete the on-line Session Feedback form on-line Session Feedback form on the Attendee Web siteon the Attendee Web site

https://web.mseventseurope.com/teched/https://web.mseventseurope.com/teched/

Page 48: DAT205 Advanced Data Mining Using SQL Server 2000