bigml fall 2015 release

Post on 13-Feb-2017

596 Views

Category:

Data & Analytics

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introducing Association Discovery

BigML 2015 Fall Release

BigMLInc Fall2015Release 2

Today’sWebinar• Speaker:

• PoulPetersen,CIO

• Moderator:

• AtakanCe>nsoy,VPPredic>veApplica>ons

• Enterques>onsintochatbox–we’llanswersomeviatext;othersattheendofthesession

• email:info@bigml.com

• TwiPer:@bigmlcom

BigMLInc Fall2015Release 3

Associa1onDiscovery

Algorithm“MagnumOpus”fromGeoffWebb

UnsupervisedLearning:unlabelleddata

LearningTask:Find“interes1ng”rela1onsbetweenvariables.

BigMLInc Fall2015Release

DecisionTreesBaggingDecisionForest

4

BigMLWorkflow

MODEL

DATASET

CLUSTER

ANOMALY

ASSOCIATION

SOURCE

K-MeansG-Means

Isola>onForest

MagnumOpus

BigMLInc Fall2015Release 5

date customer account auth class zip amountMon Bob 3421 pin clothes 46140 135Tue Bob 3421 sign food 46140 401Tue Alice 2456 pin food 12222 234Wed Sally 6788 pin gas 26339 94Wed Bob 3421 pin tech 21350 2459Wed Bob 3421 pin gas 46140 83The Sally 6788 sign food 26339 51

Clustering

date customer account auth class zip amountMon Bob 3421 pin clothes 46140 135Tue Bob 3421 sign food 46140 401Tue Alice 2456 pin food 12222 234Wed Sally 6788 pin gas 26339 94Wed Bob 3421 pin tech 21350 2459Wed Bob 3421 pin gas 46140 83The Sally 6788 sign food 26339 51

AnomalyDetec1on

similar

unusual

UnsupervisedLearning

BigMLInc Fall2015Release

date customer account auth class zip amountMon Bob 3421 pin clothes 46140 135Tue Bob 3421 sign food 46140 401Tue Alice 2456 pin food 12222 234Wed Sally 6788 pin gas 26339 94Wed Bob 3421 pin tech 21350 2459Wed Bob 3421 pin gas 46140 83The Sally 6788 sign food 26339 51

6

{customer = Bob, account = 3421} zip = 46140

Rules:

{class = gas} amount > 80

Associa1onRules

BigMLInc Fall2015Release

date customer account auth class zip amountMon Bob 3421 pin clothes 46140 135Tue Bob 3421 sign food 46140 401Tue Alice 2456 pin food 12222 234Wed Sally 6788 pin gas 26339 94Wed Bob 3421 pin tech 21350 2459Wed Bob 3421 pin gas 46140 83The Sally 6788 sign food 26339 51

7

{customer = Bob, account = 3421} zip = 46140

Rules:

{class = gas} amount > 80

Antecedent Consequent

Associa1onRules

BigMLInc Fall2015Release 8

UseCases

• MarketBasketAnalysis

• WebusagepaPerns

• Intrusiondetec>on

• Frauddetec>on

• Bioinforma>cs

• Medicalriskfactors

BigMLInc Fall2015Release 9

MarketBasketAnalysis

• Datasetof9,834grocerycarttransac>ons

• Eachrowisalistofallitemsinacartatcheckout

GOAL:Discover“interes1ng”rulesaboutwhatstoreitemsaretypicallypurchasedtogether.

BigMLInc Fall2015Release 10

Associa1onMetrics

Instances

AC

Coverage

Percentageofinstanceswhichmatchantecedent“A”

BigMLInc Fall2015Release 11

Associa1onMetrics

Instances

AC

Support

Percentageofinstanceswhichmatchantecedent“A”andConsequent“C”

BigMLInc Fall2015Release

Confidence

Percentageofinstancesintheantecedentwhichalsocontaintheconsequent.

SupportCoverage

12

Associa1onMetrics

Instances

AC

BigMLInc Fall2015Release

CInstances

A C

A

Instances

C

Instances

A

13

Associa1onMetrics

Instances

AC

0% 100%

Instances

AC

Confidence

AneverimpliesC

Asome1mesimpliesC

AalwaysimpliesC

BigMLInc Fall2015Release

LiO

Ra>oofobservedsupporttosupportifAandCweresta>s>callyindependent.

Support==Confidencep(A)*p(C)p(C)

14

Associa1onMetrics

Independent

AC

C

Observed

A

BigMLInc Fall2015Release

C

Observed

A

15

Associa1onMetrics

Observed

AC

< 1 > 1

Independent

A C

Lift = 1

Nega>veCorrela>on NoAssocia>on Posi>ve

Correla>on

Independent

A C

Independent

A C

Observed

A C

BigMLInc Fall2015Release 16

Associa1onMetrics

Independent

AC

C

Observed

A

Leverage

DifferenceofobservedsupportandsupportifAandCweresta>s>callyindependent.

Support-[p(A)*p(C)]

BigMLInc Fall2015Release

C

Observed

A

17

Associa1onMetrics

Observed

AC

< 0 > 0

Independent

A C

Leverage = 0

Nega>veCorrela>on NoAssocia>on Posi>ve

Correla>on

Independent

A C

Independent

A C

Observed

A C

-1… …1

BigMLInc Fall2015Release 18

GOAL:Findgeneralrulesthatindicatediabetes.

• Datasetofdiagnos>cmeasurementsof768pa>ents.

• Eachpa>entlabelledTrue/Falsefordiabetes.

MedicalRisk

BigMLInc Fall2015Release 19

MedicalRiskAssocia1onRule

If plasma glucose > 146 then diabetes = TRUE

DecisionTree

If plasma glucose > 155 and bmi > 29.32 and diabetes pedigree > 0.32 and insulin <= 629 and age <= 44

then diabetes = TRUE

BigMLInc Fall2015Release 20

Par1alDependencePlots

VisualizeEnsembles

BigMLInc Fall2015Release 21

FlatlineEditor

hPps://github.com/bigmlcom/flatline

BigMLInc Fall2015Release

DecisionTreesBaggingDecisionForest

22

BigMLWorkflow

MODEL

DATASET

CLUSTER

ANOMALY

ASSOCIATION

SOURCE

K-MeansG-Means

Isola>onForest

MagnumOpus

DATASET

FlatlineFlatlineEditor

BigMLInc Fall2015Release 23

Logis1cRegression

DATASET LOGISTIC REGRESSION

• Classifica>onalgorithm

• Categorical:one-hotencoded

• Text:mappedtotokenfreq

• Bindingssupportlocalmodel

• I1/I2regulariza>on

• CurrentlyAPIonly

hPps://bigml.com/developers/logis>cregressions

BigMLInc Fall2015Release

DecisionTreesBaggingDecisionForestLogis>cRegression

24

BigMLWorkflow

MODEL

DATASET

CLUSTER

ANOMALY

ASSOCIATION

SOURCE

K-MeansG-Means

Isola>onForest

MagnumOpus

DATASET

FlatlineFlatlineEditor

BigMLInc Fall2015Release 25

BigMLClassifiers

Advantages Disadvantages

SingleTree easytointerpretrobusttomissingdata overfiong

Ensemble topperformerrobusttomissingdata hardtointerpret

Logis1cRegression robusttonoiseoutputsprobability

nomissingdatahardtointerpret

BigMLInc Fall2015Release

DecisionTreesBaggingDecisionForestLogis>cRegression

26

BigMLWorkflow

MODEL

DATASET

CLUSTER

ANOMALY

ASSOCIATION

SOURCE

K-MeansG-Means

Isola>onForest

MagnumOpus

Sta>s>calTestsCorrela>ons

STATSDATASET

FlatlineFlatlineEditor

BigMLInc Fall2015Release 27

Correla1ons

DATASET CORRELATION

• PearsonCoefficient

• SpearmanCoefficient

• Chi-Square

• Cramér'sV

• Tschuprow'sT

• One-wayANOVA

hPps://bigml.com/developers/correla>ons

BigMLInc Fall2015Release 28

Sta1s1calTests

DATASET STATISTICAL TESTS

• Benford’sLaw

• Anderson-Darling

• Jarque-Bera

• Z-score

• Grubbs

hPps://bigml.com/developers/sta>s>caltests

BigMLInc Fall2015Release

DecisionTreesBaggingDecisionForestLogis>cRegression

29

BigMLWorkflow

MODEL

DATASET

CLUSTER

ANOMALY

ASSOCIATION

SOURCE

K-MeansG-Means

Isola>onForest

MagnumOpus

Sta>s>calTestsCorrela>ons

STATSDATASET

FlatlineFlatlineEditor

BigMLInc Fall2015Release 30

Q&A

•Askques1onsandgetaFreeBigMLT-shirt!

•Alldemonstratedfeaturesareimmediatelyavailabletoallusersincluding:•Allsubscrip1onplans•VirtualPrivateCloud(VPC)customers•On-premiseimplementa1ons.

•Documenta1on@hRps://bigml.com/releases

BigMLInc Fall2015Release 31

FEEDBACK

@bigmlcom TWITTER

info@bigml.com

GetStartedToday!

RESOURCES Join us for future webinars & hangouts

OFFICE HOURS

Every Wednesday 9:30am Pacific Time

top related