bigml fall 2015 release

31
Introducing Association Discovery BigML 2015 Fall Release

Upload: bigml-inc

Post on 13-Feb-2017

596 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: BigML Fall 2015 Release

Introducing Association Discovery

BigML 2015 Fall Release

Page 2: BigML Fall 2015 Release

BigMLInc Fall2015Release 2

Today’sWebinar• Speaker:

• PoulPetersen,CIO

• Moderator:

• AtakanCe>nsoy,VPPredic>veApplica>ons

• Enterques>onsintochatbox–we’llanswersomeviatext;othersattheendofthesession

• email:[email protected]

• TwiPer:@bigmlcom

Page 3: BigML Fall 2015 Release

BigMLInc Fall2015Release 3

Associa1onDiscovery

Algorithm“MagnumOpus”fromGeoffWebb

UnsupervisedLearning:unlabelleddata

LearningTask:Find“interes1ng”rela1onsbetweenvariables.

Page 4: BigML Fall 2015 Release

BigMLInc Fall2015Release

DecisionTreesBaggingDecisionForest

4

BigMLWorkflow

MODEL

DATASET

CLUSTER

ANOMALY

ASSOCIATION

SOURCE

K-MeansG-Means

Isola>onForest

MagnumOpus

Page 5: BigML Fall 2015 Release

BigMLInc Fall2015Release 5

date customer account auth class zip amountMon Bob 3421 pin clothes 46140 135Tue Bob 3421 sign food 46140 401Tue Alice 2456 pin food 12222 234Wed Sally 6788 pin gas 26339 94Wed Bob 3421 pin tech 21350 2459Wed Bob 3421 pin gas 46140 83The Sally 6788 sign food 26339 51

Clustering

date customer account auth class zip amountMon Bob 3421 pin clothes 46140 135Tue Bob 3421 sign food 46140 401Tue Alice 2456 pin food 12222 234Wed Sally 6788 pin gas 26339 94Wed Bob 3421 pin tech 21350 2459Wed Bob 3421 pin gas 46140 83The Sally 6788 sign food 26339 51

AnomalyDetec1on

similar

unusual

UnsupervisedLearning

Page 6: BigML Fall 2015 Release

BigMLInc Fall2015Release

date customer account auth class zip amountMon Bob 3421 pin clothes 46140 135Tue Bob 3421 sign food 46140 401Tue Alice 2456 pin food 12222 234Wed Sally 6788 pin gas 26339 94Wed Bob 3421 pin tech 21350 2459Wed Bob 3421 pin gas 46140 83The Sally 6788 sign food 26339 51

6

{customer = Bob, account = 3421} zip = 46140

Rules:

{class = gas} amount > 80

Associa1onRules

Page 7: BigML Fall 2015 Release

BigMLInc Fall2015Release

date customer account auth class zip amountMon Bob 3421 pin clothes 46140 135Tue Bob 3421 sign food 46140 401Tue Alice 2456 pin food 12222 234Wed Sally 6788 pin gas 26339 94Wed Bob 3421 pin tech 21350 2459Wed Bob 3421 pin gas 46140 83The Sally 6788 sign food 26339 51

7

{customer = Bob, account = 3421} zip = 46140

Rules:

{class = gas} amount > 80

Antecedent Consequent

Associa1onRules

Page 8: BigML Fall 2015 Release

BigMLInc Fall2015Release 8

UseCases

• MarketBasketAnalysis

• WebusagepaPerns

• Intrusiondetec>on

• Frauddetec>on

• Bioinforma>cs

• Medicalriskfactors

Page 9: BigML Fall 2015 Release

BigMLInc Fall2015Release 9

MarketBasketAnalysis

• Datasetof9,834grocerycarttransac>ons

• Eachrowisalistofallitemsinacartatcheckout

GOAL:Discover“interes1ng”rulesaboutwhatstoreitemsaretypicallypurchasedtogether.

Page 10: BigML Fall 2015 Release

BigMLInc Fall2015Release 10

Associa1onMetrics

Instances

AC

Coverage

Percentageofinstanceswhichmatchantecedent“A”

Page 11: BigML Fall 2015 Release

BigMLInc Fall2015Release 11

Associa1onMetrics

Instances

AC

Support

Percentageofinstanceswhichmatchantecedent“A”andConsequent“C”

Page 12: BigML Fall 2015 Release

BigMLInc Fall2015Release

Confidence

Percentageofinstancesintheantecedentwhichalsocontaintheconsequent.

SupportCoverage

12

Associa1onMetrics

Instances

AC

Page 13: BigML Fall 2015 Release

BigMLInc Fall2015Release

CInstances

A C

A

Instances

C

Instances

A

13

Associa1onMetrics

Instances

AC

0% 100%

Instances

AC

Confidence

AneverimpliesC

Asome1mesimpliesC

AalwaysimpliesC

Page 14: BigML Fall 2015 Release

BigMLInc Fall2015Release

LiO

Ra>oofobservedsupporttosupportifAandCweresta>s>callyindependent.

Support==Confidencep(A)*p(C)p(C)

14

Associa1onMetrics

Independent

AC

C

Observed

A

Page 15: BigML Fall 2015 Release

BigMLInc Fall2015Release

C

Observed

A

15

Associa1onMetrics

Observed

AC

< 1 > 1

Independent

A C

Lift = 1

Nega>veCorrela>on NoAssocia>on Posi>ve

Correla>on

Independent

A C

Independent

A C

Observed

A C

Page 16: BigML Fall 2015 Release

BigMLInc Fall2015Release 16

Associa1onMetrics

Independent

AC

C

Observed

A

Leverage

DifferenceofobservedsupportandsupportifAandCweresta>s>callyindependent.

Support-[p(A)*p(C)]

Page 17: BigML Fall 2015 Release

BigMLInc Fall2015Release

C

Observed

A

17

Associa1onMetrics

Observed

AC

< 0 > 0

Independent

A C

Leverage = 0

Nega>veCorrela>on NoAssocia>on Posi>ve

Correla>on

Independent

A C

Independent

A C

Observed

A C

-1… …1

Page 18: BigML Fall 2015 Release

BigMLInc Fall2015Release 18

GOAL:Findgeneralrulesthatindicatediabetes.

• Datasetofdiagnos>cmeasurementsof768pa>ents.

• Eachpa>entlabelledTrue/Falsefordiabetes.

MedicalRisk

Page 19: BigML Fall 2015 Release

BigMLInc Fall2015Release 19

MedicalRiskAssocia1onRule

If plasma glucose > 146 then diabetes = TRUE

DecisionTree

If plasma glucose > 155 and bmi > 29.32 and diabetes pedigree > 0.32 and insulin <= 629 and age <= 44

then diabetes = TRUE

Page 20: BigML Fall 2015 Release

BigMLInc Fall2015Release 20

Par1alDependencePlots

VisualizeEnsembles

Page 21: BigML Fall 2015 Release

BigMLInc Fall2015Release 21

FlatlineEditor

hPps://github.com/bigmlcom/flatline

Page 22: BigML Fall 2015 Release

BigMLInc Fall2015Release

DecisionTreesBaggingDecisionForest

22

BigMLWorkflow

MODEL

DATASET

CLUSTER

ANOMALY

ASSOCIATION

SOURCE

K-MeansG-Means

Isola>onForest

MagnumOpus

DATASET

FlatlineFlatlineEditor

Page 23: BigML Fall 2015 Release

BigMLInc Fall2015Release 23

Logis1cRegression

DATASET LOGISTIC REGRESSION

• Classifica>onalgorithm

• Categorical:one-hotencoded

• Text:mappedtotokenfreq

• Bindingssupportlocalmodel

• I1/I2regulariza>on

• CurrentlyAPIonly

hPps://bigml.com/developers/logis>cregressions

Page 24: BigML Fall 2015 Release

BigMLInc Fall2015Release

DecisionTreesBaggingDecisionForestLogis>cRegression

24

BigMLWorkflow

MODEL

DATASET

CLUSTER

ANOMALY

ASSOCIATION

SOURCE

K-MeansG-Means

Isola>onForest

MagnumOpus

DATASET

FlatlineFlatlineEditor

Page 25: BigML Fall 2015 Release

BigMLInc Fall2015Release 25

BigMLClassifiers

Advantages Disadvantages

SingleTree easytointerpretrobusttomissingdata overfiong

Ensemble topperformerrobusttomissingdata hardtointerpret

Logis1cRegression robusttonoiseoutputsprobability

nomissingdatahardtointerpret

Page 26: BigML Fall 2015 Release

BigMLInc Fall2015Release

DecisionTreesBaggingDecisionForestLogis>cRegression

26

BigMLWorkflow

MODEL

DATASET

CLUSTER

ANOMALY

ASSOCIATION

SOURCE

K-MeansG-Means

Isola>onForest

MagnumOpus

Sta>s>calTestsCorrela>ons

STATSDATASET

FlatlineFlatlineEditor

Page 27: BigML Fall 2015 Release

BigMLInc Fall2015Release 27

Correla1ons

DATASET CORRELATION

• PearsonCoefficient

• SpearmanCoefficient

• Chi-Square

• Cramér'sV

• Tschuprow'sT

• One-wayANOVA

hPps://bigml.com/developers/correla>ons

Page 28: BigML Fall 2015 Release

BigMLInc Fall2015Release 28

Sta1s1calTests

DATASET STATISTICAL TESTS

• Benford’sLaw

• Anderson-Darling

• Jarque-Bera

• Z-score

• Grubbs

hPps://bigml.com/developers/sta>s>caltests

Page 29: BigML Fall 2015 Release

BigMLInc Fall2015Release

DecisionTreesBaggingDecisionForestLogis>cRegression

29

BigMLWorkflow

MODEL

DATASET

CLUSTER

ANOMALY

ASSOCIATION

SOURCE

K-MeansG-Means

Isola>onForest

MagnumOpus

Sta>s>calTestsCorrela>ons

STATSDATASET

FlatlineFlatlineEditor

Page 30: BigML Fall 2015 Release

BigMLInc Fall2015Release 30

Q&A

•Askques1onsandgetaFreeBigMLT-shirt!

•Alldemonstratedfeaturesareimmediatelyavailabletoallusersincluding:•Allsubscrip1onplans•VirtualPrivateCloud(VPC)customers•On-premiseimplementa1ons.

•Documenta1on@hRps://bigml.com/releases

Page 31: BigML Fall 2015 Release

BigMLInc Fall2015Release 31

FEEDBACK

@bigmlcom TWITTER

[email protected]

GetStartedToday!

RESOURCES Join us for future webinars & hangouts

OFFICE HOURS

Every Wednesday 9:30am Pacific Time