frequent pattern mining - krishna sridhar, feb 2016

80
Pattern Mining: Getting the most out of your log data. Krishna Sridhar Staff Data Scientist, Dato Inc. krishna_srd

Upload: seattle-daml-meetup

Post on 24-Jan-2017

502 views

Category:

Engineering


1 download

TRANSCRIPT

Page 1: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Pattern Mining: Getting the most out of your log data.

Krishna SridharStaff Data Scientist, Dato Inc. krishna_srd

Page 2: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

• Background- Machine Learning (ML) Research.- Ph.D Numerical Optimization @Wisconsin

• Now- Build ML tools for data-scientists & developers @Dato.- Help deploy ML algorithms.

@krishna_srd, @DatoInc

About Me!

Page 3: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

45+$and$growing$fast!

About Us!

Page 4: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

+ =

Questions?• (Now) I love questions. Feel free to interrupt for questions!• (Later) Email me [email protected].

DAML Talks!

Page 5: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

About you?

Page 6: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Creating a model pipeline

Ingest Transform Model Deploy Unstructured Data

exploration

data

modeling

Data Science Workflow

Ingest Transform Model Deploy

Page 7: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Log Journey

Lots of data

Insights Profits

Page 8: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Log Mining: Pattern Mining

Page 9: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Logs are everywhere!

Page 10: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Machine Learning in Logs

Source: Mining Your Logs - Gaining Insight Through Visualization

Page 11: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Coffee shop

Coffee Shops Menu

Page 12: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Receipts

Coffee Shops Menu

Page 13: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Coffee Store Logs

Page 14: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Frequent Pattern Mining

What sets of items were bought together?

Page 15: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Real Applications

Page 16: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Real Applications

Page 17: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Real Applications

Page 18: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Log Mining: Rule Mining

Page 19: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Can we recommend items?

Rule Mining

Page 20: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Real Applications

Page 21: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Log Mining: Feature Extraction

Page 22: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Feature Extraction

0 1 0 0 0 0 1 1 0 1 1 0 0 1 0 0 0 0 0 0 1 1 1 0

Receipt Space Features inMenu Space

ML

Page 23: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

3 Useful Data Mining Tasks

Rule MiningPattern Mining Feature Extraction

Page 24: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Demo

Page 25: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Pattern Mining: Explained

Page 26: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Formulating Pattern Mining

N distinct items → 2N itemsets

Page 27: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Formulating Pattern Mining

Find the top K most frequent sets of length at least L that occur at least M times.

Page 28: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Formulating Pattern Mining

Find the top K most frequent sets of length at least L that occur at least M times.

- max_patterns- min_length- min_support

Page 29: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Pattern Mining

N distinct items → 2N itemsets

Page 30: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Pattern Mining: Principles

Page 31: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Pattern Mining: Principles

Page 32: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Principle 1: What is frequent?

A pattern is frequent if it occurs at least M times.

{B, C, D}

{A, C, D}

{A, B, C, D}

{A, D}

{B, C, D}

{B, C, D}

Is the pattern {C, D} frequent?M = 4

Patterns

Page 33: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Principle 1: What is frequent?

A pattern is frequent if it occurs at least M times.

{B, C, D}

{A, C, D}

{A, B, C, D}

{A, D}

{B, C, D}

{B, C, D}

{C, D} occurs 5 timesM = 4

Patterns

Page 34: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Principle 1: What is frequent?

A pattern is frequent if it occurs at least M times.

{B, C, D}

{A, C, D}

{A, B, C, D}

{A, D}

{B, C, D}

{B, C, D}

M = 4

Patterns

{C, D} occurs 5 times

Frequent!

Page 35: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Principle 1: What is frequent?

A pattern is frequent if it occurs at least M times.

{B, C, D}

{A, C, D}

{A, B, C, D}

{A, D}

{B, C, D}

{B, C, D}

Is the pattern {A, D} frequent?M = 4

Patterns

Page 36: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Principle 1: What is frequent?

A pattern is frequent if it occurs at least M times.

{B, C, D}

{A, C, D}

{A, B, C, D}

{A, D}

{B, C, D}

{B, C, D}

M = 4

Patterns

{A, D} occurs 3 times.

Not frequent.

Page 37: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Principle 1: What is frequent?

A pattern is frequent if it occurs at least M times.

{B, C, D}

{A, C, D}

{A, B, C, D}

{A, D}

{B, C, D}

{B, C, D}

{C, D}: 5 is frequentM = 4

{A, D}: 3 is not frequent

min_support

Page 38: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Principle 2: Apriori principle

A pattern is frequent only if a subset is frequent

{B, C, D}

{A, C, D}

{A, B, C, D}

{A, D}

{B, C, D}

{B, C, D}

{B, C, D} : 4 is frequent therefore {C, D} : 5 is frequent

M = 4

Page 39: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Principle 2: Apriori principle

A pattern is frequent only if a subset is frequent

{B, C, D}

{A, C, D}

{A, B, C, D}

{A, D}

{B, C, D}

{B, C, D}

{B, C, D} : 4 is frequent therefore {C, D} : 5 is frequent

M = 4Why?{C, D} must occur at least as many times as {B, C, D}.

Page 40: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Principle 2: Apriori principle (Contrapositive)If a pattern is not frequent then all supersets are not frequent

{B, C, D}

{A, C, D}

{A, B, C, D}

{A, D}

{B, C, D}

{B, C, D}

M = 4

{A} : 3 is not frequent therefore {A, D} : 3 is not frequent

Page 41: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Principle 2: Apriori principle (Contrapositive)If a pattern is not frequent then all supersets are not frequent

{B, C, D}

{A, C, D}

{A, B, C, D}

{A, D}

{B, C, D}

{B, C, D}

M = 4

{A} : 3 is not frequent therefore {A, D} : 3 is not frequent

Why?{A, D} cannot occur more times than {A}.

Page 42: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Two Main Algorithms

• Candidate Generation- Apriori - Eclat

• Pattern Growth- FP-Growth- TopK FP-Growth

Page 43: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Candidate Generation

Page 44: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Lots of Generalizations

Source: http://www.philippe-fournier-viger.com/spmf/

Page 45: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Candidate Generation

Two phases1. Candidate generation.2. Candidate filtering.

Exploit Apriori Principle!

Page 46: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Candidate Generation

{AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ?

{A} : ? {B} : ? {C} : ? {D} : ?

{ } : 6

{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?

{B, C, D}

{A, C, D}

{A, B, C, D}

{A, D}

{B, C, D}

{B, C, D}

Page 47: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Candidate Generation

{AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ?

{A} : ? {B} : ? {C} : ? {D} : ?

{ } : 6

{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?

{B, C, D}

{A, C, D}

{A, B, C, D}

{A, D}

{B, C, D}

{B, C, D}

Page 48: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Candidate Generation

{AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ?

{A} : 3 {B} : 4 {C} : 5 {D} : 6

{ } : 6

{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?

{B, C, D}

{A, C, D}

{A, B, C, D}

{A, D}

{B, C, D}

{B, C, D}

Page 49: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Candidate Generation

{AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ?

{A} : 3 {B} : 4 {C} : 5 {D} : 6

{ } : 6

{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?

{B, C, D}

{A, C, D}

{A, B, C, D}

{A, D}

{B, C, D}

{B, C, D}

Not frequent

Page 50: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Candidate Generation

{AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ?

{A} : 3 {B} : 4 {C} : 5 {D} : 6

{ } : 6

{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?

{B, C, D}

{A, C, D}

{A, B, C, D}

{A, D}

{B, C, D}

{B, C, D}

No need to explore!

Page 51: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Candidate Generation

{AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ?

{A} : 3 {B} : 4 {C} : 5 {D} : 6

{ } : 6

{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?

{B, C, D}

{A, C, D}

{A, B, C, D}

{A, D}

{B, C, D}

{B, C, D}

Page 52: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Candidate Generation

{AB} : ? {AC} : ? {AD} : ? {BC} : 4 {BD} : 4 {CD} : 5

{A} : 3 {B} : 4 {C} : 5 {D} : 6

{ } : 6

{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?

{B, C, D}

{A, C, D}

{A, B, C, D}

{A, D}

{B, C, D}

{B, C, D}

Page 53: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Candidate Generation

{AB} : ? {AC} : ? {AD} : ? {BC} : 4 {BD} : 4 {CD} : 5

{A} : 3 {B} : 4 {C} : 5 {D} : 6

{ } : 6

{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?

{B, C, D}

{A, C, D}

{A, B, C, D}

{A, D}

{B, C, D}

{B, C, D}

Page 54: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Two Main Algorithms

• Candidate Generation- Apriori - Eclat

• Pattern Growth- FP-Growth- TopK FP-Growth

Page 55: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Candidate Generation

Two phases1. Candidate generation: Enumerate all subsets.2. Candidate filtering: Eliminate infrequent subsets.

Exploit Apriori Principle!

Page 56: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Pattern Growth

Page 57: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Pattern Growth

Two phases1. Candidate filtering2. Conditional database constructions.

Avoid full scans over the data & large candidate sets!

Page 58: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Pattern Growth - Depth First {B, C, D}

{A, C, D}

{B, D}

{A, C, D}

{B, C, D}

{A, B, D}

{AB} : 1 {AC} : 2 {AD} : 3 {BD} : 4 {CD} : 4

{A} : 3 {B} : 4 {C} : 4 {D} : 6

{ } : 6

{ABC} : 0 {ABD} : 1 {ACD} : 2 {BCD} : 2

{BC} : 2

Page 59: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Pattern Growth - Preprocessing {B, C, D}

{A, C, D}

{B, D}

{A, C, D}

{B, C, D}

{A, B, D}

{A} : 3 {B} : 4 {C} : 4 {D} : 6

{ } : 6

PreprocessingFirst, count the number of times each item (singleton) occurs.

Page 60: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Pattern Growth - Depth First {B, C, D}

{A, C, D}

{B, D}

{A, C, D}

{B, C, D}

{A, B, D}

{AB} : ? {AC} : ? {AD} : ? {BD} : ? {CD} : ?

{A} : 3 {B} : 4 {C} : 4 {D} : 6

{ } : 6

{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?

{BC} : ?

Page 61: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Pattern Growth - Depth First {B, C, D}

{A, C, D}

{B, D}

{A, C, D}

{B, C, D}

{A, B, D}

{AB} : ? {AC} : ? {AD} : ? {BD} : ? {CD} : ?

{A} : 3 {B} : 4 {C} : 4 {D} : 6

{ } : 6

{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?

{BC} : ?

Page 62: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Pattern Growth - Depth First {B, C, D}

{A, C, D}

{B, D}

{A, C, D}

{B, C, D}

{A, B, D}

{AB} : ? {AC} : ? {AD} : ? {BD} : ? {CD} : ?

{A} : 3 {B} : 4 {C} : 4 {D} : 6

{ } : 6

{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?

{BC} : ?

Page 63: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Pattern Growth - Depth First {B, C, D}

{A, C, D}

{B, D}

{A, C, D}

{B, C, D}

{A, B, D}

{AB} : ? {AC} : ? {AD} : ? {BD} : ? {CD} : ?

{A} : 3 {B} : 4 {C} : 4 {D} : 6

{ } : 6

{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?

{BC} : ?

No need to explore!

Page 64: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Pattern Growth - Depth First {B, C, D}

{A, C, D}

{B, D}

{A, C, D}

{B, C, D}

{A, B, D}

{AB} : X {AC} : ? {AD} : ? {BD} : X {CD} : ?

{A} : 3 {B} : 4 {C} : 4 {D} : 6

{ } : 6

{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?

{BC} : X

Explore depth first on {B}

Page 65: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Pattern Growth

{B} : 4

{ } : 6

Conditional Database ConstructionDB{} DB{B}

{B, C, D}

{A, C, D}

{B, D}

{A, C, D}

{B, C, D}

{A, B, D}

{C, D}

{D}

{C, D}

{D}

Page 66: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Pattern Growth

{B} : 4

{ } : 6

Candidate FilteringDB{B}

{C, D}

{D}

{C, D}

{D}

{D} : 4

{C} : 2

DB{}

{B, C, D}

{A, C, D}

{B, D}

{A, C, D}

{B, C, D}

{A, B, D}

DB{B}

Add {BD} as frequent

Page 67: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Pattern Growth - Depth First {C, D}

{D}

{C, D}

{D}

{AB} : X {AC} : ? {AD} : ? {BD} : 4 {CD} : ?

{A} : 3 {B} : 4 {C} : 4 {D} : 6

{ } : 6

{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?

{BC} : 2

Explore depth first on {BD}

Page 68: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Pattern Growth - Depth First

{AB} : X {AC} : ? {AD} : ? {BD} : 4 {CD} : ?

{A} : 3 {B} : 4 {C} : 4 {D} : 6

{ } : 6

{ABC} : ? {ABD} : X {ACD} : ? {BCD} : X

{BC} : 2

Page 69: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Compare & Constrast

• Candidate Generation + Better than brute force + Filters candidate sets - Multiple passes over the data

• Pattern Growth + Fewer passes over the data + Space efficient.

Page 70: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Compare & Constrast

• Candidate Generation + Better than brute force + Filters candidate sets - Multiple passes over the data

• Pattern Growth + Fewer passes over the data + Space efficient.

Better choice

Page 71: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Some cool ideas

Page 72: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

FP-Tree CompressionFigures From Florian Verhein’s Slides on FP-Growth

Page 73: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

FP-Growth AlgorithmFigures From Florian Verhein’s Slides on FP-Growth

Two phases1. Candidate filtering.2. Conditional database constructions.

Page 74: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

TopK FP-Growth Algorithm

Similar to FP-Growth1. Dynamically raise min_support.2. Estimates of min_support greatly help.

Page 75: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Future Work

Page 76: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Distributed FP-Growth

Partition database on item-ids.

Database

Page 77: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Bags + Sequences

× 2

Itemset: {Item}

Bags: {Item: quantity}

Sequences : (item)

Page 78: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Demo: Model built, now what?

Page 79: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

Summary

Log Data Mining

≠Rocket Science

• FP-Growth for finding frequent patterns.• Find rules from patterns to make predictions.• Extract features for useful ML in pattern space.

Page 80: Frequent Pattern Mining - Krishna Sridhar, Feb 2016

SELECT questions FROM audienceWHERE difficulty == “Easy”

Thanks!