data and web science group universit at mannheim · data and web science group i research group at...

24
Declarative Sequential Pattern Mining Prof. Dr. Rainer Gemulla Data and Web Science Group Universit¨ at Mannheim BBDC Symposium November 8, 2016

Upload: others

Post on 14-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data and Web Science Group Universit at Mannheim · Data and Web Science Group I Research group at University of Mannheim, Germany {5 professors, 9 postdocs,18 Ph.D. students I European

Declarative Sequential Pattern Mining

Prof. Dr. Rainer Gemulla

Data and Web Science GroupUniversitat Mannheim

BBDC SymposiumNovember 8, 2016

Page 2: Data and Web Science Group Universit at Mannheim · Data and Web Science Group I Research group at University of Mannheim, Germany {5 professors, 9 postdocs,18 Ph.D. students I European

Data and Web Science Group

I Research group at University ofMannheim, Germany

– 5 professors, 9 postdocs,18 Ph.D. students

I European Network of National BigData Centers of Excellence

I Research focus: Understand andleverage heterogeneous data inorder to improve applicationsusing knowledge

I Contribution to community viaopen data and open software

R. Gemulla, Universitat Mannheim Declarative Sequential Pattern Mining 2/24

Page 3: Data and Web Science Group Universit at Mannheim · Data and Web Science Group I Research group at University of Mannheim, Germany {5 professors, 9 postdocs,18 Ph.D. students I European

Outline

1. Sequential Pattern Mining

2. Scalability

3. Usability

4. Summary

R. Gemulla, Universitat Mannheim Declarative Sequential Pattern Mining 3/24

Page 4: Data and Web Science Group Universit at Mannheim · Data and Web Science Group I Research group at University of Mannheim, Germany {5 professors, 9 postdocs,18 Ph.D. students I European

Outline

1. Sequential Pattern Mining

2. Scalability

3. Usability

4. Summary

R. Gemulla, Universitat Mannheim Declarative Sequential Pattern Mining 4/24

Page 5: Data and Web Science Group Universit at Mannheim · Data and Web Science Group I Research group at University of Mannheim, Germany {5 professors, 9 postdocs,18 Ph.D. students I European

Before and after

R. Gemulla, Universitat Mannheim Declarative Sequential Pattern Mining 5/24

Anni wants towatch a movie.

Anni loves LOTR1.But she does notwant to see it. Shehad seen LOTR2

last week!

Movie streaming site

Recommended for you

Page 6: Data and Web Science Group Universit at Mannheim · Data and Web Science Group I Research group at University of Mannheim, Germany {5 professors, 9 postdocs,18 Ph.D. students I European

Let’s look at some data

I Data from Netflix’ online movie-streaming platform– 500k users, 18k movies, 100M ratings with timestamps

I 125k users rated both LOTR1 and LOTR2

I In which order?

→ →

105k users 20k users

I Order matters!– How to discover patterns in sequential data?

R. Gemulla, Universitat Mannheim Declarative Sequential Pattern Mining 6/24

Page 7: Data and Web Science Group Universit at Mannheim · Data and Web Science Group I Research group at University of Mannheim, Germany {5 professors, 9 postdocs,18 Ph.D. students I European

Sequential pattern mining

I Sequential pattern mining is a fundamental task in data mining– Data modeled as collection of sequences of items or events– Often items are arranged in a hierarchy– We seek useful sequential patterns

I E.g., market-basket data– Sequence = purchases of a customer over time– Item = product (or set of products) + product hierarchy– Example pattern: DSLR Camera → Tripod → Flash

I E.g., natural-language text– Sequence = sentence or document– Item = word + syntactic/semantic hierarchy– Example pattern: person was born in location

I E.g., amino acid sequences– Sequence = protein– Item = amino acid– Example pattern: S L R

R. Gemulla, Universitat Mannheim Declarative Sequential Pattern Mining 7/24

Page 8: Data and Web Science Group Universit at Mannheim · Data and Web Science Group I Research group at University of Mannheim, Germany {5 professors, 9 postdocs,18 Ph.D. students I European

What constitutes a good pattern?

I Extensively studied– Interesting patterns should be new, surprising, understandable,

actionable– No random patterns, common knowledge, redundancy– Details application-specific

I Many different variants, many algorithms– Constraints: length, positional/temporal, hierarchy, regex, . . .– Scoring: frequency, utility, information gain, significance, . . .– Pattern sets: all, top-k , maximality, closedness, MDL, . . .

I Our research focuses on unifying sequential pattern mining– Study general properties instead of special cases– Avoid need for customized mining algorithms

R. Gemulla, Universitat Mannheim Declarative Sequential Pattern Mining 8/24

Page 9: Data and Web Science Group Universit at Mannheim · Data and Web Science Group I Research group at University of Mannheim, Germany {5 professors, 9 postdocs,18 Ph.D. students I European

DESQ

I DESQ = system for declarative sequential pattern mining [ICDM16]

– (Will be) open source– Work in progress

I Key design goals are1. Usefulness

I Can be tailored to applicationI Flexible constraintsI Flexible notions of interestingness

2. UsabilityI Describe pattern mining task in an intuitive, declarative wayI Hide technical and implementation details

3. EfficiencyI FastI Scalable [SIGMOD15, TODS15, SIGMOD13]

I Competitive to specialized miners

R. Gemulla, Universitat Mannheim Declarative Sequential Pattern Mining 9/24

Page 10: Data and Web Science Group Universit at Mannheim · Data and Web Science Group I Research group at University of Mannheim, Germany {5 professors, 9 postdocs,18 Ph.D. students I European

Outline

1. Sequential Pattern Mining

2. Scalability

3. Usability

4. Summary

R. Gemulla, Universitat Mannheim Declarative Sequential Pattern Mining 10/24

Page 11: Data and Web Science Group Universit at Mannheim · Data and Web Science Group I Research group at University of Mannheim, Germany {5 professors, 9 postdocs,18 Ph.D. students I European

Special case: n-gram mining

An n-gram is a sequence of n consecutive words

I Extensively used in text mining and natural-language processing

I Web-scale n-gram models published by Google and Microsoft

R. Gemulla, Universitat Mannheim Declarative Sequential Pattern Mining 11/24

Page 12: Data and Web Science Group Universit at Mannheim · Data and Web Science Group I Research group at University of Mannheim, Germany {5 professors, 9 postdocs,18 Ph.D. students I European

Special case: n-gram mining

An n-gram is a sequence of n consecutive words

I Extensively used in text mining and natural-language processing

I Web-scale n-gram models published by Google and Microsoft

R. Gemulla, Universitat Mannheim Declarative Sequential Pattern Mining 12/24

Page 13: Data and Web Science Group Universit at Mannheim · Data and Web Science Group I Research group at University of Mannheim, Germany {5 professors, 9 postdocs,18 Ph.D. students I European

MG-FSM

I Distributed framework for scalable frequent sequence mining

I Originally built on top of MapReduce

Key idea

I Partition data into smalleroverlapping partitionsusing item-based partitioning

– One partition foreach frequent item

– Inexpensive rewrites

I Mine each partition using anyFSM algorithm

I Combine results

R. Gemulla, Universitat Mannheim Declarative Sequential Pattern Mining 13/24

D

Item-based partitioning

D2D1 . . . Dn

a b n

F1 F2 Fn. . .

FSM FSM FSM

F

Page 14: Data and Web Science Group Universit at Mannheim · Data and Web Science Group I Research group at University of Mannheim, Germany {5 professors, 9 postdocs,18 Ph.D. students I European

How fast is it? (10 node Hadoop cluster)

5-grams from New York Times Data (50M sentences)

I Naive 21 min

I Suffix-σ n-gram miner 217 s

I MG-FSM 103 s

Gapped 5-grams from New York Times Data (50M sentences)

I Naive 3.7 h

I Suffix-σ n-gram miner N/A

I MG-FSM 137 s

5-grams from ClueWeb data (1B sentences)

I MG-FSM 20 min

R. Gemulla, Universitat Mannheim Declarative Sequential Pattern Mining 14/24

Page 15: Data and Web Science Group Universit at Mannheim · Data and Web Science Group I Research group at University of Mannheim, Germany {5 professors, 9 postdocs,18 Ph.D. students I European

MG-FSM also mines. . .

I Maximal and closed sequences– Compact, smaller output (e.g., factor 3 on NYT corpus)– No or minimal information loss

I Event sequences– Input sequences of time-annotated events– E.g., movie views, purchase transactions, session logs– Supports temporal constraints

I Hierarchies– Canon EOS 70D → DSLR camera → camera → electronics– E.g.: some DSLR camera, some photography book, some flash

R. Gemulla, Universitat Mannheim Declarative Sequential Pattern Mining 15/24

Page 16: Data and Web Science Group Universit at Mannheim · Data and Web Science Group I Research group at University of Mannheim, Germany {5 professors, 9 postdocs,18 Ph.D. students I European

R. Gemulla, Universitat Mannheim Declarative Sequential Pattern Mining 16/24

Page 17: Data and Web Science Group Universit at Mannheim · Data and Web Science Group I Research group at University of Mannheim, Germany {5 professors, 9 postdocs,18 Ph.D. students I European

Outline

1. Sequential Pattern Mining

2. Scalability

3. Usability

4. Summary

R. Gemulla, Universitat Mannheim Declarative Sequential Pattern Mining 17/24

Page 18: Data and Web Science Group Universit at Mannheim · Data and Web Science Group I Research group at University of Mannheim, Germany {5 professors, 9 postdocs,18 Ph.D. students I European

Going declarative

I If we simply mined all frequent n-grams, we may

1. Produce many uninteresting patterns (low frequency threshold)2. Miss out on interesting patterns (high frequency threshold)

I DESQ allows data analysts to focus on what they considerrelevant

– Supports all traditional constraints (length, gap, hierarchy, . . . )– Supports customized constraints that go beyond traditional

constraints

I Based on a declarative pattern expression language– Describe relevant patterns, let DESQ take care of mining them– Syntax like regular expression– Adds capture groups and hierarchies

R. Gemulla, Universitat Mannheim Declarative Sequential Pattern Mining 18/24

Page 19: Data and Web Science Group Universit at Mannheim · Data and Web Science Group I Research group at University of Mannheim, Germany {5 professors, 9 postdocs,18 Ph.D. students I European

Some examples for text mining

1. Noun modified by adjective or nounEx: big country (110), green tea (337), research scientist (473)PE: ([ADJ|NOUN] NOUN)

2. Relational phrase between entitiesEx: lives in (847), is being advised by (15), has coached (10)PE: ENTITY (VERB+ NOUN+? PREP?) ENTITY

3. Typed relational phrasesEx: ORG headed by ENTITY (275), PERS born in LOC (481)PE: (ENTITY↑ VERB+ NOUN+? PREP? ENTITY↑)

4. Google n-gram viewer dataEx: a good day, a ADJ day, DET ADJ NOUN, have a good dayPE: (.↑) (.↑)? (.↑)? | (.....?)

R. Gemulla, Universitat Mannheim Declarative Sequential Pattern Mining 19/24

Page 20: Data and Web Science Group Universit at Mannheim · Data and Web Science Group I Research group at University of Mannheim, Germany {5 professors, 9 postdocs,18 Ph.D. students I European

Pattern mining

I Under the hood, DESQ translates pattern expressions to finitestate transducers (FST)

– FST outputs all patterns that occur in a given input sequence

I Naive approach (“WordCount”)– For every input sequence, simulate FST to obtain all outputs– Count how often each output occurred, return the frequent ones– Simple, inefficient

I DesqCount (“WordCount” with frequency pruning)– Lemma: frequent patterns cannot contain infrequent items– As naive, but ignore FST transitions that produce infrequent items– Simple, more efficient but still inefficient

I DesqDfs (depth-first search)– Lemma: partial outputs more frequent than resp. final outputs– Apply a variant of prefix-growth to grow patterns incrementally

and prune early– Not that simple, efficient

R. Gemulla, Universitat Mannheim Declarative Sequential Pattern Mining 20/24

Page 21: Data and Web Science Group Universit at Mannheim · Data and Web Science Group I Research group at University of Mannheim, Germany {5 professors, 9 postdocs,18 Ph.D. students I European

Performance comparison (traditional constraints)

Left: cSPADE, center: prefix-growth, right: DesqDfs

100,0,3 100,0,5 100,1,5 100,2,5 1K,0,5(+H)

Tot

al t

ime

[sec

onds

]

1010

010

00

σ, γ, λ

>12Hr >12Hr >12Hr

DESQ is competitive to state-of-the-art miners fortraditional constraints.

R. Gemulla, Universitat Mannheim Declarative Sequential Pattern Mining 21/24

Page 22: Data and Web Science Group Universit at Mannheim · Data and Web Science Group I Research group at University of Mannheim, Germany {5 professors, 9 postdocs,18 Ph.D. students I European

Performance comparison (new constraints)T

otal

tim

e [s

econ

ds]

1010

010

0010

000

Pattern expression (σ)

N1(10) N2(100) N3(10) N4(1K) N5(1K) A1(500) A2(100) A3(100) A4(100)

Naive+cFSTDESQ−COUNTDESQ−DFS

1.03

9.38

2.02

54.5

5

89.8

4876

445 1189

2

38941.

03 7.5

1.84

48.7

5 75.9

8

1478

416 58

40

909

DesqDfs is method of choice and can be orders of magnitudefaster than Naive or DesqCount.

R. Gemulla, Universitat Mannheim Declarative Sequential Pattern Mining 22/24

Page 23: Data and Web Science Group Universit at Mannheim · Data and Web Science Group I Research group at University of Mannheim, Germany {5 professors, 9 postdocs,18 Ph.D. students I European

Outline

1. Sequential Pattern Mining

2. Scalability

3. Usability

4. Summary

R. Gemulla, Universitat Mannheim Declarative Sequential Pattern Mining 23/24

Page 24: Data and Web Science Group Universit at Mannheim · Data and Web Science Group I Research group at University of Mannheim, Germany {5 professors, 9 postdocs,18 Ph.D. students I European

Summary

I DESQ system for declarative sequential pattern mining– Find patterns in sequential data– First step towards a unifying framework– Pattern expressions to express constraints in an intuitive way– Item-based partitioning to scale to large datasets

I Directions for future work– Better algorithms & analysis– More powerful pattern expression language– Interestingness beyond frequency– Trees & graphs

Make sequential pattern mining useful, usable, and efficient.

Thank you!

R. Gemulla, Universitat Mannheim Declarative Sequential Pattern Mining 24/24