Download - CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column structural data
SEQUEL: Query Completion via Pattern Mining on Multi-Column Structural Data
Chuancong Gao, Qingyan Yang, Jianyong Wang Tsinghua University, Beijing, China
Structural Data Description
Mined Pattern Structure
Suggestion Progress
STEP 1: Search the index of each column, find at least one combination (matching order) of columns matching on the input query. E.g., Query “www da” will be matched as (with the indexes in right-side):
Advantages Comparing to Other Systems
Pattern Index Structure – Trie Tree
Example on Column Title Phrase and Venue
Structural
Data
Formalize Mine & Index
Mined Patterns
Indexes for Each Column
Query
...
...Preprocess
...
...
Try to Match Greedily on
Each Column Index
Patterns for m
Match
Combinations
Top-k Selection on
Last-Matched Column
for m Combinations Top-k
Selection from
m×k
Candidates
Output
Offline Part
Online Part
≥ ≥
≥ ≥
≥ ≥
≥ ≥
... .........
≥ : Ranking Score Comparison
: supnn -
The DBLP Computer Science Bibliography (DBLP) • > 1,400,000 Publication Entries • Four Attributes for each Publication Entry:
• Authors (e.g. Jiawei Han, Guozhu Dong, Yiwen Yin) • Title (e.g. Efficient Mining of Partial Periodic Patterns in Time
Series Database) • Venue (e.g. ICDE) • Year (e.g. 1999)
1. Title Phrase “frequent patterns” appears 17 times in Venue “icdm” 2. Title Phrase “pattern” appears 14 times for Authors “jian pei” and
“jiawei han”
• Suggests Patterns mined from underlying Data instead of Query Logs • More Accurate and Meaningful • Low Amount and Quality of Query Logs on Structural Data
• No need to Specify Explicitly Different Columns in Query • Suggests Phrases instead of Single Terms • Fast for both Offline Pattern Mining and Online Suggestion
d
a
t
a
b
e
s
a w
e
b
tl
a
m
r
o
f
me
d
c
i
w
w
w
m
l1 2 3 ...
...
... ...
2 5 6 ... ... ...
3 4 8 10 ...
5 ... 4 ...
data
data icde
data www
data web www
database icde
icde
www
1
2
3
4
5
6
7
8
w
w
w
7 8 ...
www www
www
9
10
50263
514
14
14
312
2666
880
4
1262
Title Phrase Index Venue Index
Title Phrase Venueid supid
Some Selected Patterns
d
a
t
a
9 ...
Blank Node Normal Node Phrase-end Node
www data 17
http://dbgroup.cs.tsinghua.edu.cn/chuancong/sequel
STEP 2: Suggest on the last matched column of each matching order.
Based on Frequent Sequential Pattern Mining algorithm PrefixSpan: • Treat Authors as Itemset • Treat Title as Sequence • Treat Venue & Year as Single-Item • Concatenate all the columns together as a new Sequence • Mine and Index Used Minimum Support (Frequency) Threshold: 10
Pattern Mining Algorithm
• Used for fast column text matching • Every column has one corresponding Trie tree • All the indexes share a global table storing all the patterns • Close to 2GB in total in memory