Computingn-Gram Statistics
in MapReduce
Klaus Berberich([email protected])
Srikanta Bedathur([email protected])
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
n-Gram Statistics
✦ Statistics about variable-length word sequences(e.g., lord of the rings, at the end of, …)have many applications in fields including
✦ Information Retrieval
✦ Natural Language Processing
✦ Digital Humanities
2
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
n-Gram Statistics
✦ Statistics about variable-length word sequences(e.g., lord of the rings, at the end of, …)have many applications in fields including
✦ Information Retrieval
✦ Natural Language Processing
✦ Digital Humanities
2
rates hilton paris
the hilton parisoffers great rates
in the summerd42
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
n-Gram Statistics
✦ Statistics about variable-length word sequences(e.g., lord of the rings, at the end of, …)have many applications in fields including
✦ Information Retrieval
✦ Natural Language Processing
✦ Digital Humanities
2
siri how is the
rates hilton paris
the hilton parisoffers great rates
in the summerd42
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
n-Gram Statistics
✦ Statistics about variable-length word sequences(e.g., lord of the rings, at the end of, …)have many applications in fields including
✦ Information Retrieval
✦ Natural Language Processing
✦ Digital Humanities
2
siri how is the
rates hilton paris
the hilton parisoffers great rates
in the summerd42
thou shalt notdon’t ya
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Problem Statement
✦ Can be seen as a special case of frequent sequence mining (no gaps, single-item transaction only) with slightly different notion of frequency
✦ Our focus is on large-scale document collections (millions of documents or more, natural language)
3
How can we efficiently compute statistics about n-grams, that occur at least τ times and consist of at most σ words,
using MapReduce?
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Outline
✦ Motivation
✦ Competitors & Challenges
✦ SUFFIX-σ
✦ Extensions
✦ Experimental Evaluation
✦ Conclusion
4
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
MapReduce
✦ Distributed data processing platform by Google [1]
✦ for clusters of commodity hardware
✦ handles hardware/software failures transparently
✦ available as open-source Apache Hadoop
✦ Programming model operating on key-value pairs
✦ map() : <k1,v1> -‐> list<k2,v2>
✦ reduce() : <k2,list<v2>> -‐> list<k3,v3>
✦ compare() partition()
5
[1] J. Dean and S. Ghemawat: Simplified Data Processing on Large Clusters, OSDI 2004
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
✦ Determine counts of all individual words
WORD COUNT
6
map(did, content): for all words in content: emit(word, did)
reduce(word, list<did>): emit(word, length(list<did>))
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
✦ Determine counts of all individual words
WORD COUNT
6
map(did, content): for all words in content: emit(word, did)
reduce(word, list<did>): emit(word, length(list<did>))
d1@t1a x bb a y
d2@t2b y ax a b
(a,4)(b,4)…
(x,2)(y,2)…
Map
M1
Mn
(a,d1@t1),(x,d1@t1),
…
(b,d2@t2),(y,d2@t2),
…
map()
Reduce
R1
Rm
(a,d1@t1),(a,d2@t2),
…
(x,d1@t1),(x,d2@t2),
…
reduce()
Shuffle1
m
1
m
1
m
1
m
partition()compare()
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
N-GRAM COUNT
7
map(did, content): for k in <1 ... σ >: for all k-‐grams in content: emit(k-‐gram, did)
reduce(n-‐gram, list<did>): if length(list<did>) >= τ: emit(n-‐gram, length(list<did>))
[1] J. Dean and S. Ghemawat: Simplified Data Processing on Large Clusters, OSDI 2004[2] T. Brants et al.: Large Language Models in Machine Translation, EMNLP-CoNLL 2007
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
N-GRAM COUNT
7
map(did, content): for k in <1 ... σ >: for all k-‐grams in content: emit(k-‐gram, did)
reduce(n-‐gram, list<did>): if length(list<did>) >= τ: emit(n-‐gram, length(list<did>))
d1@t1a x bb a y
(a,d1@t1),(x,d1@t1),…(ax,d1@t1),(xb,d1@t1),…(axb,d1@t1),(xbb,d1@t1)…(axbb,d1@t1),(xbba,d1@t1),…(axbba,d1@t1),(xbbay,d1@t1),…(axbbay,d1@t1)
[1] J. Dean and S. Ghemawat: Simplified Data Processing on Large Clusters, OSDI 2004[2] T. Brants et al.: Large Language Models in Machine Translation, EMNLP-CoNLL 2007
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
N-GRAM COUNT
7
map(did, content): for k in <1 ... σ >: for all k-‐grams in content: emit(k-‐gram, did)
reduce(n-‐gram, list<did>): if length(list<did>) >= τ: emit(n-‐gram, length(list<did>))
[1] J. Dean and S. Ghemawat: Simplified Data Processing on Large Clusters, OSDI 2004[2] T. Brants et al.: Large Language Models in Machine Translation, EMNLP-CoNLL 2007
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
✦ Apriori Principle: k-gram can occur more than τ times only if its constituent (k-1)-grams occur at least τ times
APRIORI-SCAN & APRIORI-INDEX
8
[1] R. Srikant and R. Agrawal: Mining Sequential Patterns: Generalizations & Performance Improvements, EDBT 1996[2] M. J. Zaki: SPADE: An Efficient Algorithm for Mining Frequent Sequences, ML 42(1/2):31-60, 2001
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
✦ Apriori Principle: k-gram can occur more than τ times only if its constituent (k-1)-grams occur at least τ times
d1@t1a x bb a y
(a,d1@t1),(b,d1@t1),(x,d1@t1),(y,d1@t1)
{ }
(1)
APRIORI-SCAN
APRIORI-SCAN & APRIORI-INDEX
8
[1] R. Srikant and R. Agrawal: Mining Sequential Patterns: Generalizations & Performance Improvements, EDBT 1996[2] M. J. Zaki: SPADE: An Efficient Algorithm for Mining Frequent Sequences, ML 42(1/2):31-60, 2001
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
✦ Apriori Principle: k-gram can occur more than τ times only if its constituent (k-1)-grams occur at least τ times
d1@t1a x bb a y
(a,d1@t1),(b,d1@t1),(x,d1@t1),(y,d1@t1)
{ }
(1)
APRIORI-SCAN
APRIORI-SCAN & APRIORI-INDEX
8
[1] R. Srikant and R. Agrawal: Mining Sequential Patterns: Generalizations & Performance Improvements, EDBT 1996[2] M. J. Zaki: SPADE: An Efficient Algorithm for Mining Frequent Sequences, ML 42(1/2):31-60, 2001
d1@t1a x bb a y
(ax,d1@t1),(ay,d1@t1),…(bb,d1@t1),(ba,d1@t1),…(xb,d1@t1)
{a,b,x,y}
(2)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
✦ Apriori Principle: k-gram can occur more than τ times only if its constituent (k-1)-grams occur at least τ times
APRIORI-SCAN & APRIORI-INDEX
8
[1] R. Srikant and R. Agrawal: Mining Sequential Patterns: Generalizations & Performance Improvements, EDBT 1996[2] M. J. Zaki: SPADE: An Efficient Algorithm for Mining Frequent Sequences, ML 42(1/2):31-60, 2001
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
✦ Apriori Principle: k-gram can occur more than τ times only if its constituent (k-1)-grams occur at least τ times
APRIORI-SCAN & APRIORI-INDEX
8
[1] R. Srikant and R. Agrawal: Mining Sequential Patterns: Generalizations & Performance Improvements, EDBT 1996[2] M. J. Zaki: SPADE: An Efficient Algorithm for Mining Frequent Sequences, ML 42(1/2):31-60, 2001
APRIORI-INDEX
(2)ab d5@t5 [2,7] d7@t7 [1,11]
bx d5@t5 [8] d7@t7 [2]
abx d5@t5 [7] d7@t7 [1]
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
✦ Apriori Principle: k-gram can occur more than τ times only if its constituent (k-1)-grams occur at least τ times
APRIORI-SCAN & APRIORI-INDEX
8
[1] R. Srikant and R. Agrawal: Mining Sequential Patterns: Generalizations & Performance Improvements, EDBT 1996[2] M. J. Zaki: SPADE: An Efficient Algorithm for Mining Frequent Sequences, ML 42(1/2):31-60, 2001
APRIORI-INDEX
(2)ab d5@t5 [2,7] d7@t7 [1,11]
bx d5@t5 [8] d7@t7 [2]
abx d5@t5 [7] d7@t7 [1]
(3)abx d8@t8 [2,7] d9@t9 [1,11]
bxy d8@t8 [3]
abxy d8@t8 [2]
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Challenges & Desiderata
✦ Single MapReduce Job(N-GRAM COUNT ✓ / APRIORI-SCAN ✗ / APRIORI-INDEX ✗)
✦ Communication Cost(N-GRAM COUNT ✗ / APRIORI-SCAN ✓ / APRIORI-INDEX ✓)
✦ Main-Memory Consumption(N-GRAM COUNT ✓ / APRIORI-SCAN ✗ / APRIORI-INDEX ✗)
✦ Ease of Implementation(N-GRAM COUNT ✓ / APRIORI-SCAN ✗ / APRIORI-INDEX ✗)
9
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Outline
✦ Motivation
✦ Competitors & Challenges
✦ SUFFIX-σ
✦ Extensions
✦ Experimental Evaluation
✦ Conclusion
10
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
SUFFIX-σ
✦ SUFFIX-σ is based on three key ideas, inspired by methods from String Processing (e.g., suffix arrays)
✦ emit only suffixes of documents in map() to reduce communication cost
✦ partition() suffixes based on their first word
✦ sort suffixes in reverse lexicographic orderto limit main-memory consumption in reduce()
11
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Suffixes
✦ SUFFIX-σ emits only suffixes of documents in map()
✦ each of them represents multiple n-grams corresponding to its prefixes (e.g., axbbay represents a, ax, axb, axbb, axbba, and axbbay)
12
d1@t1a x bb a y
(a,d1@t1),(x,d1@t1),…(ax,d1@t1),(xb,d1@t1),…(axb,d1@t1),(xbb,d1@t1)…(axbb,d1@t1),(xbba,d1@t1),…(axbba,d1@t1),(xbbay,d1@t1),…(axbbay,d1@t1)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Suffixes
✦ SUFFIX-σ emits only suffixes of documents in map()
✦ each of them represents multiple n-grams corresponding to its prefixes (e.g., axbbay represents a, ax, axb, axbb, axbba, and axbbay)
12
d1@t1a x bb a y
(a,d1@t1),(x,d1@t1),…(ax,d1@t1),(xb,d1@t1),…(axb,d1@t1),(xbb,d1@t1)…(axbb,d1@t1),(xbba,d1@t1),…(axbba,d1@t1),(xbbay,d1@t1),…(axbbay,d1@t1)
d1@t1a x bb a y
(a,d1@t1),(x,d1@t1),…(ax,d1@t1),(xb,d1@t1),…(axb,d1@t1),(xbb,d1@t1)…(axbb,d1@t1),(xbba,d1@t1),…(axbba,d1@t1),(xbbay,d1@t1),…(axbbay,d1@t1)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Partitioning
✦ SUFFIX-σ partitions suffixes based on their first word
✦ brings together suffixes representing same n-gram
✦ crucial for computation in single MapReduce job
13
(axbbay,d1@t1)(xbbay,d1@t1)(yyabbx,d3@t3)(yabbx,d3@t3)(axbyyx,d4@t4)(xbyyx,d4@t4)
(axbbay,d1@t1)(axbyyx,d4@t4)
(xbbay,d1@t1)(xbyyx,d4@t4)
(yyabbx,d3@t3)(yabbx,d3@t3)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Sorting
✦ SUFFIX-σ sorts suffixes in reverse lexicographic order
✦ bookkeeping using stack of bounded height σ
✦ crucial for low main-memory consumption
14
(axbbay,d1@t1)(axbyyx,d4@t4)(abbxa,d5@t5)(aaxxa,d6@t6)(axxxa,d7@t7)(aax,d8@t9)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Sorting
✦ SUFFIX-σ sorts suffixes in reverse lexicographic order
✦ bookkeeping using stack of bounded height σ
✦ crucial for low main-memory consumption
14
(axbbay,d1@t1)(axbyyx,d4@t4)(abbxa,d5@t5)(aaxxa,d6@t6)(axxxa,d7@t7)(aax,d8@t9)
(axxxa,d7@t7)(axbyyx,d4@t4)(axbbay,d1@t1)(abbxa,d5@t5)(aaxxa,d6@t6)(aax,d8@t9)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Sorting
✦ SUFFIX-σ sorts suffixes in reverse lexicographic order
✦ bookkeeping using stack of bounded height σ
✦ crucial for low main-memory consumption
14
(axbbay,d1@t1)(axbyyx,d4@t4)(abbxa,d5@t5)(aaxxa,d6@t6)(axxxa,d7@t7)(aax,d8@t9)
(axxxa,d7@t7)(axbyyx,d4@t4)(axbbay,d1@t1)(abbxa,d5@t5)(aaxxa,d6@t6)(aax,d8@t9)
yabbxa
{d1@t1}
{d4@t4}{d7@d7}
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Sorting
✦ SUFFIX-σ sorts suffixes in reverse lexicographic order
✦ bookkeeping using stack of bounded height σ
✦ crucial for low main-memory consumption
14
(axbbay,d1@t1)(axbyyx,d4@t4)(abbxa,d5@t5)(aaxxa,d6@t6)(axxxa,d7@t7)(aax,d8@t9)
(axxxa,d7@t7)(axbyyx,d4@t4)(axbbay,d1@t1)(abbxa,d5@t5)(aaxxa,d6@t6)(aax,d8@t9)
yabbxa
{d1@t1}
{d4@t4}{d7@d7}
(axbbay,1)(axbba,1)(axbb,1)(axb,2)(ax,3)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Sorting
✦ SUFFIX-σ sorts suffixes in reverse lexicographic order
✦ bookkeeping using stack of bounded height σ
✦ crucial for low main-memory consumption
14
(axbbay,d1@t1)(axbyyx,d4@t4)(abbxa,d5@t5)(aaxxa,d6@t6)(axxxa,d7@t7)(aax,d8@t9)
(axxxa,d7@t7)(axbyyx,d4@t4)(axbbay,d1@t1)(abbxa,d5@t5)(aaxxa,d6@t6)(aax,d8@t9)
axbba
{d5@t5}
{d1@t1, d4@t4, d7@t7}
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
SUFFIX-σ
15
map(did, content): for all suffixes in content: emit(suffix, did)
partition(suffix, did): return suffix[0] % m
compare(suffix0, suffix1): return -‐strcmp(suffix0, suffix1)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Outline
✦ Motivation
✦ Competitors & Challenges
✦ SUFFIX-σ
✦ Extensions
✦ Experimental Evaluation
✦ Conclusion
16
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Extensions
✦ Closed/Maximal n-Grams
✦ SUFFIX-σ can emit only prefix-closed/maximal n-grams in reduce(); additional MapReduce job then identifies suffix-closed/maximal n-grams
✦ Other Aggregations
✦ n-gram time series
✦ n-gram inverted index
17
[1] J.-B. Michel et al.: Quantitative Analysis of Culture Using Millions of Digitized Books, Science 2010
a b
a b c
b c x
d2, d7, d9
d2, d7
d3, d6
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Outline
✦ Motivation
✦ Competitors & Challenges
✦ SUFFIX-σ
✦ Extensions
✦ Experimental Evaluation
✦ Conclusion
18
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Datasets & Setup
✦ The New York Times Annotated Corpus (NYT)1.8 million newspaper articles, 1987 – 2007, ~3 GB
✦ ClueWeb09-B (CW)50 million web documents, 2009, ~246 GB
✦ 10 Cluster Nodes (2x6 cores, 64 GB RAM, 4x2 TB HDD, Debian 5.0.9, 1 Gbit Ethernet, CDH3u0)
✦ Implementation operates on compressed integer sequences; datasets pre-processed accordingly
19
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Use Cases
✦ Training a Statistical Language Model (LM)
✦ σ = 5 (i.e., n-grams consisting of up to five words)
✦ τ = 10 (NYT) / τ = 100 (CW)
✦ Identifying Repeated Text Fragments (RT)
✦ σ = 100 (to also capture quotations, idioms, etc.)
✦ τ = 100 (NYT) / τ = 1,000 (CW)
20
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Results (LM)
21
1
10
100
1,000
10,000
NYT CW
81
3
3,809
37
240
9
309
10
Wal
lclo
ck T
ime
(min
utes
)
N-GRAM COUNT APRIORI-SCAN APRIORI-INDEX SUFFIX-σ
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Results (RT)
22
1
10
100
1,000
10,000
NYT CW
229
5
393
62
338
77
15,000
117
Wal
lclo
ck T
ime
(min
utes
)
N-GRAM COUNT APRIORI-SCAN APRIORI-INDEX SUFFIX-σ
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Outline
✦ Motivation
✦ Competitors & Challenges
✦ SUFFIX-σ
✦ Extensions
✦ Experimental Evaluation
✦ Conclusion
23
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Conclusion
✦ SUFFIX-σ – computes n-gram statistics in MapReduce
✦ based on “suffix idea” from String Processing
✦ robust to wide variety of parameter choices
✦ outperforms state-of-the-art competitors
✦ runs in a single MapReduce job, consumes little main memory, and is easy to implement
24
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Advertisements
✦ Codehttp://github.com/kberberi/mpiingrams
✦ EU ProjectLongitudinal Analytics of Web Archive Data
✦ Follow-Up WorkI. Miliaraki, K. Berberich, R. Gemulla, S. Zoupanos: Mind the Gap: Large-Scale Frequent Sequence Mining,SIGMOD 2013
25
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 2626
Thank you!
Questions ?