nlp on a billion documents: scalable machine learning with apache spark
TRANSCRIPT
User of Spark since 2012
Organiser of the London Spark Meetup
Run Data Science team at Skimlinks
Who am I
Apache Spark
4
The RDD
5
RDD.map
>>> thisrdd = sc.parallelize(range(12), 4)
>>> thisrdd.collect()
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
>>> otherrdd = thisrdd.map(lambda x:x%3)
>>> otherrdd.collect()
[0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]
6
RDD.map
7
RDD.map
>>> otherrdd.zip(thisrdd).collect()
[(0, 0), (1, 1), (2, 2), (0, 3), (1, 4), (2, 5), (0, 6), (1, 7), (2, 8), (0, 9), (1, 10), (2, 11)]
>>> otherrdd.zip(thisrdd).reduceByKey(lambda x,y: x+y).collect()
[(0, 18), (1, 22), (2, 26)]
8
RDD.reduceByKey
Set the number of reducers sensibly
Configure your pyspark cluster properly
Don’t shuffle (unless you have to)
Don’t groupBy
Repartition your data if necessary
9
How to not crash your spark job
Lots of people will say 'use scala'
10
Lots of people will say 'use scala'
Don't listen to those people.
11
12
Naive bayes - recap
# get (class label, word) tupleslabel_token = gettokens(docs)
# [(False, u'https'), (True, u'fashionblog'), (True, u'dress'), (False, u'com')),...]
tokencounter = label_token.map(lambda (label, token): (token, (label, not label)))
# [(u'https', [0, 1]), (u'fashionblog', [1, 0]), (u'dress', [1, 0]), (u'com', [0, 1])), ...]
# get the word count for each classtermcounts = tokencounter.reduceByKey(lambda x, y: map(add, x, y))
# [(u'https', [100, 112]), (u'fashionblog', [0, 100]), (u'dress', [5, 15]), (u'com', [95, 100])),
...]
13
Naive Bayes in Spark
termcounts_plus_pseudo = termcounts.map(lambda (term, counts): (term, map(add,
counts, (1, 1))))
# [(u'https', [100, 112]), (u'fashionblog', [0, 100]), (u'dress', [5, 15]),...]
# => [(u'https', [101, 113]), (u'fashionblog', [1, 101]), (u'dress', [6, 16]),...]
# get the total number of words in each class
values = termcounts_plus_pseudo.map(lambda (term, (truecounts, falsecounts)):
(truecounts, falsecounts))
totals = values.reduce(lambda x,y: map(add, x,y))
# [1321, 2345]
P_t = termcounts_plus_pseudo.map(lambda (label, counts): (label, map(truediv,
counts, totals)))
14
Naive Bayes in Spark
reduceByKey(combineByKey)
{k1: 2, …} (k1, 2)
(k1, 3)
(k1,
5)
{k1: 10, …}
{…}
combineLocally _mergeCombiners
{k1: 3, …}
{k1: 5, …}
(k1, 1)(k1, 1)
(k1, 2)(k1, 1)
(k1, 5)
reduceByKey(combineByKey)
{k1: 2, …} (k1, 2)
(k1, 3)
(k1,
5)
{k1: 10, …}
{…}
combineLocally _mergeCombiners
{k1: 3, …}
{k1: 5, …} reduceByKey(numPartitions)
(k1, 1)(k1, 1)
(k1, 2)(k1, 1)
(k1, 5)
RDD.aggregate(zeroValue, seqOp, combOp)Aggregate the elements of each partition, and then the results for all the partitions, using a given
combine functions and a neutral “zero value.”
17
Naive Bayes in Spark
class WordFrequencyAgreggator(object):
def __init__(self):
self.S = {}
def add(self, (token, count)):
if token not in self.S:
self.S[token] = (0,0)
self.S[token] = map(add, self.S[token], count)
return self
def merge(self, other):
for term, count in other.S.iteritems():
if term not in self.S:
self.S[term] = (0,0)
self.S[term] = map(add, self.S[term], count)
return self
18
Naive Bayes in Spark: Aggregation
With aggregatetermcounts = tokencounter.reduceByKey(lambda x, y: map(add, x, y))
# [(u'https', [0, 1]), (u'fashionblog', [0, 1]), (u'dress', [0, 1]),...]
# => [(u'https', [100, 112]), (u'fashionblog', [0, 100]), (u'dress', [5, 15]),...]
With aggregateaggregates = tokencounter.aggregate(WordFrequencyAgreggator(), lambda x,y:x.add(y),
lambda x,y: x.merge(y))
RDD.aggregate(zeroValue, seqOp, combOp)
19
Naive Bayes in Spark
20
Naive Bayes in Spark: Aggregation
21
Naive Bayes in Spark: treeAggregation
RDD.treeAggregate(zeroValue, seqOp, combOp, depth=2)
Aggregates the elements of this RDD in a multi-level tree pattern.
With reducetermcounts = tokencounter.reduceByKey(lambda x, y: map(add, x, y))
# [(u'https', [0, 1]), (u'fashionblog', [0, 1]), (u'dress', [0, 1]), (u'com', [0,
1])),...]
# ===>
# [(u'https', [100, 112]), (u'fashionblog', [0, 100]), (u'dress', [5, 15]), (u'com',
[95, 100])),...]
With treeAggregateaggregates = tokencounter.treeAggregate(WordFrequencyAgreggator(), lambda x,y:x.add
(y), lambda x,y: x.merge(y), depth=4)
22
Naive Bayes in Spark: treeAggregate
On 1B short documents:
RDD.reduceByKey: 18 min
RDD.treeAggregate: 10 min
https://gist.github.com/martingoodson/aad5d06e81f23930127b
23
treeAggregate performance
24
Word2Vec
25
Training Word2Vec in Spark
from pyspark.mllib.feature import Word2Vec
inp = sc.textFile("text8_lines").map(lambda row: row.split(" "))
word2vec = Word2Vec()
model = word2vec.fit(inp)
AveragingClusteringConvolutional Neural Network
26
How to use word2vec vectors for classification problems
27
K-Means in Spark
from pyspark.mllib.clustering import KMeans, KMeansModel
word=sc.textFile('GoogleNews-vectors-negative300.txt')
vectors = word.map(lambda line: array(
[float(x) for x in line.split('\t')[1:]])
)
clusters = KMeans.train(vectors, 50000, maxIterations=10,
runs=10, initializationMode="random")
clusters_b = sc.broadcast(clusters)
labels = parsedData.map(lambda x:clusters_b.value.predict(x))
28
Semi Supervised Naive Bayes
● Build an initial naive Bayes classifier, ŵ, from the labeled documents, X, only● Loop while classifier parameters improve:
○ (E-step) Use the current classifier, ŵ, to estimate component membership of each unlabeled document, i.e., the probability that each class generated each document,
○ (M-step) Re-estimate the classifier, ŵ, given the estimated class membership of each document.
Kamal Nigam, Andrew McCallum and Tom Mitchell. Semi-supervised Text Classification Using EM. In Chapelle, O., Zien, A., and Scholkopf, B. (Eds.) Semi-Supervised Learning. MIT Press: Boston. 2006.
instead of labels:
tokencounter = label_token.map(lambda (label, token): (token, (label, not label)))
# [(u'https', [0, 1]), (u'fashionblog', [0, 1]), (u'dress', [0, 1]), (u'com', [0,
1])),...]
use probabilities:
# [(u'https', [0.1, 0.3]), (u'fashionblog', [0.01, .11]), (u'dress', [0.02, 0.02]),
(u'com', [0.13, .05])),...]
29
Naive Bayes in Spark: EM
500K labelled examplesPrecision: 0.27Recall: 0.15F1: 0.099
Add 10M unlabelled examples. 10 EM iterations.Precision of 0.26Recall of 0.31F1 of 0.14
30
Naive Bayes in Spark: EM
240M training examplesPrecision: 0.31Recall: 0.19F1: 0.12
Add 250M unlabelled examples. 10 EM iterations.Precision of 0.26 and Recall of 0.22F1: 0.12
31
Naive Bayes in Spark: EM
PySpark Memory: worked example
33
PySpark Configuration: Worked Example
10 x r3.4xlarge (122G, 16 cores)
Use half for each executor: 60GB
Number of cores = 120
OS: ~12GB
Each python process: ~4GB = 48GB
Cache = 60% x 60GB x 10 = 360GB
Each java thread: 40% x 60GB / 12 = ~2GB
more here: http://files.meetup.com/13722842/Spark%20Meetup.pdf