nlp on a billion documents: scalable machine learning with apache spark

User of Spark since 2012

Organiser of the London Spark Meetup

Run Data Science team at Skimlinks

Who am I

Apache Spark

4

The RDD

5

RDD.map

>>> thisrdd = sc.parallelize(range(12), 4)

>>> thisrdd.collect()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

>>> otherrdd = thisrdd.map(lambda x:x%3)

>>> otherrdd.collect()

[0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]

6

RDD.map

7

RDD.map

>>> otherrdd.zip(thisrdd).collect()

[(0, 0), (1, 1), (2, 2), (0, 3), (1, 4), (2, 5), (0, 6), (1, 7), (2, 8), (0, 9), (1, 10), (2, 11)]

>>> otherrdd.zip(thisrdd).reduceByKey(lambda x,y: x+y).collect()

[(0, 18), (1, 22), (2, 26)]

8

RDD.reduceByKey

Set the number of reducers sensibly

Configure your pyspark cluster properly

Don’t shuffle (unless you have to)

Don’t groupBy

Repartition your data if necessary

9

How to not crash your spark job

Lots of people will say 'use scala'

10

Lots of people will say 'use scala'

Don't listen to those people.

11

12

Naive bayes - recap

# get (class label, word) tupleslabel_token = gettokens(docs)

# [(False, u'https'), (True, u'fashionblog'), (True, u'dress'), (False, u'com')),...]

tokencounter = label_token.map(lambda (label, token): (token, (label, not label)))

# [(u'https', [0, 1]), (u'fashionblog', [1, 0]), (u'dress', [1, 0]), (u'com', [0, 1])), ...]

# get the word count for each classtermcounts = tokencounter.reduceByKey(lambda x, y: map(add, x, y))

# [(u'https', [100, 112]), (u'fashionblog', [0, 100]), (u'dress', [5, 15]), (u'com', [95, 100])),

...]

13

Naive Bayes in Spark

termcounts_plus_pseudo = termcounts.map(lambda (term, counts): (term, map(add,

counts, (1, 1))))

# [(u'https', [100, 112]), (u'fashionblog', [0, 100]), (u'dress', [5, 15]),...]

# => [(u'https', [101, 113]), (u'fashionblog', [1, 101]), (u'dress', [6, 16]),...]

# get the total number of words in each class

values = termcounts_plus_pseudo.map(lambda (term, (truecounts, falsecounts)):

(truecounts, falsecounts))

totals = values.reduce(lambda x,y: map(add, x,y))

# [1321, 2345]

P_t = termcounts_plus_pseudo.map(lambda (label, counts): (label, map(truediv,

counts, totals)))

14


reduceByKey(combineByKey)

{k1: 2, …} (k1, 2)

(k1, 3)

(k1,

5)

{k1: 10, …}

{…}

combineLocally _mergeCombiners

{k1: 3, …}

{k1: 5, …}

(k1, 1)(k1, 1)

(k1, 2)(k1, 1)

(k1, 5)

reduceByKey(combineByKey)

{k1: 2, …} (k1, 2)

(k1, 3)

(k1,

5)

{k1: 10, …}

{…}

combineLocally _mergeCombiners

{k1: 3, …}

{k1: 5, …} reduceByKey(numPartitions)

(k1, 1)(k1, 1)

(k1, 2)(k1, 1)

(k1, 5)

RDD.aggregate(zeroValue, seqOp, combOp)Aggregate the elements of each partition, and then the results for all the partitions, using a given

combine functions and a neutral “zero value.”

17


class WordFrequencyAgreggator(object):

def __init__(self):

self.S = {}

def add(self, (token, count)):

if token not in self.S:

self.S[token] = (0,0)

self.S[token] = map(add, self.S[token], count)

return self

def merge(self, other):

for term, count in other.S.iteritems():

if term not in self.S:

self.S[term] = (0,0)

self.S[term] = map(add, self.S[term], count)

return self

18

Naive Bayes in Spark: Aggregation

With aggregatetermcounts = tokencounter.reduceByKey(lambda x, y: map(add, x, y))

# [(u'https', [0, 1]), (u'fashionblog', [0, 1]), (u'dress', [0, 1]),...]

# => [(u'https', [100, 112]), (u'fashionblog', [0, 100]), (u'dress', [5, 15]),...]

With aggregateaggregates = tokencounter.aggregate(WordFrequencyAgreggator(), lambda x,y:x.add(y),

lambda x,y: x.merge(y))

RDD.aggregate(zeroValue, seqOp, combOp)

19


20

Naive Bayes in Spark: Aggregation

21

Naive Bayes in Spark: treeAggregation

RDD.treeAggregate(zeroValue, seqOp, combOp, depth=2)

Aggregates the elements of this RDD in a multi-level tree pattern.

With reducetermcounts = tokencounter.reduceByKey(lambda x, y: map(add, x, y))

# [(u'https', [0, 1]), (u'fashionblog', [0, 1]), (u'dress', [0, 1]), (u'com', [0,

1])),...]

# ===>

# [(u'https', [100, 112]), (u'fashionblog', [0, 100]), (u'dress', [5, 15]), (u'com',

[95, 100])),...]

With treeAggregateaggregates = tokencounter.treeAggregate(WordFrequencyAgreggator(), lambda x,y:x.add

(y), lambda x,y: x.merge(y), depth=4)

22

Naive Bayes in Spark: treeAggregate

On 1B short documents:

RDD.reduceByKey: 18 min

RDD.treeAggregate: 10 min

https://gist.github.com/martingoodson/aad5d06e81f23930127b

23

treeAggregate performance

24

Word2Vec

25

Training Word2Vec in Spark

from pyspark.mllib.feature import Word2Vec

inp = sc.textFile("text8_lines").map(lambda row: row.split(" "))

word2vec = Word2Vec()

model = word2vec.fit(inp)

AveragingClusteringConvolutional Neural Network

26

How to use word2vec vectors for classification problems

27

K-Means in Spark

from pyspark.mllib.clustering import KMeans, KMeansModel

word=sc.textFile('GoogleNews-vectors-negative300.txt')

vectors = word.map(lambda line: array(

[float(x) for x in line.split('\t')[1:]])

)

clusters = KMeans.train(vectors, 50000, maxIterations=10,

runs=10, initializationMode="random")

clusters_b = sc.broadcast(clusters)

labels = parsedData.map(lambda x:clusters_b.value.predict(x))

28

Semi Supervised Naive Bayes

● Build an initial naive Bayes classifier, ŵ, from the labeled documents, X, only● Loop while classifier parameters improve:

○ (E-step) Use the current classifier, ŵ, to estimate component membership of each unlabeled document, i.e., the probability that each class generated each document,

○ (M-step) Re-estimate the classifier, ŵ, given the estimated class membership of each document.

Kamal Nigam, Andrew McCallum and Tom Mitchell. Semi-supervised Text Classification Using EM. In Chapelle, O., Zien, A., and Scholkopf, B. (Eds.) Semi-Supervised Learning. MIT Press: Boston. 2006.

instead of labels:

tokencounter = label_token.map(lambda (label, token): (token, (label, not label)))

# [(u'https', [0, 1]), (u'fashionblog', [0, 1]), (u'dress', [0, 1]), (u'com', [0,

1])),...]

use probabilities:

# [(u'https', [0.1, 0.3]), (u'fashionblog', [0.01, .11]), (u'dress', [0.02, 0.02]),

(u'com', [0.13, .05])),...]

29

Naive Bayes in Spark: EM

500K labelled examplesPrecision: 0.27Recall: 0.15F1: 0.099

Add 10M unlabelled examples. 10 EM iterations.Precision of 0.26Recall of 0.31F1 of 0.14

30


240M training examplesPrecision: 0.31Recall: 0.19F1: 0.12

Add 250M unlabelled examples. 10 EM iterations.Precision of 0.26 and Recall of 0.22F1: 0.12

31


PySpark Memory: worked example

33

PySpark Configuration: Worked Example

10 x r3.4xlarge (122G, 16 cores)

Use half for each executor: 60GB

Number of cores = 120

OS: ~12GB

Each python process: ~4GB = 48GB

Cache = 60% x 60GB x 10 = 360GB

Each java thread: 40% x 60GB / 12 = ~2GB

more here: http://files.meetup.com/13722842/Spark%20Meetup.pdf

We are [email protected]

@martingoodson

mailto:[email protected]

mailto:[email protected]

nlp on a billion documents: scalable machine learning with apache spark

Technology

reducebykeynumpartitions

maplambda x

reducebykeylambda x

reducelambda x

return self

maplambda label

class label

user of spark