[research] deploying predictive models with the actor framework - brian gawalt

46
PAPIs 2015 Akka & Data Science: Making real-time predictions Brian Gawalt 2nd International Conference on Predictive APIs and Apps August 7, 2015

Upload: papisio

Post on 21-Apr-2017

1.189 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

Akka & Data Science:Making real-time predictionsBrian Gawalt2nd International Conference on Predictive APIs and AppsAugust 7, 2015

Page 2: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

[A]Sometimes, data scientists need to worry about throughput.

2

Page 3: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

[B]One way to increase throughput is with concurrency.

3

Page 4: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

[C]The Actor Model is an easy way to build a concurrent system.

4

Page 5: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

[D]Scala+Akka provides an easy-to-use Actor Model context.

5

Page 6: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

[A + B + C + D ⇒ E]Data scientists should check out Scala+Akka.

6

Page 7: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

Consider:● building a model, ● vs. using a model

7

Page 8: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

Lots of ways to practice building a model

8

Page 9: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

The Classic Process

1. Load your data set’s raw materials

2. Produce feature vectors:

o Training,

o Validation,

o Testing

3. Build the model with training and validation vectors

4 U th d l t t/ t9

Page 10: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

The Classic Process: One-time Testing

10

Load train/valid./test materials

Make train/valid./test feature vectors

Train Model

Make test predictions

Build

Use

Page 11: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

The Classic Process: Repeated Testing

11

Load train/valid. materials

Make train/valid. feature vectors

Train Model

Load test/new materials

Make test/new feature vectors

Make test/new predictions

(saved model)

(repeat every K minutes)

Build

Use

Page 12: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

Sometimes my tasks work like that, too!

12

Page 13: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

But this talk is about the other kind of tasks.

13

Page 14: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

[A]Sometimes, data scientists need to worry about throughput.

14

Page 15: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

Example:Freelancer availability on

15

Page 16: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

Hiring Freelancers on Upwork

1. Post a job

2. Search for freelancers

3. Find someone you like

4. Ask them to interview

o Request Accepted!

o or rejected/ignored...16

THE TASK:

Look at recent freelancer behavior, and predict, at time Step 2, who’s likely to accept an invite at time Step 4

Page 17: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

Building this model is business as usual:

17

Page 18: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

Building Availability Model

1. Load raw materials:

o Examples of accepts/rejects

o Histories of freelancer site activity

Job applications sent or received

Hours worked

Click logs

Profile updates

2. Produce feature vectors: 18

Greenplum

Amazon S3

Internal Service

Page 19: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

Using Availability Model

19

Load train/valid. materials

Make train/valid. feature vectors

Train Model

Load test/new materials

Make test/new feature vectors

Make test/new predictions

(saved model)

(repeat every 60 minutes)

Page 20: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

Using Availability Model

20

Load test/new materials

Make test/new feature vectors

Make test/new predictions

(saved model)

(repeat every 60 minutes)

Load job app data(4 min.)

Load click log data(30 min.)

Load work hours data(5 min.)

Load profile data(20 ms/profile)

Page 21: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

Using Availability Model

21

Load job app data(4 min.)

Load click log data(30 min.)

Load work hours data(5 min.)

Load profile data(20 ms/profile)

● Left with under 21 minutes to collect profile data○ Rate limit: 20 ms/profile○ At most, 63K profiles per

hour● Six Million freelancers who

need avail. predictions: expect ~90 hours between re-scoring any individual

● Still need to spend time actually building vectors and exporting scores!

Page 22: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

[B]One way to increase throughput is with concurrency.

22

Page 23: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

Expensive Option:Major infrastructure overhaul

23

Page 24: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

… but that takes a lot of time, attention, and cooperation…

24

Page 25: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

Simpler Option:The Actor Model

25

Page 26: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

[C]The Actor Model is an easy way to build a concurrent system.

26

Page 27: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

● Imagine a mailbox with a brain● Computation only begins when/if a

message arrives● Keeps its thoughts private:

○ No other actor can actively read this actor’s state

○ Other actors will have to wait to hear a message from this actor

An Actor

27

Page 28: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

● Lots of Actors, and each has:○ Private message queue○ Private state, shared only sending more

messages● Execution context:

○ Manages threading of each Actor’s computation

○ Handles asynch. message routing○ Can send prescheduled messages

● Each received message’s computation is fully completedbefore Actor moves on to next message in queue

The Actor Model of Concurrency

28

Page 29: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

The Actor Model of Concurrency

29

Execution Context

Page 30: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

Parallelizing predictions

30

Refresh work hours

Vectorizer:● Keep copies of raw data● Emit vector for each new

profile received

Refresh job apps

Refresh click log Fetch 10 profiles

Apply model; export

prediction

raw data

raw data

Schedule: Fetch once per hour Schedule: Fetch once per hour

Schedule: Fetch once per hour Schedule: Fetch every 300ms

Page 31: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

Serial processing

31

Refresh job apps

Make feature vectors

Export predictions

(repeat every 60 minutes)

Refresh work hours

Refresh click log

Fetch ~50K profiles

...

55 min

5 min

4 min

5 min

30 min

55 - 4 - 5 - 30 = 16 min

...

Page 32: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

Serial processing

32

Refresh job apps

Make feature vectors

Export predictions

(repeat every 60 minutes)

Refresh work hours

Refresh click log

Fetch ~50K profiles

...

55 min

5 min

4 min

5 min

30 min

55 - 4 - 5 - 30 = 16 min

... Throughput:48K users/hr

Page 33: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

Parallel Processing with Actors

33

Refresh job apps

...

Refresh click log

Refresh work hrs.

Rx data

Fetch pro.

Export

Rx data

Fetch pro.

Fetch pro.

Fetch pro.

Fetch pro.= msg. sent= msg. rx’d

1/hr.

1/hr.

1/hr. 3/sec. (as rx’ed)

Store

Store

Vectorize

Vectorize

Store

1/hr.

Thr. 1 Thr. 2 Thr. 3 Thr. 4

Vectorize

Fetch pro.

Fetch pro.(msg. processing time not to scale)

Rx data

Vectorize

...

Page 34: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

Parallel Processing with Actors

34

Refresh job apps

...

Refresh click log

Refresh work hrs.

Rx data

Fetch pro.

Export

Rx data

Fetch pro.

Fetch pro.

Fetch pro.

Fetch pro.= msg. sent= msg. rx’d

1/hr.

1/hr.

1/hr. 3/sec. (as rx’ed)

Store

Store

Vectorize

Vectorize

Store

1/hr.

Thr. 1 Thr. 2 Thr. 3 Thr. 4

Vectorize

Fetch pro.

Fetch pro.

Throughput:180K users/hr

Rx data

Vectorize

...

Page 35: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

[D]Scala+Akka provides an easy-to-use Actor Model context.

35

Page 36: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

Message passing, scheduling, & computation behavior defined in 445 lines.

36

Page 37: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

Scala+Akka Actors

● Create Scala class, mix in Actor trait

● Implement the required partial function: receive: PartialFunction[Any, Unit]

● Define family of message objects this actor’s planning to handle

● Define behavior for each message case in receive

37

Page 38: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

Scala+Akka Actors

38

Mixin same code used for export in non-Actor version

Private, mutable state: stored scores

Private, mutable state: time of last export

If receiving new scores: store them!

If storing lots of scores, or if it’s been awhile: upload what’s stored, then erase them

If told to shut down, stop accepting new scores

Page 39: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

Scala+Akka Pros

● Easy to get productive in the Scala language

● SBT dependency management makes it easy to move to any box with a JRE

● No global interpreter lock!

39

Page 40: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

Scala+Akka Cons

● Moderate Scala learning curve

● Object representation on the JVM has pretty lousy memory efficiency

● Not a lot of great options for building models in Scala (compared to R, Python, Julia)

40

Page 41: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

[A]Sometimes, data scientists need to worry about throughput.

41

Page 42: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

[B]One way to increase throughput is with concurrency.

42

Page 43: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

[C]The Actor Model is an easy way to build a concurrent system.

43

Page 44: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

[D]Scala+Akka provides an easy-to-use Actor Model context.

44

Page 45: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

[A + B + C + D ⇒ Z]Data scientists should check out Scala+Akka

45

Page 46: [Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015

Thanks!Questions?

bgawalt@{upwork, gmail}.comtwitter.com/bgawalt