[research] deploying predictive models with the actor framework - brian gawalt

Post on 21-Apr-2017

1.189 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

PAPIs 2015

Akka & Data Science:Making real-time predictionsBrian Gawalt2nd International Conference on Predictive APIs and AppsAugust 7, 2015

PAPIs 2015

[A]Sometimes, data scientists need to worry about throughput.

2

PAPIs 2015

[B]One way to increase throughput is with concurrency.

3

PAPIs 2015

[C]The Actor Model is an easy way to build a concurrent system.

4

PAPIs 2015

[D]Scala+Akka provides an easy-to-use Actor Model context.

5

PAPIs 2015

[A + B + C + D ⇒ E]Data scientists should check out Scala+Akka.

6

PAPIs 2015

Consider:● building a model, ● vs. using a model

7

PAPIs 2015

Lots of ways to practice building a model

8

PAPIs 2015

The Classic Process

1. Load your data set’s raw materials

2. Produce feature vectors:

o Training,

o Validation,

o Testing

3. Build the model with training and validation vectors

4 U th d l t t/ t9

PAPIs 2015

The Classic Process: One-time Testing

10

Load train/valid./test materials

Make train/valid./test feature vectors

Train Model

Make test predictions

Build

Use

PAPIs 2015

The Classic Process: Repeated Testing

11

Load train/valid. materials

Make train/valid. feature vectors

Train Model

Load test/new materials

Make test/new feature vectors

Make test/new predictions

(saved model)

(repeat every K minutes)

Build

Use

PAPIs 2015

Sometimes my tasks work like that, too!

12

PAPIs 2015

But this talk is about the other kind of tasks.

13

PAPIs 2015

[A]Sometimes, data scientists need to worry about throughput.

14

PAPIs 2015

Example:Freelancer availability on

15

PAPIs 2015

Hiring Freelancers on Upwork

1. Post a job

2. Search for freelancers

3. Find someone you like

4. Ask them to interview

o Request Accepted!

o or rejected/ignored...16

THE TASK:

Look at recent freelancer behavior, and predict, at time Step 2, who’s likely to accept an invite at time Step 4

PAPIs 2015

Building this model is business as usual:

17

PAPIs 2015

Building Availability Model

1. Load raw materials:

o Examples of accepts/rejects

o Histories of freelancer site activity

Job applications sent or received

Hours worked

Click logs

Profile updates

2. Produce feature vectors: 18

Greenplum

Amazon S3

Internal Service

PAPIs 2015

Using Availability Model

19

Load train/valid. materials

Make train/valid. feature vectors

Train Model

Load test/new materials

Make test/new feature vectors

Make test/new predictions

(saved model)

(repeat every 60 minutes)

PAPIs 2015

Using Availability Model

20

Load test/new materials

Make test/new feature vectors

Make test/new predictions

(saved model)

(repeat every 60 minutes)

Load job app data(4 min.)

Load click log data(30 min.)

Load work hours data(5 min.)

Load profile data(20 ms/profile)

PAPIs 2015

Using Availability Model

21

Load job app data(4 min.)

Load click log data(30 min.)

Load work hours data(5 min.)

Load profile data(20 ms/profile)

● Left with under 21 minutes to collect profile data○ Rate limit: 20 ms/profile○ At most, 63K profiles per

hour● Six Million freelancers who

need avail. predictions: expect ~90 hours between re-scoring any individual

● Still need to spend time actually building vectors and exporting scores!

PAPIs 2015

[B]One way to increase throughput is with concurrency.

22

PAPIs 2015

Expensive Option:Major infrastructure overhaul

23

PAPIs 2015

… but that takes a lot of time, attention, and cooperation…

24

PAPIs 2015

Simpler Option:The Actor Model

25

PAPIs 2015

[C]The Actor Model is an easy way to build a concurrent system.

26

PAPIs 2015

● Imagine a mailbox with a brain● Computation only begins when/if a

message arrives● Keeps its thoughts private:

○ No other actor can actively read this actor’s state

○ Other actors will have to wait to hear a message from this actor

An Actor

27

PAPIs 2015

● Lots of Actors, and each has:○ Private message queue○ Private state, shared only sending more

messages● Execution context:

○ Manages threading of each Actor’s computation

○ Handles asynch. message routing○ Can send prescheduled messages

● Each received message’s computation is fully completedbefore Actor moves on to next message in queue

The Actor Model of Concurrency

28

PAPIs 2015

The Actor Model of Concurrency

29

Execution Context

PAPIs 2015

Parallelizing predictions

30

Refresh work hours

Vectorizer:● Keep copies of raw data● Emit vector for each new

profile received

Refresh job apps

Refresh click log Fetch 10 profiles

Apply model; export

prediction

raw data

raw data

Schedule: Fetch once per hour Schedule: Fetch once per hour

Schedule: Fetch once per hour Schedule: Fetch every 300ms

PAPIs 2015

Serial processing

31

Refresh job apps

Make feature vectors

Export predictions

(repeat every 60 minutes)

Refresh work hours

Refresh click log

Fetch ~50K profiles

...

55 min

5 min

4 min

5 min

30 min

55 - 4 - 5 - 30 = 16 min

...

PAPIs 2015

Serial processing

32

Refresh job apps

Make feature vectors

Export predictions

(repeat every 60 minutes)

Refresh work hours

Refresh click log

Fetch ~50K profiles

...

55 min

5 min

4 min

5 min

30 min

55 - 4 - 5 - 30 = 16 min

... Throughput:48K users/hr

PAPIs 2015

Parallel Processing with Actors

33

Refresh job apps

...

Refresh click log

Refresh work hrs.

Rx data

Fetch pro.

Export

Rx data

Fetch pro.

Fetch pro.

Fetch pro.

Fetch pro.= msg. sent= msg. rx’d

1/hr.

1/hr.

1/hr. 3/sec. (as rx’ed)

Store

Store

Vectorize

Vectorize

Store

1/hr.

Thr. 1 Thr. 2 Thr. 3 Thr. 4

Vectorize

Fetch pro.

Fetch pro.(msg. processing time not to scale)

Rx data

Vectorize

...

PAPIs 2015

Parallel Processing with Actors

34

Refresh job apps

...

Refresh click log

Refresh work hrs.

Rx data

Fetch pro.

Export

Rx data

Fetch pro.

Fetch pro.

Fetch pro.

Fetch pro.= msg. sent= msg. rx’d

1/hr.

1/hr.

1/hr. 3/sec. (as rx’ed)

Store

Store

Vectorize

Vectorize

Store

1/hr.

Thr. 1 Thr. 2 Thr. 3 Thr. 4

Vectorize

Fetch pro.

Fetch pro.

Throughput:180K users/hr

Rx data

Vectorize

...

PAPIs 2015

[D]Scala+Akka provides an easy-to-use Actor Model context.

35

PAPIs 2015

Message passing, scheduling, & computation behavior defined in 445 lines.

36

PAPIs 2015

Scala+Akka Actors

● Create Scala class, mix in Actor trait

● Implement the required partial function: receive: PartialFunction[Any, Unit]

● Define family of message objects this actor’s planning to handle

● Define behavior for each message case in receive

37

PAPIs 2015

Scala+Akka Actors

38

Mixin same code used for export in non-Actor version

Private, mutable state: stored scores

Private, mutable state: time of last export

If receiving new scores: store them!

If storing lots of scores, or if it’s been awhile: upload what’s stored, then erase them

If told to shut down, stop accepting new scores

PAPIs 2015

Scala+Akka Pros

● Easy to get productive in the Scala language

● SBT dependency management makes it easy to move to any box with a JRE

● No global interpreter lock!

39

PAPIs 2015

Scala+Akka Cons

● Moderate Scala learning curve

● Object representation on the JVM has pretty lousy memory efficiency

● Not a lot of great options for building models in Scala (compared to R, Python, Julia)

40

PAPIs 2015

[A]Sometimes, data scientists need to worry about throughput.

41

PAPIs 2015

[B]One way to increase throughput is with concurrency.

42

PAPIs 2015

[C]The Actor Model is an easy way to build a concurrent system.

43

PAPIs 2015

[D]Scala+Akka provides an easy-to-use Actor Model context.

44

PAPIs 2015

[A + B + C + D ⇒ Z]Data scientists should check out Scala+Akka

45

PAPIs 2015

Thanks!Questions?

bgawalt@{upwork, gmail}.comtwitter.com/bgawalt

top related