distributed time travel for feature generation ... - db tsai · distributed time travel for feature...

30
Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi

Upload: others

Post on 28-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

Distributed Time Travel for Feature Generation

Prasanna Padmanabhan DB Tsai

Mohammad H. Taghavi

Page 2: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

Turn on Netflix, and the absolute best content for you would automatically

start playing

Page 3: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

Ranking

Everything is a RecommendationR

ows

Over 80% of what members watch comes from our recommendations

Recommendations are driven by Machine Learning Algorithms

Page 4: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

Data Driven

• Try an idea offline using historical data to see if it would have made better recommendations

• If it did, deploy a live A/B test to see if it performs well in Production

Page 5: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

Why build a Time Machine?

Page 6: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

Quickly try ideas on historical data and transition to online A/B test

Page 7: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

The Past•Generate features based on event data logged in Hive

– Need to reimplement features for online A/B test– Data discrepancies between offline and online sources

• Log features online where the model will be used– Need to deploy each idea into production

• Feature generation calls online services and filters data past a certain time

– Works only when a service records a log of historical events– Additional load on online services

Page 8: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

DeLorean image by JMortonPhoto.com & OtoGodfrey.com

Page 9: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

Time Travel using Snapshots

• Snapshot online services and use the snapshot data offline to generate features

• Share facts and features between experiments without calling live systems

Page 10: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

How to build a Time Machine

Page 11: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

Context Selection

Data Snapshots

APIs for Time Travel

Page 12: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

Context Selection

Context Selection

Runs once a day

Hive

S3

Context SetStratified

SamplingContexts tagged with meta data

Page 13: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

Data SnapshotsS3

Context Set

Data Snapshots Runs once

a day

S3

Snapshot

Prana (Netflix

Libraries)

Viewing History Service

MyList Service

Ratings Service

Snapshot data for each Context

Thrift

Parquet

Page 14: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

APIs for Time Travel

Page 15: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

Data Architecture

S3

Snapshot

S3

Context Set

Runs once a day

Prana (Netflix

Libraries)

Viewing History Service

MyList Service

RatingsService

Context Selection

Runs once a day

HiveStratified Sampling

Contexts tagged with meta data

Thrift

Context Selection

Data Snapshots

Batch APIs

RDD of Snapshot Objects

Data Snapshots

Batch APIs

Page 16: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

Generating Features via Time Travel

Page 17: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

Great Scott!

• DeLorean: A time-traveling vehicle

– uses data snapshots to travel in time

– scales with Apache Spark

– prototypes new ideas with Zeppelin

– requires minimal code changes from

experimentation to A/B test to production

https://en.wikipedia.org/wiki/Emmett_Brown

There’s the DeLorean!

Page 18: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

Running Time Travel Experiment

Select the destination time

Bring it up to 88 miles per hour!

Page 19: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

Running Time Travel Experiment

Design Experiment

Collect Label Dataset

DeLorean: Offline Feature Generation

Distributed Model Training

Parallel training of individual

models using different

executors

Compute Validation Metrics

Model Testing

Choose best model

Design a New Experiment to Test Out Different Ideas

GoodMetrics

Offline Experiment

OnlineSystem

Online AB Testing

Bad Metrics

Selected Contexts

Page 20: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

DeLorean Input Data

• Contexts: The setting for evaluating a set of items (e.g. tuples of member profiles, country, time, device, etc.)

• Items: The elements to be trained on, scored, and/or ranked (e.g. videos, rows, search entities).

• Labels: For supervised learning, this will be the label (target) for each item.

Page 21: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

Feature Encoders

• Compute features for each item in a given context

• Each type of raw data element has its own data key

• Data map is a map from data keys to data objects in a

given context

• Data map is consumed by feature encoder to compute

features

Page 22: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

Two type of Data Elements• Context-dependent data elements

– Viewing History

– Mylist

– ...

• Context-independent data elements– Video Metadata

– Genre Metadata

– ...

Page 23: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

Video Country of Origin Matching Fraction

Context-Items

Context: s

Items:

Context: s

Items:

Context Dependent

Data ElementViewing History

Context: s

Items:

Context: s

Items:

Context: s Items:

= 0.5

= 0.5

= 0.5

Context IndependentData Element

Video Metadata

Context: s Items:

= 1.0

= 0.0

= 1.0

Features

Page 24: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

Feature GenerationS3

Snapshot

Model Training

Label Features

Feature EncodersLabel Data Feature EncodersData Elements

Feature Model(JSON) Feature EncodersFeature EncodersFeature Encoders

RequiredFeature Keys

Data

Data Map

Features

Data in POJOs

Data Keys

Data Keys

Page 25: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

Features

•Represented in Spark’s DataFrames

•In nested structure to avoid data shuffling in ranking process

•Stored with Parquet format in S3

Page 26: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

FeaturesContext

Item, label, and features

Page 27: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

Going OnlineS3

Snapshot

DeLorean: Offline Feature Generation

Online Ranking / Scoring Service

Model Training / Validation / Testing

Offline Experiment

Online SystemViewing History Service

MyList Service

RatingsService

Online Feature Generation

Deploymodels

Shared Feature Encoders

Page 28: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

Conclusion

Spark helped us significantly reduce the time from an idea to an AB Test

Page 29: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

Future work

Event Driven Data Snapshots

Time Travel to the Future!!

Page 30: Distributed Time Travel for Feature Generation ... - DB Tsai · Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi. Turn on Netflix, and

We’re hiring!(come talk to us)

https://jobs.netflix.com/

Tech Blog: http://bit.ly/sparktimetravel