machine learning meets databases€¦ · machine learning meets databases ioannis papapanagiotou...

Post on 20-May-2020

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Machine Learning meets Databases

Ioannis PapapanagiotouCloud Database Engineering

Create Personalized Recommendations for discoveries of engaging video content that maximizes member joy.

Personalize Everything

90 seconds

90 seconds...

What do caches touch?

Signing up*Logging inChoosing a profilePicking liked videosPersonalization*Loading home page*Scrolling home page*A/B testsVideo image selection

Searching*Viewing title detailsPlaying a title*Subtitle / language prefsRating a titleMy ListVideo history*UI stringsVideo production*

* multiple caches involved

Key-Value store optimized for AWS and tuned for Netflix

Ephemeral Volatile Cache

What is EVCache?

Distributed, sharded, replicated key-value storeTunable in-region and global replicationBased on MemcachedResilient to failureTopology awareLinearly scalableSeamless deployments

Why Optimize for AWS

Instances disappearZones failRegions become unstableNetwork is lossyCustomer requests bounce between regions

Failures happen and we test all the time

EVCache Use @ Netflix Hundreds of terabytes of dataTrillions of ops / dayTens of billions of items storedTens of millions of ops / secMillions of replications / secThousands of serversHundreds of instances per clusterHundreds of microservice clientsTens of distinct clusters3 regions

Architecture

Server

Memcached

EVCar

Application

Client Library

Client

Eureka(Service Discovery)

Architecture

us-west-2a us-west-2cus-west-2b

ClientClient Client

Reading (get)

us-west-2a us-west-2cus-west-2b

Client

Primary Secondary

Writing (set, delete, add, etc.)

us-west-2a us-west-2cus-west-2b

ClientClient Client

Use Case: Lookaside Cache

Application

Client Library

Client REST/gRPC Client

S S S S

C C C CData Flow

Use Case: Transient Data Store

Application

Client Library

Client

Application

Client Library

Client

Application

Client Library

Client

Time

Use Case: Primary Store

Offline / Nearline Precomputes for

Recommendations

Online Services

Offline Services

Online Application

Client Library

Client

Data Flow

Use Case: Impression store

Hive

Online Services

Offline Services

Online Application

Client Library

Client

Data Flow

Pipeline of Personalization

Compute A

Compute B Compute C

Compute D

Online Services

Offline Services

Compute E

Data Flow

Online 1 Online 2

Additional Features

Kafka● Global data replication● Consistency metrics

Key Iteration● Cache warming● Lost instance recovery● Backup (and restore)

Region BRegion A

APP APP

Repl Proxy

Repl Relay

1 mutate

2 send metadata

3 poll msg

5 https s

end msg

6 mutate4 get data

for set

Kafka Repl Relay Kafka

Repl Proxy

Cross-Region Replication

7 read

Open Source

https://github.com/netflix/EVCache(client and REST proxy)

Viewing History

Requirements for Viewing History● Time series dataset● Support high writes● Cross region replication● Large Dataset

Growth of Viewing History

1) Massively scalable architecture2) Multi-datacenter,

multi-directional replication3) Linear scale performance4) Transparent fault detection and

recovery5) Flexible, dynamic schema data

Viewing History

1) Apply Custom Filters (user, device, subtitle, episode, season)

2) Tunable consistency to tradeoff performance vs data consistency

Growth of Viewing History

New Data Model

Use Case: A/B Metadata

● Wanted to capture information about each test○ Owner○ Properties ○ Start time/End Time○ Allocation

Dynomite

A framework that makes non-distributed data stores, distributed. Can be used with many key-value storage engines

Features: highly available, automatic failover, node warmup, tunable consistency, backups/restores

Pluggable Storage Engines

● Layer on top of a non-distributed key value data store○ Peer-peer, Shared Nothing○ Auto-Sharding○ Multi-datacenter○ Linear scale○ Replication○ Gossiping

Replication

Dyno - Java Client

● Connection Pooling● Load Balancing● Effective failover● Pipelining● Scatter/Gather● Metrics

Moving Across Storage Engines

Data Explorer for Dynomite (UI)

Open Source

https://github.com/netflix

/dynomite Proxy (C)

/dyno Client (Jedis)

/dynomite-manager Sidecar (Tomcat Container)

/dyno-queues Distributed queue recipe (Java)

Other Datastores● Source of truth: Hive backed by S3● Elastic Search● MySQL, Postgres, AWS Aurora

We are Hiring!https://jobs.netflix.com/jobs/865007

Twitter: @ipapapaLinkedIn: https://www.linkedin.com/in/ipapapa/Github: https://github.com/ipapapa

Thank you.

top related