women in big data x pinterest · journey to k8s the future agenda. motivation unify orchestration...

Women in Big Data x Pinterest

Welcome!

Regina Karson, WiBD Chapter DirectorTian-Ying Chang, Engineering Manager

Goku: Pinterest’s in house Time-Series Database

Tian-Ying ChangSr. Staff Engineer ManagerPinterest

Confidential

● Discover new ideas and find inspiration to do the things they love

○ 300M+ MAU, billions pins● Metrics for monitoring site health

○ Latency, QPS, CPU, memory● Metric about product quality

○ MAU, Impression, etc● Monitoring service needs to be fast,

reliable and scalable

Pinterest

7

Confidential

● Graphite○ Easy to setup at small scale○ Down sampling support long range query well○ Hard to scale○ Deprecated at Pinterest’s current scale

● OpenTSDB○ Rich query, tagging support○ Easy to scale horizontally with underlying HBase cluster○ Long latency for high cardinality data○ Long latency for query over longer time range

■ No down sampling○ Heavy GC worsened by combined heavy write QPS and long range scan

Monitoring at Pinterest

8

Confidential

● HBase Schema○ Row key: <metric><timestamp>[<tagk1><tagv1><tagk2><tagv2>...] (metric, tag key values are encoded in 3 bytes)○ Column qualifier: <delta to row key timestamp(up to 4 bytes)>

● Unnecessary Scan○ Query: m1{rpc=delete} [t1 to t2]○ <m1><t1><host=h1><rpc=delete>○ <m1><t1><host=h1><rpc=get>○ <m1><t1><host=h1><rpc=put>○ <m1><t2><host=h2><rpc=delete>

● Data size○ 20 bytes per data point

● Aggregation○ Read data onto one opentsdb and aggregate○ Ex. ostrich.gauges.singer.processor.stuck_processors{host=*}

● Serialization○ Json. Super slow when there are many many data points to return

Why OpenTSDB is not good fit

9

HBase RS

HBase RS

HBase RS

OpenTSDB

Confidential

Goku is here to save

Statsboard(Read Client)

Ingestor(Write Client)

Kafka

OpenTSDB

HBase RS

HBase RS

HBase RS

OpenTSDB

● Read/|Write requests are sent to a random selected OpenTSDB box, and then routed to corresponding RS based on row key range

● Reads: raw data is read from individual HBase RS, send to OpenTSDB box, then aggregated at openTSDB, then send result to client

Write

Read

Statsboard(Read Client)

Ingestor(Write Client)

Kafka

Goku

Goku Goku

Goku

Goku cluster

● A Goku box is not only storage engine, but also:

○ Proxy that route requests○ Aggregation engine

● Client can send requests to any Goku box who will route requests

○ Scatter and Gather

Write

Read

Two level sharding

● Group# hashed from metric name○ E.g tc.metrics.rpc_latency

● Shard# hashed from metric + set of tagk and tagv

○ E.g. tc.metrics.rpc_latency{rpc=put,host=m1}

● Control read fanout while easy to scale out individual group

G1:S1 G1:S2G2:S1 G2:S2 G3:S1G1:S3

G4:S1 G4:S2G3:S1G1:S3

1.Requests sent to a random goku box

G3:S3

4.Retrieve data and local aggregate

5.another aggregation

6.return response

Goku2.comput sharding to G2: S1 and S2, then look up shard config

Shard config

3.route requests

Confidential

Goku#1. Time Series Database based on Beringei

Beringei

● In-memory key value store○ Key: string○ Value: list of <timestamp, value> pairs

● Gorilla compression○ Delta-of-Delta encoding on timestamps○ Delta encoding on values

● Stores most recent 24 hours data (configurable)

● One level of sharding to distribute

● Datapoint size reduced ○ from 20 bytes to 1.37 bytes

Bucket

Bucket

Bucket

ts

ts

ts

ts

ts

ts

ts

ts

Disk

Gorilla Encode

Gorilla Decode

Beringei

Write Read

Shard

https://www.vldb.org/pvldb/vol8/p1816-teller.pdf

Confidential

Goku#2. Query Engine -- Inverted Index

Inverted Index

● A map from search term to its bitset

● Built along with processing incoming data points

● Fast lookup when serve query

● Support query filters○ ExactMatch: metricname{host=h1,api=get). => intersect

bitsets of metricname, host=h1, api=get○ Or: metricname{host=h1|h2}. => union bitsets of host=h1

and host=h2 and intersect with bitset of metricname○ Nor: metricname{host=not_literal_or(h1|h2)}. => remove

bitsets of host=h1 and host=h2 from bitset of metricname○ Wildcard: a. metricname{host=*} => intersect bitsets of

metricname and host=*; b.metricname{host=h*} => convert to regex filter

○ Regex: metricname{host=h[1|2].*, api=get, az=us-east-1} => apply other filters first. Then build a regex pattern based on the filter values and then iterate corresponding full metric names of all ids after applying other filters.

Bucket

Bucket

Bucket

ts

ts

ts

ts

ts

ts

ts

ts

DISK

Inverted Index

Gorilla Encode

Gorilla Decode

Goku Phase #1

Write Read

Shard

https://roaringbitmap.org/

Confidential

Goku#3. Query Engine -- Aggregation

Aggregation

● Post-process after retrieving all relevant time series

● Mimic OpenTSDB’s aggregation layer

● Support basic aggregators, including SUM, AVG, MAX, MIN, COUNT, DEV and Downsampling

● Versus OpenTSDB○ OpenTSDB does aggregation on a

single instance since HBase RS don’t know how to aggregate

○ Goku does aggregation in two phases. First on each leaf goku node, and second on the routing goku node

○ Distribute the computation and save data on the wire

Bucket

Bucket

Bucket

ts

ts

ts

ts

ts

ts

ts

ts

DISK

Inverted Index

Aggregation

Gorilla Encode

Gorilla Decode

Goku Phase #1

Write Read

Shard

Confidential

AWS EFS

AWS EFS

● Store log and data files to recovery

● Posix compliant

● Data durability

● Operate it asynchronously, so latency isn’t an issue

● Easy to move shard

● Easy to use on AWS

Bucket

Bucket

Bucket

ts

ts

ts

ts

ts

ts

ts

ts

AWS EFS

Inverted Index

Aggregation

Gorilla Encode

Gorilla Decode

Goku Phase #1

Write Read

Shard

Confidential

Phase #2 Disk based Goku

Bucket

Bucket

Bucket

ts

ts

ts

ts

ts

ts

ts

ts

AWS EFS

Inverted Index

Aggregation

Gorilla Encode

Gorilla Decode

Group

Goku Phase #2

Write Read

Shard

Distributed KV

store(RockStore)

Hadoop job

Goku Phase #2 -- Disk based

● Hadoop job constantly runs to compact data into disk with downsample

● Data stored into S3 for better availability and low cost

● RocksDB is used for online serving data

S3

Confidential

● Replication○ Currently dule write to two clusters for

fault tolerance ○ Replication to improve availability and

consistency

● More query possibilities○ TopK○ Percentile

● Analytics use case○ Another big consumer of Time Series

data

Next step for Goku

24

Thanks!

Scheduling Asynchronous Tasks at Pinterest

Isabel TallamData (Core Services) TeamPinterest

Why asynchronous tasks?

Asynchronous task processing service

Design considerations


SPAM

SPAM

%$#*

%$#*

%$#*SPAM

SPAMSPAM

PinlaterAsynchronous Task Processing Service

Pinlater features

- High throughput

- Easily create new tasks

- At-least-once guarantee- Strict ack mechanism

- Metrics and debugging support

- Different task priorities

- Scheduling future tasks

- Python, Java support


PinlaterServersPinlaterServersPinlaterServers

Clients PinlaterWorkers

PinlaterWorkersPinlaterWorkers

Storage

Master Slave

Storage

Master Slave

Storage

Master Slave

ClientsClients

Pinlater components

insert request /ack


Pinlater Stats

~1000 different tasks defined

~8 billion task instances processed per day

~3000 Pinlater hosts

Storage


PinlaterServersPinlaterServersPinlaterServers

Storage

Master Slave

Storage Layer

Master Slave

Storage

Master Slave

Cache


PinlaterServersPinlaterServers

Clients PinlaterWorkersPinlaterWorkers

PinlaterWorkers

ClientsClientsinsert request

Handling failures in the system

/ack

timeoutmonitor

Storage

Master Slave

PinlaterServers


Thank You!

Experimentation at Pinterest

Lu YangData (Data Analytics - Core Product Data) TeamPinterest

Outline

1 Background

2 Platform

3 Architecture

What is an a/b experiment? It is a method to compare two (or more) variations of something to determine which one performs better against your target metrics

OR

With Experiment Mindset

Idea → Feature Development → Release to small % of users → Measure impact → Release to 100% of users based on the impact of sample launch

A randomized, controlled trial with measurement

Existing code - CONTROL

Changed code - ENABLED

Not in experiment

All Users

Number of Experiments Over Time

New Experiments/wk

7Languages and platforms

800+Different Metrics

<5minExperiment setup

8.3/10Developer satisfaction score

140+

Experimentation by the Numbers

Outline

1 Background

2 Platform

3 Architecture

Typical Experiment TimelineExperiments vary, but here is a typical timeline

Idea

Form a hypothesis Analyze results Make Decision and Iterate

Experiment Setup and Launch

Iterate

Feature Development

Experimentor

Platform

Gatekeeper

Platform

Languages: python, java, scala, go, …

Platform: web, ios, android, ...

Experiment Dashboard

Platform

Outline

1 Background

2 Platform

3 Architecture

Batch processing

Real-time analytics

Experiment Data Pipeline

Hbase

Airflow at PinterestIndy PrenticeData (Big Data Compute) TeamPinterest

What is a workflow?

DefineWhat to run, dependencies, etc. When to run it

Schedule

Confidential

Scheduled with Pinball

*check it out at https://github.com/pinterest/pinball!

coordinator

worker

train the visual search model

make the search index

rank the ads

free spot!worker

download all pin images

count number of Pinterest users

(get it?)

find great recommendations

calculate experiment metrics

https://github.com/pinterest/pinball

Fixed number of workers and slots

Unfortunately…

Confidential

free spot!


coordinator

worker


free spot!worker

free spot!

free spot!

free spot!

free spot!


Confidential


coordinator

worker



rank the ads

find duplicate pin imagesworker


find great recommendations




Shared environment

Unfortunately…

Confidential

Shared host resources

coordinator

worker

worker

generate experiment metrics

index the pins





rank the ads

free spot!

Confidential

Shared codebase

coordinator

worker

worker


index the pins





rank the ads

free spot!


Shared environment

Implementation doesn’t scale

Unfortunately…

● Industry + community support● Will it support our scale?

○ We think so● Will it solve the problems with Pinball?

○ With Kubernetes executor

Enter Spinner (get it???)

Jobs run in containers

Containers are scheduled by kubernetes

Scheduler submits jobs to run on k8s

Physical machine (8 CPUs, 10 GBs of disk space)

Can’t touch this

Train the visual search model

(4 CPUs, 4 GBs of disk space) Download all images

(10 GBs of disk space)Generate experiment metrics

(2 CPUs, 1 GB of disk space)







rank the ads


index the pins

write 10TB of data to disk!


Tasks

Servers

Autoscaling happens at k8s level

Jobs run in isolated containers

Lightweight maintenance

On Airflow...

● Integration with Pinterest k8s infrastructure

● Scheduler scalability● Integration with existing

composers

Airflow @ PinterestAdoption challenges

Machine Learning and Big Data on K8S at Pinterest

June LiuInfrastructure (Core Cloud - Cloud Management Platform) TeamPinterest

● Motivation● Architecture● Journey to K8S● The Future

Agenda

Motivation● Unify orchestration and big data

infrastructure○ simplify tech stack and reduce operation

overhead● Trending in community support of ML

and BD workloads on K8S● Better interfaces and richer feature

including CNI, CSI, Autoscaling, etc

Architecture

Journey to Kubernetes

SidecarNo good support for sidecar lifecycle of run-to-finish Pod

● A lot of sidecars to nanny workloads○ Security○ Metadata○ Logs○ Metrics○ Traffic, Service discovery○ ...


● Kill Pod?○ Pod recreated due to reconcile

● Force mark Pod state?○ Confuses scheduler and Kubelet

● Docker kill sidecar?○ Messed up with Restart Policy

● Inter container signals?○ Not scalable in operation


● Write our own and contribute!○ Main container obeys restart policy,

sidecar always restart○ Kill sidecar after main containers quits○ Pod phase computed based on main

container exit codehttps://github.com/zhan849/kubernetes/tree/pinterest-sidecar-1.14.5

https://github.com/zhan849/kubernetes/tree/pinterest-sidecar-1.14.5

VolumeIs PVC really an option for serving models?

● Ideally○ Flexibility to select storage medium○ Data isolation○ Dynamic provisioning saves money○ ...


● Actually…○ EBS provisioner is not efficient, nor is EC2○ ~100% 500s with batch EBS provisioning


● DescribeInstance: by name -> by ID (#78140)● Optimize EBS provisioner cloud provider

calls (#78276)Batch Size

Total Calls Peak QPS

Original Optimized Original Optimized

50 1360 116 52 8

75 1464 139 70 10

100 2427 157 75 8

150 4384 209 93 11

AutoScalingNode scales slower than Pod

● This is expected or container becomes meaningless

● Long tail scheduling Parameter Servers and cause job to fail

● Use bogus pods with low priority as cluster buffers to scale in advance

Our Future

Future WorksNode scales slower than Pod

● Gang scheduling● Federation layer takes care of data /

network locality● Fine-grained preemption, task queuing● More caching storage options (PVC/EBS)

○ https://github.com/kubeflow/community/pull/263

https://github.com/kubeflow/community/pull/263

women in big data x pinterest · journey to k8s the future agenda. motivation unify orchestration...

Documents