women in big data x pinterest · journey to k8s the future agenda. motivation unify orchestration...
TRANSCRIPT
Women in Big Data x Pinterest
Welcome!
Regina Karson, WiBD Chapter DirectorTian-Ying Chang, Engineering Manager
Goku: Pinterest’s in house Time-Series Database
Tian-Ying ChangSr. Staff Engineer ManagerPinterest
Confidential
● Discover new ideas and find inspiration to do the things they love
○ 300M+ MAU, billions pins● Metrics for monitoring site health
○ Latency, QPS, CPU, memory● Metric about product quality
○ MAU, Impression, etc● Monitoring service needs to be fast,
reliable and scalable
7
Confidential
● Graphite○ Easy to setup at small scale○ Down sampling support long range query well○ Hard to scale○ Deprecated at Pinterest’s current scale
● OpenTSDB○ Rich query, tagging support○ Easy to scale horizontally with underlying HBase cluster○ Long latency for high cardinality data○ Long latency for query over longer time range
■ No down sampling○ Heavy GC worsened by combined heavy write QPS and long range scan
Monitoring at Pinterest
8
Confidential
● HBase Schema○ Row key: <metric><timestamp>[<tagk1><tagv1><tagk2><tagv2>...] (metric, tag key values are encoded in 3 bytes)○ Column qualifier: <delta to row key timestamp(up to 4 bytes)>
● Unnecessary Scan○ Query: m1{rpc=delete} [t1 to t2]○ <m1><t1><host=h1><rpc=delete>○ <m1><t1><host=h1><rpc=get>○ <m1><t1><host=h1><rpc=put>○ <m1><t2><host=h2><rpc=delete>
● Data size○ 20 bytes per data point
● Aggregation○ Read data onto one opentsdb and aggregate○ Ex. ostrich.gauges.singer.processor.stuck_processors{host=*}
● Serialization○ Json. Super slow when there are many many data points to return
Why OpenTSDB is not good fit
9
HBase RS
HBase RS
HBase RS
OpenTSDB
Confidential
Goku is here to save
Statsboard(Read Client)
Ingestor(Write Client)
Kafka
OpenTSDB
HBase RS
HBase RS
HBase RS
OpenTSDB
● Read/|Write requests are sent to a random selected OpenTSDB box, and then routed to corresponding RS based on row key range
● Reads: raw data is read from individual HBase RS, send to OpenTSDB box, then aggregated at openTSDB, then send result to client
Write
Read
Statsboard(Read Client)
Ingestor(Write Client)
Kafka
Goku
Goku Goku
Goku
Goku cluster
● A Goku box is not only storage engine, but also:
○ Proxy that route requests○ Aggregation engine
● Client can send requests to any Goku box who will route requests
○ Scatter and Gather
Write
Read
Two level sharding
● Group# hashed from metric name○ E.g tc.metrics.rpc_latency
● Shard# hashed from metric + set of tagk and tagv
○ E.g. tc.metrics.rpc_latency{rpc=put,host=m1}
● Control read fanout while easy to scale out individual group
G1:S1 G1:S2G2:S1 G2:S2 G3:S1G1:S3
G4:S1 G4:S2G3:S1G1:S3
1.Requests sent to a random goku box
G3:S3
4.Retrieve data and local aggregate
5.another aggregation
6.return response
Goku2.comput sharding to G2: S1 and S2, then look up shard config
Shard config
3.route requests
Confidential
Goku#1. Time Series Database based on Beringei
Beringei
● In-memory key value store○ Key: string○ Value: list of <timestamp, value> pairs
● Gorilla compression○ Delta-of-Delta encoding on timestamps○ Delta encoding on values
● Stores most recent 24 hours data (configurable)
● One level of sharding to distribute
● Datapoint size reduced ○ from 20 bytes to 1.37 bytes
Bucket
Bucket
Bucket
ts
ts
ts
ts
ts
ts
ts
ts
Disk
Gorilla Encode
Gorilla Decode
Beringei
Write Read
Shard
Confidential
Goku#2. Query Engine -- Inverted Index
Inverted Index
● A map from search term to its bitset
● Built along with processing incoming data points
● Fast lookup when serve query
● Support query filters○ ExactMatch: metricname{host=h1,api=get). => intersect
bitsets of metricname, host=h1, api=get○ Or: metricname{host=h1|h2}. => union bitsets of host=h1
and host=h2 and intersect with bitset of metricname○ Nor: metricname{host=not_literal_or(h1|h2)}. => remove
bitsets of host=h1 and host=h2 from bitset of metricname○ Wildcard: a. metricname{host=*} => intersect bitsets of
metricname and host=*; b.metricname{host=h*} => convert to regex filter
○ Regex: metricname{host=h[1|2].*, api=get, az=us-east-1} => apply other filters first. Then build a regex pattern based on the filter values and then iterate corresponding full metric names of all ids after applying other filters.
Bucket
Bucket
Bucket
ts
ts
ts
ts
ts
ts
ts
ts
DISK
Inverted Index
Gorilla Encode
Gorilla Decode
Goku Phase #1
Write Read
Shard
Confidential
Goku#3. Query Engine -- Aggregation
Aggregation
● Post-process after retrieving all relevant time series
● Mimic OpenTSDB’s aggregation layer
● Support basic aggregators, including SUM, AVG, MAX, MIN, COUNT, DEV and Downsampling
● Versus OpenTSDB○ OpenTSDB does aggregation on a
single instance since HBase RS don’t know how to aggregate
○ Goku does aggregation in two phases. First on each leaf goku node, and second on the routing goku node
○ Distribute the computation and save data on the wire
Bucket
Bucket
Bucket
ts
ts
ts
ts
ts
ts
ts
ts
DISK
Inverted Index
Aggregation
Gorilla Encode
Gorilla Decode
Goku Phase #1
Write Read
Shard
Confidential
AWS EFS
AWS EFS
● Store log and data files to recovery
● Posix compliant
● Data durability
● Operate it asynchronously, so latency isn’t an issue
● Easy to move shard
● Easy to use on AWS
Bucket
Bucket
Bucket
ts
ts
ts
ts
ts
ts
ts
ts
AWS EFS
Inverted Index
Aggregation
Gorilla Encode
Gorilla Decode
Goku Phase #1
Write Read
Shard
Confidential
Phase #2 Disk based Goku
Bucket
Bucket
Bucket
ts
ts
ts
ts
ts
ts
ts
ts
AWS EFS
Inverted Index
Aggregation
Gorilla Encode
Gorilla Decode
Group
Goku Phase #2
Write Read
Shard
Distributed KV
store(RockStore)
Hadoop job
Goku Phase #2 -- Disk based
● Hadoop job constantly runs to compact data into disk with downsample
● Data stored into S3 for better availability and low cost
● RocksDB is used for online serving data
S3
Confidential
● Replication○ Currently dule write to two clusters for
fault tolerance ○ Replication to improve availability and
consistency
● More query possibilities○ TopK○ Percentile
● Analytics use case○ Another big consumer of Time Series
data
Next step for Goku
24
Thanks!
Scheduling Asynchronous Tasks at Pinterest
Isabel TallamData (Core Services) TeamPinterest
Why asynchronous tasks?
Asynchronous task processing service
Design considerations
Why asynchronous tasks?
SPAM
SPAM
%$#*
%$#*
%$#*SPAM
SPAMSPAM
Why asynchronous tasks?
Asynchronous task processing service
Design considerations
PinlaterAsynchronous Task Processing Service
Pinlater features
- High throughput
- Easily create new tasks
- At-least-once guarantee- Strict ack mechanism
- Metrics and debugging support
- Different task priorities
- Scheduling future tasks
- Python, Java support
PinlaterAsynchronous Task Processing Service
PinlaterServersPinlaterServersPinlaterServers
Clients PinlaterWorkers
PinlaterWorkersPinlaterWorkers
Storage
Master Slave
Storage
Master Slave
Storage
Master Slave
ClientsClients
Pinlater components
insert request /ack
PinlaterAsynchronous Task Processing Service
Pinlater Stats
~1000 different tasks defined
~8 billion task instances processed per day
~3000 Pinlater hosts
Why asynchronous tasks?
Asynchronous task processing service
Design considerations
Storage
PinlaterAsynchronous Task Processing Service
PinlaterServersPinlaterServersPinlaterServers
Storage
Master Slave
Storage Layer
Master Slave
Storage
Master Slave
Cache
PinlaterAsynchronous Task Processing Service
PinlaterServersPinlaterServers
Clients PinlaterWorkersPinlaterWorkers
PinlaterWorkers
ClientsClientsinsert request
Handling failures in the system
/ack
timeoutmonitor
Storage
Master Slave
PinlaterServers
PinlaterAsynchronous Task Processing Service
Thank You!
Experimentation at Pinterest
Lu YangData (Data Analytics - Core Product Data) TeamPinterest
Outline
1 Background
2 Platform
3 Architecture
What is an a/b experiment? It is a method to compare two (or more) variations of something to determine which one performs better against your target metrics
OR
With Experiment Mindset
Idea → Feature Development → Release to small % of users → Measure impact → Release to 100% of users based on the impact of sample launch
A randomized, controlled trial with measurement
Existing code - CONTROL
Changed code - ENABLED
Not in experiment
All Users
Number of Experiments Over Time
New Experiments/wk
7Languages and platforms
800+Different Metrics
<5minExperiment setup
8.3/10Developer satisfaction score
140+
Experimentation by the Numbers
Outline
1 Background
2 Platform
3 Architecture
Typical Experiment TimelineExperiments vary, but here is a typical timeline
Idea
Form a hypothesis Analyze results Make Decision and Iterate
Experiment Setup and Launch
Iterate
Feature Development
Experimentor
Platform
Gatekeeper
Platform
Languages: python, java, scala, go, …
Platform: web, ios, android, ...
Experiment Dashboard
Platform
Outline
1 Background
2 Platform
3 Architecture
Batch processing
Real-time analytics
Experiment Data Pipeline
Hbase
Airflow at PinterestIndy PrenticeData (Big Data Compute) TeamPinterest
What is a workflow?
DefineWhat to run, dependencies, etc. When to run it
Schedule
Confidential
Scheduled with Pinball
*check it out at https://github.com/pinterest/pinball!
coordinator
worker
train the visual search model
make the search index
rank the ads
free spot!worker
download all pin images
count number of Pinterest users
(get it?)
find great recommendations
calculate experiment metrics
Fixed number of workers and slots
Unfortunately…
Confidential
free spot!
Fixed number of workers and slots
coordinator
worker
train the visual search model
free spot!worker
free spot!
free spot!
free spot!
free spot!
calculate experiment metrics
Confidential
Fixed number of workers and slots
coordinator
worker
train the visual search model
make the search index
rank the ads
find duplicate pin imagesworker
calculate experiment metrics
find great recommendations
download all pin images
count number of Pinterest users
Fixed number of workers and slots
Shared environment
Unfortunately…
Confidential
Shared host resources
coordinator
worker
worker
generate experiment metrics
index the pins
download all pin images
count number of Pinterest users
train the visual search model
make the search index
rank the ads
free spot!
Confidential
Shared codebase
coordinator
worker
worker
generate experiment metrics
index the pins
download all pin images
count number of Pinterest users
train the visual search model
make the search index
rank the ads
free spot!
Fixed number of workers and slots
Shared environment
Implementation doesn’t scale
Unfortunately…
● Industry + community support● Will it support our scale?
○ We think so● Will it solve the problems with Pinball?
○ With Kubernetes executor
Enter Spinner (get it???)
Jobs run in containers
Containers are scheduled by kubernetes
Scheduler submits jobs to run on k8s
Physical machine (8 CPUs, 10 GBs of disk space)
Can’t touch this
Train the visual search model
(4 CPUs, 4 GBs of disk space) Download all images
(10 GBs of disk space)Generate experiment metrics
(2 CPUs, 1 GB of disk space)
Enter Spinner (get it???)
Jobs run in containers
Containers are scheduled by kubernetes
Scheduler submits jobs to run on k8s
train the visual search model
make the search index
rank the ads
generate experiment metrics
index the pins
write 10TB of data to disk!
count number of Pinterest users
Tasks
Servers
Enter Spinner (get it???)
Jobs run in containers
Containers are scheduled by kubernetes
Scheduler submits jobs to run on k8s
Autoscaling happens at k8s level
Jobs run in isolated containers
Lightweight maintenance
On Airflow...
● Integration with Pinterest k8s infrastructure
● Scheduler scalability● Integration with existing
composers
Airflow @ PinterestAdoption challenges
Machine Learning and Big Data on K8S at Pinterest
June LiuInfrastructure (Core Cloud - Cloud Management Platform) TeamPinterest
● Motivation● Architecture● Journey to K8S● The Future
Agenda
Motivation● Unify orchestration and big data
infrastructure○ simplify tech stack and reduce operation
overhead● Trending in community support of ML
and BD workloads on K8S● Better interfaces and richer feature
including CNI, CSI, Autoscaling, etc
Architecture
Journey to Kubernetes
SidecarNo good support for sidecar lifecycle of run-to-finish Pod
● A lot of sidecars to nanny workloads○ Security○ Metadata○ Logs○ Metrics○ Traffic, Service discovery○ ...
SidecarNo good support for sidecar lifecycle of run-to-finish Pod
● Kill Pod?○ Pod recreated due to reconcile
● Force mark Pod state?○ Confuses scheduler and Kubelet
● Docker kill sidecar?○ Messed up with Restart Policy
● Inter container signals?○ Not scalable in operation
SidecarNo good support for sidecar lifecycle of run-to-finish Pod
● Write our own and contribute!○ Main container obeys restart policy,
sidecar always restart○ Kill sidecar after main containers quits○ Pod phase computed based on main
container exit codehttps://github.com/zhan849/kubernetes/tree/pinterest-sidecar-1.14.5
VolumeIs PVC really an option for serving models?
● Ideally○ Flexibility to select storage medium○ Data isolation○ Dynamic provisioning saves money○ ...
VolumeIs PVC really an option for serving models?
● Actually…○ EBS provisioner is not efficient, nor is EC2○ ~100% 500s with batch EBS provisioning
VolumeIs PVC really an option for serving models?
● DescribeInstance: by name -> by ID (#78140)● Optimize EBS provisioner cloud provider
calls (#78276)Batch Size
Total Calls Peak QPS
Original Optimized Original Optimized
50 1360 116 52 8
75 1464 139 70 10
100 2427 157 75 8
150 4384 209 93 11
AutoScalingNode scales slower than Pod
● This is expected or container becomes meaningless
● Long tail scheduling Parameter Servers and cause job to fail
● Use bogus pods with low priority as cluster buffers to scale in advance
Our Future
Future WorksNode scales slower than Pod
● Gang scheduling● Federation layer takes care of data /
network locality● Fine-grained preemption, task queuing● More caching storage options (PVC/EBS)
○ https://github.com/kubeflow/community/pull/263