large-scale web apps @ pinterest

Post on 10-May-2015

908 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Speaker: Varun Sharma (Pinterest) Over the past year, HBase has become an integral component of Pinterest's storage stack. HBase has enabled us to quickly launch and iterate on new products and create amazing pinner experiences. This talk briefly describes some of these applications, the underlying schema, and how our HBase setup stays highly available and performant despite billions of requests every week. It will also include some performance tips for running on SSDs. Finally, we will talk about a homegrown serving technology we built from a mashup of HBase components that has gained wide adoption across Pinterest.

TRANSCRIPT

Large Scale Web Apps @Pinterest (Powered by Apache HBase)

May 5, 2014

Pinterest is a visual discovery tool for collecting the things you love, and discovering related content along the way.

What is Pinterest ?

ScaleChallenges @scale • 100s of millions of pins/repins per month • Billions of requests per week • Millions of daily active users • Billions of pins • One of the largest discovery tools on the internet

Storage stack @Pinterest!

• MySQL • Redis (persistence and for cache) • MemCache (Consistent Hashing)

App Tier

Manual Sharding

Sharding Logic

Why HBase ?!

• High Write throughput - Unlike MySQL/B-Tree, writes don’t ever seek on Disk

• Seamless integration with Hadoop • Distributed operation

- Fault tolerance - Load Balancing - Easily add/remove nodes !

Non-Technical Reasons • Large active community • Large scale online use cases

Outline!

• Features powered by HBase • SaaS (Storage as a Service)

- MetaStore - HFile Service (Terrapin)

• Our HBase setup - optimizing for High availability & Low latency

Applications/Features!

• Offline - Analytics - Search Indexing - ETL/Hadoop worklows

• Online - Personalized Feeds - Rich Pins - Recommendations

!

Why HBase ?

Personalized Feeds

WHY HBASE ? Write Heavy load due to Pin fanout.

Recommended Pins

Users I follow

Rich Pins

WHY HBASE ? Negative Hits with Bloom Filters

Recommendations

HADOOP 1.0

HBASE + HADOOP 2.0

HADOOP 2.0

WHY HBASE ? Seamless Data Transfer from Hadoop

Generate Recommendations

DistCP Jobs

Serving Cluster

SaaS

• Large number of feature requests • 1 Cluster per feature • Scaling with organizational growth • Need for “defensive” multi tenant storage • Previous solutions reaching their limits

MetaStore I• Key Value store on top of HBase • 1 HBase Table per Feature with salted keys • Pre split tables • Table level rate limiting (online/offline reads/writes) • No Scan support • Simple client API !

!

string getValue(string feature, string key, boolean online); void setValue(string feature, string key, string value,

boolean online);

MetaStore II

MetaStore Thrift Server

Primary HBase Secondary HBase

Clients

Master/Master Replication

Thrift

Salting + Rate Limiting ZooKeeper

Issue Gets/Sets

Notifications

Metastore Config - Rate Limits - Primary Cluster

HFile Service (Terrapin)

• Solve the Bulk Upload problem • HBase backed solution

- Bulk upload + major compact - Major compact to delete old data

• Design solution from scratch using mashup of: - HFile - HBase BlockCache - Avoid compactions - Low latency key value lookups

!

!

!

High Level Architecture I

!

Client Library /Service

ETL/Batch Jobs Load/Reload

HFile Servers

!

HFiles on Amazon S3

Key/Value Lookups

Multiple HFiles/Server

High Level Architecture II• Each HFile server runs 2 processes

- Copier: pulls HFiles from S3 to local disk - Supershard: serves multiple HFile shards to client

• ZooKeeper - Detecting alive servers - Coordinating loading/swapping of new data - Enabling clients to detect availability of new data

• Loader Module (replaces distcp) - Trigger new data copy - Trigger swap through zookeeper - Update ZooKeeper and notify client

• Client library understands sharding • Old data deleted by background process !

!

Salient Features

• Multi tenancy through namespacing • Pluggable sharding functions - modulus, range & more • HBase Block Cache • Multiple clusters for redundancy • Speculative execution across clusters for low latency !

!

!

Setting up for Success• Many online usecases/applications • Optimize for:

- Low MTTR - high availability - Low latency (performance)

!

!

MTTR - I

DEADLIVE STALE20sec 9min 40sec

!

• Stale nodes avoided - As candidates for Reads - As candidate replicas for writes - During Lease Recovery

• Copying of underreplicated blocks starts when a Node is marked as “Dead”

DataNode States

MTTR - II

Failure Detection

Lease Recovery

Log Split

Recover Regions

30 sec ZooKeeper session timeout

HDFS 4721

HDFS 3703 + HDFS 3912

< 2 min

!

• Avoid stale nodes at each point of the recovery process • Multi minute timeouts ==> Multi second timeouts

Simulate, Simulate, Simulate

Simulate “Pull the plug failures” and “tail -f the logs” • kill -9 both datanode and region server - causes connection refused errors • kill -STOP both datanode and region server - causes socket timeouts • Blackhole hosts using iptables - connect timeouts + “No Route to host” - Most representative of AWS failures

PerformanceConfiguration tweaks • Small Block Size, 4K-16K • Prefix compression to cache more - when data is in the key, close to 4X reduction for some data sets • Separation of RPC handler threads for reads vs writes • Short circuit local reads • HBase level checksums (HBASE 5074)

Hardware • SATA (m1.xl/c1.xl) and SSD (hi1.4xl) • Choose based on limiting factor

- Disk space - pick SATA for max GB/$$ - IOPs - pick SSD for max IOPs/$$, clusters with heavy reads or heavy compaction activity

Performance (SSDs)

HFile Read Performance • Turn off block cache for Data Blocks, reduce GC + heap fragmentation • Keep block cache on for Index Blocks • Increase “dfs.client.read.shortcircuit.streams.cache.size” from 100 to 10,000 (with short circuit reads) • Approx. 3X improvement in read throughput !

Write Performance • WAL contention when client sets AutoFlush=true • HBase 8755

In the Pipeline...!

• Building a graph database on HBase • Disaster recovery - snapshot + incremental backup + restore • Off Heap cache - reduce GC overhead and better use of hardware • Read path optimizations

And we are Hiring !!

top related