pnuts: y ahoo !’ s h osted d ata s erving p latform b rian f. c ooper, r aghu r amakrishnan, u...

42
PNUTS: YAHOO!’S HOSTED DATA SERVING PLATFORM BRIAN F. COOPER, RAGHU RAMAKRISHNAN, UTKARSH SRIVASTAVA, ADAM SILBERSTEIN, PHILIP BOHANNON, HANS-ARNO JACOBSEN, NICK PUZ, DANIEL WEAVER AND RAMANA YERNENI YAHOO! RESEARCH Presented by Team Silverlining- Rakesh Nair, Navya Sruti Sirugudi, Shantanu Sardal, Smruti Aski, Chandra Sekhar

Upload: miles-stokes

Post on 27-Dec-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

PNUTS: YAHOO!’S HOSTED DATA SERVING PLATFORM

BRIAN F. COOPER, RAGHU RAMAKRISHNAN, UTKARSH SRIVASTAVA, ADAM SILBERSTEIN, PHILIP BOHANNON, HANS-ARNO JACOBSEN, NICK PUZ, DANIEL WEAVER AND RAMANA YERNENI

YAHOO! RESEARCH

Presented by Team Silverlining-

Rakesh Nair, Navya Sruti Sirugudi, Shantanu Sardal, Smruti Aski, Chandra Sekhar

Page 2: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

2

DISTRIBUTED DATABASES – OVERVIEW

Web applications need: Scalability

And the ability to scale linearly Geographic scope High availability and fault tolerance

Web applications typically have: Simplified query needs

No joins, aggregations Relaxed consistency needs

Applications can tolerate stale or reordered data

Page 3: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

AGENDA Introduction PNUTS Features Architecture PNUTS applications Experimental Results Feature Enhancements Related Work

Page 4: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

PNUTS

A massive-scale hosted database system

Focus on data serving for web applications

Provides data storage organized as hashed or ordered tables

Low latency for large numbers of concurrent requests

Novel per-record consistency guarantees

Page 5: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

5

WHAT IS PNUTS?

E 75656 C

A 42342 EB 42521 W

C 66354 WD 12352 E

F 15677 E

E 75656 C

A 42342 EB 42521 W

C 66354 WD 12352 E

F 15677 E

CREATE TABLE Parts (ID VARCHAR,StockNumber INT,Status VARCHAR…

)

Parallel database Geographic replication

Indexes and views

Structured, flexible schema

Hosted, managed infrastructure

A 42342 E

B 42521 W

C 66354 W

D 12352 E

E 75656 C

F 15677 E

Page 6: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

FEATURES Data Model and Features

Relational data model, scatter-gather operations, asynchronous notifications, bulk loading

Fault Tolerance Employs redundancy, supports low-latency reads and

writes even after failure Pub-Sub Message System

Asynchronous operations carried out using YMB Record-level Mastering

All high-latency operations are asynchronous Hosting

Centrally managed database service shared by multiple applications

Page 7: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

DESIGN DECISIONS Record-level, asynchronous geographic replication

Guaranteed message delivery service

Consistency model which is not fully serialized

Hashed and ordered table organizations, flexible schema

Data management as a hosted service

Page 8: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

8

SCALABILITY

Data-path components

Storage units

Routers

Tablet controller

REST API

Clients

MessageBroker

Page 9: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

9

REPLICATION

Storageunits

Routers

Tablet controller

REST API

Clients

Local region Remote regions

YMB

Page 10: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

DATA AND QUERY MODEL Data organized into tables of records with attributes

Query language of PNUTS supports selection and projection from a single table.

PNUTS allows application declare tables to be hashed or ordered.

Page 11: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

11

QUERY MODEL Per-record operations

Get Set Delete

Multi-record operations Multiget Scan Getrange

Web service (RESTful) API

Page 12: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

CONSISTENCY MODEL Web applications typically manipulate one record at a

time.

Per-record timeline consistency Data in PNUTS is replicated across sites Each record contains

Sequence number – #updates since the time of creation Version number – changes on each update on record

Hidden field in each record stores which copy is the master copy updates can be submitted to any copy forwarded to master, applied in order received by master

Record also contains origin of last few updates Mastership can be changed by current master, based on this

information Mastership change is simply a record update

Page 13: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

13

CONSISTENCY MODEL Goal: make it easier for applications to reason about

updates and cope with asynchrony

What happens to a record with primary key “Brian”?

Time

Record inserted

Update Update Update UpdateUpdate Delete

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1v. 6 v. 8

Update Update

Page 14: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

14

CONSISTENCY MODEL (APIS)

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1v. 6 v. 8

Current version

Stale versionStale version

Read

Page 15: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

15

CONSISTENCY MODEL (APIS)

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1v. 6 v. 8

Read up-to-date

Current version

Stale versionStale version

Page 16: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

16

CONSISTENCY MODEL (APIS)

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1v. 6 v. 8

Read ≥ v.6

Current version

Stale versionStale version

Read-critical(required version):

Page 17: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

17

CONSISTENCY MODEL (APIS)

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1v. 6 v. 8

Write

Current version

Stale versionStale version

Page 18: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

18

CONSISTENCY MODEL (APIS)

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1v. 6 v. 8

Write if = v.7

ERROR

Current version

Stale versionStale version

Test-and-set-write(required version)

Page 19: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

19

CONSISTENCY MODEL (APIS)

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1v. 6 v. 8

Write if = v.7

ERROR

Current version

Stale versionStale version

Mechanism: per record mastership

Page 20: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

SYSTEM ARCHITECTURE System divided into regions typically geographically

distributed

Each region contains a complete copy of each table

Use pub/sub mechanism for reliability and replication (Yahoo Message Broker)

Data tables are horizontally partitioned into groups of records called tablets.

Each server might have hundreds or thousands of tablets.

Page 21: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

21

TABLET SPLITTING AND BALANCING

Each storage unit has many tablets (horizontal partitions of the table)

Tablets may grow over timeOverfull tablets split

Storage unit may become a hotspot

Shed load by moving tablets to other servers

Storage unitTablet

Page 22: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

READING DATA Three components:

Storage Unit (SU) Router Tablet Controller

Each router contains interval mapping of each tablet boundry mapped to the SU containing the tablet. For ordered tables, the primary key space is divided into

intervals. For hash tables, the hash space is divided into intervals

for each tablet.

Page 23: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

TABLET CONTROLLER Routers contain only a cached copy of the interval

mapping.

Mapping owned by tablet controller

Routers get an update of the mapping from the tablet controller when a read request fails

Simplifies router’s failure recovery

Page 24: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

24

ACCESSING SINGLE RECORD

SUSU SU

1Get key k

2Get key k3Record for key k

4Record for key k

Page 25: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

25

BULK READ

SUScatter

/gather server

SU SU

1{k1, k2, … kn}

2Get k1

Get k2 Get k3

Page 26: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

26

RANGE QUERIES

MIN-Canteloupe

SU1

Canteloupe-Lime

SU3

Lime-Strawberry

SU2

Strawberry-MAX

SU1

Storage unit 1 Storage unit 2 Storage unit 3

Router

AppleAvocadoBananaBlueberry

CanteloupeGrapeKiwiLemonLimeMangoOrange

StrawberryTomatoWatermelon

Grapefruit…Pear?

Grapefruit…Lime?

Lime…Pear?

SU1Strawberry-MAX

SU2Lime-Strawberry

SU3Canteloupe-Lime

SU1MIN-Canteloupe

Page 27: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

27

UPDATES

1Write key k

2Write key k7Sequence # for key k

8Sequence # for key k

SU SU SU

3Write key k

4

5SUCCESS

6Write key k

RoutersMessage brokers

Page 28: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

28

YAHOO MESSAGE BROKER Distributed publish-subscribe service

Guarantees delivery once a message is published Logging at site where message is published, and at other

sites when received

Guarantees messages published to a particular cluster will be delivered in same order at all other clusters

Record updates are published to YMB by master copy (Record-level mastering) All replicas subscribe to the updates, and get them in

same order for a particular record

Page 29: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

29

ASYNCHRONOUS REPLICATION

Page 30: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

30

OTHER FEATURES Per record transactions

Copying a tablet (on failure, for e.g.) Request copy Publish checkpoint message Get copy of tablet as of when checkpoint is

received Apply later updates

Tablet split Has to be coordinated across all copies

Page 31: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

31

QUERY PROCESSING Range scan can span tablets done by scatter gather

engine (in router) Only one tablet scanned at a time Client may not need all results at once

Continuation object returned to client to indicate where range scan should continue

Notification One pub-sub topic per tablet Client knows about tables, does not know about tablets

Automatically subscribed to all tablets, even as tablets are added/removed.

Usual problem with pub-sub: undelivered notifications, handled in usual way

Page 32: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

PNUTS APPLICATIONS User Database

Millions of active Yahoo users – user profiles, IM buddy lists Record timeline - relaxed consistency Hosted DB – many apps sharing same data

Social and Web 2.0 Apps Rapidly evolving and expanding – flexible schema Connections in a social graph – ordered table abstraction

Content Metadata Bulk data – distributed FS, metadata – PNUTS Helps high performance operations like file creation, deletion,

renaming

Listings Management Comparison shopping (sorted by price, rating, etc) Ordered table and views – data sorted by price, ratings,etc

Session Data Large session-state storage PNUTS as a service – easy access to session store

Page 33: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

33

EXPERIMENTAL SETUP Production PNUTS code

Enhanced with ordered table type

Three PNUTS regions 2 west coast, 1 east coast 5 storage units, 2 message brokers, 1 router West: Dual 2.8 GHz Xeon, 4GB RAM, 6 disk RAID 5 array East: Quad 2.13 GHz Xeon, 4GB RAM, 1 SATA disk

Workload 1200-3600 requests/second 0-50% writes 80% locality

Page 34: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

34

INSERTS

Required 75.6 ms per insert in West 1 (tablet master)

131.5 ms per insert into the non-master West 2, and

315.5 ms per insert into the non-master East.

Page 35: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

35

10% writes by default

Page 36: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

36

SCALABILITY

0

20

40

60

80

100

120

140

160

1 2 3 4 5 6

Storage units

Ave

rag

e la

ten

cy (

ms)

Hash table Ordered table

Page 37: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

37

REQUEST SKEW

0

10

20

30

40

50

60

70

80

90

100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Zipf parameter

Ave

rag

e la

ten

cy (

ms)

Hash table Ordered table

Page 38: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

38

SIZE OF RANGE SCANS

0

1000

2000

3000

4000

5000

6000

7000

8000

0 0.02 0.04 0.06 0.08 0.1 0.12

Fraction of table scanned

Ave

rag

e la

ten

cy (

ms)

30 clients 300 clients

Page 39: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

39

RELATED WORK Distributed and parallel databases

Especially query processing and transactions BigTable, Dynamo, S3, SimpleDB, SQL Server Data

Services, Cassandra

Distributed filesystems Ceph, Boxwood, Sinfonia

Distributed (P2P) hash tables Chord, Pastry, …

Database replication Master-slave, epidemic/gossip, synchronous…

Page 40: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

40

CONCLUSIONS AND ONGOING WORK PNUTS is an interesting research product

Research: consistency, performance, fault tolerance, rich functionality

Product: make it work, keep it (relatively) simple, learn from experience and real applications

Ongoing work Indexes and materialized views Bundled updates Batch query processing

Page 41: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

SUMMARY Aim of PNUTS

Rich Database functionality Low latency on a massive scale

Tradeoffs between functionality, performance and scalability Asynchronous replication – Low write latency Consistency Model – Useful guarantees without sacrificing

scalability Hosted Service – Minimize operation costs for applications Features Limited – Preserving Reliability and Scale

Novel Aspects Per-record timeline consistency - Asynchronous replication Message broker - Replication mechanism, Redo log Flexible mapping of tablets to storage units – Auto Failover, Load

Balancing

Page 42: PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

THANK YOU!

Questions??