scalable, distributed data structures for internet service

Scalable, Distributed Data Structures forScalable, Distributed Data Structures forInternet Service ConstructionInternet Service Construction

Steve GribbleSteve Gribble, Eric Brewer, Joe , Eric Brewer, Joe HellersteinHellerstein, , and David Cullerand David Culler

gribble@[email protected]

Ninja Research Group (http://ninja.cs.berkeley.edu)

The University of California at BerkeleyComputer Science Division

Challenges: Simplicity and ScalabilityChallenges: Simplicity and Scalability

"It's like preparing an aircraft carrier to go to war," said Schwab spokeswoman Tracey Gordon of the daily efforts to keep afloat a site that has already capsized eight times this year.

New York Times, June 20, 1999

In response to MailExcite’ outages caused by a surge in traffic, one user wrote in a message, “If MailExcite were a car we'd all be dead right now.“

c|net news, December 14, 1998

MotivationMotivation

•• Building and running Internet services is very hard!Building and running Internet services is very hard!–– especially those that need to manage especially those that need to manage persistent statepersistent state–– their design involves many tradeoffs…their design involves many tradeoffs…

• scalability, availability, consistency, simplicity/manageability

–– and there are very few adequate reusable piecesand there are very few adequate reusable pieces

•• Goals of this work:Goals of this work:–– to design/build a reusable storage layer for servicesto design/build a reusable storage layer for services–– to demonstrate properties of this layer quantitativelyto demonstrate properties of this layer quantitatively

• all of the ‘ilities, plus acceptable performance

Outline of TalkOutline of Talk

•• MotivationMotivation•• Introduction: Distributed Data Structures (DDS)Introduction: Distributed Data Structures (DDS)•• Distributed hash table prototypeDistributed hash table prototype

•• Performance numbersPerformance numbers•• Example servicesExample services•• WrapupWrapup

ContextContext

•• Clusters are natural platforms for Internet servicesClusters are natural platforms for Internet services–– incremental scalability, natural parallelism, redundancyincremental scalability, natural parallelism, redundancy–– but, state management is hard (must keep nodes consistent)but, state management is hard (must keep nodes consistent)

•• But, no appropriate cluster state mgmt. tool existsBut, no appropriate cluster state mgmt. tool exists–– (parallel) RDBMS?(parallel) RDBMS? expensive, overly powerful semantic expensive, overly powerful semantic

guarantees, limited availability under faultsguarantees, limited availability under faults–– distributed FS?distributed FS? overly general abstractions, high overhead, overly general abstractions, high overhead,

often no fault tolerance/availability, illoften no fault tolerance/availability, ill--defined consistencydefined consistency–– roll your own?roll your own? $$$, not reusable, complex to get right$$$, not reusable, complex to get right

• optimal performance is possible this way…

An alternative storage layer for clustersAn alternative storage layer for clusters

•• Distributed Data Structures (DDS)Distributed Data Structures (DDS)–– start w/ hash table, tree, log, etc., and:start w/ hash table, tree, log, etc., and:

• partition it across nodes (parallel access, scalability, …)• replicate partitions in replica groups (availability)• sync replicas to disk (durability)

–– DDS maintains a consistent view across clusterDDS maintains a consistent view across cluster• atomic state changes (but not transactions)• engenders a simple architectural model: any node can do any task

DDSDDS

S

S

S

S

S

S

cluster

CC C

CC C

CC C

CC C

CC C

CC C

Guiding PrinciplesGuiding Principles

1.1. Simplification through separation of concernsSimplification through separation of concerns–– decouple persistence/consistency from rest of servicedecouple persistence/consistency from rest of service–– DDS abstraction: programmers understand data structures, DDS abstraction: programmers understand data structures,

so this is a natural extensionso this is a natural extension

2.2. Appeal to properties of clusters to mitigate the hard Appeal to properties of clusters to mitigate the hard distributed systems problemsdistributed systems problems–– ““cluster cluster ≠≠ wide areawide area”: physically secure, well administered, ”: physically secure, well administered,

redundant SAN, controlled heterogeneityredundant SAN, controlled heterogeneity• e.g. low-latency network à two-phase commit not prohibitive• e.g. redundant SAN à no network partitions à “presumed

commit” optimistic two phase commits


•• MotivationMotivation•• Introduction: Distributed Data Structures (DDS)Introduction: Distributed Data Structures (DDS)•• Distributed hash table prototype (in Java)Distributed hash table prototype (in Java)


Prototype DDS: distributed hash tablePrototype DDS: distributed hash table

SANSAN

storage“brick”

storage“brick”

storage“brick”

storage“brick”

storage“brick”

storage“brick”

storage“brick”

storage“brick”

storage“brick”

storage“brick”

storage“brick”

storage“brick”

serviceservice

DDS libDDS libserviceservice

DDS libDDS libserviceservice

DDS libDDS lib

clientclient clientclient clientclient clientclient clientclient

WANWAN

service interacts withDDS via library

[library is 2PC coordinator, handles partitioning, replication, etc., and

exports hashtable API]

clients interact with any service “front-end”

[all persistent state is in DDS and is consistent across cluster]

“brick” is durable single-node hashtable plus RPC skeletons

for network access

example of a distributed HT partition with 3 replicas in group

Distribution: clusterDistribution: cluster--wide metadata structureswide metadata structures

•• Two data structures are maintained across cluster:Two data structures are maintained across cluster:–– data partitioning map data partitioning map ((DPmapDPmap))

• given key, returns name of replica group that handles key• as the hash table grows in size, map subdivides• “subdivision” ensures localized changes (bounds # of groups affected)

–– replica group membership maps replica group membership maps ((RGmapRGmap))• given replica group name, returns list of bricks in replica group• nodes can be dynamically added/removed from replica groups

– node failure is subtraction from group– node recovery is addition to group

•• the consistency of these maps is maintained, but lazilythe consistency of these maps is maintained, but lazily–– clients piggyback operations w/ hash of their view of mapsclients piggyback operations w/ hash of their view of maps–– if view is out of date, bricks send new map to clientif view is out of date, bricks send new map to client

• maps are also broadcast periodically

1101010011010100

Metadata maps: hash table putMetadata maps: hash table put

110101101010010011

11

1111

11

00

00 00

00 00

000000 100100

0101

011011 111111

1010

key:key:00

00

11

100100

RG nameRG name RG membership listRG membership list000000

100100

1010

011011

dds1.dds1.cscs, dds2., dds2.cscs

dds3.dds3.cscs, dds4., dds4.cscs, dds5., dds5.cscs

dds2.dds2.cscs, dds3., dds3.cscs, dds6., dds6.cscs

dds7.dds7.cscs

100100 dds3.dds3.cscs, dds4., dds4.cscs, dds5., dds5.cscs

1.1. lookup RG name in DP lookup RG name in DP map map trietrie

2.2. lookup RG members lookup RG members in RG map tablein RG map table

3.3. twotwo--phase commit put phase commit put to all RG membersto all RG members

DP mapDP map

RG mapRG map

RecoveryRecovery

•• Insights: Insights: 1.1. make hash table “best effort”make hash table “best effort”-- it’s ok to say noit’s ok to say no

– e.g., if can’t get lock, replica group membership changes duringoperation., etc

2.2. enforce invariants to simplifyenforce invariants to simplify– no state changes unless client + all replicas agree on current maps

3.3. make partitions small (10make partitions small (10--100 MB), but have many100 MB), but have many– given fast SAN, copying an entire partition is fast (1-10 seconds)

4.4. brick failures don’t happen often (once per week)brick failures don’t happen often (once per week)

•• Given these insights, brick failure recovery is easy:Given these insights, brick failure recovery is easy:–– grab write lock over one replica in a partitiongrab write lock over one replica in a partition–– copy the entire replica to the recovering nodecopy the entire replica to the recovering node–– propagate new propagate new RGmapRGmap to other nodes in replica group to other nodes in replica group –– release lockrelease lock


•• MotivationMotivation•• Introduction: Distributed Data Structures (DDS)Introduction: Distributed Data Structures (DDS)•• Distributed hash table prototypeDistributed hash table prototype


Scalability (reads and writes)Scalability (reads and writes)

100

1000

10000

100000

1 10 100 1000

# of bricks

max t

hro

ug

hp

ut

(op

s/s

)

reads

writes

(128,13582)

(128,61432)

Recovery BehaviorRecovery Behavior

0

100

200

300

400

500

600

0 50000 100000 150000 200000 250000 300000

time (ms)

thro

ughp

ut (

read

s/s)

11

22 33 44 55

66

Read throughput, replica failure (1/3)Read throughput, replica failure (1/3)

Recovery BehaviorRecovery Behavior

CapacityCapacity

•• Scaled a single hash table to 1.2 TBScaled a single hash table to 1.2 TB–– 128128 brick nodesbrick nodes–– 128128 disks (disks (1010 GB per disk)GB per disk)–– 512512 partitions (1 replica per partition, i.e. no replication)partitions (1 replica per partition, i.e. no replication)

•• Performance:Performance:–– 200 8KB inserts per second per brick node200 8KB inserts per second per brick node–– 1.5 hours1.5 hours to load the full terabyte tableto load the full terabyte table

• is about 2 MB/s per disks• DP map + single-node hash function = seek-dominated traffic


•• MotivationMotivation•• Introduction: Distributed Data Structures (DDS)Introduction: Distributed Data Structures (DDS)•• I/O layer designI/O layer design

•• Distributed hash table prototypeDistributed hash table prototype•• Performance numbersPerformance numbers•• Example servicesExample services•• WrapupWrapup

Example service: “Example service: “SanctioSanctio””

•• instant messaging gatewayinstant messaging gateway–– ICQ <ICQ <--> AIM <> AIM <--> email <> email <--> voice> voice–– Babelfish Babelfish language translationlanguage translation

•• large routing and user pref. state large routing and user pref. state maintained in servicemaintained in service–– each task needs two HT lookupseach task needs two HT lookups

• one for user pref, one to find correct “proxy” to send through

–– strong consistency required, write strong consistency required, write traffic is common (change routes)traffic is common (change routes)

•• very rapid developmentvery rapid development–– 1 person1 person--monthmonth, most effort on IM , most effort on IM

protocols. State management: protocols. State management: 1 day1 day

AOL clientAOL client

ICQ clientICQ client


•• MotivationMotivation•• Introduction: Distributed Data Structures (DDS)Introduction: Distributed Data Structures (DDS)•• I/O layer designI/O layer design

•• Distributed hash table prototypeDistributed hash table prototype•• Performance numbersPerformance numbers•• Example servicesExample services•• WrapupWrapup

WrapupWrapup

•• Distributed data structures are a viable mechanism Distributed data structures are a viable mechanism to simplify Internet service constructionto simplify Internet service construction–– they possess the ‘they possess the ‘ilitiesilities: scalability, availability, durability: scalability, availability, durability–– they engender a simple and familiar programming modelthey engender a simple and familiar programming model

•• Some guiding principles of DDS designSome guiding principles of DDS design–– exploit properties of clusters to simplifyexploit properties of clusters to simplify

• two-phase commit optimizations, fault recovery design

–– make hash table “best effort” make hash table “best effort” • saying ‘no’ simplifies recovery, implementation, etc.

For all the gory details, PhD thesis online at:http://www.cs.washington.edu/homes/gribble

Backup Slides

I/O layer design decisionsI/O layer design decisions

•• It turns out the interesting design choices are:It turns out the interesting design choices are:i.i. APIsAPIs: subtle changes in API lead to radical changes in usage: subtle changes in API lead to radical changes in usage

• e.g.: always allow user to pass in a token to an async. enqueue that will be returned with the corresponding completion

• e.g.: allow user to specify destination of completions on every enqueue• it took me 6 versions of library to get all this right…!

ii.ii. mechanisms for passing mechanisms for passing completions and chainingcompletions and chaining queues/sinksqueues/sinks• polling (polls fan down chains) vs. upcalls (completions run up queues)• polling seemed correct, but:

– when do you poll? (always, maybe with some timing delay loops)– what do you poll? (everything, as can’t know what is ready)– who does the polling? (everybody waiting for completions)

• upcalls much more efficient: events generated exactly when data ready– “dream OS”: async. everything, no app contexts but upcall handlers

Layering on top of basic HTLayering on top of basic HT

•• Lightweight layering through Lightweight layering through FSMs FSMs is heavily exploitedis heavily exploited–– “basic” distributed hash table layer“basic” distributed hash table layer

• operations may suffer transient failures (locks, timeouts, etc)• maximum value size 8KB

–– “sugar” distributed hash table layer“sugar” distributed hash table layer• bust up large HT values (>8KB), stripe across many smaller values

–– “reliable” distributed hash table layer“reliable” distributed hash table layer• on transient failures, retry operation a few times

•• Additional data structures can reuse layersAdditional data structures can reuse layers–– planned: tree, log, planned: tree, log, skiplistskiplist??–– layer on top of existing 2PC, brick, I/O substratelayer on top of existing 2PC, brick, I/O substrate

• replace data partitioning map–– less efficient: layer on “basic” or “sugar” distributed hash tabless efficient: layer on “basic” or “sugar” distributed hash tablele

• may negatively impact performance (e.g. could specialize lower layers for that particular data structure)

DDS DDS vs vs RDBMSRDBMS

•• DDS uses RDBMS techniques…DDS uses RDBMS techniques…–– buffer cache, lock manager, HT access method, twobuffer cache, lock manager, HT access method, two--phase phase

commit, recovery pathcommit, recovery path

•• but with different goals, abstractions, and semanticsbut with different goals, abstractions, and semantics–– high availability high availability andand consistencyconsistency–– HT API is a simple declarative languageHT API is a simple declarative language

• does give both data independence and implementation freedom• but is at lower semantic level: exposes intention of operations

–– current semantics: atomic single operationscurrent semantics: atomic single operations• but, “Telegraph” project at Berkeley: transactional system on

top of same I/O layer API and implementation

DDS vs. distributed, persistent objectsDDS vs. distributed, persistent objects

•• Current DDS’s don’t provide:Current DDS’s don’t provide:–– pointers between objectspointers between objects

• especially those that exist outside of object infrastructure• (distributed objects: anonymous references are possible)• the need to GC is especially hard in this case

–– extensibilityextensibility• intention of access is not as readily apparent as DDS• with objects, ability to create any DS out of them

–– type enforcementtype enforcement• extra metadata and constraints to enforce at access time

The Big PictureThe Big Picture

Internet

Proxy

Base: well-protected cluster environment for services. Boundary of strong consistency under faults. “Solve hard problems here.”

JavaJava-- what workedwhat worked

•• strong typing, rich class library, no pointersstrong typing, rich class library, no pointers–– made software engineering much, much simplermade software engineering much, much simpler–– conservatively estimate 3x speedup in implementation timeconservatively estimate 3x speedup in implementation time

•• subclassingsubclassing, declared interfaces, declared interfaces–– much, much cleaner I/O core API as a resultmuch, much cleaner I/O core API as a result

•• portabilityportability–– it was possible to pick up DDS and run it on:it was possible to pick up DDS and run it on:

• NT, linux, solaris

–– but, of course, each JDK had its own peculiaritiesbut, of course, each JDK had its own peculiarities

JavaJava-- what didn’t workwhat didn’t work

•• garbage collectiongarbage collection–– performance bottleneck if uncontrolledperformance bottleneck if uncontrolled

• Jaguar- bottleneck factor over 100 Mb/s network

–– induced induced metastable metastable equilibriumequilibrium

•• strong typing and no pointersstrong typing and no pointers–– forced many byte array copiesforced many byte array copies

•• lack of appropriate I/O abstractionslack of appropriate I/O abstractions–– everything is threadeverything is thread--centriccentric

• no non-blocking APIs

•• Java + Java + linux linux = pain= pain–– linuxlinux kernel threads: very heavyweight contended lockskernel threads: very heavyweight contended locks–– linux JDK’s linux JDK’s are behind the curveare behind the curve

“Brick” implementation“Brick” implementation

split-phase “RPC” skeletons

HT completionupcalls

I/O completionupcalls

I/Orequests

HTrequests

asynchronous I/O layer:“sinks and sources”

TCPnetwork

VIAnetwork

filesystemstorage

rawdisk

storage

buffer cachelock manager

single-node, asynchronous,persistent hash table

•• single node hash tablesingle node hash table– RPC skeletons slapped on for

remote hash table access

•• composed of many composed of many layerslayers– each layer consists of state

machines + chained upcalls– layers themselves are

asynchronous• e.g. buffer cache, lock mgr

•• implementation:implementation:– chained hash table, static # of

buckets specified at creation– key = 64 bit number– value = array of bytes

Graceful Degradation (Reads)Graceful Degradation (Reads)

Read Throughput vs. # clients

0

4000

8000

12000

16000

20000

0 5 10 15 20 25 30

# client nodes

thro

ughp

ut (

read

s/s)

2 servers

4 servers

8 servers

16 servers

32 servers

But…an unexpected imbalance on writesBut…an unexpected imbalance on writes

Write Throughput vs. servers' CPU utilization

0

50

100

150

200

250

300

0 50000 100000 150000 200000

time (ms)

thro

ug

hp

ut

(wri

tes/s

)

0

20

40

60

80

100

120

CP

U u

tiliza

tio

n (

%)

server CPU 1

server CPU 2

throughput

pa

us

e c

lie

nts

Garbage Collection Considered Harmful…Garbage Collection Considered Harmful…

•• What if…What if…service rate Sservice rate S ∝∝ (queue length Q)(queue length Q)--11

•• then, there is a then, there is a QthreshQthresh wherewhereQ > Q > Qthresh Qthresh ⇒⇒ R > SR > S

•• Unfortunately, garbage collection tickles Unfortunately, garbage collection tickles this case..this case..

–– more objects means more time spent on GCmore objects means more time spent on GC

arrival rate Rarrival rate R

service rate Sservice rate S

queuequeuelengthlength

QQ

•• Physical analogy: ball on a windy flatPhysical analogy: ball on a windy flat--topped hilltopped hill•• classic unstable equilibriumclassic unstable equilibrium•• need “antineed “anti--gravity” force, or need windshieldgravity” force, or need windshield

•• admission control, flow control, discard, …admission control, flow control, discard, …•• Feedback effect: replica group runs at speed of Feedback effect: replica group runs at speed of

slowest node (for inserts)slowest node (for inserts)

More Example ServicesMore Example Services

•• Scalable web serverScalable web server–– “service” is HTTPD, fetches content from DDS“service” is HTTPD, fetches content from DDS–– uses lightweight FSMuses lightweight FSM--layering for layering for CGIsCGIs–– 900 lines of Java, 750 for HTTP parsing etc., <50 for DDS900 lines of Java, 750 for HTTP parsing etc., <50 for DDS

•• “Parallelisms” what’s related server“Parallelisms” what’s related server–– inversion of Yahoo!inversion of Yahoo!

• given a URL, identifies what Yahoo categories it is in• returns other URLs in those categories• 400 lines, 130 for app-specific logic (rest is HTTP junk)

•• Many services in the “Ninja” platformMany services in the “Ninja” platform–– user preference repository, user key repository, user preference repository, user key repository,

collaborative filtering engine for a communal jukebox, …collaborative filtering engine for a communal jukebox, …

Guiding PrinciplesGuiding Principles

3.3. “I“Internet servicenternet service”” means huge # of parallel tasksmeans huge # of parallel tasks–– optimize system to optimize system to maximize task throughputmaximize task throughput

• minimizing task latency is secondary, if needed at all

–– thread per task breaks!thread per task breaks!• focus changes from “pushing a task” to maintaining flows• need asynchronous I/O and event-driven model

4.4. A A layered implementationlayered implementation with much reuse is possible with much reuse is possible –– I/O subsystem and an event frameworkI/O subsystem and an event framework–– RPCRPC--accessible storage accessible storage ““bricksbricks””

+ two-phase commit code, recovery code, locking code, etc.

–– data structures are built on top of these reusable piecesdata structures are built on top of these reusable pieces

Throughput vs. Read SizeThroughput vs. Read Size

Read throughput vs. element size (8 server nodes)

0

1000

2000

3000

4000

5000

6000

0 2 4 6 8 10

# client nodes

tho

ug

hp

ut

(rea

ds/

s)

50 bytes200 bytes500 bytes1000 bytes2000 bytes5000 bytes8000 bytes

scalable, distributed data structures for internet service

Documents