systems issues for scalable, fault tolerant internet services

17
Systems Issues for Scalable, Systems Issues for Scalable, Fault Tolerant Internet Services Fault Tolerant Internet Services Yatin Chawathe Yatin Chawathe Eric Brewer Eric Brewer To appear in Middleware ’98 To appear in Middleware ’98 http://www.cs.berkeley.edu/~yatin/papers/sns-crc.ps http://www.cs.berkeley.edu/~yatin/papers/sns-crc.ps

Upload: badu

Post on 25-Feb-2016

59 views

Category:

Documents


4 download

DESCRIPTION

Systems Issues for Scalable, Fault Tolerant Internet Services. Yatin Chawathe Eric Brewer To appear in Middleware ’98 http://www.cs.berkeley.edu/~yatin/papers/sns-crc.ps. Motivation. Proliferation of network-based services Two critical issues must be addressed by Internet services: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Systems Issues for Scalable, Fault Tolerant Internet Services

Systems Issues for Scalable, Systems Issues for Scalable, Fault Tolerant Internet ServicesFault Tolerant Internet Services

Yatin ChawatheYatin ChawatheEric BrewerEric Brewer

To appear in Middleware ’98To appear in Middleware ’98http://www.cs.berkeley.edu/~yatin/papers/sns-crc.pshttp://www.cs.berkeley.edu/~yatin/papers/sns-crc.ps

Page 2: Systems Issues for Scalable, Fault Tolerant Internet Services

MotivationMotivation• Proliferation of network-based servicesProliferation of network-based services• Two critical issues must be addressed by Two critical issues must be addressed by

Internet services:Internet services:– System scalabilitySystem scalability

• Incremental and linear scalabilityIncremental and linear scalability– Availability and fault toleranceAvailability and fault tolerance

• 24x7 operation24x7 operation

Page 3: Systems Issues for Scalable, Fault Tolerant Internet Services

A Reusable SNS FrameworkA Reusable SNS Framework• Clusters of workstations are ideal for Internet Clusters of workstations are ideal for Internet

services services [FGC+97][FGC+97]

• But, clusters are difficult to manageBut, clusters are difficult to manage– To ensure linear scalability, service must distribute To ensure linear scalability, service must distribute

load across the clusterload across the cluster– Service must grow the cluster with increasing loadService must grow the cluster with increasing load– Partial failures within a cluster complicate fault Partial failures within a cluster complicate fault

managementmanagementIsolate common requirements of cluster-based Internet apps into a reusable substrate --

the Scalable Network Services (SNS) framework

Page 4: Systems Issues for Scalable, Fault Tolerant Internet Services

ArchitectureArchitecture

SNS SNS ManagerManager

InternalInternalNetworkNetwork

WorkerWorker

Worker DriverWorker Driver

WorkerWorker

Worker DriverWorker Driver

WorkerWorker

Worker DriverWorker Driver

Worker DriverWorker Driver

WorkerWorker

Worker DriverWorker Driver

WorkerWorker

...

...

Outside WorldOutside World

Page 5: Systems Issues for Scalable, Fault Tolerant Internet Services

WorkersWorkers• Workers are grouped into classes. Within a class, Workers are grouped into classes. Within a class,

workers are identicalworkers are identical• Workers can receive tasks from the outside world, or Workers can receive tasks from the outside world, or

from other workersfrom other workers• Workers have a simple serial interface for tasksWorkers have a simple serial interface for tasks

– The The originatororiginator sends a task to the sends a task to the consumerconsumer by specifying by specifying the class and inputs for the taskthe class and inputs for the task

– Tasks are atomic and restartableTasks are atomic and restartable– Worker Drivers present a narrow interface between the SNS Worker Drivers present a narrow interface between the SNS

substrate and the worker applicationsubstrate and the worker application

Page 6: Systems Issues for Scalable, Fault Tolerant Internet Services

Centralized SNS ManagerCentralized SNS Manager• SNS Manager is intentionally centralizedSNS Manager is intentionally centralized

– makes it easier to reason about and implement the makes it easier to reason about and implement the various policiesvarious policies

– ““all” we need to do is ensure the fault tolerance of the all” we need to do is ensure the fault tolerance of the manager, and make sure it is not a performance manager, and make sure it is not a performance bottleneckbottleneck

• Three key functionsThree key functions– Resource locationResource location– Load balancing and scalabilityLoad balancing and scalability– Fault toleranceFault tolerance

Page 7: Systems Issues for Scalable, Fault Tolerant Internet Services

Resource LocationResource Location

WorkerWorker

Worker DriverWorker Driver

WorkerWorker

Worker DriverWorker Driver

SNS SNS ManagerManager

Multicast BeaconsMulticast BeaconsMulticast BeaconsMulticast BeaconsMulticast BeaconsMulticast Beacons

RegisterRegister

FindFindFoundFound

PersistentPersistentConnectionConnection

Page 8: Systems Issues for Scalable, Fault Tolerant Internet Services

Load BalancingLoad Balancing• Load measurement and reportingLoad measurement and reporting

– Each worker examines incoming requests and Each worker examines incoming requests and estimates the “load” that would be generatedestimates the “load” that would be generated

– Simplest load metric: queue length at workersSimplest load metric: queue length at workers– Workers periodically report their current load to Workers periodically report their current load to

the SNS Managerthe SNS Manager– SNS Manager maintains load history and SNS Manager maintains load history and

aggregates load reports from all workersaggregates load reports from all workers– Load reports are piggybacked on manager beacons Load reports are piggybacked on manager beacons

to rest of the systemto rest of the system

Page 9: Systems Issues for Scalable, Fault Tolerant Internet Services

Load BalancingLoad Balancing• Each worker performs local load balancing Each worker performs local load balancing

decisionsdecisions• Use lottery scheduling -- # of tickets are Use lottery scheduling -- # of tickets are

inversely proportional to worker loadinversely proportional to worker load• Stale load reports can cause oscillationsStale load reports can cause oscillations

– Use a correction factor based on the number of Use a correction factor based on the number of requests that were sent since last load reportrequests that were sent since last load report

Page 10: Systems Issues for Scalable, Fault Tolerant Internet Services

Auto-launch for ScalabilityAuto-launch for Scalability• Worker replication to handle short traffic burstsWorker replication to handle short traffic bursts

– Multiple workers handle requests in parallelMultiple workers handle requests in parallel– If load on a class of workers gets too high, the SNS If load on a class of workers gets too high, the SNS

Manager launches a new oneManager launches a new one

• Overflow pool for long burstsOverflow pool for long bursts– non-dedicated set of machines (e.g. users’ desktop non-dedicated set of machines (e.g. users’ desktop

machines)machines)– when all dedicated nodes are exhausted, harness an when all dedicated nodes are exhausted, harness an

overflow node; release it after burst subsidesoverflow node; release it after burst subsides– useful for incremental scalabilityuseful for incremental scalability

Page 11: Systems Issues for Scalable, Fault Tolerant Internet Services

Fault ToleranceFault Tolerance• Starfish Fault toleranceStarfish Fault tolerance

– ““Peer” monitoring as opposed to Peer” monitoring as opposed to primary/secondary fault toleranceprimary/secondary fault tolerance

• Two mechanisms: Two mechanisms: – Timeouts and retriesTimeouts and retries– Preemptive detection and component restartPreemptive detection and component restart

• Reliance on soft state simplifies crash Reliance on soft state simplifies crash recoveryrecovery

Page 12: Systems Issues for Scalable, Fault Tolerant Internet Services

Fault ToleranceFault Tolerance

WorkerWorker

Worker DriverWorker Driver

WorkerWorker

Worker DriverWorker Driver

WorkerWorker

Worker DriverWorker Driver

SNS SNS ManagerManager

SNS SNS ManagerManager

AmRestarting

SNS SNS ManagerManager

SNS SNS ManagerManager

SNS SNS ManagerManager

ReRegisterReRegister

Page 13: Systems Issues for Scalable, Fault Tolerant Internet Services

Example ApplicationsExample Applications• TranSendTranSend

– Web proxy for on-the-fly content distillationWeb proxy for on-the-fly content distillation

• WingmanWingman– The world’s only graphical web browser for the 3COM The world’s only graphical web browser for the 3COM

PalmPilotPalmPilot

• TopGun MediaboardTopGun Mediaboard– PDA groupware: shared electronic whiteboard for the PDA groupware: shared electronic whiteboard for the

3COM PalmPilot3COM PalmPilot

• MARSMARS– MBone archive serverMBone archive server

Page 14: Systems Issues for Scalable, Fault Tolerant Internet Services

EvaluationEvaluation

0

2

4

6

8

10

12

14

16

18

0 10 20 30 40 50 60 70

Time (seconds)

Load

(que

ue le

ngth

)

Worker 1Worker 2

Page 15: Systems Issues for Scalable, Fault Tolerant Internet Services

EvaluationEvaluation

0

2

4

6

8

10

12

14

16

18

0 10 20 30 40 50 60 70

Time (seconds)

Load

(que

ue le

ngth

)

Worker 1Worker 2

Page 16: Systems Issues for Scalable, Fault Tolerant Internet Services

EvaluationEvaluation

0

5

10

15

20

25

0 200 400 600 800Time (seconds)

Que

ue L

engt

h

0

20

40

60

Offe

red

Load

(req

uest

s/se

cond

)

Worker 1Worker 2Worker 3Worker 4Worker 5Offered Load

Worker 2started

Worker 3started

Workers 4& 5started

Page 17: Systems Issues for Scalable, Fault Tolerant Internet Services

SummarySummary• Reusable architecture substrate for building Reusable architecture substrate for building

Internet service applicationsInternet service applications• Application developers program their services Application developers program their services

to a well-defined narrow interfaceto a well-defined narrow interface• SNS takes care of resource location, spawning, SNS takes care of resource location, spawning,

load balancing, fault toleranceload balancing, fault tolerance• Number of interesting applications on top of Number of interesting applications on top of

the SNS substratethe SNS substrate• Next step: SNSv2 Next step: SNSv2 NINJANINJA