cluster computing overview cs444i internet services winter 00 © 1999-2000 armando fox...

Cluster Computing OverviewCluster Computing Overview

CS444I Internet ServicesCS444I Internet ServicesWinter 00Winter 00

© 1999-2000 Armando Fox© 1999-2000 Armando [email protected]@cs.stanford.edu

© 1999, Armando Fox

Today’s OutlineToday’s Outline

Clustering: the Holy GrailClustering: the Holy Grail

The Case For NOWThe Case For NOW

Clustering and Internet ServicesClustering and Internet Services

Meeting the Cluster ChallengesMeeting the Cluster Challenges


Clustering: Holy GrailClustering: Holy Grail

Goal: Take a cluster of commodity workstations and Goal: Take a cluster of commodity workstations and make them look like a supercomputer.make them look like a supercomputer.

ProblemsProblems Application structureApplication structure

Partial failure managementPartial failure management

Interconnect technologyInterconnect technology

System administrationSystem administration


Cluster Prehistory: Tandem NonStopCluster Prehistory: Tandem NonStop

Early (1974) foray into transparent fault tolerance Early (1974) foray into transparent fault tolerance through redundancythrough redundancy Mirror everything (CPU, storage, power supplies…), can Mirror everything (CPU, storage, power supplies…), can

tolerate any single fault (later: processor duplexing)tolerate any single fault (later: processor duplexing)

““Hot standby” process pair approachHot standby” process pair approach

What’s the difference between What’s the difference between high availabilityhigh availability and and fault fault tolerance?tolerance?

NoteworthyNoteworthy ““Shared nothing”--why?Shared nothing”--why?

Performance and efficiency costs?Performance and efficiency costs?

Later evolved into Tandem Himalaya, which used clustering for Later evolved into Tandem Himalaya, which used clustering for bothboth higher performance and higher availability higher performance and higher availability


Pre-NOW Clustering in the 90’sPre-NOW Clustering in the 90’s

IBM Parallel Sysplex and DEC OpenVMSIBM Parallel Sysplex and DEC OpenVMS Targeted at conservative (read: mainframe) customersTargeted at conservative (read: mainframe) customers

Shared disks allowed under both (why?)Shared disks allowed under both (why?)

All devices have cluster-wide names (shared everything?)All devices have cluster-wide names (shared everything?)

1500 installations of Sysplex, 25,000 of OpenVMS Cluster1500 installations of Sysplex, 25,000 of OpenVMS Cluster

Programming the clustersProgramming the clusters All System/390 and/or VAX VMS subsystems were rewritten to All System/390 and/or VAX VMS subsystems were rewritten to

be cluster-awarebe cluster-aware

OpenVMS: cluster support exists even in single-node OS!OpenVMS: cluster support exists even in single-node OS!

An advantage of locking into proprietary interfacesAn advantage of locking into proprietary interfaces

What about fault tolerance?What about fault tolerance?


The Case For NOW: MPP’s a Near MissThe Case For NOW: MPP’s a Near Miss

uproc perf. improves 50% / yr (4%/month)uproc perf. improves 50% / yr (4%/month) 1 year lag:WS = 1.50 MPP node perf.1 year lag:WS = 1.50 MPP node perf.

2 year lag:WS = 2.25 MPP node perf.2 year lag:WS = 2.25 MPP node perf.

No economy of scale in 100s => +$No economy of scale in 100s => +$

Software incompatibility (OS & apps) => +$$$$Software incompatibility (OS & apps) => +$$$$

More efficient utilization of compute resources More efficient utilization of compute resources (statistical multiplexing)(statistical multiplexing)

““Scale makes availability affordable” (Pfister)Scale makes availability affordable” (Pfister)

Which of these do commodity clusters Which of these do commodity clusters actuallyactually solve? solve?


Philosophy: “Systems of Systems”Philosophy: “Systems of Systems”

Higher Order systems research: aggressively use off-the-Higher Order systems research: aggressively use off-the-shelf hardware shelf hardware and OS softwareand OS software

Advantages:Advantages: easier to track technological advanceseasier to track technological advances

less development timeless development time

easier to transfer technology (reduce lag)easier to transfer technology (reduce lag)

New challenges (“the case against NOW”):New challenges (“the case against NOW”): maintaining performance goalsmaintaining performance goals

system is changing underneath yousystem is changing underneath you

underlying system has other people's bugsunderlying system has other people's bugs

underlying system is poorly documentedunderlying system is poorly documented


Clusters: “Enhanced Standard Litany”Clusters: “Enhanced Standard Litany”

Hardware redundancyHardware redundancy

Aggregate capacityAggregate capacity

Incremental scalabilityIncremental scalability

Absolute scalabilityAbsolute scalability

Price/performance Price/performance sweet spotsweet spot

Software engineeringSoftware engineering

Partial failure Partial failure managementmanagement

Incremental scalabilityIncremental scalability


HeterogeneityHeterogeneity


Clustering and Internet ServicesClustering and Internet Services

Aggregate capacityAggregate capacity TB of disk storage, THz of compute power (if we can TB of disk storage, THz of compute power (if we can

harness in parallel!)harness in parallel!)

RedundancyRedundancy Partial failure behavior: only small fractional degradation Partial failure behavior: only small fractional degradation

from loss of one nodefrom loss of one node

Availability: industry average across “large” sites during Availability: industry average across “large” sites during 1998 holiday season was 97.2% availability (source: 1998 holiday season was 97.2% availability (source: CyberAtlas)CyberAtlas)

Compare: mission-critical systems have “four nines” Compare: mission-critical systems have “four nines” (99.99%)(99.99%)


Spike AbsorptionSpike Absorption

Internet traffic is self-similarInternet traffic is self-similar Bursty at all granularities less than about 24 hoursBursty at all granularities less than about 24 hours

What’s bad about burstiness?What’s bad about burstiness?

Spike AbsorptionSpike Absorption Diurnal variation: peak vs. average demand typically a factor Diurnal variation: peak vs. average demand typically a factor

of 3 or moreof 3 or more

Starr Report: CNN peaked at 20M hits/hour (compared to Starr Report: CNN peaked at 20M hits/hour (compared to usual peak of 12M hits/hour; that’s +66%)usual peak of 12M hits/hour; that’s +66%)

Really Really the holy grail: capacity on demandthe holy grail: capacity on demand Is this realistic?Is this realistic?


Diurnal Cycle (UCB dialups, Jan. 1997)Diurnal Cycle (UCB dialups, Jan. 1997)

~750 modems at UC ~750 modems at UC BerkeleyBerkeley

Instrumented early 1997Instrumented early 1997


Clustering and Internet WorkloadsClustering and Internet Workloads

Internet vs. “traditional” workloadsInternet vs. “traditional” workloads e.g. Database workloads (TPC benchmarks)e.g. Database workloads (TPC benchmarks)

e.g. traditional scientific codes (matrix multiply, simulated e.g. traditional scientific codes (matrix multiply, simulated annealing and related simulations, etc.)annealing and related simulations, etc.)

Some characteristic differencesSome characteristic differences Read mostlyRead mostly

Quality of service (best-effort vs. guarantees)Quality of service (best-effort vs. guarantees)

Task granularityTask granularity

““Embarrasingly parallel”Embarrasingly parallel”

……but are they balanced? (we’ll return to this later)but are they balanced? (we’ll return to this later)


Meeting the Cluster ChallengesMeeting the Cluster Challenges

Software & programming modelsSoftware & programming models

Partial failure and application semanticsPartial failure and application semantics



Software ChallengesSoftware Challenges

Message-passing & Active MessagesMessage-passing & Active Messages

Shared memory: Network RAMShared memory: Network RAM CC-NUMA, Software DSM: CC-NUMA, Software DSM: Anyone who thinks cache Anyone who thinks cache

misses can take milliseconds is an idiot.misses can take milliseconds is an idiot. (Paraphrasing (Paraphrasing Larry McVoy at OSDI 96)Larry McVoy at OSDI 96)

MP vs SM a long-standing religious debateMP vs SM a long-standing religious debate

Arbitrary object migration (“network transparency”)Arbitrary object migration (“network transparency”) What are the problems with this?What are the problems with this?

Hints: RPC, checkpointing, residual stateHints: RPC, checkpointing, residual state


Partial Failure ManagementPartial Failure Management

What does What does partial failure partial failure mean for…mean for… a transactional database?a transactional database?

A read-only database striped across cluster nodes?A read-only database striped across cluster nodes?

A compute-intensive shared service?A compute-intensive shared service?

What are appropriate “partial failure abstractions”?What are appropriate “partial failure abstractions”? Incomplete/imprecise results?Incomplete/imprecise results?

Longer latency?Longer latency?

What current programming idioms make partial What current programming idioms make partial failure hard?failure hard? Hint: remember the original RPC papers?Hint: remember the original RPC papers?


Software Challenges, Again?Software Challenges, Again?

Real issue: we have to think differently about Real issue: we have to think differently about programming…programming… ……to harness clusters?to harness clusters?

……to get decent failure semantics?to get decent failure semantics?

……to really exploit software modularity?to really exploit software modularity?

Traditional uniprocessor programming idioms/models Traditional uniprocessor programming idioms/models don’t seem to scale up to clustersdon’t seem to scale up to clusters

Question: Is there a “natural to use” cluster model that Question: Is there a “natural to use” cluster model that scales down to uniprocessors?scales down to uniprocessors? If so, is it general or application-specific?If so, is it general or application-specific?

What would be the obstacles to adopting such a model?What would be the obstacles to adopting such a model?


System Administration on a ClusterSystem Administration on a Cluster

Thanks to Eric Anderson (1998) for some of this material.Thanks to Eric Anderson (1998) for some of this material.

Total cost of ownership (TCO) way high for clustersTotal cost of ownership (TCO) way high for clusters Median sysadmin cost per machine per year (1996): ~$700Median sysadmin cost per machine per year (1996): ~$700

Cost of a headless workstation today: ~$1500Cost of a headless workstation today: ~$1500

Previous SolutionsPrevious Solutions Pay someone to watchPay someone to watch

Ignore or wait for someone to complainIgnore or wait for someone to complain

““Shell Scripts From Hell” (not general Shell Scripts From Hell” (not general vast repeated work) vast repeated work)

Need an extensible and scalable way to automate the Need an extensible and scalable way to automate the gathering, analysis, and presentation of datagathering, analysis, and presentation of data


System Administration, cont’d.System Administration, cont’d.

Extensible Scalable Monitoring For Clusters of Extensible Scalable Monitoring For Clusters of Computers Computers (Anderson & Patterson, UC Berkeley)(Anderson & Patterson, UC Berkeley)

Relational tables allow properties & queries of interest Relational tables allow properties & queries of interest to evolve as the cluster evolvesto evolve as the cluster evolves

Extensive visualization support allows humans to make Extensive visualization support allows humans to make sense of masses of datasense of masses of data

Multiple levels of caching decouple data collection from Multiple levels of caching decouple data collection from aggregationaggregation

Data updates can be “pulled” on demand or triggered Data updates can be “pulled” on demand or triggered by pushby push


Visualizing Data: ExampleVisualizing Data: Example

Display aggregates of various interesting machine Display aggregates of various interesting machine properties on the NOW’sproperties on the NOW’s

Note use of aggregation, colorNote use of aggregation, color


Case Study: The Berkeley NOWCase Study: The Berkeley NOW

History and History and PicturesPictures of an early research cluster of an early research cluster NOW-0: four HP-735’sNOW-0: four HP-735’s

NOW-1: 32 headless Sparc-10’s and Sparc-20’sNOW-1: 32 headless Sparc-10’s and Sparc-20’s

NOW-2: 100 UltraSparc 1’s, Myrinet interconnectNOW-2: 100 UltraSparc 1’s, Myrinet interconnect

inktomi.berkeley.edu: four Sparc-10’sinktomi.berkeley.edu: four Sparc-10’s

www.hotbot.com: 160 Ultra’s, 200 CPU’s totalwww.hotbot.com: 160 Ultra’s, 200 CPU’s total

NOW-3: eight 4-way SMP’sNOW-3: eight 4-way SMP’s

Myrinet interconnectionMyrinet interconnection In addition to commodity switched EthernetIn addition to commodity switched Ethernet

Originally Sparc SBus, now available on PCIbusOriginally Sparc SBus, now available on PCIbus

http://now.cs.berkeley.edu/pictures.html


The Adventures of NOW: ApplicationsThe Adventures of NOW: Applications

AlphaSort: 8.41 GB in one minute, 95 UltraSparcsAlphaSort: 8.41 GB in one minute, 95 UltraSparcs runner up: Ordinal Systems runner up: Ordinal Systems nSort nSort on SGI Origin, 5 GB)on SGI Origin, 5 GB)

pre-1997 record, 1.6 GB on an SGI Challengepre-1997 record, 1.6 GB on an SGI Challenge

40-bit DES key crack in 3.5 hours40-bit DES key crack in 3.5 hours ““NOW+”: headless and some headed machinesNOW+”: headless and some headed machines

inktomi.berkeley.edu (now inktomi.com)inktomi.berkeley.edu (now inktomi.com) now fastest search engine, largest aggregate capacitynow fastest search engine, largest aggregate capacity

TranSend proxy & Top Gun Wingman Pilot browserTranSend proxy & Top Gun Wingman Pilot browser ~15,000 users, 3-10 machines~15,000 users, 3-10 machines


The Adventures of NOW: ToolsThe Adventures of NOW: Tools

GLUnix (coming up, later today)GLUnix (coming up, later today)

xFS, a serverless network filesystemxFS, a serverless network filesystem Why not just a big RAID on a single server?Why not just a big RAID on a single server?

Support for the Myrinet fast interconnectSupport for the Myrinet fast interconnect Active Message (AM-1 and AM-2) over MyrinetActive Message (AM-1 and AM-2) over Myrinet

Fast Sockets: one-copy TCP fast path over AM-1 on MyrinetFast Sockets: one-copy TCP fast path over AM-1 on Myrinet

Moral: cluster tools are hard?Moral: cluster tools are hard?


Cluster SummaryCluster Summary

Clusters have potential advantages…but serious Clusters have potential advantages…but serious challenges to achieving them in practicechallenges to achieving them in practice Kind of like Network Computers?Kind of like Network Computers?

Everyone and their brother is now selling a clusterEveryone and their brother is now selling a cluster Who’s selling a system, and who’s selling a promise?Who’s selling a system, and who’s selling a promise?

Can clustering be sold as a “secret sauce”?Can clustering be sold as a “secret sauce”?

Next: non-clustering, and approaches to clusteringNext: non-clustering, and approaches to clustering

cluster computing overview cs444i internet services winter 00 © 1999-2000 armando fox...

Documents