cluster computing with linux

129
Cluster Computing wi Cluster Computing wi th Linux th Linux Prabhaker Mateti Prabhaker Mateti Wright State University Wright State University

Upload: sahkyo

Post on 12-Jan-2016

47 views

Category:

Documents


0 download

DESCRIPTION

Cluster Computing with Linux. Prabhaker Mateti Wright State University. - PowerPoint PPT Presentation

TRANSCRIPT

Cluster Computing with LinCluster Computing with Linuxux

Prabhaker MatetiPrabhaker Mateti

Wright State UniversityWright State University

Mateti, Linux ClustersMateti, Linux Clusters 22

AbstractAbstract

Cluster computing distributes the Cluster computing distributes the computational load to collections of similar computational load to collections of similar machines. This talk describes what machines. This talk describes what cluster computing is, the typical Linux cluster computing is, the typical Linux packages used, and examples of large packages used, and examples of large clusters in use today. This talk also clusters in use today. This talk also reviews cluster computing modifications of reviews cluster computing modifications of the Linux kernel. the Linux kernel.

Mateti, Linux ClustersMateti, Linux Clusters 33

What What KindKind of Computing, did you of Computing, did you say?say?

SequentialSequential ConcurrentConcurrent ParallelParallel DistributedDistributed NetworkedNetworked MigratoryMigratory

ClusterCluster GridGrid PervasivePervasive QuantumQuantum OpticalOptical MolecularMolecular

Fundamentals OverviewFundamentals Overview

Mateti, Linux ClustersMateti, Linux Clusters 55

Fundamentals OverviewFundamentals Overview

Granularity of Parallelism Granularity of Parallelism SynchronizationSynchronizationMessage PassingMessage PassingShared MemoryShared Memory

Mateti, Linux ClustersMateti, Linux Clusters 66

Granularity of ParallelismGranularity of Parallelism

Fine-Grained Parallelism Fine-Grained Parallelism Medium-Grained ParallelismMedium-Grained ParallelismCoarse-Grained Parallelism Coarse-Grained Parallelism NOWs (Networks of Workstations) NOWs (Networks of Workstations)

Mateti, Linux ClustersMateti, Linux Clusters 77

Fine-Grained MachinesFine-Grained Machines

Tens of thousands of Processor ElementsTens of thousands of Processor Elements Processor Elements Processor Elements

Slow (bit serial) Slow (bit serial) Small Fast Private RAMSmall Fast Private RAM Shared Memory Shared Memory

Interconnection Networks Interconnection Networks Message Passing Message Passing

Single Instruction Multiple Data (SIMD)Single Instruction Multiple Data (SIMD)

Mateti, Linux ClustersMateti, Linux Clusters 88

Medium-Grained MachinesMedium-Grained Machines

Typical Configurations Typical Configurations Thousands of processors Thousands of processors Processors have power between coarse- and Processors have power between coarse- and

fine-grained fine-grained Either shared or distributed memoryEither shared or distributed memoryTraditionally: Research Machines Traditionally: Research Machines Single Code Multiple Data (SCMD)Single Code Multiple Data (SCMD)

Mateti, Linux ClustersMateti, Linux Clusters 99

Coarse-Grained MachinesCoarse-Grained Machines

Typical Configurations Typical Configurations Hundreds/Thousands of Processors Hundreds/Thousands of Processors

Processors Processors Powerful (fast CPUs) Powerful (fast CPUs) Large (cache, vectors, multiple fast buses)Large (cache, vectors, multiple fast buses)

Memory: Shared or Distributed-Shared Memory: Shared or Distributed-Shared Multiple Instruction Multiple Data (MIMD)Multiple Instruction Multiple Data (MIMD)

Mateti, Linux ClustersMateti, Linux Clusters 1010

Networks of WorkstationsNetworks of Workstations

Exploit inexpensive Workstations/PCs Exploit inexpensive Workstations/PCs Commodity network Commodity network The NOW becomes a “distributed memory The NOW becomes a “distributed memory

multiprocessor”multiprocessor” Workstations send+receive messages Workstations send+receive messages C and Fortran programs with PVM, MPI, etc. librariesC and Fortran programs with PVM, MPI, etc. libraries Programs developed on NOWs are portable to Programs developed on NOWs are portable to

supercomputers for production runssupercomputers for production runs

Mateti, Linux ClustersMateti, Linux Clusters 1111

Definition of “Parallel”Definition of “Parallel”

S1 begins at time b1, ends at e1S1 begins at time b1, ends at e1S2 begins at time b2, ends at e2S2 begins at time b2, ends at e2S1 || S2S1 || S2

Begins at min(b1, b2)Begins at min(b1, b2)Ends at max(e1, e2)Ends at max(e1, e2)Commutative (Equiv to S2 || S1)Commutative (Equiv to S2 || S1)

Mateti, Linux ClustersMateti, Linux Clusters 1212

Data DependencyData Dependency

x := a + b; y := c + d;x := a + b; y := c + d;x := a + b || y := c + d;x := a + b || y := c + d;y := c + d; x := a + b;y := c + d; x := a + b;X depends on a and b, y depends on c X depends on a and b, y depends on c

and dand dAssumed a, b, c, d were independentAssumed a, b, c, d were independent

Mateti, Linux ClustersMateti, Linux Clusters 1313

Types of ParallelismTypes of Parallelism

Result: Data structure can be split into Result: Data structure can be split into parts of same structure.parts of same structure.

Specialist: Each node specializes. Specialist: Each node specializes. Pipelines.Pipelines.

Agenda: Have list of things to do. Each Agenda: Have list of things to do. Each node can generalize.node can generalize.

Mateti, Linux ClustersMateti, Linux Clusters 1414

Result ParallelismResult Parallelism

Also called Also called Embarrassingly ParallelEmbarrassingly Parallel Perfect ParallelPerfect Parallel

Computations that can be subdivided into sets of Computations that can be subdivided into sets of independent tasks that require little or no independent tasks that require little or no communication communication Monte Carlo simulationsMonte Carlo simulations F(x, y, z)F(x, y, z)

Mateti, Linux ClustersMateti, Linux Clusters 1515

Specialist ParallelismSpecialist Parallelism

Different operations performed simultaneously Different operations performed simultaneously on different processors on different processors

E.g., Simulating a chemical plant; one processor E.g., Simulating a chemical plant; one processor simulates the preprocessing of chemicals, one simulates the preprocessing of chemicals, one simulates reactions in first batch, another simulates reactions in first batch, another simulates refining the products, etc.simulates refining the products, etc.

Mateti, Linux ClustersMateti, Linux Clusters 1616

Agenda Parallelism: MW ModelAgenda Parallelism: MW Model

Manager Manager Initiates computation Initiates computation Tracks progress Tracks progress Handles worker’s requests Handles worker’s requests Interfaces with userInterfaces with user

Workers Workers Spawned and terminated by manager Spawned and terminated by manager Make requests to manager Make requests to manager Send results to managerSend results to manager

Mateti, Linux ClustersMateti, Linux Clusters 1717

Embarrassingly ParallelEmbarrassingly Parallel

Result Parallelism is obviousResult Parallelism is obviousEx1: Compute the square root of each of Ex1: Compute the square root of each of

the million numbers given.the million numbers given.Ex2: Search for a given set of words Ex2: Search for a given set of words

among a billion web pages.among a billion web pages.

Mateti, Linux ClustersMateti, Linux Clusters 1818

ReductionReduction

Combine several sub-results into oneCombine several sub-results into oneReduce r1 r2 … rn with opReduce r1 r2 … rn with opBecomes r1 op r2 op … op rnBecomes r1 op r2 op … op rnHadoop is based on this ideaHadoop is based on this idea

Mateti, Linux ClustersMateti, Linux Clusters 1919

Shared MemoryShared Memory

Process A writes to a memory locationProcess A writes to a memory locationProcess B reads from that memory Process B reads from that memory

locationlocationSynchronization is crucialSynchronization is crucialExcellent speedExcellent speedSemantics … ?Semantics … ?

Mateti, Linux ClustersMateti, Linux Clusters 2020

Shared MemoryShared Memory

Needs hardware support: Needs hardware support: multi-ported memorymulti-ported memory

Atomic operations: Atomic operations: Test-and-SetTest-and-SetSemaphoresSemaphores

Mateti, Linux ClustersMateti, Linux Clusters 2121

Shared Memory Semantics: Shared Memory Semantics: AssumptionsAssumptions

Global time is available. Discrete increments.Global time is available. Discrete increments. Shared variable, s = vi at ti, i=0,…Shared variable, s = vi at ti, i=0,… Process A: s := v1 at time t1Process A: s := v1 at time t1 Assume no other assignment occurred after t1.Assume no other assignment occurred after t1. Process B reads s at time t and gets value v.Process B reads s at time t and gets value v.

Mateti, Linux ClustersMateti, Linux Clusters 2222

Shared Memory: SemanticsShared Memory: Semantics

Value of Shared VariableValue of Shared Variable v = v1, if t > t1v = v1, if t > t1 v = v0, if t < t1v = v0, if t < t1 v = ??, if t = t1v = ??, if t = t1

t = t1 +- discrete quantumt = t1 +- discrete quantum Next Update of Shared VariableNext Update of Shared Variable

Occurs at t2Occurs at t2 t2 = t1 + ?t2 = t1 + ?

Mateti, Linux ClustersMateti, Linux Clusters 2323

Distributed Shared MemoryDistributed Shared Memory

““Simultaneous” read/write access by Simultaneous” read/write access by spatially distributed processorsspatially distributed processors

Abstraction layer of an implementation Abstraction layer of an implementation built from message passing primitivesbuilt from message passing primitives

Semantics not so cleanSemantics not so clean

Mateti, Linux ClustersMateti, Linux Clusters 2424

SemaphoresSemaphores

Semaphore s;Semaphore s;V(s) ::= V(s) ::= s := s + 1 s := s + 1 P(s) ::= P(s) ::= when s > 0 do s := s – 1 when s > 0 do s := s – 1

Deeply studied theory.Deeply studied theory.

Mateti, Linux ClustersMateti, Linux Clusters 2525

Condition VariablesCondition Variables

Condition C;Condition C;C.wait()C.wait()C.signal()C.signal()

Mateti, Linux ClustersMateti, Linux Clusters 2626

Distributed Shared MemoryDistributed Shared Memory

A common address space that all the A common address space that all the computers in the cluster share.computers in the cluster share.

Difficult to describe semantics. Difficult to describe semantics.

Mateti, Linux ClustersMateti, Linux Clusters 2727

Distributed Shared Memory: Distributed Shared Memory: IssuesIssues

DistributedDistributedSpatiallySpatiallyLANLANWANWAN

No global time availableNo global time available

Mateti, Linux ClustersMateti, Linux Clusters 2828

Distributed ComputingDistributed Computing

No shared memoryNo shared memoryCommunication among processesCommunication among processes

Send a messageSend a messageReceive a messageReceive a message

AsynchronousAsynchronousSynchronousSynchronousSynergy among processesSynergy among processes

Mateti, Linux ClustersMateti, Linux Clusters 2929

MessagesMessages

Messages are sequences of bytes moving Messages are sequences of bytes moving between processesbetween processes

The sender and receiver must agree on The sender and receiver must agree on the type structure of values in the the type structure of values in the messagemessage

““Marshalling”: data layout so that there is Marshalling”: data layout so that there is no ambiguity such as “four chars” v. “one no ambiguity such as “four chars” v. “one integer”.integer”.

Mateti, Linux ClustersMateti, Linux Clusters 3030

Message PassingMessage Passing

Process A sends a data buffer as a Process A sends a data buffer as a message to process B.message to process B.

Process B waits for a message from A, Process B waits for a message from A, and when it arrives copies it into its own and when it arrives copies it into its own local memory.local memory.

No memory shared between A and B.No memory shared between A and B.

Mateti, Linux ClustersMateti, Linux Clusters 3131

Message PassingMessage Passing

Obviously, Obviously, Messages cannot be received before they are sent.Messages cannot be received before they are sent. A receiver waits until there is a message.A receiver waits until there is a message.

AsynchronousAsynchronous Sender never blocks, even if infinitely many Sender never blocks, even if infinitely many

messages are waiting to be receivedmessages are waiting to be received Semi-asynchronous is a practical version of above Semi-asynchronous is a practical version of above

with large but finite amount of bufferingwith large but finite amount of buffering

Mateti, Linux ClustersMateti, Linux Clusters 3232

Message Passing: Point to Message Passing: Point to PointPoint

Q: send(m, P) Q: send(m, P) Send message M to process PSend message M to process P

P: recv(x, Q)P: recv(x, Q)Receive message from process Q, and place Receive message from process Q, and place

it in variable xit in variable xThe message dataThe message data

Type of x must match that of mType of x must match that of m As if x := mAs if x := m

Mateti, Linux ClustersMateti, Linux Clusters 3333

BroadcastBroadcast

One sender Q, multiple receivers POne sender Q, multiple receivers PNot all receivers may receive at the same Not all receivers may receive at the same

timetimeQ: broadcast (m) Q: broadcast (m)

Send message M to processesSend message M to processesP: recv(x, Q)P: recv(x, Q)

Receive message from process Q, and place Receive message from process Q, and place it in variable xit in variable x

Mateti, Linux ClustersMateti, Linux Clusters 3434

Synchronous Message PassingSynchronous Message Passing

Sender blocks until receiver is ready to Sender blocks until receiver is ready to receive.receive.

Cannot send messages to self.Cannot send messages to self.No buffering.No buffering.

Mateti, Linux ClustersMateti, Linux Clusters 3535

Asynchronous Message PassingAsynchronous Message Passing

Sender never blocks.Sender never blocks.Receiver receives when ready.Receiver receives when ready.Can send messages to self.Can send messages to self. Infinite buffering.Infinite buffering.

Mateti, Linux ClustersMateti, Linux Clusters 3636

Message PassingMessage Passing

Speed not so good Speed not so good Sender copies message into system buffers.Sender copies message into system buffers.Message travels the network.Message travels the network.Receiver copies message from system buffers Receiver copies message from system buffers

into local memory.into local memory.Special virtual memory techniques help.Special virtual memory techniques help.

Programming QualityProgramming Quality less error-prone cf. shared memoryless error-prone cf. shared memory

Mateti, Linux ClustersMateti, Linux Clusters 3737

Computer ArchitecturesComputer Architectures

Mateti, Linux ClustersMateti, Linux Clusters 3838

Architectures of Top 500 SysArchitectures of Top 500 Sys

Mateti, Linux ClustersMateti, Linux Clusters 3939

Architectures of Top 500 SysArchitectures of Top 500 Sys

Mateti, Linux ClustersMateti, Linux Clusters 4040

Mateti, Linux ClustersMateti, Linux Clusters 4141

““Parallel” ComputersParallel” Computers

Traditional supercomputersTraditional supercomputersSIMD, MIMD, pipelinesSIMD, MIMD, pipelinesTightly coupled shared memoryTightly coupled shared memoryBus level connectionsBus level connectionsExpensive to buy and to maintainExpensive to buy and to maintain

Cooperating networks of computersCooperating networks of computers

Mateti, Linux ClustersMateti, Linux Clusters 4242

Traditional SupercomputerTraditional Supercomputerss

Very high starting costVery high starting costExpensive Expensive hardwarehardwareExpensive Expensive softwaresoftware

High maintenanceHigh maintenanceExpensive to Expensive to upgradeupgrade

Mateti, Linux ClustersMateti, Linux Clusters 4343

Computational GridsComputational Grids

“ “Grids are persistent environments that Grids are persistent environments that enable software applications to integrate enable software applications to integrate instruments, displays, computational and instruments, displays, computational and information resources that are managed information resources that are managed by diverse organizations in widespread by diverse organizations in widespread locations.”locations.”

Mateti, Linux ClustersMateti, Linux Clusters 4444

Computational GridsComputational Grids

Individual nodes can be supercomputers, Individual nodes can be supercomputers, or NOWor NOW

High availabilityHigh availabilityAccommodate peak usageAccommodate peak usageLAN : Internet :: NOW : GridLAN : Internet :: NOW : Grid

Mateti, Linux ClustersMateti, Linux Clusters 4545

Buildings-Full of Workstations Buildings-Full of Workstations

1.1. Distributed OS have not taken a foot hold. Distributed OS have not taken a foot hold.

2.2. Powerful personal computers are ubiquitous. Powerful personal computers are ubiquitous.

3.3. Mostly idle: more than 90% of the up-time?Mostly idle: more than 90% of the up-time?

4.4. 100 Mb/s LANs are common. 100 Mb/s LANs are common.

5.5. Windows and Linux are the top two OS in Windows and Linux are the top two OS in terms of installed base. terms of installed base.

Mateti, Linux ClustersMateti, Linux Clusters 4646

Networks of Workstations (NOW)Networks of Workstations (NOW)

WorkstationWorkstationNetworkNetworkOperating SystemOperating SystemCooperationCooperationDistributed+Parallel ProgramsDistributed+Parallel Programs

Mateti, Linux ClustersMateti, Linux Clusters 4747

What is a Workstation? What is a Workstation?

PC? Mac? Sun …?PC? Mac? Sun …? ““Workstation OS”Workstation OS”

Mateti, Linux ClustersMateti, Linux Clusters 4848

““Workstation OS”Workstation OS”

Authenticated usersAuthenticated usersProtection of resourcesProtection of resourcesMultiple processesMultiple processesPreemptive schedulingPreemptive schedulingVirtual MemoryVirtual MemoryHierarchical file systemsHierarchical file systemsNetwork centricNetwork centric

Mateti, Linux ClustersMateti, Linux Clusters 4949

Clusters of WorkstationsClusters of Workstations

Inexpensive alternative to traditional Inexpensive alternative to traditional supercomputerssupercomputers

High availabilityHigh availabilityLower down timeLower down timeEasier accessEasier access

Development platform with production Development platform with production runs on traditional supercomputersruns on traditional supercomputers

Mateti, Linux ClustersMateti, Linux Clusters 5050

Clusters of WorkstationsClusters of Workstations

Dedicated NodesDedicated NodesCome-and-Go NodesCome-and-Go Nodes

Mateti, Linux ClustersMateti, Linux Clusters 5151

Clusters with Part Time NodesClusters with Part Time Nodes

Cycle Stealing: Running of jobs on a workstation Cycle Stealing: Running of jobs on a workstation that don't belong to the owner. that don't belong to the owner.

Definition of Idleness: E.g., No keyboard and no Definition of Idleness: E.g., No keyboard and no mouse activitymouse activity

Tools/LibrariesTools/Libraries CondorCondor PVMPVM MPIMPI

Mateti, Linux ClustersMateti, Linux Clusters 5252

CooperationCooperation

Workstations are “personal”Workstations are “personal”Others use slows you downOthers use slows you down……Willing to shareWilling to shareWilling to trustWilling to trust

Mateti, Linux ClustersMateti, Linux Clusters 5353

Cluster Cluster CharacteristicsCharacteristics

Commodity ofCommodity offf the shelf hardware the shelf hardware NNetworketworkededCommon Home DirectoriesCommon Home DirectoriesOpen source software and OSOpen source software and OSSupport Support message passingmessage passing programming programmingBatch scheduling of jobsBatch scheduling of jobsProcess migrationProcess migration

Mateti, Linux ClustersMateti, Linux Clusters 5454

Beowulf Cluster Beowulf Cluster

Dedicated nodesDedicated nodes Single System ViewSingle System View Commodity of the shelf hardwareCommodity of the shelf hardware Internal high speed networkInternal high speed network Open source software and OSOpen source software and OS Support parallel programming such as MPI, PVMSupport parallel programming such as MPI, PVM Full trust in each otherFull trust in each other

Login from one node into another without Login from one node into another without authenticationauthentication

Shared file system subtreeShared file system subtree

Mateti, Linux ClustersMateti, Linux Clusters 5555

Example ClustersExample Clusters

July 1999July 199911000 nodes 000 nodes Used for genetic algoritUsed for genetic algorit

hm research by John hm research by John KKoza, Stanford Universityoza, Stanford University

www.genetic-programmwww.genetic-programming.com/ing.com/

Mateti, Linux ClustersMateti, Linux Clusters 5656

Typical Typical Big BeowulfBig Beowulf

11000 nodes Beowulf Cl000 nodes Beowulf Cluster Systemuster System

Used for genetic algoritUsed for genetic algorithm research by John Chm research by John Coza, Stanford Universityoza, Stanford University

http://www.genetic-proghttp://www.genetic-programming.com/ramming.com/

Mateti, Linux ClustersMateti, Linux Clusters 5757

Largest Cluster SystemLargest Cluster System

IBM BlueGene, 2007IBM BlueGene, 2007 DOE/NNSA/LLNLDOE/NNSA/LLNL Memory: 73728 GBMemory: 73728 GB OS: CNK/SLES 9OS: CNK/SLES 9 Interconnect: ProprietaryInterconnect: Proprietary PowerPC 440PowerPC 440 106,496 nodes106,496 nodes 478.2 Tera FLOPS on 478.2 Tera FLOPS on

LINPACKLINPACK

Mateti, Linux ClustersMateti, Linux Clusters 5858

2008 World’s Fastest: Roadrunner2008 World’s Fastest: Roadrunner

Operating System: LinuxOperating System: Linux Interconnect InfinibandInterconnect Infiniband129600 cores: PowerXCell 8i 3200 MHz129600 cores: PowerXCell 8i 3200 MHz1105 TFlops1105 TFlopsat DOE/NNSA/LANLat DOE/NNSA/LANL

Mateti, Linux ClustersMateti, Linux Clusters 5959

Cluster Computers for RentCluster Computers for Rent Transfer executable files, source Transfer executable files, source

code or data to your secure code or data to your secure personal account on TTI servers personal account on TTI servers (1). Do this securely using winscp (1). Do this securely using winscp for Windows or "secure copy" scp for Windows or "secure copy" scp for Linux. for Linux.

To execute your program, simply To execute your program, simply submit a job (2) to the scheduler submit a job (2) to the scheduler using the "menusub" command or using the "menusub" command or do it manually using "qsub" (we do it manually using "qsub" (we use the popular PBS batch use the popular PBS batch system). There are working system). There are working examples on how to submit your examples on how to submit your executable. Your executable is executable. Your executable is securely placed on one of our in-securely placed on one of our in-house clusters for execution (3). house clusters for execution (3).

Your results and data are written Your results and data are written to your personal account in real to your personal account in real time. Download your results (4).time. Download your results (4).

Mateti, Linux ClustersMateti, Linux Clusters 6060

Turnkey Cluster VendorsTurnkey Cluster Vendors Fully integrated Beowulf clusters with commercially Fully integrated Beowulf clusters with commercially

supported Beowulf software systems are available from :supported Beowulf software systems are available from : HP www.hp.com/solutions/enterprise/highavailability/HP www.hp.com/solutions/enterprise/highavailability/ IBM www.ibm.com/servers/eserver/clusters/ IBM www.ibm.com/servers/eserver/clusters/ Northrop Grumman.comNorthrop Grumman.com Accelerated Servers.comAccelerated Servers.com Penguin Computing.comPenguin Computing.com www.aspsys.com/clusterswww.aspsys.com/clusters www.pssclabs.comwww.pssclabs.com

Mateti, Linux ClustersMateti, Linux Clusters 6161

Why Why are Linux Clustersare Linux Clusters GGood?ood?

Low initial implementation costLow initial implementation cost Inexpensive Inexpensive PCsPCsStandard componentsStandard components and Networks and NetworksFree SoftwareFree Software:: L Linux,inux, GNU, MPI, PVM GNU, MPI, PVM

Scalability: cScalability: can an grow and shrinkgrow and shrinkFamiliar Familiar ttechnology, easy for user to adechnology, easy for user to ad

opt the approach, use and maintain systopt the approach, use and maintain system.em.

Mateti, Linux ClustersMateti, Linux Clusters 6262

2007 OS Share of Top 5002007 OS Share of Top 500

OS Count OS Count Share Share Rmax (GF) Rmax (GF) Rpeak (GF) ProcessorRpeak (GF) ProcessorLinux Linux 426 426 85.20% 4897046 85.20% 4897046 7956758 7956758 970790970790Windows 6 Windows 6 1.20% 47495 1.20% 47495 86797 86797 12112 12112Unix Unix 30 30 6.00% 408378 6.00% 408378 519178 519178 73532 73532BSD BSD 2 2 0.40% 44783 0.40% 44783 50176 50176 5696 5696Mixed 34 Mixed 34 6.80% 1540037 6.80% 1540037 1900361 1900361 580693580693MacOS 2 MacOS 2 0.40% 28430 0.40% 28430 44816 44816 5272 5272Totals Totals 500 500 100% 6966169 10558086 1648095 100% 6966169 10558086 1648095

http://www.top500.org/stats/list/30/osfam Nov 2007http://www.top500.org/stats/list/30/osfam Nov 2007

Mateti, Linux ClustersMateti, Linux Clusters 6363

2007 OS Share of Top 5002007 OS Share of Top 500OS Family Count Share % Rmax (GF) Rpeak (GF) Procs

Linux 439 87.80 % 13309834 20775171 2099535

Windows 5 1.00 % 328114 429555 54144

Unix 23 4.60 % 881289 1198012 85376

BSD Based 1 0.20 % 35860 40960 5120

Mixed 31 6.20 % 2356048 2933610 869676

Mac OS 1 0.20 % 16180 24576 3072

Totals 500 100% 16927325 25401883 3116923

Mateti, Linux ClustersMateti, Linux Clusters 6464

Many Books on Linux ClustersMany Books on Linux Clusters

Search: Search: google.comgoogle.com amazon.comamazon.com

Example book:Example book:William Gropp, Ewing Lusk, William Gropp, Ewing Lusk,

Thomas Sterling, MIT Press, Thomas Sterling, MIT Press, 2003, ISBN: 0-262-69292-9 2003, ISBN: 0-262-69292-9

Mateti, Linux ClustersMateti, Linux Clusters 6565

Why Why Is Is Beowulf Beowulf GGood?ood?

Low initial implementation costLow initial implementation cost Inexpensive Inexpensive PCsPCsStandard componentsStandard components and Networks and NetworksFree SoftwareFree Software:: L Linux,inux, GNU, MPI, PVM GNU, MPI, PVM

Scalability: cScalability: can an grow and shrinkgrow and shrinkFamiliar Familiar ttechnology, easy for user to adechnology, easy for user to ad

opt the approach, use and maintain systopt the approach, use and maintain system.em.

Mateti, Linux ClustersMateti, Linux Clusters 6666

Single System Single System ImageImage

Common filesystem view from any nodeCommon filesystem view from any nodeCommon accountCommon accountss on all nodes on all nodesSingle software installation point Single software installation point Easy to install and maintain system Easy to install and maintain system Easy to use for Easy to use for end-end-usersusers

Mateti, Linux ClustersMateti, Linux Clusters 6767

CloseClosedd Cluster Configuration Cluster Configuration

computenode

computenode

computenode

computenode

High Speed Network

Service Network

gatewaynode

External Network

computenode

computenode

computenode

computenode

High Speed Network

gatewaynode

External Network

File Servernode

-Front end

Mateti, Linux ClustersMateti, Linux Clusters 6868

Open Cluster ConfigurationOpen Cluster Configuration

computenode

computenode

computenode

computenode

computenode

computenode

computenode

computenode

External Network

File Servernode

High Speed Network

Front-end

Mateti, Linux ClustersMateti, Linux Clusters 6969

DIY DIY Interconnection NetworkInterconnection Network

Most popularMost popular:: Fast Ethernet Fast EthernetNetworkNetwork topologies topologies

MMesheshTorusTorus

Switch Switch v. v. Hub Hub

Mateti, Linux ClustersMateti, Linux Clusters 7070

Software Software CComponentsomponents

Operating SystemOperating SystemLLinuxinux,, FreeBSDFreeBSD, …, …

““ParallelParallel”” PProgramrogramssPVM, MPIPVM, MPI, …, …

UtilitiesUtilitiesOpen sourceOpen source

Mateti, Linux ClustersMateti, Linux Clusters 7171

Cluster ComputCluster Computinging

Ordinary programs run as-is on clusters is Ordinary programs run as-is on clusters is notnot cluster computing cluster computing

Cluster computing takes advantage of :Cluster computing takes advantage of :Result parallelismResult parallelismAgenda parallelismAgenda parallelismReduction operationsReduction operationsProcess-grain parallelismProcess-grain parallelism

Mateti, Linux ClustersMateti, Linux Clusters 7272

Google Linux ClustersGoogle Linux Clusters

GFS: The Google File System GFS: The Google File System thousands of terabytes of storage across thousands of terabytes of storage across

thousands of disks on over a thousand thousands of disks on over a thousand machinesmachines

150 million queries per day150 million queries per dayAverage response time of 0.25 secAverage response time of 0.25 secNear-100% uptimeNear-100% uptime

Mateti, Linux ClustersMateti, Linux Clusters 7373

Cluster Computing Applications Cluster Computing Applications

MathematicalMathematical fftw (fast Fourier transform)fftw (fast Fourier transform) pblas (parallel basic linear algebra software)pblas (parallel basic linear algebra software) atlas (a collections of mathematical library)atlas (a collections of mathematical library) sprngsprng (scalable parallel random number generator) (scalable parallel random number generator) MPITBMPITB -- MPI toolbox for MATLAB -- MPI toolbox for MATLAB

Quantum Chemistry softwareQuantum Chemistry software Gaussian, Gaussian, qchemqchem

Molecular Dynamic solverMolecular Dynamic solver NAMDNAMD, , gromacsgromacs, , gamessgamess

Weather modelingWeather modeling MM5 (http://www.mmm.ucar.edu/mm5/mm5-home.html)MM5 (http://www.mmm.ucar.edu/mm5/mm5-home.html)

Mateti, Linux ClustersMateti, Linux Clusters 7474

Development of Cluster ProgramsDevelopment of Cluster Programs

New algorithms + codeNew algorithms + codeOld programs re-done:Old programs re-done:

Reverse engineer design, and re-codeReverse engineer design, and re-codeUse new languages that have distributed and Use new languages that have distributed and

parallel primitivesparallel primitivesWith new librariesWith new libraries

Parallelize legacy codeParallelize legacy codeMechanical conversion by software toolsMechanical conversion by software tools

Mateti, Linux ClustersMateti, Linux Clusters 7575

Distributed ProgramsDistributed Programs

Spatially distributed programsSpatially distributed programs A part here, a part there, …A part here, a part there, … ParallelParallel SynergySynergy

Temporally distributed programsTemporally distributed programs Compute half today, half tomorrowCompute half today, half tomorrow Combine the results at the endCombine the results at the end

Migratory programsMigratory programs Have computation, will travelHave computation, will travel

Mateti, Linux ClustersMateti, Linux Clusters 7676

Technological Bases of Technological Bases of Distributed+Parallel ProgramsDistributed+Parallel Programs

Spatially distributed programsSpatially distributed programsMessage passingMessage passing

Temporally distributed programsTemporally distributed programsShared memoryShared memory

Migratory programsMigratory programsSerialization of data and programsSerialization of data and programs

Mateti, Linux ClustersMateti, Linux Clusters 7777

Technological Bases for Technological Bases for Migratory programsMigratory programs

Same CPU architectureSame CPU architectureX86, PowerPC, MIPS, SPARC, …, JVMX86, PowerPC, MIPS, SPARC, …, JVM

Same OS + environmentSame OS + environmentBe able to “checkpoint”Be able to “checkpoint”

suspend, and suspend, and then resume computation then resume computation without loss of progresswithout loss of progress

Mateti, Linux ClustersMateti, Linux Clusters 7878

Parallel Programming Parallel Programming LanguagesLanguages

Shared-memory languagesShared-memory languagesDistributed-memory languagesDistributed-memory languagesObject-oriented languages Object-oriented languages Functional programming languagesFunctional programming languagesConcurrent logic languages Concurrent logic languages Data flow languagesData flow languages

Mateti, Linux ClustersMateti, Linux Clusters 7979

Linda: Tuple Spaces, shared memLinda: Tuple Spaces, shared mem

<v1, v2, …, vk><v1, v2, …, vk>Atomic PrimitivesAtomic Primitives

In (t)In (t)Read (t)Read (t)Out (t)Out (t)Eval (t)Eval (t)

Host language: e.g., C/Linda, JavaSpacesHost language: e.g., C/Linda, JavaSpaces

Mateti, Linux ClustersMateti, Linux Clusters 8080

Data Parallel LanguagesData Parallel Languages

Data is distributed over the processors as Data is distributed over the processors as a arrays a arrays

Entire arrays are manipulated:Entire arrays are manipulated:A(1:100) = B(1:100) + C(1:100) A(1:100) = B(1:100) + C(1:100)

Compiler generates parallel codeCompiler generates parallel codeFortran 90Fortran 90High Performance Fortran (HPF) High Performance Fortran (HPF)

Mateti, Linux ClustersMateti, Linux Clusters 8181

Parallel Functional LanguagesParallel Functional Languages

Erlang http://www.erlang.org/ Erlang http://www.erlang.org/ SISAL http://www.llnl.gov/sisal/SISAL http://www.llnl.gov/sisal/PCN ArgonnePCN ArgonneHaskell-Eden http://www.mathematik.uni-Haskell-Eden http://www.mathematik.uni-

marburg.de/~eden marburg.de/~eden Objective Caml with BSPObjective Caml with BSPSAC Functional Array LanguageSAC Functional Array Language

Mateti, Linux ClustersMateti, Linux Clusters 8282

Message Passing LibrariesMessage Passing Libraries

Programmer is responsible for initial data Programmer is responsible for initial data distribution, synchronization, and sending distribution, synchronization, and sending and receiving informationand receiving information

Parallel Virtual Machine (PVM)Parallel Virtual Machine (PVM)Message Passing Interface (MPI)Message Passing Interface (MPI)Bulk Synchronous Parallel model (BSP)Bulk Synchronous Parallel model (BSP)

Mateti, Linux ClustersMateti, Linux Clusters 8383

BSP: Bulk Synchronous Parallel BSP: Bulk Synchronous Parallel modelmodel

Divides computation into Divides computation into superstepssupersteps In each superstep a processor can work on local In each superstep a processor can work on local

data and send messages. data and send messages. At the end of the superstep, a barrier At the end of the superstep, a barrier

synchronization takes place and all processors synchronization takes place and all processors receive the messages which were sent in the receive the messages which were sent in the previous superstepprevious superstep

Mateti, Linux ClustersMateti, Linux Clusters 8484

BSP: Bulk Synchronous Parallel BSP: Bulk Synchronous Parallel modelmodel

http://www.bsp-worldwide.org/http://www.bsp-worldwide.org/ Book: Book:

Rob H. Bisseling,Rob H. Bisseling,“Parallel Scientific Computation: A Structured “Parallel Scientific Computation: A Structured Approach using BSP and MPI,”Approach using BSP and MPI,”Oxford University Press, 2004,Oxford University Press, 2004,324 pages,324 pages,ISBN 0-19-852939-2. ISBN 0-19-852939-2.

Mateti, Linux ClustersMateti, Linux Clusters 8585

BSP LibraryBSP Library

Small number of subroutines to implement Small number of subroutines to implement process creation, process creation, remote data access, and remote data access, and bulk synchronization.bulk synchronization.

Linked to C, Fortran, … programs Linked to C, Fortran, … programs

Mateti, Linux ClustersMateti, Linux Clusters 8686

Portable Batch System (PBS)Portable Batch System (PBS) Prepare a .cmd file Prepare a .cmd file

naming the program and its argumentsnaming the program and its arguments properties of the jobproperties of the job the needed resources the needed resources 

Submit .cmd to the PBS Job Server: Submit .cmd to the PBS Job Server: qsubqsub command  command  Routing and Scheduling: The Job Server Routing and Scheduling: The Job Server

examines .cmd details to route the job to an execution queue. examines .cmd details to route the job to an execution queue. allocates one or more cluster nodes to the joballocates one or more cluster nodes to the job communicates with the Execution Servers (mom's) on the cluster to determine the current communicates with the Execution Servers (mom's) on the cluster to determine the current

state of the nodes.  state of the nodes.  When all of the needed are allocated, passes the .cmd on to the Execution Server on the When all of the needed are allocated, passes the .cmd on to the Execution Server on the

first node allocated (the "mother superior"). first node allocated (the "mother superior").  Execution Server Execution Server

will will loginlogin on the first node as the submitting user and run the .cmd file in the user's home on the first node as the submitting user and run the .cmd file in the user's home directory.  directory. 

Run an installation defined prologue script.Run an installation defined prologue script. Gathers the job's output to the standard output and standard error Gathers the job's output to the standard output and standard error It will execute installation defined epilogue script.It will execute installation defined epilogue script. Delivers stdout and stdout to the user.Delivers stdout and stdout to the user.

Mateti, Linux ClustersMateti, Linux Clusters 8787

TORQUE, an open source PBSTORQUE, an open source PBS Tera-scale Open-source Resource and QUEue manager Tera-scale Open-source Resource and QUEue manager

(TORQUE) enhances OpenPBS (TORQUE) enhances OpenPBS Fault Tolerance Fault Tolerance

Additional failure conditions checked/handled Additional failure conditions checked/handled Node health check script support Node health check script support

Scheduling Interface Scheduling Interface Scalability Scalability

Significantly improved server to MOM communication model Significantly improved server to MOM communication model Ability to handle larger clusters (over 15 TF/2,500 processors) Ability to handle larger clusters (over 15 TF/2,500 processors) Ability to handle larger jobs (over 2000 processors) Ability to handle larger jobs (over 2000 processors) Ability to support larger server messages Ability to support larger server messages

LoggingLogging http://www.supercluster.org/projects/torque/http://www.supercluster.org/projects/torque/

Mateti, Linux ClustersMateti, Linux Clusters 8888

PVM, and MPIPVM, and MPI

Message passing primitivesMessage passing primitivesCan be embedded in many existing Can be embedded in many existing

programming languagesprogramming languagesArchitecturally portableArchitecturally portableOpen-sourced implementationsOpen-sourced implementations

Mateti, Linux ClustersMateti, Linux Clusters 8989

Parallel Virtual Machine (Parallel Virtual Machine (PVMPVM) )

PVM enables a heterogeneous collection PVM enables a heterogeneous collection of networked computers to be used as a of networked computers to be used as a single large parallel computer. single large parallel computer.

Older than MPIOlder than MPILarge scientific/engineering user Large scientific/engineering user

communitycommunityhttp://www.csm.ornl.gov/pvm/http://www.csm.ornl.gov/pvm/

Mateti, Linux ClustersMateti, Linux Clusters 9090

Message Passing Interface (MPI)Message Passing Interface (MPI)

http://www-unix.mcs.anl.gov/mpi/http://www-unix.mcs.anl.gov/mpi/MPI-2.0 http://www.mpi-forum.org/docs/ MPI-2.0 http://www.mpi-forum.org/docs/ MPIMPICH: www.mcs.anl.gov/mpi/mpich/ CH: www.mcs.anl.gov/mpi/mpich/ by Aby A

rgonne National Laboratory and Missisippy rgonne National Laboratory and Missisippy State UniversityState University

LAM: http://www.lam-mpi.org/LAM: http://www.lam-mpi.org/http://www.open-mpi.org/ http://www.open-mpi.org/

Mateti, Linux ClustersMateti, Linux Clusters 9191

OpenMP for shared memory OpenMP for shared memory

Distributed shared memory APIDistributed shared memory APIUser-gives hints as directives to the User-gives hints as directives to the

compilercompilerhttp://www.openmp.orghttp://www.openmp.org

Mateti, Linux ClustersMateti, Linux Clusters 9292

SPMDSPMD

Single program, multiple dataSingle program, multiple dataContrast with SIMDContrast with SIMDSame program runs on multiple nodesSame program runs on multiple nodesMay or may not be lock-stepMay or may not be lock-stepNodes may be of different speedsNodes may be of different speedsBarrier synchronizationBarrier synchronization

Mateti, Linux ClustersMateti, Linux Clusters 9393

CondorCondor

Cooperating workstations: come and go.Cooperating workstations: come and go.Migratory programsMigratory programs

CheckpointingCheckpointingRemote IORemote IO

Resource matchingResource matchinghttp://www.cs.wisc.edu/condor/http://www.cs.wisc.edu/condor/

Mateti, Linux ClustersMateti, Linux Clusters 9494

Migration of JobsMigration of Jobs

Policies Policies Immediate-EvictionImmediate-EvictionPause-and-MigratePause-and-Migrate

Technical IssuesTechnical IssuesCheck-pointing: Preserving the state of the Check-pointing: Preserving the state of the

process so it can be resumed.process so it can be resumed.Migrating from one architecture to anotherMigrating from one architecture to another

Mateti, Linux ClustersMateti, Linux Clusters 9595

Kernels Etc Mods for ClustersKernels Etc Mods for Clusters Dynamic load balancingDynamic load balancing Transparent process-migrationTransparent process-migration Kernel ModsKernel Mods

http://openmosix.sourceforge.net/ http://openmosix.sourceforge.net/ http://kerrighed.org/http://kerrighed.org/

http://openssi.org/http://openssi.org/ http://ci-linux.sourceforge.net/ http://ci-linux.sourceforge.net/

CLuster Membership Subsystem ("CLMS") and CLuster Membership Subsystem ("CLMS") and Internode Communication SubsystemInternode Communication Subsystem

http://www.gluster.org/ http://www.gluster.org/ GlusterFS: Clustered File Storage of peta bytes.GlusterFS: Clustered File Storage of peta bytes. GlusterHPC: High Performance Compute ClustersGlusterHPC: High Performance Compute Clusters

http://boinc.berkeley.edu/http://boinc.berkeley.edu/ Open-source software for volunteer computing and grid computingOpen-source software for volunteer computing and grid computing

Mateti, Linux ClustersMateti, Linux Clusters 9696

OpenMosix DistroOpenMosix Distro Quantian LinuxQuantian Linux

Boot from DVD-ROMBoot from DVD-ROM Compressed file system on DVDCompressed file system on DVD Several GB of cluster softwareSeveral GB of cluster software http://dirk.eddelbuettel.com/quantian.htmlhttp://dirk.eddelbuettel.com/quantian.html

Live CD/DVD or Single Floppy BootablesLive CD/DVD or Single Floppy Bootables http://bofh.be/clusterknoppix/http://bofh.be/clusterknoppix/ http://sentinix.org/http://sentinix.org/ http://itsecurity.mq.edu.au/chaos/http://itsecurity.mq.edu.au/chaos/ http://openmosixloaf.sourceforge.net/http://openmosixloaf.sourceforge.net/ http://plumpos.sourceforge.net/http://plumpos.sourceforge.net/ http://www.dynebolic.org/http://www.dynebolic.org/ http://bccd.cs.uni.edu/http://bccd.cs.uni.edu/ http://eucaristos.sourceforge.net/http://eucaristos.sourceforge.net/ http://gomf.sourceforge.net/ http://gomf.sourceforge.net/

Can be installed on HDDCan be installed on HDD

Mateti, Linux ClustersMateti, Linux Clusters 9797

What is openMOSIX?What is openMOSIX? An open source enhancement to the Linux An open source enhancement to the Linux

kernelkernel Cluster with come-and-go nodesCluster with come-and-go nodes System image model: Virtual machine with lots System image model: Virtual machine with lots

of memory and CPUof memory and CPU Granularity: ProcessGranularity: Process Improves the overall (cluster-wide) performance.Improves the overall (cluster-wide) performance. Multi-user, time-sharing environment for the Multi-user, time-sharing environment for the

execution of both sequential and parallel execution of both sequential and parallel applicationsapplications

Applications unmodified (no need to link with Applications unmodified (no need to link with special library)special library)

Mateti, Linux ClustersMateti, Linux Clusters 9898

What is openMOSIX?What is openMOSIX?

Execution environment: Execution environment: farm of diskless x86 based nodesfarm of diskless x86 based nodes UP (uniprocessor), or UP (uniprocessor), or SMP (symmetric multi processor)SMP (symmetric multi processor) connected by standard LAN (e.g., Fast Ethernet)connected by standard LAN (e.g., Fast Ethernet)

Adaptive resource management to dynamicAdaptive resource management to dynamic load load characteristicscharacteristics CPU, RAM, I/O, etc.CPU, RAM, I/O, etc.

Linear scalabilityLinear scalability

Mateti, Linux ClustersMateti, Linux Clusters 9999

Users’ View of the ClusterUsers’ View of the Cluster

Users can start from any node in the Users can start from any node in the cluster, or sysadmin sets-up a few nodes cluster, or sysadmin sets-up a few nodes as login nodesas login nodes

RoundRound-robin DNS: “hpc.clusters” with -robin DNS: “hpc.clusters” with many IPs assigned to same name many IPs assigned to same name

Each process has a Home-NodeEach process has a Home-NodeMigrated processes always appear to run at Migrated processes always appear to run at

the home node,the home node, e.g., “ps” show all your e.g., “ps” show all your processes, even if they run elsewhereprocesses, even if they run elsewhere

Mateti, Linux ClustersMateti, Linux Clusters 100100

MOSIX architectureMOSIX architecture

network transparencynetwork transparencypreemptive process migrationpreemptive process migrationdynamic load balancingdynamic load balancingmemory sharingmemory sharingefficient kernel communicationefficient kernel communicationprobabilistic information dissemination probabilistic information dissemination

algorithmsalgorithmsdecentralized control and autonomydecentralized control and autonomy

Mateti, Linux ClustersMateti, Linux Clusters 101101

A two tier technologyA two tier technology Information gathering and disseminationInformation gathering and dissemination

Support scalable configurations by probabilistic Support scalable configurations by probabilistic dissemination algorithmsdissemination algorithms

Same overhead for 16 nodes or 2056 nodesSame overhead for 16 nodes or 2056 nodes Pre-emptive process migration that can migrate Pre-emptive process migration that can migrate

any process, anywhere, anytime - transparentlyany process, anywhere, anytime - transparently Supervised by adaptive algorithms that respond to Supervised by adaptive algorithms that respond to

global resource availabilityglobal resource availability Transparent to applications, no change to user Transparent to applications, no change to user

interfaceinterface

Mateti, Linux ClustersMateti, Linux Clusters 102102

Tier 1: Information gathering Tier 1: Information gathering and disseminationand dissemination

In each unit of time (e.g., 1 second) each In each unit of time (e.g., 1 second) each node gathers information about:node gathers information about:CPU(s) speed, load and utilizationCPU(s) speed, load and utilizationFree memoryFree memoryFree proc-table/file-table slotsFree proc-table/file-table slots

Info sent to a randomly selected node Info sent to a randomly selected node Scalable - more nodes better scattering Scalable - more nodes better scattering

Mateti, Linux ClustersMateti, Linux Clusters 103103

Tier 2: Process migrationTier 2: Process migration

Load balancing: reduce variance between Load balancing: reduce variance between pairs of nodes to improve the overall pairs of nodes to improve the overall performanceperformance

Memory ushering: migrate processes from Memory ushering: migrate processes from a node that nearly exhausted its free a node that nearly exhausted its free memory, to prevent pagingmemory, to prevent paging

Parallel File I/O: bring the process to the Parallel File I/O: bring the process to the file-server, direct file I/O from migrated file-server, direct file I/O from migrated processesprocesses

Mateti, Linux ClustersMateti, Linux Clusters 104104

Network transparencyNetwork transparency

The user and applications are provided a The user and applications are provided a virtual machine that looks like a single virtual machine that looks like a single machine.machine.

Example: Disk access from diskless nodes Example: Disk access from diskless nodes on fileserver is completely transparent to on fileserver is completely transparent to programsprograms

Mateti, Linux ClustersMateti, Linux Clusters 105105

Preemptive process migrationPreemptive process migration

Any user’s process, trasparently and at Any user’s process, trasparently and at any time, can/may migrate to any other any time, can/may migrate to any other node.node.

The migrating process is divided into:The migrating process is divided into:system context (system context (deputydeputy) that may not be ) that may not be

migrated from home workstation (UHN);migrated from home workstation (UHN);user context (user context (remoteremote) that can be migrated on ) that can be migrated on

a diskless node;a diskless node;

Mateti, Linux ClustersMateti, Linux Clusters 106106

Splitting the Linux processSplitting the Linux process

System context (environment) - site dependent- “home” confinedSystem context (environment) - site dependent- “home” confined Connected by an exclusive link for both synchronousConnected by an exclusive link for both synchronous

(system calls) and asynchronous (signals, MOSIX events) (system calls) and asynchronous (signals, MOSIX events) Process context (code, stack, data) - site independent - may migrateProcess context (code, stack, data) - site independent - may migrate

Deputy

Rem

ote

Kernel Kernel

Userland Userland

openMOSIX LinkLocal

master node diskless node

Mateti, Linux ClustersMateti, Linux Clusters 107107

Dynamic load balancingDynamic load balancing Initiates process migrations in order to balance Initiates process migrations in order to balance

the load of farmthe load of farm responds to variations in the load of the nodes, responds to variations in the load of the nodes,

runtime characteristics of the processes, number runtime characteristics of the processes, number of nodes and their speedsof nodes and their speeds

makes continuous attempts to reduce the load makes continuous attempts to reduce the load differences among nodesdifferences among nodes

the policy is symmetrical and decentralizedthe policy is symmetrical and decentralized all of the nodes execute the same algorithmall of the nodes execute the same algorithm the reduction of the load differences is performed the reduction of the load differences is performed

indipendently by any pair of nodesindipendently by any pair of nodes

Mateti, Linux ClustersMateti, Linux Clusters 108108

Memory sharingMemory sharing

places the maximal number of processes in the places the maximal number of processes in the farm main memory, even if it implies an uneven farm main memory, even if it implies an uneven load distribution among the nodesload distribution among the nodes

delays as much as possible swapping out of delays as much as possible swapping out of pagespages

makes the decision of which process to migrate makes the decision of which process to migrate and where to migrate it is based on the and where to migrate it is based on the knoweldge of the amount of free memory in knoweldge of the amount of free memory in other nodesother nodes

Mateti, Linux ClustersMateti, Linux Clusters 109109

Efficient kernel communicationEfficient kernel communication

Reduces overhead of the internal kernel Reduces overhead of the internal kernel communications (e.g. between the communications (e.g. between the process and its home site, when it is process and its home site, when it is executing in a remote site)executing in a remote site)

Fast and reliable protocol with low startup Fast and reliable protocol with low startup latency and high throughputlatency and high throughput

Mateti, Linux ClustersMateti, Linux Clusters 110110

Probabilistic information Probabilistic information dissemination algorithmsdissemination algorithms

Each node has sufficient knowledge about Each node has sufficient knowledge about available resources in other nodes, without available resources in other nodes, without pollingpolling

measure the amount of available resources on measure the amount of available resources on each nodeeach node

receive resources indices that each node sends receive resources indices that each node sends at regular intervals to a randomly chosen subset at regular intervals to a randomly chosen subset of nodes of nodes

the use of randomly chosen subset of nodes the use of randomly chosen subset of nodes facilitates dynamic configuration and overcomes facilitates dynamic configuration and overcomes node failuresnode failures

Mateti, Linux ClustersMateti, Linux Clusters 111111

Decentralized control and Decentralized control and autonomyautonomy

Each node makes its own control Each node makes its own control decisions independently.decisions independently.

No master-slave relationshipsNo master-slave relationshipsEach node is capable of operating as an Each node is capable of operating as an

independent systemindependent systemNodes may join or leave the farm with Nodes may join or leave the farm with

minimal disruptionminimal disruption

Mateti, Linux ClustersMateti, Linux Clusters 112112

File System AccessFile System Access MOSIX is particularly efficient for distributing and MOSIX is particularly efficient for distributing and

executing CPU-bound processesexecuting CPU-bound processes However, the processes are inefficient with However, the processes are inefficient with

significant file operationssignificant file operations I/O accesses through the home node incur high I/O accesses through the home node incur high

overheadoverhead ““Direct FSA” is for better handling of I/O:Direct FSA” is for better handling of I/O:

Reduce the overhead of executing I/O oriented Reduce the overhead of executing I/O oriented system-calls of a migrated processsystem-calls of a migrated process

a migrated process performs I/O operations locally, in a migrated process performs I/O operations locally, in the current node, the current node, not via the home nodenot via the home node

processes migrate more freelyprocesses migrate more freely

Mateti, Linux ClustersMateti, Linux Clusters 113113

DFSA RequirementsDFSA Requirements

DFSA can work with any file system that satisfies some DFSA can work with any file system that satisfies some properties. properties.

Unique mount point:Unique mount point: The FS are identically mounted on The FS are identically mounted on all.all.

File consistency: when an operation is completed in one File consistency: when an operation is completed in one node, any subsequent operation on any other node will node, any subsequent operation on any other node will see the results of that operationsee the results of that operation

Required because an openMOSIX process may perform Required because an openMOSIX process may perform consecutive syscalls from different nodes consecutive syscalls from different nodes

Time-stamp consistency: if file A is modified after B, A Time-stamp consistency: if file A is modified after B, A must have a timestamp > B's timestampmust have a timestamp > B's timestamp

Mateti, Linux ClustersMateti, Linux Clusters 114114

DFSA Conforming FSDFSA Conforming FS

Global File System (GFS)Global File System (GFS)openMOSIX File System (MFS)openMOSIX File System (MFS)LustreLustre global file system global file systemGeneral Parallel File System (General Parallel File System (GPFS)GPFS)Parallel Virtual File System (Parallel Virtual File System (PVFS)PVFS)Available operations: all common file-Available operations: all common file-

system and I/O system-callssystem and I/O system-calls

Mateti, Linux ClustersMateti, Linux Clusters 115115

Global File System (GFS)Global File System (GFS)

Provides local caching and cache consistency Provides local caching and cache consistency over the cluster using a unique locking over the cluster using a unique locking mechanismmechanism

Provides direct access from any node to any Provides direct access from any node to any storage entity storage entity

GFS + process migration combine the GFS + process migration combine the advantages of load-balancing with direct disk advantages of load-balancing with direct disk access from any node - for parallel file access from any node - for parallel file operationsoperations

Non-GNU License (SPL)Non-GNU License (SPL)

Mateti, Linux ClustersMateti, Linux Clusters 116116

The MOSIX File System (MFS)The MOSIX File System (MFS) Provides a unified view of all files and all Provides a unified view of all files and all

mounted FSs on all the nodes of a MOSIX mounted FSs on all the nodes of a MOSIX cluster as if they were within a single file system.cluster as if they were within a single file system.

Makes all directories and regular files throughout Makes all directories and regular files throughout

an openMOSIX cluster available from all the an openMOSIX cluster available from all the nodesnodes

Provides cache consistencyProvides cache consistency Allows parallel file access by proper distribution Allows parallel file access by proper distribution

of files (a process migrates to the node with the of files (a process migrates to the node with the needed files) needed files)

Mateti, Linux ClustersMateti, Linux Clusters 117117

MFS NamespaceMFS Namespace

/

etc usr varbin mfs

/

etc usr var bin mfs

Mateti, Linux ClustersMateti, Linux Clusters 118118

Lustre: A scalable File System Lustre: A scalable File System http://www.lustre.org/ http://www.lustre.org/ Scalable data serving through parallel data Scalable data serving through parallel data

striping striping Scalable meta data Scalable meta data Separation of file meta data and storage Separation of file meta data and storage

allocation meta data to further increase allocation meta data to further increase scalability scalability

Object technology - allowing stackable, value-Object technology - allowing stackable, value-add functionality add functionality

Distributed operation Distributed operation

Mateti, Linux ClustersMateti, Linux Clusters 119119

Parallel Virtual File System (PVFS)Parallel Virtual File System (PVFS)

http://www.parl.clemson.edu/pvfs/http://www.parl.clemson.edu/pvfs/User-controlled striping of files across User-controlled striping of files across

nodes nodes Commodity network and storage hardware Commodity network and storage hardware MPI-IO support through ROMIOMPI-IO support through ROMIOTraditional Linux file system access Traditional Linux file system access

through the pvfs-kernel package through the pvfs-kernel package The native PVFS library interface The native PVFS library interface

Mateti, Linux ClustersMateti, Linux Clusters 120120

General Parallel File Sys (GPFS)General Parallel File Sys (GPFS)

www.ibm.com/servers/eserver/clusters/www.ibm.com/servers/eserver/clusters/software/gpfs.htmlsoftware/gpfs.html

““GPFS for Linux provides world class GPFS for Linux provides world class performance, scalability, and availability for file performance, scalability, and availability for file systems. It offers compliance to most UNIX file systems. It offers compliance to most UNIX file standards for end user applications and standards for end user applications and administrative extensions for ongoing administrative extensions for ongoing management and tuning. It scales with the size management and tuning. It scales with the size of the Linux cluster and provides NFS Export of the Linux cluster and provides NFS Export capabilities outside the cluster.”capabilities outside the cluster.”

Mateti, Linux ClustersMateti, Linux Clusters 121121

Mosix Ancillary ToolsMosix Ancillary Tools

Kernel debuggerKernel debugger Kernel profiler Kernel profiler Parallel make (all exec() become mexec())Parallel make (all exec() become mexec()) openMosix pvmopenMosix pvm openMosix mm5openMosix mm5 openMosix HMMERopenMosix HMMER openMosix MathematicaopenMosix Mathematica

Mateti, Linux ClustersMateti, Linux Clusters 122122

Cluster AdministrationCluster Administration

LTSP (www.ltsp.org)LTSP (www.ltsp.org)CClulumpOmpOs (s (www.clumpos.orgwww.clumpos.org))MpsMpsMtopMtopMosctlMosctl

Mateti, Linux ClustersMateti, Linux Clusters 123123

Mosix commands & filesMosix commands & files setpe – starts and stops Mosix on the current nodesetpe – starts and stops Mosix on the current node tune – calibrates the node speed parameterstune – calibrates the node speed parameters mtune – calibrates the node MFS parametersmtune – calibrates the node MFS parameters migrate – forces a process to migratemigrate – forces a process to migrate mosctl – comprehensive Mosix administration toolmosctl – comprehensive Mosix administration tool mosrun, nomig, runhome, runon, cpujob, iojob, nodecay, fastdecay, mosrun, nomig, runhome, runon, cpujob, iojob, nodecay, fastdecay,

slowdecay – various way to start a program in a specific wayslowdecay – various way to start a program in a specific way mon & mosixview – CLI and graphic interface to monitor the cluster statusmon & mosixview – CLI and graphic interface to monitor the cluster status

/etc/mosix.map – contains the IP numbers of the cluster nodes/etc/mosix.map – contains the IP numbers of the cluster nodes /etc/mosgates – contains the number of gateway nodes present in the /etc/mosgates – contains the number of gateway nodes present in the

clustercluster /etc/overheads – contains the output of the ‘tune’ command to be loaded at /etc/overheads – contains the output of the ‘tune’ command to be loaded at

startupstartup /etc/mfscosts – contains the output of the ‘mtune’ command to be loaded at /etc/mfscosts – contains the output of the ‘mtune’ command to be loaded at

startupstartup /proc/mosix/admin/* - various files, sometimes binary, to check and control /proc/mosix/admin/* - various files, sometimes binary, to check and control

MosixMosix

Mateti, Linux ClustersMateti, Linux Clusters 124124

MonitoringMonitoring Cluster monitor - ‘Cluster monitor - ‘mosmon’(or ‘qtop’)mosmon’(or ‘qtop’)

Displays load, speed, utilization and memory Displays load, speed, utilization and memory information across the cluster.information across the cluster.

Uses the /proc/hpc/info interface for the retrieving Uses the /proc/hpc/info interface for the retrieving informationinformation

Applet/CGI based monitoring tools - display Applet/CGI based monitoring tools - display cluster properties cluster properties Access via the InternetAccess via the Internet Multiple resourcesMultiple resources

openMosixview with X GUIopenMosixview with X GUI

Mateti, Linux ClustersMateti, Linux Clusters 125125

openMosixviewopenMosixview

by Mathias Rechemburgby Mathias Rechemburg www.mosixview.comwww.mosixview.com

Mateti, Linux ClustersMateti, Linux Clusters 126126

Qlusters OSQlusters OS

http://www.qlusters.com/http://www.qlusters.com/ Based in part on openMosix technologyBased in part on openMosix technology Migrating socketsMigrating sockets Network RAM already implementedNetwork RAM already implemented Cluster Installer, Configurator, Monitor, Cluster Installer, Configurator, Monitor,

Queue Manager, Launcher, SchedulerQueue Manager, Launcher, Scheduler Partnership with IBM, Compaq, Red Hat Partnership with IBM, Compaq, Red Hat

and Inteland Intel

Mateti, Linux ClustersMateti, Linux Clusters 127127

QlusterOS MonitorQlusterOS Monitor

Mateti, Linux ClustersMateti, Linux Clusters 128128

MMore ore IInformationnformation on Clusters on Clusters www.ieeetfcc.org/www.ieeetfcc.org/ IEEE Task Force on Cluster ComputingIEEE Task Force on Cluster Computing. (now . (now

Technical Committee on Scalable Computing TCSC). Technical Committee on Scalable Computing TCSC). lcic.org/ “a central repository of links and information regarding lcic.org/ “a central repository of links and information regarding

Linux clustering, in all its forms.”Linux clustering, in all its forms.” www.beowulf.org resources for of clusters built on commodity www.beowulf.org resources for of clusters built on commodity

hardware deploying Linux OS and open source software. hardware deploying Linux OS and open source software. linuxclusters.com/ “Authoritative resource for information on Linux linuxclusters.com/ “Authoritative resource for information on Linux

Compute Clusters and Linux High Availability Clusters.”Compute Clusters and Linux High Availability Clusters.” www.linuxclustersinstitute.org/ “To provide education and advanced www.linuxclustersinstitute.org/ “To provide education and advanced

technical training for the deployment and use of Linux-based technical training for the deployment and use of Linux-based computing clusters to the high-performance computing community computing clusters to the high-performance computing community worldwide.”worldwide.”

Mateti, Linux ClustersMateti, Linux Clusters 129129

Code-GranularityCode ItemLarge grain(task level)Program

Medium grain(control level)Function (thread)

Fine grain(data level)Loop (Compiler)

Very fine grain(multiple issue)With hardware

Task i-lTask i-l Task iTask i Task i+1Task i+1

func1 ( ){........}

func1 ( ){........}

func2 ( ){........}

func2 ( ){........}

func3 ( ){........}

func3 ( ){........}

a ( 0 ) =..b ( 0 ) =..

a ( 0 ) =..b ( 0 ) =..

a ( 1 )=..b ( 1 )=..

a ( 1 )=..b ( 1 )=..

a ( 2 )=..b ( 2 )=..

a ( 2 )=..b ( 2 )=..

++ xx LoadLoad

PVM/MPI

Threads

Compilers

CPU

Levels of ParallelismLevels of Parallelism