high performance cluster computing

161
1 By: Rajkumar Buyya, Monash University, Melbourne. [email protected] http://www.dgs.monash.edu.au/~rajkumar High Performance Cluster Computing

Upload: hertz

Post on 08-Jan-2016

33 views

Category:

Documents


1 download

DESCRIPTION

High Performance Cluster Computing. By: Rajkumar Buyya, Monash University, Melbourne. [email protected] http://www.dgs.monash.edu.au/~rajkumar. Objectives. Learn and Share Recent advances in cluster computing (both in research and commercial settings): Architecture, - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: High Performance Cluster Computing

1

By: Rajkumar Buyya, Monash University, Melbourne.

[email protected] http://www.dgs.monash.edu.au/~rajkumar

High Performance Cluster Computing

Page 2: High Performance Cluster Computing

2

Objectives

Learn and Share Recent advances in cluster computing (both in research and commercial settings):

– Architecture, – System Software– Programming Environments and Tools– Applications

Page 3: High Performance Cluster Computing

3

Agenda

Overview of ComputingMotivations & Enabling TechnologiesCluster Architecture & its ComponentsClusters ClassificationsCluster MiddlewareSingle System ImageRepresentative Cluster Systems

Berkeley NOW and Solaris-MC

Resources and Conclusions

Page 4: High Performance Cluster Computing

4

P PP P P PMicrokernelMicrokernel

Multi-Processor Computing System

Threads InterfaceThreads Interface

Hardware

Operating System

ProcessProcessor ThreadPP

Applications

Computing Elements

Programming Paradigms

Page 5: High Performance Cluster Computing

5

Architectures System Software Applications P.S.Es Architectures System

Software

Applications P.S.Es

SequentialEra

ParallelEra

1940 50 60 70 80 90 2000 2030

Two Eras of Computing

Commercialization R & D Commodity

Page 6: High Performance Cluster Computing

6

Announcement: formation of

IEEE Task Force on Cluster Computing

(TFCC)

http://www.dgs.monash.edu.au/~rajkumar/tfcc/

http://www.dcs.port.ac.uk/~mab/tfcc/

Page 7: High Performance Cluster Computing

7

TFCC Activities...

Network Technologies OS Technologies Parallel I/O Programming Environments Java Technologies Algorithms and Applications >Analysis and Profiling Storage Technologies High Throughput Computing

Page 8: High Performance Cluster Computing

8

TFCC Activities...

High Availability Single System Image Performance Evaluation Software Engineering Education Newsletter Industrial Wing

– All the above have there own pages, see pointers from: http://www.dgs.monash.edu.au/~rajkumar/tfcc/

Page 9: High Performance Cluster Computing

9

TFCC Activities...

Mailing list, Workshops, Conferences, Tutorials, Web-resources etc.

Resources for introducing subject in senior undergraduate and graduate levels.

Tutorials/Workshops at IEEE Chapters.. ….. and so on.

Visit TFCC Page for more details:

– http://www.dgs.monash.edu.au/~rajkumar/tfcc/ periodically (updated daily!).

Page 10: High Performance Cluster Computing

10

Computing Power andComputer Architectures

Page 11: High Performance Cluster Computing

11

Need of more Computing Power:

Grand Challenge Applications

Solving technology problems using

computer modeling, simulation and analysis

Life SciencesLife Sciences

Mechanical Design & Analysis (CAD/CAM)Mechanical Design & Analysis (CAD/CAM)

AerospaceAerospace

GeographicInformationSystems

GeographicInformationSystems

Page 12: High Performance Cluster Computing

12

How to Run App. Faster ?

There are 3 ways to improve performance:

– 1. Work Harder– 2. Work Smarter– 3. Get Help

Computer Analogy

–1. Use faster hardware: e.g. reduce the time per instruction (clock cycle).

–2. Optimized algorithms and techniques

–3. Multiple computers to solve problem: That is, increase no. of instructions executed per clock cycle.

Page 13: High Performance Cluster Computing

13

Sequential Architecture Limitations

Sequential architectures reaching physical limitation (speed of light, thermodynamics)

Hardware improvements like pipelining, Superscalar, etc., are non-scalable and requires sophisticated Compiler Technology.

Vector Processing works well for certain kind of problems.

Page 14: High Performance Cluster Computing

14

No. of Processors

C.P

.I.

1 2 . . . .

Computational Power Improvement

Multiprocessor

Uniprocessor

Page 15: High Performance Cluster Computing

15

Age

Gro

wth

5 10 15 20 25 30 35 40 45 . . . .

Human Physical Growth Analogy:Computational Power Improvement

Vertical Horizontal

Page 16: High Performance Cluster Computing

16

The Tech. of PP is mature and can be exploited commercially; significant R & D work on development of tools & environment.

Significant development in Networking technology is paving a way for heterogeneous computing.

Why Parallel Processing NOW?

Page 17: High Performance Cluster Computing

17

History of Parallel Processing

PP can be traced to a tablet dated around 100 BC.

Tablet has 3 calculating positions. Infer that multiple positions:

Reliability/ Speed

Page 18: High Performance Cluster Computing

18

Aggregated speed with

which complex calculations

carried out by millions of neurons in human brain is amazing! although individual neurons response is slow (milli sec.) - demonstrate the feasibility of PP

Motivating Factors

Page 19: High Performance Cluster Computing

19

Simple classification by Flynn: (No. of instruction and data streams)

SISD - conventional SIMD - data parallel, vector computing MISD - systolic arrays MIMD - very general, multiple

approaches.

Current focus is on MIMD model, using general purpose processors or multicomputers.

Taxonomy of Architectures

Page 20: High Performance Cluster Computing

20

SISD : A Conventional Computer

Speed is limited by the rate at which computer can transfer information internally.

ProcessorProcessorData Input Data Output

Instru

ctions

Ex:PC, Macintosh, Workstations

Page 21: High Performance Cluster Computing

21

The MISD Architecture

More of an intellectual exercise than a practical configuration. Few built, but commercially not available

Data InputStream

Data OutputStream

Processor

A

Processor

B

Processor

C

InstructionStream A

InstructionStream B

Instruction Stream C

Page 22: High Performance Cluster Computing

22

SIMD Architecture

Ex: CRAY machine vector processing, Thinking machine cm*

Ci<= Ai * Bi

InstructionStream

Processor

A

Processor

B

Processor

C

Data Inputstream A

Data Inputstream B

Data Inputstream C

Data Outputstream A

Data Outputstream B

Data Outputstream C

Page 23: High Performance Cluster Computing

23

Unlike SISD, MISD, MIMD computer works asynchronously.

Shared memory (tightly coupled) MIMD

Distributed memory (loosely coupled) MIMD

MIMD Architecture

Processor

A

Processor

B

Processor

C

Data Inputstream A

Data Inputstream B

Data Inputstream C

Data Outputstream A

Data Outputstream B

Data Outputstream C

InstructionStream A

InstructionStream B

InstructionStream C

Page 24: High Performance Cluster Computing

26

Main HPC Architectures..1a

SISD - mainframes, workstations, PCs. SIMD Shared Memory - Vector machines,

Cray... MIMD Shared Memory - Sequent, KSR,

Tera, SGI, SUN. SIMD Distributed Memory - DAP, TMC CM-

2... MIMD Distributed Memory - Cray T3D,

Intel, Transputers, TMC CM-5, plus recent workstation clusters (IBM SP2, DEC, Sun, HP).

Page 25: High Performance Cluster Computing

27

Main HPC Architectures..1b.

NOTE: Modern sequential machines are not purely SISD - advanced RISC processors use many concepts from

– vector and parallel architectures (pipelining, parallel execution of instructions, prefetching of data, etc) in order to achieve one or more arithmetic operations per clock cycle.

Page 26: High Performance Cluster Computing

28

Parallel Processing Paradox

Time required to develop a parallel application for solving GCA is equal to:

– Half Life of Parallel Supercomputers.

Page 27: High Performance Cluster Computing

29

The Need for Alternative

Supercomputing Resources

Vast numbers of under utilised workstations available to use.

Huge numbers of unused processor cycles and resources that could be put to good use in a wide variety of applications areas.

Reluctance to buy Supercomputer due to their cost and short life span.

Distributed compute resources “fit” better into today's funding model.

Page 28: High Performance Cluster Computing

30

Scalable Parallel Computers

Page 29: High Performance Cluster Computing

31

Design Space of Competing Computer

Architecture

Page 30: High Performance Cluster Computing

32

Towards Inexpensive Supercomputing

It is:

Cluster Computing..The Commodity Supercomputing!

Page 31: High Performance Cluster Computing

33

Motivation for using Clusters

Surveys show utilisation of CPU cycles of desktop workstations is typically <10%.

Performance of workstations and PCs is rapidly improving

As performance grows, percent utilisation will decrease even further!

Organisations are reluctant to buy large supercomputers, due to the large expense and short useful life span.

Page 32: High Performance Cluster Computing

34

Motivation for using Clusters

The communications bandwidth between workstations is increasing as new networking technologies and protocols are implemented in LANs and WANs.

Workstation clusters are easier to integrate into existing networks than special parallel computers.

Page 33: High Performance Cluster Computing

35

Motivation for using Clusters

The development tools for workstations are more mature than the contrasting proprietary solutions for parallel computers - mainly due to the non-standard nature of many parallel systems.

Workstation clusters are a cheap and readily available alternative to specialised High Performance Computing (HPC) platforms.

Use of clusters of workstations as a distributed compute resource is very cost effective - incremental growth of system!!!

Page 34: High Performance Cluster Computing

36

Cycle Stealing

Usually a workstation will be owned by an individual, group, department, or organisation - they are dedicated to the exclusive use by the owners.

This brings problems when attempting to form a cluster of workstations for running distributed applications.

Page 35: High Performance Cluster Computing

37

Cycle Stealing

Typically, there are three types of owners, who use their workstations mostly for:

1. Sending and receiving email and preparing documents.

2. Software development - edit, compile, debug and test cycle.

3. Running compute-intensive applications.

Page 36: High Performance Cluster Computing

38

Cycle Stealing

Cluster computing aims to steal spare cycles from (1) and (2) to provide resources for (3).

However, this requires overcoming the ownership hurdle - people are very protective of their workstations.

Usually requires organisational mandate that computers are to be used in this way.

Page 37: High Performance Cluster Computing

39

Cycle Stealing

Stealing cycles outside standard work hours (e.g. overnight) is easy, stealing idle cycles during work hours without impacting interactive use (both CPU and memory) is much harder.

Page 38: High Performance Cluster Computing

40

Rise & Fall of Computing

Technologies

Mainframes Minis PCs

Minis PCs NetworkComputing

1970 1980 1995

Page 39: High Performance Cluster Computing

41

Original Food Chain Picture

Page 40: High Performance Cluster Computing

42

1984 Computer Food Chain

Mainframe

Vector Supercomputer

Mini ComputerWorkstation

PC

Page 41: High Performance Cluster Computing

43

Mainframe

Vector Supercomputer MPP

WorkstationPC

1994 Computer Food Chain

Mini Computer(hitting wall soon)

(future is bleak)

Page 42: High Performance Cluster Computing

44

Computer Food Chain (Now and Future)

Page 43: High Performance Cluster Computing

45

What is a cluster?

Cluster:

– a collection of nodes connected together– Network: Faster, closer connection than a typical

network (LAN)– Looser connection than symmetric multiprocessor

(SMP)

Page 44: High Performance Cluster Computing

46

1990s Building Blocks

There is no “near commodity” component Building block = complete computers

(HW & SW) shipped in 100,000s:Killer micro, Killer DRAM, Killer disk,Killer OS, Killer packaging, Killer investment

Leverage billion $ per year investment Interconnecting Building Blocks => Killer Net

High BandwidthLow latencyReliableCommodity

(ATM?)

Page 45: High Performance Cluster Computing

47

Why Clusters now?(Beyond Technology and

Cost)

Building block is big enough (v intel 8086) Workstations performance is doubling

every 18 months. Networks are faster

Higher link bandwidth (v 10Mbit Ethernet)

Switch based networks coming (ATM) Interfaces simple & fast (Active Msgs)

Striped files preferred (RAID) Demise of Mainframes, Supercomputers, &

MPPs

Page 46: High Performance Cluster Computing

48

Architectural Drivers…(cont)

Node architecture dominates performance

– processor, cache, bus, and memory– design and engineering $ => performance

Greatest demand for performance is on large systems

– must track the leading edge of technology without lag MPP network technology => mainstream

– system area networks System on every node is a powerful enabler

– very high speed I/O, virtual memory, scheduling, …

Page 47: High Performance Cluster Computing

49

...Architectural Drivers

Clusters can be grown: Incremental scalability (up, down, and across)

– Individual nodes performance can be improved by adding additional resource (new memory blocks/disks)

– New nodes can be added or nodes can be removed– Clusters of Clusters and Metacomputing

Complete software tools

– Threads, PVM, MPI, DSM, C, C++, Java, Parallel C++, Compilers, Debuggers, OS, etc.

Wide class of applications

– Sequential and grand challenging parallel applications

Page 48: High Performance Cluster Computing

50

Example Clusters:Berkeley NOW

100 Sun UltraSparcs

– 200 disks Myrinet

SAN– 160 MB/s

Fast comm.– AM, MPI, ...

Ether/ATM switched external net

Global OS Self Config

Page 49: High Performance Cluster Computing

51

Basic Components

$

P

M I/O bus

MyriNet

P

Sun Ultra 170

MyricomNIC

160 MB/s

M

Page 50: High Performance Cluster Computing

52

Massive Cheap Storage Cluster

Basic unit:

2 PCs double-ending four SCSI chains of 8 disks each

Currently serving Fine Art at http://www.thinker.org/imagebase/

Page 51: High Performance Cluster Computing

53

Cluster of SMPs (CLUMPS)

Four Sun E5000s

– 8 processors– 4 Myricom NICs each

Multiprocessor, Multi-NIC, Multi-Protocol

NPACI => Sun 450s

Page 52: High Performance Cluster Computing

54

Millennium PC Clumps

Inexpensive, easy to manage Cluster

Replicated in many departments

Prototype for very large PC cluster

Page 53: High Performance Cluster Computing

55

Adoption of the Approach

Page 54: High Performance Cluster Computing

56

So What’s So Different?

Commodity parts? Communications Packaging? Incremental Scalability? Independent Failure? Intelligent Network Interfaces? Complete System on every node

– virtual memory– scheduler– files– ...

Page 55: High Performance Cluster Computing

57

OPPORTUNITIES &

CHALLENGES

Page 56: High Performance Cluster Computing

58

Shared Pool ofComputing Resources:

Processors, Memory, Disks

Interconnect

Guarantee atleast oneworkstation to many individuals

(when active)

Deliver large % of collectiveresources to few individuals

at any one time

Opportunity of Large-scaleComputing on NOW

Page 57: High Performance Cluster Computing

59

Windows of Opportunities

MPP/DSM:

– Compute across multiple systems: parallel. Network RAM:

– Idle memory in other nodes. Page across other nodes idle memory

Software RAID:

– file system supporting parallel I/O and reliablity, mass-storage.

Multi-path Communication:

– Communicate across multiple networks: Ethernet, ATM, Myrinet

Page 58: High Performance Cluster Computing

60

Parallel Processing

Scalable Parallel Applications require

– good floating-point performance– low overhead communication scalable

network bandwidth– parallel file system

Page 59: High Performance Cluster Computing

61

Network RAM

Performance gap between processor and disk has widened.

Thrashing to disk degrades performance significantly

Paging across networks can be effective with high performance networks and OS that recognizes idle machines

Typically thrashing to network RAM can be 5 to 10 times faster than thrashing to disk

Page 60: High Performance Cluster Computing

62

Software RAID: Redundant Array of

Workstation Disks I/O Bottleneck:

– Microprocessor performance is improving more than 50% per year.

– Disk access improvement is < 10%

– Application often perform I/O RAID cost per byte is high compared to single disks RAIDs are connected to host computers which are often a

performance and availability bottleneck RAID in software, writing data across an array of

workstation disks provides performance and some degree of redundancy provides availability.

Page 61: High Performance Cluster Computing

63

Software RAID, Parallel File Systems, and

Parallel I/O

Page 62: High Performance Cluster Computing

64

Enabling Technologies

Efficient communication hardware and software

Global co-ordination of multiple workstation Operating Systems

Page 63: High Performance Cluster Computing

65

Efficient Communication

The key Enabling Technology

Communication overheads components

– bandwidth– network latency and – processor overhead

Switched LANs allow bandwidth to scale

Network latency can be overlapped with computation

Processor overhead is the real problem - it consumes CPU cycles

Page 64: High Performance Cluster Computing

66

Efficient Communication (Contd...)

SS10 connected by Ethernet

– 456 s processor overhead With ATM

– 626 s processor overhead Target :

– MPP communication performance: low latency and scalable bandwidth

– CM5 user-level network overhead 5.7 s

Page 65: High Performance Cluster Computing

67

Efficient Communication (Contd...)

Constraints in clusters

– greater routing delay and less than complete reliability

– constraints on where the network connects into the node

– UNIX has a rigid device and scheduling interface

Page 66: High Performance Cluster Computing

68

Efficient Communication Approaches

Efficient Network Interface Hardware

Minimal Interface into the Operating System

– user must transmit directly into and receive from the network without OS intervention

– communication protection domains to be established by interface card and OS

– treat message loss as an infrequent case

Page 67: High Performance Cluster Computing

69

Cluster Computer and its Components

Page 68: High Performance Cluster Computing

70

Clustering Today

Clustering gained momentum when 3 technologies converged:

– 1. Very HP Microprocessors• workstation performance = yesterday supercomputers

– 2. High speed communication• Comm. between cluster nodes >= between processors in an

SMP.

– 3. Standard tools for parallel/ distributed computing & their growing popularity.

Page 69: High Performance Cluster Computing

71

Cluster Computer Architecture

Page 70: High Performance Cluster Computing

72

Cluster Components...1a

Nodes

Multiple High Performance Components:

– PCs– Workstations– SMPs (CLUMPS)– Distributed HPC Systems leading to

Metacomputing They can be based on different

architectures and running difference OS

Page 71: High Performance Cluster Computing

73

Cluster Components...1b

Processors There are many

(CISC/RISC/VLIW/Vector..)– Intel: Pentiums, Xeon, Merceed….– Sun: SPARC, ULTRASPARC– HP PA– IBM RS6000/PowerPC– SGI MPIS– Digital Alphas

Integrate Memory, processing and networking into a single chip– IRAM (CPU & Mem): (http://iram.cs.berkeley.edu)– Alpha 21366 (CPU, Memory Controller, NI)

Page 72: High Performance Cluster Computing

74

Cluster Components…2OS

State of the art OS:– Linux (Beowulf)

– Microsoft NT (Illinois HPVM)

– SUN Solaris (Berkeley NOW)

– IBM AIX (IBM SP2)

– HP UX (Illinois - PANDA)

– Mach (Microkernel based OS) (CMU)

– Cluster Operating Systems (Solaris MC, SCO Unixware, MOSIX (academic project)

– OS gluing layers: (Berkeley Glunix)

Page 73: High Performance Cluster Computing

75

Cluster Components…3High Performance

Networks

Ethernet (10Mbps), Fast Ethernet (100Mbps), Gigabit Ethernet (1Gbps) SCI (Dolphin - MPI- 12micro-sec

latency) ATM Myrinet (1.2Gbps) Digital Memory Channel FDDI

Page 74: High Performance Cluster Computing

76

Cluster Components…4Network Interfaces

Network Interface Card

– Myrinet has NIC

– User-level access support

– Alpha 21364 processor integrates processing, memory controller, network interface into a single chip..

Page 75: High Performance Cluster Computing

77

Cluster Components…5 Communication

Software Traditional OS supported facilities

(heavy weight due to protocol processing)..

– Sockets (TCP/IP), Pipes, etc. Light weight protocols (User Level)

– Active Messages (Berkeley)– Fast Messages (Illinois)– U-net (Cornell)– XTP (Virginia)

System systems can be built on top of the above protocols

Page 76: High Performance Cluster Computing

78

Cluster Components…6a

Cluster Middleware

Resides Between OS and Applications and offers in infrastructure for supporting:

– Single System Image (SSI)

– System Availability (SA) SSI makes collection appear as single

machine (globalised view of system resources). Telnet cluster.myinstitute.edu

SA - Check pointing and process migration..

Page 77: High Performance Cluster Computing

79

Cluster Components…6b

Middleware Components

Hardware – DEC Memory Channel, DSM (Alewife, DASH) SMP

Techniques

OS / Gluing Layers– Solaris MC, Unixware, Glunix)

Applications and Subsystems– System management and electronic forms– Runtime systems (software DSM, PFS etc.)– Resource management and scheduling (RMS):

• CODINE, LSF, PBS, NQS, etc.

Page 78: High Performance Cluster Computing

80

Cluster Components…7aProgramming environments

Threads (PCs, SMPs, NOW..) – POSIX Threads

– Java Threads MPI

– Linux, NT, on many Supercomputers PVM Software DSMs (Shmem)

Page 79: High Performance Cluster Computing

81

Cluster Components…7b

Development Tools ?

Compilers– C/C++/Java/ ;

– Parallel programming with C++ (MIT Press book)

RAD (rapid application development tools).. GUI based tools for PP modeling

Debuggers Performance Analysis Tools Visualization Tools

Page 80: High Performance Cluster Computing

82

Cluster Components…8Applications

Sequential Parallel / Distributed (Cluster-aware

app.)

– Grand Challenging applications• Weather Forecasting

• Quantum Chemistry

• Molecular Biology Modeling

• Engineering Analysis (CAD/CAM)

• ……………….

– PDBs, web servers,data-mining

Page 81: High Performance Cluster Computing

83

Key Operational Benefits of Clustering

System availability (HA). offer inherent high system availability due to the redundancy of hardware, operating systems, and applications.

Hardware Fault Tolerance. redundancy for most system components (eg. disk-RAID), including both hardware and software.

OS and application reliability. run multiple copies of the OS and applications, and through this redundancy

Scalability. adding servers to the cluster or by adding more clusters to the network as the need arises or CPU to SMP.

High Performance. (running cluster enabled programs)

Page 82: High Performance Cluster Computing

84

Classification

of Cluster Computer

Page 83: High Performance Cluster Computing

85

Clusters Classification..1

Based on Focus (in Market)

– High Performance (HP) Clusters• Grand Challenging Applications

– High Availability (HA) Clusters• Mission Critical applications

Page 84: High Performance Cluster Computing

86

HA Cluster: Server Cluster with "Heartbeat" Connection

Page 85: High Performance Cluster Computing

87

Clusters Classification..2

Based on Workstation/PC Ownership

– Dedicated Clusters

– Non-dedicated clusters• Adaptive parallel computing

• Also called Communal multiprocessing

Page 86: High Performance Cluster Computing

88

Clusters Classification..3

Based on Node Architecture..

– Clusters of PCs (CoPs)

– Clusters of Workstations (COWs)

– Clusters of SMPs (CLUMPs)

Page 87: High Performance Cluster Computing

89

Building Scalable Systems: Cluster of SMPs (Clumps)

Performance of SMP Systems Vs. Four-Processor Servers in a Cluster

Page 88: High Performance Cluster Computing

90

Clusters Classification..4

Based on Node OS Type..

– Linux Clusters (Beowulf)

– Solaris Clusters (Berkeley NOW)

– NT Clusters (HPVM)

– AIX Clusters (IBM SP2)

– SCO/Compaq Clusters (Unixware)

– …….Digital VMS Clusters, HP clusters, ………………..

Page 89: High Performance Cluster Computing

91

Clusters Classification..5

Based on node components architecture & configuration (Processor Arch, Node Type: PC/Workstation.. & OS: Linux/NT..):

– Homogeneous Clusters• All nodes will have similar configuration

– Heterogeneous Clusters• Nodes based on different processors and

running different OSes.

Page 90: High Performance Cluster Computing

92

Clusters Classification..6a

Dimensions of Scalability & Levels of Clustering

Network

Technology

Platform

Uniprocessor

SMPCluster

MPP

CPU / I/O / M

emory / OS

(1)

(2)

(3)

Campus

Enterprise

Workgroup

Department

Public Metacomputing

Page 91: High Performance Cluster Computing

93

Clusters Classification..6b

Levels of Clustering Group Clusters (#nodes: 2-99)

– (a set of dedicated/non-dedicated computers - mainly connected by SAN like Myrinet)

Departmental Clusters (#nodes: 99-999) Organizational Clusters (#nodes: many 100s) (using ATMs Net) Internet-wide Clusters=Global Clusters:

(#nodes: 1000s to many millions)– Metacomputing

– Web-based Computing

– Agent Based Computing

• Java plays a major in web and agent based computing

Page 92: High Performance Cluster Computing

94

Cluster Middleware

and

Single System Image

Page 93: High Performance Cluster Computing

95

Contents

What is Middleware ? What is Single System Image ? Benefits of Single System Image SSI Boundaries SSI Levels Relationship between Middleware

Modules. Strategy for SSI via OS Solaris MC: An example OS supporting

SSI Cluster Monitoring Software

Page 94: High Performance Cluster Computing

96

What is Cluster Middleware ?

An interface between between use applications and cluster hardware and OS platform.

Middleware packages support each other at the management, programming, and implementation levels.

Middleware Layers:

– SSI Layer

– Availability Layer: It enables the cluster services of

• Checkpointing, Automatic Failover, recovery from failure,

• fault-tolerant operating among all cluster nodes.

Page 95: High Performance Cluster Computing

97

Middleware Design Goals

Complete Transparency

– Lets the see a single cluster system..• Single entry point, ftp, telnet, software loading...

Scalable Performance

– Easy growth of cluster• no change of API & automatic load distribution.

Enhanced Availability

– Automatic Recovery from failures• Employ checkpointing & fault tolerant technologies

– Handle consistency of data when replicated..

Page 96: High Performance Cluster Computing

98

What is Single System Image (SSI) ?

A single system image is the illusion, created by software or hardware, that a collection of computing elements appear as a single computing resource.

SSI makes the cluster appear like a single machine to the user, to applications, and to the network.

A cluster without a SSI is not a cluster

Page 97: High Performance Cluster Computing

99

Benefits of Single System Image

Usage of system resources transparently

Improved reliability and higher availability

Simplified system management

Reduction in the risk of operator errors

User need not be aware of the underlying system architecture to use these machines effectively

Page 98: High Performance Cluster Computing

100

SSI vs. Scalability(design space of competing arch.)

Page 99: High Performance Cluster Computing

101

Desired SSI Services

Single Entry Point

– telnet cluster.my_institute.edu– telnet node1.cluster. institute.edu

Single File Hierarchy: xFS, AFS, Solaris MC Proxy

Single Control Point: Management from single GUI

Single virtual networking Single memory space - DSM Single Job Management: Glunix, Condin, LSF Single User Interface: Like workstation/PC

windowing environment (CDE in Solaris/NT), may it can use Web technology

Page 100: High Performance Cluster Computing

102

Availability Support Functions

Single I/O Space (SIO):

– any node can access any peripheral or disk devices without the knowledge of physical location.

Single Process Space (SPS)

– Any process on any node create process with cluster wide process wide and they communicate through signal, pipes, etc, as if they are one a single node.

Checkpointing and Process Migration.

– Saves the process state and intermediate results in memory to disk to support rollback recovery when node fails. PM for Load balancing...

Reduction in the risk of operator errors

User need not be aware of the underlying system architecture to use these machines effectively

Page 101: High Performance Cluster Computing

103

SSI Levels

It is a computer science notion of levels of abstractions (house is at a higher level of abstraction than walls, ceilings, and floors).

Application and Subsystem Level

Operating System Kernel Level

Hardware Level

Page 102: High Performance Cluster Computing

104

SSI at Application and Subsystem Level

Level Examples Boundary Importance

application cluster batch system,system management

subsystem

file system

distributed DB,OSF DME, Lotus Notes, MPI, PVM

an application what a userwants

Sun NFS, OSF,DFS, NetWare,and so on

a subsystem SSI for allapplications ofthe subsystem

implicitly supports many applications and subsystems

shared portion of the file system

toolkit OSF DCE, SunONC+, ApolloDomain

best level ofsupport for heter-ogeneous system

explicit toolkitfacilities: user,service name,time

(c) In search of clusters

Page 103: High Performance Cluster Computing

105

SSI at Operating System Kernel Level

Level Examples Boundary Importance

Kernel/OS Layer

Solaris MC, Unixware MOSIX, Sprite,Amoeba/ GLunix

kernelinterfaces

virtualmemory

UNIX (Sun) vnode,Locus (IBM) vproc

each name space:files, processes, pipes, devices, etc.

kernel support forapplications, admsubsystems

none supportingoperating system kernel

type of kernelobjects: files,processes, etc.

modularizes SSIcode within kernel

may simplifyimplementationof kernel objects

each distributedvirtual memoryspace

microkernel Mach, PARAS, Chorus,OSF/1AD, Amoeba

implicit SSI forall system services

each serviceoutside themicrokernel

(c) In search of clusters

Page 104: High Performance Cluster Computing

106

SSI at Harware Level

Level Examples Boundary Importance

memory SCI, DASH better communica-tion and synchro-nization

memory space

memory and I/O

SCI, SMP techniques lower overheadcluster I/O

memory and I/Odevice space

Application and Subsystem Level

Operating System Kernel Level

(c) In search of clusters

Page 105: High Performance Cluster Computing

107

SSI Characteristics

1. Every SSI has a boundary 2. Single system support can

exist at different levels within a system, one able to be build on another

Page 106: High Performance Cluster Computing

108

SSI Boundaries -- an applications SSI

boundary

Batch System

SSIBoundary

(c) In search of clusters

Page 107: High Performance Cluster Computing

109

Relationship Among Middleware Modules

Page 108: High Performance Cluster Computing

110

PARMON: A Cluster Monitoring Tool

PARMONHigh-Speed

Switch

parmond

parmon

PARMON Serveron Solaris NodePARMON Client on JVM

Page 109: High Performance Cluster Computing

111

Motivations

Monitoring such huge systems is a tedious and challenging task since typical workstations are designed to work as a standalone system, rather than a part of workstation clusters.

System administrators require tools to effectively monitor such huge systems. PARMON provides the solution to this challenging problem.

Page 110: High Performance Cluster Computing

112

PARMON - Salient Features

Allows to monitor system activities at Component, Node, Group, or entire Cluster level monitoring

Monitoring of System Components :

– CPU, Memory, Disk and Network Allows to monitor multiple instances of

the same component. PARMON provides GUI interface for

initiating activities/request and presents results graphically.

Page 111: High Performance Cluster Computing

113

Resource Utilization at a Glance

Page 112: High Performance Cluster Computing

114

CPU Usage Monitoring

Page 113: High Performance Cluster Computing

115

Memory Usage monitoring

Page 114: High Performance Cluster Computing

116

Kernel Data Catalog - CPU

Page 115: High Performance Cluster Computing

117

Strategy for SSI via OS

1. Build as a layer on top of the existing OS. (eg. Glunix)

– Benefits: makes the system quickly portable, tracks vendor software upgrades, and reduces development time.

– i.e. new systems can be built quickly by mapping new services onto the functionality provided by the layer beneath. Eg: Glunix/Solaris-MC

2. Build SSI at kernel level, True Cluster OS

– Good, but Can’t leverage of OS improvements by vendor

– E.g. Unixware and Mosix (built using BSD Unix)

Page 116: High Performance Cluster Computing

118

Cluster Computing - Research Projects

Beowulf (CalTech and Nasa) - USA CCS (Computing Centre Software) - Paderborn, Germany Condor - Wisconsin State University, USA DJM (Distributed Job Manager) - Minnesota Supercomputing Center DQS (Distributed Queuing System) - Florida State University, USA EASY - Argonne National Lab, USA HPVM -(High Performance Virtual Machine),UIUC&now UCSB,US far - University of Liverpool, UK Gardens - Queensland University of Technology, Australia Generic NQS (Network Queuing System),University of Sheffield, UK NOW (Network of Workstations) - Berkeley, USA NIMROD - Monash University, Australia PBS (Portable Batch System) - NASA Ames and LLNL, USA PRM (Prospero Resource Manager) - Uni. of S. California, USA QBATCH - Vita Services Ltd., USA

Page 117: High Performance Cluster Computing

119

Cluster Computing - Commercial Software

Codine (Computing in Distributed Network Environment) - GENIAS GmbH, Germany

LoadLeveler - IBM Corp., USA LSF (Load Sharing Facility) - Platform Computing,

Canada NQE (Network Queuing Environment) - Craysoft

Corp., USA OpenFrame - Centre for Development of Advanced

Computing, India RWPC (Real World Computing Partnership), Japan Unixware (SCO-Santa Cruz Operations,), USA Solaris-MC (Sun Microsystems), USA

Page 118: High Performance Cluster Computing

120

Representative Cluster Systems

1. Solaris -MC2. Berkeley NOW3. their comparison with Beowulf & HPVM

Page 119: High Performance Cluster Computing

121

Next Generation Distributed Computing:

The Solaris MC Operating System

Page 120: High Performance Cluster Computing

122

Why new software?

Without software, a cluster is:– Just a network of machines

– Requires specialized applications

– Hard to administer

With a cluster operating system:– Cluster becomes a scalable, modular computer

– Users and administrators see a single large machine

– Runs existing applications

– Easy to administer

New software makes cluster better for the customer

Page 121: High Performance Cluster Computing

123

Cluster computing and Solaris MC

Goal: use computer clusters for general-purpose computing

Support existing customers and applications

Solution: Solaris MC (Multi Computer) operating system

A distributed operating system (OS) for multi-computers

Page 122: High Performance Cluster Computing

124

What is the Solaris MC OS ?

Solaris MC extends standard Solaris

Solaris MC makes the cluster look like a single machine

Global file system Global process management Global networking

Solaris MC runs existing applications unchanged Supports Solaris ANI (Application binary interface)

Page 123: High Performance Cluster Computing

125

Applications

Ideal for: Web and interactive servers Databases File servers Timesharing

Benefits for vendors and customers Preserves investment in existing applications Modular servers with low entry-point price and low cost of

ownership Easier system administraion Solaris could become a preferred platform for clustered

systems

Page 124: High Performance Cluster Computing

126

Solaris MC is a running research system

Designed, built and demonstrated Solaris MC prototype CLuster of SPARCstations connected with Myrinet network Runs unmodified commercial parallel database, scalable Web

server, parallel make

Next: Solaris MC Phase II High availability New I/O work to take advantage of clusters Performance evaluation

Page 125: High Performance Cluster Computing

127

Advantages of Solaris MC

Leverages continuing investment in Solaris Same applications: binary-compatible Same kernel, device drivers, etc. As portable as base Solaris - will run on SPARC, x86, PowerPC

State of the art distributed systems techniques High availability designed into the sytem Powerful distributed object-oriented framework

Ease of administration and use Looks like a familiar multiprocessor server to users, sytem

administrators, and applications

Page 126: High Performance Cluster Computing

128

Solaris MC details

Solaris MC is a set of C++ loadable modules on top of Solaris

–Very few changes to existing kernelA private Solaris kernel per node: provides

reliabilityObject-oriented system with well-defined

interfaces

Page 127: High Performance Cluster Computing

129

Key components of Solaris-MC proving SSI

global file system

globalized process management

globalized networking and I/O

Solaris MC Architecture

System call interface

Network

File system

C++

Processes

Object framework

Existing Solaris 2.5 kernel

Othernodes

Object invocations

Kernel

Solaris MC

Applications

Page 128: High Performance Cluster Computing

130

Solaris MC components

Object and communication support

High availability support

PXFS global distributed file system

Process mangement

NetworkingSolaris MC Architecture

System call interface

Network

File system

C++

Processes

Object framework

Existing Solaris 2.5 kernel

Othernodes

Object invocations

Kernel

Solaris MC

Applications

Page 129: High Performance Cluster Computing

131

Object Oreintation

Better software maintenance, change, and evolution Well-defined interfaces Separate implementation from interface Interface inheritance

Solaris MC uses: IDL: a better way to define interfaces CORBA object model: a better RPC (Remote Procedure Call) C++: a better C

Page 130: High Performance Cluster Computing

132

Object and Communication Framework

Mechanism for nodes and modules to communicate Inter-node and intra-node interprocess communication

Optimized protocols for trusted computing base

Efficient, low-latency communication primitives

Object communication independent of interconnect We use Ethernet, fast Ethernet, FibreChannel, Myrinet

Allows interconnect hardware to be upgraded

Page 131: High Performance Cluster Computing

133

High Availability Support

Node failure doesn’t crash entire system Unaffected nodes continue running Better than a SMP A requirement for mission critical market

Well-defined failure boundaries Separate kernel per node - OS does not use shared

memory

Object framework provides support Delivers failure notifications to servers and clients Group membership protocol detects node failures

Each subsystem resposible for its recovery Filesystem, process management, networking,

applications

Page 132: High Performance Cluster Computing

134

PXFS: Global Filesystem

Single-system image of file sytem

Backbone of Solaris MC

Coherent access and caching of files and directories Caching provides high performance

Access to I/O devices

Page 133: High Performance Cluster Computing

135

PXFS: An object-oriented VFS

PXFS builds on existing Solaris file sytems Uses the vnode/virtual file system interface (VFS) externally Uses object communication internally

Page 134: High Performance Cluster Computing

136

Process management

Provide global view of processes on any node Users, administrators, and applications see global view Supports existing applications

Uniform support for local and remote processes Process creation/waiting/exiting (including remote execution) Global process identifiers, groups, sessions Signal handling procfs (/proc)

Page 135: High Performance Cluster Computing

137

Process management benefits

Global process management helps users and administrators

Users see familiar single machine process model

Can run programs on any node

Location of process in the cluster doesn’t matter

Use existing commands and tools: unmodified ps, kill, etc.

Page 136: High Performance Cluster Computing

138

Networking goals

Cluster appears externally as a single SMP server Familiar to customers Access cluster through single network address Multiple network interfaces supported but not required

Scalable design protocol and network application processing on any

mode Parallelism provides high server performance

Page 137: High Performance Cluster Computing

139

Networking: Implementation

A programmable “packet filter” Packets routed between network device and the correct node Efficient, scalable, and supports parallelism Supports multiple protocols with existing protocol stacks

Parallelism of protocol processing and applications Incoming connections are load-balanced across the cluster

Page 138: High Performance Cluster Computing

140

Status

4 node, 8 CPU prototype with Myrinet demonstratedObject and communication infrastructure

Global file system (PXFS) with coherency and caching

Networking TCP/IP with load balancingGlobal process management (ps, kill, exec, wait,

rfork, /proc)Monitoring toolsCluster membership protocols

Demonstrated applicationsCommercial parallel database

Scalable Web serverParallel makeTimesharing

Solaris-MC team is working on high availability

Page 139: High Performance Cluster Computing

141

Summary of Solaris MC

Clusters likely to be an important market Solaris MC preserves customer investment in Solaris

Uses existing Solaris applications

Familiar to customers Looks like a multiprocessor, not a special cluster architecture

Ease of administration and use Clusters are ideal for important applications

Web server, file server, databases, interactive services State-of-the-art object-oriented distributed

implementation Designed for future growth

Page 140: High Performance Cluster Computing

142

Berkeley NOW Project

Page 141: High Performance Cluster Computing

143

NOW @ Berkeley

Design & Implementation of higher-level system

Global OS (Glunix)Parallel File Systems (xFS)Fast Communication (HW for Active

Messages)Application Support

Overcoming technology shortcomingsFault toleranceSystem Management

NOW Goal: Faster for Parallel AND Sequential

Page 142: High Performance Cluster Computing

144

NOW Software Components

AM L.C.P.

VN segment Driver

UnixWorkstation

AM L.C.P.

VN segment Driver

UnixWorkstation

AM L.C.P.

VN segment Driver

UnixWorkstation

AM L.C.P.

VN segment Driver

Unix (Solaris)Workstation

Global Layer Unix

Myrinet Scalable Interconnect

Large Seq. AppsParallel Apps

Sockets, Split-C, MPI, HPF, vSM

Active MessagesName Svr

Sched

uler

Page 143: High Performance Cluster Computing

145

Active Messages: Lightweight

Communication Protocol

Key Idea: Network Process ID attached to every message that HW checks upon receipt

Net PID match, as fast as beforeNet PIC mismatch, interrupt and invoke

OS Can mix LAN messages and MPP messages;

invoke OS & TCP/IP only when not cooperating (if everyone uses same physical layer format)

Page 144: High Performance Cluster Computing

146

MPP Active Messages

Key Idea: associate a small user-level handler directly with each message

Sender injects the message directly into the network

Handler executes immediately upon arrival

Pulls the message out of the network and integrates it into the ongoing computation, or replies

No buffering (beyond transport), no parsing, no allocation, primitive scheduling

Page 145: High Performance Cluster Computing

147

Active Message Model

Every message contains at its header the address of a user level handler which gets executed immediately in user level

No receive side buffering of messages

Supports protected multiprogramming of a large number of users onto finite physical network resource

Active message operations, communication events and threads are integrated in a simple and cohesive model

Provides naming and protection

Page 146: High Performance Cluster Computing

148

Active Message Model (Contd..)

datastructs

datastructs

primarycomputation

primarycomputation

handler

data pc

Active Message

Network

Page 147: High Performance Cluster Computing

149

xFS: File System for NOW

Serverless File System: All data with clientsUses MP cache coherency to reduce

traffic Files striped for parallel transfer Large file cache (“cooperative caching-

Network RAM”)

Miss Rate Response Time

Client/Server 10% 1.8 ms

xFS 4% 1.0 ms

(42 WS, 32 MB/WS, 512 MB/server, 8KB/access)

Page 148: High Performance Cluster Computing

150

Glunix: Gluing Unix

It is built onto of Solaris It glues together Solaris running on Cluster

nodes. Support transparent remote execution,

load balancing, allows to run existing applications.

Provides globalized view of system resources like SolarisMC

Gang schedule parallel jobs to be as good as dedicated MPP for parallel jobs

Page 149: High Performance Cluster Computing

151

3 Paths for Applications on

NOW? Revolutionary (MPP Style): write new programs from

scratch using MPP languages, compilers, libraries,… Porting: port programs from mainframes,

supercomputers, MPPs, … Evolutionary: take sequential program & use

1)Network RAM: first use memory of many computers to reduce disk accesses; if not fast enough, then:

2)Parallel I/O: use many disks in parallel for accesses not in file cache; if not fast enough, then:

3)Parallel program: change program until it sees enough processors that is fast

=> Large speedup without fine grain parallel program

Page 150: High Performance Cluster Computing

152

Comparison of 4 Cluster Systems

Page 151: High Performance Cluster Computing

153

Pointers to Literature on Cluster Computing

Page 152: High Performance Cluster Computing

154

Reading Resources..1aInternet & WWW

– Computer Architecture:• http://www.cs.wisc.edu/~arch/www/

– PFS & Parallel I/O• http://www.cs.dartmouth.edu/pario/

– Linux Parallel Procesing• http://yara.ecn.purdue.edu/~pplinux/Sites/

– DSMs• http://www.cs.umd.edu/~keleher/dsm.html

Page 153: High Performance Cluster Computing

155

Reading Resources..1bInternet & WWW

– Solaris-MC• http://www.sunlabs.com/research/solaris-mc

– Microprocessors: Recent Advances• http://www.microprocessor.sscc.ru

– Beowulf:• http://www.beowulf.org

– Metacomputing• http://www.sis.port.ac.uk/~mab/Metacomputing/

Page 154: High Performance Cluster Computing

156

Reading Resources..2Books

– In Search of Cluster• by G.Pfister, Prentice Hall (2ed), 98

– High Performance Cluster Computing• Volume1: Architectures and Systems• Volume2: Programming and Applications

– Edited by Rajkumar Buyya, Prentice Hall, NJ, USA.

– Scalable Parallel Computing• by K Hwang & Zhu, McGraw Hill,98

Page 155: High Performance Cluster Computing

157

Reading Resources..3Journals

– A Case of NOW, IEEE Micro, Feb’95• by Anderson, Culler, Paterson

– Fault Tolerant COW with SSI, IEEE Concurrency, (to appear)

• by Kai Hwang, Chow, Wang, Jin, Xu

– Cluster Computing: The Commodity Supercomputing, Journal of Software Practice and Experience-(get from my web)

• by Mark Baker & Rajkumar Buyya

Page 156: High Performance Cluster Computing

158

Clusters Revisited

Page 157: High Performance Cluster Computing

159

Summary

We have discussed Clusters

Enabling TechnologiesArchitecture & its ComponentsClassificationsMiddlewareSingle System ImageRepresentative Systems

Page 158: High Performance Cluster Computing

160

Conclusions

Clusters are promising..

Solve parallel processing paradoxOffer incremental growth and matches with funding

pattern.New trends in hardware and software technologies

are likely to make clusters more promising..so thatClusters based supercomputers can be seen

everywhere!

Page 159: High Performance Cluster Computing

161

Breaking High Performance Computing BarriersBreaking High Performance Computing Barriers

21002100 2100 2100 2100

2100 2100 2100 2100

Single

Processor

Shared

Memory

LocalParallelCluster

GlobalParallelCluster

G

F

L

O

P

S

Page 160: High Performance Cluster Computing

162

Well, Read my book for….

http://www.dgs.monash.edu.au/~rajkumar/cluster/

Thank You ...

Thank You ...

?

Page 161: High Performance Cluster Computing

163

Thank You ...Thank You ...

?