high performance cluster computing
DESCRIPTION
High Performance Cluster Computing. By: Rajkumar Buyya, Monash University, Melbourne. [email protected] http://www.dgs.monash.edu.au/~rajkumar. Objectives. Learn and Share Recent advances in cluster computing (both in research and commercial settings): Architecture, - PowerPoint PPT PresentationTRANSCRIPT
1
By: Rajkumar Buyya, Monash University, Melbourne.
[email protected] http://www.dgs.monash.edu.au/~rajkumar
High Performance Cluster Computing
2
Objectives
Learn and Share Recent advances in cluster computing (both in research and commercial settings):
– Architecture, – System Software– Programming Environments and Tools– Applications
3
Agenda
Overview of ComputingMotivations & Enabling TechnologiesCluster Architecture & its ComponentsClusters ClassificationsCluster MiddlewareSingle System ImageRepresentative Cluster Systems
Berkeley NOW and Solaris-MC
Resources and Conclusions
4
P PP P P PMicrokernelMicrokernel
Multi-Processor Computing System
Threads InterfaceThreads Interface
Hardware
Operating System
ProcessProcessor ThreadPP
Applications
Computing Elements
Programming Paradigms
5
Architectures System Software Applications P.S.Es Architectures System
Software
Applications P.S.Es
SequentialEra
ParallelEra
1940 50 60 70 80 90 2000 2030
Two Eras of Computing
Commercialization R & D Commodity
6
Announcement: formation of
IEEE Task Force on Cluster Computing
(TFCC)
http://www.dgs.monash.edu.au/~rajkumar/tfcc/
http://www.dcs.port.ac.uk/~mab/tfcc/
7
TFCC Activities...
Network Technologies OS Technologies Parallel I/O Programming Environments Java Technologies Algorithms and Applications >Analysis and Profiling Storage Technologies High Throughput Computing
8
TFCC Activities...
High Availability Single System Image Performance Evaluation Software Engineering Education Newsletter Industrial Wing
– All the above have there own pages, see pointers from: http://www.dgs.monash.edu.au/~rajkumar/tfcc/
9
TFCC Activities...
Mailing list, Workshops, Conferences, Tutorials, Web-resources etc.
Resources for introducing subject in senior undergraduate and graduate levels.
Tutorials/Workshops at IEEE Chapters.. ….. and so on.
Visit TFCC Page for more details:
– http://www.dgs.monash.edu.au/~rajkumar/tfcc/ periodically (updated daily!).
10
Computing Power andComputer Architectures
11
Need of more Computing Power:
Grand Challenge Applications
Solving technology problems using
computer modeling, simulation and analysis
Life SciencesLife Sciences
Mechanical Design & Analysis (CAD/CAM)Mechanical Design & Analysis (CAD/CAM)
AerospaceAerospace
GeographicInformationSystems
GeographicInformationSystems
12
How to Run App. Faster ?
There are 3 ways to improve performance:
– 1. Work Harder– 2. Work Smarter– 3. Get Help
Computer Analogy
–1. Use faster hardware: e.g. reduce the time per instruction (clock cycle).
–2. Optimized algorithms and techniques
–3. Multiple computers to solve problem: That is, increase no. of instructions executed per clock cycle.
13
Sequential Architecture Limitations
Sequential architectures reaching physical limitation (speed of light, thermodynamics)
Hardware improvements like pipelining, Superscalar, etc., are non-scalable and requires sophisticated Compiler Technology.
Vector Processing works well for certain kind of problems.
14
No. of Processors
C.P
.I.
1 2 . . . .
Computational Power Improvement
Multiprocessor
Uniprocessor
15
Age
Gro
wth
5 10 15 20 25 30 35 40 45 . . . .
Human Physical Growth Analogy:Computational Power Improvement
Vertical Horizontal
16
The Tech. of PP is mature and can be exploited commercially; significant R & D work on development of tools & environment.
Significant development in Networking technology is paving a way for heterogeneous computing.
Why Parallel Processing NOW?
17
History of Parallel Processing
PP can be traced to a tablet dated around 100 BC.
Tablet has 3 calculating positions. Infer that multiple positions:
Reliability/ Speed
18
Aggregated speed with
which complex calculations
carried out by millions of neurons in human brain is amazing! although individual neurons response is slow (milli sec.) - demonstrate the feasibility of PP
Motivating Factors
19
Simple classification by Flynn: (No. of instruction and data streams)
SISD - conventional SIMD - data parallel, vector computing MISD - systolic arrays MIMD - very general, multiple
approaches.
Current focus is on MIMD model, using general purpose processors or multicomputers.
Taxonomy of Architectures
20
SISD : A Conventional Computer
Speed is limited by the rate at which computer can transfer information internally.
ProcessorProcessorData Input Data Output
Instru
ctions
Ex:PC, Macintosh, Workstations
21
The MISD Architecture
More of an intellectual exercise than a practical configuration. Few built, but commercially not available
Data InputStream
Data OutputStream
Processor
A
Processor
B
Processor
C
InstructionStream A
InstructionStream B
Instruction Stream C
22
SIMD Architecture
Ex: CRAY machine vector processing, Thinking machine cm*
Ci<= Ai * Bi
InstructionStream
Processor
A
Processor
B
Processor
C
Data Inputstream A
Data Inputstream B
Data Inputstream C
Data Outputstream A
Data Outputstream B
Data Outputstream C
23
Unlike SISD, MISD, MIMD computer works asynchronously.
Shared memory (tightly coupled) MIMD
Distributed memory (loosely coupled) MIMD
MIMD Architecture
Processor
A
Processor
B
Processor
C
Data Inputstream A
Data Inputstream B
Data Inputstream C
Data Outputstream A
Data Outputstream B
Data Outputstream C
InstructionStream A
InstructionStream B
InstructionStream C
26
Main HPC Architectures..1a
SISD - mainframes, workstations, PCs. SIMD Shared Memory - Vector machines,
Cray... MIMD Shared Memory - Sequent, KSR,
Tera, SGI, SUN. SIMD Distributed Memory - DAP, TMC CM-
2... MIMD Distributed Memory - Cray T3D,
Intel, Transputers, TMC CM-5, plus recent workstation clusters (IBM SP2, DEC, Sun, HP).
27
Main HPC Architectures..1b.
NOTE: Modern sequential machines are not purely SISD - advanced RISC processors use many concepts from
– vector and parallel architectures (pipelining, parallel execution of instructions, prefetching of data, etc) in order to achieve one or more arithmetic operations per clock cycle.
28
Parallel Processing Paradox
Time required to develop a parallel application for solving GCA is equal to:
– Half Life of Parallel Supercomputers.
29
The Need for Alternative
Supercomputing Resources
Vast numbers of under utilised workstations available to use.
Huge numbers of unused processor cycles and resources that could be put to good use in a wide variety of applications areas.
Reluctance to buy Supercomputer due to their cost and short life span.
Distributed compute resources “fit” better into today's funding model.
30
Scalable Parallel Computers
31
Design Space of Competing Computer
Architecture
32
Towards Inexpensive Supercomputing
It is:
Cluster Computing..The Commodity Supercomputing!
33
Motivation for using Clusters
Surveys show utilisation of CPU cycles of desktop workstations is typically <10%.
Performance of workstations and PCs is rapidly improving
As performance grows, percent utilisation will decrease even further!
Organisations are reluctant to buy large supercomputers, due to the large expense and short useful life span.
34
Motivation for using Clusters
The communications bandwidth between workstations is increasing as new networking technologies and protocols are implemented in LANs and WANs.
Workstation clusters are easier to integrate into existing networks than special parallel computers.
35
Motivation for using Clusters
The development tools for workstations are more mature than the contrasting proprietary solutions for parallel computers - mainly due to the non-standard nature of many parallel systems.
Workstation clusters are a cheap and readily available alternative to specialised High Performance Computing (HPC) platforms.
Use of clusters of workstations as a distributed compute resource is very cost effective - incremental growth of system!!!
36
Cycle Stealing
Usually a workstation will be owned by an individual, group, department, or organisation - they are dedicated to the exclusive use by the owners.
This brings problems when attempting to form a cluster of workstations for running distributed applications.
37
Cycle Stealing
Typically, there are three types of owners, who use their workstations mostly for:
1. Sending and receiving email and preparing documents.
2. Software development - edit, compile, debug and test cycle.
3. Running compute-intensive applications.
38
Cycle Stealing
Cluster computing aims to steal spare cycles from (1) and (2) to provide resources for (3).
However, this requires overcoming the ownership hurdle - people are very protective of their workstations.
Usually requires organisational mandate that computers are to be used in this way.
39
Cycle Stealing
Stealing cycles outside standard work hours (e.g. overnight) is easy, stealing idle cycles during work hours without impacting interactive use (both CPU and memory) is much harder.
40
Rise & Fall of Computing
Technologies
Mainframes Minis PCs
Minis PCs NetworkComputing
1970 1980 1995
41
Original Food Chain Picture
42
1984 Computer Food Chain
Mainframe
Vector Supercomputer
Mini ComputerWorkstation
PC
43
Mainframe
Vector Supercomputer MPP
WorkstationPC
1994 Computer Food Chain
Mini Computer(hitting wall soon)
(future is bleak)
44
Computer Food Chain (Now and Future)
45
What is a cluster?
Cluster:
– a collection of nodes connected together– Network: Faster, closer connection than a typical
network (LAN)– Looser connection than symmetric multiprocessor
(SMP)
46
1990s Building Blocks
There is no “near commodity” component Building block = complete computers
(HW & SW) shipped in 100,000s:Killer micro, Killer DRAM, Killer disk,Killer OS, Killer packaging, Killer investment
Leverage billion $ per year investment Interconnecting Building Blocks => Killer Net
High BandwidthLow latencyReliableCommodity
(ATM?)
47
Why Clusters now?(Beyond Technology and
Cost)
Building block is big enough (v intel 8086) Workstations performance is doubling
every 18 months. Networks are faster
Higher link bandwidth (v 10Mbit Ethernet)
Switch based networks coming (ATM) Interfaces simple & fast (Active Msgs)
Striped files preferred (RAID) Demise of Mainframes, Supercomputers, &
MPPs
48
Architectural Drivers…(cont)
Node architecture dominates performance
– processor, cache, bus, and memory– design and engineering $ => performance
Greatest demand for performance is on large systems
– must track the leading edge of technology without lag MPP network technology => mainstream
– system area networks System on every node is a powerful enabler
– very high speed I/O, virtual memory, scheduling, …
49
...Architectural Drivers
Clusters can be grown: Incremental scalability (up, down, and across)
– Individual nodes performance can be improved by adding additional resource (new memory blocks/disks)
– New nodes can be added or nodes can be removed– Clusters of Clusters and Metacomputing
Complete software tools
– Threads, PVM, MPI, DSM, C, C++, Java, Parallel C++, Compilers, Debuggers, OS, etc.
Wide class of applications
– Sequential and grand challenging parallel applications
50
Example Clusters:Berkeley NOW
100 Sun UltraSparcs
– 200 disks Myrinet
SAN– 160 MB/s
Fast comm.– AM, MPI, ...
Ether/ATM switched external net
Global OS Self Config
51
Basic Components
$
P
M I/O bus
MyriNet
P
Sun Ultra 170
MyricomNIC
160 MB/s
M
52
Massive Cheap Storage Cluster
Basic unit:
2 PCs double-ending four SCSI chains of 8 disks each
Currently serving Fine Art at http://www.thinker.org/imagebase/
53
Cluster of SMPs (CLUMPS)
Four Sun E5000s
– 8 processors– 4 Myricom NICs each
Multiprocessor, Multi-NIC, Multi-Protocol
NPACI => Sun 450s
54
Millennium PC Clumps
Inexpensive, easy to manage Cluster
Replicated in many departments
Prototype for very large PC cluster
55
Adoption of the Approach
56
So What’s So Different?
Commodity parts? Communications Packaging? Incremental Scalability? Independent Failure? Intelligent Network Interfaces? Complete System on every node
– virtual memory– scheduler– files– ...
57
OPPORTUNITIES &
CHALLENGES
58
Shared Pool ofComputing Resources:
Processors, Memory, Disks
Interconnect
Guarantee atleast oneworkstation to many individuals
(when active)
Deliver large % of collectiveresources to few individuals
at any one time
Opportunity of Large-scaleComputing on NOW
59
Windows of Opportunities
MPP/DSM:
– Compute across multiple systems: parallel. Network RAM:
– Idle memory in other nodes. Page across other nodes idle memory
Software RAID:
– file system supporting parallel I/O and reliablity, mass-storage.
Multi-path Communication:
– Communicate across multiple networks: Ethernet, ATM, Myrinet
60
Parallel Processing
Scalable Parallel Applications require
– good floating-point performance– low overhead communication scalable
network bandwidth– parallel file system
61
Network RAM
Performance gap between processor and disk has widened.
Thrashing to disk degrades performance significantly
Paging across networks can be effective with high performance networks and OS that recognizes idle machines
Typically thrashing to network RAM can be 5 to 10 times faster than thrashing to disk
62
Software RAID: Redundant Array of
Workstation Disks I/O Bottleneck:
– Microprocessor performance is improving more than 50% per year.
– Disk access improvement is < 10%
– Application often perform I/O RAID cost per byte is high compared to single disks RAIDs are connected to host computers which are often a
performance and availability bottleneck RAID in software, writing data across an array of
workstation disks provides performance and some degree of redundancy provides availability.
63
Software RAID, Parallel File Systems, and
Parallel I/O
64
Enabling Technologies
Efficient communication hardware and software
Global co-ordination of multiple workstation Operating Systems
65
Efficient Communication
The key Enabling Technology
Communication overheads components
– bandwidth– network latency and – processor overhead
Switched LANs allow bandwidth to scale
Network latency can be overlapped with computation
Processor overhead is the real problem - it consumes CPU cycles
66
Efficient Communication (Contd...)
SS10 connected by Ethernet
– 456 s processor overhead With ATM
– 626 s processor overhead Target :
– MPP communication performance: low latency and scalable bandwidth
– CM5 user-level network overhead 5.7 s
67
Efficient Communication (Contd...)
Constraints in clusters
– greater routing delay and less than complete reliability
– constraints on where the network connects into the node
– UNIX has a rigid device and scheduling interface
68
Efficient Communication Approaches
Efficient Network Interface Hardware
Minimal Interface into the Operating System
– user must transmit directly into and receive from the network without OS intervention
– communication protection domains to be established by interface card and OS
– treat message loss as an infrequent case
69
Cluster Computer and its Components
70
Clustering Today
Clustering gained momentum when 3 technologies converged:
– 1. Very HP Microprocessors• workstation performance = yesterday supercomputers
– 2. High speed communication• Comm. between cluster nodes >= between processors in an
SMP.
– 3. Standard tools for parallel/ distributed computing & their growing popularity.
71
Cluster Computer Architecture
72
Cluster Components...1a
Nodes
Multiple High Performance Components:
– PCs– Workstations– SMPs (CLUMPS)– Distributed HPC Systems leading to
Metacomputing They can be based on different
architectures and running difference OS
73
Cluster Components...1b
Processors There are many
(CISC/RISC/VLIW/Vector..)– Intel: Pentiums, Xeon, Merceed….– Sun: SPARC, ULTRASPARC– HP PA– IBM RS6000/PowerPC– SGI MPIS– Digital Alphas
Integrate Memory, processing and networking into a single chip– IRAM (CPU & Mem): (http://iram.cs.berkeley.edu)– Alpha 21366 (CPU, Memory Controller, NI)
74
Cluster Components…2OS
State of the art OS:– Linux (Beowulf)
– Microsoft NT (Illinois HPVM)
– SUN Solaris (Berkeley NOW)
– IBM AIX (IBM SP2)
– HP UX (Illinois - PANDA)
– Mach (Microkernel based OS) (CMU)
– Cluster Operating Systems (Solaris MC, SCO Unixware, MOSIX (academic project)
– OS gluing layers: (Berkeley Glunix)
75
Cluster Components…3High Performance
Networks
Ethernet (10Mbps), Fast Ethernet (100Mbps), Gigabit Ethernet (1Gbps) SCI (Dolphin - MPI- 12micro-sec
latency) ATM Myrinet (1.2Gbps) Digital Memory Channel FDDI
76
Cluster Components…4Network Interfaces
Network Interface Card
– Myrinet has NIC
– User-level access support
– Alpha 21364 processor integrates processing, memory controller, network interface into a single chip..
77
Cluster Components…5 Communication
Software Traditional OS supported facilities
(heavy weight due to protocol processing)..
– Sockets (TCP/IP), Pipes, etc. Light weight protocols (User Level)
– Active Messages (Berkeley)– Fast Messages (Illinois)– U-net (Cornell)– XTP (Virginia)
System systems can be built on top of the above protocols
78
Cluster Components…6a
Cluster Middleware
Resides Between OS and Applications and offers in infrastructure for supporting:
– Single System Image (SSI)
– System Availability (SA) SSI makes collection appear as single
machine (globalised view of system resources). Telnet cluster.myinstitute.edu
SA - Check pointing and process migration..
79
Cluster Components…6b
Middleware Components
Hardware – DEC Memory Channel, DSM (Alewife, DASH) SMP
Techniques
OS / Gluing Layers– Solaris MC, Unixware, Glunix)
Applications and Subsystems– System management and electronic forms– Runtime systems (software DSM, PFS etc.)– Resource management and scheduling (RMS):
• CODINE, LSF, PBS, NQS, etc.
80
Cluster Components…7aProgramming environments
Threads (PCs, SMPs, NOW..) – POSIX Threads
– Java Threads MPI
– Linux, NT, on many Supercomputers PVM Software DSMs (Shmem)
81
Cluster Components…7b
Development Tools ?
Compilers– C/C++/Java/ ;
– Parallel programming with C++ (MIT Press book)
RAD (rapid application development tools).. GUI based tools for PP modeling
Debuggers Performance Analysis Tools Visualization Tools
82
Cluster Components…8Applications
Sequential Parallel / Distributed (Cluster-aware
app.)
– Grand Challenging applications• Weather Forecasting
• Quantum Chemistry
• Molecular Biology Modeling
• Engineering Analysis (CAD/CAM)
• ……………….
– PDBs, web servers,data-mining
83
Key Operational Benefits of Clustering
System availability (HA). offer inherent high system availability due to the redundancy of hardware, operating systems, and applications.
Hardware Fault Tolerance. redundancy for most system components (eg. disk-RAID), including both hardware and software.
OS and application reliability. run multiple copies of the OS and applications, and through this redundancy
Scalability. adding servers to the cluster or by adding more clusters to the network as the need arises or CPU to SMP.
High Performance. (running cluster enabled programs)
84
Classification
of Cluster Computer
85
Clusters Classification..1
Based on Focus (in Market)
– High Performance (HP) Clusters• Grand Challenging Applications
– High Availability (HA) Clusters• Mission Critical applications
86
HA Cluster: Server Cluster with "Heartbeat" Connection
87
Clusters Classification..2
Based on Workstation/PC Ownership
– Dedicated Clusters
– Non-dedicated clusters• Adaptive parallel computing
• Also called Communal multiprocessing
88
Clusters Classification..3
Based on Node Architecture..
– Clusters of PCs (CoPs)
– Clusters of Workstations (COWs)
– Clusters of SMPs (CLUMPs)
89
Building Scalable Systems: Cluster of SMPs (Clumps)
Performance of SMP Systems Vs. Four-Processor Servers in a Cluster
90
Clusters Classification..4
Based on Node OS Type..
– Linux Clusters (Beowulf)
– Solaris Clusters (Berkeley NOW)
– NT Clusters (HPVM)
– AIX Clusters (IBM SP2)
– SCO/Compaq Clusters (Unixware)
– …….Digital VMS Clusters, HP clusters, ………………..
91
Clusters Classification..5
Based on node components architecture & configuration (Processor Arch, Node Type: PC/Workstation.. & OS: Linux/NT..):
– Homogeneous Clusters• All nodes will have similar configuration
– Heterogeneous Clusters• Nodes based on different processors and
running different OSes.
92
Clusters Classification..6a
Dimensions of Scalability & Levels of Clustering
Network
Technology
Platform
Uniprocessor
SMPCluster
MPP
CPU / I/O / M
emory / OS
(1)
(2)
(3)
Campus
Enterprise
Workgroup
Department
Public Metacomputing
93
Clusters Classification..6b
Levels of Clustering Group Clusters (#nodes: 2-99)
– (a set of dedicated/non-dedicated computers - mainly connected by SAN like Myrinet)
Departmental Clusters (#nodes: 99-999) Organizational Clusters (#nodes: many 100s) (using ATMs Net) Internet-wide Clusters=Global Clusters:
(#nodes: 1000s to many millions)– Metacomputing
– Web-based Computing
– Agent Based Computing
• Java plays a major in web and agent based computing
94
Cluster Middleware
and
Single System Image
95
Contents
What is Middleware ? What is Single System Image ? Benefits of Single System Image SSI Boundaries SSI Levels Relationship between Middleware
Modules. Strategy for SSI via OS Solaris MC: An example OS supporting
SSI Cluster Monitoring Software
96
What is Cluster Middleware ?
An interface between between use applications and cluster hardware and OS platform.
Middleware packages support each other at the management, programming, and implementation levels.
Middleware Layers:
– SSI Layer
– Availability Layer: It enables the cluster services of
• Checkpointing, Automatic Failover, recovery from failure,
• fault-tolerant operating among all cluster nodes.
97
Middleware Design Goals
Complete Transparency
– Lets the see a single cluster system..• Single entry point, ftp, telnet, software loading...
Scalable Performance
– Easy growth of cluster• no change of API & automatic load distribution.
Enhanced Availability
– Automatic Recovery from failures• Employ checkpointing & fault tolerant technologies
– Handle consistency of data when replicated..
98
What is Single System Image (SSI) ?
A single system image is the illusion, created by software or hardware, that a collection of computing elements appear as a single computing resource.
SSI makes the cluster appear like a single machine to the user, to applications, and to the network.
A cluster without a SSI is not a cluster
99
Benefits of Single System Image
Usage of system resources transparently
Improved reliability and higher availability
Simplified system management
Reduction in the risk of operator errors
User need not be aware of the underlying system architecture to use these machines effectively
100
SSI vs. Scalability(design space of competing arch.)
101
Desired SSI Services
Single Entry Point
– telnet cluster.my_institute.edu– telnet node1.cluster. institute.edu
Single File Hierarchy: xFS, AFS, Solaris MC Proxy
Single Control Point: Management from single GUI
Single virtual networking Single memory space - DSM Single Job Management: Glunix, Condin, LSF Single User Interface: Like workstation/PC
windowing environment (CDE in Solaris/NT), may it can use Web technology
102
Availability Support Functions
Single I/O Space (SIO):
– any node can access any peripheral or disk devices without the knowledge of physical location.
Single Process Space (SPS)
– Any process on any node create process with cluster wide process wide and they communicate through signal, pipes, etc, as if they are one a single node.
Checkpointing and Process Migration.
– Saves the process state and intermediate results in memory to disk to support rollback recovery when node fails. PM for Load balancing...
Reduction in the risk of operator errors
User need not be aware of the underlying system architecture to use these machines effectively
103
SSI Levels
It is a computer science notion of levels of abstractions (house is at a higher level of abstraction than walls, ceilings, and floors).
Application and Subsystem Level
Operating System Kernel Level
Hardware Level
104
SSI at Application and Subsystem Level
Level Examples Boundary Importance
application cluster batch system,system management
subsystem
file system
distributed DB,OSF DME, Lotus Notes, MPI, PVM
an application what a userwants
Sun NFS, OSF,DFS, NetWare,and so on
a subsystem SSI for allapplications ofthe subsystem
implicitly supports many applications and subsystems
shared portion of the file system
toolkit OSF DCE, SunONC+, ApolloDomain
best level ofsupport for heter-ogeneous system
explicit toolkitfacilities: user,service name,time
(c) In search of clusters
105
SSI at Operating System Kernel Level
Level Examples Boundary Importance
Kernel/OS Layer
Solaris MC, Unixware MOSIX, Sprite,Amoeba/ GLunix
kernelinterfaces
virtualmemory
UNIX (Sun) vnode,Locus (IBM) vproc
each name space:files, processes, pipes, devices, etc.
kernel support forapplications, admsubsystems
none supportingoperating system kernel
type of kernelobjects: files,processes, etc.
modularizes SSIcode within kernel
may simplifyimplementationof kernel objects
each distributedvirtual memoryspace
microkernel Mach, PARAS, Chorus,OSF/1AD, Amoeba
implicit SSI forall system services
each serviceoutside themicrokernel
(c) In search of clusters
106
SSI at Harware Level
Level Examples Boundary Importance
memory SCI, DASH better communica-tion and synchro-nization
memory space
memory and I/O
SCI, SMP techniques lower overheadcluster I/O
memory and I/Odevice space
Application and Subsystem Level
Operating System Kernel Level
(c) In search of clusters
107
SSI Characteristics
1. Every SSI has a boundary 2. Single system support can
exist at different levels within a system, one able to be build on another
108
SSI Boundaries -- an applications SSI
boundary
Batch System
SSIBoundary
(c) In search of clusters
109
Relationship Among Middleware Modules
110
PARMON: A Cluster Monitoring Tool
PARMONHigh-Speed
Switch
parmond
parmon
PARMON Serveron Solaris NodePARMON Client on JVM
111
Motivations
Monitoring such huge systems is a tedious and challenging task since typical workstations are designed to work as a standalone system, rather than a part of workstation clusters.
System administrators require tools to effectively monitor such huge systems. PARMON provides the solution to this challenging problem.
112
PARMON - Salient Features
Allows to monitor system activities at Component, Node, Group, or entire Cluster level monitoring
Monitoring of System Components :
– CPU, Memory, Disk and Network Allows to monitor multiple instances of
the same component. PARMON provides GUI interface for
initiating activities/request and presents results graphically.
113
Resource Utilization at a Glance
114
CPU Usage Monitoring
115
Memory Usage monitoring
116
Kernel Data Catalog - CPU
117
Strategy for SSI via OS
1. Build as a layer on top of the existing OS. (eg. Glunix)
– Benefits: makes the system quickly portable, tracks vendor software upgrades, and reduces development time.
– i.e. new systems can be built quickly by mapping new services onto the functionality provided by the layer beneath. Eg: Glunix/Solaris-MC
2. Build SSI at kernel level, True Cluster OS
– Good, but Can’t leverage of OS improvements by vendor
– E.g. Unixware and Mosix (built using BSD Unix)
118
Cluster Computing - Research Projects
Beowulf (CalTech and Nasa) - USA CCS (Computing Centre Software) - Paderborn, Germany Condor - Wisconsin State University, USA DJM (Distributed Job Manager) - Minnesota Supercomputing Center DQS (Distributed Queuing System) - Florida State University, USA EASY - Argonne National Lab, USA HPVM -(High Performance Virtual Machine),UIUC&now UCSB,US far - University of Liverpool, UK Gardens - Queensland University of Technology, Australia Generic NQS (Network Queuing System),University of Sheffield, UK NOW (Network of Workstations) - Berkeley, USA NIMROD - Monash University, Australia PBS (Portable Batch System) - NASA Ames and LLNL, USA PRM (Prospero Resource Manager) - Uni. of S. California, USA QBATCH - Vita Services Ltd., USA
119
Cluster Computing - Commercial Software
Codine (Computing in Distributed Network Environment) - GENIAS GmbH, Germany
LoadLeveler - IBM Corp., USA LSF (Load Sharing Facility) - Platform Computing,
Canada NQE (Network Queuing Environment) - Craysoft
Corp., USA OpenFrame - Centre for Development of Advanced
Computing, India RWPC (Real World Computing Partnership), Japan Unixware (SCO-Santa Cruz Operations,), USA Solaris-MC (Sun Microsystems), USA
120
Representative Cluster Systems
1. Solaris -MC2. Berkeley NOW3. their comparison with Beowulf & HPVM
121
Next Generation Distributed Computing:
The Solaris MC Operating System
122
Why new software?
Without software, a cluster is:– Just a network of machines
– Requires specialized applications
– Hard to administer
With a cluster operating system:– Cluster becomes a scalable, modular computer
– Users and administrators see a single large machine
– Runs existing applications
– Easy to administer
New software makes cluster better for the customer
123
Cluster computing and Solaris MC
Goal: use computer clusters for general-purpose computing
Support existing customers and applications
Solution: Solaris MC (Multi Computer) operating system
A distributed operating system (OS) for multi-computers
124
What is the Solaris MC OS ?
Solaris MC extends standard Solaris
Solaris MC makes the cluster look like a single machine
Global file system Global process management Global networking
Solaris MC runs existing applications unchanged Supports Solaris ANI (Application binary interface)
125
Applications
Ideal for: Web and interactive servers Databases File servers Timesharing
Benefits for vendors and customers Preserves investment in existing applications Modular servers with low entry-point price and low cost of
ownership Easier system administraion Solaris could become a preferred platform for clustered
systems
126
Solaris MC is a running research system
Designed, built and demonstrated Solaris MC prototype CLuster of SPARCstations connected with Myrinet network Runs unmodified commercial parallel database, scalable Web
server, parallel make
Next: Solaris MC Phase II High availability New I/O work to take advantage of clusters Performance evaluation
127
Advantages of Solaris MC
Leverages continuing investment in Solaris Same applications: binary-compatible Same kernel, device drivers, etc. As portable as base Solaris - will run on SPARC, x86, PowerPC
State of the art distributed systems techniques High availability designed into the sytem Powerful distributed object-oriented framework
Ease of administration and use Looks like a familiar multiprocessor server to users, sytem
administrators, and applications
128
Solaris MC details
Solaris MC is a set of C++ loadable modules on top of Solaris
–Very few changes to existing kernelA private Solaris kernel per node: provides
reliabilityObject-oriented system with well-defined
interfaces
129
Key components of Solaris-MC proving SSI
global file system
globalized process management
globalized networking and I/O
Solaris MC Architecture
System call interface
Network
File system
C++
Processes
Object framework
Existing Solaris 2.5 kernel
Othernodes
Object invocations
Kernel
Solaris MC
Applications
130
Solaris MC components
Object and communication support
High availability support
PXFS global distributed file system
Process mangement
NetworkingSolaris MC Architecture
System call interface
Network
File system
C++
Processes
Object framework
Existing Solaris 2.5 kernel
Othernodes
Object invocations
Kernel
Solaris MC
Applications
131
Object Oreintation
Better software maintenance, change, and evolution Well-defined interfaces Separate implementation from interface Interface inheritance
Solaris MC uses: IDL: a better way to define interfaces CORBA object model: a better RPC (Remote Procedure Call) C++: a better C
132
Object and Communication Framework
Mechanism for nodes and modules to communicate Inter-node and intra-node interprocess communication
Optimized protocols for trusted computing base
Efficient, low-latency communication primitives
Object communication independent of interconnect We use Ethernet, fast Ethernet, FibreChannel, Myrinet
Allows interconnect hardware to be upgraded
133
High Availability Support
Node failure doesn’t crash entire system Unaffected nodes continue running Better than a SMP A requirement for mission critical market
Well-defined failure boundaries Separate kernel per node - OS does not use shared
memory
Object framework provides support Delivers failure notifications to servers and clients Group membership protocol detects node failures
Each subsystem resposible for its recovery Filesystem, process management, networking,
applications
134
PXFS: Global Filesystem
Single-system image of file sytem
Backbone of Solaris MC
Coherent access and caching of files and directories Caching provides high performance
Access to I/O devices
135
PXFS: An object-oriented VFS
PXFS builds on existing Solaris file sytems Uses the vnode/virtual file system interface (VFS) externally Uses object communication internally
136
Process management
Provide global view of processes on any node Users, administrators, and applications see global view Supports existing applications
Uniform support for local and remote processes Process creation/waiting/exiting (including remote execution) Global process identifiers, groups, sessions Signal handling procfs (/proc)
137
Process management benefits
Global process management helps users and administrators
Users see familiar single machine process model
Can run programs on any node
Location of process in the cluster doesn’t matter
Use existing commands and tools: unmodified ps, kill, etc.
138
Networking goals
Cluster appears externally as a single SMP server Familiar to customers Access cluster through single network address Multiple network interfaces supported but not required
Scalable design protocol and network application processing on any
mode Parallelism provides high server performance
139
Networking: Implementation
A programmable “packet filter” Packets routed between network device and the correct node Efficient, scalable, and supports parallelism Supports multiple protocols with existing protocol stacks
Parallelism of protocol processing and applications Incoming connections are load-balanced across the cluster
140
Status
4 node, 8 CPU prototype with Myrinet demonstratedObject and communication infrastructure
Global file system (PXFS) with coherency and caching
Networking TCP/IP with load balancingGlobal process management (ps, kill, exec, wait,
rfork, /proc)Monitoring toolsCluster membership protocols
Demonstrated applicationsCommercial parallel database
Scalable Web serverParallel makeTimesharing
Solaris-MC team is working on high availability
141
Summary of Solaris MC
Clusters likely to be an important market Solaris MC preserves customer investment in Solaris
Uses existing Solaris applications
Familiar to customers Looks like a multiprocessor, not a special cluster architecture
Ease of administration and use Clusters are ideal for important applications
Web server, file server, databases, interactive services State-of-the-art object-oriented distributed
implementation Designed for future growth
142
Berkeley NOW Project
143
NOW @ Berkeley
Design & Implementation of higher-level system
Global OS (Glunix)Parallel File Systems (xFS)Fast Communication (HW for Active
Messages)Application Support
Overcoming technology shortcomingsFault toleranceSystem Management
NOW Goal: Faster for Parallel AND Sequential
144
NOW Software Components
AM L.C.P.
VN segment Driver
UnixWorkstation
AM L.C.P.
VN segment Driver
UnixWorkstation
AM L.C.P.
VN segment Driver
UnixWorkstation
AM L.C.P.
VN segment Driver
Unix (Solaris)Workstation
Global Layer Unix
Myrinet Scalable Interconnect
Large Seq. AppsParallel Apps
Sockets, Split-C, MPI, HPF, vSM
Active MessagesName Svr
Sched
uler
145
Active Messages: Lightweight
Communication Protocol
Key Idea: Network Process ID attached to every message that HW checks upon receipt
Net PID match, as fast as beforeNet PIC mismatch, interrupt and invoke
OS Can mix LAN messages and MPP messages;
invoke OS & TCP/IP only when not cooperating (if everyone uses same physical layer format)
146
MPP Active Messages
Key Idea: associate a small user-level handler directly with each message
Sender injects the message directly into the network
Handler executes immediately upon arrival
Pulls the message out of the network and integrates it into the ongoing computation, or replies
No buffering (beyond transport), no parsing, no allocation, primitive scheduling
147
Active Message Model
Every message contains at its header the address of a user level handler which gets executed immediately in user level
No receive side buffering of messages
Supports protected multiprogramming of a large number of users onto finite physical network resource
Active message operations, communication events and threads are integrated in a simple and cohesive model
Provides naming and protection
148
Active Message Model (Contd..)
datastructs
datastructs
primarycomputation
primarycomputation
handler
data pc
Active Message
Network
149
xFS: File System for NOW
Serverless File System: All data with clientsUses MP cache coherency to reduce
traffic Files striped for parallel transfer Large file cache (“cooperative caching-
Network RAM”)
Miss Rate Response Time
Client/Server 10% 1.8 ms
xFS 4% 1.0 ms
(42 WS, 32 MB/WS, 512 MB/server, 8KB/access)
150
Glunix: Gluing Unix
It is built onto of Solaris It glues together Solaris running on Cluster
nodes. Support transparent remote execution,
load balancing, allows to run existing applications.
Provides globalized view of system resources like SolarisMC
Gang schedule parallel jobs to be as good as dedicated MPP for parallel jobs
151
3 Paths for Applications on
NOW? Revolutionary (MPP Style): write new programs from
scratch using MPP languages, compilers, libraries,… Porting: port programs from mainframes,
supercomputers, MPPs, … Evolutionary: take sequential program & use
1)Network RAM: first use memory of many computers to reduce disk accesses; if not fast enough, then:
2)Parallel I/O: use many disks in parallel for accesses not in file cache; if not fast enough, then:
3)Parallel program: change program until it sees enough processors that is fast
=> Large speedup without fine grain parallel program
152
Comparison of 4 Cluster Systems
153
Pointers to Literature on Cluster Computing
154
Reading Resources..1aInternet & WWW
– Computer Architecture:• http://www.cs.wisc.edu/~arch/www/
– PFS & Parallel I/O• http://www.cs.dartmouth.edu/pario/
– Linux Parallel Procesing• http://yara.ecn.purdue.edu/~pplinux/Sites/
– DSMs• http://www.cs.umd.edu/~keleher/dsm.html
155
Reading Resources..1bInternet & WWW
– Solaris-MC• http://www.sunlabs.com/research/solaris-mc
– Microprocessors: Recent Advances• http://www.microprocessor.sscc.ru
– Beowulf:• http://www.beowulf.org
– Metacomputing• http://www.sis.port.ac.uk/~mab/Metacomputing/
156
Reading Resources..2Books
– In Search of Cluster• by G.Pfister, Prentice Hall (2ed), 98
– High Performance Cluster Computing• Volume1: Architectures and Systems• Volume2: Programming and Applications
– Edited by Rajkumar Buyya, Prentice Hall, NJ, USA.
– Scalable Parallel Computing• by K Hwang & Zhu, McGraw Hill,98
157
Reading Resources..3Journals
– A Case of NOW, IEEE Micro, Feb’95• by Anderson, Culler, Paterson
– Fault Tolerant COW with SSI, IEEE Concurrency, (to appear)
• by Kai Hwang, Chow, Wang, Jin, Xu
– Cluster Computing: The Commodity Supercomputing, Journal of Software Practice and Experience-(get from my web)
• by Mark Baker & Rajkumar Buyya
158
Clusters Revisited
159
Summary
We have discussed Clusters
Enabling TechnologiesArchitecture & its ComponentsClassificationsMiddlewareSingle System ImageRepresentative Systems
160
Conclusions
Clusters are promising..
Solve parallel processing paradoxOffer incremental growth and matches with funding
pattern.New trends in hardware and software technologies
are likely to make clusters more promising..so thatClusters based supercomputers can be seen
everywhere!
161
Breaking High Performance Computing BarriersBreaking High Performance Computing Barriers
21002100 2100 2100 2100
2100 2100 2100 2100
Single
Processor
Shared
Memory
LocalParallelCluster
GlobalParallelCluster
G
F
L
O
P
S
162
Well, Read my book for….
http://www.dgs.monash.edu.au/~rajkumar/cluster/
Thank You ...
Thank You ...
?
163
Thank You ...Thank You ...
?