1 cluster computing cheng-zhong xu. 2 outline cluster computing basics –multicore architecture...

56
1 Cluster Computing Cheng-Zhong Xu

Upload: claude-ball

Post on 13-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

1

Cluster Computing

Cheng-Zhong Xu

Page 2: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

2

Outline

Cluster Computing Basics– Multicore architecture

– Cluster Interconnect

Parallel Programming for Performance MapReduce Programming Systems Management

Page 3: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

3

What’s a Cluster?

Broadly, a group of networked autonomous computers that work together to form a single machine in many respects:

– To improve performance (speed)

– To improve throughout

– To improve service availability (high-availability clusters)

Based on commercial off-the-shelf, the system is often more cost-effective than single machine with comparable speed or availability

Page 4: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

4

Highly Scalable Clusters

High Performance Cluster (aka Compute Cluster)– A form of parallel computers, which aims to solve

problems faster by using multiple compute nodes.

– For parallel efficiency, the nodes are often closely coupled in a high throughput, low-latency network

Server Cluster and Datacenter– Aims to improve the system’s throughput , service

availability, power consumption, etc by using multiple nodes

Page 5: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

5

Top500 Installation of Supercomputers

Top500.com

Page 6: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

6

Clusters in Top500

Page 7: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

7

An Example of Top500 Submission (F’08)

Location Tukwila, WA

Hardware – Machines 256 Dual-CPU, quad-core Intel 5320 Clovertown 1.86GHz CPU and 8GB RAM

Hardware – Networking Private & Public: Broadcom GigEMPI: Cisco Infiniband SDR, 34 IB switches in leaf/node configuration

Number of Compute Nodes 256

Total Number of Cores 2048

Total Memory 2 TB of RAM

Particulars of for current Linpack Runs

Best Linpack Result 11.75 TFLOPS

Best Cluster Efficiency 77.1%

For Comparison…

Linpack rating from June2007 Top500 run (#106) on the same hardware

8.99 TFLOPS

Cluster efficiency from June2007 Top500 run (#106) on the same hardware

59%

Typical Top500 efficiency for Clovertown motherboards w/ IB regardless of Operating System

65-77% (2 instances of 79%)

30% impro in efficiency on the same hardware; about one hour to deplay

Page 8: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

8

Beowulf Cluster A cluster of inexpensive PCs for low-cost personal

supercomputing Based on commodity off-the-shelf components:

– PC computers running a Unix-like Os (BSD, Linux, or OpenSolaris)

– Interconnected by an Ethernet LAN Head node, plus a group of compute node

– Head node controls the cluster, and serves files to the compute nodes

Standard, free and open source software– Programming in MPI– MapReduce

Page 9: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

9

Why Clustering Today

Powerful node (cpu, mem, storage)– Today’s PC is yesterday’s supercomputers

– Multi-core processors

High speed network– Gigabit (56% in top500 as of Nov 2008)

– Infiniband System Area Network (SAN) (24.6%)

Standard tools for parallel/ distributed computing & their growing popularity.– MPI, PBS, etc

– MapReduce for data-intensive computing

Page 10: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

10

Major issues in Cluster Design Programmability

– Sequential vs Parallel Programming

– MPI, DSM, DSA: hybrid of multithreading and MPI

– MapReduce

Cluster-aware Resource management – Job scheduling (e.g. PBS)

– Load balancing, data locality, communication opt, etc

System management– Remote installation, monitoring, diagnosis,

– Failure management, power management, etc

Page 11: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

11

Cluster Architecture

Multi-core node architecture Cluster Interconnect

Page 12: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

12

Single-core computer

Page 13: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

13

Single-core CPU chip

the single core

Page 14: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

14

Multicore Architecture

Combine 2 or more independent cores (normally CPU) into a single package

Support multitasking and multithreading in a single physical package

Page 15: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

15

Multicore is Everywhere

Dual-core commonplace in laptops Quad-core in desktops Dual quad-core in servers All major chip manufacturers produce multicore

CPUs– SUN Niagara (8 cores, 64 concurrent threads)– Intel Xeon (6 cores)– AMD Opteron (4 cores)

Page 16: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

16

Multithreading on multi-core

David Geer, IEEE Computer, 2007

Page 17: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

17

Interaction with the OS

OS perceives each core as a separate processor

OS scheduler maps threads/processes to different cores

Most major OS support multi-core today:Windows, Linux, Mac OS X, …

Page 18: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

18

Cluster Interconnect

Network fabric connecting the compute nodes Objective is to strike a balance between

– Processing power of compute nodes

– Communication ability of the interconnect

A more specialized LAN, providing many opportunities for perf. optimization

– Switch in the core

– Latency vs bwCross-bar

InputBuffer

Control

OutputPorts

Input Receiver Transmiter

Ports

Routing, Scheduling

OutputBuffer

Page 19: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

19

Goal: Bandwidth and Latency

0

10

20

30

40

50

60

70

80

0 0.2 0.4 0.6 0.8 1

Delivered Bandwidth

Lat

ency

Saturation

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 0.2 0.4 0.6 0.8 1 1.2

Offered BandwidthD

eliv

ered

Ban

dw

idth

Saturation

Page 20: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

20

Ethernet Switch: allows multiple simultaneous transmissions

hosts have dedicated, direct connection to switch

switches buffer packets Ethernet protocol used on each

incoming link, but no collisions; full duplex

– each link is its own collision domain

switching: A-to-A’ and B-to-B’ simultaneously, without collisions

– not possible with dumb hub

A

A’

B

B’

C

C’

switch with six interfaces(1,2,3,4,5,6)

1 23

45

6

Page 21: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

21

Switch Table

Q: how does switch know that A’ reachable via interface 4, B’ reachable via interface 5?

A: each switch has a switch table, each entry:

– (MAC address of host, interface to reach host, time stamp)

looks like a routing table! Q: how are entries created,

maintained in switch table? – something like a routing protocol?

A

A’

B

B’

C

C’

switch with six interfaces(1,2,3,4,5,6)

1 23

45

6

Page 22: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

22

Switch: self-learning

switch learns which hosts can be reached through which interfaces

– when frame received, switch “learns” location of sender: incoming LAN segment

– records sender/location pair in switch table

A

A’

B

B’

C

C’

1 2 345

6

A A’

Source: A

Dest: A’

MAC addr interface TTL

Switch table (initially empty)

A 1 60

Page 23: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

23

Self-learning, forwarding: example

A

A’

B

B’

C

C’

1 23

45

6

A A’

Source: A

Dest: A’

MAC addr interface TTL

Switch table (initially empty)

A 1 60

A A’A A’A A’A A’A A’

frame destination unknown:flood

A’ A

destination A location known:

A’ 4 60

selective send

Page 24: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

24

Interconnecting switches

Switches can be connected together

A

B

Q: sending from A to G - how does S1 know to forward frame destined to F via S4 and S3?

A: self learning! (works exactly the same as in single-switch case!)

Q: Latency and Bandwidth for a large-scale network?

S1

C D

E

FS2

S4

S3

HI

G

Page 25: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

25

What characterizes a network?

Topology (what)– physical interconnection structure of the network graph

– Regular vs irregular

Routing Algorithm (which)– restricts the set of paths that msgs may follow

– Table-driven, or routing algorithm based

Switching Strategy (how)– how data in a msg traverses a route

– Store and forward vs cut-through

Flow Control Mechanism (when)– when a msg or portions of it traverse a route

– what happens when traffic is encountered?

Interplay of all of these determines performance

Page 26: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

26

Tree: An Example

Diameter and ave distance logarithmic– k-ary tree, height d = logk N– address specified d-vector of radix k coordinates describing path down from root

Fixed degree Route up to common ancestor and down

– R = B xor A– let i be position of most significant 1 in R, route up i+1 levels– down in direction given by low i+1 bits of B

Bandwidth and Bisection BW?

Page 27: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

27

Bandwidth Bandwidth

– Point-to-Point bandwidth

– Bisectional bandwidth of interconnect frabric: rate of data that can be sent across an imaginary line dividing the cluster into two halves each with equal number of ndoes.

For a switch with N ports,– If it is non-blocking, the bisectional bandwidth = N * the p-t-p

bandwidth

– Oversubscribed switch delivers less bisectional bandwidth than non-blocking, but cost-effective. It scales the bw per node up to a point after which the increase in number of nodes decreases the available bw per node

– oversubscription is the ratio of the worst-case achievable aggregate bw among the end hosts to the total bisection bw

Page 28: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

28

How to Maintain Constant BW per Node?

Limited ports in a single switch– Multiple switches

Link between a pair of switches be bottleneck– Fast uplink

How to organize multiple switches – Irregular topology

– Regular topologies: ease of management

Page 29: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

29

Scalable Interconnect: Examples

0

1

2

3

4

16 node butterfly

0 1 0 1

0 1 0 1

0 1

building block

Fat Tree

Fat Tree

Page 30: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

30

Multidimensional Meshes and Tori

d-dimensional array– n = kd-1 X ...X kO nodes

– described by d-vector of coordinates (id-1, ..., iO)

d-dimensional k-ary mesh: N = kd

– k = dN– described by d-vector of radix k coordinate

d-dimensional k-ary torus (or k-ary d-cube)?

2D Mesh 3D Cube2D torus

Page 31: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

31

Packet Switching Strategies Store and Forward (SF)

– move entire packet one hop toward destination– buffer till next hop permitted

Virtual Cut-Through and Wormhole– pipeline the hops: switch examines the header,

decides where to send the message, and then starts forwarding it immediately

– Virtual Cut-Through: buffer on blockage– Wormhole: leave message spread through network

on blockage

Page 32: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

32

SF vs WH (VCT) Switching

Unloaded latencyh( n/b+ vs n/b+h– h: distance– n: size of message– b: bandwidth : additional routing delay per hop

23 1 0

23 1 0

23 1 0

23 1 0

23 1 0

23 1 0

23 1 0

23 1 0

23 1 0

23 1 0

23 1 0

23 1

023

3 1 0

2 1 0

23 1 0

0

1

2

3

23 1 0Time

Store & Forward Routing Cut-Through Routing

Source Dest Dest

Page 33: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

33

Conventional Datacenter Network

Page 34: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

34

Problems with the Architecture

Resource fragmentation: – If an application grows and requires more servers, it

cannot use available servers in other layer 2 domains, resulting in fragmentation and underutilization of resources

Power server-to-server connectivity– Servers in different layer-2 domains to communication

through the layer-3 portion of the network

See papers in the reading list of Datacenter Network Design for proposed approaches

Page 35: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

35

Parallel Programming for Performance

Page 36: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

36

Steps in Creating a Parallel Program

4 steps: Decomposition, Assignment, Orchestration, Mapping– Done by programmer or system software (compiler, runtime, ...)– Issues are the same, so assume programmer does it all explicitly

P0

Tasks Processes Processors

P1

P2 P3

p0 p1

p2 p3

p0 p1

p2 p3

Partitioning

Sequentialcomputation

Parallelprogram

Assignment

Decomposition

Mapping

Orchestration

Page 37: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

37

Some Important Concepts

Task: – Arbitrary piece of undecomposed work in parallel

computation– Executed sequentially; concurrency is only across tasks– Fine-grained versus coarse-grained tasks

Process (thread): – Abstract entity that performs the tasks assigned to processes– Processes communicate and synchronize to perform their

tasks Processor:

– Physical engine on which process executes– Processes virtualize machine to programmer

• first write program in terms of processes, then map to processors

Page 38: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

38

Decomposition

Break up computation into tasks to be divided among processes

– Tasks may become available dynamically

– No. of available tasks may vary with time

Identify concurrency and decide level at which to exploit it

Goal: Enough tasks to keep processes busy, but not too many

– No. of tasks available at a time is upper bound on achievable speedup

Page 39: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

39

Assignment Specifying mechanism to divide work up among processes

– Together with decomposition, also called partitioning

– Balance workload, reduce communication and management cost

Structured approaches usually work well– Code inspection (parallel loops) or understanding of application

– Well-known heuristics

– Static versus dynamic assignment

As programmers, we worry about partitioning first– Usually independent of architecture or prog model

– But cost and complexity of using primitives may affect decisions

As architects, we assume program does reasonable job of it

Page 40: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

40

Orchestration

– Naming data– Structuring communication– Synchronization – Organizing data structures and scheduling tasks temporally

Goals– Reduce cost of communication and synch. as seen by processors– Reserve locality of data reference (incl. data structure organization)– Schedule tasks to satisfy dependences early– Reduce overhead of parallelism management

Closest to architecture (and programming model & language)– Choices depend a lot on comm. abstraction, efficiency of primitives – Architects should provide appropriate primitives efficiently

Page 41: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

41

Orchestration (cont’)

Shared address space– Shared and private data explicitly separate

– Communication implicit in access patterns

– No correctness need for data distribution

– Synchronization via atomic operations on shared data

– Synchronization explicit and distinct from data communication

Message passing– Data distribution among local address spaces needed

– No explicit shared structures (implicit in comm. patterns)

– Communication is explicit

– Synchronization implicit in communication (at least in synch. Case)

Page 42: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

42

Mapping After orchestration, already have parallel program

Two aspects of mapping:– Which processes/threads will run on same processor (core), if necessary

– Which process/thread runs on which particular processor (core)

• mapping to a network topology

One extreme: space-sharing– Machine divided into subsets, only one app at a time in a subset

– Processes can be pinned to processors, or left to OS

Another extreme: leave resource management control to OS Real world is between the two

– User specifies desires in some aspects, system may ignore

Usually adopt the view: process <-> processor

Page 43: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

43

Basic Trade-offs for Performance

Page 44: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

44

Trade-offs Load Balance

– fine grain tasks– random or dynamic assignment

Parallelism Overhead– coarse grain tasks– simple assignment

Communication– decompose to obtain locality– recompute from local data– big transfers – amortize overhead and latency– small transfers – reduce overhead and contention

Page 45: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

45

Load Balancing in HPC

Based on notes of James Demmel and David Culler

Page 46: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

46

LB in Parallel and Distributed Systems

Load balancing problems differ in: Tasks costs

– Do all tasks have equal costs?

– If not, when are the costs known?• Before starting, when task created, or only when task ends

Task dependencies– Can all tasks be run in any order (including parallel)?

– If not, when are the dependencies known?• Before starting, when task created, or only when task ends

Locality– Is it important for some tasks to be scheduled on the same processor

(or nearby) to reduce communication cost?

– When is the information about communication between tasks known?

Page 47: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

47

Task cost spectrum

Page 48: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

48

Task Dependency Spectrum

Page 49: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

49

Task Locality Spectrum (Data Dependencies)

Page 50: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

50

Spectrum of Solutions

One of the key questions is when certain information about the load balancing problem is known

Leads to a spectrum of solutions: Static scheduling. All information is available to scheduling

algorithm, which runs before any real computation starts. (offline algorithms)

Semi-static scheduling. Information may be known at program startup, or the beginning of each timestep, or at other well-defined points. Offline algorithms may be used even though the problem is dynamic.

Dynamic scheduling. Information is not known until mid-execution. (online algorithms)

Page 51: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

51

Representative Approaches

Static load balancing Semi-static load balancing Self-scheduling Distributed task queues Diffusion-based load balancing DAG scheduling Mixed Parallelism

Page 52: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

52

Self-Scheduling Basic Ideas:

– Keep a centralized pool of tasks that are available to run

– When a processor completes its current task, look at the pool

– If the computation of one task generates more, add them to the pool

It is useful, when– A batch (or set) of tasks without dependencies

– The cost of each task is unknown

– Locality is not important

– Using a shared memory multiprocessor, so a centralized pool of tasks is fine (How about on a distributed memory system like clusters?)

Page 53: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

53

Cluster Management

Page 54: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

54

Rocks Cluster Distribution: An Example www.rocksclusters.org Based on CentOS Linux Mass installation is a core part of the system

– Mass re-installation for application-specific config.

Front-end central server + compute & storage nodes Rolls: collection of packages

– Base roll includes: PBS (portable batch system), PVM (parallel virtual machine), MPI (message passing interface), job launchers, …

– Rolls ver 5.1: support for virtual clusters, virtual front ends, virtual compute nodes

Page 55: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

55

Microsoft HPC Server 2008: Another example Windows Server 2008 + clustering package Systems Management

– Management Console: plug-in to System Center UI with support for Windows PowerShell

– RIS (Remote Installation Service)

Networking– MS-MPI (Message Passing Interface)

– ICS (Internet Connection Sharing) : NAT for cluster nodes– Network Direct RDMA (Remote DMA)

Job scheduler Storage: iSCSI SAN and SMB support Failover support

Page 56: 1 Cluster Computing Cheng-Zhong Xu. 2 Outline  Cluster Computing Basics –Multicore architecture –Cluster Interconnect  Parallel Programming for Performance

Microsoft’s Productivity Vision for HPC

AdministratoAdministratorr

Application Application DeveloperDeveloper End - UserEnd - User

Integrated Turnkey HPC Cluster Solution

Simplified Setup and Deployment

Built-In Diagnostics Efficient Cluster

Utilization Integrates with IT

Infrastructure and Policies

Integrated Tools for Parallel Programming

Highly Productive Parallel Programming Frameworks

Service-Oriented HPC Applications

Support for Key HPC Development Standards

Unix Application Migration

Seamless Integration with Workstation Applications

Integration with Existing Collaboration and Workflow Solutions

Secure Job Execution and Data Access

Windows HPC allows you to accomplish more, in less time, with reduced effort by leveraging users existing skills and integrating

with the tools they are already using.