1 cluster computing cheng-zhong xu. 2 outline cluster computing basics –multicore architecture...
TRANSCRIPT
1
Cluster Computing
Cheng-Zhong Xu
2
Outline
Cluster Computing Basics– Multicore architecture
– Cluster Interconnect
Parallel Programming for Performance MapReduce Programming Systems Management
3
What’s a Cluster?
Broadly, a group of networked autonomous computers that work together to form a single machine in many respects:
– To improve performance (speed)
– To improve throughout
– To improve service availability (high-availability clusters)
Based on commercial off-the-shelf, the system is often more cost-effective than single machine with comparable speed or availability
4
Highly Scalable Clusters
High Performance Cluster (aka Compute Cluster)– A form of parallel computers, which aims to solve
problems faster by using multiple compute nodes.
– For parallel efficiency, the nodes are often closely coupled in a high throughput, low-latency network
Server Cluster and Datacenter– Aims to improve the system’s throughput , service
availability, power consumption, etc by using multiple nodes
5
Top500 Installation of Supercomputers
Top500.com
6
Clusters in Top500
7
An Example of Top500 Submission (F’08)
Location Tukwila, WA
Hardware – Machines 256 Dual-CPU, quad-core Intel 5320 Clovertown 1.86GHz CPU and 8GB RAM
Hardware – Networking Private & Public: Broadcom GigEMPI: Cisco Infiniband SDR, 34 IB switches in leaf/node configuration
Number of Compute Nodes 256
Total Number of Cores 2048
Total Memory 2 TB of RAM
Particulars of for current Linpack Runs
Best Linpack Result 11.75 TFLOPS
Best Cluster Efficiency 77.1%
For Comparison…
Linpack rating from June2007 Top500 run (#106) on the same hardware
8.99 TFLOPS
Cluster efficiency from June2007 Top500 run (#106) on the same hardware
59%
Typical Top500 efficiency for Clovertown motherboards w/ IB regardless of Operating System
65-77% (2 instances of 79%)
30% impro in efficiency on the same hardware; about one hour to deplay
8
Beowulf Cluster A cluster of inexpensive PCs for low-cost personal
supercomputing Based on commodity off-the-shelf components:
– PC computers running a Unix-like Os (BSD, Linux, or OpenSolaris)
– Interconnected by an Ethernet LAN Head node, plus a group of compute node
– Head node controls the cluster, and serves files to the compute nodes
Standard, free and open source software– Programming in MPI– MapReduce
9
Why Clustering Today
Powerful node (cpu, mem, storage)– Today’s PC is yesterday’s supercomputers
– Multi-core processors
High speed network– Gigabit (56% in top500 as of Nov 2008)
– Infiniband System Area Network (SAN) (24.6%)
Standard tools for parallel/ distributed computing & their growing popularity.– MPI, PBS, etc
– MapReduce for data-intensive computing
10
Major issues in Cluster Design Programmability
– Sequential vs Parallel Programming
– MPI, DSM, DSA: hybrid of multithreading and MPI
– MapReduce
Cluster-aware Resource management – Job scheduling (e.g. PBS)
– Load balancing, data locality, communication opt, etc
System management– Remote installation, monitoring, diagnosis,
– Failure management, power management, etc
11
Cluster Architecture
Multi-core node architecture Cluster Interconnect
12
Single-core computer
13
Single-core CPU chip
the single core
14
Multicore Architecture
Combine 2 or more independent cores (normally CPU) into a single package
Support multitasking and multithreading in a single physical package
15
Multicore is Everywhere
Dual-core commonplace in laptops Quad-core in desktops Dual quad-core in servers All major chip manufacturers produce multicore
CPUs– SUN Niagara (8 cores, 64 concurrent threads)– Intel Xeon (6 cores)– AMD Opteron (4 cores)
16
Multithreading on multi-core
David Geer, IEEE Computer, 2007
17
Interaction with the OS
OS perceives each core as a separate processor
OS scheduler maps threads/processes to different cores
Most major OS support multi-core today:Windows, Linux, Mac OS X, …
18
Cluster Interconnect
Network fabric connecting the compute nodes Objective is to strike a balance between
– Processing power of compute nodes
– Communication ability of the interconnect
A more specialized LAN, providing many opportunities for perf. optimization
– Switch in the core
– Latency vs bwCross-bar
InputBuffer
Control
OutputPorts
Input Receiver Transmiter
Ports
Routing, Scheduling
OutputBuffer
19
Goal: Bandwidth and Latency
0
10
20
30
40
50
60
70
80
0 0.2 0.4 0.6 0.8 1
Delivered Bandwidth
Lat
ency
Saturation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.2 0.4 0.6 0.8 1 1.2
Offered BandwidthD
eliv
ered
Ban
dw
idth
Saturation
20
Ethernet Switch: allows multiple simultaneous transmissions
hosts have dedicated, direct connection to switch
switches buffer packets Ethernet protocol used on each
incoming link, but no collisions; full duplex
– each link is its own collision domain
switching: A-to-A’ and B-to-B’ simultaneously, without collisions
– not possible with dumb hub
A
A’
B
B’
C
C’
switch with six interfaces(1,2,3,4,5,6)
1 23
45
6
21
Switch Table
Q: how does switch know that A’ reachable via interface 4, B’ reachable via interface 5?
A: each switch has a switch table, each entry:
– (MAC address of host, interface to reach host, time stamp)
looks like a routing table! Q: how are entries created,
maintained in switch table? – something like a routing protocol?
A
A’
B
B’
C
C’
switch with six interfaces(1,2,3,4,5,6)
1 23
45
6
22
Switch: self-learning
switch learns which hosts can be reached through which interfaces
– when frame received, switch “learns” location of sender: incoming LAN segment
– records sender/location pair in switch table
A
A’
B
B’
C
C’
1 2 345
6
A A’
Source: A
Dest: A’
MAC addr interface TTL
Switch table (initially empty)
A 1 60
23
Self-learning, forwarding: example
A
A’
B
B’
C
C’
1 23
45
6
A A’
Source: A
Dest: A’
MAC addr interface TTL
Switch table (initially empty)
A 1 60
A A’A A’A A’A A’A A’
frame destination unknown:flood
A’ A
destination A location known:
A’ 4 60
selective send
24
Interconnecting switches
Switches can be connected together
A
B
Q: sending from A to G - how does S1 know to forward frame destined to F via S4 and S3?
A: self learning! (works exactly the same as in single-switch case!)
Q: Latency and Bandwidth for a large-scale network?
S1
C D
E
FS2
S4
S3
HI
G
25
What characterizes a network?
Topology (what)– physical interconnection structure of the network graph
– Regular vs irregular
Routing Algorithm (which)– restricts the set of paths that msgs may follow
– Table-driven, or routing algorithm based
Switching Strategy (how)– how data in a msg traverses a route
– Store and forward vs cut-through
Flow Control Mechanism (when)– when a msg or portions of it traverse a route
– what happens when traffic is encountered?
Interplay of all of these determines performance
26
Tree: An Example
Diameter and ave distance logarithmic– k-ary tree, height d = logk N– address specified d-vector of radix k coordinates describing path down from root
Fixed degree Route up to common ancestor and down
– R = B xor A– let i be position of most significant 1 in R, route up i+1 levels– down in direction given by low i+1 bits of B
Bandwidth and Bisection BW?
27
Bandwidth Bandwidth
– Point-to-Point bandwidth
– Bisectional bandwidth of interconnect frabric: rate of data that can be sent across an imaginary line dividing the cluster into two halves each with equal number of ndoes.
For a switch with N ports,– If it is non-blocking, the bisectional bandwidth = N * the p-t-p
bandwidth
– Oversubscribed switch delivers less bisectional bandwidth than non-blocking, but cost-effective. It scales the bw per node up to a point after which the increase in number of nodes decreases the available bw per node
– oversubscription is the ratio of the worst-case achievable aggregate bw among the end hosts to the total bisection bw
28
How to Maintain Constant BW per Node?
Limited ports in a single switch– Multiple switches
Link between a pair of switches be bottleneck– Fast uplink
How to organize multiple switches – Irregular topology
– Regular topologies: ease of management
29
Scalable Interconnect: Examples
0
1
2
3
4
16 node butterfly
0 1 0 1
0 1 0 1
0 1
building block
Fat Tree
Fat Tree
30
Multidimensional Meshes and Tori
d-dimensional array– n = kd-1 X ...X kO nodes
– described by d-vector of coordinates (id-1, ..., iO)
d-dimensional k-ary mesh: N = kd
– k = dN– described by d-vector of radix k coordinate
d-dimensional k-ary torus (or k-ary d-cube)?
2D Mesh 3D Cube2D torus
31
Packet Switching Strategies Store and Forward (SF)
– move entire packet one hop toward destination– buffer till next hop permitted
Virtual Cut-Through and Wormhole– pipeline the hops: switch examines the header,
decides where to send the message, and then starts forwarding it immediately
– Virtual Cut-Through: buffer on blockage– Wormhole: leave message spread through network
on blockage
32
SF vs WH (VCT) Switching
Unloaded latencyh( n/b+ vs n/b+h– h: distance– n: size of message– b: bandwidth : additional routing delay per hop
23 1 0
23 1 0
23 1 0
23 1 0
23 1 0
23 1 0
23 1 0
23 1 0
23 1 0
23 1 0
23 1 0
23 1
023
3 1 0
2 1 0
23 1 0
0
1
2
3
23 1 0Time
Store & Forward Routing Cut-Through Routing
Source Dest Dest
33
Conventional Datacenter Network
34
Problems with the Architecture
Resource fragmentation: – If an application grows and requires more servers, it
cannot use available servers in other layer 2 domains, resulting in fragmentation and underutilization of resources
Power server-to-server connectivity– Servers in different layer-2 domains to communication
through the layer-3 portion of the network
See papers in the reading list of Datacenter Network Design for proposed approaches
35
Parallel Programming for Performance
36
Steps in Creating a Parallel Program
4 steps: Decomposition, Assignment, Orchestration, Mapping– Done by programmer or system software (compiler, runtime, ...)– Issues are the same, so assume programmer does it all explicitly
P0
Tasks Processes Processors
P1
P2 P3
p0 p1
p2 p3
p0 p1
p2 p3
Partitioning
Sequentialcomputation
Parallelprogram
Assignment
Decomposition
Mapping
Orchestration
37
Some Important Concepts
Task: – Arbitrary piece of undecomposed work in parallel
computation– Executed sequentially; concurrency is only across tasks– Fine-grained versus coarse-grained tasks
Process (thread): – Abstract entity that performs the tasks assigned to processes– Processes communicate and synchronize to perform their
tasks Processor:
– Physical engine on which process executes– Processes virtualize machine to programmer
• first write program in terms of processes, then map to processors
38
Decomposition
Break up computation into tasks to be divided among processes
– Tasks may become available dynamically
– No. of available tasks may vary with time
Identify concurrency and decide level at which to exploit it
Goal: Enough tasks to keep processes busy, but not too many
– No. of tasks available at a time is upper bound on achievable speedup
39
Assignment Specifying mechanism to divide work up among processes
– Together with decomposition, also called partitioning
– Balance workload, reduce communication and management cost
Structured approaches usually work well– Code inspection (parallel loops) or understanding of application
– Well-known heuristics
– Static versus dynamic assignment
As programmers, we worry about partitioning first– Usually independent of architecture or prog model
– But cost and complexity of using primitives may affect decisions
As architects, we assume program does reasonable job of it
40
Orchestration
– Naming data– Structuring communication– Synchronization – Organizing data structures and scheduling tasks temporally
Goals– Reduce cost of communication and synch. as seen by processors– Reserve locality of data reference (incl. data structure organization)– Schedule tasks to satisfy dependences early– Reduce overhead of parallelism management
Closest to architecture (and programming model & language)– Choices depend a lot on comm. abstraction, efficiency of primitives – Architects should provide appropriate primitives efficiently
41
Orchestration (cont’)
Shared address space– Shared and private data explicitly separate
– Communication implicit in access patterns
– No correctness need for data distribution
– Synchronization via atomic operations on shared data
– Synchronization explicit and distinct from data communication
Message passing– Data distribution among local address spaces needed
– No explicit shared structures (implicit in comm. patterns)
– Communication is explicit
– Synchronization implicit in communication (at least in synch. Case)
42
Mapping After orchestration, already have parallel program
Two aspects of mapping:– Which processes/threads will run on same processor (core), if necessary
– Which process/thread runs on which particular processor (core)
• mapping to a network topology
One extreme: space-sharing– Machine divided into subsets, only one app at a time in a subset
– Processes can be pinned to processors, or left to OS
Another extreme: leave resource management control to OS Real world is between the two
– User specifies desires in some aspects, system may ignore
Usually adopt the view: process <-> processor
43
Basic Trade-offs for Performance
44
Trade-offs Load Balance
– fine grain tasks– random or dynamic assignment
Parallelism Overhead– coarse grain tasks– simple assignment
Communication– decompose to obtain locality– recompute from local data– big transfers – amortize overhead and latency– small transfers – reduce overhead and contention
45
Load Balancing in HPC
Based on notes of James Demmel and David Culler
46
LB in Parallel and Distributed Systems
Load balancing problems differ in: Tasks costs
– Do all tasks have equal costs?
– If not, when are the costs known?• Before starting, when task created, or only when task ends
Task dependencies– Can all tasks be run in any order (including parallel)?
– If not, when are the dependencies known?• Before starting, when task created, or only when task ends
Locality– Is it important for some tasks to be scheduled on the same processor
(or nearby) to reduce communication cost?
– When is the information about communication between tasks known?
47
Task cost spectrum
48
Task Dependency Spectrum
49
Task Locality Spectrum (Data Dependencies)
50
Spectrum of Solutions
One of the key questions is when certain information about the load balancing problem is known
Leads to a spectrum of solutions: Static scheduling. All information is available to scheduling
algorithm, which runs before any real computation starts. (offline algorithms)
Semi-static scheduling. Information may be known at program startup, or the beginning of each timestep, or at other well-defined points. Offline algorithms may be used even though the problem is dynamic.
Dynamic scheduling. Information is not known until mid-execution. (online algorithms)
51
Representative Approaches
Static load balancing Semi-static load balancing Self-scheduling Distributed task queues Diffusion-based load balancing DAG scheduling Mixed Parallelism
52
Self-Scheduling Basic Ideas:
– Keep a centralized pool of tasks that are available to run
– When a processor completes its current task, look at the pool
– If the computation of one task generates more, add them to the pool
It is useful, when– A batch (or set) of tasks without dependencies
– The cost of each task is unknown
– Locality is not important
– Using a shared memory multiprocessor, so a centralized pool of tasks is fine (How about on a distributed memory system like clusters?)
53
Cluster Management
54
Rocks Cluster Distribution: An Example www.rocksclusters.org Based on CentOS Linux Mass installation is a core part of the system
– Mass re-installation for application-specific config.
Front-end central server + compute & storage nodes Rolls: collection of packages
– Base roll includes: PBS (portable batch system), PVM (parallel virtual machine), MPI (message passing interface), job launchers, …
– Rolls ver 5.1: support for virtual clusters, virtual front ends, virtual compute nodes
55
Microsoft HPC Server 2008: Another example Windows Server 2008 + clustering package Systems Management
– Management Console: plug-in to System Center UI with support for Windows PowerShell
– RIS (Remote Installation Service)
Networking– MS-MPI (Message Passing Interface)
– ICS (Internet Connection Sharing) : NAT for cluster nodes– Network Direct RDMA (Remote DMA)
Job scheduler Storage: iSCSI SAN and SMB support Failover support
Microsoft’s Productivity Vision for HPC
AdministratoAdministratorr
Application Application DeveloperDeveloper End - UserEnd - User
Integrated Turnkey HPC Cluster Solution
Simplified Setup and Deployment
Built-In Diagnostics Efficient Cluster
Utilization Integrates with IT
Infrastructure and Policies
Integrated Tools for Parallel Programming
Highly Productive Parallel Programming Frameworks
Service-Oriented HPC Applications
Support for Key HPC Development Standards
Unix Application Migration
Seamless Integration with Workstation Applications
Integration with Existing Collaboration and Workflow Solutions
Secure Job Execution and Data Access
Windows HPC allows you to accomplish more, in less time, with reduced effort by leveraging users existing skills and integrating
with the tools they are already using.