talk based on material by googlemedicalbioinformatics.de/downloads/lectures/verteiltes... · ·...

Talk based on material by Google

Block II: Cluster/Grid/Cloud Programming & The Message Passing Interfaces (MPI)

ClustersHistory, Architectures, Programming Concepts, Scheduling, Components, Middleware, Single System

Image, Resource management, Programming Environments & Tools, Applications, Message Passing, Load-balancing, Distributed Shared-memory, Parallel I/O

GridsHistory, Technologies, Programming Concepts, Grid Projects, Open Standards, Resource, Protocol,

Network Enabled Service, API, SDK, Syntax, Hourglass Model, Grid Layers, The Globus Toolkit, Data Grid, Portals, Resource managers, Scheduling, Security, Economy Patterns, Projects, proteomics.net

slide

2

History Remote Procedure Calls (RPC) Message Passing Interface (MPI)

Rajkumar Buyya

Taxonomy ◦ based on how processors, memory & interconnect

are laid out, resources are managed Massively Parallel Processors (MPP) Symmetric Multiprocessors (SMP) Cache-Coherent Non-Uniform Memory

Access (CC-NUMA) Clusters Distributed Systems – Grids/P2P

MPP ◦ A large parallel processing system with a shared-

nothing architecture◦ Consist of several hundred nodes with a high-speed

interconnection network/switch◦ Each node consists of a main memory & one or more

processors Runs a separate copy of the OS

SMP◦ 2-64 processors today◦ Shared-everything architecture◦ All processors share all the global resources

available◦ Single copy of the OS runs on these systems

CC-NUMA◦ a scalable multiprocessor system having a cache-coherent

nonuniform memory access architecture◦ every processor has a global view of all of the memory

Clusters◦ a collection of workstations / PCs that are interconnected by a

high-speed network◦ work as an integrated collection of resources ◦ have a single system image spanning all its nodes

Distributed systems◦ considered conventional networks of independent computers◦ have multiple system images as each node runs its own OS◦ the individual machines could be combinations of MPPs, SMPs,

clusters, & individual computers

Vector Computers (VC) - proprietary system:◦ provided the breakthrough needed for the emergence of

computational science, buy they were only a partial answer. Massively Parallel Processors (MPP) -proprietary

systems:◦ high cost and a low performance/price ratio.

Symmetric Multiprocessors (SMP):◦ suffers from scalability

Distributed Systems:◦ difficult to use and hard to extract parallel performance.

Clusters - gaining popularity:◦ High Performance Computing - Commodity

Supercomputing◦ High Availability Computing - Mission Critical Applications

ACRI Alliant American

Supercomputer Ametek Applied Dynamics Astronautics BBN CDC Convex Cray Computer Cray Research

(SGI?Tera) Culler-Harris Culler Scientific Cydrome

Convex C4600

Dana/Ardent/Stellar Elxsi ETA Systems Evans & Sutherland

Computer Division Floating Point Systems Galaxy YH-1 Goodyear Aerospace MPP Gould NPL Guiltech Intel Scientific

Computers Intl. Parallel Machines KSR MasPar

Meiko Myrias Thinking

Machines Saxpy Scientific

Computer Systems (SCS)

Soviet Supercomputers

Suprenum

•Network of Workstations

The promise of supercomputing to the average PC User ?

Performance of PC/Workstations components has almost reached performance of those used in supercomputers…◦ Microprocessors (50% to 100% per year)◦ Networks (Gigabit SANs);◦ Operating Systems (Linux,...);◦ Programming environment (MPI,…);◦ Applications (.edu, .com, .org, .net, .shop, .bank);

The rate of performance improvements of commodity systems is much rapid compared to specialized systems.

◦ Linking together two or more computers to jointly solve computational problems

◦ Since the early 1990s, an increasing trend to move away from expensive and specialized proprietary parallel supercomputers towards clusters of workstations Hard to find money to buy expensive systems

◦ The rapid improvement in the availability of commodity high performance components for workstations and networks Low-cost commodity supercomputing

◦ From specialized traditional supercomputing platforms to cheaper, general purpose systems consisting of loosely coupled components built up from single or multiprocessor PCs or workstations

1960 1990 1995+1980s 2000+

PDAClusters

A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand-alone computers cooperatively working together as a single, integrated computing resource.

A node:◦ a single or multiprocessor system with memory, I/O facilities, & OS

A cluster:◦ generally 2 or more computers (nodes) connected together◦ in a single cabinet, or physically separated & connected via a LAN◦ appear as a single system to users and applications ◦ provide a cost-effective way to gain features and benefits

Sequential Applications

Parallel Applications

Parallel Programming Environment

Cluster Middleware(Single System Image and Availability Infrastructure)

Cluster Interconnection Network/Switch

PC/Workstation

Network Interface Hardware

CommunicationsSoftware

PC/Workstation



PC/Workstation



PC/Workstation



Sequential ApplicationsSequential Applications

Parallel ApplicationsParallel Applications

Commodity Parts? Communications Packaging? Incremental Scalability? Independent Failure? Intelligent Network Interfaces? Complete System on every node◦ virtual memory◦ scheduler◦ files◦ …

Nodes can be used individually or jointly...

Parallel Processing ◦ Use multiple processors to build MPP/DSM-like systems for

parallel computing Network RAM◦ Use memory associated with each workstation as aggregate

DRAM cache Software RAID◦ Redundant array of inexpensive disks◦ Use the arrays of workstation disks to provide cheap, highly

available and scalable file storage◦ Possible to provide parallel I/O support to applications

Multipath Communication◦ Use multiple networks for parallel data transfer between

nodes

MPP: Massively Parallel ProcessingDSP: Distributed Shared Memory

• Enhanced Performance (performance @ low cost)• Enhanced Availability (failure management)• Single System Image (look-and-feel of one system)• Size Scalability (physical & application)• Fast Communication (networks & protocols)• Load Balancing (CPU, Net, Memory, Disk) • Security and Encryption (clusters of clusters)• Distributed Environment (Social issues)• Manageability (admin. And control)• Programmability (simple API if required)• Applicability (cluster-aware and non-aware app.)

Cluster Design Issues

High Performance (dedicated). High Throughput (idle cycle harvesting). High Availability (fail-over).

A Unified System – HP and HA within the same cluster

Shared Pool ofComputing Resources:Processors, Memory, Disks

Interconnect

Guarantee at least oneworkstation to many individuals(when active)

Deliver large % of collectiveresources to few individualsat any one time

• Best of both Worlds: (world is heading towards this configuration)

Work queues allow threads from one task to send processing work to another task in a decoupled fashion

CP

P

P

C

C

shared queue

Producers Consumers

To make this work in a distributed setting, we would like this to simply “happen over the network”

CP

P

P

C

C

network shared queue

separate machines

Where does the queue live? How do you access it? (custom protocol? a

generic memory-sharing protocol?) How do you guarantee that it doesn't become

a bottleneck / source of deadlock?

... Some well-defined solutions exist to support inter-machine programming, which we'll see next

Regular client-server protocols involve sending data back and forth according to a shared state

Client: Server:

HTTP/1.0 index.html GET

200 OK

Length: 2400

(file data)

HTTP/1.0 hello.gif GET

200 OK

Length: 81494

…

RPC servers will call arbitrary functions in dll, exe, with arguments passed over the network, and return values back over network

Client: Server:

foo.dll,bar(4, 10, “hello”)

“returned_string”

foo.dll,baz(42)

err: no such function

…

RPC can be used with two basic interfaces: synchronous and asynchronous

Synchronous RPC is a “remote function call” –client blocks and waits for return val

Asynchronous RPC is a “remote thread spawn”

h = Spawn(server_name, “foo.dll”, long_runner, x, y…)

RPC dispatcher

String long_runner(x, y){ … return new GiantObject();}

foo.dll:

GiantObject myObj = Sync(h);

client server

time

(More code

...

keeps running…)

Writing rpc_call(foo.dll, bar, arg0, arg1..) is poor form◦ Confusing code◦ Breaks abstraction

Wrapper “stub” function makes code cleanerbar(arg0, arg1); //programmer writes this;

// makes RPC “under the hood”

Who can call RPC functions? Anybody? How do you handle multiple versions of a

function? Need to marshal objects How do you handle error conditions? Numerous protocols: DCOM, CORBA, JRMI…

“Imagine a Beowulf cluster of these…”-- common Slashdot meme

Traditional cluster computing involves explicitly forming a cluster from computer nodes and dispatching jobs

Beowulf is a style of system that links Linux machines together

MPI (Message Passing Interface) describes an API for allowing programs to communicate with their parallel components

Makes a cluster of computers present a single computer interface

One computer is the “master”◦ Starts tasks◦ User terminal / external network is connected to

this machine Several “worker” nodes form backend; not

usually individually accessed

Runs on commodity PCs Uses standard Ethernet network (though

faster networks can be used too) Open-source software

Beowulf is an architecture style◦ It is not itself an explicit library

Client nodes are set up in very dumb fashion◦ Use NFS to share file system with master

User starts programs on master machine Scripts use rsh to invoke subprograms on

worker nodes

If you need several totally isolated jobs done in parallel, the above is all you need

Most systems require more inter-thread communication than Beowulf offers

Special libraries make this easier

MPI is an API that allows programs running on multiple computers to interoperate

MPI itself is a standard; implementations of it exist in C and Fortran

Provides synchronization and communicationoperations to processes

Messages are sequences of bytes moving between processes

The sender and receiver must agree on the type structure of values in the message

“Marshalling”: data layout so that there is no ambiguity such as “four chars” v. “one integer”.

Mateti, Linux Clusters 46

Process A sends a data buffer as a message to process B.

Process B waits for a message from A, and when it arrives copies it into its own local memory.

No memory shared between A and B.


Obviously, ◦ Messages cannot be received before they are

sent.◦ A receiver waits until there is a message.

Asynchronous◦ Sender never blocks, even if infinitely many

messages are waiting to be received◦ Semi-asynchronous is a practical version of

above with large but finite amount of buffering


Q: send(m, P) ◦ Send message M to process P

P: recv(x, Q)◦ Receive message from process Q, and place it in

variable x The message data◦ Type of x must match that of m◦ As if x := m


One sender Q, multiple receivers P Not all receivers may receive at the same time Q: broadcast (m) ◦ Send message M to processes

P: recv(x, Q)◦ Receive message from process Q, and place it in

variable x


Sender blocks until receiver is ready to receive.

Cannot send messages to self. No buffering.


Sender never blocks. Receiver receives when ready. Can send messages to self. Infinite buffering.


Speed not so good ◦ Sender copies message into system buffers.◦ Message travels the network.◦ Receiver copies message from system buffers into

local memory.◦ Special virtual memory techniques help.

Programming Quality◦ less error-prone cf. shared memory


User explicitly spawns child processes to do work

MPI library aware of the size of the “universe” – the number of available machines

MPI system will spawn processes on different machines◦ Do not need to be the same executable

MPI programs define a “Window” of a certain size as a shared memory region

Multiple processes attach to the window◦ Get() and Put() primitives copy data into the shared

memory asynchronously◦ Fence() command blocks until all users of the

window reach the fence, at which point their shared memories are consistent◦ User is responsible for ensuring that stale data is

not read from shared memory buffer

Supports intuitive notion of “barriers” with Fence()

Mutual exclusion locks also supported◦ Library ensures that multiple machines cannot

access the lock at the same time◦ Ensuring that failed nodes cannot deadlock an

entire distributed process will increase system complexity

Basic communication unit in MPI is a message – a piece of data sent from one machine to another

MPI provides message-sending and receiving functions that allow processes to exchange messages in a thread-safe fashion over the network

Also includes multi-party messages...

1:n broadcast – one process sends a message to all processes in a group

n:1 reduce – all processes in a group send data to a designated process which merges the data

n:n messaging communication also supported

• One process in a group can send a message which all group members receive (e.g., a global “stop processing” signal)

• Processes in a group can all report data together (asynchronously) which is gathered into a single message reported to one process (e.g., reporting results of a distributed computation)

• Combination of above paradigms; individual processes contribute components to a global message which reaches all group members

Programmers have very explicit control over data manipulation; allows high performance applications

Trade-off is that it has a steep learning curve Systems such as MapReduce are considerably

lower learning curve (but cannot handle as complex of system interactions)

Generic RPC and shared-memory libraries allow flexible definition of software systems

Require programmers to think hard about how the network is involved in the process

Systems such as MapReduce (next lecture) automate much of the lower-level inter-machine communication, in exchange for some inflexibility of design

talk based on material by googlemedicalbioinformatics.de/downloads/lectures/verteiltes... · ·...

Documents