lec3 cloud computing
TRANSCRIPT
11Copyright © 2012, Elsevier Inc. All rights reserved. 1 - 1 - 11
Distributed and Cloud ComputingDistributed and Cloud Computing
Chapter 2: Computer Clusters for Chapter 2: Computer Clusters for Scalable parallel ComputingScalable parallel Computing
Adapted from Kai HwangAdapted from Kai Hwang
22
OutlineOutline
1 Clustering for massive parallelism
2 Cluster organization and architecture
3 Design principles of computer clusters
4 Cluster job and resource management
5 Case studies
33
Clustering for massive parallelism
OutlineOutline
1 Clustering for massive parallelism
2 Cluster organization and architecture
3 Design principles of computer clusters
4 Cluster job and resource management
5 Case studies
44Copyright © 2012, Elsevier Inc. All rights reserved. 1 - 1 - 44
What is a computing cluster? A computing cluster consists of a collection of
interconnected stand-alone/complete computers, which can cooperatively working together as a single, integrated computing resource. Cluster explores parallelism at job level and distributed computing with higher availability.
A typical cluster: Merging multiple system images to a SSI
(single-system image) at certain functional levels. Low latency communication protocols applied Loosely coupled than an SMP(symmetric MULTI-proc) with a SSI
55Copyright © 2012, Elsevier Inc. All rights reserved. 1 - 1 - 55
Multicomputer Clusters:
• Achieve scalable performance, HA, fault tolerance, modular growth, and use of commodity components
• Became popular in the mid-1990s as traditional mainframes and supercomputers were proven to be less cost-effective in many high-performance computing (HPC) applications
• Of the Top 500 supercomputers reported in 2010, 85% were built with homogeneous nodes
66Copyright © 2012, Elsevier Inc. All rights reserved. 1 - 1 - 66
Multi-Computer Cluster Components Multi-Computer Cluster Components
77
Attributes Attribute Value
Packaging Compact Slack
Control Centralized Decentralized
Homogeneity Homogeneous Heterogeneous
Security Enclosed Exposed
Example Dedicated cluster Enterprise cluster
Attributes Used in Cluster Classification
88
Clustering for massive parallelism
Design Objectives of Computer Clusters
Scalability
Clustering of computers is based on the concept of modular growthScaling a cluster from hundreds of uniprocessor nodes to a supercluster with 10,000 multicore nodes is a nontrivial taskThe scalability could be limited by a number of factors, such as the multicore chip technology, cluster topology, power consumption, packaging method and cooling schemeOther limiting factors are the, disk I/O bottlenecks, and latency tolerance
99
Clustering for massive parallelism
Design Objectives of Computer Clusters
Packaging
Cluster nodes can be packaged in a compact or a slack fashionIn a compact cluster, the nodes are closely packaged in one or more racks sitting in a room, and the nodes are not attached to peripherals (monitors, keyboards, mice, etc.)In a slack cluster, the nodes are attached to their usual peripherals, and may be located in different rooms, different buildings, or even remote regionsPackaging directly affects communication wire length, and thus the selection of interconnection technology usedCompact cluster can utilize a high-bandwidth, low-latency communication network that is often proprietaryNodes of a slack cluster are normally connected through standard LANs or WANs
September 4, 2013 9 /
74
1010
Clusterin for massive parallelism
Control
Clusters can be managed or controlled in a centralized or decentralized fasionA compact cluster normally has centralized control, while a slack cluster can be controlled either wayIn a centralized cluster, all the nodes are owned, controlled, managed, and administered by a central operatorIn a decentralized cluster, the nodes have individual ownersDifficult to manage decentralized clusters (process scheduling, workload migration, checkpointing, accounting)
September 4, 2013 10
/ 74
1111
Clustering for massive parallelism
Design Objectives of Computer Clusters
Homogeneity
A homogeneous cluster uses nodes from the same platform (the same processor architecture and the same operating system; often, the nodes are from the same vendors)A heterogeneous cluster uses nodes of different platforms Interoperability is an important issue in heterogeneous clustersFor instance, process migration is often needed for load balancing or availabilityIn a homogeneous cluster, a binary process image can migrate to another node and continue executionThis is not feasible in a heterogeneous cluster, as the binary code will not be executable when the process migrates to a node of a different platform
Computer clustring
Distributed Computing
September 4, 2013 11
/ 74
1212
Clustering for massive parallelism
Design Objectives of Computer Clusters
Security
Intracluster communication can be either exposed or enclosedIn an exposed cluster, the communication paths among the nodes are exposed to the outside worldAn outside machine can access the communication paths using standard protocols (e.g., TCP/IP)Such exposed clusters are easy to implement, but have several disadvantages
Computer clustring
Distributed Computing
September 4, 2013 12
/ 74
1313
Clustering for massive parallelism Design Objectives of Computer Clusters
DisadvantagesDisadvantages of exposed communicationof exposed communication
Intracluster communication is not secure, unless Intracluster communication is not secure, unless the communication subsystem performs the communication subsystem performs additional work to ensure privacy and securityadditional work to ensure privacy and security
Outside communications may disrupt intracluster Outside communications may disrupt intracluster communications in an unpredictable fashioncommunications in an unpredictable fashion
Standard communication protocols tend to have Standard communication protocols tend to have high overheadhigh overhead
Computer clustring
Distributed Computing
September 4, 2013 13
/ 74
1414
Clustering for massive parallelism Design Objectives of Computer Clusters
Enclosed intracluster communicationEnclosed intracluster communication
In an enclosed cluster, intracluster In an enclosed cluster, intracluster communication is shielded from the outside communication is shielded from the outside worldworld
Lack of standard for efficient, enclosed Lack of standard for efficient, enclosed intracluster communication. Consequently, most intracluster communication. Consequently, most commercial or academic clusters realize fast commercial or academic clusters realize fast communications through one-of-a-kind protocolscommunications through one-of-a-kind protocols
Computer clustring
Distributed Computing
September 4, 2013 14
/ 74
1515Copyright © 2012, Elsevier Inc. All rights reserved. 1 - 1 - 1515
Operational Benefits of Clustering High availability (HA) : Cluster offers inherent high system
availability due to the redundancy of hardware, operating systems, and applications.
Hardware Fault Tolerance: Cluster has some degree of redundancy in most system components including both hardware and software modules.
OS and application reliability : Run multiple copies of the OS and applications, and through this redundancy
Scalability : Adding servers to a cluster or adding more clusters to a network as the application need arises.
High Performance : Running cluster enabled programs to yield higher throughput.
1616
Cluster Organization and Cluster Organization and architecturearchitecture
1717
Basic Cluster ArchitectureBasic Cluster Architecture
Computer Cluster built from commodity hardware, software, middleware, and network supporting HA and SSI
1818
Cluster organization and architecture
Abstract architecture
A basic cluster architecture
A plateform independent cluster OS is not availableCluster middleware is required to glue together all node platforms at the user spaceAn SSI layer provides a single entry point, a single file hierarchy, a single point of control, and a single job management systemSingle memory may be realized with the help of scalable coherent interface
Computer clustring
Distributed Computing
September 4, 2013 18
/ 74
1919
Cluster organization and architecture
Abstract architecture
Resource sharing
Shared-nothingSimply connects two or more autonomous computers via a LAN such as EthernetShared-diskCan hold checkpoint files or critical system images to enhance cluster availabilityWithout shared disks, checkpointing, rollback recovery, failover, and failback are not possible in a clusterShared-memoryThe memory buses nodes can be inter-connected by a scalable coherent interface (SCI) ringIn the other two, the interconnect is attached to the I/O bus which has widely accepted standards (PCI I/O bus)The I/O bus evolves at a much slower rate than the memory bus
2020
2121
2222
Cluster organization and architecture
Abstract architecture
Cluster nodes
Service nodeCould be built with different processors, mainly used to handle I/O, file access, and system monitoring.Compute nodeMainly used for large scale searching or parallel floating-point computationsAppear in larger quantities and dominating in system cost
2323Copyright © 2012, Elsevier Inc. All rights reserved. 1 - 1 - 2323
Resource Sharing in Cluster of Computers
Three ways to connect cluster nodes.
2424
High Bandwidth Interconnects
2525
RDMA Remote Direct Memory Access (RDMA) is a
technology that allows computers in a network to exchange data in main memory without involving the processor, cache, or operating system of either computer.
2626
Flow Control Flow control is the management of
data flow between computers or devices or between nodes in a network so that the data can be handled at an efficient pace. Too much data arriving before a device can handle it causes data overflow, meaning the data is either lost or must be retransmitted.
2727
Cluster organization and architecture Cluster System Interconnects
Interconnect family system shareInterconnect family system share
http://www.top500.org/
Computer clustring
Distributed Computing
September 4, 2013 27
/ 74
2828
Example : InfiniBand (1) Provides applications with an easy-to-use Provides applications with an easy-to-use
messaging service.messaging service. Gives every application direct access to the Gives every application direct access to the
messaging service without need not rely on the messaging service without need not rely on the operating system to transfer messages.operating system to transfer messages.
Provides a messaging service by creating a Provides a messaging service by creating a channel connecting an application to any other channel connecting an application to any other application or service with which the application application or service with which the application needs to communicate.needs to communicate.
Create these channels between virtual address Create these channels between virtual address spaces.spaces.
2929
Example : InfiniBand (2)
• InfiniBand creates a channel directly connecting an application in its virtual address space to an application in another virtual address space.
• The two applications can be in disjoint physical address spaces – hosted by different servers.
3030
InfiniBand Architecture HCA – Host Channel Adapter. An HCA is the point at
which an InfiniBand end node, such as a server or storage device, connects to the InfiniBand network.
TCA – Target Channel Adapter. This is a specialized version of a channel adapter intended for use in an embedded environment such as a storage appliance.
Switches – An InfiniBand Architecture switch is conceptually similar to any other standard networking switch, but molded to meet InfiniBand’s performance and cost targets.
Routers – Although not currently in wide deployment, an InfiniBand router is intended to be used to segment a very large network into smaller subnets connected together by an InfiniBand router.
3131
3232Copyright © 2012, Elsevier Inc. All rights reserved. 1 - 1 - 3232
Example: InfiniBand System Fabric
3333
Cluster organization and architecture
Cluster System Interconnects
Ethernet versus InfiniBand
InfiniBand had a clear advantage of speed over EthernetEthernet has closed the gap considerably by improving on the core reasons for low latency forwardingHost latency: IB takes the data from the wire and puts it directly in the operating system user space. Today, Ethernet can deliver the same capability of bypassing the O/S kernel through Ethernet NICs with Kernel Bypass and TCP acceleration in hardwareSerialization delay: 10GbE reduces the serialization delay by a factor of ten taking about 1.2 microseconds to serialize at 10Gb/s and 1.8-2.0 microseconds to transmit. InifiniBand takes about 600 nanoseconds to do so
Computer clustring
Distributed Computing
September 4, 2013 33
/ 74
3434
3535
Cluster organization and architecture
Cluster System Interconnects
Ethernet versus InfiniBand
Cut-through switching: Ethernet was store-and-forward while InfiniBand uses cut-through switchingIn store-and-forward mechanism, the entire packet must be received to be examined and then forwarded.The cut-through switching just receives the header, makes the forwarding decision and then pipelines the rest of the packet for transmissionEthernet cut-through switches are available now
Computer clustring
Distributed Computing
September 4, 2013 35
/ 74
3636
Cluster organization and architecture
Cluster System Interconnects
Ethernet versus InfiniBand
Ethernet is considered a better choice sinceIt does not require for a separate network investment and technology they need to learn, staff, and supportGetting off of an Infiniband island is complex and involves multiple gateways that slow down each transaction and create choke-points in the infrastructureInfiniband has poor performance for IP traffic. Any application based on TCP/IP, UDP, or IP Multicast should use 10GbE
Computer clustring
Distributed Computing
September 4, 2013 36
/ 74
3737
Example: The Big Google Search Engine Example: The Big Google Search Engine
A Supercluster built over high-speed PCs and A Supercluster built over high-speed PCs and Gigabit LANs for global web page searching Gigabit LANs for global web page searching applications provided by Google.applications provided by Google.
Physically, the cluster is housed in 40 PC/switch Physically, the cluster is housed in 40 PC/switch racks with 80 PCs per rack and 3200 PCs in racks with 80 PCs per rack and 3200 PCs in totaltotal
Two racks to house two 128 x 128 Gigabit Two racks to house two 128 x 128 Gigabit Ethernet switches, the front hosts, and UPSs, Ethernet switches, the front hosts, and UPSs, etc.etc.
All commercially available hardware parts with All commercially available hardware parts with Google designed software systems for Google designed software systems for supporting parallel search, URL linking, page supporting parallel search, URL linking, page ranking, file and database management, etc.ranking, file and database management, etc.
3838Copyright © 2012, Elsevier Inc. All rights reserved. 1 - 1 - 3838
Google Search Engine Cluster Google Search Engine Cluster
3939
VideoVideo
4040
4141
Design Principles of Clusters
4242
Design Principles of Clusters Single-system image (SSI)Single-system image (SSI) High availability (HA)High availability (HA) Fault toleranceFault tolerance Rollback recoveryRollback recovery
4343Copyright © 2012, Elsevier Inc. All rights reserved. 1 - 1 - 4343
1. Single System Image (SSI)
A single system image is the illusion, created by software or hardware, that presents a collection of resources as an integrated powerful resource.
SSI makes the cluster appear like a single machine to the user, applications, and network.
A cluster with multiple system images is nothing but a collection of independent computers (Distributed systems in general)
4444Copyright © 2012, Elsevier Inc. All rights reserved. 1 - 1 - 4444
Single-System-Image Features Single System: The entire cluster is viewed by the
users as one system, which has multiple processors. Single Control: Logically, an end user or system user
utilizes services from one place with a single interface.
Symmetry: A user can use a cluster service from any node. All cluster services and functionalities are symmetric to all nodes and all users, except those protected by access rights.
Location Transparent: The user is not aware of the whereabouts of the physical device that eventually provides a service.
4545Copyright © 2012, Elsevier Inc. All rights reserved. 1 - 1 - 4545
Basic SSI Services A. Single Entry Point
telnet cluster.usc.edu telnet node1.cluster.usc.edu
B. Single File Hierarchy: xFS, AFS, Solaris MC ProxyC. Single I/O, Networking, and Memory Space
Other Single Job Management: GlUnix, Codine, LSF, etc. Single User Interface: Like CDE in Solaris/NT Single process space
4646Copyright © 2012, Elsevier Inc. All rights reserved. 1 - 1 - 4646
A: Example: Realizing a single entry point in a cluster of computers
1. Four nodes of a cluster are used as host nodes to receive users’ login requests. 2. To log into the cluster a standard Unix command such as “telnet
cluster.cs.hku.hk”, using the symbolic name of the cluster system is issued.3. The symbolic name is translated by the DNS, which returns with the IP address
159.226.41.150 of the least-loaded node, which happens to be node Host1. 4. The user then logs in using this IP address.5. The DNS periodically receives load information from the host nodes to make
load-balancing translation decisions.
4747
B. Single File Hierarchy
Single file hierarchy Single file hierarchy - the illusion of a single, huge file - the illusion of a single, huge file system image that transparently integrates local and global system image that transparently integrates local and global disks and other file devices (e.g., tapes). disks and other file devices (e.g., tapes). Files can reside on 3 types of locations in a cluster:Files can reside on 3 types of locations in a cluster:Local storage Local storage - disk on the local node. - disk on the local node. Remote storage Remote storage - disks on remote nodes. - disks on remote nodes. Stable storage Stable storage - -
Persistent - data, once written to the stable storage, Persistent - data, once written to the stable storage, will stay there at least for a period of time (e.g., a will stay there at least for a period of time (e.g., a week), even after the cluster shuts down. week), even after the cluster shuts down.
Fault tolerant - to some degree, by using redundancy Fault tolerant - to some degree, by using redundancy and periodical backup to tapes.and periodical backup to tapes.
4848
Example: Stable Storage
Could be implemented as one centralized, large RAID disk or distributed using local disks of cluster nodes.
• First approach uses a large disk, which is a single point of failure and a potential performance bottleneck.
• Second approach is more difficult to implement, but potentially more economical, more efficient, and more available.
4949Copyright © 2012, Elsevier Inc. All rights reserved. 1 - 1 - 4949
To achieve SSI, we need a:•single control point•single address space •single job management system•single user interface•single process control
C. Single I/O, Networking, and Memory Space
5050Copyright © 2012, Elsevier Inc. All rights reserved. 1 - 1 - 5050
Example: Distributed RAID - The RAID-x Architecture
• Distributed RAID architecture with a single I/O space over 12 distributed disks attached to 4 host machines (nodes) in a Linux cluster.
• Di stands for disk I• Bj for disk block j• Mj’ an image (shaded
plates) of block Bj. • P/M refers to processor
/memory node • CDD is a cooperative disk
driver.
5151
2. High Availability Through Redundancy
• Three terms often go together: reliability, availability, and serviceability (RAS).
• Availability combines the concepts of reliability and serviceability as defined below:• Reliability measures how long a system can operate
without a breakdown.• Availability indicates the percentage of time that a
system is available to the user, that is, the percentage of system uptime.
• Serviceability refers to how easy it is to service the system, including hardware and software maintenance, repair, upgrade, etc.
5252
Availability = MTTF / (MTTF + MTTR)
The operate-repair cycle of a computer system.
Availability and Failure Rate
Recent Find/SVP Survey of Fortune 1000 companies: •An average computer is down 9 times a year with an average downtime of 4 hours. •The average loss of revenue per hour downtime is $82,500.
5353Copyright © 2012, Elsevier Inc. All rights reserved. 1 - 1 - 5353
5454
Q : system will fail 10 time in years after Q : system will fail 10 time in years after running 700 Hours and will repair after 7 hours running 700 Hours and will repair after 7 hours what it’s availability ?what it’s availability ?
5555Copyright © 2012, Elsevier Inc. All rights reserved. 1 - 1 - 5555
Single Points of Failure in SMP and Clusters
5656
3. Fault Tolerant Cluster ConfigurationsRedundant components configured based on different cost, Redundant components configured based on different cost, availability, and performance requirements. availability, and performance requirements. The following three configurations are frequently used:The following three configurations are frequently used:Hot Standby: Hot Standby: A primary component provides service, while A primary component provides service, while a redundant backup component stands by without doing any a redundant backup component stands by without doing any work, but is ready (hot) to take over as soon as the primary work, but is ready (hot) to take over as soon as the primary fails. fails. Mutual Takeover: Mutual Takeover: All components are primary in that they All components are primary in that they all actively perform useful workload. When one fails, its all actively perform useful workload. When one fails, its workload is redistributed to other components. (More workload is redistributed to other components. (More Economical)Economical)Fault-Tolerance: Fault-Tolerance: Most expensive configuration, as N Most expensive configuration, as N components deliver performance of only one component, at components deliver performance of only one component, at more than N times the cost. The failure of N–1 components more than N times the cost. The failure of N–1 components is masked (not visible to the user).is masked (not visible to the user).
5757
Failover Probably the most important feature demanded in current clusters Probably the most important feature demanded in current clusters
for commercial applications. for commercial applications. When a component fails, this technique allows the remaining When a component fails, this technique allows the remaining
system to take over the services originally provided by the failed system to take over the services originally provided by the failed component. component.
A failover mechanism must provide several functions: failure A failover mechanism must provide several functions: failure diagnosis, failure notification, and failure recovery. diagnosis, failure notification, and failure recovery.
Failure diagnosis - detection of a failure and the location of the Failure diagnosis - detection of a failure and the location of the failed component that causes the failure. failed component that causes the failure. Commonly used technique - heartbeat, where the cluster Commonly used technique - heartbeat, where the cluster
nodes send out a stream of heartbeat messages to one nodes send out a stream of heartbeat messages to one another.another.
If the system does not receive the stream of heartbeat If the system does not receive the stream of heartbeat messages from a node, it can conclude that either the node or messages from a node, it can conclude that either the node or the network connection has failed.the network connection has failed.
5858
4. Recovery Schemes Failure recovery refers to the actions needed to take over the Failure recovery refers to the actions needed to take over the
workload of a failed component. workload of a failed component. Two types of recovery techniques: Two types of recovery techniques: Backward recovery Backward recovery - the processes running on a cluster periodically - the processes running on a cluster periodically
save a consistent state (called a checkpoint) to a stable storage. save a consistent state (called a checkpoint) to a stable storage. After a failure, the system is reconfigured to isolate the failed After a failure, the system is reconfigured to isolate the failed
component, restores the previous checkpoint, and resumes component, restores the previous checkpoint, and resumes normal operation. This is called normal operation. This is called rollbackrollback. .
Backward recovery is easy to implement and is widely used. Backward recovery is easy to implement and is widely used. Rollback implies wasted execution. If execution time is crucial, a Rollback implies wasted execution. If execution time is crucial, a
forward recovery scheme should be used.forward recovery scheme should be used. Forward recovery - Forward recovery - The system uses the failure diagnosis The system uses the failure diagnosis
information to reconstruct a valid system state and continues information to reconstruct a valid system state and continues execution. execution. Forward recovery is application-dependent and may need extra Forward recovery is application-dependent and may need extra
hardware.hardware.
5959
Checkpointing and Recovery Techniques
Kernel, Library, and Application Levels Kernel, Library, and Application Levels : Checkpointing : Checkpointing at the operating system kernel level, where the OS at the operating system kernel level, where the OS transparently checkpoints and restarts processes.transparently checkpoints and restarts processes.
Checkpoint OverheadsCheckpoint Overheads: The time consumed and : The time consumed and storage required for checkpointing. storage required for checkpointing.
Choosing an Optimal Checkpoint Interval: Choosing an Optimal Checkpoint Interval: The time The time period between two checkpoints is called the checkpoint period between two checkpoints is called the checkpoint interval.interval.
Incremental CheckpointIncremental Checkpoint: Instead of saving the full state : Instead of saving the full state at each checkpoint, an incremental checkpoint scheme at each checkpoint, an incremental checkpoint scheme saves only the portion of the state that is changed from saves only the portion of the state that is changed from the previous checkpoint.the previous checkpoint.
User-Directed Checkpointing: UUser-Directed Checkpointing: User inserts code (e.g., ser inserts code (e.g., library or system calls) to tell the system when to save, library or system calls) to tell the system when to save, what to save, and what not to save.what to save, and what not to save.
6060
ENDEND