scheduling and resource management for next-generation clusters

70
Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University www.cse.psu.edu/~yyzhang

Upload: huyen

Post on 16-Jan-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Scheduling and Resource Management for Next-generation Clusters. Yanyong Zhang Penn State University www.cse.psu.edu/~yyzhang. What is a Cluster?. Cost effective Easily scalable Highly available Readily upgradeable. Scientific & Engineering Applications. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Scheduling and Resource Management for Next-generation Clusters

Scheduling and Resource Management for Next-

generation Clusters

Yanyong ZhangPenn State University

www.cse.psu.edu/~yyzhang

Page 2: Scheduling and Resource Management for Next-generation Clusters

What is a Cluster?

•Cost effective

•Easily scalable

•Highly available

•Readily upgradeable

Page 3: Scheduling and Resource Management for Next-generation Clusters

Scientific & Engineering Applications

• HPTi win 5 year $15M procurement to provide systems for weather modeling (NOAA). (http://www.noaanews.noaa.gov/stories/s419.htm)

• Sandia's expansion of their Alpha-based C-plant system.

• Maui HPCC LosLobos Linux Super-cluster (http://www.dl.ac.uk/CFS/benchmarks/beowulf/tsld007.htm)

• A performance-price ratio of … is demonstrated in simulations of wind instruments using a cluster of 20 ….

(http://www.swiss.ai.mit.edu/~pas/p/sc95.html)

• The PC cluster based parallel simulation environment and the technologies … will have a positive impact on networking research nationwide ….

(http://www.osc.edu/press/releases/2001/approved.shtml)

Page 4: Scheduling and Resource Management for Next-generation Clusters

Commercial Applications

• Business applications– Transaction Processing (IBM DB2, oracle …)– Decision Support System (IBM DB2, oracle …)

• Internet applications– Web serving / searching (Google.Com …)– Infowares (yahoo.Com, AOL.Com)– Email, eChat, ePhone, eBook,eBank, eSociety,

eAnything– Computing portal

Page 5: Scheduling and Resource Management for Next-generation Clusters

Resource Management

• Each application is demanding• Several applications/users can

be present at the same time

Resource management and Quality-of-service become important.

Page 6: Scheduling and Resource Management for Next-generation Clusters

4

System ModelArrival Q

43

• Each node is independent• Maximum MPL• Arrival queue

High Speed

Network

P0 P1 P2 P3 P4

Page 7: Scheduling and Resource Management for Next-generation Clusters

Two Phases in Resource Management• Allocation Issues

– Admission Control– Arrival Queue Principle

• Scheduling Issues (CPU Scheduling)– Resource Isolation– Co-allocation

Page 8: Scheduling and Resource Management for Next-generation Clusters

SEND

switch

Co-allocation / Co-scheduling

P0 P1

TIME

t0

t1

P0RECV

Scheduling skewness

Page 9: Scheduling and Resource Management for Next-generation Clusters

Outline• From OS’s perspective

– Contribution 1: boosting the CPU utilization at supercomputing centers

– Contribution 2: providing quick responses for commercial workloads

– Contribution 3: scheduling multiple classes of applications

• From application’s perspective– Contribution 4: optimizing clustered

DB2

NEXT

Page 10: Scheduling and Resource Management for Next-generation Clusters

Contribution 1:Boosting CPU Utilization at Supercomputing Centers

Page 11: Scheduling and Resource Management for Next-generation Clusters

Wait Time Execute Time

Objective

Wait in the arrival Q

Wait in the ready/blocked

Q

Response Time

slowdown =Response Time

Execute Time in Isolation

minimize

Page 12: Scheduling and Resource Management for Next-generation Clusters

• Back Filling (BF)

• Gang Scheduling (GS)

• Migration (M)

Existing Techniques

2 6

5

23

# of CPUs = 14

8283 2

6

tim

e

space2 2

Page 13: Scheduling and Resource Management for Next-generation Clusters

Proposed Scheme

• MBGS = GS + BF + M– Use GS as the basic framework– At each row of GS matrix, apply

BF technique– Whenever GS matrix is re-

calculated, M should be considered.

Page 14: Scheduling and Resource Management for Next-generation Clusters

How Does MBGS Perform?

Page 15: Scheduling and Resource Management for Next-generation Clusters

Outline• From OS’s perspective

– Contribution 1: boosting the CPU utilization at supercomputing centers

– Contribution 2: providing quick responses for commercial workloads

– Contribution 3: scheduling multiple classes of applications

• From application’s perspective– Contribution 4: optimizing clustered

DB2

NEXT

Page 16: Scheduling and Resource Management for Next-generation Clusters

Contribution 2:Reducing Response Times for Commercial Applications

Page 17: Scheduling and Resource Management for Next-generation Clusters

Wait Time Execute Time

Objective

Wait in the arrival Q

Wait in the ready/block

ed Q

Response Time

•Minimize wait time•Minimize response time

Page 18: Scheduling and Resource Management for Next-generation Clusters

Previous Work I:Gang Scheduling (GS)

GS is not responsive enough !

(1)

(2)

MINUTES !

wasted

Page 19: Scheduling and Resource Management for Next-generation Clusters

Previous Work II:Dynamic Co-scheduling

B D A C

P0 P1 P2 P3

B just gets a msg

Everybody else is blocked

It’s A’s tur

n

C just finishes I/O

The scheduler on each node makes independentdecision based on local events without global synchronizations.

Page 20: Scheduling and Resource Management for Next-generation Clusters

Dynamic Co-scheduling Heuristics

How do you wait for a message?

What doyou do onmessagearrival?

No ExplicitReschedule

Interrupt &Reschedule

PeriodicallyReschedule

Busy Wait Spin Block Spin Yield

Local

SB SY

DCS DCS-SB DCS-SY

PB PB-SB PB-SY

Page 21: Scheduling and Resource Management for Next-generation Clusters

Simulation Study

• A detailed simulator at a microsecond granularity

• System parameters– System configurations (maximum

MPL, to partition or not)– System overheads (context switch

overheads, interrupt costs, costs associated with manipulating queues)

Page 22: Scheduling and Resource Management for Next-generation Clusters

Simulation Study (Cont’d)

• Application parameters– Injection load– Characteristics (CPU intensive, IO

intensive, communication intensive or somewhere in the middle)

Page 23: Scheduling and Resource Management for Next-generation Clusters

Impact of Load

Page 24: Scheduling and Resource Management for Next-generation Clusters

Impact of Workload Characteristics

Comm intensive I/O intensive

Page 25: Scheduling and Resource Management for Next-generation Clusters

Periodic Boost Heuristics

• S1: Compute Phase• S2: S1 + Unconsumed

Msg.• S3: Recv. + Msg.

Arrived• S4: Recv. + No Msg.

• A: S3-> {S2,S1}• B: S3->S2->S1• C: {S3,S2,S1}• D: {S3,S2}->S1• E: S2->S3->S1

2.3

2.4

2.5

2.6

2.7

2.8

2.9

Ave

rage

Job

Res

pon

se T

ime

(X10

000

seco

nd

s)

A B C D E

Page 26: Scheduling and Resource Management for Next-generation Clusters

Analytical Modeling Study

• The state space is impossible to handle.

High Speed

Network

P0 P1 P2 P3 Pp

… …

Dynamic arrival

Page 27: Scheduling and Resource Management for Next-generation Clusters

Analysis Descriptioni

X i, jA, j1B,…,jP

Bi+,

jA1, …, mA,

number of nodes

_ _

jkB

_ ik, jk,1B ,…,jk,Bik M , ik1,…,iM,jk

R,_

1,…,iM,jkR(l)

_

jk,l1,…,N,B jk 1,…,mQ+mO, k1,…,P, N Q ll=1

n

Original State Space (impossible to handle!!)

Assumption: The state of each processor is stochastically independent and identical to thestate of the other processors.

i, ,…, jiM,jQ jA,

Reduced State Space (much more tractable !! )

iY jR,j1

B_

B i+, jA1, …, mA, jR(l)1,…,iM,_

jkB1,…,N, jQ

1,…,mQ+mO

Number of jobs on node k

Page 28: Scheduling and Resource Management for Next-generation Clusters

Analysis Description (Cont)

Address the state transition rates usingContinuous Markov model; Build the

Generator Matrix Q

Get the invariant probability vector by

solving Q = 0, and e = 1.

Use fixed-point iteration to get the solution

Page 29: Scheduling and Resource Management for Next-generation Clusters

SB Example

1 C2 C

2 C1 IO

2 C1 C

1 IO2 IO

1

2 C1 IO

1 SN2 CQ

1

Q1

1 C2 C

2 C1 SN

1xP 1

1x(1-P1)Q…

1 SP2 C

1 C2 C

r 1

2 C1 B

1

2 C1 SP

Q

1 B2 IO

1

1 C2 C

r1’

2 C1 B

Q

… …

r1 = P( )x1 C2 * 1/1+1/1+1/1

1 +{P( )+P( )}x1 IO2 *

2 *1 IO

1

1/1+1/1

+P( )x1 SN2 *

1

r2 = …

Page 30: Scheduling and Resource Management for Next-generation Clusters

Results

Optimal PB Frequency Optimal Spin Time for SB

Page 31: Scheduling and Resource Management for Next-generation Clusters

Results – Optimal Quantum Length

Comm Intensive

CPU Intensive

I/OIntensive

Page 32: Scheduling and Resource Management for Next-generation Clusters

Outline• From OS’s perspective

– Contribution 1: boosting the CPU utilization at supercomputing centers

– Contribution 2: providing quick responses for commercial workloads

– Contribution 3: scheduling multiple classes of applications

• From application’s perspective– Contribution 4: optimizing clustered

DB2

NEXT

Page 33: Scheduling and Resource Management for Next-generation Clusters

Contribution 3:Scheduling Multiple Classes of Applications

realtime

interactive

batch

Page 34: Scheduling and Resource Management for Next-generation Clusters

Objective

cluster

BE

RTHow long did it take me to finish?? Response time

How many deadlines have been missed? Miss rate

Page 35: Scheduling and Resource Management for Next-generation Clusters

Fairness Ratio (x:y)

RT

BE

Cluster Resource xx+y

yx+y

Page 36: Scheduling and Resource Management for Next-generation Clusters

How to Adhere to Fairness Ratio?

RT1RT2

BE

RT

BE1GS 2DCS-TDM 2DCS-PS

x:y = 2:1

tim

e

tim

e

tim

e

P0 P1 P0 P1P0 P1

Page 37: Scheduling and Resource Management for Next-generation Clusters

BE response time

RT : BE = 2:1 RT : BE = 1:9

RT : BE = 9:1

Page 38: Scheduling and Resource Management for Next-generation Clusters

RT Deadline Miss Rate

RT : BE = 2:1 RT : BE = 1:9

RT : BE = 9:1

Page 39: Scheduling and Resource Management for Next-generation Clusters

• From OS’s perspective– Contribution 1: boosting the CPU utilization at

supercomputing centers– Contribution 2: providing quick responses for

commercial workloads– Contribution 3: scheduling multiple classes of

applications

• From application’s perspective– Characterizing decision support workloads on

the clustered database server– Resource management for transaction

processing workloads on the clustered database server

Outline

NEXT

Page 40: Scheduling and Resource Management for Next-generation Clusters

Experiment Setup

• IBM DB2 Universal Database for Linux, EEE, Version 7.2

• 8 dual node Linux/Pentium cluster, that has 256 MB RAM and 18 GB disk on each node.

• TPC-H workload. Queries are run sequentially (Q1 – Q20). Completion time for each query is measured.

Page 41: Scheduling and Resource Management for Next-generation Clusters

Myrinet

Server

Platform

Client

001A 002B 003C 004D

004D

003C

002B

001A

Table T

Select * from T

coordinator node

1

3 3 3 3 34

4 4 422 2 2

5

004D

003C

002B

001A

Page 42: Scheduling and Resource Management for Next-generation Clusters

Methodology

• Identify the components with high system overhead.

• For each such component, characterize the request distribution.

• Come up with ways of optimization.

• Quantify potential benefits from the optimization.

Page 43: Scheduling and Resource Management for Next-generation Clusters

Sampling OS Statistics

• Sample the statistics provided by stat, net/dev, process/stat.– User/system CPU %– # of pages faults– # of blocks read/written– # of reads/writes– # of packets sent/received– CPU utilization during I/O

Page 44: Scheduling and Resource Management for Next-generation Clusters

Kernel Instrumentation

• Instrument each system call in the kernel.

Enter system call

block

unblock

resumeexecution

Exitsystem call

Page 45: Scheduling and Resource Management for Next-generation Clusters

Operating System Profile

• Considerable part of the execution time is taken by pread system call.

• There is good overlap of computation with I/O for some queries.

• More reads than writes.

Page 46: Scheduling and Resource Management for Next-generation Clusters

TPC-H pread OverheadQuery

% of exe time

Query

% of exe time

Q6 20.0 Q13 10.0

Q14 19.0 Q3 9.6

Q19 16.9 Q4 9.1

Q12 15.4 Q18 9.0

Q15 13.4 Q20 7.9

Q7 12.1 Q2 5.2

Q17 10.8 Q9 5.2

Q8 10.5 Q5 4.6

Q10 10.3 Q16 4.1

Q1 10.0 Q11 3.5

pread overhead = # of preads X overhead per pread.

Page 47: Scheduling and Resource Management for Next-generation Clusters

pread Optimization

user space

pagecache 1

2

pread(dest, chunk) { for each page in the chunk { if the page is not in cache { bring it in from disk } copy the page into dest }}

pagetable

Optimization:•Re-mapping the buffer•Copy on write

30s

Page 48: Scheduling and Resource Management for Next-generation Clusters

Copy-on-write

user space

pagecache

read only

Query

% reduction

Query % reduction

Q1 98.9 Q11 96.1

Q2 85.7 Q12 87.1

Q3 96.0 Q13 100.0

Q4 80.9 Q14 96.1

Q5 100.0 Q15 96.8

Q6 100.0 Q16 70.7

Q7 79.7 Q17 94.5

Q8 79.3 Q18 100.0

Q9 88.7 Q19 95.7

Q10 77.8 Q20 94.4

# of copy-on-write

# of preads% reduction = 1 -

Page 49: Scheduling and Resource Management for Next-generation Clusters

Operating System Profile

• Socket calls are the next dominant system calls.

Page 50: Scheduling and Resource Management for Next-generation Clusters

Message Characteristics

Q11

Q16

Message Size (bytes)

Message Inter-injectionTime (Millisecond)

Message Destination

Page 51: Scheduling and Resource Management for Next-generation Clusters

Observations on Messages

• Only a small set of message sizes is used.

• Many messages are sent in a short period.

• Message destination distribution is uniform.

• Many messages are point-to-point implementations of multicast/broadcast messages.

• Multicast can reduce # of messages.

Page 52: Scheduling and Resource Management for Next-generation Clusters

Potential % Reduction in Messages

query

total

small

large query

total

small

large

Q1 44.7 71.4

38.7 Q11 9.6 28.6

0.1

Q2 20.4 58.7

0.2 Q12 8.3 7.8 2.9

Q3 48.2 64.3

38.0 Q13 24.5

75.2

0.1

Q4 22.6 58.6

0.1 Q14 27.9

80.4

0.7

Q5 8.0 7.1 8.4 Q15 46.6

56.5

0.7

Q6 76.4 78.6

45.5 Q16 59.1

63.0

56.9

Q7 57.5 71.4

56.2 Q17 41.5

66.7

27.3

Q8 29.1 75.5

4.8 Q18 11.4

32.3

0.0

Q9 66.8 78.5

61.1 Q19 26.7

79.4

0.2

Q10 25.0 73.6

0.1 Q20 21.1

62.8

0.1

Page 53: Scheduling and Resource Management for Next-generation Clusters

Send ( msg, dest ) { if (msg = buffered_msg && dest dest_set) dest_set = dest_set { dest } ; else buffer the msg; }

Send_bg () { foreach buffered_msg if ( it has been buffered longer than threshold ) send multicast msg to nodes in dest_set;}

Online AlgorithmSend ( msg, dest ) { send msg to node dest;}

Page 54: Scheduling and Resource Management for Next-generation Clusters

Impact of ThresholdQ7 Q16

Threshold (millisecond) Threshold (millisecond)

Page 55: Scheduling and Resource Management for Next-generation Clusters

Outline• From OS’s perspective

– Contribution 1: boosting the CPU utilization at supercomputing centers

– Contribution 2: providing quick responses for commercial workloads

– Contribution 3: scheduling multiple classes of applications

• From application’s perspective– Characterizing decision support workloads on

the clustered database server– Resource management for clustered database

applications NEXT

Page 56: Scheduling and Resource Management for Next-generation Clusters

Ongoing/Near-term Work

• What is the optimal number of jobs which should be admitted?

• Can we dynamically pause some processes based on resource requirement and resource availability?

• Which dynamic co-scheduling scheme works best here?

• How do we exploit application level information in scheduling?

Page 57: Scheduling and Resource Management for Next-generation Clusters

• Some next-generation applications– Real time medical imaging and collaborative surgery

Future Work

Application requirements:• VAST processing power, disk capacity and network bandwidth• absolute availability• deterministic performance

Page 58: Scheduling and Resource Management for Next-generation Clusters

Future Work– E-business on demand

Requirements:• performance

more users responsive Quality-of-service

• availability• security• power consumption• pricing model

Page 59: Scheduling and Resource Management for Next-generation Clusters

Future Work

• What does it take to get there?– Hardware innovations– Resource management and

isolation– Good scalability– High availability– Deterministic Performance

Page 60: Scheduling and Resource Management for Next-generation Clusters

Future Work

• Not only high performance– Energy consumption– Security– Pricing for service – User satisfaction– System management– Ease of use

Page 61: Scheduling and Resource Management for Next-generation Clusters

Related Work

• parallel job scheduling: – Gang Scheduling [Ousterhout82]– Backfilling ([Lifka95], [Feitelson98]) – Migration ([Epima96])

• Dynamic co-scheduling: – Spin Block ([Arpaci-Dusseau98],

[Anglano00]), – Periodic Boost ([Nagar99])– Demand-based Coscheduling

([Sobalvarro97]),

Page 62: Scheduling and Resource Management for Next-generation Clusters

Related Work (Cont’d)

• Real-time Scheduling: – Earliest Deadline First– Rate Monotonic– Least Laxity First

• Single node Multi-class scheduling– Hierarchical scheduling ([Goyal96])– Proportional share ([Waldspurger95])

• Commercial clustered server (Pai[98], reserve)

Page 63: Scheduling and Resource Management for Next-generation Clusters

Related Work (Cont’d)

• Commercial Workloads (CAECW, [Barford99], Kant[99])

• Database Characterizing ([Keeton99], [Ailamaki99], [Rosenblum97])

• OS support for database ([Stonebraker81], [Gray78], [Christmann87])

• Reducing copies in IO ([Pai00], [Druschel93], [Thadani95])

Page 64: Scheduling and Resource Management for Next-generation Clusters

Publications

• IEEE Transactions on Parallel and Distributed Systems.

• International Parallel and Distributed Processing Symposium (IPDPS 2000)

• ACM International Conference on Supercomputing (ICS 2000)

• International Euro-par Conference (Europar 2000)• ACM Symposium on Parallel Algorithms and

Architectures (SPAA 2001)• Workshop on Job Scheduling Strategies for Parallel

Processing (JSSPP 2001)• Workshop on Computer Architecture Evaluation

Using Commercial Workloads (CAECW 2002)

Page 65: Scheduling and Resource Management for Next-generation Clusters

Publications I:Batch Applications

• Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. An Integrated Approach to Parallel Scheduling Using Gang-Scheduling,Backfilling and Migration, 7th Workshop on Job Scheduling Strategies for Parallel Processing.

• Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. The Impact of Migration on Parallel Job Scheduling for Distributed Systems. Proceedings of 6th International Euro-Par Conference Lecture Notes in Computer Science 1900, pages 242-251, Munich, Aug/Sep 2000.

• Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. Improving Parallel Job Scheduling by combining Gang Scheduling and Backfilling Techniques. International Parallel and Distributed Processing Symposium (IPDPS'2000), pages 133-142, May 2000.

• Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. A Comparative Analysis of Space- and Time-Sharing Techniques for Parallel Job Scheduling in Large Scale Parallel Systems. Submitted to IEEE Transactions on Parallel and Distributed Systems.

Page 66: Scheduling and Resource Management for Next-generation Clusters

Publications II:Interactive Applications

• M. Squillante, Y. Zhang, A. Sivasubramaniam, N. Gautam, H. Franke, J. Moreira. Analytic Modeling and Analysis of Dynamic Coscheduling for a Wide Spectrum of Parallel and Distributed Environments. Penn State CSE tech report CSE-01-004.

• Y. Zhang, A. Sivasubramaniam, J. Moreira, H. Franke. Impact of Workload and System Parameters on Next Generation Cluster Scheduling Mechanisms. To appear in IEEE Transactions on Parallel and Distributed Systems.

• Y. Zhang, A. Sivasubramaniam, H. Franke, J. Moreira. A Simulation-based Performance Study of Cluster Scheduling Mechanisms. 14th ACM International Conference on Supercomputing (ICS'2000), pages 100-109, May 2000.

• M. Squillante, Y. Zhang, A. Sivasubramaniam, N. Gautam, H. Franke, J. Moreira. Analytic Modeling and Analysis of Dynamic Coscheduling for a Wide Spectrum of Parallel and Distributed Environments. Submitted to ACM Transactions on Modeling and Compute Simulation (TOMACS).

Page 67: Scheduling and Resource Management for Next-generation Clusters

Publications III:Multi-class Applications• Y. Zhang, A. Sivasubramaniam.Scheduling Best-Effort

and Real-Time Pipelined Applications on Time-Shared Clusters, the 13th Annual ACM symposium on Parallel Algorithms and Architectures.

• Y. Zhang, A. Sivasubramaniam.Scheduling Best-Effort and Real-Time Pipelined Applications on Time-Shared Clusters, Submitted to IEEE Transactions on Parallel and Distributed Systems.

Page 68: Scheduling and Resource Management for Next-generation Clusters

Publications IV:Database• Y. Zhang, J. Zhang, A. Sivasubramaniam, C. Liu, H.

Franke. Decision-Support Workload Characteristics on a Clustered Database Server from the OS Perspective. Penn State Technical Report CSE-01-003

Page 69: Scheduling and Resource Management for Next-generation Clusters

Thank You !

Page 70: Scheduling and Resource Management for Next-generation Clusters

I/O Characteristics (Q6)