scheduling and resource management for next-generation clusters

Scheduling and Resource Management for Next-

generation Clusters

Yanyong ZhangPenn State University

www.cse.psu.edu/~yyzhang

What is a Cluster?

•Cost effective

•Easily scalable

•Highly available

•Readily upgradeable

Scientific & Engineering Applications

• HPTi win 5 year $15M procurement to provide systems for weather modeling (NOAA). (http://www.noaanews.noaa.gov/stories/s419.htm)

• Sandia's expansion of their Alpha-based C-plant system.

• Maui HPCC LosLobos Linux Super-cluster (http://www.dl.ac.uk/CFS/benchmarks/beowulf/tsld007.htm)

• A performance-price ratio of … is demonstrated in simulations of wind instruments using a cluster of 20 ….

(http://www.swiss.ai.mit.edu/~pas/p/sc95.html)

• The PC cluster based parallel simulation environment and the technologies … will have a positive impact on networking research nationwide ….

(http://www.osc.edu/press/releases/2001/approved.shtml)

Commercial Applications

• Business applications– Transaction Processing (IBM DB2, oracle …)– Decision Support System (IBM DB2, oracle …)

• Internet applications– Web serving / searching (Google.Com …)– Infowares (yahoo.Com, AOL.Com)– Email, eChat, ePhone, eBook,eBank, eSociety,

eAnything– Computing portal

Resource Management

• Each application is demanding• Several applications/users can

be present at the same time

Resource management and Quality-of-service become important.

4

System ModelArrival Q

43

• Each node is independent• Maximum MPL• Arrival queue

High Speed

Network

P0 P1 P2 P3 P4

Two Phases in Resource Management• Allocation Issues

– Admission Control– Arrival Queue Principle

• Scheduling Issues (CPU Scheduling)– Resource Isolation– Co-allocation

SEND

switch

Co-allocation / Co-scheduling

P0 P1

TIME

t0

t1

P0RECV

Scheduling skewness

Outline• From OS’s perspective

– Contribution 1: boosting the CPU utilization at supercomputing centers

– Contribution 2: providing quick responses for commercial workloads

– Contribution 3: scheduling multiple classes of applications

• From application’s perspective– Contribution 4: optimizing clustered

DB2

NEXT

Contribution 1:Boosting CPU Utilization at Supercomputing Centers

Wait Time Execute Time

Objective

Wait in the arrival Q

Wait in the ready/blocked

Q

Response Time

slowdown =Response Time

Execute Time in Isolation

minimize

• Back Filling (BF)

• Gang Scheduling (GS)

• Migration (M)

Existing Techniques

2 6

5

23

# of CPUs = 14

8283 2

6

tim

e

space2 2

Proposed Scheme

• MBGS = GS + BF + M– Use GS as the basic framework– At each row of GS matrix, apply

BF technique– Whenever GS matrix is re-

calculated, M should be considered.

How Does MBGS Perform?






DB2

NEXT

Contribution 2:Reducing Response Times for Commercial Applications

Wait Time Execute Time

Objective

Wait in the arrival Q

Wait in the ready/block

ed Q

Response Time

•Minimize wait time•Minimize response time

Previous Work I:Gang Scheduling (GS)

GS is not responsive enough !

(1)

(2)

MINUTES !

wasted

Previous Work II:Dynamic Co-scheduling

B D A C

P0 P1 P2 P3

B just gets a msg

Everybody else is blocked

It’s A’s tur

n

C just finishes I/O

The scheduler on each node makes independentdecision based on local events without global synchronizations.

Dynamic Co-scheduling Heuristics

How do you wait for a message?

What doyou do onmessagearrival?

No ExplicitReschedule

Interrupt &Reschedule

PeriodicallyReschedule

Busy Wait Spin Block Spin Yield

Local

SB SY

DCS DCS-SB DCS-SY

PB PB-SB PB-SY

Simulation Study

• A detailed simulator at a microsecond granularity

• System parameters– System configurations (maximum

MPL, to partition or not)– System overheads (context switch

overheads, interrupt costs, costs associated with manipulating queues)

Simulation Study (Cont’d)

• Application parameters– Injection load– Characteristics (CPU intensive, IO

intensive, communication intensive or somewhere in the middle)

Impact of Load

Impact of Workload Characteristics

Comm intensive I/O intensive

Periodic Boost Heuristics

• S1: Compute Phase• S2: S1 + Unconsumed

Msg.• S3: Recv. + Msg.

Arrived• S4: Recv. + No Msg.

• A: S3-> {S2,S1}• B: S3->S2->S1• C: {S3,S2,S1}• D: {S3,S2}->S1• E: S2->S3->S1

2.3

2.4

2.5

2.6

2.7

2.8

2.9

Ave

rage

Job

Res

pon

se T

ime

(X10

000

seco

nd

s)

A B C D E

Analytical Modeling Study

• The state space is impossible to handle.

High Speed

Network

P0 P1 P2 P3 Pp

… …

Dynamic arrival

Analysis Descriptioni

X i, jA, j1B,…,jP

Bi+,

jA1, …, mA,

number of nodes

_ _

jkB

_ ik, jk,1B ,…,jk,Bik M , ik1,…,iM,jk

R,_

1,…,iM,jkR(l)

_

jk,l1,…,N,B jk 1,…,mQ+mO, k1,…,P, N Q ll=1

n

Original State Space (impossible to handle!!)

Assumption: The state of each processor is stochastically independent and identical to thestate of the other processors.

i, ,…, jiM,jQ jA,

Reduced State Space (much more tractable !! )

iY jR,j1

B_

B i+, jA1, …, mA, jR(l)1,…,iM,_

jkB1,…,N, jQ

1,…,mQ+mO

Number of jobs on node k

Analysis Description (Cont)

Address the state transition rates usingContinuous Markov model; Build the

Generator Matrix Q

Get the invariant probability vector by

solving Q = 0, and e = 1.

Use fixed-point iteration to get the solution

SB Example

1 C2 C

2 C1 IO

2 C1 C

1 IO2 IO

1

2 C1 IO

1 SN2 CQ

1

Q1

1 C2 C

2 C1 SN

1xP 1

1x(1-P1)Q…

…

…

…

1 SP2 C

1 C2 C

r 1

2 C1 B

1

2 C1 SP

Q

1 B2 IO

1

1 C2 C

r1’

2 C1 B

Q

…

… …

…

…

…

r1 = P( )x1 C2 * 1/1+1/1+1/1

1 +{P( )+P( )}x1 IO2 *

2 *1 IO

1

1/1+1/1

+P( )x1 SN2 *

1

r2 = …

Results

Optimal PB Frequency Optimal Spin Time for SB

Results – Optimal Quantum Length

Comm Intensive

CPU Intensive

I/OIntensive






DB2

NEXT

Contribution 3:Scheduling Multiple Classes of Applications

realtime

interactive

batch

Objective

cluster

BE

RTHow long did it take me to finish?? Response time

How many deadlines have been missed? Miss rate

Fairness Ratio (x:y)

RT

BE

Cluster Resource xx+y

yx+y

How to Adhere to Fairness Ratio?

RT1RT2

BE

RT

BE1GS 2DCS-TDM 2DCS-PS

x:y = 2:1

tim

e

tim

e

tim

e

P0 P1 P0 P1P0 P1

BE response time

RT : BE = 2:1 RT : BE = 1:9

RT : BE = 9:1

RT Deadline Miss Rate

RT : BE = 2:1 RT : BE = 1:9

RT : BE = 9:1

• From OS’s perspective– Contribution 1: boosting the CPU utilization at

supercomputing centers– Contribution 2: providing quick responses for

commercial workloads– Contribution 3: scheduling multiple classes of

applications

• From application’s perspective– Characterizing decision support workloads on

the clustered database server– Resource management for transaction

processing workloads on the clustered database server

Outline

NEXT

Experiment Setup

• IBM DB2 Universal Database for Linux, EEE, Version 7.2

• 8 dual node Linux/Pentium cluster, that has 256 MB RAM and 18 GB disk on each node.

• TPC-H workload. Queries are run sequentially (Q1 – Q20). Completion time for each query is measured.

Myrinet

Server

Platform

Client

001A 002B 003C 004D

004D

003C

002B

001A

Table T

Select * from T

coordinator node

1

3 3 3 3 34

4 4 422 2 2

5

004D

003C

002B

001A

Methodology

• Identify the components with high system overhead.

• For each such component, characterize the request distribution.

• Come up with ways of optimization.

• Quantify potential benefits from the optimization.

Sampling OS Statistics

• Sample the statistics provided by stat, net/dev, process/stat.– User/system CPU %– # of pages faults– # of blocks read/written– # of reads/writes– # of packets sent/received– CPU utilization during I/O

Kernel Instrumentation

• Instrument each system call in the kernel.

Enter system call

block

unblock

resumeexecution

Exitsystem call

Operating System Profile

• Considerable part of the execution time is taken by pread system call.

• There is good overlap of computation with I/O for some queries.

• More reads than writes.

TPC-H pread OverheadQuery

% of exe time

Query

% of exe time

Q6 20.0 Q13 10.0

Q14 19.0 Q3 9.6

Q19 16.9 Q4 9.1

Q12 15.4 Q18 9.0

Q15 13.4 Q20 7.9

Q7 12.1 Q2 5.2

Q17 10.8 Q9 5.2

Q8 10.5 Q5 4.6

Q10 10.3 Q16 4.1

Q1 10.0 Q11 3.5

pread overhead = # of preads X overhead per pread.

pread Optimization

user space

pagecache 1

2

pread(dest, chunk) { for each page in the chunk { if the page is not in cache { bring it in from disk } copy the page into dest }}

pagetable

Optimization:•Re-mapping the buffer•Copy on write

30s

Copy-on-write

user space

pagecache

read only

Query

% reduction

Query % reduction

Q1 98.9 Q11 96.1

Q2 85.7 Q12 87.1

Q3 96.0 Q13 100.0

Q4 80.9 Q14 96.1

Q5 100.0 Q15 96.8

Q6 100.0 Q16 70.7

Q7 79.7 Q17 94.5

Q8 79.3 Q18 100.0

Q9 88.7 Q19 95.7

Q10 77.8 Q20 94.4

# of copy-on-write

# of preads% reduction = 1 -

Operating System Profile

• Socket calls are the next dominant system calls.

Message Characteristics

Q11

Q16

Message Size (bytes)

Message Inter-injectionTime (Millisecond)

Message Destination

Observations on Messages

• Only a small set of message sizes is used.

• Many messages are sent in a short period.

• Message destination distribution is uniform.

• Many messages are point-to-point implementations of multicast/broadcast messages.

• Multicast can reduce # of messages.

Potential % Reduction in Messages

query

total

small

large query

total

small

large

Q1 44.7 71.4

38.7 Q11 9.6 28.6

0.1

Q2 20.4 58.7

0.2 Q12 8.3 7.8 2.9

Q3 48.2 64.3

38.0 Q13 24.5

75.2

0.1

Q4 22.6 58.6

0.1 Q14 27.9

80.4

0.7

Q5 8.0 7.1 8.4 Q15 46.6

56.5

0.7

Q6 76.4 78.6

45.5 Q16 59.1

63.0

56.9

Q7 57.5 71.4

56.2 Q17 41.5

66.7

27.3

Q8 29.1 75.5

4.8 Q18 11.4

32.3

0.0

Q9 66.8 78.5

61.1 Q19 26.7

79.4

0.2

Q10 25.0 73.6

0.1 Q20 21.1

62.8

0.1

Send ( msg, dest ) { if (msg = buffered_msg && dest dest_set) dest_set = dest_set { dest } ; else buffer the msg; }

Send_bg () { foreach buffered_msg if ( it has been buffered longer than threshold ) send multicast msg to nodes in dest_set;}

Online AlgorithmSend ( msg, dest ) { send msg to node dest;}

Impact of ThresholdQ7 Q16

Threshold (millisecond) Threshold (millisecond)





• From application’s perspective– Characterizing decision support workloads on

the clustered database server– Resource management for clustered database

applications NEXT

Ongoing/Near-term Work

• What is the optimal number of jobs which should be admitted?

• Can we dynamically pause some processes based on resource requirement and resource availability?

• Which dynamic co-scheduling scheme works best here?

• How do we exploit application level information in scheduling?

• Some next-generation applications– Real time medical imaging and collaborative surgery

Future Work

Application requirements:• VAST processing power, disk capacity and network bandwidth• absolute availability• deterministic performance

Future Work– E-business on demand

Requirements:• performance

more users responsive Quality-of-service

• availability• security• power consumption• pricing model

Future Work

• What does it take to get there?– Hardware innovations– Resource management and

isolation– Good scalability– High availability– Deterministic Performance

Future Work

• Not only high performance– Energy consumption– Security– Pricing for service – User satisfaction– System management– Ease of use

Related Work

• parallel job scheduling: – Gang Scheduling [Ousterhout82]– Backfilling ([Lifka95], [Feitelson98]) – Migration ([Epima96])

• Dynamic co-scheduling: – Spin Block ([Arpaci-Dusseau98],

[Anglano00]), – Periodic Boost ([Nagar99])– Demand-based Coscheduling

([Sobalvarro97]),

Related Work (Cont’d)

• Real-time Scheduling: – Earliest Deadline First– Rate Monotonic– Least Laxity First

• Single node Multi-class scheduling– Hierarchical scheduling ([Goyal96])– Proportional share ([Waldspurger95])

• Commercial clustered server (Pai[98], reserve)

Related Work (Cont’d)

• Commercial Workloads (CAECW, [Barford99], Kant[99])

• Database Characterizing ([Keeton99], [Ailamaki99], [Rosenblum97])

• OS support for database ([Stonebraker81], [Gray78], [Christmann87])

• Reducing copies in IO ([Pai00], [Druschel93], [Thadani95])

Publications

• IEEE Transactions on Parallel and Distributed Systems.

• International Parallel and Distributed Processing Symposium (IPDPS 2000)

• ACM International Conference on Supercomputing (ICS 2000)

• International Euro-par Conference (Europar 2000)• ACM Symposium on Parallel Algorithms and

Architectures (SPAA 2001)• Workshop on Job Scheduling Strategies for Parallel

Processing (JSSPP 2001)• Workshop on Computer Architecture Evaluation

Using Commercial Workloads (CAECW 2002)

Publications I:Batch Applications

• Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. An Integrated Approach to Parallel Scheduling Using Gang-Scheduling,Backfilling and Migration, 7th Workshop on Job Scheduling Strategies for Parallel Processing.

• Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. The Impact of Migration on Parallel Job Scheduling for Distributed Systems. Proceedings of 6th International Euro-Par Conference Lecture Notes in Computer Science 1900, pages 242-251, Munich, Aug/Sep 2000.

• Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. Improving Parallel Job Scheduling by combining Gang Scheduling and Backfilling Techniques. International Parallel and Distributed Processing Symposium (IPDPS'2000), pages 133-142, May 2000.

• Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. A Comparative Analysis of Space- and Time-Sharing Techniques for Parallel Job Scheduling in Large Scale Parallel Systems. Submitted to IEEE Transactions on Parallel and Distributed Systems.

Publications II:Interactive Applications

• M. Squillante, Y. Zhang, A. Sivasubramaniam, N. Gautam, H. Franke, J. Moreira. Analytic Modeling and Analysis of Dynamic Coscheduling for a Wide Spectrum of Parallel and Distributed Environments. Penn State CSE tech report CSE-01-004.

• Y. Zhang, A. Sivasubramaniam, J. Moreira, H. Franke. Impact of Workload and System Parameters on Next Generation Cluster Scheduling Mechanisms. To appear in IEEE Transactions on Parallel and Distributed Systems.

• Y. Zhang, A. Sivasubramaniam, H. Franke, J. Moreira. A Simulation-based Performance Study of Cluster Scheduling Mechanisms. 14th ACM International Conference on Supercomputing (ICS'2000), pages 100-109, May 2000.

• M. Squillante, Y. Zhang, A. Sivasubramaniam, N. Gautam, H. Franke, J. Moreira. Analytic Modeling and Analysis of Dynamic Coscheduling for a Wide Spectrum of Parallel and Distributed Environments. Submitted to ACM Transactions on Modeling and Compute Simulation (TOMACS).

Publications III:Multi-class Applications• Y. Zhang, A. Sivasubramaniam.Scheduling Best-Effort

and Real-Time Pipelined Applications on Time-Shared Clusters, the 13th Annual ACM symposium on Parallel Algorithms and Architectures.

• Y. Zhang, A. Sivasubramaniam.Scheduling Best-Effort and Real-Time Pipelined Applications on Time-Shared Clusters, Submitted to IEEE Transactions on Parallel and Distributed Systems.

Publications IV:Database• Y. Zhang, J. Zhang, A. Sivasubramaniam, C. Liu, H.

Franke. Decision-Support Workload Characteristics on a Clustered Database Server from the OS Perspective. Penn State Technical Report CSE-01-003

Thank You !

I/O Characteristics (Q6)

scheduling and resource management for next-generation clusters

Documents

cpu utilization

gang scheduling gsgs

gs bf muse gs

arrival qwait

pc cluster

row of gs matrix

bf techniquewhenever

system model443