topology-aware job allocation on hpc system

28
Topology-aware Job allocation on HPC system Xu Yang ID: A20280429 Email: [email protected]

Upload: others

Post on 24-Jan-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Topology-aware Job allocation on HPC system

Xu Yang!!

! ! ! ID: A20280429! Email: [email protected]

Outline

I. Motivation!

II. Solutions!

(1) Dimensional Ordering!

(2) Space Filling Curve Ordering!

III. Evaluation Metrics!

IV. Experimental Results

What is job allocation on HPC?

Why we care about job allocation

• Jobs submitted to HPC system always requires different number of processors.

Processors —> Nodes—>Midplanes(or blade cards)—>Racks(chassis)—>Cabinets

• HPS system consists of hundreds/thousands of processors. They are organized in the form of:

• HPC network resource is limited, especially like bandwidth,connection(routing path)

• Communication in HPC is expensive, more expensive than computation.

IBM! Cray

Blue Gene/L 0.375 XT3 8.77

Blue Gene/P 0.375 XT4 1.36

Blue Gene/Q 0.117 XT5 0.23

Table 1: Byte-to-flop ratios!

For each flop on the node, the interconnected network is able to communicate fewer and fewer bytes. !!

Topology aware job scheduling/allocating will have great importance for HPC systems.

Now, only 6% of the top500 machines(primarily the IBM Blue Gene series) provide contiguous node allocation for their jobs.

Contiguous VS Non-Contiguous job allocation

Contiguous! Non-Contiguous

Pros• Low communicat ion

cost/Network contention • fragmentation

• High system utilization • Short wait time • No fragmentation

Cons • Low system utilization • long wait time

• High communication cost/Network contention

Processor Ordering—Sequence of allocation

1. Dimensional Ordering

2. Space Filling Curve (Hilbert Curve)Ordering

3D torus topology, three dimension is w, l, d . For each node, its index is ind, coordinates is (x, y, z)

ind = z*w*l+y*w+x

ind = H(x, y, z) = (h(x), h(y), h(x))

0 1

2 3

4

0

5

1

6

2

7

3

12

8

13

9

14

10

15

11

colored by job and illustrates the planar and fragmented

nature of the default selection algorithm.

The new node selection algorithm was designed to select

nodes in a cubic geometry by using a node ordering mask,

a static, total ordering of all compute nodes, constructed

by taking the shortest path through the machine from node

to node. The mask was then used to order free nodes on

each scheduling cycle, assigning the first N nodes from

this list to a job requiring N nodes. The reader is

encouraged to view an animation[7] illustrating the

construction of the node ordering mask by comparing the

physical and wired views of the machine as nodes are

added to the mask.

Ordering the list of free nodes according to this mask is

computationally no more expensive than sorting them

numerically, so there is no additional overhead in using

this new algorithm.

Figure 5. Xt3dmon wired view showing planar nature of

default node selection algorithm leading to non

contiguous node assignment within a job. Jobs are color

coded. Service nodes are yellow.

To illustrate the node selection differences between the

default and new algorithms on a set of real jobs, a time

lapse animation[8] has been produced that shows a six

hour window starting from an empty state on the machine.

This animation contrasts the differences between the two

algorithms on the same set of jobs and shows how larger

jobs generally get contiguous nodes in a cubic geometry

using the new algorithm while jobs using the old default

node id ordering algorithm have a more planar and non-

contiguous geometry. Figures 5 and 6 also help to

illustrate these differences.

4.0 System Changes to Benefit Specific Jobs

The changes detailed in section 3 were made to help

improve interconnect performance for all jobs. In this

section system changes to accommodate applications that

understand the machine topology and that can assign tasks

to take advantage of node proximity will be reviewed.

For these topology-aware codes each must be given a

specific geometry or shape. In addition the codes must

know the coordinates of the nodes that have been assigned

so that they may assign tasks appropriately.

Figure 6. Xt3dmon wired view showing cubic nature of

new node selection algorithm leading to contiguous node

assignment within a job. Jobs are color coded. Service

nodes are yellow.

Figure 7. Xt3dmon wired view showing an 8x8x8 node

job allocation in red.

4.1 OpenAtom

OpenAtom is a quantum chemistry code that is highly

communications bound and its performance is highly

influenced by placement on a torus topology machine[9].

The goal of the researchers working with this code on

BigBen is to minimize the communication volume of

3

Dimensional Ordering job allocation algorithm leading to non contiguous node assignment within a job. Jobs are color coded. !

I. Dimensional Ordering

II. Hilbert Curve Ordering

Hilbert Curve on 2D Mesh

II. Hilbert Curve Ordering

Hilbert Curve on 3D Mesh—2 x 2 x 2

II. Hilbert Curve Ordering

Hilbert Curve on 3D Mesh—4 x 4 x 4

II. Hilbert Curve Ordering

Hilbert Curve on 3D Mesh—8 x 8 x 8

Evaluation Metrics

Parameter Geo-Metrics

α1 Average Pairwise Distance(m1)

α2 Diameter(m2)

α3 Max Dimension(m3)

α4 Distance between Logic Neighbors(m4)

• αi is obtained from running benchmark on Blue Gene/Q • Penalty function p = ∑ αi ·mi

Communication Pattern

• Broadcast

• P2P

Communication Pattern Dominate MetricAll-to-All Average Pairwise Distance

One-to-All Diameter

Communication Pattern Dominate Metric

Nearest Neighbor Distance Between Neighbors

I. SDSC Blue

• System: IBM SP at SDSC; 144 nodes; 1152 Processors!• Duration: Apr 2000 to May 2000!• Jobs: 2,440

Traces and Evaluation

SDSC-BLUE Trace

• Horizontal axis is the id of each job • HSFC—Hilbert Space Filling Curve Ordering • DO—Dimensional Ordering

Average_Pairwise_Distance_Difference HSFC vs DO

-2.25

0

2.25

4.5

6.75

9

Max-Dimension Difference HSFC vs DO

0

1.5

3

4.5

6

• Horizontal axis is the id of each job • HSFC—Hilbert Space Filling Curve Ordering • DO—Dimensional Ordering

Diameter Difference HSFC vs DO

-7.5

-5

-2.5

0

2.5

5

7.5

10

• Horizontal axis is the id of each job • HSFC—Hilbert Space Filling Curve Ordering • DO—Dimensional Ordering

• Horizontal axis is the id of each job • HSFC—Hilbert Space Filling Curve Ordering • DO—Dimensional Ordering

Job Runtime Improvement HSFC vs DO

-80%

-40%

0%

40%

80%

120%

160%

II. LLNL Thunder

• System: Linux Cluster (Thunder) at LLNL; 1024 Nodes; 4096 Processors!• Duration: Feb 2007 to Mar 2007!• Jobs: 2,662

LLNL Thunder

• Horizontal axis is the id of each job • HSFC—Hilbert Space Filling Curve Ordering • DO—Dimensional Ordering

Average_Pairwise_Distance_Difference HSFC vs DO

-15

-7.5

0

7.5

15

22.5

30

Max_Dimension_Difference HSFC vs DO

-3.5

0

3.5

7

10.5

14

• Horizontal axis is the id of each job • HSFC—Hilbert Space Filling Curve Ordering • DO—Dimensional Ordering

Diameter_Difference HSFC vs DO

-15

-7.5

0

7.5

15

22.5

30

• Horizontal axis is the id of each job • HSFC—Hilbert Space Filling Curve Ordering • DO—Dimensional Ordering

• Horizontal axis is the id of each job • HSFC—Hilbert Space Filling Curve Ordering • DO—Dimensional Ordering

Job Runtime Improvement HSFC vs DO

-40%

0%

40%

80%

120%

160%

Thank You

Q&A