task scheduling algorithm for multicore processor systems with turbo boost and hyper-threading

Yosuke Wakisaka Naoki Shibata Junji Kitamichi‡ Keiichi Yasumoto Minoru Ito

Nara Institute of Science and Technology‡University of Aizu

Task Scheduling Algorithm forMulticore Processor Systems withTurbo Boost and Hyper-Threading

1

2

• Multicore processors– Widely used in distributed computing environments

• Data centers, supercomputers, ordinary PCs, mobile devices

– Dynamic control of the CPU’s operating frequency• Turbo Boost(Intel), Turbo Core(AMD), Dynamic frequency scaling(nVidia

GPUs)

– Efficient computation while saving power• Hyper Threading(Intel), Simultaneous multithreading(IBM POWER)

Background

3

Objective of Research

• Efficient task scheduling– Utilizing the latest technologies– Minimize total computation time

• No known scheduling algorithms takes account of these technologies

Dynamic scaling of operating speed by Turbo Boost and Hyper-Threading

Delay of communication by network contention

4

Related Works(1/2)

• Task scheduling algorithm taking account of network contention [1]

• Formulates communication delay between processors

– Reduces network contention

– No multicore processors– No scaling of operating frequency

[1] O. Sinnen and LA. Sousa: ``Communication Contention in Task Scheduling," IEEE Transactions on Parallel and Distributed Systems, Vol. 16, No. 6, pp. 503-515, 2005.

5

Related Works(2/2)

• Task scheduling algorithm taking account of multicore processor and fail-stop model [2]– Considers both network contention and failure of

multiple processors on a single die– No consideration for change of operating

frequency

[2] S. Gotoda, N. Shibata and M. Ito : ``Task scheduling algorithm for multicore processor system for minimizing recovery time in case of single node fault," International Symposium on Cluster, Cloud, and Grid Computing (CCGrid), pp. 260-267, 2012.

6

Outline

1. Background

2. Related Works

3. Modelling of Computing System

4. Proposed Method

5. Evaluation

6. Conclusion

Task Graph• A group of tasks that can be

executed in parallel• Vertex (task node)

– Task to be executed on a single CPU core

• Edge (task link)– Data dependence between tasks

7

Task node Task link

Task graph

Processor Graph• Topology of the computer network• Vertex (Processor node)

– CPU core (circle)• has only one link

– Switch (rectangle)• has more than 2 links

• Edge (Processor link)– Communication path between

processors

8

Processor node Processor linkSwitch

Processor graph

321

Task Scheduling• Task scheduling problem

– assigns a processor node to each task node

– minimizes total execution time– An NP-hard problem

9

1

One processor node is assigned to each task node

321

Processor graph

Task graph

10

Cor

e2

2 processors are working

Turbo Boost• Technology for accelerating multicore processors

– Scales operating frequencies according to the usage of each core

– Higher frequency if small number of cores are working– Lower frequency if all of the cores are working

• Assumption– Operating frequency only depends on usage of each core

• It does not depend on temperature

– Frequency is switched instantly, anytime

Cor

e0

Cor

e1

Cor

e3

4 processors are working

Idle

3 processors are workingFrequency

11

Hyper-Threading

Physical processor core

HardwareResource

Register file 1

Register file 2

Thread 1

Thread 2

Hardware Thread 1

Hardware Thread 2

• Multiple hardware threads are executed on a single physical processor– Hardware resources are shared among multiple threads– Reduces performance penalty by pipeline stall

• We define Effective operating frequency • Speed index reflecting reduction of executing speed by

sharing resources between hardware threads

12

Modelling Frequency Control by Turbo Boost and Hyper-Threading

• Effective operating frequency depends on only– The number of working processors– The property of executed program

• We measure the operating frequencies corresponding to processor states beforehand– Programs with different properties are executed on

multiple cores at a time– We measure the operating frequency of the real

devices

13

Modelling Effective Frequency

State of cores Frequency

[ c , i ] , [ i , i ] , [ i , i ] , [ i , i ] To be measured

[ c , i ] , [ c , i ] , [ i , i ] , [ i , i ] To be measured

Operating frequency when Turbo Boost is enabled（ The processor has 4 cores 8 hardware

threads ）

i:idle, c:computation heavy

m ： memory access heavy, n ： In between c and m

• We want to predict the operating frequency from the number of working cores– We changed the number of working cores– We measured the operating frequency

Physical core Hardware thread

14

Results of Measurement

State Freq Ratio Effective freq.

[ c , c ] 3.1GHz 0.84 2.6GHz

[ m , m ] 3.1GHz 0.76 2.3GHz

[ n , n ] 3.1GHz 0.79 2.5GHz

Only Turbo Boost is used

Only Hyper Threading is used

State of cores i : idle c : computation heavy m : memory-access

heavy n : In between c and m

When Turbo Boost is used Effective freq. depends on

number of working cores When Hyper-Threading is

used Effective freq. depends on

properties of programs

State of cores Effective freq.

[ c , i ] [ i , i ] [ i , i ] [ i , i ] 3.7GHz

[ c , i ] [ c , i ] [ i , i ] [ i , i ] 3.5GHz

[ c , i ] [ c , i ] [ c , i ] [ i , i ] 3.3GHz

[ c , i ] [ c , i ] [ c , i ] [ c , i ] 3.1GHz

[ m , i ] [ i , i ] [ i , i ] [ i , i ] 3.7GHz

[ m , i ] [ m , i ] [ m , i ] [ m , i ] 3.1GHz

[ n , i ] [ i , i ] [ i , i ] [ i , i ] 3.7GHz

[ n , i ] [ n , i ] [ n , i ] [ n , i ] 3.1GHz

15

Outline

1. Background

2. Related Works


4. Proposed Method

5. Evaluation

6. Conclusion

16

Problem in Existing Scheduling

• Tasks are assigned in order• Assumes fixed executing speed

• When operating frequency is dynamically changed– Operating frequency is high when the first task is

assigned– Frequency is reduced as many tasks are assigned to

coresCore1

Core2　

Core3

Core4

aResulting schedule

Existing scheduling technique

:Freq. for 1 core execution

１２３４

a b c12

1 2

1

d e


Execution Time

17




cores


Core1

Core2　

Core3

Core4

aResulting schedule


１２３４

a b c12

1 2

1

d e


b

Execution Time


18




cores




aResulting schedule

１２３４

a b c12

1 2

1

d e

Core1

Core2　

Core3

Core4

a

b

Execution Time


19




cores




Core1

Core2　

Core3

Core4

Resulting schedule

a

１２３４

a b c12

1 2

1

d e

b

Execution time for a is increased because b is assigned

Execution Time


20

• Execution time changes by assigning following tasks• We cannot know which core is assigned to which task until

the scheduling is finished

Problem and SolutionProblem

• Tentatively assign tasks and estimate the operating frequencies

Solution

21

Proc2

N1

Execution TimeProc

1

Core1

Core2

Core3

Core4

Estimating Execution Time of Tasks

Node

1

32 4

Link

5

a

c db

d

2 2 2

22 ２１４３

Task graph Processor graph

comm.TB

1coreTB

2Core HT

• Initial assignment of the first task

22

Proc2

N1

Execution TimeProc

1

Core1

Core2

Core3

Core4

Estimating Execution Time of Tasks

Node

1

32 4

Link

5

a

c db

d

2 2 2

22 ２１４３

comm.TB

1coreTB

2Core HT

N ２

• Initial assignment of the second task


23

Proc2

N1

N4

C

Execution TimeProc

1

Core1

Core2

Core3

Core4

Tentative Assignment

Node

1

32 4

Link

5

a

c db

d

2 2 2

22 ２１４３

comm.TB

1coreTB

2Core HT

• We calculate communication time using the existing method

• Operating frequency is tentatively fixed

N3

C

N5

Tentatively assigned

N2


24

Proc2

• Operating frequency is determined by observing usage of each core

C

Execution TimeProc

1

Core1

Core2

Core3

Core4

Determining Operating Freq.

Node

1

32 4

Link

5

a

c db

d

2 2 2

22 ２１４３

comm.TB

1coreTB

2Core HT

C

N1

N3

N4

N2 N5


25

Proc2

C

Proc1

Core1

Core2

Core3

Core4

C

N1

N4

N2

N5

Selecting Processor to Assign

• Processor is assigned so that the whole processing time of the task graph is minimized

Proc2

C

Execution TimeProc

1

Core1

Core2

Core3

Core4

comm.TB

1coreTB

2Core HT

C

N1

N3

N4

N2 N5

Execution Time

C

N3 C

0 5 10 0 5 109 11

Compare the execution time of the whole task graph

26

Outline

1. Background

2. Related Works


4. Proposed Method

5. Evaluation

6. Conclusion

27

Evaluation• Items to be evaluated

– Execution time of the whole task graph– Accuracy of estimation of execution time

• Compared method– We extended the Sinnen’s scheduling method in a straight-

forward way• Sinnen’s method only considers network contention• We switched the subroutine to calculate the task execution time with our

model• The first method only assigns to the physical

cores （ SinnenPhysical ）• The second method assigns to all hardware

threads （ SinnenLogical ）

28

Configuration

• System– CPU:Intel Core i7 3770T(2.5GHz, 4 cores, 8 hardware

threads, single socket)– Memory ： 16.0GB– OS:Windows7 SP1(64bit)– Java(TM) SE (1.6.0 21, 64bit)– Gigabit LAN subsystem using the Intel 82579V Gigabit

Ethernet Controller

• Network– 4 computers are connected with Gigabit Ethernet– 420Mbps bandwidth (Measured with the real system)

29

Network Topologies for Evaluation

1 2 3 4 1 2 3 4

1234

1 2 3 4 1 2 3 4

1 2 3 4 1 2 3 4

1234

1 2 3 4 1 2 3 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

Star16 processors

5 switches

Mesh16 processors

4 switches（ Only for simulation ）

Tree16 processors

7 switches

30

Task Graphs for Evaluation High parallelism （ Matrix Solver ）

98 nodes 67 edges

Low parallelism （ Robot Control ） 90 nodes 135 edges

T. Tobita and H. Kasahara: ``A standard task graph set for fair evaluation of multi-processor scheduling algorithms," Journal of Scheduling, Vol. 5, pp. 379-394,2002.

31

Execution Time of Whole Task Graph

• Execution time is reduced by up to 35%

• Higher parallelism leads to more reduction– Because of greater freedom of choosing which core to assign

Task

Exe

cuti

on

Tim

e (s

ec)

16% 35% 9%

24%

Robot Solver Robot SolverTree Star

0

50

100

150

200

250

SinnenPhysical SinnenLogical Proposed

32

Robot Solver Robot SolverTree Star

0

50

100

150

200

250

SinnenPhysical Simulation SinnenPhysical Experiment System SinnenLogical SimulationSinnenLogical Experiment System Proposed Simulation Proposed Experiment System

Accuracy of Estimation of Execution Time

• Estimation error is 4% in average, 7% at maximum

Task

Exe

cuti

on

Tim

e (s

ec)

5% 6

% 7%

7% 6

% 5%

5%

6%

4% 1

% 2% 1

%

33

Conclusion

• Task scheduling method for multicore processor

systems– Considering dynamic scaling of operating

frequency by Turbo Boost and Hyper Threading– Network contention

• We modeled frequency control by Turbo Boost and Hyper-Threading

task scheduling algorithm for multicore processor systems with turbo boost and hyper-threading

Software

task node task link

processor graph task

scaling of operating

idle c

change of operating

parallel vertex task

lower frequency

temperature frequency