task scheduling algorithm for multicore processor systems with turbo boost and hyper-threading
DESCRIPTION
Yosuke Wakisaka, Naoki Shibata, Keiichi Yasumoto, Minoru Ito, and Junji Kitamichi : Task Scheduling Algorithm for Multicore Processor Systems with Turbo Boost and Hyper-Threading, In Proc. of The 2014 International Conference on Parallel and Distributed Processing Techniques and Applications(PDPTA'14), pp. 229-235 In this paper, we propose a task scheduling algorithm for multiprocessor systems with Turbo Boost and Hyper-Threading technologies. The proposed algorithm minimizes the total computation time taking account of dynamic changes of the processing speed by the two technologies, in addition to the network contention among the processors. We constructed a clock speed model with which the changes of processing speed with Turbo Boost and Hyper-threading can be estimated for various processor usage patterns. We then constructed a new scheduling algorithm that minimizes the total execution time of a task graph considering network contention and the two technologies. We evaluated the proposed algorithm by simulations and experiments with a multiprocessor system consisting of 4 PCs. In the experiment, the proposed algorithm produced a schedule that reduces the total execution time by 36% compared to conventional methods which are straightforward extensions of an existing method.TRANSCRIPT
Yosuke Wakisaka Naoki Shibata Junji Kitamichi‡ Keiichi Yasumoto Minoru Ito
Nara Institute of Science and Technology‡University of Aizu
Task Scheduling Algorithm forMulticore Processor Systems withTurbo Boost and Hyper-Threading
1
2
• Multicore processors– Widely used in distributed computing environments
• Data centers, supercomputers, ordinary PCs, mobile devices
– Dynamic control of the CPU’s operating frequency• Turbo Boost(Intel), Turbo Core(AMD), Dynamic frequency scaling(nVidia
GPUs)
– Efficient computation while saving power• Hyper Threading(Intel), Simultaneous multithreading(IBM POWER)
Background
3
Objective of Research
• Efficient task scheduling– Utilizing the latest technologies– Minimize total computation time
• No known scheduling algorithms takes account of these technologies
Dynamic scaling of operating speed by Turbo Boost and Hyper-Threading
Delay of communication by network contention
4
Related Works(1/2)
• Task scheduling algorithm taking account of network contention [1]
• Formulates communication delay between processors
– Reduces network contention
– No multicore processors– No scaling of operating frequency
[1] O. Sinnen and LA. Sousa: ``Communication Contention in Task Scheduling," IEEE Transactions on Parallel and Distributed Systems, Vol. 16, No. 6, pp. 503-515, 2005.
5
Related Works(2/2)
• Task scheduling algorithm taking account of multicore processor and fail-stop model [2]– Considers both network contention and failure of
multiple processors on a single die– No consideration for change of operating
frequency
[2] S. Gotoda, N. Shibata and M. Ito : ``Task scheduling algorithm for multicore processor system for minimizing recovery time in case of single node fault," International Symposium on Cluster, Cloud, and Grid Computing (CCGrid), pp. 260-267, 2012.
6
Outline
1. Background
2. Related Works
3. Modelling of Computing System
4. Proposed Method
5. Evaluation
6. Conclusion
Task Graph• A group of tasks that can be
executed in parallel• Vertex (task node)
– Task to be executed on a single CPU core
• Edge (task link)– Data dependence between tasks
7
Task node Task link
Task graph
Processor Graph• Topology of the computer network• Vertex (Processor node)
– CPU core (circle)• has only one link
– Switch (rectangle)• has more than 2 links
• Edge (Processor link)– Communication path between
processors
8
Processor node Processor linkSwitch
Processor graph
321
Task Scheduling• Task scheduling problem
– assigns a processor node to each task node
– minimizes total execution time– An NP-hard problem
9
1
One processor node is assigned to each task node
321
Processor graph
Task graph
10
Cor
e2
2 processors are working
Turbo Boost• Technology for accelerating multicore processors
– Scales operating frequencies according to the usage of each core
– Higher frequency if small number of cores are working– Lower frequency if all of the cores are working
• Assumption– Operating frequency only depends on usage of each core
• It does not depend on temperature
– Frequency is switched instantly, anytime
Cor
e0
Cor
e1
Cor
e3
4 processors are working
Idle
3 processors are workingFrequency
11
Hyper-Threading
Physical processor core
HardwareResource
Register file 1
Register file 2
Thread 1
Thread 2
Hardware Thread 1
Hardware Thread 2
• Multiple hardware threads are executed on a single physical processor– Hardware resources are shared among multiple threads– Reduces performance penalty by pipeline stall
• We define Effective operating frequency • Speed index reflecting reduction of executing speed by
sharing resources between hardware threads
12
Modelling Frequency Control by Turbo Boost and Hyper-Threading
• Effective operating frequency depends on only– The number of working processors– The property of executed program
• We measure the operating frequencies corresponding to processor states beforehand– Programs with different properties are executed on
multiple cores at a time– We measure the operating frequency of the real
devices
13
Modelling Effective Frequency
State of cores Frequency
[ c , i ] , [ i , i ] , [ i , i ] , [ i , i ] To be measured
[ c , i ] , [ c , i ] , [ i , i ] , [ i , i ] To be measured
Operating frequency when Turbo Boost is enabled( The processor has 4 cores 8 hardware
threads )
i:idle, c:computation heavy
m : memory access heavy, n : In between c and m
• We want to predict the operating frequency from the number of working cores– We changed the number of working cores– We measured the operating frequency
Physical core Hardware thread
14
Results of Measurement
State Freq Ratio Effective freq.
[ c , c ] 3.1GHz 0.84 2.6GHz
[ m , m ] 3.1GHz 0.76 2.3GHz
[ n , n ] 3.1GHz 0.79 2.5GHz
Only Turbo Boost is used
Only Hyper Threading is used
State of cores i : idle c : computation heavy m : memory-access
heavy n : In between c and m
When Turbo Boost is used Effective freq. depends on
number of working cores When Hyper-Threading is
used Effective freq. depends on
properties of programs
State of cores Effective freq.
[ c , i ] [ i , i ] [ i , i ] [ i , i ] 3.7GHz
[ c , i ] [ c , i ] [ i , i ] [ i , i ] 3.5GHz
[ c , i ] [ c , i ] [ c , i ] [ i , i ] 3.3GHz
[ c , i ] [ c , i ] [ c , i ] [ c , i ] 3.1GHz
[ m , i ] [ i , i ] [ i , i ] [ i , i ] 3.7GHz
[ m , i ] [ m , i ] [ m , i ] [ m , i ] 3.1GHz
[ n , i ] [ i , i ] [ i , i ] [ i , i ] 3.7GHz
[ n , i ] [ n , i ] [ n , i ] [ n , i ] 3.1GHz
15
Outline
1. Background
2. Related Works
3. Modelling of Computing System
4. Proposed Method
5. Evaluation
6. Conclusion
16
Problem in Existing Scheduling
• Tasks are assigned in order• Assumes fixed executing speed
• When operating frequency is dynamically changed– Operating frequency is high when the first task is
assigned– Frequency is reduced as many tasks are assigned to
coresCore1
Core2
Core3
Core4
aResulting schedule
Existing scheduling technique
:Freq. for 1 core execution
1 2 3 4
a b c12
1 2
1
d e
:Freq. for 2 core execution
Execution Time
17
• Tasks are assigned in order• Assumes fixed executing speed
• When operating frequency is dynamically changed– Operating frequency is high when the first task is
assigned– Frequency is reduced as many tasks are assigned to
cores
Problem in Existing Scheduling
Core1
Core2
Core3
Core4
aResulting schedule
:Freq. for 1 core execution
1 2 3 4
a b c12
1 2
1
d e
:Freq. for 2 core execution
b
Execution Time
Existing scheduling technique
18
• Tasks are assigned in order• Assumes fixed executing speed
• When operating frequency is dynamically changed– Operating frequency is high when the first task is
assigned– Frequency is reduced as many tasks are assigned to
cores
Problem in Existing Scheduling
:Freq. for 1 core execution
:Freq. for 2 core execution
aResulting schedule
1 2 3 4
a b c12
1 2
1
d e
Core1
Core2
Core3
Core4
a
b
Execution Time
Existing scheduling technique
19
• Tasks are assigned in order• Assumes fixed executing speed
• When operating frequency is dynamically changed– Operating frequency is high when the first task is
assigned– Frequency is reduced as many tasks are assigned to
cores
Problem in Existing Scheduling
:Freq. for 1 core execution
:Freq. for 2 core execution
Core1
Core2
Core3
Core4
Resulting schedule
a
1 2 3 4
a b c12
1 2
1
d e
b
Execution time for a is increased because b is assigned
Execution Time
Existing scheduling technique
20
• Execution time changes by assigning following tasks• We cannot know which core is assigned to which task until
the scheduling is finished
Problem and SolutionProblem
• Tentatively assign tasks and estimate the operating frequencies
Solution
21
Proc2
N1
Execution TimeProc
1
Core1
Core2
Core3
Core4
Estimating Execution Time of Tasks
Node
1
32 4
Link
5
a
c db
d
2 2 2
22 21 43
Task graph Processor graph
comm.TB
1coreTB
2Core HT
• Initial assignment of the first task
22
Proc2
N1
Execution TimeProc
1
Core1
Core2
Core3
Core4
Estimating Execution Time of Tasks
Node
1
32 4
Link
5
a
c db
d
2 2 2
22 21 43
comm.TB
1coreTB
2Core HT
N 2
• Initial assignment of the second task
Task graph Processor graph
23
Proc2
N1
N4
C
Execution TimeProc
1
Core1
Core2
Core3
Core4
Tentative Assignment
Node
1
32 4
Link
5
a
c db
d
2 2 2
22 21 43
comm.TB
1coreTB
2Core HT
• We calculate communication time using the existing method
• Operating frequency is tentatively fixed
N3
C
N5
Tentatively assigned
N2
Task graph Processor graph
24
Proc2
• Operating frequency is determined by observing usage of each core
C
Execution TimeProc
1
Core1
Core2
Core3
Core4
Determining Operating Freq.
Node
1
32 4
Link
5
a
c db
d
2 2 2
22 21 43
comm.TB
1coreTB
2Core HT
C
N1
N3
N4
N2 N5
Task graph Processor graph
25
Proc2
C
Proc1
Core1
Core2
Core3
Core4
C
N1
N4
N2
N5
Selecting Processor to Assign
• Processor is assigned so that the whole processing time of the task graph is minimized
Proc2
C
Execution TimeProc
1
Core1
Core2
Core3
Core4
comm.TB
1coreTB
2Core HT
C
N1
N3
N4
N2 N5
Execution Time
C
N3 C
0 5 10 0 5 109 11
Compare the execution time of the whole task graph
26
Outline
1. Background
2. Related Works
3. Modelling of Computing System
4. Proposed Method
5. Evaluation
6. Conclusion
27
Evaluation• Items to be evaluated
– Execution time of the whole task graph– Accuracy of estimation of execution time
• Compared method– We extended the Sinnen’s scheduling method in a straight-
forward way• Sinnen’s method only considers network contention• We switched the subroutine to calculate the task execution time with our
model• The first method only assigns to the physical
cores ( SinnenPhysical )• The second method assigns to all hardware
threads ( SinnenLogical )
28
Configuration
• System– CPU:Intel Core i7 3770T(2.5GHz, 4 cores, 8 hardware
threads, single socket)– Memory : 16.0GB– OS:Windows7 SP1(64bit)– Java(TM) SE (1.6.0 21, 64bit)– Gigabit LAN subsystem using the Intel 82579V Gigabit
Ethernet Controller
• Network– 4 computers are connected with Gigabit Ethernet– 420Mbps bandwidth (Measured with the real system)
29
Network Topologies for Evaluation
1 2 3 4 1 2 3 4
1234
1 2 3 4 1 2 3 4
1 2 3 4 1 2 3 4
1234
1 2 3 4 1 2 3 4
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
Star16 processors
5 switches
Mesh16 processors
4 switches( Only for simulation )
Tree16 processors
7 switches
30
Task Graphs for Evaluation High parallelism ( Matrix Solver )
98 nodes 67 edges
Low parallelism ( Robot Control ) 90 nodes 135 edges
T. Tobita and H. Kasahara: ``A standard task graph set for fair evaluation of multi-processor scheduling algorithms," Journal of Scheduling, Vol. 5, pp. 379-394,2002.
31
Execution Time of Whole Task Graph
• Execution time is reduced by up to 35%
• Higher parallelism leads to more reduction– Because of greater freedom of choosing which core to assign
Task
Exe
cuti
on
Tim
e (s
ec)
16% 35% 9%
24%
Robot Solver Robot SolverTree Star
0
50
100
150
200
250
SinnenPhysical SinnenLogical Proposed
32
Robot Solver Robot SolverTree Star
0
50
100
150
200
250
SinnenPhysical Simulation SinnenPhysical Experiment System SinnenLogical SimulationSinnenLogical Experiment System Proposed Simulation Proposed Experiment System
Accuracy of Estimation of Execution Time
• Estimation error is 4% in average, 7% at maximum
Task
Exe
cuti
on
Tim
e (s
ec)
5% 6
% 7%
7% 6
% 5%
5%
6%
4% 1
% 2% 1
%
33
Conclusion
• Task scheduling method for multicore processor
systems– Considering dynamic scaling of operating
frequency by Turbo Boost and Hyper Threading– Network contention
• We modeled frequency control by Turbo Boost and Hyper-Threading