university of michigan electrical engineering and computer science 1 dynamic acceleration of...

University of MichiganElectrical Engineering and Computer Science1

Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems

Hyoun Kyu Cho and Scott MahlkeUniversity of Michigan, Ann Arbor

December 2, 2012

University of MichiganElectrical Engineering and Computer Science

Critical Path

2

• Longest path between source and sink in DAG


Critical Path

3

[Saidi`08]

3 25

35

2

93

142

2

523

10


Critical Path for Multithreaded Programs

4

Call Call

Unlock

StartLock

EndLock

(a) Mutex Lock

T1 T2

Call Call

ArBarrier

(b) Barrier

T1 T2

Call

T3

ArBarrierArBarrier

LvBarrier LvBarrier LvBarrier

[Hollingsworth`98]


Scalability of Multithreaded Programs

5

Some benchmarks does not scale very well!


CPU Time Wasted on Synchronizations

6

Synchronization is major bottleneck!


Arrival Time Variation

7


Accelerating Critical Path

8

• ACS [Suleman et al. ASPLOS `09]– Critical sections

• Voltage Boosting [Dreslinski `11]– Transactional bottlenecks

• Booster [Miller et al. HPCA `12]– Alleviate performance variation– Reactive acceleration for barriers


Challenges and Opportunities of NTC

9

• Poor single thread performance• Very sensitive to process variation

– Running at the slowest one leads to severe loss– Likely to have performance heterogeneity

• Potential for bigger frequency boosting


Objectives

10

• Systematic way of identifying critical paths

• Dealing with performance variation

• Flexible control of core boosting


System Architecture

11

offline online

TargetProgram

IntermediateRepresentation

MonitoringLogic

Com

pila

tion

ParallelismAnalysis

Instrumented Executable

Monitor

instrumentation

Observe

AdjustPriority

Priority

Schedule

WeightedProbabilistic

PriorityScheduler


Lottery Scheduling

12

• Each thread holds a number of tickets• Scheduler select fast mode thread by picking a ticket• Efficient implementation of proportional-share resource

management• Responsive, flexible control over relative execution rate

[Waldspurger`94]

10

total = 20random [0 .. 19] = 15

2 5 1 2

∑ = 10∑ > 15? no

∑ = 12∑ > 15? no

∑ = 17∑ > 15? yes


Progress Monitoring

13

• For data parallel threads• Slower threads are more likely to be in critical path• Divide task into multiple smaller chunks and instrument

monitoring code• Monitoring code reduce number of tickets


Example of Progress Monitoring

14

… pthread_barrier_wait(barrier); long PROGRESS_GRANULE = (k2 – k1) / NUM_STEPS; for ( i = k1 ; i < k2 ; i++ ) { float x_cost = dist(points->p[i],points->p[x],points->dim) * points->p[i].weight; float current_cost = points->p[i].cost; if ( x_cost < current_cost ) { switch_membership[i] = 1; cost_of_opening_x += x_cost – current_cost; } else { int assign = points->p[i].assign; lower[center_table[assign]] += current_cost – x_cost; } if ( (i – k1) % PROGRESS_GRANULE == 0 ) halve_priority_tickets(); } pthread_barrier_wait(barrier); …

Loop Body


Priority Delegation

15

• Thread holding a mutex is likely to be in critical path– Increase tickets when acquire mutex

• More likely to be in critical path if other threads are waiting– Temporarily transfer waiting thread’s ticket to the thread

holding mutex


Performance Evaluation

16

• Post processing traces– Generated on 32-core machine

• Four 8-core Intel Xeon X7560• 24MB L3 cache per chip• 32GB Total memory

– Augmented progress time indication• 1.5x, 2x, 5x, 10x acceleration for 1 fast mode core• Varying scheduling quantum from 1us to 1ms


Speedup for Streamcluster

17

H/W OSUsermode


Current Status

18

TargetProgram

IntermediateRepresentation

MonitoringLogic

Com

pila

tion

ParallelismAnalysis

Instrumented Executable

Monitor

instrumentation

Observe

AdjustPriority

Priority

Schedule

WeightedProbabilistic

PriorityScheduler

Normal

Turbo

Normal

Turbo

Co

res


Conclusion & Future Work

19

• Introduce S/W framework to improve multithreaded programs’ performance using core boosting

• Combines static analysis, dynamic monitoring, and probabilistic priority scheduling to predict critical paths

• Shows 5% ~ 27% performance improvement for streamcluster

• Better model the tradeoff between performance and energy• Predicting critical paths on other type of parallelism


Thank you!

20

university of michigan electrical engineering and computer science 1 dynamic acceleration of...

Documents

computer sciencechallenges

computer sciencegiven

computer sciencecritical

computer sciencecpu

critical sectionsvoltage

critical path3 saidi

critical pathdivide

critical path8acs suleman