university of michigan electrical engineering and computer science 1 dynamic acceleration of...

20
University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems Hyoun Kyu Cho and Scott Mahlke University of Michigan, Ann Arbor December 2, 2012

Upload: caitlin-simpson

Post on 29-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

University of MichiganElectrical Engineering and Computer Science1

Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems

Hyoun Kyu Cho and Scott MahlkeUniversity of Michigan, Ann Arbor

December 2, 2012

University of MichiganElectrical Engineering and Computer Science

Critical Path

2

• Longest path between source and sink in DAG

University of MichiganElectrical Engineering and Computer Science

Critical Path

3

[Saidi`08]

3 25

35

2

93

142

2

523

10

University of MichiganElectrical Engineering and Computer Science

Critical Path for Multithreaded Programs

4

Call Call

Unlock

StartLock

EndLock

(a) Mutex Lock

T1 T2

Call Call

ArBarrier

(b) Barrier

T1 T2

Call

T3

ArBarrierArBarrier

LvBarrier LvBarrier LvBarrier

[Hollingsworth`98]

University of MichiganElectrical Engineering and Computer Science

Scalability of Multithreaded Programs

5

Some benchmarks does not scale very well!

University of MichiganElectrical Engineering and Computer Science

CPU Time Wasted on Synchronizations

6

Synchronization is major bottleneck!

University of MichiganElectrical Engineering and Computer Science

Arrival Time Variation

7

University of MichiganElectrical Engineering and Computer Science

Accelerating Critical Path

8

• ACS [Suleman et al. ASPLOS `09]– Critical sections

• Voltage Boosting [Dreslinski `11]– Transactional bottlenecks

• Booster [Miller et al. HPCA `12]– Alleviate performance variation– Reactive acceleration for barriers

University of MichiganElectrical Engineering and Computer Science

Challenges and Opportunities of NTC

9

• Poor single thread performance• Very sensitive to process variation

– Running at the slowest one leads to severe loss– Likely to have performance heterogeneity

• Potential for bigger frequency boosting

University of MichiganElectrical Engineering and Computer Science

Objectives

10

• Systematic way of identifying critical paths

• Dealing with performance variation

• Flexible control of core boosting

University of MichiganElectrical Engineering and Computer Science

System Architecture

11

offline online

TargetProgram

IntermediateRepresentation

MonitoringLogic

Com

pila

tion

ParallelismAnalysis

Instrumented Executable

Monitor

instrumentation

Observe

AdjustPriority

Priority

Schedule

WeightedProbabilistic

PriorityScheduler

University of MichiganElectrical Engineering and Computer Science

Lottery Scheduling

12

• Each thread holds a number of tickets• Scheduler select fast mode thread by picking a ticket• Efficient implementation of proportional-share resource

management• Responsive, flexible control over relative execution rate

[Waldspurger`94]

10

total = 20random [0 .. 19] = 15

2 5 1 2

∑ = 10∑ > 15? no

∑ = 12∑ > 15? no

∑ = 17∑ > 15? yes

University of MichiganElectrical Engineering and Computer Science

Progress Monitoring

13

• For data parallel threads• Slower threads are more likely to be in critical path• Divide task into multiple smaller chunks and instrument

monitoring code• Monitoring code reduce number of tickets

University of MichiganElectrical Engineering and Computer Science

Example of Progress Monitoring

14

… pthread_barrier_wait(barrier); long PROGRESS_GRANULE = (k2 – k1) / NUM_STEPS; for ( i = k1 ; i < k2 ; i++ ) { float x_cost = dist(points->p[i],points->p[x],points->dim) * points->p[i].weight; float current_cost = points->p[i].cost; if ( x_cost < current_cost ) { switch_membership[i] = 1; cost_of_opening_x += x_cost – current_cost; } else { int assign = points->p[i].assign; lower[center_table[assign]] += current_cost – x_cost; } if ( (i – k1) % PROGRESS_GRANULE == 0 ) halve_priority_tickets(); } pthread_barrier_wait(barrier); …

Loop Body

University of MichiganElectrical Engineering and Computer Science

Priority Delegation

15

• Thread holding a mutex is likely to be in critical path– Increase tickets when acquire mutex

• More likely to be in critical path if other threads are waiting– Temporarily transfer waiting thread’s ticket to the thread

holding mutex

University of MichiganElectrical Engineering and Computer Science

Performance Evaluation

16

• Post processing traces– Generated on 32-core machine

• Four 8-core Intel Xeon X7560• 24MB L3 cache per chip• 32GB Total memory

– Augmented progress time indication• 1.5x, 2x, 5x, 10x acceleration for 1 fast mode core• Varying scheduling quantum from 1us to 1ms

University of MichiganElectrical Engineering and Computer Science

Speedup for Streamcluster

17

H/W OSUsermode

University of MichiganElectrical Engineering and Computer Science

Current Status

18

TargetProgram

IntermediateRepresentation

MonitoringLogic

Com

pila

tion

ParallelismAnalysis

Instrumented Executable

Monitor

instrumentation

Observe

AdjustPriority

Priority

Schedule

WeightedProbabilistic

PriorityScheduler

Normal

Turbo

Normal

Turbo

Co

res

University of MichiganElectrical Engineering and Computer Science

Conclusion & Future Work

19

• Introduce S/W framework to improve multithreaded programs’ performance using core boosting

• Combines static analysis, dynamic monitoring, and probabilistic priority scheduling to predict critical paths

• Shows 5% ~ 27% performance improvement for streamcluster

• Better model the tradeoff between performance and energy• Predicting critical paths on other type of parallelism

University of MichiganElectrical Engineering and Computer Science

Thank you!

20