self-tuning htm paolo romano. based on icac’14 paper n. diegues and paolo romano self-tuning intel...

41
SELF-TUNING HTM Paolo Romano

Upload: claud-warner

Post on 11-Jan-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

SELF-TUNING HTMPaolo Romano

Page 2: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

2

Based on ICAC’14 paper

N. Diegues and Paolo Romano

Self-Tuning Intel Transactional Synchronization Extensions

11th USENIX International Conference on Autonomic Computing (ICAC), June 2014

Best paper award

Page 3: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

Best-Effort Nature of HTM

3

No progress guarantees:

• A transaction may always abort

…due to a number of reasons:

• Forbidden instructions

• Capacity of caches (L1 for writes, L2 for reads)

• Faults and signals

• Contending transactions, aborting each other

Need for a fallback path, typically a lock or an STM

Page 4: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

When and how to activate the fallback?

4

• How many retries before triggering the fall-back?• Ranges from never retrying to insisting many times

• How to cope with capacity aborts?• GiveUp – exhaust all retries left• Half – drop half of the retries left• Stubborn – drop only one retry left

• How to implement the fall-back synchronization?• Wait – single lock should be free before retrying• None – retry immediately and hope the lock will be freed• Aux – serialize conflicting transactions on auxiliary lock

Page 5: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

Is static tuning enough?

5

Focus on single global lock fallback

Heuristic:

Try to tune the parameters according to best practices

• Empirical work in recent papers [SC13, HPCA14]

• Intel optimization manual

GCC:

Use the existing support in GCC out of the box

Page 6: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

Why Static Tuning is not enough

6

Benchmark GCC Heuristic Best Tuning

genome 1.54 3.14 3.36 wait-giveup-4

intruder 2.03 1.81 3.02 wait-giveup-4

kmeans-h 2.73 2.66 3.03 none-stubborn-10

rbt-l-w 2.48 2.43 2.95 aux-stubborn-3

ssca2 1.71 1.69 1.78 wait-giveup-6

vacation-h 2.12 1.61 2.51 aux-half-5

yada 0.19 0.47 0.81 wait-stubborn-15

Speedup with 4 threads (vs 1 thread non-instrumented)

Intel Haswell Xeon with 4 cores (8 hyperthreads)

room for improvement

Page 7: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

No one size fits all

7

Intruder from STAMP benchmarks

Page 8: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

Are all optimization dimensions relevant?

8

• How many retries before triggering the fall-back?• Ranges from never retrying to insisting many times

• How to cope with capacity aborts?• GiveUp – exhaust all retries left• Half – drop half of the retries left• Stubborn – drop only one retry left

• How to implement the fall-back synchronization?• Wait – single lock should be free before retrying• None – retry immediately and hope the lock will be freed• Aux – serialize conflicting transactions on auxiliary lock

• aux and wait perform similarly

• When none is best, it is by a marginal amount

• Reduce this dimension in the optimization problem

Page 9: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

9

Self-tuning design choices

3 key choices:

• How should we learn?

• At what granularity should we adapt?

• What metrics should we optimize for?

Page 10: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

10

How should we learn?• Off-line learning

• test with some mix of applications & characterize their workload• infer a model (e.g., based on decision trees) mapping:

workload optimal configuration• monitor the workload of your target application, feed the model with

this info and accordingly tune the system

• On-line learning• no preliminary training phase• explore the search space while the application is running • exploit the knowledge acquired via exploration for tuning

Page 11: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

11

How should we learn?• Off-line learning

• PRO:• no exploration costs

• CONs: • initial training phase is time-consuming and “critical”

• accuracy is strongly affected by training set representativeness

• non-trivial to incorporate new knowledge from target application

• On-line learning• PROs:

• no training phase plug-and-play effect• naturally incorporate newly available knowledge

• CONs:• exploration costs

reconfiguration cost is low with HTM exploring is affordable

Page 12: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

Which on-line learning techniques?

12

Uses 2 on-line reinforcement learning techniques in synergy:

• Upper Confidence Bounds: how to cope with capacity aborts?

• Gradient Descent: how many retries in hardware?

• Key features:

• both techniques are extremely lightweight practical

• coupled in a hierarchical fashion:

• they optimize non-independent parameters

• avoid ping-pong effects

Page 13: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

13

Self-tuning design choices

3 key choices:

• How should we learn?

• At what granularity should we adapt?

• What metrics should we optimize for?

Page 14: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

14

At what granularity should we adapt?• Per thread & atomic block

• PRO:• exploit diversity and maximize flexibility

• CON: • possibly large number of optimizers running in parallel

• redundancy larger overheads• interplay of multiple local optimizers

• Whole application• PRO:

• lower overhead, simpler convergence dynamics

• CON: • reduced flexibility

Page 15: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

15

Self-tuning design choices

3 key choices:

• How should we learn?

• At what granularity should we adapt?

• What metrics should we optimize for?

Page 16: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

16

What metrics should we optimize for?• Performance? Power? A combination of the two?

• Key issues/questions:• Cost and accuracy of monitoring the target metric

• Performance: • RTDSC allow for lightweight, fine-grained measurement of latency

• Energy:• RAPL: coarse granularity (msec) and requires system calls

• How correlated are the two metrics?

Page 17: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

17

Energy and performance in (H)TM: two sides of the same coin?

• How correlated are energy consumption and throughput?• 480 different configurations (number of retries, capacity aborts

handling, no. threads) per each benchmark:• includes both optimal and sub-optimal configurations

Page 18: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

18

Energy and performance in (H)TM: two sides of the same coin?

• How suboptimal is the energy consumption if we use a configuration that is optimal performance-wise?

Page 19: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

(G)Tuner

19

Performance measured through processor cycles (RTDSC)

Support fine and coarse grained optimization granularity:

• Tuner: per atomic block, per thread

• no synchronization among threads

• G(lobal)-Tuner: application-wide configuration

• Threads collect statistics privately

• An optimizer thread periodically:

• Gathers stats & decides (a possibly) new configuration

Periodic profiling and re-optimization to minimize overhead

Integrated in GCC

Page 20: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

Evaluation

20

• Idealized “Best” variant

• Tuner

• G-Tuner

• Heuristic: GiveUp-5

• NOrec (STM)

Intel Haswell Xeon with 4 cores (8 hyper-threads)

RTM-SGL RTM-NOrec

• Idealized “Best” variant

• Tuner

• G-Tuner

• Heuristic: GiveUp-5

• GCC

• Adaptive Locks [PACT09]

Page 21: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

RTM-SGL

21

Intruder from STAMP benchmarks

4% avg offset

+50%

Threads

Spe

edup

Page 22: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

RTM-NORec

22

Intruder from STAMP benchmarks

G-Tuner better with

NOrec fallback

Threads

Spe

edup

Page 23: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

Evaluating the granularity trade-off

23

Genome from STAMP benchmarks, 8 threads

adaptingover time

also adapting, but large constant overheads

static configuration

Page 24: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

Take home messages

24

• Tuning of fall-back policy strongly impacts performance

• Self-tuning of HTM via on-line learning is feasible:

• plug & play: no training phase

• gains largely outweigh exploration overheads

• Tuning granularity hides subtle trade-offs:

• flexibility vs overhead vs convergence speed

• Optimize for performance or for energy?

• Strong correlation between the 2 metrics

• How general is this claim? Seems the case also for STM

Page 25: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

Thank you!

25

Questions?

Page 26: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

26

BACKUP SLIDES

Dagstuhl Seminar 2015

Page 27: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

27

Single lock fallback path• After “some” failed attempts using HTM, acquire a single

global lock and execute the tx pessimistically

• How to couple transactions executing in hardware and fallback?• Subscribe the lock in the HTM transaction:

• read the state of the global lock from the HTM transaction• activating the fallback path aborts any concurrent hw transaction

STRONG IMPACT ON PERFORMANCE

BETTER TUNE THIS MECHANISM PROPERLY!

ICAC 2014

Page 28: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

Why Static Tuning is not enough

Self-Tuning Intel RTM 28

Page 29: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

How to handle capacity aborts?

Self-Tuning Intel RTM 29

Reduction to “Bandit Problem”

• 3-levers slot machine with unknown reward distributions

Exploitation vs Exploration dilemma

how often to test apparently unfavorable levers?

Too little: convergence to wrong solution

Too much: many suboptimal choices

Lever A Lever B Lever C

giveup half stubbornStrategy:

Reward: ? ? ?

Page 30: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

Upper Confidence Bounds (UCB)

Self-Tuning Intel RTM 30

Solution to exploration vs exploitation dilemma

• Online estimation of “uncertainty” of each strategy

• upper confidence bound on expected reward

• amplify bound of rarely explored strategies

• Appealing theoretical guarantees:

• logarithmic bound on optimization error

• Very lightweight and efficient:

• …practical!

Page 31: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

Upper Confidence Bounds (UCB)

Self-Tuning Intel RTM 31

• Basic reward function for each strategy i :

xi=

• Estimate upper bound on reward of each strategy:

• Amplify confidence bound of rarely explored levers

avg. #cycles using strategy i1

Page 32: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

How many attempts using HTM?

Self-Tuning Intel RTM 32

UCB not a good fit too many levers to explore!

Page 33: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

Gradient Descent

Self-Tuning Intel RTM 33

1

23

4?

Problems:

1- unnecessary oscillations

2- stuck in local maxima

Page 34: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

Gradient Descent

Self-Tuning Intel RTM 34

1

23

4?

Problems:

1- unnecessary oscillations * stabilization threshold2- stuck in local maxima * random jumps

5

Page 35: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

Gradient Descent

Self-Tuning Intel RTM 35

1

23

4?

Problems:

1- unnecessary oscillations * stabilization threshold2- stuck in local maxima * random jumps

56

7

8revert to

curr. maximum upon “unlucky”

jumps

Page 36: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

Optimizers in action

Self-Tuning Intel RTM 36

One atomic block in Yada benchmark (8 threads).

the two optimizers are *not* independent

Page 37: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

Coupling the Optimizers

Self-Tuning Intel RTM 37

UCB and Gradient Descent overlap in responsibilities:

• Optimize consumption of attempts upon capacity aborts

• Optimize allocation of budget for attempts

Minimize interference via hierarchical organization:

• UCB rules over Grad:

• UCB can force Grad to explore with random jump

• Direction and length defined by UCB belief

• More details in the paper

Page 38: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

Coupling the Optimizers

Self-Tuning Intel RTM 38

Speedup of coupled techniques vs individual ones

Page 39: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

Overhead of self-tuning

Self-Tuning Intel RTM 39

Profiling and decision-making are performed, but discarded.

Uses a static configuration (and compares with it).

Page 40: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

Integration in GCC

Self-Tuning Intel RTM 40

• Workload-oblivious

• Transparent to the programmer

• Lightweight for general purpose use

• Ideal candidate for integration at the compiler level

• (current prototype does not support G-Tuner yet)

Page 41: SELF-TUNING HTM Paolo Romano. Based on ICAC’14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX

Integration in GCC

Self-Tuning Intel RTM 41

our extensions