eazyhtm : eager-lazy hardware transactional memory
DESCRIPTION
EazyHTM : Eager-Lazy Hardware Transactional Memory. Saša Tomić , Cristian Perfumo , Chinmay Kulkarni , Adrià Armejach , Adri á n Cristal, Osman Unsal , Tim Harris, Mateo Valero. Barcelona Supercomputing Center, UPC BITS Pilani Microsoft Research Cambridge. Why Transactional Memory?. - PowerPoint PPT PresentationTRANSCRIPT
EazyHTM: Eager-Lazy Hardware Transactional Memory
Saša Tomić, Cristian Perfumo, Chinmay Kulkarni,
Adrià Armejach, Adrián Cristal, Osman Unsal, Tim Harris, Mateo Valero
Barcelona Supercomputing Center, UPC
BITS Pilani
Microsoft Research Cambridge
2
Why Transactional Memory?• Lock-based parallel programming has
problems– Deadlocks, races, complexity, performance, …
• Transactional Memory (TM) to the rescue– Optimistic concurrency control mechanism– Easy to use– Deadlock free– Supports composability– Protects data in critical sections
• Hardware-TM (HTM), Software-TM (STM) and hybrid
3
HTM terminology• Atomic section/transaction: group of
instructions that appear to take effect instantaneously
• Where are speculative values stored (version management):– in-place, and log the original value, or– buffered in private storage, publish on commit
• Conflict: TX writes where others TX reads– Detection: an action in which we check for
conflicts– Resolution: an action performed to resolve
the conflict• Can be abort, stalling the execution, …
4
• A.k.a. pessimistic• Writes in-place, detects&resolves conflicts on
every access• LogTM [Moore, HPCA06], LogTM-SE [Yen, HPCA07]
Eager HTM
Stall
W
RR
TX 1
TX 2
TX 3
fastcomm
it
Limitedconcurrency
Fast commit
Slow abort
5
• A.k.a. optimistic• Writes buffered, detect&resolve conflicts on
commit• TCC [Hammond, ISCA04], Scalable-TCC [Chafi,
HPCA07]
Lazy HTM
W
RR
TX 1
TX 2
TX 3
complexcommit: validate + write
Fast abort
Complex commit
Good concurrency
The MotivationSplitting conflict management
• Eager-Lazy hardware-software TM exists (FlexTM [Shriraman, ISCA08]):– Software begin, commit and abort– Probabilistic (signature based) conflict detection
• EazyHTM is the first pure-hardware TM6
Conflictdetection
Eager
Lazy
Conflict resolution
Eager Lazy
LogTM
TCC, S-TCCImpossible
EazyHTM Fast commit
Good concurrency
Outline• Motivation• Contributions• Hardware changes• The Protocol• Evaluation• Conclusions
7
EazyHTM Contributions• The best of two worlds
– Eager conflict detection: simple commit/exact list of conflicts in advance
– Lazy conflict resolution: good concurrency• Parallel commits of non-conflicting TXs• Designed for CMPs (Chip-Multiprocessors)
– Use cores proximity– MESI/MOESI protocol upgrade (easier
verification)
8
Hardware changes
9
Racers list – 1 bit per coreKillers list – 1 bit per core
SR – 1 bit per lineSM – 1 bit per line
TD – 1 bit per line
Register file checkpoint
Racers list
Killers listCPU
SR Existing cache logic
PrivateCache(s)
SM
TD Existing directory logicDirectory
• tracks conflicts• bit-vector• 32 bits for 32 cores
holds read/write set
read-only optimization bit(details in the paper)
core core core... ... ...
Racers and killers list• If line is shared between two TXs:
– Read-Read• No conflict
– Write-Read, Read-Write, Write-Write• Writer adds reader TX into “racers” list
– “TXs that I have to abort” list, if I commit first• Reader adds writer TX into “killers” list
– “TXs that can abort me” list, if they commit first• We illustrate only the Write-after-Read (WAR)
conflict
10
txMark @A
ACK @A, 0
... ...
no othersharers
EazyHTM Protocol
Conflict Detection (1/2)
11
racers
killers
TX 0
racers
killers
TX 2
sharers @A
Directory
1
2
TX 0 TX 2BTX
BTXRD A
WR ACTX
CTX
ReplacesGETS/GETX
TX 0 TX 2BTX
BTXRD A
WR ACTX
CTX
racers
killers
TX 2
sharers @A
Directory
racers
killers
TX 0
ACK @A, 1txAccessor #2, @A
txMark @A
Reader #0, @A
Potentialconflict
1 othersharer
Writer #2, @A
EazyHTM Protocol
Conflict Detection (2/2)
12
Remember: abort TX#0 on commit
Remember:TX#2 canabort me
1
23
4
5
racers
killers
TX 2
racers
killers
TX 0
sharers @A
Directory
Abort from TX#2WR @A (commit)
Abort Ack from TX#0
EazyHTM Protocol
Conflict Resolution
13
TX#2 first came to the commit point, abort TX#0!1
12
3
TX 0 TX 2BTX
BTXRD A
WR ACTX
CTX
TX 0 TX 2BTX
BTXWR A
WR BCTX
CTX
TX 0 TX 2BTX
BTXWR A
WR BCTX
CTX
TX 0 TX 2BTX
BTXWR A
WR BCTX
CTX
0 othersharers
EazyHTM Protocol
Disjoint data => parallel commit
14
txMark @B
...
txMark @A
ACK @A, 0
WR @A(commit)
WR @B(commit)
TX#0 works with line @A TX#2 works with line @B
sharers @A
Directorysharers @B
1 1
ACK @B, 022
racers
killers
TX 0
3racers
killers
TX 2
3
...
NO SERIALIZAT
ION 0 othersharers
Implementation• Implemented in M5, full-system simulator
(Alpha)• Private L1 (32KB, 4-way, 64B CL, 2 cycles)• Private L2 (512KB, 8-way, 64B CL, 10
cycles)• Memory (with directory, 100 cycles)• ICN (2D Mesh, 10 cycles per hop)
15
Evaluation• Evaluated STAMP benchmarks• Compared with Scalable-TCC-like HTM
– Same base simulator– Implemented specialized directory protocol
• Compared with ideal lazy HTM (MESI based)– magical conflict detection– instant conflict resolution– parallel write-back commit
16
17
Kmeans Low
• Small TXs (RS 15 CL; WS 5 CL)
• Low contention(10% aborts)
• Similar profile to “replacing locks with atomic”
• Near ideal performance• K-means: groups N-
dimensional space into K clusters
• Most of the SPLASH-2 suite has similar profile0 5 10 15 20 25 30 35
0
5
10
15
20
25
30
Kmeans-Low
IdealEazyHTMSTCC
processors
spee
dup
SSCA2
• Small TXs (RS 50 CL, WS 10 CL)
• Low contention(1.2% aborts)
• Near ideal performance• Scalability affected by
barriers, not by contention• SSCA2: large directed
graph operations
18
0 5 10 15 20 25 30 350
0.5
1
1.5
2
2.5
3
3.5
4
4.5
SSCA2
IdealEazyHTMSTCC
processors
spee
dup
Yada
• Large TXs (260 CL RS, 140 CL WS)
• Moderate contention (35% aborts)
• We can see good performance also for large TXs!
• Yada: delaunay mesh refinement
19
0 5 10 15 20 25 30 350
2
4
6
8
10
12
Yada
IdealEazyHTMSTCC
processors
spee
dup
Intruder
• Medium TXs (53 CL RS, 20 CL WS)
• High contention (85% aborts)
• Very bad scalability for all HTMs
• Every transaction detects conflicts over and over again – lot of conflict detection messages slow down the execution
• Intruder: signature based network intrusion detection system
20
0 5 10 15 20 25 30 35 400
2
4
6
8
10
12
Intruder
IdealEazyHTMSTCC
processors
spee
dup
Only high-conflict STAMP
• >50% abort rate only
• High contention high-core-count should be optimized
• Averages:• Labyrinth• Intruder• Kmeans-Hi
• Results highly affected by Intruder
21
0 5 10 15 20 25 30 350
2
4
6
8
10
12
High-conflict STAMP
IdealEazyHTMSTCC
processors
spee
dup
Only low-conflict STAMP
• <50% abort rate only
• Low abort rate necessary for scaling
• Excludes:• Labyrinth 8-32• Intruder 16-32• Kmeans-Hi 32
22
0 5 10 15 20 25 30 350
2
4
6
8
10
12
Scaling STAMP
IdealEazyHTMSTCC
processors
spee
dup
Conclusions• Introduced EazyHTM, a new HTM implementation
– Eager conflict detection, lazy conflict resolution– Fast: performs well for low conflict parallel applications– Minimal changes to directory protocols (easier
verification)– As scalable as standard directory protocol
• EazyHTM mechanism could allow (future work):– Simpler transaction prioritization– Less wasted work– Better performance optimization– Power efficient TM mechanisms
23