eazyhtm: eager-lazy hardware transactional memory€¦ · eazyhtm: eager-lazy hardware...
Post on 27-Jun-2020
5 Views
Preview:
TRANSCRIPT
EazyHTM: Eager-Lazy Hardware
Transactional Memory
Saša Tomić, Cristian Perfumo, Chinmay Kulkarni,
Adrià Armejach, Adrián Cristal, Osman Unsal,
Tim Harris, Mateo Valero
Barcelona Supercomputing Center, UPC
BITS Pilani
Microsoft Research Cambridge
Why Transactional Memory?
• Lock-based parallel programming has problems
– Deadlocks, races, complexity, performance, …
• Transactional Memory (TM) to the rescue
– Optimistic concurrency control mechanism
– Easy to use
– Deadlock free
– Supports composability
– Protects data in critical sections
• Hardware-TM (HTM), Software-TM (STM) and hybrid
• Lock-based parallel programming has problems
– Deadlocks, races, complexity, performance, …
• Transactional Memory (TM) to the rescue
– Optimistic concurrency control mechanism
– Easy to use
– Deadlock free
– Supports composability
– Protects data in critical sections
• Hardware-TM (HTM), Software-TM (STM) and hybrid
2
HTM terminology
• Atomic section/transaction: group of instructions that
appear to take effect instantaneously
• Where are speculative values stored (version
management):
– in-place, and log the original value, or
– buffered in private storage, publish on commit
• Conflict: TX writes where others TX reads
– Detection: an action in which we check for conflicts
– Resolution: an action performed to resolve the conflict
• Can be abort, stalling the execution, …
3
• A.k.a. pessimistic
• Writes in-place, detects&resolves conflicts on every access
• LogTM [Moore, HPCA06], LogTM-SE [Yen, HPCA07]
Eager HTM
4
Stall
W
RR
TX 1
TX 2
TX 3
fast
commit
Limited
concurrency
Fast commit
Slow abort
• A.k.a. optimistic
• Writes buffered, detect&resolve conflicts on commit
• TCC [Hammond, ISCA04], Scalable-TCC [Chafi, HPCA07]
Lazy HTM
5
W
RR
TX 1
TX 2
TX 3
complex
commit:
validate +
write
Fast abort
Complex
commit
Good
concurrency
The Motivation
Splitting conflict management
• Eager-Lazy hardware-software TM exists (FlexTM [Shriraman, ISCA08]):
– Software begin, commit and abort
– Probabilistic (signature based) conflict detection
• EazyHTM is the first pure-hardware TM
6
Conflict
detection
Eager
Lazy
Conflict resolution
Eager Lazy
LogTM
TCC, S-TCCImpossible
EazyHTM Fast commit
Good
concurrency
Outline
• Motivation
• Contributions
• Hardware changes
• The Protocol
• Evaluation
• Conclusions
7
EazyHTM Contributions
• The best of two worlds
– Eager conflict detection: simple commit/exact list of
conflicts in advance
– Lazy conflict resolution: good concurrency
• Parallel commits of non-conflicting TXs
• Designed for CMPs (Chip-Multiprocessors)
– Use cores proximity
– MESI/MOESI protocol upgrade (easier verification)
8
Hardware changes
9
Racers list – 1 bit per core
Killers list – 1 bit per core
SR – 1 bit per line
SM – 1 bit per line
TD – 1 bit per line
Register file
checkpoint
Racers listRacers list
Killers listKillers listCPU
S
R
S
R Existing cache logicPrivate
Cache(s)S
M
S
M
T
D
T
D Existing directory logicDirectory
• tracks conflicts
•
• tracks conflicts
• bit-vector
• 32 bits for 32 cores
holds read/write set
read only optimization bit
(details in the paper)
read-only optimization bit
(details in the paper)
core core core... ... ...
Racers and killers list
• If line is shared between two TXs:
– Read-Read
• No conflict
– Write-Read, Read-Write, Write-Write
• Writer adds reader TX into “racers” list
– “TXs that I have to abort” list, if I commit first
• Reader adds writer TX into “killers” list
– “TXs that can abort me” list, if they commit first
• We illustrate only the Write-after-Read (WAR) conflict
10
txMark @A
ACK @A, 0
... ...
no other
sharers
EazyHTM Protocol
Conflict Detection (1/2)
11
racers
killers
TX 0
racers
killers
TX 2
sharers @A
Directory
1
2
TX 0 TX 2
BTX
RD A
CTX
TX 0 TX 2
BTX
BTX
RD A
WR A
CTX
CTX
Replaces
GETS/GETX
TX 0 TX 2
BTX
RD A
CTX
TX 0 TX 2
BTX
BTX
RD A
WR A
CTX
CTX
racers
killers
TX 2
sharers @A
Directory
racers
killers
TX 0
ACK @A, 1txAccessor #2, @A
txMark @A
Reader #0, @A
Potential
conflict
1 other
sharer
Writer #2, @A
EazyHTM Protocol
Conflict Detection (2/2)
12
Remember:
abort TX#0
on commitRemember:
TX#2 can
abort me
1
23
4
5
racers
killers
TX 2
racers
killers
TX 0
sharers @A
Directory
Abort from TX#2
WR @A (commit)
Abort Ack from TX#0
EazyHTM Protocol
Conflict Resolution
13
TX#2 first came to the commit point, abort TX#0!1
1
2
3
TX 0 TX 2
BTX
RD A
CTX
TX 0 TX 2
BTX
BTX
RD A
WR A
CTX
CTX
TX 0 TX 2
BTX
WR A
CTX
TX 0 TX 2
BTX
BTX
WR A
WR B
CTX
CTX
TX 0 TX 2
BTX
WR A
CTX
TX 0 TX 2
BTX
BTX
WR A
WR B
CTX
CTX
TX 0 TX 2
BTX
WR A
CTX
TX 0 TX 2
BTX
BTX
WR A
WR B
CTX
CTX
0 other
sharers
EazyHTM Protocol
Disjoint data => parallel commit
14
txMark @B
...
txMark @A
ACK @A, 0
WR @A
(commit)
WR @B
(commit)
TX#0 works with line @A TX#2 works with line @B
sharers @A
Directorysharers @B
1 1
ACK @B, 022
racers
killers
TX 0
3racers
killers
TX 2
3
...
NO
SERIALIZATION0 other
sharers
Implementation
• Implemented in M5, full-system simulator (Alpha)
• Private L1 (32KB, 4-way, 64B CL, 2 cycles)
• Private L2 (512KB, 8-way, 64B CL, 10 cycles)
• Memory (with directory, 100 cycles)
• ICN (2D Mesh, 10 cycles per hop)
15
Evaluation
• Evaluated STAMP benchmarks
• Compared with Scalable-TCC-like HTM
– Same base simulator
– Implemented specialized directory protocol
• Compared with ideal lazy HTM (MESI based)
– magical conflict detection
– instant conflict resolution
– parallel write-back commit
16
Kmeans Low
• Small TXs (RS 15 CL; WS 5 CL)
• Low contention
(10% aborts)
• Similar profile to
“replacing locks with atomic”
• Near ideal performance
• K-means: groups N-dimensional
space into K clusters
• Most of the SPLASH-2 suite has
similar profile
17
0
5
10
15
20
25
30
0 10 20 30 40
sp
ee
du
p
processors
Kmeans-Low
Ideal
EazyHTM
STCC
SSCA2
• Small TXs (RS 50 CL, WS 10 CL)
• Low contention
(1.2% aborts)
• Near ideal performance
• Scalability affected by barriers,
not by contention
• SSCA2: large directed graph
operations
18
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 10 20 30 40
sp
ee
du
p
processors
SSCA2
Ideal
EazyHTM
STCC
Yada
• Large TXs (260 CL RS, 140 CL
WS)
• Moderate contention
(35% aborts)
• We can see good performance
also for large TXs!
• Yada: delaunay mesh refinement
19
0
2
4
6
8
10
12
0 10 20 30 40
sp
ee
du
p
processors
Yada
Ideal
EazyHTM
STCC
Intruder
• Medium TXs (53 CL RS, 20 CL
WS)
• High contention (85%
aborts)
• Very bad scalability for all HTMs
• Every transaction detects conflicts
over and over again – lot of
conflict detection messages slow
down the execution
• Intruder: signature based network
intrusion detection system
20
0
2
4
6
8
10
12
0 10 20 30 40
sp
ee
du
p
processors
Intruder
Ideal
EazyHTM
STCC
Only high-conflict STAMP
• >50% abort rate only
• High contention high-core-count
should be optimized
• Averages:
• Labyrinth
• Intruder
• Kmeans-Hi
• Results highly affected by
Intruder
21
0
2
4
6
8
10
12
0 10 20 30 40
sp
ee
du
p
processors
High-conflict STAMP
Ideal
EazyHTM
STCC
Only low-conflict STAMP
• <50% abort rate only
• Low abort rate necessary for
scaling
• Excludes:
• Labyrinth 8-32
• Intruder 16-32
• Kmeans-Hi 32
22
0
2
4
6
8
10
12
0 10 20 30 40
sp
ee
du
p
processors
Scaling STAMP
Ideal
EazyHTM
STCC
Conclusions
• Introduced EazyHTM, a new HTM implementation
– Eager conflict detection, lazy conflict resolution
– Fast: performs well for low conflict parallel applications
– Minimal changes to directory protocols (easier verification)
– As scalable as standard directory protocol
• EazyHTM mechanism could allow (future work):
– Simpler transaction prioritization
– Less wasted work
– Better performance optimization
– Power efficient TM mechanisms
23
top related