how to implement any concurrent data structure · effort in 2012–2014 the future(s) of shared...
TRANSCRIPT
how to implementany
concurrent data structure marcos k. aguilera
vmware
jointly withirina calciu
siddhartha senmahesh balakrishnan
Where to find more information about this work
How to Implement Any Concurrent Data Structure.By Irina Calciu, Siddhartha Sen, Mahesh Balakrishnan, Marcos K. Aguilera.Communications of the ACM, 2018
Black-box Concurrent Data Structures for NUMA Architectures.Irina Calciu, Siddhartha Sen, Mahesh Balakrishnan, Marcos K. Aguilera.ASPLOS, 2017
concurrent data structuresare everywhere
kernel
application libraries
applications
but efficient ones are hard to design
locks
transactional memory
lock-free and wait-free
effort in 2012–2014The Future(s) of Shared Data StructuresAlex Kogan and Maurice HerlihyPODC 2014
Concurrent Updates with RCU: Search Tree as an ExampleMaya Arbel and Hagit AttiyaPODC 2014
Dynamic-Sized Nonblocking Hash TablesYujie Liu, Kunlong Zhang and Michael SpearPODC 2014
Efficient Lock-free Binary Search TreesBapi Chatterjee, Nhan Nguyen and Philippas TsigasPODC 2014
The Amortized Complexity of Non-blocking Binary Search TreesFaith Ellen, Panagiota Fatourou, Joanna Helga and Eric RuppertPODC 2014
The Adaptive Priority Queue with Elimination and CombiningIrina Calciu, Hammurabi Mendes and Maurice HerlihyDISC 2014
Solo-fast Universal Constructions for Deterministic Abortable ObjectsClaire Capdevielle, Colette Johnen and Alessia MilaniDISC 2014
On Deterministic Abortable ObjectsVassos Hadzilacos and Sam TouegPODC 2013
Leaplist: Lessons Learned in Designing TM-Supported Range QueriesHillel Avni, Nir Shavit, and Adi SuissaPODC 2013
The SkipTrie: Low-Depth Concurrent Search without RebalancingRotem Oshman and Nir ShavitPODC 2013
Pragmatic Primitives for Non-blocking Data StructuresTrevor Brown, Faith Ellen, and Eric RuppertPODC 2013
Lock-Free Data Structure IteratorsErez Petrank and Shahar TimnatDISC 2013
Practical Non-blocking Unordered ListsKunlong Zhang, Yujiao Zhao, Yajun Yang, Yujie Liu and Michael SpearDISC 2013
Atomic snapshots in expected $O(\log^3 n)$ steps using randomized helpingJames Aspnes and Keren Censor-HillelDISC 2013
An Optimal Implementation of Fetch-and-IncrementFaith Ellen and Philipp WoelfelDISC 2013
On the Time and Space Complexity of Randomized Test-And-Set George Giakkoupis and Philipp WoelfelPODC 2012
Universal Constructions that Ensure Disjoint-Access Parallelism and Wait-Freedom Faith Ellen, Panagiota Fatourou, Eleftherios Kosmas, Alessia Milani, and CorentinTraversPODC 2012
Faster than Optimal Snapshots (for a While) James Aspnes, Hagit Attiya, Keren Censor-Hillel, and Faith EllenPODC 2012
Strongly Linearizable Implementations: Possibilities and Impossibilities Maryam Helmi, Lisa Higham, and Philipp WoelfelPODC 2012
CBTree: A Practical Concurrent Self-Adjusting Search TreeYehuda Afek, Haim Kaplan, Boris Korenfeld, Adam Morrison, Robert E. TarjanDISC 2012
Efficient Fetch-and-IncrementFaith Ellen, Vijaya Ramachandran, Philipp WoelfelDISC 2012
problems withconcurrent data structure design
herculean effort for each data structure
rigid designs
an even greater problem…
problems withconcurrent data structure design
herculean effort for each data structure
rigid designs
an even greater problem…new hardware architectures
our options?1. underutilize the system
2. develop new data structures…
3. we think there is a better way
for each new architecture
architecture-awareblack-box data structures
sequential data structures
architecture 1
architecture 2
transformation 1
transformation 2
architecture 3transformation 3
architecture-awareblack-box data structures
sequential data structures
architecture 1
architecture 2
transformation 1
transformation 2
architecture 3transformation 3
FOCUS OF REST OF TALK NUMAarchitecture
the NR algorithm
NUMA architectureNon-Uniform Memory Access
❖ local access more efficient
core
cache
core
cache
core
cache
core
cachecache
core
cache
core
cache
core
cache
core
cachecache
memory memory
node node
evaluation
Intel Xeon E7-4850v356 cores, 4 nodes
2.2 GHz512 GB RAML3 35 MBL2 256 KBL1 64 KB
0
20
40
60
1 28 56 84 110
op
s/u
s
# threads
skip list priority queue – 10% updates(FC+) FC + RWL (RWL) Readers-Writer Lock
(SL) Spinlock(FC) Flat CombiningX
(NR) Node ReplicationX
(LF) Lock-free
0
2
4
6
1 28 56 84 110
op
s/u
s
# threads
data structure in REDIS: 10% updates(NR) Node Replication (FC+) FC + RWL (RWL) Readers-Writer Lock
(FC) Flat Combining (SL) SpinlockX
X
the transformation
given single-threadedexecute(op,parameters)
isReadOnly(op)
we produce multi-threadedexecute(op,parameters)
works well in NUMA servers
key ideas
1. replicate data structure across (NUMA) nodesstate machine approach with a shared log
2. provide efficient NUMA-aware loglarge effort to optimize log
NUMA Node
Local Replica
the transformation
ThreadThread
NUMA Node
Local Replica
ThreadThread
NUMA Node
Local Replica
Local Tail
the transformation
Shared Log
LogTail
ThreadThread
NUMA Node
Local Replica
Local Tail
ThreadThread
how to implement log?
key observationcoordination within node cheaper than across nodes
within node: we use flat combining
across nodes: we use lock-free appending to log
correctness
linearizability [Herlihy Wing 1990]:each operation appears to take effect instantaneously at a point between its invocation and response
whence performance comes• trade memory + computation for less communication• compact representation of operations• limited cross-node synchronization and contention
• enable parallelism • combiners across nodes• readers within a node • readers and the combiner on the same node
• leverage batching
22
conclusion• fundamental changes in hardware
• exposed to software developers
• take-away:instead of individual data structures,let’s develop general architecture-aware techniques