how to implement any concurrent data structure · effort in 2012–2014 the future(s) of shared...

how to implementany

concurrent data structure marcos k. aguilera

vmware

jointly withirina calciu

siddhartha senmahesh balakrishnan

Where to find more information about this work

How to Implement Any Concurrent Data Structure.By Irina Calciu, Siddhartha Sen, Mahesh Balakrishnan, Marcos K. Aguilera.Communications of the ACM, 2018

Black-box Concurrent Data Structures for NUMA Architectures.Irina Calciu, Siddhartha Sen, Mahesh Balakrishnan, Marcos K. Aguilera.ASPLOS, 2017

concurrent data structuresare everywhere

kernel

application libraries

applications

but efficient ones are hard to design

locks

transactional memory

lock-free and wait-free

effort in 2012–2014The Future(s) of Shared Data StructuresAlex Kogan and Maurice HerlihyPODC 2014

Concurrent Updates with RCU: Search Tree as an ExampleMaya Arbel and Hagit AttiyaPODC 2014

Dynamic-Sized Nonblocking Hash TablesYujie Liu, Kunlong Zhang and Michael SpearPODC 2014

Efficient Lock-free Binary Search TreesBapi Chatterjee, Nhan Nguyen and Philippas TsigasPODC 2014

The Amortized Complexity of Non-blocking Binary Search TreesFaith Ellen, Panagiota Fatourou, Joanna Helga and Eric RuppertPODC 2014

The Adaptive Priority Queue with Elimination and CombiningIrina Calciu, Hammurabi Mendes and Maurice HerlihyDISC 2014

Solo-fast Universal Constructions for Deterministic Abortable ObjectsClaire Capdevielle, Colette Johnen and Alessia MilaniDISC 2014

On Deterministic Abortable ObjectsVassos Hadzilacos and Sam TouegPODC 2013

Leaplist: Lessons Learned in Designing TM-Supported Range QueriesHillel Avni, Nir Shavit, and Adi SuissaPODC 2013

The SkipTrie: Low-Depth Concurrent Search without RebalancingRotem Oshman and Nir ShavitPODC 2013

Pragmatic Primitives for Non-blocking Data StructuresTrevor Brown, Faith Ellen, and Eric RuppertPODC 2013

Lock-Free Data Structure IteratorsErez Petrank and Shahar TimnatDISC 2013

Practical Non-blocking Unordered ListsKunlong Zhang, Yujiao Zhao, Yajun Yang, Yujie Liu and Michael SpearDISC 2013

Atomic snapshots in expected $O(\log^3 n)$ steps using randomized helpingJames Aspnes and Keren Censor-HillelDISC 2013

An Optimal Implementation of Fetch-and-IncrementFaith Ellen and Philipp WoelfelDISC 2013

On the Time and Space Complexity of Randomized Test-And-Set George Giakkoupis and Philipp WoelfelPODC 2012

Universal Constructions that Ensure Disjoint-Access Parallelism and Wait-Freedom Faith Ellen, Panagiota Fatourou, Eleftherios Kosmas, Alessia Milani, and CorentinTraversPODC 2012

Faster than Optimal Snapshots (for a While) James Aspnes, Hagit Attiya, Keren Censor-Hillel, and Faith EllenPODC 2012

Strongly Linearizable Implementations: Possibilities and Impossibilities Maryam Helmi, Lisa Higham, and Philipp WoelfelPODC 2012

CBTree: A Practical Concurrent Self-Adjusting Search TreeYehuda Afek, Haim Kaplan, Boris Korenfeld, Adam Morrison, Robert E. TarjanDISC 2012

Efficient Fetch-and-IncrementFaith Ellen, Vijaya Ramachandran, Philipp WoelfelDISC 2012

problems withconcurrent data structure design

herculean effort for each data structure

rigid designs

an even greater problem…

problems withconcurrent data structure design

herculean effort for each data structure

rigid designs

an even greater problem…new hardware architectures

our options?1. underutilize the system

2. develop new data structures…

3. we think there is a better way

for each new architecture

architecture-awareblack-box data structures

sequential data structures

architecture 1

architecture 2

transformation 1

transformation 2

architecture 3transformation 3

architecture-awareblack-box data structures

sequential data structures

architecture 1

architecture 2

transformation 1

transformation 2

architecture 3transformation 3

FOCUS OF REST OF TALK NUMAarchitecture

the NR algorithm

NUMA architectureNon-Uniform Memory Access

❖ local access more efficient

core

cache

core

cache

core

cache

core

cachecache

core

cache

core

cache

core

cache

core

cachecache

memory memory

node node

evaluation

Intel Xeon E7-4850v356 cores, 4 nodes

2.2 GHz512 GB RAML3 35 MBL2 256 KBL1 64 KB

0

20

40

60

1 28 56 84 110

op

s/u

s

# threads

skip list priority queue – 10% updates(FC+) FC + RWL (RWL) Readers-Writer Lock

(SL) Spinlock(FC) Flat CombiningX

(NR) Node ReplicationX

(LF) Lock-free

0

2

4

6

1 28 56 84 110

op

s/u

s

# threads

data structure in REDIS: 10% updates(NR) Node Replication (FC+) FC + RWL (RWL) Readers-Writer Lock

(FC) Flat Combining (SL) SpinlockX

X

the transformation

given single-threadedexecute(op,parameters)

isReadOnly(op)

we produce multi-threadedexecute(op,parameters)

works well in NUMA servers

key ideas

1. replicate data structure across (NUMA) nodesstate machine approach with a shared log

2. provide efficient NUMA-aware loglarge effort to optimize log

NUMA Node

Local Replica

the transformation

ThreadThread

NUMA Node

Local Replica

ThreadThread

NUMA Node

Local Replica

Local Tail

the transformation

Shared Log

LogTail

ThreadThread

NUMA Node

Local Replica

Local Tail

ThreadThread

how to implement log?

key observationcoordination within node cheaper than across nodes

within node: we use flat combining

across nodes: we use lock-free appending to log

correctness

linearizability [Herlihy Wing 1990]:each operation appears to take effect instantaneously at a point between its invocation and response

whence performance comes• trade memory + computation for less communication• compact representation of operations• limited cross-node synchronization and contention

• enable parallelism • combiners across nodes• readers within a node • readers and the combiner on the same node

• leverage batching

22

conclusion• fundamental changes in hardware

• exposed to software developers

• take-away:instead of individual data structures,let’s develop general architecture-aware techniques

how to implement any concurrent data structure · effort in 2012–2014 the future(s) of shared...

Documents