smart data structures

SAN FRANCISCO, CA, USA

Smart Data Structures

Jonathan EastepDavid WingateAnant Agarwal

06/3/2012

Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments

Multicores are Complex!

• The Problem System complexity is skyrocketing! Multicore architecture is a moving target The best algorithm and algorithm settings

depend Application inputs and workloads can be

dynamic Online tuning is necessary but typically

absent

2


The Big Picture

• Developed a dynamic optimization framework to auto-tune software and minimize burden

• Framework is based on online machine learning technologies

• Demonstrated the framework by designing “Smart Data Structures” for parallel programs

• The framework is general; could apply to systems such as Clouds, OS, Runtimes

3


Smart Data Structures

• Smart Data Structures are parallel data structures that self-optimize to minimize programmer burden

• They use online machine learning to adapt to changing app or system needs and achieve the best performance

• A library of Smart Data Structures open sourced on github (GPL)– github.com/mit-carbon/Smart-Data-Structures

• Publications: [1], [2], [3], [4]

4


0.0 0.3 0.6 1.0 1.3 1.6 2.0 2.3 2.6 3.0 3.3 3.6 4.0

0.6

0.8

1

1.2

1.4

1.6 x 106

T im e (s e c o nd s )Heart

rate (

beats

per s

econ

d / 1e

6)

O ptima lS martloc kP rio rity lo c k: po lic y 1P rio rity lo c k: po lic y 2S p in!loc k: R eac tiv e L oc kS pin!L oc k: Tes t and S e t

W o rklo ad #1 W o rklo ad #2 W o rklo ad #1

(Item

s pe

r sec

ond)

/ 1e

6A Sketch of The Benefits of SDS• Use a Smart Lock to optimize a master-worker

program Measure rate of completed work items Emulate dynamic frequency scaling due to Intel Turbo Boost® Workload 1: Worker 0 @ 3GHz, others @ 2GHz Workload 2: Worker 3 @ 3GHz, others @ 2GHz

5

gap

IdealSmartlock

Baseline


Outline

• Smart Data Structures Anatomy of a Smart Data Structure Implementation Example Research Challenges and Solutions Online Machine Learning Algorithm Empirical Benchmark Results Empirical Scalability Studies Future Directions Conclusions

6


What are Smart Data Structures?• Self-aware computing applied to data

structures• Data Structures that self-optimize using

online learning

• We can optimize knobs in other systems too7

• automatically• at runtime

knobs • self-tuned

Storage

AlgorithmInterfac

e• add• remove• peek

Smart DataStructure

E.g. Smart Queue

t1 t2 tn…

knobs • hand-tuned• per system• per app

DataStructure

E.g. Queue

t1 t2 tn…• static

Online LearningStorage

Algorithm

• add• remove• peek

Interface


Smart Data Structure Library

• C++/C Library of Popular Parallel Data Structures

8

• Supported:– Smart Lock– Smart Queue– Smart SkipList– Smart PairHeap– Smart Stack

• Future Work:– Smart DHT

• ML Optimization Type:Lock Acquisition SchedulingTuning Flat Combining

Dynamic Load-Balancing


Smart Queue, SkipList, PairHeap, Stack

• Implementation should leverage best-performing prior work• What are the best? Determine with experiments.• Result: Flat Combining Data Structures from

Hendler et al.• This is contrary to conventional wisdom

• Reason: FC Algorithm minimizes synchronization overheads by combining data structure ops and applying multiple ops at once

9


Serial Data Structure

enq cenq benq a

enq cenq b

enq d

Flat Combining Primer

10

enq a enq b enq c

Lock

WorkingWorking Working

enq d

CombiningWorking

Scancount3210!!!


Smart Queue, SkipList, PairHeap, Stack

• Here the application of learning is to auto-tune a performance-critical knob called the scancount

11

Interface

SmartQueue

Lock

Thread Request

Scancount

Serial QueueE.g.:

• enqueue• dequeue knobs

• number of scans over request records• peek

t1 t2 tn…

ReinforcementLearning

(of a discrete variable)Records

• dynamically tune the time spent combining


Why Does the Scancount Matter?

• Scancount controls how long threads spend as the combiner

• Increasing scancount allows combiner to do more data structure ops within the same lock

• But, increasing scancount increases latency of the combiner’s op

• It’s good to increase scancount up to a point, but after that latency can hurt performance

• Smart Data Structures use online learning to find the ideal scancount at any given time

12


SDS Implementation

• Goal: minimize application disruption

• Internal lightweight statistics or external application-specific reward signal

• Number of learning threads is one by default; it runs learning engines for all SDS 13

throughput (ops/s)

ExternalPerf. MonitorE.g. Heartbeats

ApplicationThreads

Storage

AlgorithmInterfac

e

Smart DataStructure

E.g. Smart Queue

Online Learning

• add• remove• peek

Rewardstat

t1 t2 tn…

LearningThread


SDS Implementation

– Machine learning co-optimization framework– Supports joint optimization: multiple

knobs– Supports discrete, gaussian, boolean,

permutation knobs– Designed explicitly to support other

systems than SDS

14


Major SDS Research Challenges

1. How do you find knob settings with best long-term effects?

2. How do you measure if a knob setting is helping?

3. How do you optimize quickly enough to not miss opportunities?

4. How do you manage a potentially intractable search space?

15

Quality Challenge

s

Timeliness

Challenges


Addressing Other Quality Challenges

1. How do you find settings with best long-term effects?

Leverage one of the machine learning technologies for planning

Use online RL to adapt to workload or phase changes

2. How do you measure if a knob setting is helping?

Extensible reward signal interface for performance monitoring

Heartbeats Framework for application-specific perf. evaluations

16


Addressing Timeliness Challenges

3. How to optimize fast enough not to miss opportunities? Choose a fast gradient-based machine learning algorithm Use learning helper thread to decouple learning from app

threads

4. How to manage potentially intractable search space? Relax potentially exponential discrete action space into

continuous one Use a stochastic soft-max policy which enables gradient-based

learning17

Burberry

“Sorry I’m late dear…have you been waiting long?”


Reinforcement Learning Algorithm

• Goal: optimize rate of reward (e.g. heart rate)• Method: Policy Gradients Algorithm

Online, model-free, handles exponential knob spaces

Learn a stochastic policy which will give a probability distribution over knob settings for each knob

Sample settings for each knob from the policy, try them empirically, and listen to performance feedback signal

Improve the policy using a method analogous to gradient ascent• I.e. estimate gradient of the reward wrt policy and

step policy in the gradient direction to get maximum reward

• Balance exploration vs. exploitation + make policy differentiable via stochastic soft-max policy

18


How Does SDS Perform? • Full sweep over SDS, load: compare against

Static Oracle• Result: near-ideal performance in many cases

• Result: Quality Challenge is met

19

0

500

1000

1500

Smart Queue Ideal StaticSDS DynamicAvg Static

Post Computation (ns)

Thro

ughp

ut (o

ps/m

s)

0200400600800

1000

Smart Pair Heap

Post Computation (ns)†14 threads

Static AvgSDS DynamicStatic Oracle

0200400600800

1000

Smart Skip List

Post Computation (ns)


What if Workload Changes Rapidly?

• Inject changes in the data structure “load” (i.e. post computation between ops)

• Sweep over SDS, random load schedules, frequencies• Result: Good benefit even when load changes every

10μs

• Result: Quality and Timeliness Challenges are met 20

1/10000

1/1000 1/100 1/100

200400600800Smart Pairing Heap: Sched.

1

Interval Frequency (1/µs)

1/10000

1/1000 1/100 1/100

200400600800

Smart Skip List: Sched. 1


0200400600800

1000Smart Queue: Sched.

1


Thro

ughp

ut

(ops

/ms)

Dynamic AverageSDS DynamicDynamic Oracle †14 threads


Future Directions

• Extend this work to a common framework to coordinate tuning across all system layers E.g.: application -> runtime -> OS -> HW Scalable, decentralized optimization

methods

21


Conclusions

• Developed a framework to dynamically tune systems and minimize programmer burden via online machine learning

• Demonstrated the framework through a case study of self-tuning “Smart Data Structures”

• Now looking at uses in systems beyond data structures jonathan dot eastep at gmail

• Reinforcement Learning will play an increasingly important role in the development of future software and hardware

22


Presentation References

[1] J. Eastep, D. Wingate, M. D. Santambrogio, A. Agarwal, “Smartlocks: Lock Acquisition Scheduling for Self-Aware Synchronization,” 7th IEEE International Conference on Autonomic Computing (ICAC’10), 2010. Best Student Paper Award (pdf)

[2] J. Eastep, D. Wingate, M.D. Santambrogio, A. Agarwal, “Smartlocks: Self-Aware Synchronization through Lock Acquisition Scheduling,” MIT CSAIL Technical Report, MIT-CSAIL-TR-2009-055, November 2009. (pdf)

[3] J. Eastep, D. Wingate, A. Agarwal, “Smart Data Structures: A Reinforcement Learning Approach to Multicore Data Structures,” 8th IEEE International Conference on Autonomic Computing (ICAC’11), 2011. (pdf)

[4] J. Eastep, “Smart Data Structures: An Online Machine Learning Approach to Multicore Data Structures,” Doctoral Dissertation, MIT, May 2011 (pdf)

23

http://groups.csail.mit.edu/carbon/wordpress/wp-content/uploads/2009/11/smartlocks_icac10_final.pdf

http://dspace.mit.edu/bitstream/handle/1721.1/49808/MIT-CSAIL-TR-2009-055.pdf?sequence=1

http://groups.csail.mit.edu/carbon/wordpress/wp-content/uploads/2011/03/eastep-smart-data-structures-icac11.pdf

http://dspace.mit.edu/bitstream/handle/1721.1/65967/751867152.pdf?sequence=1

smart data structures

Documents

data structuresdata

flat combining data

data structure ops

smart queuet1t2tnknobs

smart dhtml optimization

dynamic optimization

best performance

parallel programsthe