performance tuning in computer systems with machine …ey204/pubs/talks/2019_12_11_rais.pdfdeep...

Performance Tuning in Computer Systems with Machine Learning

Eiko [email protected]

http://www.cl.cam.ac.uk/~ey204

Systems Research GroupUniversity of Cambridge Computer Laboratory

Alan Turing Institute

Tuning Computer Systems is Complex

Complex configuration parameter space / increasing # of parameters

Configurations need tuning to optimise resource utilisation

Cluster Workload Management

Not well-tuned system degrades performance with massive data processing

Compiler Optimisation

Complex and High Dimension Parameter Space

Device Allocation for Distributed Training

UBER

Parameter Space of Task Scheduler

Tuning distributed SGD scheduler over TensorFlow 10 heterogeneous machines with ~32 parameters ~1053 possible valid configurations

Objective function: minimise distributed SGD iteration time

Computer Systems Optimisation

What is performance? Resource usage (e.g. time, power) Computational properties (e.g. accuracy, fairness, latency)

How do we improve it: Manual tuning Runtime autotuning Static time autotuning

Manual Tuning: Profiling

Always the first step

Simplest case: Poor man’s profiler

Debugger + Pause

Higher level tools

Perf, Vtune, Gprof…

Distributed profiling: a difficult active research area

No clock synchronisation guarantee

Many resources to consider

System logs can be leveraged

tune implementation based on profiling (never captures all

interactions)

Static time Autotuning

Especially useful when:

There is a variety of environments (hardware, input distributions)

The parameter space is difficult to explore manually

Defining a parameter space

e.g. Petabricks: A language and compiler for algorithmic choice (2009)

BNF-like language for parameter space

Uses an evolutionary algorithm for optimisation

Applied to Sort, matrix multiplication

Auto-tuning systems

Properties: Many dimensions

(30+)

Expensive objective function

Understanding of the underlying behaviour

Hardware

System

ApplicationInput data

Flags

Auto-tuning Complex Systems

Grid search θ ∈ [1, 2, 3, …]

Evolutionary approaches (e.g. )

Hill-climbing (e.g. )

Bayesian optimisation (e.g. )

1000s of evaluations of objective function

Computation more expensive

Fewer samples

Many dimensions Expensive objective function Hand-crafted solutions impractical

(e.g. extensive offline analysis)

Blackbox Optimisation

can surpass human expert-level tuning

Deep Learning, Machine Learning, and AI…

e.g. CNN, LSTM

e.g. Logistic regression, Neural Networks, Bayesian, Reinforcement Learning..

Machine learning: a set of methods for creating models that describe or predicting something about the world. It does so by learning those models from data.

Bayesian optimisation

Domain

Objective

Domai

n

Objecti

ve


Domain

Objective



① Find promising point (parameter values with

high performance value in the model)

② Evaluate the objective function at that point

③ Update the model to reflect this new

measurement

Iteratively build a probabilistic model of objective function


① Find promising point (parameter values with

high performance value in the model)

② Evaluate the objective function at that point

③ Update the model to reflect this new

measurement

Pros:

✓ Data efficient: converges in few iterations

✓ Able to deal with noisy observations

Cons:

✗ In many dimensions, model does not converge to the objective function

Iteratively build a probabilistic model of objective function

Structured Bayesian Optimisation

Probabilistic model in Probabilistic Programming:User-given probabilistic model of parameter space

Extend current Probabilistic C++ with various inference algorithms, multi objectives and other language support (e.g. Python)

Probabilistic Model

Probabilistic models incorporate random variables and probability distributions into the model

Deterministic model gives a single possible outcome

Probabilistic model gives a probability distribution

Used for various probabilistic logic inference (e.g. MCMC-based inference, Bayesian inference…)

Python based PP:

Pyro: https://pyro.ai/examples

Edward: http://edwardlib.org

Performance Improvement from Structure

1. User-given probabilistic model structured in semi-parametric model using Directed Acyclic Graph

2. Sub-Optimisation in numerical optimisation

Exploit structure to split problem into smaller optimisations

(enables nested optimisation)

Use decomposition mechanisms

Semi-parametric Model

Easy to use and well suited to SBO

Understand general trend of Objective function

High precision in region of Optimum for finding highest performance

Too restrictive

Too generic

Just right

Example:

Cassandra's garbage collection

Minimise 99th percentile latency of Cassandra

Cassandra

JVM

Garbage collection flags:

● Young generation size

● Survivor ratio

● Max tenuring threshold

Define DAG Model

Define a directed acyclic graph (DAG) of models

99th Percentile

LatencyGC FlagsGC Rate

Model

GC Average

Duration Model

Latency

Model

Average

GC duration

GC Rate

Tune JVM parameters of a database (Cassandra) to minimise latency

DAG model in BOATstruct CassandraModel : public DAGModel<CassandraModel> {

void model(int ygs, int sr, int mtt){// Calculate the size of the heap regionsdouble es = ygs * sr / (sr + 2.0);// Eden space's sizedouble ss = ygs / (sr + 2.0); // Survivor space's size

// Define the dataflow between semi-parametric modelsdouble rate = output("rate", rate_model, es);double duration = output("duration", duration_model,

es, ss, mtt);double latency = output("latency", latency_model,

rate, duration, es, ss, mtt);}

ProbEngine<GCRateModel> rate_model;ProbEngine<GCDurationModel> duration_model;ProbEngine<LatencyModel> latency_model;

};

GC Rate Semi-parametric model

Evaluation: Garbage collection

Evaluation: Neural networks (SGD) scheduling

Communication

modelMachine

modelstm1 tm2 tm3 tm4

maxPredicted

time

Load balancing, worker

allocation over 10 machines =

30 parameters

Use TensorFlow

Evaluation: Neural networks scheduling

Default configuration: 9.82s

OpenTuner: 8.71s

BOAT: 4.31s

Existing systems don’t converge!

Case Studies

Task Scheduling in Cluster Computing

JVM Garbage Collector

Neural Network Hyper-parameter tuning

LLVM Compiler

ASICS/Soc Design

Limitation of Bayersian Optimisation

Not efficient to model dynamic and/or combinatorial model

LLVM Compiler pass list optimisation(BaysOpt vs Random Search)

Ru

n T

ime (

s)

Iteration

Computer Systems Optimisation Models Long-term planning: requires model of how actions affect future states.

Only a few system optimisations fall into this category, e.g. network routing optimisation.

Short-term dynamic control: major system components are under dynamic load, such as resource allocation and stream processing, where the future load is not statistically dependent on the current load. BaysOpt is sufficient to optimise distinct workloads. For dynamic workload, Reinforcement Learning would perform better.

Combinatorial optimisation: a set of options to be selected from a larger set under potential rules of combination. There is no straightforward similarity between different combinations. Many problems in device assignment, indexing, compiler optimisation fall in this category. BaysOpt cannot be easily applied. Either learning online if the task is cheap via random sampling, or via RL + pre-training if the task is expensive, or massively parallel online training if the resources are available.

Many systems problems are combinatorial in nature

Deep Reinforcement Learning for Optimisation

Deep RL provides attractive framework for differentiable control Blackbox optimisation for dynamic/combinatorial problems Trained model can continuously make decisions on new instances

Problems:

Difficult task: make right decision in large discrete action spaces

Exploration in production system not unstable/unpredictable

Simulations can oversimplify problem and expensive to build

Long online training to build a model…

Many deep learning tools, no standard library for modern RL (~2014-2018)

Some standard flavours emerge but mostly tightly coupled logic/execution

e.g. TensorForce/Rlgraph: 20-30K downloads

A brief history of Deep RL software

1. Gen (2014-16): Loose research scripts (e.g. DQN), high expertise

required, only specific simulators

2. Gen (2016-17): OpenAI gym gives unified task interface, reference implementations (e.g. OpenAI baselines)

3. Gen (2017-18): Generic declarative APIs, distributed abstractions (Ray RLlib), some standard flavours emerge

Problems: Tightly coupled execution/logic, testing, reuse,..

Problem: Controlling dynamic behaviour

Reinforcement Learning

Agent interacts with Dynamicenvironment

Goal: Maximise expectations over rewards over agent’s lifetime

Notion of Planning/Control, not single static configuration

What makes RL different from other ML paradigms?

There is no supervisor, only a reward signal

Feedback is delayed, not instantaneous

Time really matters (sequential)

Agent’s actions affect the subsequent data it receives

The most similar way to human brain’s behaviour…

Where are the applications?

RL Workloads

Unlike supervised learning, not a single dominant execution pattern

Distributed workloads: Hierarchies of sync/async data exchange

Algorithms highly sensitive to hyper-parameters

From large scale parallel training (e.g. AlphaGo) to single core

RL in Computer Systems: Practical Considerations

Action spaces do not scale:

Systems problems often combinatorial

Exploration in production system not a good idea

Unstable, unpredictable

Simulations can oversimplify problem

Expensive to build, not justified versus gain

Online steps take too long

Deep Reinforcement Learning for Optimisation

New programming model: Separation of logical dataflow from execution

(no standardised interface)

Automated graph generation/transformation

RLgraph: Modular Dataflow Composition

RLGraph: Separate Local and Distributed Execution

High performance RL computation graphs for RL with different distributed backends

Evaluation: Distributed training

Evaluation: Distributed TensorFlow (DM 3D task)

Performance (Atari Pong) – APEX DQN based

Left: Distributed sample performance Right: Time to solve Pong (Score ~21)

LIFT: Learning from Traces

Idea:

Task may be hard to scale, human can give examples

Ground model with demonstrations

Difficulty: Combining imperfect examples and experience

Results (IMDB data set)

Query latencies: mean (left) 99th percentile (right)

Learn from Demonstration and Pre-Training Reducing online

training time

Optimising DNN Computation with Graph Substitutions

TASO (SOSP, 2019): Performance improvement by transformation of computation graphs

In progress: use of Reinforcement Learning

Case Studies

Packet Classification with RL Match a network packet to a rule from a set of rules

Objective: minimise the classification time and memory footprint

Deep RL solution to build decision trees

DB compound indexing

Stream Processing

Cluster Scheduling

Traffic Signal Control

PARK: RL Opensource Platform

AutoML: Neural Architecture Search

Current: ML expertise + Data + Computation

AutoML aims turning into: Data + 100 x Computation

Use of Reinforcement Learning, Evolutionary Algorithms

..and tune network model?

Graph transformation

Compression

+ Hyper parameter tuning

Tuning Complex Computer Systems

BOAT: Building Auto-Tuners with Structured Bayesian Optimization, WWW 2017. (Morning Paper (2017.5.18) https://github.com/VDalibard/BOAT

RLgraph: Modular Computation Graphs for Deep Reinforcement Learning. SysML 2019. (https://arxiv.org/abs/1810.09028) RLgraph https://github.com/rlgraph/rlgraph

LIFT: Reinforcement Learning in Computer Systems by Learning From Demonstrations. (https://arxiv.org/abs/1808.07903)

Wield: Systematic Reinforcement Learning with Progressive Randomization. 2019. (https://arxiv.org/abs/1909.06844)

performance tuning in computer systems with machine …ey204/pubs/talks/2019_12_11_rais.pdfdeep...

Documents