for content delivery network caching learning relaxed...

33
Learning Relaxed Belady for Content Delivery Network Caching Zhenyu Song, Daniel S. Berger, Kai Li, Wyatt Lloyd NSDI 2020

Upload: others

Post on 28-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

Learning Relaxed Beladyfor Content Delivery Network Caching

Zhenyu Song, Daniel S. Berger, Kai Li, Wyatt Lloyd

NSDI 2020

Page 2: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

Edge cache

CDN Caching Goal: Minimize WAN Traffic

Requests

MissHit

User

Requests

Wide Area Network (WAN)traffic is expensive

miss bytes total bytesKey metric byte miss ratio:

2

Page 3: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

Caching Remains ChallengingHeuristic-based algorithms (1965–): LRU, LRUK, GDSF, ARC, ...● Work well for some workloads, but work poorly for otherML-based adaptation of heuristics (2017–): UCB, LeCAR, ...● Also work well for some workloads, but poorly for othersBelady’s MIN algorithm (1966)● Oracle: requires future knowledge● Large gap in byte miss ratio between state-of-the-art and Belady:

3

● 20–40% on production traces

Page 4: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

Introducing Learning Relaxed Belady (LRB)

New approach: mimic Belady using machine learning

4

Page 5: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

General Overview of our Approach

R R R R R R·········

Now

Cache

R

Past information

ML architecture

Training data

Eviction candidates

5

Page 6: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

Challenge 1: Past Information

R R R R R R·········

Now

Cache

RWhat past information to use?

Past information

ML architecture

Training data

Eviction candidates

6

More data improves training but increases mem overhead

Page 7: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

Challenge 2: Generate Online Training Data

R R R R R R·········

Now

Cache

RWhat past information to use?

Generate online training data?

Past information

ML architecture

Training data

Eviction candidates

7

Page 8: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

Challenge 3: ML Architecture

R R R R R R·········

Now

Cache

RWhat past information to use?

Generate online training data?

What ML architecture to select?

Past information

ML architecture

Training data

Eviction candidates

8

Large design space: features, model, prediction target, loss function

Page 9: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

Challenge 4: Eviction Candidates

R R R R R R·········

Now

Cache

R

Past information

ML architecture

Training data

Eviction candidates

How to select evict candidates?

What past information to use?

Generate online training data?

What ML architecture to select?

9

Page 10: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

Challenge 5: Quickly Evaluate Design Decisions

R R R R R R·········

Now

Cache

R

End-to-end evaluation: days

How to select evict candidates?

What past information to use?

Generate online training data?

What ML architecture to select?

Past information

ML architecture

Training data

Eviction candidates

10

Page 11: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

Solutions: Relaxed Belady Algorithm & Good Decision Ratio

End-to-end evaluation: days

How to select evict candidates?

What past information to use?

Generate online training data?

What ML architecture to select?Relaxed Belady algorithm

Good decision ratio

11

Page 12: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

Solutions: Relaxed Belady Algorithm & Good Decision Ratio

Relaxed Belady algorithm

Good decision ratio: minsEnd-to-end evaluation: days

How to select evict candidates?

What past information to use?

Generate online training data?

What ML architecture to select?

12

Page 13: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

Solutions: Relaxed Belady Algorithm & Good Decision Ratio

End-to-end evaluation: days

How to select evict candidates?

What past information to use?

Generate online training data?

What ML architecture to select?Relaxed Belady algorithm

Good decision ratio: mins

13

Page 14: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

Challenge: Hard to Mimic Belady (Oracle) Algorithm

Mimicking exact Belady is impractical ● Need predictions for all objects → prohibitive computational cost● Need exact prediction of next access → further prediction are harder

Belady: evict object with next access farthest in the future

14

Cache (now)

A······

B

C D

Time to next request

D B A C······ ······

Evict

Page 15: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

Introducing the Relaxed Belady Algorithm

Observation: many objects are good candidates for eviction

Relaxed Belady evicts an objects beyond boundary

15

Cache (now)

A······

B

C D

Time to next request

D B A C······ ······

EvictBelady boundary

● Do not need predictions for all objects → reasonable computation● No need to differentiate beyond boundary → simplifies the prediction

Page 16: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

Good Decision Ratio: Directly Measures Eviction Decisions

Insight: relaxed Belady enables evaluating eviction decisions

Good decision ratio:

Bad eviction decision evicted obj’s next access < boundary

# good eviction decisions # total eviction decisions

Good eviction decision evicted obj’s next access > boundary

16

Belady boundary

Time to next request

Page 17: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

Challenge 5: Quickly Evaluate Design Decisions

R R R R R R·········

Now

Cache

R

End-to-end evaluation: days

How to select evict candidates?

What past information to use?

Generate online training data?

What ML architecture to select?

Past information

ML architecture

Training data

Eviction candidates

17

Page 18: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

Evaluate Design Decisions w/o Simulation

ML architecture

Training data

Eviction candidates

Log

18

Simulate once,reuse for many designs

Evaluate designs on log using good decision ratio in minutes

Page 19: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

Challenge 1: Past Information

R R R R R R·········

ML architecture

Training data

Eviction candidates

More data improves training but increases mem overhead

Past informationWhat past information to use? Now

Cache

R

19

Page 20: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

Track Objects within a Sliding Memory Window

Per object features

R R R R R R·········

Now

R

Sliding memory window mimics Belady boundary

Only track objects within memory window

20

Window size is LRB’s main hyperparameter

Page 21: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

Challenge 2: Training Data

R R R R R R·········

Now

Cache

RWhat past information to use?

Generate online training data?

Past information

ML architecture

Training data

Eviction candidates

21

Page 22: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

Sample Training Data & Label on Access or Boundary

Per object features

R R R R R R·········

Now

R

Sliding memory window

Sample

Unlabeled training data

Past memory windowAccess

Labeled training data

22

Page 23: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

Challenge 3: ML Architecture

Large potential design space

R R R R R R·········

Now

Cache

RWhat past information to use?

Generate online training data?

What ML architecture to select?

Past information

ML architecture

Training data

Eviction candidates

23

Page 24: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

Solution 3: Feature & Model Selection

Gradient boosting decision trees

Lightweight & high good decision ratio

Training ~300 ms, prediction ~30 us

Features

Object size

Object type

Inter-request distances(recency)

Exponential decay counters (long-term frequencies)

Use good decision ratio to evaluate new designs

24

Page 25: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

Challenge 4: Eviction Candidates

R R R R R R·········

Now

Cache

R

Past information

ML architecture

Training data

Eviction candidates

How to select evict candidates?

What past information to use?

Generate online training data?

What ML architecture to select?

25

Page 26: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

Solution 4: Random Sampling for Eviction

Can mimic relaxed Belady if we canfind 1 object beyond the boundary

k=64 candidates; more does not improve good decision ratio

R R R R R R·········

Now

Cache

R

Past information

Random k candidates

26

Page 27: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

Label

Labeled dataset

Sample

Unlabeled dataset

Learning Relaxed Belady

27

Now

Cache

RR R R R R R······

Memory window

RR

Train

Model Eviction Candidates

···

Sample

Predict

Evict

Page 28: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

● Simulator implementation○ LRB + 14 other algorithms

● Prototype implementation○ C++ on top of production system (Apache Traffic Server)○ Many optimizations

Implementation

28

Page 29: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

● Q1: Learning Relaxed Belady (LRB) traffic reduction vs state-of-the-art

● Q2: overhead of LRB vs CDN production system

● Traces: 6 production traces from 3 CDNs

● Hyperparameter (memory window/model/...) tuned on 20% of trace

Evaluation Setup

29

Page 30: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

LRB Reduces WAN Traffic20% traffic reduction over B-LRU10% reduction over the best SOA

Wikipedia trace

Industry standard

30

Page 31: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

CDN-B1 CDN-B3CDN-B2

LRB Consistently Improves on the State of the Art

Wikipedia CDN-A1 CDN-A2

31

Page 32: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

LRB Overhead Is Modest

Throughput: 11.7 Gbps vs 11.7 Gbps (unmodified)

Memory overhead=1‒3% cache size

Peak CPU: 16% vs 9% (unmodified)

32

Edge cache

Requests

User

Page 33: for Content Delivery Network Caching Learning Relaxed Beladywlloyd/papers/lrb-nsdi20-talk-public.pdf · Challenge 2: Training Data R R R R R R ··· ······ Now Cache R What

Conclusion● LRB reduces WAN traffic with modest overhead

● Key insight: relaxed Belady

→ Simplifies machine learning & reduces system overhead

→ Good decision ratio enables fast design evaluation & design iteration

Code & Wikipedia trace: https://github.com/sunnyszy/lrb

33