keith winstein (with francis yan, hudson ayers, chenzhi ...€¦ · towards continual learning for...

Towards continual learning for networking algorithms

Keith Winstein(with Francis Yan, Hudson Ayers, Chenzhi Zhu, Sadjad Fouladi, James Hong, Keyi Zhang, and Philip Levis)

Assistant Professor of Computer ScienceAssistant Professor of Electrical Engineering (by courtesy)Stanford University

High level takeaways

I Networking presents unique challenges for machine learning.

I We don’t know how to emulate the Internet very accurately.I Training algorithms in emulation: disappointing real-world results.I Evaluating algorithms in emulation: not predictive of real-world results.I Running in real life: things change over time, space, etc.

I Continual learning in situ mitigates some problems.I But hard to arrange. . .I . . . only workable for some kinds of algorithmsI . . . and means a substantial shift in how we work.

Background

I Networking algorithms have many tunable parameters.

I Natural setting for machine learning, but:

Networked systems present uniquechallenges for machine learning.

Example: Sprout (NSDI 2013) in publication

Sprout, NSDI 2013 (figure 7)

Sprout in real life in America

Stanford Pantheon result (July 31, 2018, T-Mobile in California),https://pantheon.stanford.edu/result/3455/

Sprout in real life in India

Stanford Pantheon result (August 1, 2018, Airtel in New Delhi),https://pantheon.stanford.edu/result/3474/

Example: Remy in simulation (SIGCOMM 2013)

Better

Remy in emulation

RemyCC

Stanford Pantheon result (March 8, 2019, “Bottleneck buffer = BDP/3”),https://pantheon.stanford.edu/result/5809/

Remy in real life

RemyCC

Stanford Pantheon result (March 9, 2019, HostDime India to AWS India),https://pantheon.stanford.edu/result/5822/

Remy in real life (2)

RemyCC

Stanford Pantheon result (March 9, 2019, GCE Sydney to GCE Tokyo),https://pantheon.stanford.edu/result/5840/

Example: Vivace (NSDI 2018) in publication

Vivace, NSDI 2018 (figure 7)

Vivace in real life

Stanford Pantheon result (March 9, 2019, AWS Brazil-HostDime Colombia),https://pantheon.stanford.edu/result/5814/ (page 204 of PDF)

Example: Pensieve (SIGCOMM 2017) in publication

Pensieve, SIGCOMM 2017 (figure 11)

Pensieve in reproduction

Stanford CS244 student project,https://reproducingnetworkresearch.files.wordpress.com/2018/07/recreating pensieve.pdf

Example: BBR (ACM Queue 2016)

“BBR converges toward a fair share of the bottleneck bandwidth whether competingwith other BBR flows or with loss-based congestion control. [...] Unmanaged routerbuffers exceeding several BDPs, however, cause long-lived loss-based competitors tobloat the queue and grab more than their fair share.”

Neal Cardwell, Yuchung Cheng, C. Stephen Gunn, Soheil Hassas Yeganeh, Van Jacobson, BBR: Congestion-Based Congestion Control, ACM Queue(December 1, 2016), https://queue.acm.org/detail.cfm?id=3022184

Academics pile on BBR

Ranysha Ware, Matthew K. Mukerjee, Justine Sherry, and Srinivasan Seshan, The Battle for Bandwidth: Fairness and Heterogeneous CongestionControl, NSDI 2018 poster (April 2018)

“Among the evaluated protocols, BBR yields the worst TCP friendliness. . . . as we add moreCUBIC flows, BBR becomes increasingly aggressive: it effectively treats all competingCUBIC flows as a single “bundle” . . . . Therefore, in practice, BBR can be very unfriendlywhen there are multiple competing CUBIC flows.”

Mo Dong, Tong Meng, Doron Zarchy, Engin Arslan, Yossi Gilad, Brighten Godfrey, Michael Schapira, PCC Vivace: Online-Learning CongestionControl, NSDI 2018 (April 2018), https://www.usenix.org/system/files/conference/nsdi18/nsdi18-dong.pdf

Operators pile on BBR

Cubic vs BBR over a 12ms RTT 10G circuit

Geoff Huston, TCP and BBR, RIPE 76 (May 2018) https://ripe76.ripe.net/presentations/10-2018-05-15-bbr.pdf

BBR in BBR2 presentation

Neal Cardwell, Yuchung Cheng, C. Stephen Gunn, Soheil Hassas Yeganeh, Ian Swett, Jana Iyengar, Victor Vasiliev, Priyaranjan Jha, Yousuk Seung,Kevin Yang, Matt Mathis, Van Jacobson, BBR Congestion Control Work at Google: IETF 102 Update (July 2018),https://datatracker.ietf.org/meeting/102/materials/slides-102-iccrg-an-update-on-bbr-work-at-google-00

Example: Google Flu Trends

Proposal (2008): train a model to predict flu incidence fromhistorical search engine queries. Then deploy the model to predictflu in advance of the government.

Nov. 11, 2008 announcement

Keith Winstein [email protected] CSAIL

The 2012–2013 Divergence of Google Flu Trends

Nov. 19, 2008: Publication in Nature



Accuracy figures

Index based on 45 queries (e.g. ”pnumonia”).

I Training data (2003–2007): 0.80 ≤ r ≤ 0.96 (mean 0.90)

I Verification (2007–2008): 0.92 ≤ r ≤ 0.99 (mean 0.97)

“We intend to update our model each year with the latest sentinelprovider ILI data, obtaining a better fit and adjusting as onlinehealth-seeking behaviour evolves over time.”



Performance in the first year

0%

1%

2%

3%

4%

5%

2009 2010

Out

patie

nt v

isits

for

influ

enz

a-lik

e ill

ness

Google

CDC



Aug. 19, 2011: PLoS ONE paper

I Training data (2003–2007): Mean correlation 0.90

I Verification (2007–2008): Mean correlation 0.97

I Actual (March–August 2009): Mean correlation 0.29!

Model retrained in September 2009, now 160 queries. “We willcontinue to perform annual updates of Flu Trends models toaccount for additional changes in behavior, should they occur.”



Google Flu Trends plot as of today

(http://www.google.org/flutrends/about/how.html)



Most of plot is training data

(http://www.google.org/flutrends/about/how.html)



Large divergence (3.7×) in New England (HHS region 1)

0

2%

4%

6%

8%

10%

12%

14%

2010 2011 2012 2013

Out

patie

nt v

isits

for

influ

enz

a-lik

e ill

ness

Google

CDC



Substantial divergence (+72%) in France

0

0.2%

0.4%

0.6%

0.8%

1%

1.2%

2010 2011 2012 2013

Out

patie

nt v

isits

for

influ

enz

a-lik

e ill

ness

Google

Sentinelles



Example: predicting unemployment claims from Twitter (9/2015)

Twitter, Big Data, and Job Numbers

People turn to Twitter to see what’s trending in the news. But LSA Economics ProfessorMatthew Shapiro has found a new way to harvest employment information from tweets andhashtags faster and more accurately than the government’s official reports.

LSA Economics Professor Matthew Shapiro is one of the rare people who can claim to be getting work done when looking atsocial media. He’s not following sports scores or breaking news; he’s tracking the nation’s labor market.

“When we started,” explains Shapiro, “we had no idea if we could track job loss with tweets, but over a two-year period, we’veseen the social media index perform quite well. In 2011-2012, for example, our index leveled off just like the official numbers. Wecaptured big job-loss fluctuations around Hurricane Sandy, and around the government shutdown in October 2013.”

There were times when Shapiro’s numbers matched the reports, and there were timeswhen they didn’t. When they differed, Shapiro’s numbers were more accurate than thegovernment’s.When the state of California got new computers, for example, there were delays in processing unemployment claims.Government data reflected the slowdown in processing applications, but social media captured a more accurate picture of whatwas happening in the labor market.

“Our series was stable,” says Shapiro, the Lawrence R. Klein Collegiate Professor of Economics, “so our numbers were, in someways, a better indicator.”

By the NumbersMany important economic indicators, including the Bureau of Labor Statistics unemployment rate and the University of MichiganConsumer Sentiment Index, are collected based on surveys. But it’s hard and costly to get enough people to answer a survey inorder to get a representative sample. It’s hard to verify whether the people responding to the survey are speaking for just

by Susan Hutton

Predicting unemployment claims from Twitter (at time of publication)

Jul 1

6, 2

011

Sep

17, 2

011

Nov

19,

201

1Ja

n 21

, 201

2M

ar 2

4, 2

012

May

26,

201

2Ju

l 28,

201

2Se

p 29

, 201

2D

ec 1

, 201

2Fe

b 2,

201

3Apr

6, 2

013

Jun

8, 2

013

Aug

10,

201

3O

ct 1

2, 2

013260

280

300

320

340

360

380

400

420

440

Init

ial C

laim

s fo

r U

nem

plo

ym

en

t In

sura

nce

(Th

ousa

nd

s)

240

Sep

5, 2

015

Nov

7, 2

015

Jan

9, 2

016

Mar

12,

201

6M

ay 1

4, 2

016

Jul 4

, 201

5

May

2, 2

015

Feb

28, 2

015

Dec

27,

201

4

Oct

25,

201

4

Aug

23,

201

4

Dec

14,

201

3Fe

b 15

, 201

4Apr

19,

201

4Ju

n 21

, 201

4

Initial Claims

Prediction

May

27,

201

7

Mar

25,

201

7

Jan

21, 2

017

Nov

19,

201

6

Sep

17, 2

016

Jul 1

6, 2

016

Jul 1

5, 2

017

Predicting unemployment claims from Twitter (post-publication)

Jul 1

6, 2

011

Sep

17, 2

011

Nov

19,

201

1Ja

n 21

, 201

2M

ar 2

4, 2

012

May

26,

201

2Ju

l 28,

201

2Se

p 29

, 201

2D

ec 1

, 201

2Fe

b 2,

201

3Apr

6, 2

013

Jun

8, 2

013

Aug

10,

201

3O

ct 1

2, 2

013260

280

300

320

340

360

380

400

420

440

Init

ial C

laim

s fo

r U

nem

plo

ym

en

t In

sura

nce

(Th

ousa

nd

s)

240

Sep

5, 2

015

Nov

7, 2

015

Jan

9, 2

016

Mar

12,

201

6M

ay 1

4, 2

016

Jul 4

, 201

5

May

2, 2

015

Feb

28, 2

015

Dec

27,

201

4

Oct

25,

201

4

Aug

23,

201

4

Dec

14,

201

3Fe

b 15

, 201

4Apr

19,

201

4Ju

n 21

, 201

4

Initial Claims

Prediction

May

27,

201

7

Mar

25,

201

7

Jan

21, 2

017

Nov

19,

201

6

Sep

17, 2

016

Jul 1

6, 2

016

Jul 1

5, 2

017

Example: observational study of Bing searches

By John Markoff

June 7, 2016

Microsoft scientists have demonstrated that by analyzing large samples of search engine queries they may in some cases be ableto identify internet users who are suffering from pancreatic cancer, even before they have received a diagnosis of the disease.

The scientists said they hoped their work could lead to early detection of cancer. Their study was published on Tuesday in TheJournal of Oncology Practice[...]

The researchers focused on searches conducted on Bing, Microsoft’s search engine, that indicated someone had been diagnosedwith pancreatic cancer. From there, they worked backward, looking for earlier queries that could have shown that the Bing userwas experiencing symptoms before the diagnosis. Those early searches, they believe, can be warning flags.

While five-year survival rates for pancreatic cancer are extremely low, early detection of the disease can prolong life in a verysmall percentage of cases. The study suggests that early screening can increase the five-year survival rate of pancreatic patientsto 5 to 7 percent, from just 3 percent.

The researchers reported that they could identify from 5 to 15 percent of pancreatic cases with false positive rates of as low asone in 100,000. The researchers noted that false positives could lead to raised medical costs or create significant anxiety forpeople who later found out they were not sick.

The data used by the researchers was anonymized, meaning it did not carry identifying markers like a user name, so theindividuals conducting the searches could not be contacted.

Microsoft Finds Cancer Cluesin Search Queries

Example: spam filtering

SpamAssassin (spam filtering engine):

I Anybody can propose a spam-filtering algorithm.

I Central party learns best weights based on predictive power ofeach algorithm.

I Weights are then deployed in the field.

Deployment: spam filtering

I 2007: a rule is added

I Rule: “Does the year match 200x ?”

I Catches a lot of spam.

I Zero false positives in training! . . . but what happens in 2010?

Yesterday. . .

Observation

Many apparent ML advances in networking decay with time,or with move from simulator/emulator to deployment.

Is the Internet a uniquely challenging setting for ML?

Unique challenges for ML in networking

I Model mismatch: training environment (simulator, emulator,testbed) does not match deployment environment (Internet).

I Dataset shift: change in network/userbase/workload over time

I Each participant has only partial information

I Failure may not be evident to any individual

I Greedy behavior may be locally rewarded, but globally bad

I Adversaries, heavy tails, . . .



I We don’t know how to emulate the Internet very accurately.I Training algorithms in emulation: disappointing real-world results.I Evaluating algorithms in emulation: not predictive of real-world results.I Running in real life: things change over time, space, etc.

I Continual learning in situ mitigates some problems.I But hard to arrange. . .I . . . only workable for some kinds of algorithmsI . . . and means a substantial shift in how we work.

We don’t know how to emulate the Internet very accurately

Better

Francis Yan et al., Pantheon: the training ground for Internet congestion-control research, USENIX ATC2018, figure 3(b), https://pantheon.stanford.edu/static/pantheon/documents/pantheon-paper.pdf

Possible solution: continual learning in place

I “Evaluate-first, idea-second” approach: deploy something andstart collecting real metrics on real traffic

I Retrain frequently

I Compare against held-back “sane baseline”

Puffer: continual learning for Internet video streaming

I Live TV-streaming website (https://puffer.stanford.edu)

I Approved by Stanford lawyers, opened to public December 2018

I Streaming about 1,000 hours of video per day

I Randomizes sessions to different algorithms

I Goal: realistic testbed and learning environment for research in:I congestion controlI throughput prediction, andI adaptive bitrate (ABR)

Algorithms that affect video streaming

I Congestion control: when to send each packet

I Throughput prediction: how fast can server send in near future?

I Adaptive bitrate (ABR): what version of each upcoming “chunk” to send

Google ad for “tv streaming”

Reddit ad

Reddit (continued)

Puffer algorithm

I Continual learning algorithm for video streaming

I Model-based RL: combines classical control and neural network

I Neural network predicts “how long would each chunk take?”I given internal TCP statistics (weakly cross-layer)I given size of proposed chunkI producing probability distribution (vs. point estimate)

I Classical controller: probabilistic model-predictive controlI Maximize SSIMI Minimize stalls

Data Aggregation

Transmission Time Predictor MPC Controller

“OurTube”Video Server

bitrateselection

stateupdate

updatemodel

da

ily t

rain

ing

mo

de

l-ba

se

d c

on

trol

Results

I Data from January 19–28, 2019

I 8,131 hours of video streamed

I 3,719 unique users

I 2 congestion-control schemes (BBR and Cubic), 7 ABR schemes

I Daily retraining for Puffer/BBR and Puffer/Cubic

Stalls vs. SSIM (BBR)

Puffer

Bette

r QoE

Puffer

Puffer

Results (BBR)

Algorithm Time stalled Mean SSIM SSIM variation(over BBR congestion control) (lower is better) (higher is better) (mean, lower is better)

Puffer 0.06% 17.03 dB 0.53 dBBuffer-based 0.34% 16.58 dB 0.79 dBMPC 0.81% 16.32 dB 0.54 dBRobustMPC 0.29% 15.74 dB 0.63 dBPensieve 0.52% 16.19 dB 0.88 dBNon-continual Puffer 0.31% 16.49 dB 0.60 dBEmulation-trained Puffer 0.76% 15.26 dB 1.03 dB

Stalls vs. SSIM (BBR, < 6 Mbps only)

Puffer

Puffer

Puffer

Results in emulation don’t match real world

Puffer

Puffer

People on Puffer watched 2× as long

Algorithm Stream duration (mean ± std. err.)(any congestion control) (higher may suggest less frustrated users)

Puffer 20.3 ± 1.3 minutesBuffer-based 12.9 ± 0.8 minutesMPC 9.5 ± 0.5 minutesRobustMPC 10.0 ± 0.5 minutesPensieve 10.0 ± 0.5 minutesNon-continual Puffer 12.5 ± 0.7 minutesEmulation-trained Puffer 7.9 ± 1.3 minutes

Multivariate regression predicts each % of stall is assocated with 6 minutes less watch time.Each dB of SSIM predicts 2 minutes more watch time (R2 = 0.36).

Contributors to transmission-time accuracy

TTP ablation Prediction error rate(lower is better)

No History + No cwnd inflight 19.8%No History + No delivery rate 17.9%No History + No RTT 24.0%No History 17.1%Full TTP 13.7%Harmonic Mean 17.9%

Networking research if the Internet is sui generis

I Much-debated truism: “Data > algorithms” (even neural networks)

I In networking, “data” not enough. Need responsive environment.

I “Training set vs. testing set” is not an acceptable methodology.

I Much research to do in realistic network emulation so we can train offline

I Until then, need to learn in situ + may need to continually relearn over time

I Not so easy when:I Fairness is a concern: greedy behavior may be locally rewarded, but globally badI Hard to attract enough users to get meaningful training and evaluationI Hard to compare notes across different deployments

Practical difficulties in deploying learned algorithms

I “We’ll give you a trace to learn or evaluate on.”

I Hard to predict real performance from an emulator/simulatorI We need to see more emulator calibration diagrams

I “Give us an algorithm, and we’ll evaluate in on .001% of users.”

I Any sufficiently complex algorithm needs to be trained/tuned/debugged on real trafficI The real learning/developing usually starts after the first flow is served.

I “We invented a new congestion-control algorithm, tuned it for months, deployed itin an A/B test, and it reduced ABR playback stalls and page load times by n%.”

I Do these gains come at the expense of competing flows?I Should we care?

Emulator calibration diagram (example)

Better



I We don’t know how to emulate the Internet very accurately.

I Continual learning in situ may mitigate some difficulties.

I Win probably not from smarter algorithm, but in continual access to real data.I “Evaluate-first, idea-second”

I We are opening Puffer for other researchers to try ideas with daily feedback.

I But, Puffer is one small site with ≈ 1,000 hours/day of usage.

I Algorithm development would probably benefit from tighter collaborationbetween “people with candidate algorithms” and “people with users.”

Keith Winstein ([email protected]) + Francis Yan, Hudson Ayers, Chenzhi Zhu, Sadjad Fouladi, James Hong, Keyi Zhang, Philip Levis.Funded by: Platform Lab, VMware, Huawei, Google, Amazon, Dropbox, Facebook, NSF, DARPA.

keith winstein (with francis yan, hudson ayers, chenzhi ...€¦ · towards continual learning for...

Documents