keith winstein (with francis yan, hudson ayers, chenzhi ...€¦ · towards continual learning for...
TRANSCRIPT
Towards continual learning for networking algorithms
Keith Winstein(with Francis Yan, Hudson Ayers, Chenzhi Zhu, Sadjad Fouladi, James Hong, Keyi Zhang, and Philip Levis)
Assistant Professor of Computer ScienceAssistant Professor of Electrical Engineering (by courtesy)Stanford University
High level takeaways
I Networking presents unique challenges for machine learning.
I We don’t know how to emulate the Internet very accurately.I Training algorithms in emulation: disappointing real-world results.I Evaluating algorithms in emulation: not predictive of real-world results.I Running in real life: things change over time, space, etc.
I Continual learning in situ mitigates some problems.I But hard to arrange. . .I . . . only workable for some kinds of algorithmsI . . . and means a substantial shift in how we work.
Background
I Networking algorithms have many tunable parameters.
I Natural setting for machine learning, but:
Networked systems present uniquechallenges for machine learning.
Example: Sprout (NSDI 2013) in publication
Sprout, NSDI 2013 (figure 7)
Sprout in real life in America
Stanford Pantheon result (July 31, 2018, T-Mobile in California),https://pantheon.stanford.edu/result/3455/
Sprout in real life in India
Stanford Pantheon result (August 1, 2018, Airtel in New Delhi),https://pantheon.stanford.edu/result/3474/
Example: Remy in simulation (SIGCOMM 2013)
Better
Remy in emulation
RemyCC
Stanford Pantheon result (March 8, 2019, “Bottleneck buffer = BDP/3”),https://pantheon.stanford.edu/result/5809/
Remy in real life
RemyCC
Stanford Pantheon result (March 9, 2019, HostDime India to AWS India),https://pantheon.stanford.edu/result/5822/
Remy in real life (2)
RemyCC
Stanford Pantheon result (March 9, 2019, GCE Sydney to GCE Tokyo),https://pantheon.stanford.edu/result/5840/
Example: Vivace (NSDI 2018) in publication
Vivace, NSDI 2018 (figure 7)
Vivace in real life
Stanford Pantheon result (March 9, 2019, AWS Brazil-HostDime Colombia),https://pantheon.stanford.edu/result/5814/ (page 204 of PDF)
Example: Pensieve (SIGCOMM 2017) in publication
Pensieve, SIGCOMM 2017 (figure 11)
Pensieve in reproduction
Stanford CS244 student project,https://reproducingnetworkresearch.files.wordpress.com/2018/07/recreating pensieve.pdf
Example: BBR (ACM Queue 2016)
“BBR converges toward a fair share of the bottleneck bandwidth whether competingwith other BBR flows or with loss-based congestion control. [...] Unmanaged routerbuffers exceeding several BDPs, however, cause long-lived loss-based competitors tobloat the queue and grab more than their fair share.”
Neal Cardwell, Yuchung Cheng, C. Stephen Gunn, Soheil Hassas Yeganeh, Van Jacobson, BBR: Congestion-Based Congestion Control, ACM Queue(December 1, 2016), https://queue.acm.org/detail.cfm?id=3022184
Academics pile on BBR
Ranysha Ware, Matthew K. Mukerjee, Justine Sherry, and Srinivasan Seshan, The Battle for Bandwidth: Fairness and Heterogeneous CongestionControl, NSDI 2018 poster (April 2018)
“Among the evaluated protocols, BBR yields the worst TCP friendliness. . . . as we add moreCUBIC flows, BBR becomes increasingly aggressive: it effectively treats all competingCUBIC flows as a single “bundle” . . . . Therefore, in practice, BBR can be very unfriendlywhen there are multiple competing CUBIC flows.”
Mo Dong, Tong Meng, Doron Zarchy, Engin Arslan, Yossi Gilad, Brighten Godfrey, Michael Schapira, PCC Vivace: Online-Learning CongestionControl, NSDI 2018 (April 2018), https://www.usenix.org/system/files/conference/nsdi18/nsdi18-dong.pdf
Operators pile on BBR
Cubic vs BBR over a 12ms RTT 10G circuit
Geoff Huston, TCP and BBR, RIPE 76 (May 2018) https://ripe76.ripe.net/presentations/10-2018-05-15-bbr.pdf
BBR in BBR2 presentation
Neal Cardwell, Yuchung Cheng, C. Stephen Gunn, Soheil Hassas Yeganeh, Ian Swett, Jana Iyengar, Victor Vasiliev, Priyaranjan Jha, Yousuk Seung,Kevin Yang, Matt Mathis, Van Jacobson, BBR Congestion Control Work at Google: IETF 102 Update (July 2018),https://datatracker.ietf.org/meeting/102/materials/slides-102-iccrg-an-update-on-bbr-work-at-google-00
Example: Google Flu Trends
Proposal (2008): train a model to predict flu incidence fromhistorical search engine queries. Then deploy the model to predictflu in advance of the government.
Nov. 11, 2008 announcement
Keith Winstein [email protected] CSAIL
The 2012–2013 Divergence of Google Flu Trends
Nov. 19, 2008: Publication in Nature
Keith Winstein [email protected] CSAIL
The 2012–2013 Divergence of Google Flu Trends
Accuracy figures
Index based on 45 queries (e.g. ”pnumonia”).
I Training data (2003–2007): 0.80 ≤ r ≤ 0.96 (mean 0.90)
I Verification (2007–2008): 0.92 ≤ r ≤ 0.99 (mean 0.97)
“We intend to update our model each year with the latest sentinelprovider ILI data, obtaining a better fit and adjusting as onlinehealth-seeking behaviour evolves over time.”
Keith Winstein [email protected] CSAIL
The 2012–2013 Divergence of Google Flu Trends
Performance in the first year
0%
1%
2%
3%
4%
5%
2009 2010
Out
patie
nt v
isits
for
influ
enz
a-lik
e ill
ness
CDC
Keith Winstein [email protected] CSAIL
The 2012–2013 Divergence of Google Flu Trends
Aug. 19, 2011: PLoS ONE paper
I Training data (2003–2007): Mean correlation 0.90
I Verification (2007–2008): Mean correlation 0.97
I Actual (March–August 2009): Mean correlation 0.29!
Model retrained in September 2009, now 160 queries. “We willcontinue to perform annual updates of Flu Trends models toaccount for additional changes in behavior, should they occur.”
Keith Winstein [email protected] CSAIL
The 2012–2013 Divergence of Google Flu Trends
Google Flu Trends plot as of today
(http://www.google.org/flutrends/about/how.html)
Keith Winstein [email protected] CSAIL
The 2012–2013 Divergence of Google Flu Trends
Most of plot is training data
(http://www.google.org/flutrends/about/how.html)
Keith Winstein [email protected] CSAIL
The 2012–2013 Divergence of Google Flu Trends
Large divergence (3.7×) in New England (HHS region 1)
0
2%
4%
6%
8%
10%
12%
14%
2010 2011 2012 2013
Out
patie
nt v
isits
for
influ
enz
a-lik
e ill
ness
CDC
Keith Winstein [email protected] CSAIL
The 2012–2013 Divergence of Google Flu Trends
Substantial divergence (+72%) in France
0
0.2%
0.4%
0.6%
0.8%
1%
1.2%
2010 2011 2012 2013
Out
patie
nt v
isits
for
influ
enz
a-lik
e ill
ness
Sentinelles
Keith Winstein [email protected] CSAIL
The 2012–2013 Divergence of Google Flu Trends
Example: predicting unemployment claims from Twitter (9/2015)
Twitter, Big Data, and Job Numbers
People turn to Twitter to see what’s trending in the news. But LSA Economics ProfessorMatthew Shapiro has found a new way to harvest employment information from tweets andhashtags faster and more accurately than the government’s official reports.
LSA Economics Professor Matthew Shapiro is one of the rare people who can claim to be getting work done when looking atsocial media. He’s not following sports scores or breaking news; he’s tracking the nation’s labor market.
“When we started,” explains Shapiro, “we had no idea if we could track job loss with tweets, but over a two-year period, we’veseen the social media index perform quite well. In 2011-2012, for example, our index leveled off just like the official numbers. Wecaptured big job-loss fluctuations around Hurricane Sandy, and around the government shutdown in October 2013.”
There were times when Shapiro’s numbers matched the reports, and there were timeswhen they didn’t. When they differed, Shapiro’s numbers were more accurate than thegovernment’s.When the state of California got new computers, for example, there were delays in processing unemployment claims.Government data reflected the slowdown in processing applications, but social media captured a more accurate picture of whatwas happening in the labor market.
“Our series was stable,” says Shapiro, the Lawrence R. Klein Collegiate Professor of Economics, “so our numbers were, in someways, a better indicator.”
By the NumbersMany important economic indicators, including the Bureau of Labor Statistics unemployment rate and the University of MichiganConsumer Sentiment Index, are collected based on surveys. But it’s hard and costly to get enough people to answer a survey inorder to get a representative sample. It’s hard to verify whether the people responding to the survey are speaking for just
by Susan Hutton
Predicting unemployment claims from Twitter (at time of publication)
Jul 1
6, 2
011
Sep
17, 2
011
Nov
19,
201
1Ja
n 21
, 201
2M
ar 2
4, 2
012
May
26,
201
2Ju
l 28,
201
2Se
p 29
, 201
2D
ec 1
, 201
2Fe
b 2,
201
3Apr
6, 2
013
Jun
8, 2
013
Aug
10,
201
3O
ct 1
2, 2
013260
280
300
320
340
360
380
400
420
440
Init
ial C
laim
s fo
r U
nem
plo
ym
en
t In
sura
nce
(Th
ousa
nd
s)
240
Sep
5, 2
015
Nov
7, 2
015
Jan
9, 2
016
Mar
12,
201
6M
ay 1
4, 2
016
Jul 4
, 201
5
May
2, 2
015
Feb
28, 2
015
Dec
27,
201
4
Oct
25,
201
4
Aug
23,
201
4
Dec
14,
201
3Fe
b 15
, 201
4Apr
19,
201
4Ju
n 21
, 201
4
Initial Claims
Prediction
May
27,
201
7
Mar
25,
201
7
Jan
21, 2
017
Nov
19,
201
6
Sep
17, 2
016
Jul 1
6, 2
016
Jul 1
5, 2
017
Predicting unemployment claims from Twitter (post-publication)
Jul 1
6, 2
011
Sep
17, 2
011
Nov
19,
201
1Ja
n 21
, 201
2M
ar 2
4, 2
012
May
26,
201
2Ju
l 28,
201
2Se
p 29
, 201
2D
ec 1
, 201
2Fe
b 2,
201
3Apr
6, 2
013
Jun
8, 2
013
Aug
10,
201
3O
ct 1
2, 2
013260
280
300
320
340
360
380
400
420
440
Init
ial C
laim
s fo
r U
nem
plo
ym
en
t In
sura
nce
(Th
ousa
nd
s)
240
Sep
5, 2
015
Nov
7, 2
015
Jan
9, 2
016
Mar
12,
201
6M
ay 1
4, 2
016
Jul 4
, 201
5
May
2, 2
015
Feb
28, 2
015
Dec
27,
201
4
Oct
25,
201
4
Aug
23,
201
4
Dec
14,
201
3Fe
b 15
, 201
4Apr
19,
201
4Ju
n 21
, 201
4
Initial Claims
Prediction
May
27,
201
7
Mar
25,
201
7
Jan
21, 2
017
Nov
19,
201
6
Sep
17, 2
016
Jul 1
6, 2
016
Jul 1
5, 2
017
Example: observational study of Bing searches
By John Markoff
June 7, 2016
Microsoft scientists have demonstrated that by analyzing large samples of search engine queries they may in some cases be ableto identify internet users who are suffering from pancreatic cancer, even before they have received a diagnosis of the disease.
The scientists said they hoped their work could lead to early detection of cancer. Their study was published on Tuesday in TheJournal of Oncology Practice[...]
The researchers focused on searches conducted on Bing, Microsoft’s search engine, that indicated someone had been diagnosedwith pancreatic cancer. From there, they worked backward, looking for earlier queries that could have shown that the Bing userwas experiencing symptoms before the diagnosis. Those early searches, they believe, can be warning flags.
While five-year survival rates for pancreatic cancer are extremely low, early detection of the disease can prolong life in a verysmall percentage of cases. The study suggests that early screening can increase the five-year survival rate of pancreatic patientsto 5 to 7 percent, from just 3 percent.
The researchers reported that they could identify from 5 to 15 percent of pancreatic cases with false positive rates of as low asone in 100,000. The researchers noted that false positives could lead to raised medical costs or create significant anxiety forpeople who later found out they were not sick.
The data used by the researchers was anonymized, meaning it did not carry identifying markers like a user name, so theindividuals conducting the searches could not be contacted.
Microsoft Finds Cancer Cluesin Search Queries
Example: spam filtering
SpamAssassin (spam filtering engine):
I Anybody can propose a spam-filtering algorithm.
I Central party learns best weights based on predictive power ofeach algorithm.
I Weights are then deployed in the field.
Deployment: spam filtering
I 2007: a rule is added
I Rule: “Does the year match 200x ?”
I Catches a lot of spam.
I Zero false positives in training! . . . but what happens in 2010?
Yesterday. . .
Observation
Many apparent ML advances in networking decay with time,or with move from simulator/emulator to deployment.
Is the Internet a uniquely challenging setting for ML?
Unique challenges for ML in networking
I Model mismatch: training environment (simulator, emulator,testbed) does not match deployment environment (Internet).
I Dataset shift: change in network/userbase/workload over time
I Each participant has only partial information
I Failure may not be evident to any individual
I Greedy behavior may be locally rewarded, but globally bad
I Adversaries, heavy tails, . . .
High level takeaways
I Networking presents unique challenges for machine learning.
I We don’t know how to emulate the Internet very accurately.I Training algorithms in emulation: disappointing real-world results.I Evaluating algorithms in emulation: not predictive of real-world results.I Running in real life: things change over time, space, etc.
I Continual learning in situ mitigates some problems.I But hard to arrange. . .I . . . only workable for some kinds of algorithmsI . . . and means a substantial shift in how we work.
We don’t know how to emulate the Internet very accurately
Better
Francis Yan et al., Pantheon: the training ground for Internet congestion-control research, USENIX ATC2018, figure 3(b), https://pantheon.stanford.edu/static/pantheon/documents/pantheon-paper.pdf
Possible solution: continual learning in place
I “Evaluate-first, idea-second” approach: deploy something andstart collecting real metrics on real traffic
I Retrain frequently
I Compare against held-back “sane baseline”
Puffer: continual learning for Internet video streaming
I Live TV-streaming website (https://puffer.stanford.edu)
I Approved by Stanford lawyers, opened to public December 2018
I Streaming about 1,000 hours of video per day
I Randomizes sessions to different algorithms
I Goal: realistic testbed and learning environment for research in:I congestion controlI throughput prediction, andI adaptive bitrate (ABR)
Algorithms that affect video streaming
I Congestion control: when to send each packet
I Throughput prediction: how fast can server send in near future?
I Adaptive bitrate (ABR): what version of each upcoming “chunk” to send
Demo
Google ad for “tv streaming”
Reddit ad
Reddit (continued)
Puffer algorithm
I Continual learning algorithm for video streaming
I Model-based RL: combines classical control and neural network
I Neural network predicts “how long would each chunk take?”I given internal TCP statistics (weakly cross-layer)I given size of proposed chunkI producing probability distribution (vs. point estimate)
I Classical controller: probabilistic model-predictive controlI Maximize SSIMI Minimize stalls
Data Aggregation
Transmission Time Predictor MPC Controller
“OurTube”Video Server
bitrateselection
stateupdate
updatemodel
da
ily t
rain
ing
mo
de
l-ba
se
d c
on
trol
Results
I Data from January 19–28, 2019
I 8,131 hours of video streamed
I 3,719 unique users
I 2 congestion-control schemes (BBR and Cubic), 7 ABR schemes
I Daily retraining for Puffer/BBR and Puffer/Cubic
Stalls vs. SSIM (BBR)
Puffer
Bette
r QoE
Puffer
Puffer
Results (BBR)
Algorithm Time stalled Mean SSIM SSIM variation(over BBR congestion control) (lower is better) (higher is better) (mean, lower is better)
Puffer 0.06% 17.03 dB 0.53 dBBuffer-based 0.34% 16.58 dB 0.79 dBMPC 0.81% 16.32 dB 0.54 dBRobustMPC 0.29% 15.74 dB 0.63 dBPensieve 0.52% 16.19 dB 0.88 dBNon-continual Puffer 0.31% 16.49 dB 0.60 dBEmulation-trained Puffer 0.76% 15.26 dB 1.03 dB
Stalls vs. SSIM (BBR, < 6 Mbps only)
Puffer
Puffer
Puffer
Results in emulation don’t match real world
Puffer
Puffer
People on Puffer watched 2× as long
Algorithm Stream duration (mean ± std. err.)(any congestion control) (higher may suggest less frustrated users)
Puffer 20.3 ± 1.3 minutesBuffer-based 12.9 ± 0.8 minutesMPC 9.5 ± 0.5 minutesRobustMPC 10.0 ± 0.5 minutesPensieve 10.0 ± 0.5 minutesNon-continual Puffer 12.5 ± 0.7 minutesEmulation-trained Puffer 7.9 ± 1.3 minutes
Multivariate regression predicts each % of stall is assocated with 6 minutes less watch time.Each dB of SSIM predicts 2 minutes more watch time (R2 = 0.36).
Contributors to transmission-time accuracy
TTP ablation Prediction error rate(lower is better)
No History + No cwnd inflight 19.8%No History + No delivery rate 17.9%No History + No RTT 24.0%No History 17.1%Full TTP 13.7%Harmonic Mean 17.9%
Networking research if the Internet is sui generis
I Much-debated truism: “Data > algorithms” (even neural networks)
I In networking, “data” not enough. Need responsive environment.
I “Training set vs. testing set” is not an acceptable methodology.
I Much research to do in realistic network emulation so we can train offline
I Until then, need to learn in situ + may need to continually relearn over time
I Not so easy when:I Fairness is a concern: greedy behavior may be locally rewarded, but globally badI Hard to attract enough users to get meaningful training and evaluationI Hard to compare notes across different deployments
Practical difficulties in deploying learned algorithms
I “We’ll give you a trace to learn or evaluate on.”
I Hard to predict real performance from an emulator/simulatorI We need to see more emulator calibration diagrams
I “Give us an algorithm, and we’ll evaluate in on .001% of users.”
I Any sufficiently complex algorithm needs to be trained/tuned/debugged on real trafficI The real learning/developing usually starts after the first flow is served.
I “We invented a new congestion-control algorithm, tuned it for months, deployed itin an A/B test, and it reduced ABR playback stalls and page load times by n%.”
I Do these gains come at the expense of competing flows?I Should we care?
Emulator calibration diagram (example)
Better
High level takeaways
I Networking presents unique challenges for machine learning.
I We don’t know how to emulate the Internet very accurately.
I Continual learning in situ may mitigate some difficulties.
I Win probably not from smarter algorithm, but in continual access to real data.I “Evaluate-first, idea-second”
I We are opening Puffer for other researchers to try ideas with daily feedback.
I But, Puffer is one small site with ≈ 1,000 hours/day of usage.
I Algorithm development would probably benefit from tighter collaborationbetween “people with candidate algorithms” and “people with users.”
Keith Winstein ([email protected]) + Francis Yan, Hudson Ayers, Chenzhi Zhu, Sadjad Fouladi, James Hong, Keyi Zhang, Philip Levis.Funded by: Platform Lab, VMware, Huawei, Google, Amazon, Dropbox, Facebook, NSF, DARPA.