traffic prediction on the internet
DESCRIPTION
Traffic Prediction on the Internet. Anne Denton. Outline. Paper by Y. Baryshnikov, E. Coffman, D. Rubenstein and B. Yimwadsana Solutions Time-Series prediction Our work for the KDD-cup 03. Time Series Prediction on the Internet. - PowerPoint PPT PresentationTRANSCRIPT
Traffic Prediction on the Internet
Anne Denton
Outline
Paper by Y. Baryshnikov, E. Coffman, D. Rubenstein and B. Yimwadsana
Solutions Time-Series prediction Our work for the KDD-cup 03
Time Series Prediction on the Internet
By Y. Baryshnikov, E. Coffman, D. Rubenstein and B. Yimwadsana
Adjustment to “hot spots” Avoiding degradation, even “denial of
service” Can “hot spots” be predicted? Can predicted “hot spots” be
avoided?
What are “hot spots”? Exceptionally large numbers of requests Spontaneous, short lifetime “instant” ramp up in traffic
Only valid on long time scales Claim: time scale for increase larger than
time scale to react Why does increase take time?
Passing on the word How good does a predictor have to be?
Cost of missing a “hot spot” higher than aggregate cost of false alarms (similar to hurricane)
Examples
Olympics (Nagano 98) Soccer World Cup (98) NASA (95)
What to do about “hot spots”? <Detour> “The Columbia Hotspot Rescue
Service: A Research Plan”E. Coffman, P. Jelenkovic, J.Nieh, and D. Rubenstein
Approaches Deal ad hoc with high request Build a better network (expensive) Content delivery services
Caching Extra bandwidth
Suggested solution: use available and underutilized resources
Hotspot Rescue Service
Server-based approach Requires additional resources from
server when necessary Resources provided by other members of
Hotspot Rescue Service Peer-to-Peer approach
Requires additional resources from client when necessary
Caching
Four Phases Prediction (see rest of presentation)
Server-based: daemons P2P: plug-ins
Replication Server-based: replication of objects P2P: identified cached copies More advanced: redistribution of traffic load
Notification Modifications to DNS (Domain Name System) P2P system proactively announces hot objects and
indicates alternative locations? Termination<End of Detour>
Tail of Distribution
Requests per 10-second time slot X-axis: number of hits per time slot Y-axis: probability that that number of
hits will be exceeded
Time Scales Prediction relies on correlation
between values at different times Auto correlation function
Predictabilityon time scalesof 5-30 min
ttftf d)()(
Prediction Algorithm
Standard problem Signal processing Econometrics
Internet traffic Particularly bursty
Simplest model Linear extrapolation
Structure of Prediction Algorithms Traffic observation
# of requests in time unit (t-1,t] Usually 1s
Prediction window Duration Wp 0
Advance notice Prediction at time t:
Mapping of observations in [t-Wp,t] to a number pt 0 of requests predicted in interval [t+, t++1] that is units in the future
Linear Prediction Linear Fit: Least squares linear fit
pt = ft(t+) with ft(s) = at s+bt
Minimizing Performance: O(W+T)
W: Window size T: uptime duration
Problems Prediction window size must match burstiness
parameters governing request flow
t
Wtiit
p
rif 2)(
Results
Depends on properties of auto-correlation function
Conclusions of Paper Build a load-based taxonomy of web
server traffic Depends on technological,
sociological, and psychological factors Look for quantification of basic
patterns reflecting behavior
Do we agree ??? Why cluster when we can classify!!
Our Approach
Normally time series prediction uses only data in that time series
We use similarity to other instances E.g., other web sites
Model-free Weighted Nearest Neighbor approach Problem:
How integrate time?
Typical Nearest Neighbor Classification / Regression R(A1, …, An, C)
Attributes Ai
C class label (classification) or continuous variable (regression)
Based on distance function on Ai
K nearest neighbors Neighbors within a range Use kernel function to weight closer ones
higher
Weighting of Attributes
Some attributes are more important than others
Apply scaling to space Optimize weights through
Hill-climbing Genetic Algorithm
How does this generalize to a time-series?
Our Answer
Identify “relevant” sections in the time series E.g. times with already high download
rates We’ll call each relevant section a
“prediction”
Predictions
Each prediction contains information about The nature of the time series The time instance in question, i.e. the
history of requests The actual change in requests
Make a table of predictions Leads to a relation just as standard
classification / regression setting
Data Set Paper citations in “e-print ArXive” Background: KDD-cup 03
Predict the change in citations in successive 3-month periods
Only consider periods with at least 6 citations Evaluation: L1 distance (Manhattan distance)
between predicted and real difference Very close match between citation history
and request history Predict change in requests Only consider periods that already show large
number of requests
Attributes of a “Prediction” Quantitative attributes
Number of citations in window Gradient of citations in window Aggregate number of citations up to and through
window (assume finite time series) Attribute values given by time series
Keyword occurrences Author Number of revisions of papers Maximum time interval between revisions Country of origin Format
Similarity Function
Common kernel-function
What worked better
2
210
10 2)(exp),( xxxxK
1010 1
1),(
xxwxxK
Plot of Similarity Function
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
x
f(x)
Gaussian
1/(1+x)
Accuracy No linear extrapolation data available
Could lead to negative citations Comparison
Default prediction: No change: 1851 Very simple model (decrease by 0.3 in 3
months): 1532 Prediction based on average of time series
(synchronized at first non-0): 1593 Prediction based on quantitative attributes: 1465 Full prediction (prelimiary): 1357 Weight optimized (very preliminary): reduction
1414 -> 1391
Results
0
500
1000
1500
2000
2500
3000
1 2 3 4 5 6 7 8 9 10 11
Series1
Series2
Series3
Series4
Conclusions
Method works well for citation prediction
Yet to be tested for hot-spot prediction