efficient monitoring algorithm for fast news alertsia/present/monitor.pdf · ka cheung sia...
TRANSCRIPT
Efficient Monitoring Algorithm forFast News Alert
Ka Cheung ”Richard” [email protected]
UCLA
Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert
Backgroud
GoalMonitor and collect information from the WebAnswer most of users’ queries
ChallengesBillions of pages to monitorInformation are updated frequentlyUsers want information fresh!
Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert
Information aggregator framework
Server-based monitoring and dissemination
Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert
Overview
Modeling the posting generation processDefinition of delayPoisson process
Crawl schedulingResource allocation (how often to contact?)Retrieval scheduling (when to contact?)
The collected data∼10k RSS (since September 2004)∼40k Weblogs (since April 2004)
Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert
Overview
Modeling the posting generation processDefinition of delayPoisson process
Crawl schedulingResource allocation (how often to contact?)Retrieval scheduling (when to contact?)
The collected data∼10k RSS (since September 2004)∼40k Weblogs (since April 2004)
Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert
Overview
Modeling the posting generation processDefinition of delayPoisson process
Crawl schedulingResource allocation (how often to contact?)Retrieval scheduling (when to contact?)
The collected data∼10k RSS (since September 2004)∼40k Weblogs (since April 2004)
Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert
New challenges
Higher requirement on freshness
Finer time granularity (will traditional assumptionbe valid?)
Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert
Terminology
ti - posting generation time
τj - time of the jth contact
D(O) =∑k
i=1(τj − ti), where ti ∈ [τj−1, τj]
Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert
Posting generation model
Homogeneous Poisson modelλ(t) = λ at any t
Periodic inhomogeneous Poisson modelλ(t) = λ(t− nT ), n = 1, 2, . . .
0 2 4 6 8 10 12 14 160
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2x 10
5 Weekly number of postings (Sep 26 − Jan 8)
Num
ber o
f pos
tings
Week
Oct Nov Dec
0 20 40 60 80 100 120 140 1600
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
Hours (since Oct 3 12:00 am)
Num
ber o
f pos
tings
in 2
hou
rs
2 hours posting count (Oct 3 − Oct 9)
Sun Mon Tue Wed Thu Fri Sat
Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert
Posting generation model
Homogeneous Poisson modelλ(t) = λ at any t
Periodic inhomogeneous Poisson modelλ(t) = λ(t− nT ), n = 1, 2, . . .
0 2 4 6 8 10 12 14 160
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2x 10
5 Weekly number of postings (Sep 26 − Jan 8)
Num
ber o
f pos
tings
Week
Oct Nov Dec
0 20 40 60 80 100 120 140 1600
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
Hours (since Oct 3 12:00 am)
Num
ber o
f pos
tings
in 2
hou
rs
2 hours posting count (Oct 3 − Oct 9)
Sun Mon Tue Wed Thu Fri Sat
Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert
Expected retrieval delay
Inhomogeneous Poisson modelrate - λ(t)retrieval time - τj−1, τj
expected delay -∫ τj
τj−1λ(t)(τj − t)dt
Homoegeneous Poisson model
expected delay -λ(τj−τj−1)2
2
Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert
Expected retrieval delay
Inhomogeneous Poisson modelrate - λ(t)retrieval time - τj−1, τj
expected delay -∫ τj
τj−1λ(t)(τj − t)dt
Homoegeneous Poisson model
expected delay -λ(τj−τj−1)2
2
Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert
Objective
Maximize resource utlilization to provide timelyinformaiton.
Resource allocationHow often to contact data sources?
Retrieval schedulingWhen to contact data sources within a day?
Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert
Objective
Maximize resource utlilization to provide timelyinformaiton.
Resource allocationHow often to contact data sources?
Retrieval schedulingWhen to contact data sources within a day?
Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert
Objective
Maximize resource utlilization to provide timelyinformaiton.
Resource allocationHow often to contact data sources?
Retrieval schedulingWhen to contact data sources within a day?
Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert
Resource allocation
Consider n data source O1, . . . , On
λi - posting rate of Oi
wi - weight of Oi (how important)N - total number of retrievals per daymi - number of retrievals per day allocated to Oi
Optimal allocation
mi ∝√
wiλi
Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert
Resource allocation
Consider n data source O1, . . . , On
λi - posting rate of Oi
wi - weight of Oi (how important)N - total number of retrievals per daymi - number of retrievals per day allocated to Oi
Optimal allocation
mi ∝√
wiλi
Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert
Retrieval scheduling
m retrieval(s) per day is allocated for data sourceO, how should we schedule these m retrievals?
m = 1
m > 1
Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert
Multiple retrievals per period
m retrievals per period are allocated, whenscheduled at time τ1, . . . , τm, the expected delay is:
D(O) =∑m
i=1
∫ τi+1
τiλ(t)(τi+1 − t)dt
τm+1 = T + τ1
Criteria for optimality
λ(τj)(τj+1 − τj) =∫ τj
τj−1λ(t)dt
Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert
Multiple retrievals per period
Example: λ(t) = 2 + 2 sin(2πt)
0 0.2 0.4 0.6 0.8 10
0.5
1
1.5
2
2.5
3
3.5
time − t
λ(t)
τj−1
τj τ
j+1
∫τj−1
τj λ(t)dt
λ(τj)(τ
j+1−τ
j)
Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert
Experiment
∼10k RSS feeds from Sep 21 - Dec 20 2004
Characteristics of posting generation
Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert
Distribution of posting rate
100
101
102
103
104
105
100
101
102
103
Number of postings
Num
ber o
f RS
S fe
eds
9634 feeds
9634 RSS feeds are used
Power-law distribution
Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert
Is posting rate stable and predictable?
0 10 20 30 40 500
5
10
15
20
25
30
35
40
45
50
Posting rate on week 1−2 (Oct 3−16)
Pos
ting
rate
on
wee
k 3−
4 (O
ct 1
7−30
) Remainingtop 50% − top 90%top 50%
0 10 20 30 40 500
5
10
15
20
25
30
35
40
45
50
Posting rate on week 1−2 (Oct 3−16)
Pos
ting
rate
on
wee
k 12
−13
(Dec
12−
25) Remainingtop 50% − top 90%top 50%
The closer to diagonal, the more the stability andpredictability
red - top 50%, green - top 80%, blue - rest
Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert
How much history to keep?
0 10 20 30 40 5080
100
120
140
160
180
200
220
240
260
280
300
Size of sliding window (days)
Ave
rage
del
ay (m
inut
es)
Resource allocation only: 4 retrievals per day
Reallocate resource everyday
2 weeks is a good choice
Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert
What is the posting pattern?
0 5 10 15 200
0.5
1
1.5
2
2.5x 10
5
Hours
Tota
l num
ber o
f pos
tings
bet
wee
n A
pr 1
− D
ec 3
1 20
04 Changes of posting rate within one day
Periodic (daily pattern)
inactive at night
Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert
What are the individual pattern?
0 5 10 15 200
0.05
0.1
0 5 10 15 200
0.1
0.2
0 5 10 15 200
0.2
0.4
0 5 10 15 200
0.2
0.4
0 5 10 15 200
0.2
0.4
0 5 10 15 200
0.2
0.4
1625 feeds 1346 feeds
384 feeds 377 feeds
369 feeds 305 feeds
K-mean clustering
Optimize for different patterns
Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert
Performance
1 2 3 4 5 6 70
100
200
300
400
500
600
700
Number of retrievals per day per feed
Ave
rage
del
ay (m
inut
es)
Even schedulingRetrieval schedulingResource allocationCombined
1. Even scheduling2. Retrieval scheduling only3. Resource allocation only4. Combined
Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert
Performance
strategy 1 2 3 4
average delay (in min) 645 581 433 395max delay (in min) 1440 1440 9120 10073standard deviation 392 405 542 560
Statistics breakdown of posting delay using one retrieval perday.
Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert
Summary
Efficient MonitoringResource allocationRetrieval scheduling→ Include user access pattern (extension)
Data1 year of weblogs and half year of RSS dataFor prototype testing
Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert