efficient monitoring algorithm for fast news alertsia/present/monitor.pdf · ka cheung sia...

29
Efficient Monitoring Algorithm for Fast News Alert Ka Cheung ”Richard” Sia [email protected] UCLA Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert

Upload: lehanh

Post on 15-Apr-2019

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Monitoring Algorithm for Fast News Alertsia/present/monitor.pdf · Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert. How much history to keep? 0 10 20 30

Efficient Monitoring Algorithm forFast News Alert

Ka Cheung ”Richard” [email protected]

UCLA

Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert

Page 2: Efficient Monitoring Algorithm for Fast News Alertsia/present/monitor.pdf · Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert. How much history to keep? 0 10 20 30

Backgroud

GoalMonitor and collect information from the WebAnswer most of users’ queries

ChallengesBillions of pages to monitorInformation are updated frequentlyUsers want information fresh!

Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert

Page 3: Efficient Monitoring Algorithm for Fast News Alertsia/present/monitor.pdf · Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert. How much history to keep? 0 10 20 30

Information aggregator framework

Server-based monitoring and dissemination

Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert

Page 4: Efficient Monitoring Algorithm for Fast News Alertsia/present/monitor.pdf · Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert. How much history to keep? 0 10 20 30

Overview

Modeling the posting generation processDefinition of delayPoisson process

Crawl schedulingResource allocation (how often to contact?)Retrieval scheduling (when to contact?)

The collected data∼10k RSS (since September 2004)∼40k Weblogs (since April 2004)

Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert

Page 5: Efficient Monitoring Algorithm for Fast News Alertsia/present/monitor.pdf · Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert. How much history to keep? 0 10 20 30

Overview

Modeling the posting generation processDefinition of delayPoisson process

Crawl schedulingResource allocation (how often to contact?)Retrieval scheduling (when to contact?)

The collected data∼10k RSS (since September 2004)∼40k Weblogs (since April 2004)

Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert

Page 6: Efficient Monitoring Algorithm for Fast News Alertsia/present/monitor.pdf · Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert. How much history to keep? 0 10 20 30

Overview

Modeling the posting generation processDefinition of delayPoisson process

Crawl schedulingResource allocation (how often to contact?)Retrieval scheduling (when to contact?)

The collected data∼10k RSS (since September 2004)∼40k Weblogs (since April 2004)

Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert

Page 7: Efficient Monitoring Algorithm for Fast News Alertsia/present/monitor.pdf · Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert. How much history to keep? 0 10 20 30

New challenges

Higher requirement on freshness

Finer time granularity (will traditional assumptionbe valid?)

Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert

Page 8: Efficient Monitoring Algorithm for Fast News Alertsia/present/monitor.pdf · Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert. How much history to keep? 0 10 20 30

Terminology

ti - posting generation time

τj - time of the jth contact

D(O) =∑k

i=1(τj − ti), where ti ∈ [τj−1, τj]

Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert

Page 9: Efficient Monitoring Algorithm for Fast News Alertsia/present/monitor.pdf · Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert. How much history to keep? 0 10 20 30

Posting generation model

Homogeneous Poisson modelλ(t) = λ at any t

Periodic inhomogeneous Poisson modelλ(t) = λ(t− nT ), n = 1, 2, . . .

0 2 4 6 8 10 12 14 160

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2x 10

5 Weekly number of postings (Sep 26 − Jan 8)

Num

ber o

f pos

tings

Week

Oct Nov Dec

0 20 40 60 80 100 120 140 1600

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

Hours (since Oct 3 12:00 am)

Num

ber o

f pos

tings

in 2

hou

rs

2 hours posting count (Oct 3 − Oct 9)

Sun Mon Tue Wed Thu Fri Sat

Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert

Page 10: Efficient Monitoring Algorithm for Fast News Alertsia/present/monitor.pdf · Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert. How much history to keep? 0 10 20 30

Posting generation model

Homogeneous Poisson modelλ(t) = λ at any t

Periodic inhomogeneous Poisson modelλ(t) = λ(t− nT ), n = 1, 2, . . .

0 2 4 6 8 10 12 14 160

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2x 10

5 Weekly number of postings (Sep 26 − Jan 8)

Num

ber o

f pos

tings

Week

Oct Nov Dec

0 20 40 60 80 100 120 140 1600

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

Hours (since Oct 3 12:00 am)

Num

ber o

f pos

tings

in 2

hou

rs

2 hours posting count (Oct 3 − Oct 9)

Sun Mon Tue Wed Thu Fri Sat

Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert

Page 11: Efficient Monitoring Algorithm for Fast News Alertsia/present/monitor.pdf · Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert. How much history to keep? 0 10 20 30

Expected retrieval delay

Inhomogeneous Poisson modelrate - λ(t)retrieval time - τj−1, τj

expected delay -∫ τj

τj−1λ(t)(τj − t)dt

Homoegeneous Poisson model

expected delay -λ(τj−τj−1)2

2

Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert

Page 12: Efficient Monitoring Algorithm for Fast News Alertsia/present/monitor.pdf · Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert. How much history to keep? 0 10 20 30

Expected retrieval delay

Inhomogeneous Poisson modelrate - λ(t)retrieval time - τj−1, τj

expected delay -∫ τj

τj−1λ(t)(τj − t)dt

Homoegeneous Poisson model

expected delay -λ(τj−τj−1)2

2

Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert

Page 13: Efficient Monitoring Algorithm for Fast News Alertsia/present/monitor.pdf · Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert. How much history to keep? 0 10 20 30

Objective

Maximize resource utlilization to provide timelyinformaiton.

Resource allocationHow often to contact data sources?

Retrieval schedulingWhen to contact data sources within a day?

Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert

Page 14: Efficient Monitoring Algorithm for Fast News Alertsia/present/monitor.pdf · Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert. How much history to keep? 0 10 20 30

Objective

Maximize resource utlilization to provide timelyinformaiton.

Resource allocationHow often to contact data sources?

Retrieval schedulingWhen to contact data sources within a day?

Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert

Page 15: Efficient Monitoring Algorithm for Fast News Alertsia/present/monitor.pdf · Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert. How much history to keep? 0 10 20 30

Objective

Maximize resource utlilization to provide timelyinformaiton.

Resource allocationHow often to contact data sources?

Retrieval schedulingWhen to contact data sources within a day?

Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert

Page 16: Efficient Monitoring Algorithm for Fast News Alertsia/present/monitor.pdf · Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert. How much history to keep? 0 10 20 30

Resource allocation

Consider n data source O1, . . . , On

λi - posting rate of Oi

wi - weight of Oi (how important)N - total number of retrievals per daymi - number of retrievals per day allocated to Oi

Optimal allocation

mi ∝√

wiλi

Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert

Page 17: Efficient Monitoring Algorithm for Fast News Alertsia/present/monitor.pdf · Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert. How much history to keep? 0 10 20 30

Resource allocation

Consider n data source O1, . . . , On

λi - posting rate of Oi

wi - weight of Oi (how important)N - total number of retrievals per daymi - number of retrievals per day allocated to Oi

Optimal allocation

mi ∝√

wiλi

Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert

Page 18: Efficient Monitoring Algorithm for Fast News Alertsia/present/monitor.pdf · Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert. How much history to keep? 0 10 20 30

Retrieval scheduling

m retrieval(s) per day is allocated for data sourceO, how should we schedule these m retrievals?

m = 1

m > 1

Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert

Page 19: Efficient Monitoring Algorithm for Fast News Alertsia/present/monitor.pdf · Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert. How much history to keep? 0 10 20 30

Multiple retrievals per period

m retrievals per period are allocated, whenscheduled at time τ1, . . . , τm, the expected delay is:

D(O) =∑m

i=1

∫ τi+1

τiλ(t)(τi+1 − t)dt

τm+1 = T + τ1

Criteria for optimality

λ(τj)(τj+1 − τj) =∫ τj

τj−1λ(t)dt

Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert

Page 20: Efficient Monitoring Algorithm for Fast News Alertsia/present/monitor.pdf · Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert. How much history to keep? 0 10 20 30

Multiple retrievals per period

Example: λ(t) = 2 + 2 sin(2πt)

0 0.2 0.4 0.6 0.8 10

0.5

1

1.5

2

2.5

3

3.5

time − t

λ(t)

τj−1

τj τ

j+1

∫τj−1

τj λ(t)dt

λ(τj)(τ

j+1−τ

j)

Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert

Page 21: Efficient Monitoring Algorithm for Fast News Alertsia/present/monitor.pdf · Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert. How much history to keep? 0 10 20 30

Experiment

∼10k RSS feeds from Sep 21 - Dec 20 2004

Characteristics of posting generation

Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert

Page 22: Efficient Monitoring Algorithm for Fast News Alertsia/present/monitor.pdf · Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert. How much history to keep? 0 10 20 30

Distribution of posting rate

100

101

102

103

104

105

100

101

102

103

Number of postings

Num

ber o

f RS

S fe

eds

9634 feeds

9634 RSS feeds are used

Power-law distribution

Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert

Page 23: Efficient Monitoring Algorithm for Fast News Alertsia/present/monitor.pdf · Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert. How much history to keep? 0 10 20 30

Is posting rate stable and predictable?

0 10 20 30 40 500

5

10

15

20

25

30

35

40

45

50

Posting rate on week 1−2 (Oct 3−16)

Pos

ting

rate

on

wee

k 3−

4 (O

ct 1

7−30

) Remainingtop 50% − top 90%top 50%

0 10 20 30 40 500

5

10

15

20

25

30

35

40

45

50

Posting rate on week 1−2 (Oct 3−16)

Pos

ting

rate

on

wee

k 12

−13

(Dec

12−

25) Remainingtop 50% − top 90%top 50%

The closer to diagonal, the more the stability andpredictability

red - top 50%, green - top 80%, blue - rest

Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert

Page 24: Efficient Monitoring Algorithm for Fast News Alertsia/present/monitor.pdf · Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert. How much history to keep? 0 10 20 30

How much history to keep?

0 10 20 30 40 5080

100

120

140

160

180

200

220

240

260

280

300

Size of sliding window (days)

Ave

rage

del

ay (m

inut

es)

Resource allocation only: 4 retrievals per day

Reallocate resource everyday

2 weeks is a good choice

Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert

Page 25: Efficient Monitoring Algorithm for Fast News Alertsia/present/monitor.pdf · Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert. How much history to keep? 0 10 20 30

What is the posting pattern?

0 5 10 15 200

0.5

1

1.5

2

2.5x 10

5

Hours

Tota

l num

ber o

f pos

tings

bet

wee

n A

pr 1

− D

ec 3

1 20

04 Changes of posting rate within one day

Periodic (daily pattern)

inactive at night

Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert

Page 26: Efficient Monitoring Algorithm for Fast News Alertsia/present/monitor.pdf · Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert. How much history to keep? 0 10 20 30

What are the individual pattern?

0 5 10 15 200

0.05

0.1

0 5 10 15 200

0.1

0.2

0 5 10 15 200

0.2

0.4

0 5 10 15 200

0.2

0.4

0 5 10 15 200

0.2

0.4

0 5 10 15 200

0.2

0.4

1625 feeds 1346 feeds

384 feeds 377 feeds

369 feeds 305 feeds

K-mean clustering

Optimize for different patterns

Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert

Page 27: Efficient Monitoring Algorithm for Fast News Alertsia/present/monitor.pdf · Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert. How much history to keep? 0 10 20 30

Performance

1 2 3 4 5 6 70

100

200

300

400

500

600

700

Number of retrievals per day per feed

Ave

rage

del

ay (m

inut

es)

Even schedulingRetrieval schedulingResource allocationCombined

1. Even scheduling2. Retrieval scheduling only3. Resource allocation only4. Combined

Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert

Page 28: Efficient Monitoring Algorithm for Fast News Alertsia/present/monitor.pdf · Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert. How much history to keep? 0 10 20 30

Performance

strategy 1 2 3 4

average delay (in min) 645 581 433 395max delay (in min) 1440 1440 9120 10073standard deviation 392 405 542 560

Statistics breakdown of posting delay using one retrieval perday.

Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert

Page 29: Efficient Monitoring Algorithm for Fast News Alertsia/present/monitor.pdf · Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert. How much history to keep? 0 10 20 30

Summary

Efficient MonitoringResource allocationRetrieval scheduling→ Include user access pattern (extension)

Data1 year of weblogs and half year of RSS dataFor prototype testing

Ka Cheung Sia Efficient Monitoring Algorithm for Fast News Alert