1 searching the web junghoo cho ucla computer science
Post on 19-Dec-2015
213 views
TRANSCRIPT
![Page 1: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/1.jpg)
1
Searching the WebSearching the Web
Junghoo ChoJunghoo Cho
UCLA Computer ScienceUCLA Computer Science
![Page 2: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/2.jpg)
2
Legacy database Plain text files
Biblio sever
Information GaloreInformation Galore
![Page 3: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/3.jpg)
3
Information Overload ProblemInformation Overload Problem
![Page 4: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/4.jpg)
4
SolutionSolution
Indexing approachIndexing approach– Google, Excite, AltaVistaGoogle, Excite, AltaVista
Integration approachIntegration approach– MySimon, BizRateMySimon, BizRate
![Page 5: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/5.jpg)
5
Indexing ApproachIndexing Approach
Central Index
![Page 6: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/6.jpg)
6
ChallengesChallenges
Page selection and downloadPage selection and download– What page to download?What page to download?
Page and index updatePage and index update– How to update pages?How to update pages?
Page rankingPage ranking– What page is “important” or “relevant”?What page is “important” or “relevant”?
ScalabilityScalability
![Page 7: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/7.jpg)
7
Integration ApproachIntegration Approach
Mediator
Wrapper
Source 1
Wrapper
Source 2
Wrapper
Source n
![Page 8: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/8.jpg)
8
Heterogeneous sourcesHeterogeneous sources– Different data models: relational, object-orientedDifferent data models: relational, object-oriented– Different schemas and representations:Different schemas and representations:
““Keanu ReevesKeanu Reeves” or “” or “Reeves, K.Reeves, K.” etc.” etc. Limited query capabilitiesLimited query capabilities Mediator cachingMediator caching
ChallengesChallenges
![Page 9: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/9.jpg)
9
Focus of the TalkFocus of the Talk
Indexing approachIndexing approach How to maintain pages up-to-date?How to maintain pages up-to-date?
![Page 10: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/10.jpg)
10
Outline of This TalkOutline of This Talk
How can we maintain pages fresh?How can we maintain pages fresh? How does the Web change?How does the Web change? What do we mean by “fresh” pages?What do we mean by “fresh” pages? How should we refresh pages?How should we refresh pages?
![Page 11: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/11.jpg)
11
Web Evolution ExperimentWeb Evolution Experiment
How often does a Web page change?How often does a Web page change? How long does a page stay on the Web?How long does a page stay on the Web? How long does it take for 50% of the Web How long does it take for 50% of the Web
to change?to change? How do we model Web changes?How do we model Web changes?
![Page 12: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/12.jpg)
12
Experimental SetupExperimental Setup
February 17 to June 24, 1999February 17 to June 24, 1999 270 sites visited (with permission)270 sites visited (with permission)
– identified 400 sites with highest “PageRank”identified 400 sites with highest “PageRank”– contacted administratorscontacted administrators
720,000 pages collected720,000 pages collected– 3,000 pages from each site daily3,000 pages from each site daily– start at root, visit breadth first (get new & old pages)start at root, visit breadth first (get new & old pages)– ran only 9pm - 6am, 10 seconds between site requestsran only 9pm - 6am, 10 seconds between site requests
![Page 13: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/13.jpg)
13
Average Change IntervalAverage Change Intervalfr
actio
n of
pag
es
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
1day 1day- 1week
1week-1month
1month-4months
4months
average change interval
![Page 14: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/14.jpg)
14
Change Interval – By DomainChange Interval – By Domainfr
actio
n of
pag
es
0
0.1
0.2
0.3
0.4
0.5
0.6
1day 1day- 1week
1week-1month
1month-4months
4months
com
netorg
edu
gov
average change interval
![Page 15: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/15.jpg)
15
Modeling Web EvolutionModeling Web Evolution
Poisson process with rate Poisson process with rate T is time to next eventT is time to next event ffTT ((tt) = ) = ee--
tt ( (tt > 0) > 0)
![Page 16: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/16.jpg)
16
Change Interval of PagesChange Interval of Pagesfor pages thatchange every
10 days on average
interval in days
frac
tion
of c
hang
esw
ith g
iven
inte
rval
Poisson model
![Page 17: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/17.jpg)
17
Change MetricsChange Metrics
FreshnessFreshness– Freshness of element Freshness of element eeii at time at time tt is is
F F ( ( eeii ; ; tt ) = 1 if ) = 1 if eeii is up-to-date at time is up-to-date at time tt 0 otherwise 0 otherwise
eiei
......
web databaseFreshness of the database S at time t is
F( S ; t ) = F( ei ; t )
(Assume “equal importance” of pages)
N
1 N
i=1
![Page 18: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/18.jpg)
18
Change MetricsChange Metrics
AgeAge– Age of element Age of element eeii at time at time tt is is
A A( ( eeii ; ; tt ) = 0 if ) = 0 if eeii is up-to-date at time is up-to-date at time tt tt - (modification - (modification eei i time) otherwisetime) otherwise
eiei
......
web databaseAge of the database S at time t is
A( S ; t ) = A( ei ; t )
(Assume “equal importance” of pages)
N
1 N
i=1
![Page 19: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/19.jpg)
19
Change MetricsChange Metrics
F(ei)
A(ei)
0
0
1
time
time
update refresh
Time averages:
0
1( ) lim ( ; )
t
i itF e F e t dt
t
0
1( ) lim ( ; )
t
tF S F S t dt
t
![Page 20: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/20.jpg)
20
Trick QuestionTrick Question
Two page databaseTwo page database e1 changes dailychanges daily e2 changes once a weekchanges once a week Can visit one page per weekCan visit one page per week How should we visit pages?How should we visit pages?
– e1 e2 e1 e2 e1 e2 e1 e2... ... [uniform] [uniform]
– e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 … … [proportional][proportional]
– e1 e1 e1 e1 e1 e1 ... ...
– e2 e2 e2 e2 e2 e2 ... ...
– ??
e1
e2
e1
e2
webdatabase
![Page 21: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/21.jpg)
21
Proportional Often Not Good!Proportional Often Not Good!
Visit fast changing Visit fast changing e1
get 1/2 day of freshnessget 1/2 day of freshness
Visit slow changing Visit slow changing e2
get 1/2 week of freshnessget 1/2 week of freshness
Visiting Visiting e2 is a better deal!is a better deal!
![Page 22: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/22.jpg)
22
Optimal Refresh FrequencyOptimal Refresh Frequency
ProblemProblem
Given and Given and f ,f ,
findfind
that maximizethat maximize
1 2, , ..., N
1 21
, ,... , /N
N
ii
f f f f f N
1
1( ) ( )
N
ii
F S F eN
![Page 23: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/23.jpg)
23
Optimal Refresh FrequencyOptimal Refresh Frequency
• Shape of curve is the same in all cases• Holds for any change frequency distribution
![Page 24: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/24.jpg)
24
Optimal Refresh for AgeOptimal Refresh for Age
• Shape of curve is the same in all cases• Holds for any change frequency distribution
![Page 25: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/25.jpg)
25
Comparing PoliciesComparing Policies
Freshness AgeProportional 0.12 400 days
Uniform 0.57 5.6 daysOptimal 0.62 4.3 days
Based on Statistics from experimentand revisit frequency of every month
![Page 26: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/26.jpg)
26
Not Every Page is Equal!Not Every Page is Equal!
1 211
( ) ( ) (2 )3
F S F e F e
In general,1 1
( ) ( )N N
i i ii i
F S w F e w
e1
e2 Accessed by users 20 times/day
Accessed by users 10 times/day
Some pages are “more important”Some pages are “more important”
![Page 27: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/27.jpg)
27
Weighted FreshnessWeighted Freshness
w = 1
w = 2
f
![Page 28: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/28.jpg)
28
Change Frequency EstimationChange Frequency Estimation
How to estimate change frequency?How to estimate change frequency?– Naïve Estimator: Naïve Estimator: XX//TT
– XX: number of detected changes: number of detected changes
– TT: monitoring period: monitoring period
– 2 changes in 10 days: 0.2 times/day2 changes in 10 days: 0.2 times/day
Change detected1 day
Page visitedPage changed
Incomplete change historyIncomplete change history
![Page 29: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/29.jpg)
29
Improved EstimatorImproved Estimator
Based on the Poisson modelBased on the Poisson model
– XX: number of detected changes: number of detected changes– NN: number of accesses: number of accesses– f f : access frequency: access frequency
2log
1
Nf
N X
3 changes in 10 days: 0.36 times/day Accounts for “missed” changes
![Page 30: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/30.jpg)
30
Improvement Significant?Improvement Significant?
Application to a Web crawlerApplication to a Web crawler– Visit pages once every week for 5 weeksVisit pages once every week for 5 weeks– Estimate change frequency Estimate change frequency – Adjust revisit frequency based on the estimateAdjust revisit frequency based on the estimate
» Uniform: do not adjustUniform: do not adjust
» Naïve: based on the naïve estimatorNaïve: based on the naïve estimator
» Ours: based on our improved estimatorOurs: based on our improved estimator
![Page 31: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/31.jpg)
31
Improvement from Our EstimatorImprovement from Our Estimator
Detected changesDetected changes Ratio to uniformRatio to uniform
UniformUniform 2,147,5892,147,589 100%100%
NaïveNaïve 4,145,5824,145,582 193%193%
OursOurs 4,892,1164,892,116 228%228%
(9,200,000 visits in total)
![Page 32: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/32.jpg)
32
SummarySummary
Information overload problemInformation overload problem– Indexing approachIndexing approach– Integration approachIntegration approach
Page updatePage update– Web evolution experimentWeb evolution experiment– Change metricChange metric– Refresh policyRefresh policy– Frequency estimatorFrequency estimator
![Page 33: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/33.jpg)
33
Research OpportunityResearch Opportunity
Efficient query processing?Efficient query processing? Automatic source discovery?Automatic source discovery? Automatic data extraction?Automatic data extraction?
![Page 34: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/34.jpg)
34
Web Archive ProjectWeb Archive Project
Can we store the history of the Web?Can we store the history of the Web?– Web is ephemeralWeb is ephemeral– Study of the Evolution of the WebStudy of the Evolution of the Web
ChallengesChallenges– Update policy?Update policy?– Compression?Compression?– New storage structure?New storage structure?– New index structure?New index structure?
![Page 35: 1 Searching the Web Junghoo Cho UCLA Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d3a5503460f94a14fb0/html5/thumbnails/35.jpg)
35
The EndThe End
Thank you for your attentionThank you for your attention For more information visitFor more information visit
http://www.cs.ucla.edu/~cho/http://www.cs.ucla.edu/~cho/