Download - Focused Crawling and Collection Synthesis
![Page 1: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/1.jpg)
December 20, 2002 CUL Metadata WG Meeting 1
Focused Crawling and Collection Synthesis
Donna Bergmark
Cornell Information Systems
![Page 2: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/2.jpg)
December 20, 2002 CUL Metadata WG Meeting 2
Outline
• Crawlers
• Collection Synthesis
• Focused Crawling
• Some Results
• Student Project (Fall 2002)
![Page 3: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/3.jpg)
December 20, 2002 CUL Metadata WG Meeting 3
Definition
Spider = robot = crawler
Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web.
![Page 4: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/4.jpg)
December 20, 2002 CUL Metadata WG Meeting 4
Crawlers – some background
• Resource discovery
• Crawlers and internet history
• Crawling and crawlers
• Mercator
![Page 5: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/5.jpg)
December 20, 2002 CUL Metadata WG Meeting 5
Resource Discovery
• Finding info on the Web– Surfing (random strategy, goal is serendipity)
– Searching (inverted indices; specific info)
– Crawling (“all” the info)
• Uses for crawling– Find stuff
– Gather stuff
– Check stuff
![Page 6: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/6.jpg)
December 20, 2002 CUL Metadata WG Meeting 6
Crawlers and internet history• 1991: HTTP• 1992: 26 servers• 1993: 60+ servers; self-register; archie• 1994 (early) – first crawlers• 1996 – search engines abound• 1998 – focused crawling• 1999 – web graph studies• 2002 – use for digital libraries
![Page 7: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/7.jpg)
December 20, 2002 CUL Metadata WG Meeting 7
Crawling and Crawlers
• Web overlays the internet
• A crawl overlays the webseed
![Page 8: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/8.jpg)
December 20, 2002 CUL Metadata WG Meeting 8
Crawler Issues
• The web is so big
• Visit Order
• The URL itself
• Politeness
• Robot Traps
• The hidden web
• System Considerations
![Page 9: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/9.jpg)
December 20, 2002 CUL Metadata WG Meeting 9
Standard for Robot Exclusion
• Martin Koster (1994)
• http://any-server:80/robots.txt
• Maintained by the webmaster
• Forbid access to pages, directories
• Commonly excluded: /cgi-bin/
• Adherence is voluntary for the crawler
![Page 10: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/10.jpg)
December 20, 2002 CUL Metadata WG Meeting 10
Robot Traps
• Cycles in the Web graph
• Infinite links on a page
• Traps set out by the Webmaster
![Page 11: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/11.jpg)
December 20, 2002 CUL Metadata WG Meeting 11
The Hidden Web
• Dynamic pages increasing
• Subscription pages
• Username and password pages
• Research in progress on how crawlers can “get into” the hidden web
![Page 12: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/12.jpg)
December 20, 2002 CUL Metadata WG Meeting 12
System Issues
• Crawlers are complicated systems
• Efficiency is of utmost importance
• Crawlers are demanding of system and network resources
![Page 13: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/13.jpg)
13CUL Metadata WG MeetingDecember 20, 2002
![Page 14: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/14.jpg)
December 20, 2002 CUL Metadata WG Meeting 14
Mercator Features
• Written in Java• One file configures a crawl• Can add your own code
– Extend one or more of M’s base classes– Add totally new classes called by your own
• Industrial-strength crawler:– uses its own DNS and java.net package
![Page 15: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/15.jpg)
December 20, 2002 CUL Metadata WG Meeting 15
Collection Synthesis
• The NSDL– National Scientific Digital Library– Educational materials for K-thru-grave– A collection of digital collections
• Collection (automatically derived)– 20-50 items on a topic, represented by their
URLs, expository in nature, precision trumps recall
![Page 16: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/16.jpg)
December 20, 2002 CUL Metadata WG Meeting 16
Crawler is the Key
• A general search engine is good for precise results, few in number
• A search engine must cover all topics, not just scientific
• For automatic collection assembly, a Web crawler is needed
• A focused crawler is the key
![Page 17: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/17.jpg)
December 20, 2002 CUL Metadata WG Meeting 17
Focused Crawling
![Page 18: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/18.jpg)
December 20, 2002 CUL Metadata WG Meeting 18
Focused Crawling
432
765
1
1
R
Breadth-first crawl
1
432
5R
X X
Focused crawl
![Page 19: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/19.jpg)
December 20, 2002 CUL Metadata WG Meeting 19
Collections and Clusters
• Traditional – document universe is divided into clusters, or collections
• Each collection represented by its centroid• Web – size of document universe is infinite• Agglomerative clustering is used instead• Two aspects:
– Collection descriptor– Rule for when items belong to that Collection
![Page 20: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/20.jpg)
December 20, 2002 CUL Metadata WG Meeting 20
Q = 0.2
Q = 0.6
![Page 21: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/21.jpg)
December 20, 2002 CUL Metadata WG Meeting 21
The Setup
A virtual collection of items about Chebyshev Polynomials
![Page 22: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/22.jpg)
December 20, 2002 CUL Metadata WG Meeting 22
Adding a Centroid
An empty collection of items about Chebyshev Polynomials
![Page 23: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/23.jpg)
December 20, 2002 CUL Metadata WG Meeting 23
Document Vector Space
• Classic information retrieval technique
• Each word is a dimension in N-space
• Each document is a vector in N-space Example: <0, 0.003, 0,0,.01, .984,0,.001>
• Normalize the weights
Both the “centroid” and the downloaded document are term vectors
![Page 24: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/24.jpg)
December 20, 2002 CUL Metadata WG Meeting 24
Agglomerate
A collection with 3 items about Ch. Polys.
![Page 25: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/25.jpg)
December 20, 2002 CUL Metadata WG Meeting 25
Where does the Centroid come from?
“ChebyshevPolynomials”
A really good centroid fora collection about C.P.’s
![Page 26: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/26.jpg)
December 20, 2002 CUL Metadata WG Meeting 26
Building a Centroid
1. Google(“Chebyshev Polynomials”) {url1 … url-n
2. Let H be a hash (k,v) where k=word, value=freq
3. For each url in {u1 … un} do
D download(url)V term vector(d)
For each term t in V doIf t not in H add it with value H(t) ++
4. Compute tf-idf weights. C top 20 terms.
![Page 27: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/27.jpg)
December 20, 2002 CUL Metadata WG Meeting 27
Dictionary
• Given centroids C1, C2, C3 …
• Dictionary is C1 + C2 + C3 …– Terms are union of terms in Ci– Term Frequencies are total frequency in Ci– Document Frequency is how many C’s have t– Term IDF is as from Berkeley
• Dictionary is 300-500 terms
![Page 28: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/28.jpg)
December 20, 2002 CUL Metadata WG Meeting 28
Focused Crawling• Recall the cartoon for a focused crawl:
• A simple way to do it is with 2 “knobs”
1
432
5R
X X
![Page 29: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/29.jpg)
December 20, 2002 CUL Metadata WG Meeting 29
Focusing the Crawl
• Threshold: page is on-topic if correlation to the closest centroid is above this value
• Cutoff: follow links from pages whose “distance” from closest on-topic ancestor is less than the cutoff
![Page 30: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/30.jpg)
December 20, 2002 CUL Metadata WG Meeting 30
Illustration
2 3
4
6
7
1
5555
Cutoff = 1
Corr >= threshold
![Page 31: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/31.jpg)
December 20, 2002 CUL Metadata WG Meeting 31
Min-avg-max correlation vs. crawl length
00.10.2
0.30.40.50.6
0.70.8
0 20000 40000 60000 80000 100000 120000
No. documents downloaded
corr
elat
ion Maximum
Average
Minimum
Closest
Furthest
![Page 32: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/32.jpg)
December 20, 2002 CUL Metadata WG Meeting 32
Collection “Evaluation”
• Assume higher correlations are good
• With human relevance assessments, one can also compute a “precision” curve
• Precision P(n) after considering the n most highly ranked items is number of relevant, divided by n.
![Page 33: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/33.jpg)
December 20, 2002 CUL Metadata WG Meeting 33
Cutoff = 0Threshold = 0.3
![Page 34: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/34.jpg)
December 20, 2002 CUL Metadata WG Meeting 34
Precision vs. Rank
0
0.2
0.4
0.6
0.8
1
1.2
0 20 40 60
Rank
Pre
cisi
on
Crawling
![Page 35: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/35.jpg)
December 20, 2002 CUL Metadata WG Meeting 35
Tunneling with Cutoff
• Nugget – dud – dud… - dud – nugget
Notation: 0 – X – X … - X – 0
• Fixed cutoff: 0 – X1 – X2 - … Xc
• Adaptive cutoff: 0 – X1 – X2 - … X?
![Page 36: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/36.jpg)
December 20, 2002 CUL Metadata WG Meeting 36
Statistics Collected
• 500,000 documents
• Number of seeds: 4
• Path data for all but seeds
• 6620 completed paths (0-x…x-0)
• 100,000s incomplete paths (0-x…x..)
![Page 37: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/37.jpg)
December 20, 2002 CUL Metadata WG Meeting 37
Nuggets that are x steps from a nugget
0
200
400
600
800
1000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
X - number of links from nugget
# nuggets
![Page 38: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/38.jpg)
December 20, 2002 CUL Metadata WG Meeting 38
Nuggets that are x steps from a seed and/or a nugget
0
200
400
600
800
1000
1200
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
X - number of links from nugget
from seeds# nuggets
![Page 39: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/39.jpg)
December 20, 2002 CUL Metadata WG Meeting 39
Better parents have better children.
0
0.05
0.1
0.15
0.2
0.251 3 5 7 9 11
13
15
17
Correlation bracket
Nu
mb
er
of
no
de
s
General Population
children of .45-.5nodes
![Page 40: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/40.jpg)
December 20, 2002 CUL Metadata WG Meeting 40
Using the Empirical Observations
• Use the path history
• Use the page quality - cosine correlation
• Current distance should increase exponentially as you get away from quality nodes
Distance = 0 if this is a nugget, otherwise:1 or (1-corr) exp (2 x parent’s distance / cutoff)
![Page 41: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/41.jpg)
December 20, 2002 CUL Metadata WG Meeting 41
Results
• Details in the ECDL paper
• Smaller frontier more docs/second
• More documents downloaded in same time
• Higher-scoring documents were downloaded
• Cutoff of 20 averaged 7 steps at the cutoff
![Page 42: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/42.jpg)
December 20, 2002 CUL Metadata WG Meeting 42
Fall 2002 Student Project
Query
Mercator
Centroid Collection Description
Term vectors
Centroids,Dictionary
CollectionURLs
Chebyshev P.s HTML
![Page 43: Focused Crawling and Collection Synthesis](https://reader035.vdocument.in/reader035/viewer/2022062517/56812e55550346895d93fbb7/html5/thumbnails/43.jpg)
December 20, 2002 CUL Metadata WG Meeting 43
Conclusion
• We’ve covered crawling – history, technology, use
• Focused crawling with tunneling• Adaptive cutoff with tunneling
• We have a good experimental setup for exploring automatic collection synthesis