instructor: p.krishna reddy e-mail: [email protected]@iiit.net pkreddy 1)efficient crawling...
TRANSCRIPT
Instructor: P.Krishna Reddy
E-mail: [email protected]
http://www.iiit.net/~pkreddy
1) Efficient crawling through URL ordering, 7th WWW Conf.
2) The PageRank Citation Ranking: Bringing order to the Web, 7th WWW Conf.
Course Roadmap• Introduction to Data mining
– Introduction, Preprocessing, Clustering, Association Rules, Classification,…• Web structure mining
– Authoritative Sources in a Hyperlinked environment, by John.M.Kleonberg– Efficient Crawling through URL ordering, J.Cho, H.Garcia Molina and L.Page– The PageRank Citation Ranking: Bringing order to the Web. – The anatomy large scale Hyper-textual Web Search engine, Sergey Brin and Lawrance
Page– Graph Structure in the Web– Focused Crawling: A new approach to topic-specific web resource discovery, by Souman
Chakravarthi et al.– Trawling the web for emerging cyber communities, Ravi Kumar…– Building a cyber-community hierarchy based on link analysis, P.Krishna Reddy and Masaru
Kitsuregawa– Efficient identification of web communities, GW Flake, S..Lawrence, et al.– Finding related pages in WWW, J Dean and MR Herizinger
• Web content mining/Information Retrieval• Web log mining/Recommendation systems/E-commerce
Efficient Crawling Through URL ordering
• Introduction
• Important metrics
• Problem definition
• Ordering metrics
• Experiments
Introduction
• A crawler is a program that retrieves web pages; used by search engine of web cache.
• It starts off with the URL of initial page P0.• It retrievs P0, extracts any URLs in it and adds
them to a queue of URLs to be scanned.• The crawler get the URLS in some order.• It must carefully decide what URL to scan in
what order.• It must also decide how frequently to revisit the
web pages it has already seen.
In this paper
• How to select URLs to scan from its queue of known URLs.
• Most crawlers may not want to visit all the URLs.– The client may have limited storage capacity.– Crawling takes time
• So, it is important to visit important pages first.
Important metrics
• Given a web page P, we can define the importance of web page, I(P) in one of the following ways.– Similarity to driving query Q
• If a query Q drives the crawling process and I(P) is defined as textual similarity between P and Q.
• To compute similarity we can view each document an m- dimensional vector (w1,w2,…,wn) where wi represents ith word.
– Back link count (IB(P)): The value of I(P) is the number of links to P that appear over the entire Web.
• A crawler may estimate IB(P) with the # of links to P that have been seen so far.
Important metrics….
– Page rank: IB(P) metric treats all links equally.• Example: A link from Yahoo home page counts
same as the link from the individual home page.• So, a page rank back link metric IR(P) recuursively
defines the importance of the page to be the weighted sum of back links to it.
– Suppose a page p is pointed by pages T1, …,Tn.– Let Ci be the number of links going out of page Ti.– Let d be a damping factor– IR(P)=(1-d)+d(IR(T1)/C1+….+IR(Tn)/Cn)– One equation per web page.– d when the user is on a page, there is some
probability ‘d’ that next visited page will be completely random.
Important metrics….
– Forward link count:• IF(P) # of links that emanate from P.
– Location metric:• IL(P) is a function of its location not of its
contents.• Example:
– URLs ending with .com may be more useful than URLs with other endings.
– URLs with fewer slashes more useful than with more slashes.
• We can also combine the importance.– IC(P)= K1.IS(P,Q) +K2.IB(P)
Problem definition
• Define a crawler that if possible visits high I(P) pages before lower ranked pages for some definition of I(P).
• Options:– Crawl and stop: after visiting K pages– Crawl and stop with threshold: IP(P) >= G is
considered HOT.– Limited buffer crawl: After filling the buffer
drop unimportant pages, lowest I(P) value.
Ordering Metrics
• A crawler keeps a queue of URLs it has seen during a crawl, and selects next URL from this queue.
• The ordering metric O is used by the crawler for this selection.– It selects the URL u such that O(u) has the
highest value among all URLs in the queue.– The O metric can use the information seen.– The O metric is based on importance metric.
Experiments
• Data set– Repository consisting of all Stanford web
pages.– Experiments are conducted on this data set.– # of pages=179,000
Results• X-axis: Fraction of Stanford pages crawled over time.• Y-axis: fraction of the total hot pages that has been
crawled at a given point (Pst).• Threshold G=3,10,100.• A page with G or more back links considered as hot.• Crawl and stop model is used with G=100.• Ordering metrics:
– Random– Ideal– Breadth first: Do nothing– Back link: Sorl URL-queue by back-link count– Pagerank: Sort URL queue with IR[u]
• IR(u)=(1-0.9)+0.9 IR(vi)/ci, where (vi,u) ε links and ci is the number of links in the page vi.
Results
• Page rank metric outperforms back link metric.
• Reason– IB(P) carwler behaved like depth first one by
frequently visiting pages within one cluster.– IR(p) crawler combined breath and depth in a
better way, because it is based on ranking.
Similarity based crawlers
• Similarity based importance metric: IS(P) measures the relevance of each page to a topic or the query the user has in mind.
• For first two experiments: A page is hot if it contains computer in its title or if it has more than 10 occurrences of computer in its body.
• Two queues are maintained:– Hot queue– URL queue– Crawler prefers Hot queue.
• Observation:– When similarity is important it is effective to use ordering metric
that considers the content of anchors and URLs and the distance to the hot pages that have been discovered.
Conclusion
• Page rank is excellent ordering metric when pages with many backlinks or with high Page rank or sought.
• If similarity to a driving query is important, it is useful to visit the URLs that:– Have anchor text that is similar to the driving
query.– Have some of the query terms within the URL
itself.– Have a short link distance to a page that is
known to be hot.
Contents
• Motivation
• Related work
• Page Rank & Random Surfer Model
• Implementation
• Application
• Conclusion
MotivationMotivationWeb: heterogeneous and unstructuredThe pages are diverse
Containing casual activities to research papers.
Free of quality control on the webCommercial interest to manipulate rankingSimplicity of publishing web pages results in a
large fraction of low quality web pages that the users are unlikely to read.
Related Work
• Academic citation analysis– Theory of information flow in academic
community (epidemic process).
• Link-based analysis– Charecterizing world wide web ecologies, Jim
Pitkow.
• Clustering methods of link structure• Hubs & Authorities Model• Notion of Quality from library community
Backlink• Link Structure of the Web• Every page has some number of forward links and back
links.• Highly back linked pages are more important than with
few links.• Page rank is an approximation of importance / quality
PageRank• Pages with lots of backlinks are important• Backlinks coming from important pages convey
more importance to a page• Let u be a web page. Fu : the set of pages u
points to Bu : the set of pages that point to u• Nu = | Fu |
• Problem: Rank Sink
uBv vN
vRcuR
)()(
Rank Sink• Page cycles pointed by some incoming
link
• Problem: this loop will accumulate rank but never distribute any rank outside
Escape Term• Let E(u) be some vector over the web pages that
corresponds to a source of rank. The page rank of a set of web pages is an assignment, R’, to the the web pages which satisfies
• c is maximized and = 1• E(u) is some vector over the web pages
– uniform, favorite page etc.
)()(
)( ucEN
vRcuR
uBv v
1R
Matrix Notation
• R is the dominant eigenvector and c is the dominant eigenvalue of because c is maximized
ReEAcR TT )(
)( TeEA
Computing PageRank - initialize vector over web pages
loop:
- new ranks sum of normalized backlink ranks
- compute normalizing factor
- add escape term
- control parameter
while - stop when converged
SR 0
iT
i RAR 1
111 ii RRd
dERR ii 11
ii RR 1
Random Surfer Model
• Page Rank corresponds to the probability distribution of a random walk on the web graphs
• E(u) can be re-phrased as the random surfer gets bored periodically and jumps to a different page and not kept in a loop forever
Implementation
• Computing resources — 24 million pages — 75 million URLs
• Memory and disk storage
Weight Vector
(4 byte float)
Matrix A (linear access)
Implementation (Con't)
• Unique integer ID for each URL
• Sort and Remove dangling links
• Rank initial assignment
• Iteration until convergence
• Add back dangling links and Re-compute
Convergence Properties
• Graph (V, E) is an expander with factor if for all (not too large) subsets S: |As| |s|
• Random walk converges fast to a limiting probability distribution on a set of nodes in the graph.
Convergence Properties (con't)
• PageRank computation is O(log(|V|)) due to rapidly mixing graph G of the web.
Personalized PageRank• Rank Source E can be initialized :
– uniformly over all pages: e.g. copyright warnings, disclaimers, mailing lists archives
result in overly high ranking– total weight on a single page, e.g. Netscape, McCarthy
great variation of ranks under different single pages as rank source
– and everything in-between, e.g. server root pages
allow manipulation by commercial interests
Applications I
• Estimate web traffic– Server/page aliases
– Link/traffic disparity,
• Backlink predictor– Citation counts have been used to predict future citations
– very difficult to map the citation structure of the web completely
– avoid the local maxima that citation counts get stuck in and get better performance
Applications II - Ranking Proxy
• Surfer's Navigation Aid
• Annotating links by PageRank (bar graph)
• Not query dependent
Issues• Users are no random walkers – Content based methods
• Starting point distribution – Actual usage data as starting vector
• Reinforcing effects/bias towards main pages• How about traffic to ranking pages?• No query specific rank• Linkage spam – PageRank favors pages that managed to get other pages to link to them – Linkage not necessarily a sign of relevancy, only of promotion (advertisement…)