crawling paolo ferragina dipartimento di informatica università di pisa reading 20.1, 20.2 and 20.3

55
Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.

Post on 21-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Crawling

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Reading 20.1, 20.2 and 20.3

Page 2: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Spidering

24h, 7days “walking” over a Graph

What about the Graph? BowTie Direct graph G = (N, E) N changes (insert, delete) >> 50 * 109 nodes E changes (insert, delete) > 10 links per node

10*50*109 = 500*109 1-entries in adj matrix

Page 3: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Crawling Issues

How to crawl? Quality: “Best” pages first Efficiency: Avoid duplication (or near duplication) Etiquette: Robots.txt, Server load concerns (Minimize load)

How much to crawl? How much to index? Coverage: How big is the Web? How much do we cover? Relative Coverage: How much do competitors have?

How often to crawl? Freshness: How much has changed?

How to parallelize the process

Page 4: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Page selection

Given a page P, define how “good” P is.

Several metrics: BFS, DFS, Random Popularity driven (PageRank, full vs partial) Topic driven or focused crawling Combined

Page 5: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

This page is a new one ?

Check if file has been parsed or downloaded before

after 20 mil pages, we have “seen” over 200 million

URLs each URL is at least 100 bytes on average Overall we have about 20Gb of URLS

Options: compress URLs in main memory, or use disk Bloom Filter (Archive) Disk access with caching (Mercator, Altavista)

Page 6: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Link Extractor:while(<Page Repository is not empty>){ <take a page p (check if it is new)> <extract links contained in p within href> <extract links contained in javascript> <extract ….. <insert these links into the Priority Queue>}

Downloaders:while(<Assigned Repository is not empty>){ <extract url u> <download page(u)> <send page(u) to the Page Repository> <store page(u) in a proper archive,

possibly compressed>}

Crawler Manager:while(<Priority Queue is not empty>){ <extract some URL u having the highest priority> foreach u extracted {

if ( (u “Already Seen Page” ) || ( u “Already Seen Page” && <u’s version on the Web is more recent> ) ) { <resolve u wrt DNS> <send u to the Assigned Repository>

} }}

Crawler “cycle of life”PQ

PR

ARCrawler Manager

DownloadersLinkExtractor

Page 7: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Parallel Crawlers

Web is too big to be crawled by a single crawler, work should be divided avoiding duplication

Dynamic assignment Central coordinator dynamically assigns URLs to

crawlers Links are given to Central coordinator (?bottleneck?)

Static assignment Web is statically partitioned and assigned to crawlers Crawler only crawls its part of the web

Page 8: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Two problems

Load balancing the #URLs assigned to downloaders: Static schemes based on hosts may fail

www.geocities.com/…. www.di.unipi.it/

Dynamic “relocation” schemes may be complicated

Managing the fault-tolerance: What about the death of downloaders ? DD-1, new

hash !!! What about new downloaders ? DD+1, new hash !!!

Let D be the number of downloaders.

hash(URL) maps anURL to {0,...,D-1}.

Dowloader x fetchesthe URLs U s.t. hash(U) = x

Page 9: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

A nice technique: Consistent Hashing

A tool for: Spidering Web Cache P2P Routers Load Balance Distributed FS

Item and servers mapped to unit circle Item K assigned to first server N such

that ID(N) ≥ ID(K)

What if a downloader goes down?

What if a new downloader appears?

Each server gets replicated log S times

[monotone] adding a new server moves points between one old to the new, only.

[balance] Prob item goes to a server is ≤ O(1)/S

[load] any server gets ≤ (I/S) log S items w.h.p

[scale] you can copy each server more times...

Page 10: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Examples: Open Source

Nutch, also used by WikiSearch http://www.nutch.org

Hentrix, used by Archive.org http://archive-crawler.sourceforge.net/index.html

Consisten Hashing Amazon’s Dynamo

Page 11: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Ranking

Link-based Ranking(2° generation)

Reading 21

Page 12: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Query-independent ordering

First generation: using link counts as simple measures of popularity.

Undirected popularity: Each page gets a score given by the number of in-links

plus the number of out-links (es. 3+2=5).

Directed popularity: Score of a page = number of its in-links (es. 3).

Easy to SPAM

Page 13: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Second generation: PageRank

Each link has its own importance!!

PageRank is

independent of the query

many interpretations…

Page 14: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Basic Intuition…

What about nodes with no in/out links?

Page 15: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Google’s Pagerank

else

jiioutji

0)(#

1

,

B(i)B(i) : set of pages linking to i. : set of pages linking to i.#out(j)#out(j) : number of outgoing links from j. : number of outgoing links from j.ee : vector of components 1/sqrt{N}. : vector of components 1/sqrt{N}.

Random jump

Principaleigenvector

r = [ T + (1-) e eT ] × r

Page 16: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Three different interpretations

Graph (intuitive interpretation) Co-citation

Matrix (easy for computation) Eigenvector computation or a linear system solution

Markov Chain (useful to prove convergence) a sort of Usage Simulation

Any node

Neighbors

“In the steady state” each page has a long-term visit rate - use this as the page’s score.

Page 17: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Pagerank: use in Search Engines

Preprocessing: Given graph, build matrix Compute its principal eigenvector r r[i] is the pagerank of page i

We are interested in the relative order

Query processing: Retrieve pages containing query terms Rank them by their Pagerank

The final order is query-independent

T + (1-) e eT

Page 18: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

HITS: Hypertext Induced Topic Search

Page 19: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Calculating HITS

It is query-dependent

Produces two scores per page: Authority score: a good authority page for

a topic is pointed to by many good hubs for that topic.

Hub score: A good hub page for a topic points to many authoritative pages for that topic.

Page 20: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Authority and Hub scores

2

3

4

1 1

5

6

7

a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)

Page 21: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

HITS: Link Analysis Computation

Wherea: Vector of Authority’s scores

h: Vector of Hub’s scores. A: Adjacency matrix in which ai,j = 1 if ij

hAAh

AaAa

Aah

hAaT

TT

Thus, h is an eigenvector of AAt

a is an eigenvector of AtA

Symmetricmatrices

Page 22: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Weighting links

Weight more if the query occurs in the neighborhood of the link (e.g. anchor text).

yx

yaxh

)()(

xy

yhxa

)()( )(),()(

)(),()(

yhyxwxa

yayxwxh

xy

yx

Page 23: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Latent Semantic Indexing(mapping onto a smaller space of latent concepts)

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Reading 18

Page 24: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Speeding up cosine computation

What if we could take our vectors and “pack” them into fewer dimensions (say 50,000100) while preserving distances? Now, O(nm) Then, O(km+kn) where k << n,m

Two methods: “Latent semantic indexing” Random projection

Page 25: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

A sketch

LSI is data-dependent Create a k-dim subspace by eliminating

redundant axes Pull together “related” axes – hopefully

car and automobile

Random projection is data-independent Choose a k-dim subspace that guarantees

good stretching properties with high probability between pair of points.

What about polysemy ?

Page 26: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Notions from linear algebra

Matrix A, vector v Matrix transpose (At) Matrix product Rank Eigenvalues and eigenvector v: Av = v

Page 27: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Overview of LSI

Pre-process docs using a technique from linear algebra called Singular Value Decomposition

Create a new (smaller) vector space

Queries handled (faster) in this new space

Page 28: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Singular-Value Decomposition

Recall m n matrix of terms docs, A. A has rank r m,n

Define term-term correlation matrix T=AAt

T is a square, symmetric m m matrix Let P be m r matrix of eigenvectors of T

Define doc-doc correlation matrix D=AtA D is a square, symmetric n n matrix Let R be n r matrix of eigenvectors of D

Page 29: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

A’s decomposition

Do exist matrices P (for T, m r) and R (for D, n r) formed by orthonormal columns (unit dot-product)

It turns out that A = PRt

Where is a diagonal matrix with the eigenvalues of T=AAt in decreasing order.

=

A P Rt

mn mr rr rn

Page 30: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

For some k << r, zero out all but the k biggest eigenvalues in [choice of k is crucial]

Denote by k this new version of , having rank k

Typically k is about 100, while r (A’s rank) is > 10,000

=

P k Rt

Dimensionality reduction

Ak

document

useless due to 0-col/0-row of k

m x r r x n

r

kk

k

00

0

A m x k k x n

Page 31: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Guarantee

Ak is a pretty good approximation to A: Relative distances are (approximately) preserved

Of all m n matrices of rank k, Ak is the best

approximation to A wrt the following measures:

minB, rank(B)=k ||A-B||2 = ||A-Ak||2 = k

minB, rank(B)=k ||A-B||F2 = ||A-Ak||F

2 =

k2k+2

2r2

Frobenius norm ||A||F2 =

22r

2

Page 32: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Reduction

Xk = k Rt is the doc-matrix k x n, hence reduced to k dim

Take the doc-correlation matrix: It is D=At

A =(P Rt)t (P Rt) = (Rt)t (Rt) Approx with k, thus get At

A Xkt Xk (both are n x n matr.)

We use Xk to define A’s projection: Xk = k Rt , substitute Rt = Pt A, so get Pk

t A . In fact, k Pt = Pk

t which is a k x m matrix

This means that to reduce a doc/query vector is enough to multiply it by Pk

t

Cost of sim(q,d), for all d, is O(kn+km) instead of O(mn)

R,P are formed by

orthonormal eigenvectorsof the matrices D,T

Page 33: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Which are the concepts ?

c-th concept = c-th row of Pkt (which is k x m)

Denote it by Pkt [c], whose size is m = #terms

Pkt [c][i] = strength of association between c-th

concept and i-th term

Projected document: d’j = Pkt dj

d’j[c] = strenght of concept c in dj

Projected query: q’ = Pkt q

q’ [c] = strenght of concept c in q

Page 34: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Random Projections

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Slides only !

Page 35: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

An interesting math result

Setting v=0 we also get a bound on f(u)’s stretching!!!

d is our previous m = #terms

Page 36: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

What about the cosine-distance ?

f(u)’s, f(v)’s stretching

substituting formula above

Page 37: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

A practical-theoretical idea !!!

E[ri,j] = 0

Var[ri,j] = 1

Page 38: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Finally...

Random projections hide large constantsk (1/)2 * log d, so it may be large…

it is simple and fast to compute

LSI is intuitive and may scale to any koptimal under various metrics

but costly to compute

Page 39: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Document duplication(exact or approximate)

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Slides only!

Page 40: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Duplicate documents

The web is full of duplicated content Few exact duplicate detection Many cases of near duplicates

E.g., Last modified date the only difference between two copies of a page

Sec. 19.6

Page 41: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Natural Approaches

Fingerprinting: only works for exact matches

Random Sampling sample substrings (phrases, sentences, etc) hope: similar documents similar samples But – even samples of same document will differ

Edit-distance metric for approximate string-matching expensive – even for one pair of strings impossible – for 1032 web documents

Page 42: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Obvious techniques Checksum – no worst-case collision probability

guarantees MD5 – cryptographically-secure string hashes

relatively slow

Karp-Rabin’s Scheme Algebraic technique – arithmetic on primes Efficient and other nice properties…

Exact-Duplicate Detection

Page 43: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Karp-Rabin Fingerprints

Consider – m-bit string A=a1 a2 … am

Assume – a1=1 and fixed-length strings (wlog)

Basic values: Choose a prime p in the universe U, such that 2p uses few

memory-words (hence U ≈ 264) Set h = dm-1 mod p

Fingerprints: f(A) = A mod p Nice property is that if B = a2 … am am+1

f(B) = [d (A - a1 h) + am+1 ] mod p

Prob[false hit] = Prob p divides (A-B) = #div(A-B)/U ≈ (log (A+B)) / U = m/U

Page 44: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Near-Duplicate Detection

Problem Given a large collection of documents Identify the near-duplicate documents

Web search engines Proliferation of near-duplicate documents

Legitimate – mirrors, local copies, updates, … Malicious – spam, spider-traps, dynamic URLs, … Mistaken – spider errors

30% of web-pages are near-duplicates [1997]

Page 45: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Desiderata

Storage: only small sketches of each document.

Computation: the fastest possible

Stream Processing: once sketch computed, source is

unavailable

Error Guarantees problem scale small biases have large impact need formal guarantees – heuristics will not do

Page 46: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Basic Idea [Broder 1997]

Shingling dissect document into q-grams (shingles) represent documents by shingle-sets reduce problem to set intersection [ Jaccard ]

They are near-duplicates if large shingle-sets intersect enough

We know how to cope with “Set Intersection” fingerprints of shingles (for space efficiency) min-hash to estimate intersections sizes (for time and

space efficiency)

Page 47: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Multiset ofFingerprints

Doc shinglingMultiset ofShingles

fingerprint

Documents Sets of 64-bit fingerprints

Fingerprints:

• Use Karp-Rabin fingerprints over q-gram shingles (of 8q bits)• Fingerprint space [0, …, U-1]• In practice, use 64-bit fingerprints, i.e., U=264

• Prob[collision] ≈ (8q)/264 << 1

Page 48: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Similarity of Documents

DocBSB

SADocA

• Jaccard measure – similarity of SA, SB U = [0 … N-1]

• Claim: A & B are near-duplicates if sim(SA,SB) is high

BA

BABA SS

SS )S,sim(S

Page 49: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Speeding-up: Sketch of a document

Intersecting directly the shingles is too costly

Create a “sketch vector” (of size ~200) for each document

Documents that share ≥ t (say 80%) corresponding vector elements are near duplicates

Sec. 19.6

Page 50: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Sketching by Min-Hashing

Consider SA, SB P

Pick a random permutation π of P (such as ax+b

mod |P|)

Define = π -1( min{π(SA)} ) , = π -

1( min{π(SB)} ) minimal element under permutation π

Lemma: BA

BA

SS

SS β]P[α

Page 51: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Sum up…

Similarity sketch sk(A) = k minimal elements under π(SA)

K is fixed or is a fixed ratio of SA,SB ? We might also take K permutations and the min of each

Similarity Sketches sk(A): Succinct representation of fingerprint sets SA

Allows efficient estimation of sim(SA,SB)

Basic idea is to use min-hash of fingerprints

Note: we can reduce the variance by using a larger k

Page 52: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Computing Sketch[i] for Doc1

Document 1

264

264

264

264

Start with 64-bit f(shingles)

Permute on the number linewith i

Pick the min value

Sec. 19.6

Page 53: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Test if Doc1.Sketch[i] = Doc2.Sketch[i]

Document 1 Document 2

264

264

264

264

264

264

264

264

Are these equal?

Test for 200 random permutations: , ,… 200

A B

Sec. 19.6

Page 54: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

However…

Document 1 Document 2

264

264

264

264

264

264

264

264

A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (i.e., lies in the intersection)

Claim: This happens with probability Size_of_intersection / Size_of_union

BA

Sec. 19.6

Page 55: Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Sum up…

Brute-force: compare sk(A) vs. sk(B) for all the pairs of documents A and B.

Locality sensitive hashing (LSH) Compute sk(A) for each document A Use LSH of all sketches, briefly:

Take h elements of sk(A) as ID (may induce false positives)

Create t IDs (to reduce the false negatives)

If one ID matches with another one (wrt same h-selection),

then the corresponding docs are probably near-duplicates;

hence compare.