web algorithmics
DESCRIPTION
Web Algorithmics. Web Search Engines. Goal of a Search Engine. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query : paradigm “bag of words” Relevant ?!?. The Web: Language and encodings: hundreds… - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/1.jpg)
Web Algorithmics
Web Search Engines
![Page 2: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/2.jpg)
Retrieve docs that are “relevant” for the user query
Doc: file word or pdf, web page, email, blog, e-book,...
Query: paradigm “bag of words”
Relevant ?!?
Goal of a Search Engine
![Page 3: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/3.jpg)
Two main difficulties
The Web: Language and encodings: hundreds…
Distributed authorship: SPAM, format-less,…
Dynamic: in one year 35% survive, 20% untouched
The User: Query composition: short (2.5 terms avg) and imprecise
Query results: 85% users look at just one result-page
Several needs: Informational, Navigational, Transactional
Extracting “significant data” is difficult !!
Matching “user needs” is difficult !!
![Page 4: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/4.jpg)
Evolution of Search Engines First generation -- use only on-page, web-text data
Word frequency and language
Second generation -- use off-page, web-graph data Link (or connectivity) analysis Anchor-text (How people refer to a page)
Third generation -- answer “the need behind the query” Focus on “user need”, rather than on query Integrate multiple data-sources Click-through data
1995-1997 AltaVista, Excite, Lycos, etc
1998: Google
Fourth generation Information Supply[Andrei Broder, VP emerging search tech, Yahoo! Research]
Google, Yahoo,
MSN, ASK,………
![Page 5: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/5.jpg)
![Page 6: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/6.jpg)
+$
-$
![Page 7: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/7.jpg)
![Page 8: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/8.jpg)
This is a search engine!!!
![Page 9: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/9.jpg)
Wolfram Alpha
![Page 10: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/10.jpg)
Clusty
![Page 11: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/11.jpg)
Yahoo! Correlator
![Page 12: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/12.jpg)
Web Algorithmics
The structure of a Search Engine
![Page 13: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/13.jpg)
The structureW
eb
Crawler
Page archive
Control
Query
Queryresolver
?
Ranker
PageAnalizer
textStructure
auxiliary
Indexer
![Page 14: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/14.jpg)
![Page 15: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/15.jpg)
Generating the snippets !
![Page 16: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/16.jpg)
The big fight: find the best ranking...
![Page 17: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/17.jpg)
Ranking: Google vs Google.cn
![Page 18: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/18.jpg)
Problem: Indexing
Consider Wikipedia En: Collection size ≈ 10 Gbytes # docs ≈ 4 * 106 #terms in total > 1 billion (avg term len = 6 chars) #terms distinct = several millions
Which kind of data structure do we build to support word-based searches ?
![Page 19: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/19.jpg)
DB-based solution: Term-Doc matrix
1 if play contains word, 0 otherwise
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
#te
rms >
1M
#docs ≈ 4M
Space ≈ 4Tb !
![Page 20: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/20.jpg)
Current solution: Inverted index
Brutus
the
Calpurnia
1 2 3 5 8 13 21 34
2 4 6 10 32
13 16
Currently they get 13% original text
A term like Calpurnia may use log2 N bits per occurrence A term like the should take about 1 bit per occurrence
![Page 21: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/21.jpg)
Gap-coding for postings
Sort the docIDs Store gaps between consecutive docIDs:
Brutus: 33, 47, 154, 159, 202 … 33, 14, 107, 5, 43 …
Two advantages: Space: store smaller integers (clustering?) Speed: query requires just a scan
![Page 22: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/22.jpg)
code for integer encoding
v > 0 and Length = log2 v +1
e.g., v=9 represented as <000,1001>.
code for v takes 2 log2 v +1 bits (ie. factor of 2 from optimal)
0000...........0 v in binary Length-1
Optimal for Pr(v) = 1/2v2, and i.i.d integers
![Page 23: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/23.jpg)
Rice code (simplification of Golomb code)
It is a parametric code: depends on k
Quotient q=(v-1)/k, and the rest is r= v – k * q – 1
Useful when integers concentrated around k
How do we choose k ? Usually k 0.69 * mean(v) [Bernoulli model]
Optimal for Pr(v) = p (1-p)v-1, where mean(v)=1/p, and i.i.d ints
Unary(q+1) Binary rest
[q times 0s] 1 Log k bits
![Page 24: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/24.jpg)
PForDelta coding
10 11 11 …01 01 11 11 01 42 2311 10
2 3 3 …1 1 3 3 23 13 42 2
a block of 128 numbers
Use b (e.g. 2) bits to encode 128 numbers or create exceptions
Encode exceptions: ESC or pointers
Choose b to encode 90% values, or trade-off: b waste more bits, b more exceptions
Translate data: [base, base + 2b-1] [0,2b-1]
![Page 25: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/25.jpg)
Interpolative coding
= 1 1 1 2 2 2 2 4 3 1 1 1M = 1 2 3 5 7 9 11 15 18 19 20 21
Recursive coding preorder traversal of a balanced binary tree
At every step we know (initially, they are encoded):num = |M| = 12, Lidx=1, low = 1, Ridx=12, hi = 21
Take the middle element: h= (Lidx+Ridx)/2=6 M[6]=9, left_size= h – Lidx = 5, right_size= Ridx-h = 6
low + left_size =1+5 = 6 ≤ M[h] ≤ hi – right_size = (21 – 6) = 15 We can encode 9 in log2 (15-6+1) = 4 bits
lo=1, hi=9-1=8,
num=5
lo=9+1=10, hi=21,
num=6
![Page 26: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/26.jpg)
Query processing
1) Retrieve all pages matching the query
Brutus
the
Caesar
1 2 3 5 8 13 21 34
2 4 6 13 32
4 13 17
![Page 27: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/27.jpg)
Some optimization
Best order for query processing ? Shorter lists first…
Brutus
The
Calpurnia
1 2 3 5 8 13 21 34
2 4 6 13 32
4 13
Query: Brutus AND Calpurnia AND The
17
![Page 28: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/28.jpg)
Expand the posting lists with word positions to:
2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ...
be:
1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...
Larger space occupancy, 5÷8% on Web
Phrase queries
![Page 29: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/29.jpg)
Query processing
1) Retrieve all pages matching the query 2) Order pages according to various scores:
Term position & freq (body, title, anchor,…)
Link popularity User clicks or preferences
Brutus
the
Caesar
1 2 3 5 8 13 21 34
2 4 6 13 32
4 13 17
![Page 30: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/30.jpg)
The structureW
eb
Crawler
Page archive
Control
Query
Queryresolver
?
Ranker
PageAnalizer
textStructure
auxiliary
Indexer
![Page 31: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/31.jpg)
Web Algorithmics
Text-based Ranking(1° generation)
![Page 32: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/32.jpg)
A famous “weight”: tf-idf
)/log(,, tdtdt nntfw
Frequency of term t in doc d = #occt / |d| tf t,d
where nt = #docs containing term t n = #docs in the indexed collection
log nnidft
t
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 13,1 11,4 0,0 0,0 0,0 0,0
Brutus 3,0 8,3 0,0 1,0 0,0 0,0
Caesar 2,3 2,3 0,0 0,5 0,3 0,3
Calpurnia 0,0 11,2 0,0 0,0 0,0 0,0
Cleopatra 17,7 0,0 0,0 0,0 0,0 0,0
mercy 0,5 0,0 0,7 0,9 0,9 0,3
worser 1,2 0,0 0,6 0,6 0,6 0,0
Vector Space model
![Page 33: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/33.jpg)
A graphical example
Postulate: Documents that are “close together” in the vector space talk about the same things.Euclidean distance sensible to vector length !!
t1
d2
d1
d3
d4
d5
t3
t2
cos() = v w / ||v|| * ||w||
The user query is a very short doc
Easy to Spam
Sophisticated algosto find top-k docs
for a query Q
![Page 34: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/34.jpg)
Approximate top-k results
Preprocess: Assign to each term, its m best documents
Search: If |Q| = q terms, merge their preferred lists ( mq answers). Compute COS between Q and these docs, and choose the top
k.Need to pick m>k to work well empirically.
Now SE use tf-idf PLUS PageRank (PLUS other weights)
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 13.1 11.4 0.0 0.0 0.0 0.0
Brutus 3.0 8.3 0.0 1.0 0.0 0.0
Caesar 2.3 2.3 0.0 0.5 0.3 0.3
Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0
Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0
mercy 0.5 0.0 0.7 0.9 0.9 0.3
worser 1.2 0.0 0.6 0.6 0.6 0.0
![Page 35: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/35.jpg)
Web Algorithmics
Link-based Ranking(2° generation)
![Page 36: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/36.jpg)
Query-independent ordering
First generation: using link counts as simple measures of popularity.
Undirected popularity: Each page gets a score given by the number of in-links
plus the number of out-links (es. 3+2=5).
Directed popularity: Score of a page = number of its in-links (es. 3).
Easy to SPAM
![Page 37: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/37.jpg)
Second generation: PageRank
Each link has its own importance!!
PageRank is
independent of the query
many interpretations…
![Page 38: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/38.jpg)
Basic Intuition…
Any node
1-d Neighbors
d
![Page 39: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/39.jpg)
Google’s Pagerank
else
jiioutL ji
0)(#
1
,
B(i)B(i) : set of pages linking to i. : set of pages linking to i.#out(i)#out(i) : number of outgoing links from i. : number of outgoing links from i.
Fixed value
Principaleigenvector
![Page 40: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/40.jpg)
Three different interpretations
Graph (intuitive interpretation) Co-citation
Matrix (easy for computation) Eigenvector computation or a linear system solution
Markov Chain (useful to prove convergence) a sort of Usage Simulation
Any node
1-dNeighbors
d“In the steady state” each page has a long-term visit rate - use this as the page’s score.
![Page 41: Web Algorithmics](https://reader031.vdocument.in/reader031/viewer/2022012919/56814531550346895db1f8e1/html5/thumbnails/41.jpg)
Pagerank: use in Search Engines
Preprocessing: Given graph of links, build matrix L Compute its principal eigenvector r r[i] is the pagerank of page i
We are interested in the relative order
Query processing: Retrieve pages containing query terms Rank them by their Pagerank
The final order is query-independent