detecting near-duplicates for web crawling
DESCRIPTION
Detecting Near-Duplicates for Web Crawling. Presentation By: Fernando Arreola. Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. Outline. De-duplication Goal of the Paper Why is De-duplication Important? Algorithm Experiment Related Work Tying it Back to Lecture - PowerPoint PPT PresentationTRANSCRIPT
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING
Authors:
Gurmeet Singh Manku,
Arvind Jain, and
Anish Das Sarma
Presentation By:Fernando Arreola
26/20/2011Detecting Near-Duplicates for Web
Crawling
Outline
De-duplication Goal of the Paper Why is De-duplication Important? Algorithm Experiment Related Work Tying it Back to Lecture Paper Evaluation Questions
36/20/2011Detecting Near-Duplicates for Web
Crawling
De-duplication
The process of eliminating near-duplicate web documents in a generic crawl
Challenge of near-duplicates: Identifying exact duplicates is easy
Use checksums How to identify near-duplicate?
Near-duplicates are identical in content but have differences in small areas
Ads, counters, and timestamps
46/20/2011Detecting Near-Duplicates for Web
Crawling
Goal of the Paper
Present near-duplicate detection system which improves web crawling
Near-duplicate detection system includes: Simhash technique
Technique used to transform a web-page to an f-bit fingerprint
Solution to Hamming Distance Problem Given f-bit fingerprint find all fingerprints in a
given collection which differ by at most k-bit positions
56/20/2011Detecting Near-Duplicates for Web
Crawling
Why is De-duplication Important? Elimination of near duplicates:
Saves network bandwidth Do not have to crawl content if similar to
previously crawled content Reduces storage cost
Do not have to store in local repository if similar to previously crawled content
Improves quality of search indexes Local repository used for building search
indexes not polluted by near-duplicates
66/20/2011Detecting Near-Duplicates for Web
Crawling
Algorithm: Simhash Technique Convert web-page to set of features
Using Information Retrieval techniques e.g. tokenization, phrase detection
Give a weight to each feature Hash each feature into a f-bit value Have a f-dimensional vector
Dimension values start at 0 Update f-dimensional vector with weight of feature
If i-th bit of hash value is zero -> subtract i-th vector value by weight of feature
If i-th bit of hash value is one -> add the weight of the feature to the i-th vector value
Vector will have positive and negative components Sign (+/-) of each component are bits for the fingerprint
76/20/2011Detecting Near-Duplicates for Web
Crawling
Algorithm: Simhash Technique (cont.) Very simple example
One web-page Web-page text: “Simhash Technique”
Reduced to two features “Simhash” -> weight = 2 “Technique” -> weight = 4
Hash features to 4-bits “Simhash” -> 1101 “Technique” -> 0110
86/20/2011Detecting Near-Duplicates for Web
Crawling
Algorithm: Simhash Technique (cont.) Start vector with all zeroes
0
0
0
0
96/20/2011Detecting Near-Duplicates for Web
Crawling
Algorithm: Simhash Technique (cont.) Apply “Simhash” feature (weight = 2)
0
0
0
0
2
2
-2
2
1
1
0
1
feature’s f-bit value
0 + 2
0 + 2
0 - 2
0 + 2
calculation
106/20/2011Detecting Near-Duplicates for Web
Crawling
Algorithm: Simhash Technique (cont.) Apply “Technique” feature (weight = 4)
2
2
-2
2
-2
6
2
-2
0
1
1
0
feature’s f-bit value
2 - 4
2 + 4
-2 + 4
2 - 4
calculation
116/20/2011Detecting Near-Duplicates for Web
Crawling
Algorithm: Simhash Technique (cont.) Final vector:
Sign of vector values is -,+,+,- Final 4-bit fingerprint = 0110
-2
6
2
-2
126/20/2011Detecting Near-Duplicates for Web
Crawling
Algorithm: Solution to Hamming Distance Problem Problem: Given f-bit fingerprint (F) find all fingerprints
in a given collection which differ by at most k-bit positions
Solution: Create tables containing the fingerprints
Each table has a permutation (π) and a small integer (p) associated with it
Apply the permutation associated with the table to its fingerprints Sort the tables
Store tables in main-memory of a set of machines Iterate through tables in parallel
Find all permutated fingerprints whose top pi bits match the top pi bits of πi(F)
For the fingerprints that matched, check if they differ from πi(F) in at most k-bits
136/20/2011Detecting Near-Duplicates for Web
Crawling
Algorithm: Solution to Hamming Distance Problem (cont.) Simple example
F = 0100 1101 K = 3 Have a collection of 8 fingerprints Create two tablesFingerprints
1100 0101
1111 1111
0101 1100
0111 1110
1111 1110
0000 0001
1111 0101
1101 0010
146/20/2011Detecting Near-Duplicates for Web
Crawling
Algorithm: Solution to Hamming Distance Problem (cont.)
Fingerprints
1100 0101
1111 1111
0101 1100
0111 1110
1111 1110
0010 0001
1111 0101
1101 0010
p = 3; π = Swap last four bits with first four bits
0101 1100
1111 1111
1100 0101
1110 0111
p = 3; π = Move last two bits to the front
1011 1111
0100 1000
0111 1101
1011 0100
156/20/2011Detecting Near-Duplicates for Web
Crawling
Algorithm: Solution to Hamming Distance Problem (cont.)
p = 3; π = Swap last four bits with first four bits
0101 1100
1111 1111
1100 0101
1110 0111
p = 3; π = Move last two bits to the front
1011 1111
0100 1000
0111 1101
1011 0100
Sort
p = 3; π = Swap last four bits with first four bits
0101 1100
1100 0101
1110 0111
1111 1111
p = 3; π = Move last two bits to the front
0100 1000
0111 1101
1011 0100
1011 1111
Sort
166/20/2011Detecting Near-Duplicates for Web
Crawling
Algorithm: Solution to Hamming Distance Problem (cont.)
F = 0100 1101
p = 3; π = Swap last four bits with first four bits
0101 1100
1100 0101
1110 0111
1111 1111
p = 3; π = Move last two bits to the front
0100 1000
0111 1101
1011 0100
1011 1111
π(F) = 1101 0100 π(F) = 0101 0011
Match!
176/20/2011Detecting Near-Duplicates for Web
Crawling
Algorithm: Solution to Hamming Distance Problem (cont.) With k =3, only fingerprint in first table
is a near-duplicate of the F fingerprint
p = 3; π = Swap last four bits with first four bits
1 1 0 1 0 1 0 0
1 1 0 0 0 1 0 1
p = 3; π = Move last two bits to the front
0 1 0 1 0 0 1 1
0 1 0 0 1 0 0 0
F
186/20/2011Detecting Near-Duplicates for Web
Crawling
Algorithm: Compression of Tables Store first fingerprint in a block (1024 bytes) XOR the current fingerprint with the previous
one Append to the block the Huffman code for the
position of the most significant 1 bit Append to the block the bits after the most
significant 1 bit Repeat steps 2-4 until block is full
Comparing to the query fingerprint Use last fingerprint (key) in the block and perform
interpolation search to decompress appropriate block
196/20/2011Detecting Near-Duplicates for Web
Crawling
Algorithm: Extending to Batch Queries Problem: Want to get near-duplicates for batch of
query fingerprints – not just one Solution:
Use Google File System (GFS) and MapReduce Create two files
File F has the collection of fingerprints File Q has the query fingerprints
Store the files in GFS GFS breaks up the files into chunks
Use MapReduce to solve the Hamming Distance Problem for each chunk of F for all queries in Q
MapReduce allows for a task to be created per chunk Iterate through chunks in parallel Each task produces output of near-duplicates found
Produce sorted file from output of each task Remove duplicates if necessary
206/20/2011Detecting Near-Duplicates for Web
Crawling
Experiment: Parameters
8 Billion web pages used K = 1 …10 Manually tagged pairs as follows:
True positives Differ slightly
False positives Radically different pairs
Unknown Could not be evaluated
216/20/2011Detecting Near-Duplicates for Web
Crawling
Experiment: Results
Accuracy Low k value -> a lot of false negatives High k value -> a lot of false positives Best value -> k = 3
75% of near-duplicates reported 75% of reported cases are true positives
Running Time Solution Hamming Distance: O(log(p)) Batch Query + Compression:
32GB File & 200 tasks -> runs under 100 seconds
226/20/2011Detecting Near-Duplicates for Web
Crawling
Related Work
Clustering related documents Detect near-duplicates to show related pages
Data extraction Determine schema of similar pages to obtain
information Plagiarism
Detect pages that have borrowed from each other
Spam Detect spam before user receives it
236/20/2011Detecting Near-Duplicates for Web
Crawling
Tying it Back to Lecture
Similarities Indicated importance of de-duplication to save
crawler resources Brief summary of several uses for near-
duplicate detection
Differences Lecture focus:
Breadth-first look at algorithms for near-duplicate detection
Paper focus: In-depth look of simhash and Hamming Distance
algorithm Includes how to implement and effectiveness
246/20/2011Detecting Near-Duplicates for Web
Crawling
Paper Evaluation: Pros
Thorough step-by-step explanation of the algorithm implementation
Thorough explanation on how the conclusions were reached
Included brief description of how to improve simhash + Hamming Distance algorithm Categorize web-pages before running
simhash, create algorithm to remove ads or timestamps, etc.
256/20/2011Detecting Near-Duplicates for Web
Crawling
Paper Evaluation: Cons
No comparison How much more effective or faster is it than
other algorithms? By how much did it improve the crawler?
Limited batch queries to a specific technology Implementation required use of GFS Approach not restricted to certain
technology might be more applicable
266/20/2011Detecting Near-Duplicates for Web
Crawling
Any Questions?
???