do not crawl in the dust different ur ls similar text
DESCRIPTION
TRANSCRIPT
![Page 1: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/1.jpg)
Do Not Crawl In The DUST:Different URLs Similar Text
Uri Schonfeld Department of Electrical Engineering
TechnionJoint Work with Dr. Ziv Bar Yossef and Dr. Idit Keidar
![Page 2: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/2.jpg)
Problem statement and motivation Related work Our contribution The DustBuster algorithm Experimental results Concluding remarks
Talk Outline
![Page 3: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/3.jpg)
DUST – Different URLs Similar Text Examples:
Standard Canonization: “http://domain.name/index.html”
“http://domain.name” Domain names and virtual hosts
“http://news.google.com” “http://google.com/news” Aliases and symbolic links:
“http://domain.name/~shuri” “http://domain.name/people/shuri”
Parameters with little affect on content Print=1
URL transformations: “http://domain.name/story_”
“http://domain.name/story?id=”
Even the WWW Gets Dusty
![Page 4: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/4.jpg)
Dust rule: Transforms one URL to another Example: “index.html” “”
Valid DUST rule:
r is a valid DUST rule w.r.t. site S if for every URL u S,
r(u) is a valid URL r(u) and u have “similar” contents
Why similar and not identical? Comments, news, text ads, counters
DUST Rules!
![Page 5: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/5.jpg)
Expensive to crawl Access the same document via multiple URLs
Forces us to shingle An expensive technique used to discover similar
documents Ranking algorithms suffer
References to a document split among its aliases Multiple identical results
The same document is returned several times in the search results
Any algorithm based on URLs suffers
DUST is Bad
![Page 6: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/6.jpg)
Given: a list of URLs from a site S Crawl log Web server log
Want: to find valid DUST rules w.r.t. S As many as possible Including site-specific ones Minimize number of fetches
Applications: Site-specific canonization More efficient crawling
We Want To
![Page 7: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/7.jpg)
Domain name aliases Standard extensions Default file names: index.html,
default.htm File path canonizations:
“dirname/../” “”, “//” “/” Escape sequences: “%7E” “~”
How do we Fight DUST Today? (1) Standard Canonization
![Page 8: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/8.jpg)
Site-specific DUST: “story_” “story?id=“ “news.google.com”
“google.com/news” “labs” “laboratories”
This DUST is harder to find
Standard Canonization is not Enough
![Page 9: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/9.jpg)
Shingles are document sketches [Broder,Glassman,Manasse 97]
Used to compare documents for similarity Pr(Shingles are equal) = Document similarity Compare documents by comparing shingles Calculate Shingle:
Take all m word sequences Hash them with hi Choose the min That's your shingle
How do we Fight DUST Today? (2) Shingles
![Page 10: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/10.jpg)
Shingles expensive: Require fetch Parsing Hash
Shingles do not find rules Therefore, not applicable to new
pages
Shingles are Not Perfect
![Page 11: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/11.jpg)
Mirror detection[Bharat,Broder 99], [Bharat,Broder,Dean,Henzinger 00], [Cho,Shivakumar,Garcia-Molina 00], [Liang 01]
Identifying plagiarized documents [Hoad,Zobel 03]
Finding near-replicas[Shivakumar,Garcia-Molina 98], [Di Iorio,Diligenti,Gori,Maggini,Pucci 03]
Copy detection[Brin,Davis,Garcia-Molina 95], [Garcia-Molina,Gravano,Shivakumar 96], [Shivakumar,Garcia-Molina 96]
More Related Work
![Page 12: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/12.jpg)
An algorithm that finds site-specific valid DUST rules requires minimum number of fetches
Convincing results in experiments Benefits to crawling
Our contributions
![Page 13: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/13.jpg)
Alias DUST: simple substring substitutions “story_1259” “story?id=1259” “news.google.com” “google.com/news” “/index.html” “”
Parameter DUST: Standard URL structure:
protocol://domain.name/path/name?para=val&pa=va Some parameters do not affect content:
Can be removed Can changed to a default value
Types of DUST
![Page 14: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/14.jpg)
Input: URL list Detect likely DUST rules Eliminate redundant rules Validate DUST rules using samples:
Eliminate DUST rules that are “wrong” Further eliminate duplicate DUST rules
No Fetch Required
Our Basic Framework
![Page 15: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/15.jpg)
Large support principle: Likely DUST rules have lots of “evidence” supporting them
Small buckets principle:Ignore evidence that supports many different rules
How to detect likely DUST rules?
![Page 16: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/16.jpg)
Large Support Principle
A pair of URLs (u,v) is an instance of rule r, if: r(u) = v
Support(r) = all instances (u,v) of r
Large Support Principle
The support of a valid DUST rule is large
![Page 17: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/17.jpg)
Rule Support:An Equivalent View
: a string Ex: = “story_”
u: URL that contains as a substring Ex: u =
“http://www.sitename.com/story_2659” Envelope of in u:
A pair of strings (p,s) p: prefix of u preceding s: suffix of u succeeding Example: p = “http://www.sitename.com/”, s
= “2659” E(α): all envelopes of in URLs that appear in
input URL list
![Page 18: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/18.jpg)
Envelopes Example
![Page 19: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/19.jpg)
Rule Support:An Equivalent View : an alias DUST rule
Ex: = “story_”, = “story?id=“
Lemma: |Support( )| = | E() ∩ E()| Proof:
bucket(p,s) = { | (p,s) E() } Observation: (u,v) is an instance of if
and only if u = p s and v = p s for some (p,s)
Hence, (u,v) is an instance of iff (p,s) E() ∩ E()
![Page 20: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/20.jpg)
Large Buckets Often there is a large set of
substrings that are interchangeable within a given URL while not being DUST:
page=1,page=2,… lecture-1.pdf, lecture-2.pdf
This gives rise to large buckets:
![Page 21: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/21.jpg)
Big Buckets: popular prefix suffix Often do not contain similar content Big buckets are expensive to process
I am a DUCK not a DUST
Small Bucket Principle
Small Buckets Principle
Most of the support of valid Alias DUST rules
is likely to belong to small buckets
![Page 22: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/22.jpg)
Scan Log and form buckets Ignore big buckets For each small Bucket:
For every two substrings α, β in the bucket print (α, β)
Sort by (α, β) For every pair (α, β):
Count If (Count > threshold) print α β
Algorithm – Detecting Likely DUST RulesNo Fetch here!
![Page 23: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/23.jpg)
Size and Comments
Consider only instances of rules whose size “matches”
Use ranges of sizes Running time O(Llog(L)) Process only short substrings Tokenize URLs
![Page 24: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/24.jpg)
Input: URL list Detect likely DUST rules Eliminate redundant rules Validate DUST rules using samples:
Eliminate DUST rules that are “wrong” Further eliminate duplicate DUST rules
No Fetch Required
Our Basic Framework
![Page 25: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/25.jpg)
Eliminating RedundantRules
“/vlsi/” “/labs/vlsi/”“/vlsi” “/labs/vlsi”
Lemma:A substitution rule α’ β’ refines rule α β if and only if there exists an envelope (γ,δ) such that α’ = γ◦α◦δ and β’=γ◦β ◦ δ
Lemma helps us identify refinements easily φ refines ψ? remove ψ if supports match
Rule φ refines rule ψ if SUPPORT(φ) SUPPORT(ψ )
No Fetch here!
![Page 26: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/26.jpg)
Validating Likely Rules For each likely rule r, for both
directions Find sample URLs from list to which r is
applicable For each URL u in the sample:
v = r(u) Fetch u and v Check if content(u) is similar to content(v)
if fraction of similar pairs > threshold: Declare rule r valid
![Page 27: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/27.jpg)
Assumption: if validation beyond threshold in 100
it will be the same for any validation above
Why isn’t threshold 100%? A 95% valid rule may still be worth it Dynamic pages change often
Comments About Validation
![Page 28: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/28.jpg)
We experiment on logs of two web sites: Dynamic Forum Academic Site
Detected from a log of about 20,000 unique URLs
On each site we used four logs from different time periods
Experimental Setup
![Page 29: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/29.jpg)
Precision at k
![Page 30: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/30.jpg)
Precision vs. Validation
![Page 31: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/31.jpg)
How many of the DUST do we find? What other duplicates are there:
Soft errors True copies:
Last semesters course All authors of paper
Frames Image galleries
Recall
![Page 32: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/32.jpg)
In a crawl examined
18% of the crawl was reduced.
DUST Distribution
47.1 DUST
25.7% Images
7.6% Soft Errors
17.9% Exact Copy
1.8% misc
![Page 33: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/33.jpg)
DustBuster is an efficient algorithm Finds DUST rules Can reduce a crawl Can benefit ranking algorithms
Conclusions
![Page 34: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/34.jpg)
THE END
![Page 35: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/35.jpg)
= => --> all rules with “” Fix drawing urls crossing alpha not
all p and all s
Things to fix
![Page 36: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/36.jpg)
So far, non-directional Prefer shrinking rules Prefer lexicographically lowering
rules Check those directions first
![Page 37: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/37.jpg)
Parameter name and possible values What rules:
Remove parameter Substitute one value with another Substitute all values with a single value
Rules are validated the same way the alias rules are
Will not discuss further
Parametric DUST
![Page 38: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/38.jpg)
Unfortunately we see a lot of “wrong” rules
Substitute 1 with 2 Just wrong:
One domain name with another with similar software
False rules examples: /YoninaEldar/ != /DavidMalah/ /labs/vlsi/oldsite != /labs/vlsi -2. != -3.
False Rules
![Page 39: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/39.jpg)
Filtering out False Rulese
Getting rid of the big buckets Using the size field:
False dust rules: May give valid URLs Content is not similar Size is probably different Size ranges used
Tokenization helps
![Page 40: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/40.jpg)
DustBuster – cleaning up the rules Go over list with a window If
Rule a refines rule b Their support size is close
Leave only rule a
![Page 41: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/41.jpg)
DustBuster – Validation
Validation per rule Get sample URLs URLs that the rule can be applied Apply URL => applied URL Get content Compare using shingles
![Page 42: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/42.jpg)
DustBuster - Validation
Stop fetching when: #failures > 100 * (1-threshold)
Page that doesn't exist is not similar to anything else
Why use threshold < 100%? Shingles not perfect Dynamic pages may change a lot fast
![Page 43: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/43.jpg)
Detect Alias DUST – take 2
Tokenize of course Form buckets Ignore big buckets Count support only if size matches Don't count Long substrings Results are cleaner
![Page 44: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/44.jpg)
Eliminate Redundancies 1: EliminateRedundancies(pairs_list R) 2: for i = 1 to |R| do 3: if (already eliminated R[i]) continue 4: to_eliminate_current := false /* Go over a window */ 5: for j = 1 to min(MW, |R| - i) do /* Support not close? Stop checking */ 6: if (R[i].size - R[i+j].size > max(MRD*R[i].size, MAD)) break /* a refines b? remove b */ 7: if (R[i] refines R[i+j]) 8: eliminate R[i+j] 9: else if (R[i+j] refines R[i]) then 10: to_eliminate_current := true 11: break 12: if (to_eliminate_current) 13: eliminate R[i] 14: return R
No Fetch here!
![Page 45: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/45.jpg)
Validate a Single Rule 1:ValidateRule(R, L) 2: positive := 0 3: negative := 0 /* Stop When You Are sure you either succeeded or failed */ 4: while (positive < (1 - ε) N AND (negative < εN) do 5: u := a random URL from L to which R is applicable 6: v := outcome of application of R to u 7: fetch u and v 8: if (fetch u failed) continue /* Something went wrong, negative sample */ 9: if (fetch v failed) OR (shingling(u) shingling(v)) 10 negative := negative + 1 /* Another positive sample */ 11: else 12: positive := positive + 1 13: if (negative ε N ) 14: retrun FALSE 15: return TRUE
![Page 46: Do not crawl in the dust different ur ls similar text](https://reader033.vdocument.in/reader033/viewer/2022042813/54be2de94a79598c1e8b4587/html5/thumbnails/46.jpg)
Validate Rules 1:Validate(rules_list R, test_log L) 2 create list of rules LR 3: for i = 1 to |R| do /* Go over rules that survived = valid rules */ 4: for j = 1 to i - 1 do 5: if (R[j] was not eliminated AND R[i] refines R[j]) 6: eliminate R[i] from the list 7: break 8: if (R[i] was eliminated) 9: continue /* Test one direction */ 10: if (ValidateRule(R[i].alpha R[i].beta, L)) 11: add R[i].alpha R[i].beta to LR /* Test other direction only if first direction failed*/ 12: else if (ValidateRule(R[i].beta R[i].alpha, L)) 13: add R[i].alpha R[i].beta to LR 14: else 15: eliminate R[i] from the list 16: return LR