search engine and web mining · 2020-03-29 · web crawling issues • coverage –google, the...
TRANSCRIPT
![Page 1: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/1.jpg)
Search Engine and Web Mining
Hamed MonkaresiDepartment of Computer Engineering and Information Technology,
Razi University, Kermanshah
Search Engine and Web Mining 1
![Page 2: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/2.jpg)
2
Outline
• Web challenges
• Search engines
• Web crawling
• Web ranking
– Ranking algorithms
– Ranking challenges
Search Engine and Web Mining
![Page 3: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/3.jpg)
3
What is the success reason of the Web?
• A distributed system
• A simple protocol
• Production and generation is very simple
Search Engine and Web Mining
![Page 4: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/4.jpg)
4
Web Retrieval
User Space Information Space
Matching
RetrievalBrowsing
Index termsFull text
Full text + Structure (e.g. hypertext)
Search Engine
Search Engine and Web Mining
![Page 5: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/5.jpg)
5
IR vs Data Retrieval
• A data retrieval aims at retrieving all objects which satisfy clearly defined conditions in regular expression
• DR does not solve the problem of retrieving information about subject or object
Search Engine and Web Mining
![Page 6: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/6.jpg)
6
Comparing IR to databases (vs data retrieval)
Databases IR
Data Structured Unstructured
FieldsClear semantics (SSN, age)
No fields (other than text)
QueriesDefined (relational algebra, SQL)
Free text (“natural language”), Boolean
Query specification
Complete Incomplete
MatchingExact (results are always “correct”)
Imprecise (need to measure effectiveness)
Error response Sensitive Insensitive
Search Engine and Web Mining
![Page 7: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/7.jpg)
7
Main points in IR
• What is the definition of relevancy?
• Evaluation!
– Subjective (opposite to hardware, network)
Search Engine and Web Mining
![Page 8: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/8.jpg)
8
Web Challenges
• Huge size of information– 11.5 billions pages (2005)– 64 billions pages (05 June, 2008)
• Proliferation and dynamic nature– New pages are created at the rate of 8% per week– Only 20% of the current pages will be accessible after
one year – New links are created at rate 25% per week
• Heterogeneous contents– HTML/Text/Audio/…
Search Engine and Web Mining
![Page 9: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/9.jpg)
9
Web IR (SE) Challenges (1)
• The definition of Relevancy
• The connectivity with content in Web– A huge graph
• Different type of Queries– Narrow
• Needle in a haystack
– Wide• Overlapping with many areas
• User have Poor patience: they commonly browse through the first ten results (i.e. one screen) hoping to find there the “right” document for their query
Search Engine and Web Mining
![Page 10: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/10.jpg)
10
Web IR (SE) Challenges (2)
• Spamming phenomenon– it is crucial for business sites to be ranked highly by
the major search engines. – There are quite a few companies who sell this kind of
expertise (also known as “search engine optimization”) and actively research ranking algorithms and heuristics of search engines, and know how many keywords to place (and where) in a Web page so as to improve the page’s ranking
– SEO Books
• Content & Connectivity Spamming• Anti Spamming solutions
Search Engine and Web Mining
![Page 11: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/11.jpg)
11
Web IR (SE) Challenges (3)
• Rich-get-richer problem
– It takes a long time for a young high quality web pages to receive an appropriate quality
– Unfairness
– Bad directions in growing web contents
Search Engine and Web Mining
![Page 12: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/12.jpg)
12
Web IR (SE) Challenges (4)
• Crawling challenges– Huge size of information with dynamic nature
– Freshness & converge• Google covers only 70% of the Web
– An suitable scheduling policy
– Hidden web (600 times bigger)
• Using meta search engines to increase coverage– Merging and ranking problem
Search Engine and Web Mining
![Page 13: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/13.jpg)
13
Web IR (SE) Challenges (5)
• User evaluation is subjective and changes in time
– Relevancy between a query and document depends on user and time
– Two users with the same query expect different results
Search Engine and Web Mining
![Page 14: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/14.jpg)
14
Web IR (SE) Challenges (6)
• Query Ambiguity
– Python
– Car & automobile
Search Engine and Web Mining
![Page 15: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/15.jpg)
15
Web Structure• Web graph has Bow-tie shape• It has scale-free topology
– Many features of graph follow a power-law distribution
– The core has small-worldproperty
• the shortest directed path from any page in the core to any other page in the core involves 16–20 links on average
xxp )(
Search Engine and Web Mining
![Page 16: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/16.jpg)
16
Distribution of Web Graph: Power-Law
Search Engine and Web Mining
![Page 17: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/17.jpg)
17
Search Engines Trends
• 625 million search queries are received by major search engines each day
• 80% of web surfers discover the new sites that they visit through search engines
• Web search currently generates more than 85% of the traffic to most web sites
Search Engine and Web Mining
![Page 18: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/18.jpg)
18
Components of Search Engines
• Crawling
• Indexing
• Ranking
Search Engine and Web Mining
![Page 19: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/19.jpg)
19
Architecture of Search Engines
Crawler(s)
Page Repository
Indexer Module
CollectionAnalysis Module
Query Engine
Ranking
Client
Indexes : Text Structure Utility
Queries Results
Web
Search Engine and Web Mining
![Page 20: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/20.jpg)
20
Web Crawling Issues
• Coverage– Google, the biggest search engine, covers only 70% of web content
– We must focus on high quality pages
• Freshness– Keep the copy in synchronize with the source pages
• Politeness– Do it without disrupting the web and obeying the webmasters constrains
Search Engine and Web Mining
![Page 21: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/21.jpg)
21
Web Crawling Issues
Search Engine and Web Mining
![Page 22: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/22.jpg)
22
Web crawling
Crawler
Search Engine and Web Mining
![Page 23: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/23.jpg)
23
Crawling Scheduling
• Breadth-First
• Back-link count
• PageRank,…
Search Engine and Web Mining
![Page 24: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/24.jpg)
24
Crawling scheduling
Downloader
Web
Repository
Ranking
Algorithm
URLs and Links
Search Engine and Web Mining
![Page 25: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/25.jpg)
25
Indexing
• Text Operations forms index words (tokens).
– Stopword removal
– Stemming
• Indexing constructs an inverted index of word to document pointers.
Search Engine and Web Mining
![Page 26: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/26.jpg)
26
Indexing Systems
• Google file system
• MG4J (Managing Gigabytes for Java)
• Lucene (Java-GPL)
• Swish-e (C++-Linux)
Search Engine and Web Mining
![Page 27: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/27.jpg)
27
Ranking : Definition
• Ranking is the process which estimates the quality of a set of results retrieved by a search engine
• Ranking is the most important part of a search engine
Search Engine and Web Mining
![Page 28: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/28.jpg)
28
Ranking Types
• Content-based
– Classical IR
• Connectivity based (web)
– Query independent
– Query dependent
• User-behavior based
Search Engine and Web Mining
![Page 29: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/29.jpg)
29
• Ranking is a function of
query term frequency
within the document (tf)
and across all documents
(idf)
– Vector space
– Probabilistic
Classical Information Retrieval
WordsDocs
1
2
w
1
2
n
Query
Search Engine and Web Mining
![Page 30: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/30.jpg)
30
Classical Information Retrieval
• This works because of the following
assumptions in classical IR:
– Queries are long and well specified
– Documents (e.g., newspaper articles) are
coherent, well authored, and are usually about
one topic
– The vocabulary is small and relatively well
understood
Search Engine and Web Mining
![Page 31: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/31.jpg)
31
Web information retrieval
• Queries are short: 2.35 terms in avg.
• Huge variety in documents: language, quality, duplication
• Huge vocabulary: 100s millions terms
• Deliberate misinformation
• Spamming!– Its rank is completely under the control of
Web page’s author
Search Engine and Web Mining
![Page 32: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/32.jpg)
32
Ranking in Web IR
• Ranking is a function of the
query terms and of the
hyperlink structure
– Using content of other pages to
rank current pages
• It is out of the control of the page’s author– Spamming is hard
WordsDocsDocs
1
2
w
1
2
n
1
2
n
Web graph
Query
Search Engine and Web Mining
![Page 33: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/33.jpg)
Books
Search Engine and Web Mining 33
• Main Text book: – C. D. Manning, P. Raghavan, H. Schutz, Introduction to Information
Retrival, Cambridge University Press, 2008.
– http://www.cs.utexas.edu/~mooney/ir-course/
• Secondary:– R. Baeza-Yates, B. Ribeiro-Neto,
Modern Information Retrieval,
Addison Wesley, 1999.
![Page 34: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/34.jpg)
34
Assessment
• Final Exam: 10 Marks
• Project: 5 Marks
• Homework: 2 Marks
• Paper Review and Presentation: 3 Marks
Search Engine and Web Mining
![Page 35: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/35.jpg)
Papers for Review
• Cho, Junghoo, and Sourashis Roy. "Impact of search engines on page popularity." Proceedings of the 13th international conference on World Wide Web. ACM, 2004.
• Spink, Amanda, et al. "Searching the web: The public and their queries." Journal of the Association for Information Science and Technology 52.3 (2001): 226-234.
• Berners-Lee, Tim, James Hendler, and Ora Lassila. "The semantic web." Scientific american 284.5 (2001): 34-43.
35Search Engine and Web Mining
![Page 36: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/36.jpg)
Contacts
• Be a member of this group in Shagerdaneh:
https://shagerdaneh.ir/
• Telegram Channel
https://telegram.me/RaziWM982
• Instructor’s email:
Search Engine and Web Mining 36
![Page 37: Search Engine and Web Mining · 2020-03-29 · Web Crawling Issues • Coverage –Google, the biggest search engine, covers only 70% of web content –We must focus on high quality](https://reader034.vdocument.in/reader034/viewer/2022050104/5f42c53b7405537ccf562907/html5/thumbnails/37.jpg)
QUESTIONS ?
Search Engine and Web Mining 37