autumn 20111 web information retrieval (web ir) handout #0: introduction ali mohammad zareh bidoki...

28
Autumn 2011 1 Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University [email protected]

Upload: meredith-french

Post on 02-Jan-2016

216 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Autumn 20111 Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Autumn 2011 1

Web Information retrieval (Web IR)

Handout #0: Introduction

Ali Mohammad Zareh BidokiECE Department, Yazd University

[email protected]

Page 2: Autumn 20111 Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Autumn 2011 2

Outline

• Web challenges• Search engines• Web crawling• Web ranking

– Ranking algorithms– Ranking challenges

Page 3: Autumn 20111 Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Autumn 2011 3

Web Challenges

• Huge size of information– 11.5 billions pages (2005)– 64 billions pages (05 June, 2008)

• Proliferation and dynamic nature– New pages are created at the rate of 8% per week– Only 20% of the current pages will be accessible after one

year – New links are created at rate 25% per week

• Heterogeneous contents– HTML/Text/Audio/…

Page 4: Autumn 20111 Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Autumn 2011 4

Web Structure• Web graph has Bow-tie

shape• It has scale-free topology

– Many features of graph follow a power-law distribution

– The core has small-world property

• the shortest directed path from any page in the core to any other page in the core involves 16–20 links on average

xxp )(

Page 5: Autumn 20111 Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Autumn 2011 5

Web Retrieval

User Space

User Space

Information Space

Information Space

Matching

RetrievalBrowsing

Index termsFull text

Full text + Structure (e.g. hypertext)

Search Engine

Page 6: Autumn 20111 Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Autumn 2011 6

Search Engines Trends

• 625 million search queries are received by major search engines each day

• 80% of web surfers discover the new sites that they visit through search engines

• Web search currently generates more than 85% of the traffic to most web sites

Page 7: Autumn 20111 Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Autumn 2011 7

Components of Search Engines

• Crawling• Indexing• Ranking

Page 8: Autumn 20111 Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Autumn 2011 8

Architecture of Search Engines

Crawler(s)

Page Repository

Indexer Module

CollectionAnalysis Module

Query Engine

Ranking

Client

Indexes : TextStructureUtility

Queries Results

Web

Page 9: Autumn 20111 Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Autumn 2011 9

Web Crawling Issues

• Coverage– Google, the biggest search engine, covers only 70% of web

content– We must focus on high quality pages

• Freshness– Keep the copy in synchronize with the source pages

• Politeness– Do it without disrupting the web and obeying the

webmasters constrains

Page 10: Autumn 20111 Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Autumn 2011 10

Web Crawling Issues

Page 11: Autumn 20111 Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Autumn 2011 11

Web crawling

Crawler

Page 12: Autumn 20111 Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Autumn 2011 12

Crawling Scheduling

• Breadth-First• Back-link count• PageRank,…

Page 13: Autumn 20111 Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Autumn 2011 13

Crawling scheduling

Downloader

Web

Web

Repository

RankingAlgorithm

URLs and Links

Page 14: Autumn 20111 Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Autumn 2011 14

Indexing

• Text Operations forms index words (tokens).– Stopword removal– Stemming

• Indexing constructs an inverted index of word to document pointers.

Page 15: Autumn 20111 Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Autumn 2011 15

Comparing IR to databases (vs data

retrieval)

Databases IR

Data Structured Unstructured

Fields Clear semantics (SSN, age)

No fields (other than text)

QueriesDefined (relational algebra, SQL)

Free text (“natural language”), Boolean

Query specification

Complete Incomplete

MatchingExact (results are always “correct”)

Imprecise (need to measure effectiveness)

Error response

Sensitive Insensitive

Page 16: Autumn 20111 Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Autumn 2011 16

Indexing Systems

• Google file system• MG4J (Managing Gigabytes for Java)• Lucene (Java-GPL)• Swish-e (C++-Linux)

Page 17: Autumn 20111 Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Autumn 2011 17

Ranking : Definition

• Ranking is the process which estimates the quality of a set of results retrieved by a search engine

• Ranking is the most important part of a search engine

Page 18: Autumn 20111 Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Autumn 2011 18

Ranking Types

• Content-based – Classical IR

• Connectivity based (web)– Query independent– Query dependent

• User-behavior based

Page 19: Autumn 20111 Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Autumn 2011 19

• Ranking is a function of query term frequency within the document (tf) and across all documents (idf)– Vector space

– Probabilistic

Classical Information Retrieval

WordsDocs

1

2

w

1

2

n

Query

Page 20: Autumn 20111 Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Autumn 2011 20

Classical Information Retrieval

• This works because of the following assumptions in classical IR:– Queries are long and well specified

– Documents (e.g., newspaper articles) are coherent, well authored, and are usually about one topic

– The vocabulary is small and relatively well understood

Page 21: Autumn 20111 Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Autumn 2011 21

Web information retrieval

• Queries are short: 2.35 terms in avg.• Huge variety in documents:

language, quality, duplication• Huge vocabulary: 100s millions

terms• Deliberate misinformation• Spamming!

– Its rank is completely under the control of Web page’s author

Page 22: Autumn 20111 Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Autumn 2011 22

Ranking in Web IR

• Ranking is a function of the query terms and of the hyperlink structure– Using content of other

pages to rank current pages

• It is out of the control of the page’s author– Spamming is hard

WordsDocsDocs

1

2

w

1

2

n

1

2

n

Web graph

Query

Page 23: Autumn 20111 Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Autumn 2011 23

Books

– Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schuetze

– Modern Information Retrieval, by Ricardo Baeza-Yates & Berthier Ribeiro-Neto, Addison-Wisley, 1999.

Page 24: Autumn 20111 Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Autumn 2011 24

Grading

• Exam: 50%• Project & Homework: 30%• Paper Review:10%• A paper presentation 10%

Page 25: Autumn 20111 Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Web Site

• http://ce.yazduni.ac.ir/zareh/courses/webir/

Autumn 2011 25

Page 26: Autumn 20111 Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Next paper for Review

• Impact of Search Engines on Page Popularity by Cho

Autumn 2011 26

Page 27: Autumn 20111 Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Autumn 2011 27

Course Outline

• Web Structure• Crawling/Ranking/Indexing in Web

search engines• Retrieval in Persian documents

– Query Processing– Indexing solutions

• Cross-language Information Retrieval• Semantic web

Page 28: Autumn 20111 Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir

Next Paper for Review

• Impact of Search Engines on Page Popularity, by cho

Autumn 2011 28