information retrieval - indian institute of technology bombay · web crawler a web crawler is an...

104
Information Retrieval Shehzaad Dhuliawala Maulik Vachhani

Upload: others

Post on 27-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Information RetrievalShehzaad Dhuliawala

Maulik Vachhani

Page 2: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Presentation Outline• Introduction

• Boolean Retrieval

• Indexing

Term Vocabulary

Postings List

Index Creation

• Retrieval Models and Scoring

Vector Space Model

Probabilistic Model

• Web Crawling

• Cross Lingual Information Retrieval

Page 3: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Content we will refer to1. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze,

Introduction to Information Retrieval, Cambridge University Press. 2008. http://nlp.stanford.edu/IR-book/

2. Coursera Natural Language Processing Course by Dan Jurafsky and Christopher Manning https://class.coursera.org/nlp/

3. NPTEL course on Natural Language Processing by Pushpak Bhattacharyya http://nptel.ac.in/courses/106101007/

Page 4: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

What is Information RetrievalInformation retrieval (IR) is finding material (usually documents) of

an unstructured nature (usually text) that satisfies an information need

from within large collections (usually stored on computers). [1]

Page 5: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Unstructured Text• What differentiates an IR system from a database

Page 6: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

IR Models• An IR model is a quadruple

[D, Q, F, R(di, qi)]

D: Collection of Documents

Q: Collection of Queries

F: Framework for modelling the document, query and their relationship

R: A Ranking/ scoring function which returns a real number expressing relevance of di with qi

Page 7: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Boolean Retrieval• It’s a simple model based on Set theory

• It checks whether terms are present in a document or not

Page 8: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Example• We have a collection of scientific papers in the field of computer science

• The information need: A collection of papers which are about information retrieval using machine learning

• Query: 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 ∧ 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑎𝑙 ∧ 𝑚𝑎𝑐ℎ𝑖𝑛𝑒 ∧ 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔

• Set(information) U Set(retrieval) U Set(machine) U Set(learning)

information

retrieval

machine

learning

Page 9: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Grepping• The Unix grep command lets you search for the presence of a term in a

document

• Why does this approach pose a problem?

Page 10: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

The term-document matrixcompiler machine learning deep informat

ion

retrieval translati

on

Doc1 1 0 0 0 0 0 0

Doc 2 0 1 1 0 1 1 0

Doc 3 0 0 1 1 1 0 0

Doc 4 0 0 0 0 0 0 1

Doc 5 1 1 1 1 1 1 0

Doc 6 1 1 0 0 1 0 0

Query: 𝑚𝑎𝑐ℎ𝑖𝑛𝑒 ∧ 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 ∧ ¬(𝑐𝑜𝑚𝑝𝑙𝑖𝑒𝑟)

(010011) ∧ (011010) ∧ (011100) = (010000)

So relevant set -> Doc 2

Page 11: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

TDM: Sparseness• Space complexity: |V| . |D|

• |V| -> Vocabulary size

• |D| -> No. of Documents

• |V| = 500,000

• |D| = 1 Million

• Space required: ~ 500 GB

Page 12: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Inverted Index

compilers

machine

learning

information

retrieval

1 5 6

2 5 76

2 5 8

2 6 9

2 6 9

10 20 77

77 78 90

76 77 78

12 19 30

19 30 45

45

90

Page 13: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

NLP and IR

Page 14: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

How NLP helps IR• Tokenization

• Stemming/Lemmatization

• Stopword removal

• Normalization

• Named Entities

• Multi-word expressions

Page 15: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Tokenization• Text is a sequence of characters

• For term based indexing we need to take a decision on how to tokenize the text

• Where does this become a problem?

• O’Neal, Knock-out –How do we tokenize these?

Page 16: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Stemming• Q: The best cars

• D: The best car in 2016 is the Honda…

• More prevalent in Morphologically richer languages (Eg: Marathi)

• म ुंबईहून प ण्याला जाणार् या बसची वेळ

• Is stemming always beneficial?

Page 17: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Stopword Removal• Which words actually convey the meaning of the text

• Taj Mahal is situated in Agra which is close to Delhi

• Taj Mahal is situated in Agra which is close to Delhi

• It has been shown that removal of stopwords often boosts performances of IR system and lowers index size

• Is it always beneficial to remove stopwords?

Page 18: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Normalization• Text often contains stylistic features and usages may not be consistant

• For example, one document may contain the term : USA, while another : U.S.A

• Should both be indexed separately?

Page 19: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Named Entities and Multiword Expressions• Often a group of words may be more relevant together than individually

• Q: machine learning

• D: …the machine was used by several students and this was a good learning experience for them…

• Such terms are called Multiword expressions

• Should they be indexed together?

Page 20: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Retrieval Models

Page 21: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Problems with Boolean search• Boolean queries often result in either too few (=0) or too many (1000s)

results.

• Query 1: “standard user dlink 650” → many results

• Query 2: “standard user dlink 650 no card found”: 0 hits

• It takes a lot of skill to come up with a query that produces a manageable number of hits.

AND gives too few; OR gives too many

• Retrieved documents are not in order.

Page 22: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Ranked retrieval models• Rather than a set of documents satisfying a query expression, in ranked

retrieval, the system returns an ordering over the (top) documents in the collection for a query

• Ranked retrieval Models are:

1. Vector Space Model

2. Probabilistic model

Page 23: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Scoring as the basis of ranked retrieval• We wish to return in order the documents most likely to be useful to the

searcher

• Assign a score – say in [0, 1] – to each document

• We need a way of assigning a score to a query/document pair

• The more frequent the query term in the document, the higher the score (should be)

• Rare terms are more informative than frequent terms

Page 24: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Term frequency tf• The term frequency tft,d of term t in document d is defined as the number of

times that t occurs in d.

• Raw term frequency is not important.

• Relevance does not increase proportionally with term frequency.

• So we use log frequency.

• The log frequency weight of term t in d is

otherwise 0,

0 tfif, tflog 1

10 t,dt,d

t,dw

Page 25: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

idf weight• Frequent terms are less informative than rare terms

• We wants to give higher weights to rare documents.

• We will use document frequency (df) to capture this.

• dft is the document frequency of t: the number of documents that contain t

• We define the idf (inverse document frequency) of t by

• N is total number of documents.

)/df( log idf 10 tt N

Page 26: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

tf-idf weighting• The tf-idf weight of a term is the product of its tf weight and its idf weight.

• Alternative names: tf.idf, tf x idf

• Increases with the number of occurrences within a document

• Increases with the rarity of the term in the collection

)df/(log)tf1log(w 10,, tdt Ndt

dqt dtdq ,tf.idf),(Score

Page 27: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Documents and query as vectors• So we have a |V|-dimensional vector space

• Terms are axes of the space

• Documents and query are points or vectors in this space

• Find the cosine similarity between documents and query.

• We can remove denominator as we are interested in relative values only.

V

i i

V

i i

V

i ii

dq

dq

d

d

q

q

dq

dqdq

1

2

1

2

1),cos(

Page 28: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Summary – vector space model• Represent the query as a weighted tf-idf vector

• Represent each document as a weighted tf-idf vector

• Compute the cosine similarity score for the query vector and each document vector

• Rank documents with respect to the query by score

• Return the top K (e.g., K = 10) to the user

Page 29: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Probabilistic Model• Probability Ranking Principle

Let d is document collection.

R represents relevant documents

NR represents non relevant documents

• In a probabilistic model, the obvious way to give the output is to rank documents by the estimated probability of their relevance with respect to the information.

• That is, we order documents d by P(R|d, q).

Where q is query terms

• Examples are BM25, Binary Independence Model etc.

Page 30: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

BM25• Ranks documents based on query terms appearing in a document

• Given a query , containing keywords , the BM25 score of a document is

avgDL

DbbkDqTF

kDqTFqIDFQDscore

i

in

i

i ||*1.(),(

)1(*),(*)(),(

1

1

1

5.0)(

5.0)(log)(

i

ii

qn

qnNqIDF

Page 31: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Link based Model

Page 32: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Link Structure of the Web

• Intuitively, a webpage is important if it has a lot of backlinks.

In-links and Out-links links:A and B are C’s in-links

C is A and B’s out-link

Page 33: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

PageRank𝑃𝑅 𝑝𝑖 =

1 − 𝑑

𝑁+ 𝑑

𝑝𝑗∈𝑀(𝑝𝑖)

𝑃𝑅(𝑝𝑗)

𝐿(𝑝𝑗)

• p1,p2…pN are pages under consideration.

• M(pi) is the set of pages that link to pi.

• L(pj) is the number of outbound links on page pj.

• N is the total number of pages.

Page 34: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

An example of Simplified PageRank

PageRank Calculation: first iteration

Page 35: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Evaluation

Page 36: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Set based effectiveness measures

RetrievedRelevantRelevant

and

retrieved

Page 37: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Precision and recall

Precision (P) is the fraction of retrieved documents that are relevant

Recall (R) is the fraction of relevant documents that are retrieved

Page 38: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Precision/recall tradeoff• You can increase recall by returning more docs.

• Recall is a non-decreasing function of the number of docs retrieved.

• A system that returns all docs has 100% recall!

• The converse is also true (usually): It’s easy to get high precision for very low recall.

• So we can use harmonic mean of both.

• 𝐹 =2𝑃𝑅

𝑃+𝑅

Page 39: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Measures

Average Precision is average of all P@K where the document at rank K is Relevant.

Advantage of average precision : No need to select any particular k.

Mean Average Precision (MAP) is average precision averaged across a set of queries.

Advantage of MAP : Result shows relevance of whole system.

Page 40: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

NDCG

Normalized Discounted Cumulative Gain (NDCG) :

It is used when relevant judgement is not a binary.

Suppose there are five level of relevance judgement

Perfect, Excellent, Good, Fair, Bad.

We assign relevance score to each level. Suppose Perfect =

4, Excellent = 3, Good = 2, Fair = 1 and Bad = 0 .

𝑁𝐷𝐶𝐺 𝑄, 𝑘 =1

|𝑄|

𝑗=1

|𝑄|

𝑍𝑘𝑗

𝑚=1

𝑘2𝑅(𝑗,𝑚) − 1

log2(1 + 𝑚)

NDCG can be measured at rank k. Here Q = set of queries.

R(j,m) = Relevance score for query j and document m. Zkj

is normalizing factor.

Page 41: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Evaluation Fora

Page 42: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Cranfield

experiments

Cranfield

collection

1960

Initial experiments on text retrieval were started by Cyril Cleverdon in the 60s at Cranfield University. The Cleverdon’s retrieval test collection formed the blueprint for TREC

Page 43: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Cranfield

experiments

TREC

Collections

(1-4)

TRECThe Text Retrieval Conference was started in 1992 by the NIST. TREC focuses on several tracks ranging from question answering to cross lingual information retrieval.

1992

Page 44: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

NTCIR

Asian

language

collections

Cranfield

experimentsTREC

The NTCIR was the Japanese counterpart of TREC which was launched in 1999. NTCIR focuses largely on datasets for Asian languages (Japanese, Korean, Chinese)

1999

Page 45: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

NTCIRCranfield

experimentsTREC CLEF

European

language

collections

CLEF or Cross lingual evaluation forums started out as an evaluation forum focused on cross lingual IR. Today it has become a fully peer reviewed conference. CLEF focuses largely on European languages

2000

Page 46: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

2007

NTCIRCranfield

experimentsTREC CLEF

Indian

language

collections

FIRE

FIRE (Forum for IR evaluation) started as a spin-off to a CLEF 2007 task for retrieval for Indian languages. FIRE has released collections for 10 Indian languages.

Page 47: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Web Crawling

Page 48: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Web Crawler A Web crawler is an Internet bot which systematically browses the World

Wide Web, typically for the purpose of Web indexing.

A Web crawler may also be called a Web spider, an ant, an automaticindexer, or (in the FOAF software context) a Web scutter.

Web search engines and some other sites use Web crawling or spideringsoftware to update their web content or indexes of others sites' web content.

Page 49: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

List of web crawlers• Apache Nutch

• WebCrawler

• DataparkSearch

• HTTrack

• MnoGoSearch

Page 50: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Web Crawler Architecture

Page 51: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Crawl cycle• Create a URL seed list (One time process)

• Generate : In this phase, list of URLs will be generated which need to be fetched in this cycle.

• Fetcher : In this phase, list of generated URLs will be fetched from the internet.

• Parser : In this phase, fetched document will get parsed and out-link will be extracted.

• UpdateDb : In this phase, out-link will be updated in the database.

Page 52: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Cross Lingual Information Retrieval

Page 53: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

The Problem• You have a collection of documents in language L1

• The user gives a query in language L2

Page 54: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Possible pipelines: Document translation

Document

collection

Translation

system

IR system

Index

Query

Ranked list of

documents

Page 55: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Possible pipelines: Query translation

Document

collection

IR system

Index

Query

Ranked list of

documents

Translation

system

Page 56: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Sandhan: A Case Studyhttp://www.sandhansearch.in

Page 57: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Crawled and

Indexed

Web Pages

Target Information

in English

तिरूपति यात्राHindi Query

CLIR Engine

Target Language Index

in English

Ranked List of Results

Language

Resources

तिरूपति आने के लिए रेिसाधन

तिरूपति प ण्य नगर पह ुँचने के ललएबह ि रेल उपलब्ध हैं | अगर म ुंबई सेयात्रा कर रहे है िो म ुंबई-चेन्नईएक्सपे्रस गाडी से प्रवास कर सकिे है|

तिरूपतियात्रा

Result Snippets

in Hindi

57

Page 58: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Sandhan – Consortium Project• IIT Bombay (co-ordinator)

• CDAC Noida (co-cordinator)

• CDAC Pune

• IIT Kharaghpur

• Jadhavpur University

• ISI Kolkata

• IIIT Hyderabad

• AU KBC

• AU CEG

• Gauhati University

• DAIICT Gujarat

• IIIT Bhubaneswar

• TDIL 58

Page 59: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Problem definition• Cross Lingual Information Retrieval (CLIR) engine for Indian languages

Input: Query in one of the six Indian languages (Hindi, Marathi, Tamil, Telugu, Bengali, Punjabi, Assamese. Gujarati, Oriya)

Output: In Hindi, English and Query Language

• Currently in the second phase of the project

• Three new languages are added in second phase

Assamese, Gujarati, Oriya

• Built on top of Nutch Framework

59

Page 60: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Software Used• Nutch v0.9 – Framework

• Hadoop – Distributed Crawling

• Lucene – Indexing

• Moses/GIZA++ - Training models

• Tomcat – Deployment

60

Page 61: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

61

Fetcher

Web

Analyzer

MWE

Lookup

NE

Lookup

Domain

Identifier

Language

Identifier

Font

Transcoder

Indexer

CMLifier

UNL Index

Snippet

Translation

Summary

Generation

Snippet

GenerationTranslation

/Transliteration

MWE

Lookup

NE

Lookup

Analyzer

Query

Formulation

Index

Information

Extraction

Page 62: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Resources Developed• Language specific analyzers

• Stop word List

• Bilingual Dictionary ( X-English, X-Hindi)

• NE List

• MWE List

• Transliteration Models

62

Page 63: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Nutch and LuceneFramework: Demo-Arjun Atreya V

RS-IITB

Page 64: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Outline• Introduction

• Behavior of Nutch (Offline and Online)

• Lucene Features

64

Nu

tch

an

d L

uce

ne F

ram

ew

ork

Page 65: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Resources Used

Nu

tch

an

d L

uce

ne F

ram

ew

ork

65

• Gospodnetic, Otis; Erik Hatcher (December 1, 2004). Lucene in Action (1st ed.). Manning Publications. pp. 456. ISBN 978-1-932394-28-3.

• Nutch Wiki http://wiki.apache.org/nutch/

Page 66: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Introduction

Nu

tch

an

d L

uce

ne F

ram

ew

ork

66

• Nutch is an opensource search engine

• Implemented in Java

• Nutch is comprised of Lucene, Solr, Hadoop etc..

• Lucene is an implementation of indexing and searching crawled data

• Both Nutch and Lucene are developed using plugin framework

• Easy to customize

Page 67: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Where do they fit in IR?

Nu

tch

an

d L

uce

ne F

ram

ew

ork

67

Page 68: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Nutch – complete search engine

Nu

tch

an

d L

uce

ne F

ram

ew

ork

68

Page 69: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Nutch – offline processing

Nu

tch

an

d L

uce

ne F

ram

ew

ork

69

• Crawling

Starts with set of seed URLs

Goes deeper in the web and starts fetching the content

Content need to be analyzed before storing

Storing the content

Makes suitable for searching

• Issues

Time consuming process

Freshness of the crawl (How often should I crawl?)

Coverage of content

Page 70: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Nutch – online processing

Nu

tch

an

d L

uce

ne F

ram

ew

ork

70

• Searching

Analysis of the query

Processing of few words(tokens) in the query

Query tokens matched against stored tokens(index)

• Fast and Accurate

• Involves ordering the matching results

• Ranking affects User’s satisfaction directly

• Supports distributed searching

Page 71: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Nutch – Data structures

Nu

tch

an

d L

uce

ne F

ram

ew

ork

71

• Web Database or WebDB Mirrors the properties/structure of web graph being crawled

• Segment Intermediate index

Contains pages fetched in a single run

• Index Final inverted index obtained by “merging” segments (Lucene)

Page 72: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Nutch –Crawling

Nu

tch

an

d L

uce

ne F

ram

ew

ork

72

• Inject: initial creation of CrawlDB

Insert seed URLs

Initial LinkDB is empty

• Generate new shard's fetchlist

• Fetch raw content

• Parse content (discovers outlinks)

• Update CrawlDB from shards

• Update LinkDB from shards

• Index shards

Page 73: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Wide Crawling vs. Focused Crawling

Nu

tch

an

d L

uce

ne F

ram

ew

ork

73

• Differences:

Little technical difference in configuration

Big difference in operations, maintenance and quality

• Wide crawling:

(Almost) Unlimited crawling frontier

High risk of spamming and junk content

“Politeness” a very important limiting factor

Bandwidth & DNS considerations

• Focused (vertical or enterprise) crawling:

Limited crawling frontier

Bandwidth or politeness is often not an issue

Low risk of spamming and junk content

Page 74: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Crawling Architecture

Nu

tch

an

d L

uce

ne F

ram

ew

ork

74

Page 75: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Step1 : Injector injects the list of seed URLs into the CrawlDB

Nu

tch

an

d L

uce

ne F

ram

ew

ork

75

Page 76: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Step2 : Generator takes the list of seed URLs from CrawlDB, forms fetch list, adds crawl_generate folder into the segments

Nu

tch

an

d L

uce

ne F

ram

ew

ork

76

Page 77: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Step3 : These fetch lists are used by fetchers to fetch the rawcontent of the document. It is then stored in segments.

Nu

tch

an

d L

uce

ne F

ram

ew

ork

77

Page 78: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Step4 : Parser is called to parse the content of the documentand parsed content is stored back in segments.

Nu

tch

an

d L

uce

ne F

ram

ew

ork

78

Page 79: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Step5 : The links are inverted in the link graph and stored inLinkDB

Nu

tch

an

d L

uce

ne F

ram

ew

ork

79

Page 80: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Step6 : Indexing the terms present in segments is done andindices are updated in the segments

Nu

tch

an

d L

uce

ne F

ram

ew

ork

80

Page 81: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Step7 : Information on the newly fetched documents areupdated in the CrwalDB

Nu

tch

an

d L

uce

ne F

ram

ew

ork

81

Page 82: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Crawling: 10 stage process

Nu

tch

an

d L

uce

ne F

ram

ew

ork

82

bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& crawl.log

1. admin db –create: Create a new WebDB.

2. inject: Inject root URLs into the WebDB.

3. generate: Generate a fetchlist from the WebDB in a new segment.

4. fetch: Fetch content from URLs in the fetchlist.

5. updatedb: Update the WebDB with links from fetched pages.

6. Repeat steps 3-5 until the required depth is reached.

7. updatesegs: Update segments with scores and links from the WebDB.

8. index: Index the fetched pages.

9. dedup: Eliminate duplicate content (and duplicate URLs) from the indexes.

10. merge: Merge the indexes into a single index for searching

Page 83: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

De-duplication Algorithm

Nu

tch

an

d L

uce

ne F

ram

ew

ork

83

(MD5 hash, float score, int indexID, int docID, int urlLen)

for each page

to eliminate URL duplicates from a segmentsDir:

open a temporary file

for each segment:

for each document in its index:

append a tuple for the document to the temporary file with hash=MD5(URL)

close the temporary file

sort the temporary file by hash

for each group of tuples with the same hash:

for each tuple but the first:

delete the specified document from the index

Page 84: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

URL Filtering

Nu

tch

an

d L

uce

ne F

ram

ew

ork

84

URL Filters (Text file) (conf/crawl-urlfilter.txt)

Regular expression to filter URLs during crawling

E.g.

To ignore files with certain suffix:

-\.(gif|exe|zip|ico)$

To accept host in a certain domain

+^http://([a-z0-9]*\.)*apache.org/

Page 85: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Few API’s

Nu

tch

an

d L

uce

ne F

ram

ew

ork

85

• Site we would crawl: http://www.iitb.ac.in bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& crawl.log

• Analyze the database: bin/nutch readdb <db dir> –stats

bin/nutch readdb <db dir> –dumppageurl

bin/nutch readdb <db dir> –dumplinks

s=`ls -d <segment dir> /* | head -1` ; bin/nutch segread -dump $s

Page 86: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Map-Reduce Function

Nu

tch

an

d L

uce

ne F

ram

ew

ork

86

• Works in distributed environment

• map() and reduce() functions are implemented in most of the modules

• Both map() and reduce() functions uses <key, value> pairs

• Useful in case of processing large data (eg: Indexing)

• Some applications need sequence of map-reduce

Map-1 -> Reduce-1 -> ... -> Map-n -> Reduce-n

Page 87: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Map-Reduce Architecture

Nu

tch

an

d L

uce

ne F

ram

ew

ork

87

Page 88: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Nutch – Map-Reduce Indexing

Nu

tch

an

d L

uce

ne F

ram

ew

ork

88

• Map() just assembles all parts of documents

• Reduce() performs text analysis + indexing:

Adds to a local Lucene index

Other possible MR indexing models:

• Hadoop contrib/indexing model:

analysis and indexing on map() side

Index merging on reduce() side

• Modified Nutch model:

Analysis on map() side

Indexing on reduce() side

Page 89: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Nutch - Ranking

Nu

tch

an

d L

uce

ne F

ram

ew

ork

89

• Nutch Ranking

queryNorm() : indicates the normalization factor for the query

coord() : indicates how many query terms are present in the given document

norm() : score indicating field based normalization factor

tf : term frequency and idf : inverse document frequency

t.boost() : score indicating the importance of terms occurrence in a particular field

Page 90: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Lucene - Features

Nu

tch

an

d L

uce

ne F

ram

ew

ork

90

• Field based indexing and searching

• Different fields of a webpage are

Title

URL

Anchor text

Content, etc..

• Different boost factors to give importance to fields

• Uses inverted index to store content of crawled documents

• Open source Apache project

Page 91: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Lucene - Index

Nu

tch

an

d L

uce

ne F

ram

ew

ork

91

• Concepts

Index: sequence of documents (a.k.a. Directory)

Document: sequence of fields

Field: named sequence of terms

Term: a text string (e.g., a word)

• Statistics

Term frequencies and positions

Page 92: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Writing to Index

Nu

tch

an

d L

uce

ne F

ram

ew

ork

92

IndexWriter writer =

new IndexWriter(directory, analyzer, true);

Document doc = new Document();

// add fields to document (next slide)

writer.addDocument(doc);

writer.close();

Page 93: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Adding Fields

Nu

tch

an

d L

uce

ne F

ram

ew

ork

93

doc.add(Field.Keyword("isbn", isbn));

doc.add(Field.Keyword("category", category));

doc.add(Field.Text("title", title));

doc.add(Field.Text("author", author));

doc.add(Field.UnIndexed("url", url));

doc.add(Field.UnStored("subjects", subjects, true));

doc.add(Field.Keyword("pubmonth", pubmonth));

doc.add(Field.UnStored("contents",author + " " + subjects));

doc.add(Field.Keyword("modified", DateField.timeToString(file.lastModified())));

Page 94: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Fields Description

Nu

tch

an

d L

uce

ne F

ram

ew

ork

94

• Attributes

Stored: original content retrievable

Indexed: inverted, searchable

Tokenized: analyzed, split into tokens

• Factory methods

Keyword: stored and indexed as single term

Text: indexed, tokenized, and stored if String

UnIndexed: stored

UnStored: indexed, tokenized

• Terms are what matters for searching

Page 95: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Searching an Index

Nu

tch

an

d L

uce

ne F

ram

ew

ork

95

IndexSearcher searcher =

new IndexSearcher(directory);

Query query = QueryParser.parse(queryExpression,

"contents“,analyzer);

Hits hits = searcher.search(query);

for (int i = 0; i < hits.length(); i++) {

Document doc = hits.doc(i);

System.out.println(doc.get("title"));

}

Page 96: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Analyzer

Nu

tch

an

d L

uce

ne F

ram

ew

ork

96

• Analysis occurs

For each tokenized field during indexing

For each term or phrase in QueryParser

• Several analyzers built-in

Many more in the sandbox

Straightforward to create your own

• Choosing the right analyzer is important!

Page 97: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

WhiteSpace Analyzer

Nu

tch

an

d L

uce

ne F

ram

ew

ork

97

The quick brown fox jumps over the lazy dog.

[The] [quick] [brown] [fox] [jumps] [over] [the]

[lazy] [dog.]

Page 98: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Simple Analyzer

Nu

tch

an

d L

uce

ne F

ram

ew

ork

98

The quick brown fox jumps over the lazy dog.

[the] [quick] [brown] [fox] [jumps] [over] [the]

[lazy] [dog]

Page 99: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Stop Analyzer

Nu

tch

an

d L

uce

ne F

ram

ew

ork

99

The quick brown fox jumps over the lazy dog.

[quick] [brown] [fox] [jumps] [over] [lazy] [dog]

Page 100: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Snowball Analyzer

Nu

tch

an

d L

uce

ne F

ram

ew

ork

100

The quick brown fox jumps over the lazy dog.

[the] [quick] [brown] [fox] [jump] [over] [the]

[lazy] [dog]

Page 101: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Query Creation

Nu

tch

an

d L

uce

ne F

ram

ew

ork

101

• Searching by a term – TermQuery

• Searching within a range – RangeQuery

• Searching on a string – PrefixQuery

• Combining queries – BooleanQuery

• Searching by phrase – PhraseQuery

• Searching by wildcard – WildcardQuery

• Searching for similar terms - FuzzyQuery

Page 102: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Lucene Queries

Nu

tch

an

d L

uce

ne F

ram

ew

ork

102

Page 103: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Conclusions

Nu

tch

an

d L

uce

ne F

ram

ew

ork

103

• Nutch as a starting point

• Crawling in Nutch

• Detailed map-reduce architecture

• Different query formats in Lucene

• Built-in analyzers in Lucene

• Same analyzer need to be used both while indexing and searching

Page 104: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose

Thanks

Nu

tch

an

d L

uce

ne F

ram

ew

ork

104

• Questions ??