information retrieval - indian institute of technology bombay · web crawler a web crawler is an...

Information RetrievalShehzaad Dhuliawala

Maulik Vachhani

Presentation Outline• Introduction

• Boolean Retrieval

• Indexing

Term Vocabulary

Postings List

Index Creation

• Retrieval Models and Scoring

Vector Space Model

Probabilistic Model

• Web Crawling

• Cross Lingual Information Retrieval

Content we will refer to1. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze,

Introduction to Information Retrieval, Cambridge University Press. 2008. http://nlp.stanford.edu/IR-book/

2. Coursera Natural Language Processing Course by Dan Jurafsky and Christopher Manning https://class.coursera.org/nlp/

3. NPTEL course on Natural Language Processing by Pushpak Bhattacharyya http://nptel.ac.in/courses/106101007/

What is Information RetrievalInformation retrieval (IR) is finding material (usually documents) of

an unstructured nature (usually text) that satisfies an information need

from within large collections (usually stored on computers). [1]

Unstructured Text• What differentiates an IR system from a database

IR Models• An IR model is a quadruple

[D, Q, F, R(di, qi)]

D: Collection of Documents

Q: Collection of Queries

F: Framework for modelling the document, query and their relationship

R: A Ranking/ scoring function which returns a real number expressing relevance of di with qi

Boolean Retrieval• It’s a simple model based on Set theory

• It checks whether terms are present in a document or not

Example• We have a collection of scientific papers in the field of computer science

• The information need: A collection of papers which are about information retrieval using machine learning

• Query: 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 ∧ 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑎𝑙 ∧ 𝑚𝑎𝑐ℎ𝑖𝑛𝑒 ∧ 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔

• Set(information) U Set(retrieval) U Set(machine) U Set(learning)

information

retrieval

machine

learning

Grepping• The Unix grep command lets you search for the presence of a term in a

document

• Why does this approach pose a problem?

The term-document matrixcompiler machine learning deep informat

retrieval translati

Doc1 1 0 0 0 0 0 0

Doc 2 0 1 1 0 1 1 0

Doc 3 0 0 1 1 1 0 0

Doc 4 0 0 0 0 0 0 1

Doc 5 1 1 1 1 1 1 0

Doc 6 1 1 0 0 1 0 0

Query: 𝑚𝑎𝑐ℎ𝑖𝑛𝑒 ∧ 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 ∧ ¬(𝑐𝑜𝑚𝑝𝑙𝑖𝑒𝑟)

(010011) ∧ (011010) ∧ (011100) = (010000)

So relevant set -> Doc 2

TDM: Sparseness• Space complexity: |V| . |D|

• |V| -> Vocabulary size

• |D| -> No. of Documents

• |V| = 500,000

• |D| = 1 Million

• Space required: ~ 500 GB

Inverted Index

compilers

machine

learning

information

retrieval

2 5 76

10 20 77

77 78 90

76 77 78

12 19 30

19 30 45

NLP and IR

How NLP helps IR• Tokenization

• Stemming/Lemmatization

• Stopword removal

• Normalization

• Named Entities

• Multi-word expressions

Tokenization• Text is a sequence of characters

• For term based indexing we need to take a decision on how to tokenize the text

• Where does this become a problem?

• O’Neal, Knock-out –How do we tokenize these?

Stemming• Q: The best cars

• D: The best car in 2016 is the Honda…

• More prevalent in Morphologically richer languages (Eg: Marathi)

• म ुंबईहून प ण्याला जाणार् या बसची वेळ

• Is stemming always beneficial?

Stopword Removal• Which words actually convey the meaning of the text

• Taj Mahal is situated in Agra which is close to Delhi

• It has been shown that removal of stopwords often boosts performances of IR system and lowers index size

• Is it always beneficial to remove stopwords?

Normalization• Text often contains stylistic features and usages may not be consistant

• For example, one document may contain the term : USA, while another : U.S.A

• Should both be indexed separately?

Named Entities and Multiword Expressions• Often a group of words may be more relevant together than individually

• Q: machine learning

• D: …the machine was used by several students and this was a good learning experience for them…

• Such terms are called Multiword expressions

• Should they be indexed together?

Retrieval Models

Problems with Boolean search• Boolean queries often result in either too few (=0) or too many (1000s)

results.

• Query 1: “standard user dlink 650” → many results

• Query 2: “standard user dlink 650 no card found”: 0 hits

• It takes a lot of skill to come up with a query that produces a manageable number of hits.

AND gives too few; OR gives too many

• Retrieved documents are not in order.

Ranked retrieval models• Rather than a set of documents satisfying a query expression, in ranked

retrieval, the system returns an ordering over the (top) documents in the collection for a query

• Ranked retrieval Models are:

1. Vector Space Model

2. Probabilistic model

Scoring as the basis of ranked retrieval• We wish to return in order the documents most likely to be useful to the

searcher

• Assign a score – say in [0, 1] – to each document

• We need a way of assigning a score to a query/document pair

• The more frequent the query term in the document, the higher the score (should be)

• Rare terms are more informative than frequent terms

Term frequency tf• The term frequency tft,d of term t in document d is defined as the number of

times that t occurs in d.

• Raw term frequency is not important.

• Relevance does not increase proportionally with term frequency.

• So we use log frequency.

• The log frequency weight of term t in d is

otherwise 0,

0 tfif, tflog 1

10 t,dt,d

idf weight• Frequent terms are less informative than rare terms

• We wants to give higher weights to rare documents.

• We will use document frequency (df) to capture this.

• dft is the document frequency of t: the number of documents that contain t

• We define the idf (inverse document frequency) of t by

• N is total number of documents.

)/df( log idf 10 tt N

tf-idf weighting• The tf-idf weight of a term is the product of its tf weight and its idf weight.

• Alternative names: tf.idf, tf x idf

• Increases with the number of occurrences within a document

• Increases with the rarity of the term in the collection

)df/(log)tf1log(w 10,, tdt Ndt

dqt dtdq ,tf.idf),(Score

Documents and query as vectors• So we have a |V|-dimensional vector space

• Terms are axes of the space

• Documents and query are points or vectors in this space

• Find the cosine similarity between documents and query.

• We can remove denominator as we are interested in relative values only.

1),cos(

Summary – vector space model• Represent the query as a weighted tf-idf vector

• Represent each document as a weighted tf-idf vector

• Compute the cosine similarity score for the query vector and each document vector

• Rank documents with respect to the query by score

• Return the top K (e.g., K = 10) to the user

Probabilistic Model• Probability Ranking Principle

Let d is document collection.

R represents relevant documents

NR represents non relevant documents

• In a probabilistic model, the obvious way to give the output is to rank documents by the estimated probability of their relevance with respect to the information.

• That is, we order documents d by P(R|d, q).

Where q is query terms

• Examples are BM25, Binary Independence Model etc.

BM25• Ranks documents based on query terms appearing in a document

• Given a query , containing keywords , the BM25 score of a document is

DbbkDqTF

kDqTFqIDFQDscore

i ||*1.(),(

)1(*),(*)(),(

5.0)(log)(

qnNqIDF

Link based Model

Link Structure of the Web

• Intuitively, a webpage is important if it has a lot of backlinks.

In-links and Out-links links:A and B are C’s in-links

C is A and B’s out-link

PageRank𝑃𝑅 𝑝𝑖 =

1 − 𝑑

𝑁+ 𝑑

𝑝𝑗∈𝑀(𝑝𝑖)

𝑃𝑅(𝑝𝑗)

𝐿(𝑝𝑗)

• p1,p2…pN are pages under consideration.

• M(pi) is the set of pages that link to pi.

• L(pj) is the number of outbound links on page pj.

• N is the total number of pages.

An example of Simplified PageRank

PageRank Calculation: first iteration

Evaluation

Set based effectiveness measures

RetrievedRelevantRelevant

retrieved

Precision and recall

Precision (P) is the fraction of retrieved documents that are relevant

Recall (R) is the fraction of relevant documents that are retrieved

Precision/recall tradeoff• You can increase recall by returning more docs.

• Recall is a non-decreasing function of the number of docs retrieved.

• A system that returns all docs has 100% recall!

• The converse is also true (usually): It’s easy to get high precision for very low recall.

• So we can use harmonic mean of both.

• 𝐹 =2𝑃𝑅

𝑃+𝑅

Measures

Average Precision is average of all P@K where the document at rank K is Relevant.

Advantage of average precision : No need to select any particular k.

Mean Average Precision (MAP) is average precision averaged across a set of queries.

Advantage of MAP : Result shows relevance of whole system.

Normalized Discounted Cumulative Gain (NDCG) :

It is used when relevant judgement is not a binary.

Suppose there are five level of relevance judgement

Perfect, Excellent, Good, Fair, Bad.

We assign relevance score to each level. Suppose Perfect =

4, Excellent = 3, Good = 2, Fair = 1 and Bad = 0 .

𝑁𝐷𝐶𝐺 𝑄, 𝑘 =1

|𝑄|

𝑗=1

|𝑄|

𝑍𝑘𝑗

𝑚=1

𝑘2𝑅(𝑗,𝑚) − 1

log2(1 + 𝑚)

NDCG can be measured at rank k. Here Q = set of queries.

R(j,m) = Relevance score for query j and document m. Zkj

is normalizing factor.

Evaluation Fora

Cranfield

experiments

Cranfield

collection

Initial experiments on text retrieval were started by Cyril Cleverdon in the 60s at Cranfield University. The Cleverdon’s retrieval test collection formed the blueprint for TREC

Cranfield

experiments

Collections

TRECThe Text Retrieval Conference was started in 1992 by the NIST. TREC focuses on several tracks ranging from question answering to cross lingual information retrieval.

language

collections

Cranfield

experimentsTREC

The NTCIR was the Japanese counterpart of TREC which was launched in 1999. NTCIR focuses largely on datasets for Asian languages (Japanese, Korean, Chinese)

NTCIRCranfield

experimentsTREC CLEF

European

language

collections

CLEF or Cross lingual evaluation forums started out as an evaluation forum focused on cross lingual IR. Today it has become a fully peer reviewed conference. CLEF focuses largely on European languages

NTCIRCranfield

experimentsTREC CLEF

Indian

language

collections

FIRE (Forum for IR evaluation) started as a spin-off to a CLEF 2007 task for retrieval for Indian languages. FIRE has released collections for 10 Indian languages.

Web Crawling

Web Crawler A Web crawler is an Internet bot which systematically browses the World

Wide Web, typically for the purpose of Web indexing.

A Web crawler may also be called a Web spider, an ant, an automaticindexer, or (in the FOAF software context) a Web scutter.

Web search engines and some other sites use Web crawling or spideringsoftware to update their web content or indexes of others sites' web content.

List of web crawlers• Apache Nutch

• WebCrawler

• DataparkSearch

• HTTrack

• MnoGoSearch

Web Crawler Architecture

Crawl cycle• Create a URL seed list (One time process)

• Generate : In this phase, list of URLs will be generated which need to be fetched in this cycle.

• Fetcher : In this phase, list of generated URLs will be fetched from the internet.

• Parser : In this phase, fetched document will get parsed and out-link will be extracted.

• UpdateDb : In this phase, out-link will be updated in the database.

Cross Lingual Information Retrieval

The Problem• You have a collection of documents in language L1

• The user gives a query in language L2

Possible pipelines: Document translation

Document

collection

Translation

system

IR system

Ranked list of

documents

Possible pipelines: Query translation

Document

collection

IR system

Ranked list of

documents

Translation

system

Sandhan: A Case Studyhttp://www.sandhansearch.in

Crawled and

Indexed

Web Pages

Target Information

in English

तिरूपति यात्राHindi Query

CLIR Engine

Target Language Index

in English

Ranked List of Results

Language

Resources

तिरूपति आने के लिए रेिसाधन

तिरूपति प ण्य नगर पह ुँचने के ललएबह ि रेल उपलब्ध हैं | अगर म ुंबई सेयात्रा कर रहे है िो म ुंबई-चेन्नईएक्सपे्रस गाडी से प्रवास कर सकिे है|

तिरूपतियात्रा

Result Snippets

in Hindi

Sandhan – Consortium Project• IIT Bombay (co-ordinator)

• CDAC Noida (co-cordinator)

• CDAC Pune

• IIT Kharaghpur

• Jadhavpur University

• ISI Kolkata

• IIIT Hyderabad

• AU KBC

• AU CEG

• Gauhati University

• DAIICT Gujarat

• IIIT Bhubaneswar

• TDIL 58

Problem definition• Cross Lingual Information Retrieval (CLIR) engine for Indian languages

Input: Query in one of the six Indian languages (Hindi, Marathi, Tamil, Telugu, Bengali, Punjabi, Assamese. Gujarati, Oriya)

Output: In Hindi, English and Query Language

• Currently in the second phase of the project

• Three new languages are added in second phase

Assamese, Gujarati, Oriya

• Built on top of Nutch Framework

Software Used• Nutch v0.9 – Framework

• Hadoop – Distributed Crawling

• Lucene – Indexing

• Moses/GIZA++ - Training models

• Tomcat – Deployment

Fetcher

Analyzer

Lookup

Domain

Identifier

Language

Identifier

Transcoder

Indexer

CMLifier

UNL Index

Snippet

Translation

Summary

Generation

Snippet

GenerationTranslation

/Transliteration

Lookup

Analyzer

Formulation

Information

Extraction

Resources Developed• Language specific analyzers

• Stop word List

• Bilingual Dictionary ( X-English, X-Hindi)

• NE List

• MWE List

• Transliteration Models

Nutch and LuceneFramework: Demo-Arjun Atreya V

RS-IITB

Outline• Introduction

• Behavior of Nutch (Offline and Online)

• Lucene Features

Resources Used

• Gospodnetic, Otis; Erik Hatcher (December 1, 2004). Lucene in Action (1st ed.). Manning Publications. pp. 456. ISBN 978-1-932394-28-3.

• Nutch Wiki http://wiki.apache.org/nutch/

Introduction

• Nutch is an opensource search engine

• Implemented in Java

• Nutch is comprised of Lucene, Solr, Hadoop etc..

• Lucene is an implementation of indexing and searching crawled data

• Both Nutch and Lucene are developed using plugin framework

• Easy to customize

Where do they fit in IR?

Nutch – complete search engine

Nutch – offline processing

• Crawling

Starts with set of seed URLs

Goes deeper in the web and starts fetching the content

Content need to be analyzed before storing

Storing the content

Makes suitable for searching

• Issues

Time consuming process

Freshness of the crawl (How often should I crawl?)

Coverage of content

Nutch – online processing

• Searching

Analysis of the query

Processing of few words(tokens) in the query

Query tokens matched against stored tokens(index)

• Fast and Accurate

• Involves ordering the matching results

• Ranking affects User’s satisfaction directly

• Supports distributed searching

Nutch – Data structures

• Web Database or WebDB Mirrors the properties/structure of web graph being crawled

• Segment Intermediate index

Contains pages fetched in a single run

• Index Final inverted index obtained by “merging” segments (Lucene)

Nutch –Crawling

• Inject: initial creation of CrawlDB

Insert seed URLs

Initial LinkDB is empty

• Generate new shard's fetchlist

• Fetch raw content

• Parse content (discovers outlinks)

• Update CrawlDB from shards

• Update LinkDB from shards

• Index shards

Wide Crawling vs. Focused Crawling

• Differences:

Little technical difference in configuration

Big difference in operations, maintenance and quality

• Wide crawling:

(Almost) Unlimited crawling frontier

High risk of spamming and junk content

“Politeness” a very important limiting factor

Bandwidth & DNS considerations

• Focused (vertical or enterprise) crawling:

Limited crawling frontier

Bandwidth or politeness is often not an issue

Low risk of spamming and junk content

Crawling Architecture

Step1 : Injector injects the list of seed URLs into the CrawlDB

Step2 : Generator takes the list of seed URLs from CrawlDB, forms fetch list, adds crawl_generate folder into the segments

Step3 : These fetch lists are used by fetchers to fetch the rawcontent of the document. It is then stored in segments.

Step4 : Parser is called to parse the content of the documentand parsed content is stored back in segments.

Step5 : The links are inverted in the link graph and stored inLinkDB

Step6 : Indexing the terms present in segments is done andindices are updated in the segments

Step7 : Information on the newly fetched documents areupdated in the CrwalDB

Crawling: 10 stage process

bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& crawl.log

1. admin db –create: Create a new WebDB.

2. inject: Inject root URLs into the WebDB.

3. generate: Generate a fetchlist from the WebDB in a new segment.

4. fetch: Fetch content from URLs in the fetchlist.

5. updatedb: Update the WebDB with links from fetched pages.

6. Repeat steps 3-5 until the required depth is reached.

7. updatesegs: Update segments with scores and links from the WebDB.

8. index: Index the fetched pages.

9. dedup: Eliminate duplicate content (and duplicate URLs) from the indexes.

10. merge: Merge the indexes into a single index for searching

De-duplication Algorithm

(MD5 hash, float score, int indexID, int docID, int urlLen)

for each page

to eliminate URL duplicates from a segmentsDir:

open a temporary file

for each segment:

for each document in its index:

append a tuple for the document to the temporary file with hash=MD5(URL)

close the temporary file

sort the temporary file by hash

for each group of tuples with the same hash:

for each tuple but the first:

delete the specified document from the index

URL Filtering

URL Filters (Text file) (conf/crawl-urlfilter.txt)

Regular expression to filter URLs during crawling

To ignore files with certain suffix:

-\.(gif|exe|zip|ico)$

To accept host in a certain domain

+^http://([a-z0-9]*\.)*apache.org/

Few API’s

• Site we would crawl: http://www.iitb.ac.in bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& crawl.log

• Analyze the database: bin/nutch readdb <db dir> –stats

bin/nutch readdb <db dir> –dumppageurl

bin/nutch readdb <db dir> –dumplinks

s=`ls -d <segment dir> /* | head -1` ; bin/nutch segread -dump $s

Map-Reduce Function

• Works in distributed environment

• map() and reduce() functions are implemented in most of the modules

• Both map() and reduce() functions uses <key, value> pairs

• Useful in case of processing large data (eg: Indexing)

• Some applications need sequence of map-reduce

Map-1 -> Reduce-1 -> ... -> Map-n -> Reduce-n

Map-Reduce Architecture

Nutch – Map-Reduce Indexing

• Map() just assembles all parts of documents

• Reduce() performs text analysis + indexing:

Adds to a local Lucene index

Other possible MR indexing models:

• Hadoop contrib/indexing model:

analysis and indexing on map() side

Index merging on reduce() side

• Modified Nutch model:

Analysis on map() side

Indexing on reduce() side

Nutch - Ranking

• Nutch Ranking

queryNorm() : indicates the normalization factor for the query

coord() : indicates how many query terms are present in the given document

norm() : score indicating field based normalization factor

tf : term frequency and idf : inverse document frequency

t.boost() : score indicating the importance of terms occurrence in a particular field

Lucene - Features

• Field based indexing and searching

• Different fields of a webpage are

Anchor text

Content, etc..

• Different boost factors to give importance to fields

• Uses inverted index to store content of crawled documents

• Open source Apache project

Lucene - Index

• Concepts

Index: sequence of documents (a.k.a. Directory)

Document: sequence of fields

Field: named sequence of terms

Term: a text string (e.g., a word)

• Statistics

Term frequencies and positions

Writing to Index

IndexWriter writer =

new IndexWriter(directory, analyzer, true);

Document doc = new Document();

// add fields to document (next slide)

writer.addDocument(doc);

writer.close();

Adding Fields

doc.add(Field.Keyword("isbn", isbn));

doc.add(Field.Keyword("category", category));

doc.add(Field.Text("title", title));

doc.add(Field.Text("author", author));

doc.add(Field.UnIndexed("url", url));

doc.add(Field.UnStored("subjects", subjects, true));

doc.add(Field.Keyword("pubmonth", pubmonth));

doc.add(Field.UnStored("contents",author + " " + subjects));

doc.add(Field.Keyword("modified", DateField.timeToString(file.lastModified())));

Fields Description

• Attributes

Stored: original content retrievable

Indexed: inverted, searchable

Tokenized: analyzed, split into tokens

• Factory methods

Keyword: stored and indexed as single term

Text: indexed, tokenized, and stored if String

UnIndexed: stored

UnStored: indexed, tokenized

• Terms are what matters for searching

Searching an Index

IndexSearcher searcher =

new IndexSearcher(directory);

Query query = QueryParser.parse(queryExpression,

"contents“,analyzer);

Hits hits = searcher.search(query);

for (int i = 0; i < hits.length(); i++) {

Document doc = hits.doc(i);

System.out.println(doc.get("title"));

Analyzer

• Analysis occurs

For each tokenized field during indexing

For each term or phrase in QueryParser

• Several analyzers built-in

Many more in the sandbox

Straightforward to create your own

• Choosing the right analyzer is important!

WhiteSpace Analyzer

The quick brown fox jumps over the lazy dog.

[The] [quick] [brown] [fox] [jumps] [over] [the]

[lazy] [dog.]

Simple Analyzer

[the] [quick] [brown] [fox] [jumps] [over] [the]

[lazy] [dog]

Stop Analyzer

[quick] [brown] [fox] [jumps] [over] [lazy] [dog]

Snowball Analyzer

[the] [quick] [brown] [fox] [jump] [over] [the]

[lazy] [dog]

Query Creation

• Searching by a term – TermQuery

• Searching within a range – RangeQuery

• Searching on a string – PrefixQuery

• Combining queries – BooleanQuery

• Searching by phrase – PhraseQuery

• Searching by wildcard – WildcardQuery

• Searching for similar terms - FuzzyQuery

Lucene Queries

Conclusions

• Nutch as a starting point

• Crawling in Nutch

• Detailed map-reduce architecture

• Different query formats in Lucene

• Built-in analyzers in Lucene

• Same analyzer need to be used both while indexing and searching

Thanks

• Questions ??

information retrieval - indian institute of technology bombay · web crawler a web crawler is an...

Documents

focused web crawler with genetic algorithm-an …€¦ ·...

web scrapingcarlosgmartin.com/scrapingslides.pdf ·...

smart crawler base paper a two stage crawler for efficiently...

web crawlers. web crawler a web crawler is a computer...

web crawler and information management

web search engines - technische universität …1 j....

18363882 search engine with web crawler

genome browses and data display

what is web information retrieval from web search engine web...

semagrow demonstrator: “web crawler + agrotagger”

slug: a semantic web crawler leigh dodds engineering ... ·...

web crawler assisted web page cleaning for web data mining

parallel web crawler

multi-threaded web crawler in ruby

web crawler with seo analysis

design and implementation of domain based semantic hidden...

web crawler 11

an overview of approaches used in focused crawlersimportant...

information retrieval web crawler - cornell university ·...

search engine and web crawler