![Page 1: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/1.jpg)
Information RetrievalShehzaad Dhuliawala
Maulik Vachhani
![Page 2: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/2.jpg)
Presentation Outline• Introduction
• Boolean Retrieval
• Indexing
Term Vocabulary
Postings List
Index Creation
• Retrieval Models and Scoring
Vector Space Model
Probabilistic Model
• Web Crawling
• Cross Lingual Information Retrieval
![Page 3: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/3.jpg)
Content we will refer to1. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze,
Introduction to Information Retrieval, Cambridge University Press. 2008. http://nlp.stanford.edu/IR-book/
2. Coursera Natural Language Processing Course by Dan Jurafsky and Christopher Manning https://class.coursera.org/nlp/
3. NPTEL course on Natural Language Processing by Pushpak Bhattacharyya http://nptel.ac.in/courses/106101007/
![Page 4: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/4.jpg)
What is Information RetrievalInformation retrieval (IR) is finding material (usually documents) of
an unstructured nature (usually text) that satisfies an information need
from within large collections (usually stored on computers). [1]
![Page 5: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/5.jpg)
Unstructured Text• What differentiates an IR system from a database
![Page 6: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/6.jpg)
IR Models• An IR model is a quadruple
[D, Q, F, R(di, qi)]
D: Collection of Documents
Q: Collection of Queries
F: Framework for modelling the document, query and their relationship
R: A Ranking/ scoring function which returns a real number expressing relevance of di with qi
![Page 7: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/7.jpg)
Boolean Retrieval• It’s a simple model based on Set theory
• It checks whether terms are present in a document or not
![Page 8: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/8.jpg)
Example• We have a collection of scientific papers in the field of computer science
• The information need: A collection of papers which are about information retrieval using machine learning
• Query: 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 ∧ 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑎𝑙 ∧ 𝑚𝑎𝑐ℎ𝑖𝑛𝑒 ∧ 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔
• Set(information) U Set(retrieval) U Set(machine) U Set(learning)
information
retrieval
machine
learning
![Page 9: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/9.jpg)
Grepping• The Unix grep command lets you search for the presence of a term in a
document
• Why does this approach pose a problem?
![Page 10: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/10.jpg)
The term-document matrixcompiler machine learning deep informat
ion
retrieval translati
on
Doc1 1 0 0 0 0 0 0
Doc 2 0 1 1 0 1 1 0
Doc 3 0 0 1 1 1 0 0
Doc 4 0 0 0 0 0 0 1
Doc 5 1 1 1 1 1 1 0
Doc 6 1 1 0 0 1 0 0
Query: 𝑚𝑎𝑐ℎ𝑖𝑛𝑒 ∧ 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 ∧ ¬(𝑐𝑜𝑚𝑝𝑙𝑖𝑒𝑟)
(010011) ∧ (011010) ∧ (011100) = (010000)
So relevant set -> Doc 2
![Page 11: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/11.jpg)
TDM: Sparseness• Space complexity: |V| . |D|
• |V| -> Vocabulary size
• |D| -> No. of Documents
• |V| = 500,000
• |D| = 1 Million
• Space required: ~ 500 GB
![Page 12: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/12.jpg)
Inverted Index
compilers
machine
learning
information
retrieval
1 5 6
2 5 76
2 5 8
2 6 9
2 6 9
10 20 77
77 78 90
76 77 78
12 19 30
19 30 45
45
90
![Page 13: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/13.jpg)
NLP and IR
![Page 14: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/14.jpg)
How NLP helps IR• Tokenization
• Stemming/Lemmatization
• Stopword removal
• Normalization
• Named Entities
• Multi-word expressions
![Page 15: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/15.jpg)
Tokenization• Text is a sequence of characters
• For term based indexing we need to take a decision on how to tokenize the text
• Where does this become a problem?
• O’Neal, Knock-out –How do we tokenize these?
![Page 16: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/16.jpg)
Stemming• Q: The best cars
• D: The best car in 2016 is the Honda…
• More prevalent in Morphologically richer languages (Eg: Marathi)
• म ुंबईहून प ण्याला जाणार् या बसची वेळ
• Is stemming always beneficial?
![Page 17: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/17.jpg)
Stopword Removal• Which words actually convey the meaning of the text
• Taj Mahal is situated in Agra which is close to Delhi
• Taj Mahal is situated in Agra which is close to Delhi
• It has been shown that removal of stopwords often boosts performances of IR system and lowers index size
• Is it always beneficial to remove stopwords?
![Page 18: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/18.jpg)
Normalization• Text often contains stylistic features and usages may not be consistant
• For example, one document may contain the term : USA, while another : U.S.A
• Should both be indexed separately?
![Page 19: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/19.jpg)
Named Entities and Multiword Expressions• Often a group of words may be more relevant together than individually
• Q: machine learning
• D: …the machine was used by several students and this was a good learning experience for them…
• Such terms are called Multiword expressions
• Should they be indexed together?
![Page 20: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/20.jpg)
Retrieval Models
![Page 21: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/21.jpg)
Problems with Boolean search• Boolean queries often result in either too few (=0) or too many (1000s)
results.
• Query 1: “standard user dlink 650” → many results
• Query 2: “standard user dlink 650 no card found”: 0 hits
• It takes a lot of skill to come up with a query that produces a manageable number of hits.
AND gives too few; OR gives too many
• Retrieved documents are not in order.
![Page 22: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/22.jpg)
Ranked retrieval models• Rather than a set of documents satisfying a query expression, in ranked
retrieval, the system returns an ordering over the (top) documents in the collection for a query
• Ranked retrieval Models are:
1. Vector Space Model
2. Probabilistic model
![Page 23: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/23.jpg)
Scoring as the basis of ranked retrieval• We wish to return in order the documents most likely to be useful to the
searcher
• Assign a score – say in [0, 1] – to each document
• We need a way of assigning a score to a query/document pair
• The more frequent the query term in the document, the higher the score (should be)
• Rare terms are more informative than frequent terms
![Page 24: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/24.jpg)
Term frequency tf• The term frequency tft,d of term t in document d is defined as the number of
times that t occurs in d.
• Raw term frequency is not important.
• Relevance does not increase proportionally with term frequency.
• So we use log frequency.
• The log frequency weight of term t in d is
otherwise 0,
0 tfif, tflog 1
10 t,dt,d
t,dw
![Page 25: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/25.jpg)
idf weight• Frequent terms are less informative than rare terms
• We wants to give higher weights to rare documents.
• We will use document frequency (df) to capture this.
• dft is the document frequency of t: the number of documents that contain t
• We define the idf (inverse document frequency) of t by
• N is total number of documents.
)/df( log idf 10 tt N
![Page 26: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/26.jpg)
tf-idf weighting• The tf-idf weight of a term is the product of its tf weight and its idf weight.
• Alternative names: tf.idf, tf x idf
• Increases with the number of occurrences within a document
• Increases with the rarity of the term in the collection
)df/(log)tf1log(w 10,, tdt Ndt
dqt dtdq ,tf.idf),(Score
![Page 27: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/27.jpg)
Documents and query as vectors• So we have a |V|-dimensional vector space
• Terms are axes of the space
• Documents and query are points or vectors in this space
• Find the cosine similarity between documents and query.
• We can remove denominator as we are interested in relative values only.
V
i i
V
i i
V
i ii
dq
dq
d
d
q
q
dq
dqdq
1
2
1
2
1),cos(
![Page 28: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/28.jpg)
Summary – vector space model• Represent the query as a weighted tf-idf vector
• Represent each document as a weighted tf-idf vector
• Compute the cosine similarity score for the query vector and each document vector
• Rank documents with respect to the query by score
• Return the top K (e.g., K = 10) to the user
![Page 29: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/29.jpg)
Probabilistic Model• Probability Ranking Principle
Let d is document collection.
R represents relevant documents
NR represents non relevant documents
• In a probabilistic model, the obvious way to give the output is to rank documents by the estimated probability of their relevance with respect to the information.
• That is, we order documents d by P(R|d, q).
Where q is query terms
• Examples are BM25, Binary Independence Model etc.
![Page 30: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/30.jpg)
BM25• Ranks documents based on query terms appearing in a document
• Given a query , containing keywords , the BM25 score of a document is
avgDL
DbbkDqTF
kDqTFqIDFQDscore
i
in
i
i ||*1.(),(
)1(*),(*)(),(
1
1
1
5.0)(
5.0)(log)(
i
ii
qn
qnNqIDF
![Page 31: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/31.jpg)
Link based Model
![Page 32: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/32.jpg)
Link Structure of the Web
• Intuitively, a webpage is important if it has a lot of backlinks.
In-links and Out-links links:A and B are C’s in-links
C is A and B’s out-link
![Page 33: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/33.jpg)
PageRank𝑃𝑅 𝑝𝑖 =
1 − 𝑑
𝑁+ 𝑑
𝑝𝑗∈𝑀(𝑝𝑖)
𝑃𝑅(𝑝𝑗)
𝐿(𝑝𝑗)
• p1,p2…pN are pages under consideration.
• M(pi) is the set of pages that link to pi.
• L(pj) is the number of outbound links on page pj.
• N is the total number of pages.
![Page 34: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/34.jpg)
An example of Simplified PageRank
PageRank Calculation: first iteration
![Page 35: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/35.jpg)
Evaluation
![Page 36: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/36.jpg)
Set based effectiveness measures
RetrievedRelevantRelevant
and
retrieved
![Page 37: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/37.jpg)
Precision and recall
Precision (P) is the fraction of retrieved documents that are relevant
Recall (R) is the fraction of relevant documents that are retrieved
![Page 38: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/38.jpg)
Precision/recall tradeoff• You can increase recall by returning more docs.
• Recall is a non-decreasing function of the number of docs retrieved.
• A system that returns all docs has 100% recall!
• The converse is also true (usually): It’s easy to get high precision for very low recall.
• So we can use harmonic mean of both.
• 𝐹 =2𝑃𝑅
𝑃+𝑅
![Page 39: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/39.jpg)
Measures
Average Precision is average of all P@K where the document at rank K is Relevant.
Advantage of average precision : No need to select any particular k.
Mean Average Precision (MAP) is average precision averaged across a set of queries.
Advantage of MAP : Result shows relevance of whole system.
![Page 40: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/40.jpg)
NDCG
Normalized Discounted Cumulative Gain (NDCG) :
It is used when relevant judgement is not a binary.
Suppose there are five level of relevance judgement
Perfect, Excellent, Good, Fair, Bad.
We assign relevance score to each level. Suppose Perfect =
4, Excellent = 3, Good = 2, Fair = 1 and Bad = 0 .
𝑁𝐷𝐶𝐺 𝑄, 𝑘 =1
|𝑄|
𝑗=1
|𝑄|
𝑍𝑘𝑗
𝑚=1
𝑘2𝑅(𝑗,𝑚) − 1
log2(1 + 𝑚)
NDCG can be measured at rank k. Here Q = set of queries.
R(j,m) = Relevance score for query j and document m. Zkj
is normalizing factor.
![Page 41: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/41.jpg)
Evaluation Fora
![Page 42: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/42.jpg)
Cranfield
experiments
Cranfield
collection
1960
Initial experiments on text retrieval were started by Cyril Cleverdon in the 60s at Cranfield University. The Cleverdon’s retrieval test collection formed the blueprint for TREC
![Page 43: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/43.jpg)
Cranfield
experiments
TREC
Collections
(1-4)
TRECThe Text Retrieval Conference was started in 1992 by the NIST. TREC focuses on several tracks ranging from question answering to cross lingual information retrieval.
1992
![Page 44: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/44.jpg)
NTCIR
Asian
language
collections
Cranfield
experimentsTREC
The NTCIR was the Japanese counterpart of TREC which was launched in 1999. NTCIR focuses largely on datasets for Asian languages (Japanese, Korean, Chinese)
1999
![Page 45: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/45.jpg)
NTCIRCranfield
experimentsTREC CLEF
European
language
collections
CLEF or Cross lingual evaluation forums started out as an evaluation forum focused on cross lingual IR. Today it has become a fully peer reviewed conference. CLEF focuses largely on European languages
2000
![Page 46: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/46.jpg)
2007
NTCIRCranfield
experimentsTREC CLEF
Indian
language
collections
FIRE
FIRE (Forum for IR evaluation) started as a spin-off to a CLEF 2007 task for retrieval for Indian languages. FIRE has released collections for 10 Indian languages.
![Page 47: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/47.jpg)
Web Crawling
![Page 48: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/48.jpg)
Web Crawler A Web crawler is an Internet bot which systematically browses the World
Wide Web, typically for the purpose of Web indexing.
A Web crawler may also be called a Web spider, an ant, an automaticindexer, or (in the FOAF software context) a Web scutter.
Web search engines and some other sites use Web crawling or spideringsoftware to update their web content or indexes of others sites' web content.
![Page 49: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/49.jpg)
List of web crawlers• Apache Nutch
• WebCrawler
• DataparkSearch
• HTTrack
• MnoGoSearch
![Page 50: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/50.jpg)
Web Crawler Architecture
![Page 51: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/51.jpg)
Crawl cycle• Create a URL seed list (One time process)
• Generate : In this phase, list of URLs will be generated which need to be fetched in this cycle.
• Fetcher : In this phase, list of generated URLs will be fetched from the internet.
• Parser : In this phase, fetched document will get parsed and out-link will be extracted.
• UpdateDb : In this phase, out-link will be updated in the database.
![Page 52: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/52.jpg)
Cross Lingual Information Retrieval
![Page 53: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/53.jpg)
The Problem• You have a collection of documents in language L1
• The user gives a query in language L2
![Page 54: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/54.jpg)
Possible pipelines: Document translation
Document
collection
Translation
system
IR system
Index
Query
Ranked list of
documents
![Page 55: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/55.jpg)
Possible pipelines: Query translation
Document
collection
IR system
Index
Query
Ranked list of
documents
Translation
system
![Page 56: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/56.jpg)
Sandhan: A Case Studyhttp://www.sandhansearch.in
![Page 57: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/57.jpg)
Crawled and
Indexed
Web Pages
Target Information
in English
तिरूपति यात्राHindi Query
CLIR Engine
Target Language Index
in English
Ranked List of Results
Language
Resources
तिरूपति आने के लिए रेिसाधन
तिरूपति प ण्य नगर पह ुँचने के ललएबह ि रेल उपलब्ध हैं | अगर म ुंबई सेयात्रा कर रहे है िो म ुंबई-चेन्नईएक्सपे्रस गाडी से प्रवास कर सकिे है|
तिरूपतियात्रा
Result Snippets
in Hindi
57
![Page 58: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/58.jpg)
Sandhan – Consortium Project• IIT Bombay (co-ordinator)
• CDAC Noida (co-cordinator)
• CDAC Pune
• IIT Kharaghpur
• Jadhavpur University
• ISI Kolkata
• IIIT Hyderabad
• AU KBC
• AU CEG
• Gauhati University
• DAIICT Gujarat
• IIIT Bhubaneswar
• TDIL 58
![Page 59: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/59.jpg)
Problem definition• Cross Lingual Information Retrieval (CLIR) engine for Indian languages
Input: Query in one of the six Indian languages (Hindi, Marathi, Tamil, Telugu, Bengali, Punjabi, Assamese. Gujarati, Oriya)
Output: In Hindi, English and Query Language
• Currently in the second phase of the project
• Three new languages are added in second phase
Assamese, Gujarati, Oriya
• Built on top of Nutch Framework
59
![Page 60: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/60.jpg)
Software Used• Nutch v0.9 – Framework
• Hadoop – Distributed Crawling
• Lucene – Indexing
• Moses/GIZA++ - Training models
• Tomcat – Deployment
60
![Page 61: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/61.jpg)
61
Fetcher
Web
Analyzer
MWE
Lookup
NE
Lookup
Domain
Identifier
Language
Identifier
Font
Transcoder
Indexer
CMLifier
UNL Index
Snippet
Translation
Summary
Generation
Snippet
GenerationTranslation
/Transliteration
MWE
Lookup
NE
Lookup
Analyzer
Query
Formulation
Index
Information
Extraction
![Page 62: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/62.jpg)
Resources Developed• Language specific analyzers
• Stop word List
• Bilingual Dictionary ( X-English, X-Hindi)
• NE List
• MWE List
• Transliteration Models
62
![Page 63: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/63.jpg)
Nutch and LuceneFramework: Demo-Arjun Atreya V
RS-IITB
![Page 64: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/64.jpg)
Outline• Introduction
• Behavior of Nutch (Offline and Online)
• Lucene Features
64
Nu
tch
an
d L
uce
ne F
ram
ew
ork
![Page 65: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/65.jpg)
Resources Used
Nu
tch
an
d L
uce
ne F
ram
ew
ork
65
• Gospodnetic, Otis; Erik Hatcher (December 1, 2004). Lucene in Action (1st ed.). Manning Publications. pp. 456. ISBN 978-1-932394-28-3.
• Nutch Wiki http://wiki.apache.org/nutch/
![Page 66: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/66.jpg)
Introduction
Nu
tch
an
d L
uce
ne F
ram
ew
ork
66
• Nutch is an opensource search engine
• Implemented in Java
• Nutch is comprised of Lucene, Solr, Hadoop etc..
• Lucene is an implementation of indexing and searching crawled data
• Both Nutch and Lucene are developed using plugin framework
• Easy to customize
![Page 67: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/67.jpg)
Where do they fit in IR?
Nu
tch
an
d L
uce
ne F
ram
ew
ork
67
![Page 68: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/68.jpg)
Nutch – complete search engine
Nu
tch
an
d L
uce
ne F
ram
ew
ork
68
![Page 69: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/69.jpg)
Nutch – offline processing
Nu
tch
an
d L
uce
ne F
ram
ew
ork
69
• Crawling
Starts with set of seed URLs
Goes deeper in the web and starts fetching the content
Content need to be analyzed before storing
Storing the content
Makes suitable for searching
• Issues
Time consuming process
Freshness of the crawl (How often should I crawl?)
Coverage of content
![Page 70: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/70.jpg)
Nutch – online processing
Nu
tch
an
d L
uce
ne F
ram
ew
ork
70
• Searching
Analysis of the query
Processing of few words(tokens) in the query
Query tokens matched against stored tokens(index)
• Fast and Accurate
• Involves ordering the matching results
• Ranking affects User’s satisfaction directly
• Supports distributed searching
![Page 71: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/71.jpg)
Nutch – Data structures
Nu
tch
an
d L
uce
ne F
ram
ew
ork
71
• Web Database or WebDB Mirrors the properties/structure of web graph being crawled
• Segment Intermediate index
Contains pages fetched in a single run
• Index Final inverted index obtained by “merging” segments (Lucene)
![Page 72: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/72.jpg)
Nutch –Crawling
Nu
tch
an
d L
uce
ne F
ram
ew
ork
72
• Inject: initial creation of CrawlDB
Insert seed URLs
Initial LinkDB is empty
• Generate new shard's fetchlist
• Fetch raw content
• Parse content (discovers outlinks)
• Update CrawlDB from shards
• Update LinkDB from shards
• Index shards
![Page 73: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/73.jpg)
Wide Crawling vs. Focused Crawling
Nu
tch
an
d L
uce
ne F
ram
ew
ork
73
• Differences:
Little technical difference in configuration
Big difference in operations, maintenance and quality
• Wide crawling:
(Almost) Unlimited crawling frontier
High risk of spamming and junk content
“Politeness” a very important limiting factor
Bandwidth & DNS considerations
• Focused (vertical or enterprise) crawling:
Limited crawling frontier
Bandwidth or politeness is often not an issue
Low risk of spamming and junk content
![Page 74: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/74.jpg)
Crawling Architecture
Nu
tch
an
d L
uce
ne F
ram
ew
ork
74
![Page 75: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/75.jpg)
Step1 : Injector injects the list of seed URLs into the CrawlDB
Nu
tch
an
d L
uce
ne F
ram
ew
ork
75
![Page 76: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/76.jpg)
Step2 : Generator takes the list of seed URLs from CrawlDB, forms fetch list, adds crawl_generate folder into the segments
Nu
tch
an
d L
uce
ne F
ram
ew
ork
76
![Page 77: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/77.jpg)
Step3 : These fetch lists are used by fetchers to fetch the rawcontent of the document. It is then stored in segments.
Nu
tch
an
d L
uce
ne F
ram
ew
ork
77
![Page 78: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/78.jpg)
Step4 : Parser is called to parse the content of the documentand parsed content is stored back in segments.
Nu
tch
an
d L
uce
ne F
ram
ew
ork
78
![Page 79: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/79.jpg)
Step5 : The links are inverted in the link graph and stored inLinkDB
Nu
tch
an
d L
uce
ne F
ram
ew
ork
79
![Page 80: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/80.jpg)
Step6 : Indexing the terms present in segments is done andindices are updated in the segments
Nu
tch
an
d L
uce
ne F
ram
ew
ork
80
![Page 81: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/81.jpg)
Step7 : Information on the newly fetched documents areupdated in the CrwalDB
Nu
tch
an
d L
uce
ne F
ram
ew
ork
81
![Page 82: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/82.jpg)
Crawling: 10 stage process
Nu
tch
an
d L
uce
ne F
ram
ew
ork
82
bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& crawl.log
1. admin db –create: Create a new WebDB.
2. inject: Inject root URLs into the WebDB.
3. generate: Generate a fetchlist from the WebDB in a new segment.
4. fetch: Fetch content from URLs in the fetchlist.
5. updatedb: Update the WebDB with links from fetched pages.
6. Repeat steps 3-5 until the required depth is reached.
7. updatesegs: Update segments with scores and links from the WebDB.
8. index: Index the fetched pages.
9. dedup: Eliminate duplicate content (and duplicate URLs) from the indexes.
10. merge: Merge the indexes into a single index for searching
![Page 83: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/83.jpg)
De-duplication Algorithm
Nu
tch
an
d L
uce
ne F
ram
ew
ork
83
(MD5 hash, float score, int indexID, int docID, int urlLen)
for each page
to eliminate URL duplicates from a segmentsDir:
open a temporary file
for each segment:
for each document in its index:
append a tuple for the document to the temporary file with hash=MD5(URL)
close the temporary file
sort the temporary file by hash
for each group of tuples with the same hash:
for each tuple but the first:
delete the specified document from the index
![Page 84: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/84.jpg)
URL Filtering
Nu
tch
an
d L
uce
ne F
ram
ew
ork
84
URL Filters (Text file) (conf/crawl-urlfilter.txt)
Regular expression to filter URLs during crawling
E.g.
To ignore files with certain suffix:
-\.(gif|exe|zip|ico)$
To accept host in a certain domain
+^http://([a-z0-9]*\.)*apache.org/
![Page 85: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/85.jpg)
Few API’s
Nu
tch
an
d L
uce
ne F
ram
ew
ork
85
• Site we would crawl: http://www.iitb.ac.in bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& crawl.log
• Analyze the database: bin/nutch readdb <db dir> –stats
bin/nutch readdb <db dir> –dumppageurl
bin/nutch readdb <db dir> –dumplinks
s=`ls -d <segment dir> /* | head -1` ; bin/nutch segread -dump $s
![Page 86: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/86.jpg)
Map-Reduce Function
Nu
tch
an
d L
uce
ne F
ram
ew
ork
86
• Works in distributed environment
• map() and reduce() functions are implemented in most of the modules
• Both map() and reduce() functions uses <key, value> pairs
• Useful in case of processing large data (eg: Indexing)
• Some applications need sequence of map-reduce
Map-1 -> Reduce-1 -> ... -> Map-n -> Reduce-n
![Page 87: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/87.jpg)
Map-Reduce Architecture
Nu
tch
an
d L
uce
ne F
ram
ew
ork
87
![Page 88: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/88.jpg)
Nutch – Map-Reduce Indexing
Nu
tch
an
d L
uce
ne F
ram
ew
ork
88
• Map() just assembles all parts of documents
• Reduce() performs text analysis + indexing:
Adds to a local Lucene index
Other possible MR indexing models:
• Hadoop contrib/indexing model:
analysis and indexing on map() side
Index merging on reduce() side
• Modified Nutch model:
Analysis on map() side
Indexing on reduce() side
![Page 89: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/89.jpg)
Nutch - Ranking
Nu
tch
an
d L
uce
ne F
ram
ew
ork
89
• Nutch Ranking
queryNorm() : indicates the normalization factor for the query
coord() : indicates how many query terms are present in the given document
norm() : score indicating field based normalization factor
tf : term frequency and idf : inverse document frequency
t.boost() : score indicating the importance of terms occurrence in a particular field
![Page 90: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/90.jpg)
Lucene - Features
Nu
tch
an
d L
uce
ne F
ram
ew
ork
90
• Field based indexing and searching
• Different fields of a webpage are
Title
URL
Anchor text
Content, etc..
• Different boost factors to give importance to fields
• Uses inverted index to store content of crawled documents
• Open source Apache project
![Page 91: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/91.jpg)
Lucene - Index
Nu
tch
an
d L
uce
ne F
ram
ew
ork
91
• Concepts
Index: sequence of documents (a.k.a. Directory)
Document: sequence of fields
Field: named sequence of terms
Term: a text string (e.g., a word)
• Statistics
Term frequencies and positions
![Page 92: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/92.jpg)
Writing to Index
Nu
tch
an
d L
uce
ne F
ram
ew
ork
92
IndexWriter writer =
new IndexWriter(directory, analyzer, true);
Document doc = new Document();
// add fields to document (next slide)
writer.addDocument(doc);
writer.close();
![Page 93: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/93.jpg)
Adding Fields
Nu
tch
an
d L
uce
ne F
ram
ew
ork
93
doc.add(Field.Keyword("isbn", isbn));
doc.add(Field.Keyword("category", category));
doc.add(Field.Text("title", title));
doc.add(Field.Text("author", author));
doc.add(Field.UnIndexed("url", url));
doc.add(Field.UnStored("subjects", subjects, true));
doc.add(Field.Keyword("pubmonth", pubmonth));
doc.add(Field.UnStored("contents",author + " " + subjects));
doc.add(Field.Keyword("modified", DateField.timeToString(file.lastModified())));
![Page 94: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/94.jpg)
Fields Description
Nu
tch
an
d L
uce
ne F
ram
ew
ork
94
• Attributes
Stored: original content retrievable
Indexed: inverted, searchable
Tokenized: analyzed, split into tokens
• Factory methods
Keyword: stored and indexed as single term
Text: indexed, tokenized, and stored if String
UnIndexed: stored
UnStored: indexed, tokenized
• Terms are what matters for searching
![Page 95: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/95.jpg)
Searching an Index
Nu
tch
an
d L
uce
ne F
ram
ew
ork
95
IndexSearcher searcher =
new IndexSearcher(directory);
Query query = QueryParser.parse(queryExpression,
"contents“,analyzer);
Hits hits = searcher.search(query);
for (int i = 0; i < hits.length(); i++) {
Document doc = hits.doc(i);
System.out.println(doc.get("title"));
}
![Page 96: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/96.jpg)
Analyzer
Nu
tch
an
d L
uce
ne F
ram
ew
ork
96
• Analysis occurs
For each tokenized field during indexing
For each term or phrase in QueryParser
• Several analyzers built-in
Many more in the sandbox
Straightforward to create your own
• Choosing the right analyzer is important!
![Page 97: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/97.jpg)
WhiteSpace Analyzer
Nu
tch
an
d L
uce
ne F
ram
ew
ork
97
The quick brown fox jumps over the lazy dog.
[The] [quick] [brown] [fox] [jumps] [over] [the]
[lazy] [dog.]
![Page 98: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/98.jpg)
Simple Analyzer
Nu
tch
an
d L
uce
ne F
ram
ew
ork
98
The quick brown fox jumps over the lazy dog.
[the] [quick] [brown] [fox] [jumps] [over] [the]
[lazy] [dog]
![Page 99: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/99.jpg)
Stop Analyzer
Nu
tch
an
d L
uce
ne F
ram
ew
ork
99
The quick brown fox jumps over the lazy dog.
[quick] [brown] [fox] [jumps] [over] [lazy] [dog]
![Page 100: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/100.jpg)
Snowball Analyzer
Nu
tch
an
d L
uce
ne F
ram
ew
ork
100
The quick brown fox jumps over the lazy dog.
[the] [quick] [brown] [fox] [jump] [over] [the]
[lazy] [dog]
![Page 101: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/101.jpg)
Query Creation
Nu
tch
an
d L
uce
ne F
ram
ew
ork
101
• Searching by a term – TermQuery
• Searching within a range – RangeQuery
• Searching on a string – PrefixQuery
• Combining queries – BooleanQuery
• Searching by phrase – PhraseQuery
• Searching by wildcard – WildcardQuery
• Searching for similar terms - FuzzyQuery
![Page 102: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/102.jpg)
Lucene Queries
Nu
tch
an
d L
uce
ne F
ram
ew
ork
102
![Page 103: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/103.jpg)
Conclusions
Nu
tch
an
d L
uce
ne F
ram
ew
ork
103
• Nutch as a starting point
• Crawling in Nutch
• Detailed map-reduce architecture
• Different query formats in Lucene
• Built-in analyzers in Lucene
• Same analyzer need to be used both while indexing and searching
![Page 104: Information Retrieval - Indian Institute of Technology Bombay · Web Crawler A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2a33a7d90860af766aae0/html5/thumbnails/104.jpg)
Thanks
Nu
tch
an
d L
uce
ne F
ram
ew
ork
104
• Questions ??