cs6007 information retrieval unit i introduction-history...
TRANSCRIPT
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
1
CS6007 –Information Retrieval
UNIT I
Introduction-History of IR-Components of IR-Issues-Open source search
engine frameworks-the impact of the web on IR-The role of Artificial
intelligence (AI) on IR-IR Versus Web search- Components of a Search
Engine- Characterizing the web.
Introduction:
Definition:
Information retrieval (IR) is finding material (usually documents) of an
unstructured nature (usually text) that satisfies an information need from
within large collections (usually stored on computers).
Information Retrieval - Calvin Moores definition-1951
“Information retrieval is a field concerned with the structure, analysis,
organization, storage, searching, and retrieval of information.” It is the
activity of obtaining information resources relevant to an information need
from a collection of information resources.
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
2
Example:
To determine which plays of Shakespeare contain the words Brutus AND
Caesar and NOT Calpurnia, one way to do that is to start at the beginning and
to read through all the text, noting for each play whether it contains Brutus and
Caesar and excluding it from consideration if it contains Calpurnia.
The simplest form of document retrieval is for a computer to do this sort of
linear scan through documents. This process is commonly referred to
as grepping through text, after the Unix command grep, which performs this
process.
The way to avoid linearly scanning the texts for each query is to index the
documents in advance. The Shakespeare's Collected Works, is used to introduce
the basics of the Boolean retrieval model. Suppose we record for each document
- here a play of Shakespeare's - whether it contains each word out of all the
words Shakespeare used (Shakespeare used about 32,000 different words). The
result is a binary term-document incidence matrix , as in Figure 1.1. Terms are the
indexed units Now, depending on whether we look at the matrix rows or
columns, we can have a vector for each term, which shows the documents it
appears in, or a vector for each document, showing the terms that occur in it.
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
3
To answer the query Brutus AND Caesar AND NOT Calpurnia, we take the
vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a
bitwise AND:
110100 AND 110111 AND 101111 = 100100
The answers for this query are thus Antony and Cleopatra and Hamlet
The Boolean retrieval model is a model for information retrieval in which we can
pose any query which is in the form of a Boolean expression of terms, that is, in
which terms are combined with the operators AND, OR, and NOT. The model
views each document as just a set of words.
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
4
Figure: Results from Shakespeare for the query Brutus AND Caesar AND NOT
Calpurnia.
1. Concepts :
The major concept in information retrieval is inverted index. The inverted index,
or sometimes inverted file , has become the standard term in information
retrieval. The basic idea of an inverted index is shown in Figure .
The dictionary for the data structure and vocabulary for the set of terms. Then for
each term, we have a list that records which documents the term occurs in. Each
item in the list - which records that a term appeared in a document and is
conventionally called a posting .The list is then called a postings list (or ), and all
the postings lists taken together are referred to as the postings . The dictionary
in Figure has been sorted alphabetically and each postings list is sorted by
document ID.
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
5
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
6
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
7
1.1.The term vocabulary and postings list
The steps in inverted index construction includes
1.2 Stop words
Figure : A stop list of 25 semantically non-selective words which are common in
Reuters-RCV1.
Some extremely common words which would appear to be of little value in
helping select documents matching a user need are excluded from the
vocabulary entirely. These words are called stop words. The general strategy
for determining a stop list is to sort the terms by collection frequency (the total
number of times each term appears in the document collection) and then to take
the most frequent terms, often hand-filtered for their semantic content relative
to the domain of the documents being indexed, as a stop list .
Token normalization is the process of canonicalizing tokens so that matches
occur despite superficial differences in the character sequences of the tokens.
The most standard way to normalize is to implicitly create equivalence classes,
which are normally named after one member of the set.
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
8
1.3 Stemming and lemmatization
Stemming usually refers to a crude heuristic process that chops off the ends of
words in the hope of achieving this goal correctly most of the time, and often
includes the removal of derivational affixes.
Example:
am,are,is be
car, cars, car's, cars' car
saw => s
“surfing”, “surfed” --> “surf”
Lemmatization usually refers to doing things properly with the use of a
vocabulary and morphological analysis of words, normally aiming to remove
inflectional endings only and to return the base or dictionary form of a word,
which is known as the lemma.
Example:
saw => see or saw
1.6 Edit distance :
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
9
Given two character strings s1 and s2, the edit distance between them is
the minimum number of edit operations required to transform s1 into s2.
Most commonly, the edit operations allowed for this purpose are (i) insert
a character into a string, (ii) delete a character from a string, and (iii)
replace a character of a string by another character; for these operations,
edit distance Levenshtein is sometimes known as Levenshtein distance.
For example, the edit distance distance between cat and dog is three.
Two main search paradigms:
Retrieval and Browse
Retrieval
o Search for particular information
o Usually focused and purposeful
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
10
Browsing
o General looking around for information
o For example: Asia-> Thailand -> Phuket ->Tsunami
IR vs. DBMS
IR DBMS
Imprecise Semantics Precise Semantics
Keyword search SQL
Unstructured data format Structured data
Read-Mostly. Add docs
occasionally
Expect reasonable number of updates
Page through top k results Generate full answer
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
11
Information Retrieval Vs information Extraction
Information Retrieval:
Given a set of terms and a set of document terms select only most relevant
document (precision) ,and preferably all the relevant ones(recall)
Infromation Extraction:
Extract from the text what the document means.
2. History of IR:
The idea of using computers to search for relevant pieces of information
was popularized in the article As We May Think by Vannevar Bush in
1945.
It would appear that Bush was inspired by patents for a 'statistical
machine' - filed by Emanuel Goldberg in the 1920s and '30s - that
searched for documents stored on film
The first description of a computer searching for information was
described by Holmstrom in 1948, detailing an early mention of
the Univac computer.
Automated information retrieval systems were introduced in the 1950s:
one even featured in the 1957 romantic comedy, Desk Set.
• 1960-70’s:
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
12
– Initial exploration of text retrieval systems for “small” corpora of
scientific abstracts, and law and business documents.
– Development of the basic Boolean and vector-space models of
retrieval.
– Prof. Salton and his students at Cornell University are the leading
researchers in the area.
• 1980’s:
– Large document database systems, many run by companies:
• Lexis-Nexis
• Dialog
• MEDLINE
• 1990’s:
– Searching FTPable documents on the Internet
• Archie
• WAIS
– Searching the World Wide Web
• Lycos
• Yahoo
• Altavista
– Organized Competitions
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
13
• NIST TREC
– Recommender Systems
• Ringo
• Amazon
• NetPerceptions
– Automated Text Categorization & Clustering
• 2000’s
– Link analysis for Web Search
– Automated Information Extraction
• Whizbang
• Fetch
• Burning Glass
– Question Answering
• TREC Q/A track
– Multimedia IR
• Image
• Video
• Audio and music
– Cross-Language IR
• DARPA Tides
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
14
– Document Summarization
– Learning to Rank
2.1 Historical Milestone in IR research
2.2 Past ,Present and Future:
2.2.1 Early Developments:
An old and popular data structure for faster information retrieval is a collection
of selected words or concepts with which are associated pointers to the related
information called index.In one form or another,indexes are at the core of every
modern information retrieval system.They provide faster access to the data and
allow the query processing task to be speeded up.
For centuries indexes were created manually as categorization hierarchies.In
fact most libraries still use some form of categorical hierarchy to classify their
volumes.Such hierarchies have usually been conceived by human subjects from
the libraray sciences field.More recently ,the advent of modern computers has
made possible the construction of large indexes automatically.Automatic
indexes provide as view of the retrieval problem which is much more related to
the system itself than to the user need.It is important to distinquish between
two different views of the IR problem: a computer centered one and a human
centered one.
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
15
In the computer centered view ,the IR problem consists mainly of building up
effiecient indexes,procesing user queries with high performance, and dveloping
ranking algorithms which improve the quality of the answer set.
In the Human centered view ,the IR problem consists mainly of studying the
behavior of the user,of understanding his maain needs, and of determining
how such understanding affects the organisation and operation of the retrieval
system.
2.2.2 Information Retrieval in the library:
Libraries were among the first instituttions to adopt IR systems for retrieving
information.Usually sustems to be used in libraries were intially developed by
academic institutions and later by commercial vendors.
In the first geneartion,such systems consisted basically allowed
searches based on author name and title
In the second generation increased searchfunctionality was added
which allowed searchingby subject headings,by keywords and
some complex query facilities.
In the third generation which is currently being deployed the focus
is on improved graphical interfaces,electronic forms,hypertext
features and open system architectures.
2.2.3 Web and Digital Library:
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
16
Three dramatic and fundamental changes have occurred due to the
advances in modern computer technology and boom of the web.they are
1. A cheaper access to various sources of information.
2. Provide greater access to networks.
3. Publishing freedom
3. Components of IR System:
Information retrieval locates relevant documents on the basis of uesr input
such as keywords or example documents, for example :Find documents
containing the words “database systems”.the figure shows information
retrieval system block digram.It consists of three components: Query or
Documents,IR system and Ranked Results.
1) Query /Collections: store only a representation of the document or query
which means that the text of a document is lost once it has been processed
for the purpose of generating its representation.
2) IR System: Involve inperforming actual retrival function ,executing the
search strategy in response to a query.
3) Ranked Results: a set of documents which improves the subsequent run
after information retrieval.
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
17
Figure:Block diagram of IR
Architecture of IR System:
Logical View of the Documents:
Due to historical reasons, documents in a collection are frequently represented
through a set of index terms or keywords. Such keywords might be extracted
directly from the text of the document or might be specified by a human subject
(as frequently done in the information sciences arena). No matter whether these
representative keywords are derived automatically or generated by a specialist,
they provide a logical view of the document
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
18
With very large collections, however, even modern computers might have to
reduce the set of representative keywords. This can be accomplished through
the elimination of stopwords (such as articles and connectives), the use of
stemming (which reduces distinct words to their common grammatical
root),and the identification of noun groups (which eliminates adjectives,
adverbs, and verbs). Further, compression might be employed. These
operations are called text operations (transformations). Text operations reduce
the complexity of the document representation and allow moving the logical
view from that of a full text to that of a set of index terms.
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
19
As illustrated in Figure , we view the issue of logically representing a document
as a continuum in which the logical view of a document might shift (smoothly)
from a full text representation to a higher level representation specified by a
human subject.
The Retrieval Process:
To describe the retrieval process, we use a simple and generic software
architecture as shown in Figure First of all, before the retrieval process can even
be initiated, it is necessary to dene the text database. This is usually done by the
manager of the database, which species the following: (a) the documents to be
used, (b) the operations to be performed on the text, and (c) the text model (i.e.,
the text structure and what elements can be retrieved). The text operations
transform the original documents and generate a logical view of them.
Once the logical view of the documents is defined, the database manager
(using the DB Manager Module) builds an index of the text. An index is a
critical data structure because it allows fast searching over large volumes of
data. Different index structures might be used, but the most popular one is the
inverted index as indicated in Figure . The resources (time and storage space)
spent on defining the text database and building the index are amortized by
querying the retrieval system many times.
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
20
Given that the document database is indexed, the retrieval process can be
initiated. The user need which is then parsed and transformed by the same text
operations applied to the text. Then, query operations might be applied before
the actual query, which provides a system representation for the user need, is
generated. The query is then processed to obtain the retrieved documents. Fast
query processing is made possible by the index structure previously built.
Before been sent to the user, the retrieved documents are ranked according to a
likelihood of relevance. The user then examines the set of ranked documents in
the search for useful information. At this point, he might pinpoint a subset of
the documents seen as definitely of interest and initiate a user feedback cycle.
In such a cycle, the system uses the documents selected by the user to change
the query formulation. Hopefully, this modified query is a better representation
of the real user need.
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
21
Text Operations forms index words (tokens).
o Stopword removal
o Stemming
Indexing constructs an inverted index of
word to document pointers.
Searching retrieves documents that contain a
given query token from the inverted index.
Ranking scores all retrieved documents
according to a relevance metric.
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
22
User Interface manages interaction with the
user:
o Query input and document output.
o Relevance feedback.
o Visualization of results.
Query Operations transform the query to
improve retrieval:
o Query expansion using a thesaurus.
o Query transformation using relevance feedback.
4. Issues in IR
The main objective of an IR system is to retrieve all the items that are relevant
to a user query, while retrieving as few non relevant items as possible.
4.1 Main problems in IR:
o Document and Query indexing
o How to best represent their contents?
o Query evaluation(or retrieval process)
o To what extent does a document correspond to a query?
o System evaluation
o How good is a system?
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
23
o Are the retrived documents relevant?(precision)
o Are all the relevant documents retrieved? (recall)
Information retrieval is concerned with representing, searching, and
manipulating large collections of electronic text and other human-
language data.
Three Big Issues in IR
1.Relevance
It is the fundamental concept in IR.
A relevant document contains the information that a person was
looking for when she submitted a query to the search engine.
There are many factors that go into a person’s decision as to whether a
document is relevant.
These factors must be taken into account when designing algorithms
for comparing text and ranking documents.
Simply comparing the text of a query with the text of a document and
looking for an exact match, as might be done in a database system
produces very poor results in terms of relevance.
To address the issue of relevance, retrieval models are used.
A retrieval model is a formal representation of the process of matching
a query and a document. It is the basis of the ranking algorithm that is
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
24
used in a search engine to produce the ranked list of documents.
A good retrieval model will find documents that are likely to be
considered relevant by the person who submitted the query.
The retrieval models used in IR typically model the statistical
properties of text rather than the linguistic structure. For example, the
ranking algorithms are concerned with the counts of word occurrences
than whether the word is a noun or an adjective.
2. Evaluation
Two of the evaluation measures are precision and recall.
Precision is the proportion of retrieved
documents that are relevant. Recall is the
proportion of relevant documents that are
retrieved.
Precision = Relevant documents ∩ Retrieved documents
Retrieved documents
Recall = Relevant documents ∩ Retrieved documents
Relevant documents
When the recall measure is used, there is an assumption that all the
relevant documents for a given query are known. Such an assumption
is clearly problematic in a web search environment, but with smaller
test collection of documents, this measure can be useful. It is not
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
25
suitable for large volumes of log data.
3. Emphasis on users and their information needs
The users of a search engine are the ultimate judges of quality. This has
led to numerous studies on how people interact with search engines
and in particular, to the development of techniques to help people
express their information needs.
Text queries are often poor descriptions of what the user actually wants
compared to the request to a database system, such as for the balance of
a bank account.
Despite their lack of specificity, one-word queries are very common in
web search. A one-word query such as “cats” could be a request for
information on where to buy cats or for a description of the Cats
(musical).
Techniques such as query suggestion, query expansion and relevance
feedback use interaction and context to refine the initial query in order
to produce better ranked results.
• The figure summarizes the major issues involved in search engine
design
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
26
5. Open source Search Engine Frameworks:
Open source
Open source software is software whose source code is available for
modification or enhancement by anyone. "Source code" is the part of software
that most computer users don't ever see; it's the code computer programmers
can manipulate to change how a piece of software—a "program" or
"application"—works. Programmers who have access to a computer
program's source code can improve that program by adding features to it or
fixing parts that don't always work correctly.
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
27
Advantage of open source
The right to use the software in any way.
There is usually no license cost and free of cost.
The source code is open and can be modified freely.
Open standards.
It provides higher flexibility.
Disadvantage of open source
There is no guarantee that development will happen.
It is sometimes difficult to know that a project exist, and its current
status.
No secured follow-up development strategy.
Closed software
Closed software is a term for software whose license does not allow for
the release or distribution of the software’s source code. Generally it means
only the binaries of a computer program are distributed and the
license provides no access to the programs source code. The source code of
such programs is usually regarded as a trade secret of the company. Access
to source code by third parties commonly requires the party to sign a non-
disclosure agreement.
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
28
Search Engine
A search engine is a document retrieval system design to help find
information stored in a computer system, such as on the WWW. The search
engine allows one to ask for content meeting specific criteria and retrieves a
list of items that match those criteria. The following are the famous search
engines.
5.1 Lucene
Lucene is an indexing and search system implemented in Java, with ports to
other programming languages. The project was started by Doug Cutting in
1997. It was initially available for download from its home at
the SourceForge web site. It joined the Apache Software Foundation's
Jakarta family of open-source Java products in September 2001 and became its
own top-level Apache project in February 2005.
Since then, it has grown from a single-developer effort to a global project
involving hundreds of developers in various countries. It is currently hosted by
the Apache Foundation. Lucene is by far the most successful open source
search engine. Its largest installation is quite likely Wikipedia: All queries
entered into Wikipedia’s search form are handled by Lucene. A list of other
projects relying on its indexing and search capabilities can be found on
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
29
Lucene’s “PoweredBy”.Known for its modularity and extensibility, Lucene
allows developers to define their own
indexing and retrieval rules and formulae. Under the hood, Lucene’s retrieval
framework is based on the concept of fields: Every document is a collection of
fields, such as its title, body, URL, and so forth. This makes it easy to specify
structured search requests and to give different weights to different parts of a
document.The latest version of Lucene is 6.1.0 which was released on June 17,
2016.
5.2.Indri
Indri is an academic information retrieval system written in C++. It is
developed by researchers at the University of Massachusetts and is part of the
Lemur project, a joint effort of the University of Massachusetts and Carnegie
Mellon University.
Indri is well known for its high retrieval effectiveness and is frequently found
among the top scoring search engines at TREC. Its retrieval model is a
combination of the language modeling approaches .Like Lucene, Indri can
handle multiple fields per document, such as title, body, and anchor text, which
is important in the context of Web search.
It supports automatic query expansion by means of pseudo-relevance feedback,
a
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
30
technique that adds related terms to an initial search query, based on the
contents of an initial set of search results .It also supports query-independent
document scoring that may, for instance, be used to prefer more recent
documents over less
recent ones when ranking the search results .
5.3.Wumpus
Wumpus is an academic search engine written in C++ and developed at the
University of Waterloo. Unlike most other search engines, Wumpus has no
built-in notion of “documents” and does not know about the beginning and the
end of each document when it builds the index.
Instead, every part of the text collection may represent a potential unit for
retrieval, depending on the structural search constraints specified in the query.
This makes the system particularly attractive for search tasks in which the ideal
search result may not always be a whole document, but may be a section, a
paragraph, or a sequence of paragraphs within a document. Wumpus supports
a variety of different retrieval methods, including the proximity ranking
function from the BM25 algorithm from, and the language modeling and
divergence from randomness. In addition, it is able to carry
out real-time index updates (i.e., adding/removing files to/from the index) and
provides support for multi-user security restrictions that are useful if the
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
31
system has more than one user, and each user is allowed to search only parts of
the index.
6. The Impact of the web on IR
The Web is very large, public, unstructured but ubiquitous repository that need
efficient tools to manage, retrieve, and filter information. The search engines
have become a central tool in the Web.
Two characteristics make retrieval of relevant information from the Web is a
really hard task
the large and distributed volume of data available
the fast pace of change
Main challenges posted by Web are
data-centric: related to the data itself
interaction-centric: related to the users and their interactions
Data-centric challenges are varied and include
distributed data
high percentage of volatile data
large volume of data
unstructured and redundant data
quality of data
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
32
heterogeneous data
Interaction Centric related to the users and their interactions
Expressing a query
Interpreting results
Impact of the web
o The first impact of the web on search is related to the characteristics
of the document collection itself.
o The web is composed of pages distributed over million of
sites and connected through hyperlinks
o This requires collecting ll documents and storing copies of
them in a central repository, prior to indexing.
o This new phase in the IR process, introduced by the web is
called crawling
o The second impact of the web on search is related to
o The size of the collection
o The volume of user queries submitted on a daily basis
o As a consequence, performance and scalability have critical
characteristics of the IR system.
o The third impact in a very large collection, predicting relevance is
much harder than before
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
33
o Fortunately the web also includes new sources of evidence
o Ex. hyperlinks and user clicks in documents in the answer set
o The fourth impact derives from the fact that the web is also a
medium to do business.
o Search problem has been extended beyond the seeking of text
information to also encompass other user needs
o Ex.price of a book, the phone number of a hotel
o The fifth impact of the web on search is web spam
o Web spam: abusive availability of commercial information
disguised in th form of informational content.
o This difficulty is so large that today we talk of diversion web
retrieval.
Practical issues in the Web
o Security
o Commercial transactions over the internet are not yet a completely
safe procedure
o Privacy
o Frequently people are willing to exchange information as long as it
does not become public
o Copyright and patent rights
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
34
o It is far from clear how to wide spread of data on the web affects
copyright and patent laws in the various countries.
o Scanning, Optical character Recognition(OCR) and cross language
retrieval
7. Role of Artificial intelligence in IR
Artificial Intelligence:
The study of how to construct intelligent machines & systems that can
simulate or extend the development of human intelligence. Both IR and AI
fields developed in parallel during the early days of computers. The fields of
artificial intelligence and information retrieval share a common interest in
developing more capable computer systems.
What is Intelligence?
According to Cook et.al. [1988]
1. Acquisition: the ability to acquire new knowledge.
2. Automatization: the ability to refine procedures for dealing with a novel
situation into an efficient functional form.
3. Comprehension: the ability to know, understand, and deal with novel
problems.
4. Memory management: the ability to represent knowledge in memory, to map
knowledge on to that memory representation, and to access the knowledge in
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
35
memory.
5. Metacontrol: the ability to control various processes in intelligent behavior.
6.Numeric ability: the ability to perform arithmetic operations.
7. Reasoning: the ability to use problem-solving knowledge.
8. Social competence: the ability to interact with and understand other people,
machines or programs.
9. Verbal perception: the ability to recognize natural language.
10. Visual perception: the ability to recognize visual images.
What are Intelligent IR Systems?
The concept of 'intelligent' information retrieval was first suggested in the
late 1970s. Not pursued by IR Community until early 1990s.An intelligent IR
system can simulate the human thinking process on information processing
and intelligence activities to achieve information and knowledge storage,
retrieval and reasoning, and to provide intelligence support.
How to introduce AI into IR systems?
A program which takes a query as input, and returns documents as output,
without affording the opportunity for judgment, modification and especially
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
36
interaction with text.
The question is, “where” should AI be introduced into the IR system?
Levels of user and system involvement, according to Bates ‘90:
Level 0 – No system involvement (User comes up with a tactic, formulating a
query,
coming up with a strategy and thinking about the outcome)
Level 1 – User can ask for information about searching (System suggests
tactics that
can be used to formulate queries e.g. help)
Level 2 – User simply enters a query, suggests what needs to be done, and
the
system executes the query to return results.
Level 3 – First signs of AI. System actually starts suggesting improvements to
user.
Level 4 – Full Automation. User queries are entered and the rest is done by
the
system.
Some AI methods currently used in Intelligent IR Systems
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
37
Web Crawlers (for information extraction)
Mediator Techniques (for information integration)
Ontologies (for intelligent information access by making semantics of
information explicit and machine readable)
Neural Networks (for document clustering & preprocessing)
Kohonen Neural Networks - Self Organizing maps
Hopefield Networks
Semantic Networks
Neural Networks in IR
Based on Neural Networks
Document clustering can be viewed as classification in
document*document space. Thesaurus construction can be viewed as laying
out a coordinate system in the index*index space. Indexing itself can be viewed
as mappings in the document*index space. Searching can be conceptualized as
connections and activations in the index*document space.
Applying Neural Networks to Information Retrieval will likely
produce information systems that will be able to:
recall memories despite failed individual memory units
modify stored information in response to new inputs from the user
retrieve "nearest neighbor" data when no exact data match exists
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
38
associatively recall information despite noise or missing pieces in the
input
categorize information by their associative patterns
AI offers us a powerful set of tools, especially when they are combined with
conventional and other innovative computing tools. However, it is not an easy
task to master those tools and employ them skillfully to build truly significant
intelligent systems. By recognizing the limitations of modern artificial
intelligence techniques, we can establish realistic goals for intelligent
information retrieval systems and devise appropriate system development
strategies.AI models like the neural network will probably not replace
traditional IR approaches anytime soon. However, the application of neural
network models can make an IR system more powerful.
8. IR on the web Vs.IR
Traditional IR systems normally index a closed collection of documents,
which are mainly text-based and usually offer little linkage between
documents. Traditional IR systems are often referred to as full-text retrieval
systems. Libraries were among the first to adopt IR to index their catalogs and
later, to search through information which was typically imprinted onto CD-
ROMs. The main aim of traditional IR was to return relevant documents that
satisfy the user’s information need. Although the main goal of satisfying the
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
39
user’s need is still the central issue in web IR (or web search), there are some
very specific challenges that web search poses that have required new and
innovative solutions.
The first important difference is the scale of web search, as we have
seen that the current size of the web is approximately 600 billion pages.
This is well beyond the size of traditional document collections.
The Web is dynamic in a way that was unimaginable to traditional IR in
terms of its rate of change and the different types of web pages ranging
from static types (HTML, portable document format (PDF), DOC,
Postscript, XLS) to a growing number dynamic pages written in
scripting languages such a JSP, PHP or Flash. We also mention that a
large number of images, videos, and a growing number of programs
are delivered through the Web to our browsers.
The Web also contains an enormous amount of duplication, estimated
at about 30%. Such redundancy is not present in traditional corpora
and makes the search engine’s task even more difficult.
The quality of web pages vary dramatically; for example, some web
sites create web pages with the sole intention of manipulating the
search engine’s ranking, documents may contain misleading
information, the information on some pages is just out of date, and the
overall quality of a web page may be poor in terms of its use of
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
40
language and the amount of useful information it contains. The issue of
quality is of prime importance to web search engines as they would
very quickly lose their audience if, in the top- ranked positions, they
presented to users poor quality pages.
The range of topics covered on the Web is completely open, as opposed
to the closed collections indexed by traditional IR systems, where the
topics such as in library catalogues, are much better defined and
constrained.
Another aspect of the Web is that it is globally distributed. This poses
serious logistic problems to search engines in building their indexes,
and moreover, in delivering a service that is being used from all over
the globe. The sheer size of the problem is daunting, considering that
users will not tolerate anything but
an immediate response to their query. Users also vary in their level of
expertise, interests, information- seeking tasks, the language(s) they
understand, and in many other ways.
Users also tend to submit short queries (between two to three
keywords), avoid the use of anything but the basic search engine
syntax, and when the results list is returned, most users do not look at
more than the top 10 results, and are unlikely to modify their query.
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
41
This is all contrary to typical usage of traditional IR.
The hypertextual nature of the Web is also different from traditional
document collections, in giving users the ability to surf by following
links.
On the positive side (for the Web), there are many roads (or paths of
links) that “lead to Rome” and you need only find one of them, but
often, users lose their way in the myriad of choices they have to make.
Another positive aspect of the Web is that it has provided and is providing
impetus for the development of many new tools, whose aim is to improve the
user’s experience.
Classical IR Web IR
Volume Large Huge
Data Quality Clean ,No
duplicates
Noisy, duplicates
Available
Data change rate Infrequent In flux
Data accessibility Accessible Partially accessible
Format diversity Homogeneous Widely Diverse
Documents Text HTML
No.of Matches Small Large
IR techniques Content based Link based
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
42
9. Components of a search engine:
Search engines are among the most important applications or services on
the web.Most exciting sucessful search engines use a centralized architecture
and global ranking algorithms to genarate the ranking of documents crawled
in their databases.
A search engine is a program designed to help find information stored on
a computer system such as the world wide web.
Major building blocks of Search engine are
a) Indexing
a. Text Acquistion
b. Text Transformation
c. Index creation
b) Query Processing
a. User Interactions
b. Ranking
c. Evaluation
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
43
a) Indexing Process
Document Data store
Index
Email ,Web pages, Letters
1. Text Acquistion-identifies and stores documents for indexing
2. Text Transfomation –transforms documents into index terms or
features
3. Index Creation – takes index terms and creates data structures
1 .Text Acquisition:
Crawler identifies and acquires documents for search engine
Web crawlers follow links to find documents
Text
Acquisition
Text
Transformation
Index creation
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
44
o Must efficiently find huge numbers of web pages and keep them up
to date
o Single site crawlers for site search
o Topical or focused crawlers for vertical search
Document crawlers for enterprise and desktop search
o Follow links and scan directories
Feeds
o Real time streams of documents eg.web feeds for news,blogs
o RSS is common standard RSS reader can provide new XML
documents to search engine
Conversion
o Convert variety of documents into a consistent text plus metadata
format.eg.HTML,XML,Word
o Convert text encoding for different languages using a Unicode
standard like UTF 8
Document Datastore
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
45
o Stores text.meta data and other related content for documents
Metadata is information about document such as type and
creation date
Other content includes links,anchor text
o Provides fast access to document contents for search engine
components
o Could use relatiuonal database system
o More typically a simpler more efficient storage system is
used due to huge numbers of douments.
2.Text Transformation
Parser
o Processing the sequence of text tokens in the document to recognize
structure elements eg titles,links
o Tokenzier recognizes words in the text
Must consider issues like captilization,hyphens
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
46
o Markup Languages such as HTML,XML often used to specify
structure
o Tags used to specify document elements
o Document parser uses syntax of markup language to identify
structure
Stopping
o Remove common words eg.”and”,”or”,”the”
o Some impact on efficiency and effectiveness
Stemming
o Group words derived from a common stem
Eg.”Computer”,”Computers”,”Computing”,”Compute”
o Usually effective ,but not for all queries
Link Analysis
o Makes use of links and anchor text in web pages
o Link analysis idenfies popularity and community information
Eg Page Rank
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
47
o Anchor Text can significantly enhance the representation of pages
pointed to by links.
o Significant impact on web search
Information extraction
o Identify classes of index terms that are important for some
applications
o Eg. Named entity recognizers identify classes such as
people,locations etc.
Classifer
o Identifies class related metadata for documents
i.e assign labels to documents
eg.topics,reading levels
3.Index Creation:
Document statistics
o Gathers counts and positions of words and other features
o Used in ranking algorithm
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
48
Weighting
o Computes weights for index terms
o Used in ranking algorithm
o Eg.tf.idf weight
Combination of term frequency in document and inverse
document frequency in the collection.
Inversion
o Core of indexing process
o Converts document term information to term document for indexing
Difficult for very large numbers of document
Format of inverted file id designed for fast query processing
o Must handle updates
o Compression used for efficiency
Index Distribution
o Distribution indexes across multiple computers and /or multisite
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
49
o Essential for fast query processing with large number of documents
o P2P and distributed IR involve search across multiple sites.
b ) Query Process
Document Data store
User
Index
1. User Interaction - supports creation and refinement of query ,display of results
User Interaction
Evaluation
Ranking
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
50
2. Ranking-uses query and indexes to generate ranked list of documents
3. Evaluation- monitors and measures effectiveness and efficiency
1. User Interaction
Query input
o Provides interface and parser for query language
o Most web queries are very simple, other applications may use forms
o Query language used to describe more complex queries and results
of query transformation
E.g. Boolean queries, indri query language
IR query languages also allow content and structure
specifications but focus on content.
Query Transformation
o Improves initial query ,both before and after initial search
o Includes text transformation techniques used for documents
o Spell checking and query suggestion provide alternatives to original
query
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
51
o Query expansion and relevance feedback modify the original query
with additional terms
Results Output
o Constructs the display of ranked documents for a query
o Generates snippets to show how queries match documents
o Highlights important words and passages
o Retrieves appropriate advertising in many applications
o May Provide clustering and other visualization tools
2. Ranking
Scoring
o Calculates scores for documents using a ranking algorithm
o Core component of search engine
o Basic form of score is ∑qidi
Qi and di are query and document term weighting for term
o Many variations of ranking algorithms and retrieval models
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
52
Performance Optimization
o Designing ranking algorithms for efficient processing
Term at a time Vs document at a time processing
Safe vs unsafe optimizations
Distribution
o Processing queries in a distributed environment
o Query broker distributes queries and assembles results
o Caching is a form of distributed searching
3.Evaluation
Logging
o Logging user queries and interaction is crucial for improving search
effectiveness and efficiency
o Query logs and Click through data used for query suggestion, spell
checking, query caching, ranking, advertising search and other
components
Ranking analysis
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
53
o Measuring and tuning ranking effectiveness
Performance Analysis
o Measuring and tuning system efficiency
10. Characterizing the Web
Measuring the Internet and the Web is difficult
highly dynamic nature
more than 778 million computers in the Internet
(Internet Domain Survey, October 2010)
estimated number of Web servers currently exceeds 285 million
(Netcraft Web Survey, February 2011)
Hence, there is about one Web server per every three computers
directly connected to the Internet
How many institutions (not servers) maintain Web data?
o number is smaller than the number of servers
o many places have multiple servers
o exact number is unknown
o should be larger than 40% of the number of Web servers
How many pages and how much traffic in the Web?
o studies on the size of search engines, done in 2005, estimated
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
54
over 20 billion pages
o same studies estimated that size of static Web is roughly
doubling every eight months
Exact number of static Web pages important before wide use of dynamic
pages
Nowadays, the Web is infinite for practical purposes
o can generate an infinite number of dynamic pages
o Example: an on-line calendar
Most popular formats on Web
o HTML
o followed by GIF and JPG, ASCII text, and PDF
Structure of the Web Graph
The Web can be viewed as a graph, where
o the nodes represent individual pages
o the edges represent links between pages
Broder et al compared the topology of the Web graph to a bow-tie
Original bow-tie structure of the Web
In Baeza-Yates et al, the graph notation was extended
by dividing the CORE component into four parts:
Bridges: sites in CORE that can be reached directly from the IN
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
55
component and that can reach directly the OUT component
Entry points: sites in CORE that can be reached directly from
the IN component but are not in Bridges
Exit points: sites in CORE that reach the OUT component
directly, but are not in Bridges
Normal: sites in CORE not belonging to the previously defined
sub-components
Bow-Tie structure of the web
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
56
Refined view of the bow-tie strcuture
Modeling the Web
Heaps’ and Zipf’s laws are also valid in the Web.
» In particular, the vocabulary grows faster (larger b) and the word
distribution should be more biased (larger q)
Heaps’ Law
» An empirical rule which describes the vocabulary growth as a
function of the text size.
» It establishes that a text of n words has a vocabulary of size O(nb) for
0<b<1
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
57
Zipf’s Law
» An empirical rule that describes the frequency of the text words.
» It states that the i-th most frequent word appears as many times as
the most frequent one divided by iq, for some q>1
Zipf’s and Heaps’ Law
Distribution of sorted word frequencies (left) and size of the vocabulary (right)
CORE component follows a power law distribution
Power Law: function that is invariant to scale changes
f(x) =a/xα with α > 0
Depending on value of _, moments of distribution will be finite or not
α ≤ 2: average and all higher-order moments are infinite
F
Words
V
Text size
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
58
2 < α ≤ 3: mean exists, but variance and higher-order moments are
infinite
Web measures that follow a power law include
o number of pages per Web site
o number of Web sites per domain
o incoming and outgoing link distributions
o number of connected components of the Web graph
Also the case for the host-graph
o the connectivity graph at the level of Web sites
Distribution of document sizes: self-similar model
o based on mixing two different distributions
Main body of distribution follows a
Logarithmic Normal distribution
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
59
Example of file size distribution in a semi-log graph
The right tail of the distribution is heavy-tailed
o majority of documents are small
o but there is a non trivial number of large documents, so the
area
under the curve is relevant
Good fit is obtained with a Pareto distribution, which is similar to a
power law
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
60
Important Questions
1. Differentiate between Information Retrieval and Web Search (8) Nov/Dec
2017 AN
2. Explain the issues in the process of Information Retrieval. (8) Nov/Dec
2017 U
3. Explain in detail, the components of Information Retrieval and Search
engine (16) Nov/Dec 2017, Nov/Dec 2018, Apr/May 2018 U
4. Explain in detail about the components of IR. Nov/Dec 2016 U
5. Write short notes on Nov/Dec 2016 U
i. Characterizing the web for search. (8) U
ii. Role of AI in IR (8) AN
6. Explain the historical development of Information Systems. Discuss the
sophistication in technology in detail. U
PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII
61
7. Analyze the challenges in IR system and give your suggestion to
overcome that. AN
8. Brief about Open source search engine framework (6) Nov/Dec 2018
U
9. Explain the impact of the web on Information retrieval systems.(7)
Nov/Dec 2018 AN
10. How will you characterize the web U
11. Compare Web IR with classical IR and describe web impact on
information retrieval process. AN