cs6007 information retrieval unit i introduction-history...

PANIMALAR INSTITUTE OF TECHNOLOGY IR SEMESTER VII

1

CS6007 –Information Retrieval

UNIT I

Introduction-History of IR-Components of IR-Issues-Open source search

engine frameworks-the impact of the web on IR-The role of Artificial

intelligence (AI) on IR-IR Versus Web search- Components of a Search

Engine- Characterizing the web.

Introduction:

Definition:

Information retrieval (IR) is finding material (usually documents) of an

unstructured nature (usually text) that satisfies an information need from

within large collections (usually stored on computers).

Information Retrieval - Calvin Moores definition-1951

“Information retrieval is a field concerned with the structure, analysis,

organization, storage, searching, and retrieval of information.” It is the

activity of obtaining information resources relevant to an information need

from a collection of information resources.


2

Example:

To determine which plays of Shakespeare contain the words Brutus AND

Caesar and NOT Calpurnia, one way to do that is to start at the beginning and

to read through all the text, noting for each play whether it contains Brutus and

Caesar and excluding it from consideration if it contains Calpurnia.

The simplest form of document retrieval is for a computer to do this sort of

linear scan through documents. This process is commonly referred to

as grepping through text, after the Unix command grep, which performs this

process.

The way to avoid linearly scanning the texts for each query is to index the

documents in advance. The Shakespeare's Collected Works, is used to introduce

the basics of the Boolean retrieval model. Suppose we record for each document

- here a play of Shakespeare's - whether it contains each word out of all the

words Shakespeare used (Shakespeare used about 32,000 different words). The

result is a binary term-document incidence matrix , as in Figure 1.1. Terms are the

indexed units Now, depending on whether we look at the matrix rows or

columns, we can have a vector for each term, which shows the documents it

appears in, or a vector for each document, showing the terms that occur in it.


3

To answer the query Brutus AND Caesar AND NOT Calpurnia, we take the

vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a

bitwise AND:

110100 AND 110111 AND 101111 = 100100

The answers for this query are thus Antony and Cleopatra and Hamlet

The Boolean retrieval model is a model for information retrieval in which we can

pose any query which is in the form of a Boolean expression of terms, that is, in

which terms are combined with the operators AND, OR, and NOT. The model

views each document as just a set of words.


4

Figure: Results from Shakespeare for the query Brutus AND Caesar AND NOT

Calpurnia.

1. Concepts :

The major concept in information retrieval is inverted index. The inverted index,

or sometimes inverted file , has become the standard term in information

retrieval. The basic idea of an inverted index is shown in Figure .

The dictionary for the data structure and vocabulary for the set of terms. Then for

each term, we have a list that records which documents the term occurs in. Each

item in the list - which records that a term appeared in a document and is

conventionally called a posting .The list is then called a postings list (or ), and all

the postings lists taken together are referred to as the postings . The dictionary

in Figure has been sorted alphabetically and each postings list is sorted by

document ID.


5


6


7

1.1.The term vocabulary and postings list

The steps in inverted index construction includes

1.2 Stop words

Figure : A stop list of 25 semantically non-selective words which are common in

Reuters-RCV1.

Some extremely common words which would appear to be of little value in

helping select documents matching a user need are excluded from the

vocabulary entirely. These words are called stop words. The general strategy

for determining a stop list is to sort the terms by collection frequency (the total

number of times each term appears in the document collection) and then to take

the most frequent terms, often hand-filtered for their semantic content relative

to the domain of the documents being indexed, as a stop list .

Token normalization is the process of canonicalizing tokens so that matches

occur despite superficial differences in the character sequences of the tokens.

The most standard way to normalize is to implicitly create equivalence classes,

which are normally named after one member of the set.


8

1.3 Stemming and lemmatization

Stemming usually refers to a crude heuristic process that chops off the ends of

words in the hope of achieving this goal correctly most of the time, and often

includes the removal of derivational affixes.

Example:

am,are,is be

car, cars, car's, cars' car

saw => s

“surfing”, “surfed” --> “surf”

Lemmatization usually refers to doing things properly with the use of a

vocabulary and morphological analysis of words, normally aiming to remove

inflectional endings only and to return the base or dictionary form of a word,

which is known as the lemma.

Example:

saw => see or saw

1.6 Edit distance :


9

Given two character strings s1 and s2, the edit distance between them is

the minimum number of edit operations required to transform s1 into s2.

Most commonly, the edit operations allowed for this purpose are (i) insert

a character into a string, (ii) delete a character from a string, and (iii)

replace a character of a string by another character; for these operations,

edit distance Levenshtein is sometimes known as Levenshtein distance.

For example, the edit distance distance between cat and dog is three.

Two main search paradigms:

Retrieval and Browse

Retrieval

o Search for particular information

o Usually focused and purposeful


10

Browsing

o General looking around for information

o For example: Asia-> Thailand -> Phuket ->Tsunami

IR vs. DBMS

IR DBMS

Imprecise Semantics Precise Semantics

Keyword search SQL

Unstructured data format Structured data

Read-Mostly. Add docs

occasionally

Expect reasonable number of updates

Page through top k results Generate full answer


11

Information Retrieval Vs information Extraction

Information Retrieval:

Given a set of terms and a set of document terms select only most relevant

document (precision) ,and preferably all the relevant ones(recall)

Infromation Extraction:

Extract from the text what the document means.

2. History of IR:

The idea of using computers to search for relevant pieces of information

was popularized in the article As We May Think by Vannevar Bush in

1945.

It would appear that Bush was inspired by patents for a 'statistical

machine' - filed by Emanuel Goldberg in the 1920s and '30s - that

searched for documents stored on film

The first description of a computer searching for information was

described by Holmstrom in 1948, detailing an early mention of

the Univac computer.

Automated information retrieval systems were introduced in the 1950s:

one even featured in the 1957 romantic comedy, Desk Set.

• 1960-70’s:

https://en.wikipedia.org/wiki/As_We_May_Think

https://en.wikipedia.org/wiki/Vannevar_Bush

https://en.wikipedia.org/wiki/Emanuel_Goldberg

https://en.wikipedia.org/wiki/Univac

https://en.wikipedia.org/wiki/Desk_Set


12

– Initial exploration of text retrieval systems for “small” corpora of

scientific abstracts, and law and business documents.

– Development of the basic Boolean and vector-space models of

retrieval.

– Prof. Salton and his students at Cornell University are the leading

researchers in the area.

• 1980’s:

– Large document database systems, many run by companies:

• Lexis-Nexis

• Dialog

• MEDLINE

• 1990’s:

– Searching FTPable documents on the Internet

• Archie

• WAIS

– Searching the World Wide Web

• Lycos

• Yahoo

• Altavista

– Organized Competitions


13

• NIST TREC

– Recommender Systems

• Ringo

• Amazon

• NetPerceptions

– Automated Text Categorization & Clustering

• 2000’s

– Link analysis for Web Search

• Google

– Automated Information Extraction

• Whizbang

• Fetch

• Burning Glass

– Question Answering

• TREC Q/A track

– Multimedia IR

• Image

• Video

• Audio and music

– Cross-Language IR

• DARPA Tides


14

– Document Summarization

– Learning to Rank

2.1 Historical Milestone in IR research

2.2 Past ,Present and Future:

2.2.1 Early Developments:

An old and popular data structure for faster information retrieval is a collection

of selected words or concepts with which are associated pointers to the related

information called index.In one form or another,indexes are at the core of every

modern information retrieval system.They provide faster access to the data and

allow the query processing task to be speeded up.

For centuries indexes were created manually as categorization hierarchies.In

fact most libraries still use some form of categorical hierarchy to classify their

volumes.Such hierarchies have usually been conceived by human subjects from

the libraray sciences field.More recently ,the advent of modern computers has

made possible the construction of large indexes automatically.Automatic

indexes provide as view of the retrieval problem which is much more related to

the system itself than to the user need.It is important to distinquish between

two different views of the IR problem: a computer centered one and a human

centered one.


15

In the computer centered view ,the IR problem consists mainly of building up

effiecient indexes,procesing user queries with high performance, and dveloping

ranking algorithms which improve the quality of the answer set.

In the Human centered view ,the IR problem consists mainly of studying the

behavior of the user,of understanding his maain needs, and of determining

how such understanding affects the organisation and operation of the retrieval

system.

2.2.2 Information Retrieval in the library:

Libraries were among the first instituttions to adopt IR systems for retrieving

information.Usually sustems to be used in libraries were intially developed by

academic institutions and later by commercial vendors.

In the first geneartion,such systems consisted basically allowed

searches based on author name and title

In the second generation increased searchfunctionality was added

which allowed searchingby subject headings,by keywords and

some complex query facilities.

In the third generation which is currently being deployed the focus

is on improved graphical interfaces,electronic forms,hypertext

features and open system architectures.

2.2.3 Web and Digital Library:


16

Three dramatic and fundamental changes have occurred due to the

advances in modern computer technology and boom of the web.they are

1. A cheaper access to various sources of information.

2. Provide greater access to networks.

3. Publishing freedom

3. Components of IR System:

Information retrieval locates relevant documents on the basis of uesr input

such as keywords or example documents, for example :Find documents

containing the words “database systems”.the figure shows information

retrieval system block digram.It consists of three components: Query or

Documents,IR system and Ranked Results.

1) Query /Collections: store only a representation of the document or query

which means that the text of a document is lost once it has been processed

for the purpose of generating its representation.

2) IR System: Involve inperforming actual retrival function ,executing the

search strategy in response to a query.

3) Ranked Results: a set of documents which improves the subsequent run

after information retrieval.


17

Figure:Block diagram of IR

Architecture of IR System:

Logical View of the Documents:

Due to historical reasons, documents in a collection are frequently represented

through a set of index terms or keywords. Such keywords might be extracted

directly from the text of the document or might be specified by a human subject

(as frequently done in the information sciences arena). No matter whether these

representative keywords are derived automatically or generated by a specialist,

they provide a logical view of the document


18

With very large collections, however, even modern computers might have to

reduce the set of representative keywords. This can be accomplished through

the elimination of stopwords (such as articles and connectives), the use of

stemming (which reduces distinct words to their common grammatical

root),and the identification of noun groups (which eliminates adjectives,

adverbs, and verbs). Further, compression might be employed. These

operations are called text operations (transformations). Text operations reduce

the complexity of the document representation and allow moving the logical

view from that of a full text to that of a set of index terms.


19

As illustrated in Figure , we view the issue of logically representing a document

as a continuum in which the logical view of a document might shift (smoothly)

from a full text representation to a higher level representation specified by a

human subject.

The Retrieval Process:

To describe the retrieval process, we use a simple and generic software

architecture as shown in Figure First of all, before the retrieval process can even

be initiated, it is necessary to dene the text database. This is usually done by the

manager of the database, which species the following: (a) the documents to be

used, (b) the operations to be performed on the text, and (c) the text model (i.e.,

the text structure and what elements can be retrieved). The text operations

transform the original documents and generate a logical view of them.

Once the logical view of the documents is defined, the database manager

(using the DB Manager Module) builds an index of the text. An index is a

critical data structure because it allows fast searching over large volumes of

data. Different index structures might be used, but the most popular one is the

inverted index as indicated in Figure . The resources (time and storage space)

spent on defining the text database and building the index are amortized by

querying the retrieval system many times.


20

Given that the document database is indexed, the retrieval process can be

initiated. The user need which is then parsed and transformed by the same text

operations applied to the text. Then, query operations might be applied before

the actual query, which provides a system representation for the user need, is

generated. The query is then processed to obtain the retrieved documents. Fast

query processing is made possible by the index structure previously built.

Before been sent to the user, the retrieved documents are ranked according to a

likelihood of relevance. The user then examines the set of ranked documents in

the search for useful information. At this point, he might pinpoint a subset of

the documents seen as definitely of interest and initiate a user feedback cycle.

In such a cycle, the system uses the documents selected by the user to change

the query formulation. Hopefully, this modified query is a better representation

of the real user need.


21

Text Operations forms index words (tokens).

o Stopword removal

o Stemming

Indexing constructs an inverted index of

word to document pointers.

Searching retrieves documents that contain a

given query token from the inverted index.

Ranking scores all retrieved documents

according to a relevance metric.


22

User Interface manages interaction with the

user:

o Query input and document output.

o Relevance feedback.

o Visualization of results.

Query Operations transform the query to

improve retrieval:

o Query expansion using a thesaurus.

o Query transformation using relevance feedback.

4. Issues in IR

The main objective of an IR system is to retrieve all the items that are relevant

to a user query, while retrieving as few non relevant items as possible.

4.1 Main problems in IR:

o Document and Query indexing

o How to best represent their contents?

o Query evaluation(or retrieval process)

o To what extent does a document correspond to a query?

o System evaluation

o How good is a system?


23

o Are the retrived documents relevant?(precision)

o Are all the relevant documents retrieved? (recall)

Information retrieval is concerned with representing, searching, and

manipulating large collections of electronic text and other human-

language data.

Three Big Issues in IR

1.Relevance

It is the fundamental concept in IR.

A relevant document contains the information that a person was

looking for when she submitted a query to the search engine.

There are many factors that go into a person’s decision as to whether a

document is relevant.

These factors must be taken into account when designing algorithms

for comparing text and ranking documents.

Simply comparing the text of a query with the text of a document and

looking for an exact match, as might be done in a database system

produces very poor results in terms of relevance.

To address the issue of relevance, retrieval models are used.

A retrieval model is a formal representation of the process of matching

a query and a document. It is the basis of the ranking algorithm that is


24

used in a search engine to produce the ranked list of documents.

A good retrieval model will find documents that are likely to be

considered relevant by the person who submitted the query.

The retrieval models used in IR typically model the statistical

properties of text rather than the linguistic structure. For example, the

ranking algorithms are concerned with the counts of word occurrences

than whether the word is a noun or an adjective.

2. Evaluation

Two of the evaluation measures are precision and recall.

Precision is the proportion of retrieved

documents that are relevant. Recall is the

proportion of relevant documents that are

retrieved.

Precision = Relevant documents ∩ Retrieved documents

Retrieved documents

Recall = Relevant documents ∩ Retrieved documents

Relevant documents

When the recall measure is used, there is an assumption that all the

relevant documents for a given query are known. Such an assumption

is clearly problematic in a web search environment, but with smaller

test collection of documents, this measure can be useful. It is not


25

suitable for large volumes of log data.

3. Emphasis on users and their information needs

The users of a search engine are the ultimate judges of quality. This has

led to numerous studies on how people interact with search engines

and in particular, to the development of techniques to help people

express their information needs.

Text queries are often poor descriptions of what the user actually wants

compared to the request to a database system, such as for the balance of

a bank account.

Despite their lack of specificity, one-word queries are very common in

web search. A one-word query such as “cats” could be a request for

information on where to buy cats or for a description of the Cats

(musical).

Techniques such as query suggestion, query expansion and relevance

feedback use interaction and context to refine the initial query in order

to produce better ranked results.

• The figure summarizes the major issues involved in search engine

design


26

5. Open source Search Engine Frameworks:

Open source

Open source software is software whose source code is available for

modification or enhancement by anyone. "Source code" is the part of software

that most computer users don't ever see; it's the code computer programmers

can manipulate to change how a piece of software—a "program" or

"application"—works. Programmers who have access to a computer

program's source code can improve that program by adding features to it or

fixing parts that don't always work correctly.


27

Advantage of open source

The right to use the software in any way.

There is usually no license cost and free of cost.

The source code is open and can be modified freely.

Open standards.

It provides higher flexibility.

Disadvantage of open source

There is no guarantee that development will happen.

It is sometimes difficult to know that a project exist, and its current

status.

No secured follow-up development strategy.

Closed software

Closed software is a term for software whose license does not allow for

the release or distribution of the software’s source code. Generally it means

only the binaries of a computer program are distributed and the

license provides no access to the programs source code. The source code of

such programs is usually regarded as a trade secret of the company. Access

to source code by third parties commonly requires the party to sign a non-

disclosure agreement.


28

Search Engine

A search engine is a document retrieval system design to help find

information stored in a computer system, such as on the WWW. The search

engine allows one to ask for content meeting specific criteria and retrieves a

list of items that match those criteria. The following are the famous search

engines.

5.1 Lucene

Lucene is an indexing and search system implemented in Java, with ports to

other programming languages. The project was started by Doug Cutting in

1997. It was initially available for download from its home at

the SourceForge web site. It joined the Apache Software Foundation's

Jakarta family of open-source Java products in September 2001 and became its

own top-level Apache project in February 2005.

Since then, it has grown from a single-developer effort to a global project

involving hundreds of developers in various countries. It is currently hosted by

the Apache Foundation. Lucene is by far the most successful open source

search engine. Its largest installation is quite likely Wikipedia: All queries

entered into Wikipedia’s search form are handled by Lucene. A list of other

projects relying on its indexing and search capabilities can be found on

http://www/

https://en.wikipedia.org/wiki/SourceForge

https://en.wikipedia.org/wiki/Jakarta_Project


29

Lucene’s “PoweredBy”.Known for its modularity and extensibility, Lucene

allows developers to define their own

indexing and retrieval rules and formulae. Under the hood, Lucene’s retrieval

framework is based on the concept of fields: Every document is a collection of

fields, such as its title, body, URL, and so forth. This makes it easy to specify

structured search requests and to give different weights to different parts of a

document.The latest version of Lucene is 6.1.0 which was released on June 17,

2016.

5.2.Indri

Indri is an academic information retrieval system written in C++. It is

developed by researchers at the University of Massachusetts and is part of the

Lemur project, a joint effort of the University of Massachusetts and Carnegie

Mellon University.

Indri is well known for its high retrieval effectiveness and is frequently found

among the top scoring search engines at TREC. Its retrieval model is a

combination of the language modeling approaches .Like Lucene, Indri can

handle multiple fields per document, such as title, body, and anchor text, which

is important in the context of Web search.

It supports automatic query expansion by means of pseudo-relevance feedback,

a


30

technique that adds related terms to an initial search query, based on the

contents of an initial set of search results .It also supports query-independent

document scoring that may, for instance, be used to prefer more recent

documents over less

recent ones when ranking the search results .

5.3.Wumpus

Wumpus is an academic search engine written in C++ and developed at the

University of Waterloo. Unlike most other search engines, Wumpus has no

built-in notion of “documents” and does not know about the beginning and the

end of each document when it builds the index.

Instead, every part of the text collection may represent a potential unit for

retrieval, depending on the structural search constraints specified in the query.

This makes the system particularly attractive for search tasks in which the ideal

search result may not always be a whole document, but may be a section, a

paragraph, or a sequence of paragraphs within a document. Wumpus supports

a variety of different retrieval methods, including the proximity ranking

function from the BM25 algorithm from, and the language modeling and

divergence from randomness. In addition, it is able to carry

out real-time index updates (i.e., adding/removing files to/from the index) and

provides support for multi-user security restrictions that are useful if the


31

system has more than one user, and each user is allowed to search only parts of

the index.

6. The Impact of the web on IR

The Web is very large, public, unstructured but ubiquitous repository that need

efficient tools to manage, retrieve, and filter information. The search engines

have become a central tool in the Web.

Two characteristics make retrieval of relevant information from the Web is a

really hard task

the large and distributed volume of data available

the fast pace of change

Main challenges posted by Web are

data-centric: related to the data itself

interaction-centric: related to the users and their interactions

Data-centric challenges are varied and include

distributed data

high percentage of volatile data

large volume of data

unstructured and redundant data

quality of data


32

heterogeneous data

Interaction Centric related to the users and their interactions

Expressing a query

Interpreting results

Impact of the web

o The first impact of the web on search is related to the characteristics

of the document collection itself.

o The web is composed of pages distributed over million of

sites and connected through hyperlinks

o This requires collecting ll documents and storing copies of

them in a central repository, prior to indexing.

o This new phase in the IR process, introduced by the web is

called crawling

o The second impact of the web on search is related to

o The size of the collection

o The volume of user queries submitted on a daily basis

o As a consequence, performance and scalability have critical

characteristics of the IR system.

o The third impact in a very large collection, predicting relevance is

much harder than before


33

o Fortunately the web also includes new sources of evidence

o Ex. hyperlinks and user clicks in documents in the answer set

o The fourth impact derives from the fact that the web is also a

medium to do business.

o Search problem has been extended beyond the seeking of text

information to also encompass other user needs

o Ex.price of a book, the phone number of a hotel

o The fifth impact of the web on search is web spam

o Web spam: abusive availability of commercial information

disguised in th form of informational content.

o This difficulty is so large that today we talk of diversion web

retrieval.

Practical issues in the Web

o Security

o Commercial transactions over the internet are not yet a completely

safe procedure

o Privacy

o Frequently people are willing to exchange information as long as it

does not become public

o Copyright and patent rights


34

o It is far from clear how to wide spread of data on the web affects

copyright and patent laws in the various countries.

o Scanning, Optical character Recognition(OCR) and cross language

retrieval

7. Role of Artificial intelligence in IR

Artificial Intelligence:

The study of how to construct intelligent machines & systems that can

simulate or extend the development of human intelligence. Both IR and AI

fields developed in parallel during the early days of computers. The fields of

artificial intelligence and information retrieval share a common interest in

developing more capable computer systems.

What is Intelligence?

According to Cook et.al. [1988]

1. Acquisition: the ability to acquire new knowledge.

2. Automatization: the ability to refine procedures for dealing with a novel

situation into an efficient functional form.

3. Comprehension: the ability to know, understand, and deal with novel

problems.

4. Memory management: the ability to represent knowledge in memory, to map

knowledge on to that memory representation, and to access the knowledge in


35

memory.

5. Metacontrol: the ability to control various processes in intelligent behavior.

6.Numeric ability: the ability to perform arithmetic operations.

7. Reasoning: the ability to use problem-solving knowledge.

8. Social competence: the ability to interact with and understand other people,

machines or programs.

9. Verbal perception: the ability to recognize natural language.

10. Visual perception: the ability to recognize visual images.

What are Intelligent IR Systems?

The concept of 'intelligent' information retrieval was first suggested in the

late 1970s. Not pursued by IR Community until early 1990s.An intelligent IR

system can simulate the human thinking process on information processing

and intelligence activities to achieve information and knowledge storage,

retrieval and reasoning, and to provide intelligence support.

How to introduce AI into IR systems?

A program which takes a query as input, and returns documents as output,

without affording the opportunity for judgment, modification and especially


36

interaction with text.

The question is, “where” should AI be introduced into the IR system?

Levels of user and system involvement, according to Bates ‘90:

Level 0 – No system involvement (User comes up with a tactic, formulating a

query,

coming up with a strategy and thinking about the outcome)

Level 1 – User can ask for information about searching (System suggests

tactics that

can be used to formulate queries e.g. help)

Level 2 – User simply enters a query, suggests what needs to be done, and

the

system executes the query to return results.

Level 3 – First signs of AI. System actually starts suggesting improvements to

user.

Level 4 – Full Automation. User queries are entered and the rest is done by

the

system.

Some AI methods currently used in Intelligent IR Systems


37

Web Crawlers (for information extraction)

Mediator Techniques (for information integration)

Ontologies (for intelligent information access by making semantics of

information explicit and machine readable)

Neural Networks (for document clustering & preprocessing)

Kohonen Neural Networks - Self Organizing maps

Hopefield Networks

Semantic Networks

Neural Networks in IR

Based on Neural Networks

Document clustering can be viewed as classification in

document*document space. Thesaurus construction can be viewed as laying

out a coordinate system in the index*index space. Indexing itself can be viewed

as mappings in the document*index space. Searching can be conceptualized as

connections and activations in the index*document space.

Applying Neural Networks to Information Retrieval will likely

produce information systems that will be able to:

recall memories despite failed individual memory units

modify stored information in response to new inputs from the user

retrieve "nearest neighbor" data when no exact data match exists


38

associatively recall information despite noise or missing pieces in the

input

categorize information by their associative patterns

AI offers us a powerful set of tools, especially when they are combined with

conventional and other innovative computing tools. However, it is not an easy

task to master those tools and employ them skillfully to build truly significant

intelligent systems. By recognizing the limitations of modern artificial

intelligence techniques, we can establish realistic goals for intelligent

information retrieval systems and devise appropriate system development

strategies.AI models like the neural network will probably not replace

traditional IR approaches anytime soon. However, the application of neural

network models can make an IR system more powerful.

8. IR on the web Vs.IR

Traditional IR systems normally index a closed collection of documents,

which are mainly text-based and usually offer little linkage between

documents. Traditional IR systems are often referred to as full-text retrieval

systems. Libraries were among the first to adopt IR to index their catalogs and

later, to search through information which was typically imprinted onto CD-

ROMs. The main aim of traditional IR was to return relevant documents that

satisfy the user’s information need. Although the main goal of satisfying the


39

user’s need is still the central issue in web IR (or web search), there are some

very specific challenges that web search poses that have required new and

innovative solutions.

The first important difference is the scale of web search, as we have

seen that the current size of the web is approximately 600 billion pages.

This is well beyond the size of traditional document collections.

The Web is dynamic in a way that was unimaginable to traditional IR in

terms of its rate of change and the different types of web pages ranging

from static types (HTML, portable document format (PDF), DOC,

Postscript, XLS) to a growing number dynamic pages written in

scripting languages such a JSP, PHP or Flash. We also mention that a

large number of images, videos, and a growing number of programs

are delivered through the Web to our browsers.

The Web also contains an enormous amount of duplication, estimated

at about 30%. Such redundancy is not present in traditional corpora

and makes the search engine’s task even more difficult.

The quality of web pages vary dramatically; for example, some web

sites create web pages with the sole intention of manipulating the

search engine’s ranking, documents may contain misleading

information, the information on some pages is just out of date, and the

overall quality of a web page may be poor in terms of its use of


40

language and the amount of useful information it contains. The issue of

quality is of prime importance to web search engines as they would

very quickly lose their audience if, in the top- ranked positions, they

presented to users poor quality pages.

The range of topics covered on the Web is completely open, as opposed

to the closed collections indexed by traditional IR systems, where the

topics such as in library catalogues, are much better defined and

constrained.

Another aspect of the Web is that it is globally distributed. This poses

serious logistic problems to search engines in building their indexes,

and moreover, in delivering a service that is being used from all over

the globe. The sheer size of the problem is daunting, considering that

users will not tolerate anything but

an immediate response to their query. Users also vary in their level of

expertise, interests, information- seeking tasks, the language(s) they

understand, and in many other ways.

Users also tend to submit short queries (between two to three

keywords), avoid the use of anything but the basic search engine

syntax, and when the results list is returned, most users do not look at

more than the top 10 results, and are unlikely to modify their query.


41

This is all contrary to typical usage of traditional IR.

The hypertextual nature of the Web is also different from traditional

document collections, in giving users the ability to surf by following

links.

On the positive side (for the Web), there are many roads (or paths of

links) that “lead to Rome” and you need only find one of them, but

often, users lose their way in the myriad of choices they have to make.

Another positive aspect of the Web is that it has provided and is providing

impetus for the development of many new tools, whose aim is to improve the

user’s experience.

Classical IR Web IR

Volume Large Huge

Data Quality Clean ,No

duplicates

Noisy, duplicates

Available

Data change rate Infrequent In flux

Data accessibility Accessible Partially accessible

Format diversity Homogeneous Widely Diverse

Documents Text HTML

No.of Matches Small Large

IR techniques Content based Link based


42

9. Components of a search engine:

Search engines are among the most important applications or services on

the web.Most exciting sucessful search engines use a centralized architecture

and global ranking algorithms to genarate the ranking of documents crawled

in their databases.

A search engine is a program designed to help find information stored on

a computer system such as the world wide web.

Major building blocks of Search engine are

a) Indexing

a. Text Acquistion

b. Text Transformation

c. Index creation

b) Query Processing

a. User Interactions

b. Ranking

c. Evaluation


43

a) Indexing Process

Document Data store

Index

Email ,Web pages, Letters

1. Text Acquistion-identifies and stores documents for indexing

2. Text Transfomation –transforms documents into index terms or

features

3. Index Creation – takes index terms and creates data structures

1 .Text Acquisition:

Crawler identifies and acquires documents for search engine

Web crawlers follow links to find documents

Text

Acquisition

Text

Transformation

Index creation


44

o Must efficiently find huge numbers of web pages and keep them up

to date

o Single site crawlers for site search

o Topical or focused crawlers for vertical search

Document crawlers for enterprise and desktop search

o Follow links and scan directories

Feeds

o Real time streams of documents eg.web feeds for news,blogs

o RSS is common standard RSS reader can provide new XML

documents to search engine

Conversion

o Convert variety of documents into a consistent text plus metadata

format.eg.HTML,XML,Word

o Convert text encoding for different languages using a Unicode

standard like UTF 8

Document Datastore


45

o Stores text.meta data and other related content for documents

Metadata is information about document such as type and

creation date

Other content includes links,anchor text

o Provides fast access to document contents for search engine

components

o Could use relatiuonal database system

o More typically a simpler more efficient storage system is

used due to huge numbers of douments.

2.Text Transformation

Parser

o Processing the sequence of text tokens in the document to recognize

structure elements eg titles,links

o Tokenzier recognizes words in the text

Must consider issues like captilization,hyphens


46

o Markup Languages such as HTML,XML often used to specify

structure

o Tags used to specify document elements

o Document parser uses syntax of markup language to identify

structure

Stopping

o Remove common words eg.”and”,”or”,”the”

o Some impact on efficiency and effectiveness

Stemming

o Group words derived from a common stem

Eg.”Computer”,”Computers”,”Computing”,”Compute”

o Usually effective ,but not for all queries

Link Analysis

o Makes use of links and anchor text in web pages

o Link analysis idenfies popularity and community information

Eg Page Rank


47

o Anchor Text can significantly enhance the representation of pages

pointed to by links.

o Significant impact on web search

Information extraction

o Identify classes of index terms that are important for some

applications

o Eg. Named entity recognizers identify classes such as

people,locations etc.

Classifer

o Identifies class related metadata for documents

i.e assign labels to documents

eg.topics,reading levels

3.Index Creation:

Document statistics

o Gathers counts and positions of words and other features

o Used in ranking algorithm


48

Weighting

o Computes weights for index terms

o Used in ranking algorithm

o Eg.tf.idf weight

Combination of term frequency in document and inverse

document frequency in the collection.

Inversion

o Core of indexing process

o Converts document term information to term document for indexing

Difficult for very large numbers of document

Format of inverted file id designed for fast query processing

o Must handle updates

o Compression used for efficiency

Index Distribution

o Distribution indexes across multiple computers and /or multisite


49

o Essential for fast query processing with large number of documents

o P2P and distributed IR involve search across multiple sites.

b ) Query Process

Document Data store

User

Index

1. User Interaction - supports creation and refinement of query ,display of results

User Interaction

Evaluation

Ranking


50

2. Ranking-uses query and indexes to generate ranked list of documents

3. Evaluation- monitors and measures effectiveness and efficiency

1. User Interaction

Query input

o Provides interface and parser for query language

o Most web queries are very simple, other applications may use forms

o Query language used to describe more complex queries and results

of query transformation

E.g. Boolean queries, indri query language

IR query languages also allow content and structure

specifications but focus on content.

Query Transformation

o Improves initial query ,both before and after initial search

o Includes text transformation techniques used for documents

o Spell checking and query suggestion provide alternatives to original

query


51

o Query expansion and relevance feedback modify the original query

with additional terms

Results Output

o Constructs the display of ranked documents for a query

o Generates snippets to show how queries match documents

o Highlights important words and passages

o Retrieves appropriate advertising in many applications

o May Provide clustering and other visualization tools

2. Ranking

Scoring

o Calculates scores for documents using a ranking algorithm

o Core component of search engine

o Basic form of score is ∑qidi

Qi and di are query and document term weighting for term

o Many variations of ranking algorithms and retrieval models


52

Performance Optimization

o Designing ranking algorithms for efficient processing

Term at a time Vs document at a time processing

Safe vs unsafe optimizations

Distribution

o Processing queries in a distributed environment

o Query broker distributes queries and assembles results

o Caching is a form of distributed searching

3.Evaluation

Logging

o Logging user queries and interaction is crucial for improving search

effectiveness and efficiency

o Query logs and Click through data used for query suggestion, spell

checking, query caching, ranking, advertising search and other

components

Ranking analysis


53

o Measuring and tuning ranking effectiveness

Performance Analysis

o Measuring and tuning system efficiency

10. Characterizing the Web

Measuring the Internet and the Web is difficult

highly dynamic nature

more than 778 million computers in the Internet

(Internet Domain Survey, October 2010)

estimated number of Web servers currently exceeds 285 million

(Netcraft Web Survey, February 2011)

Hence, there is about one Web server per every three computers

directly connected to the Internet

How many institutions (not servers) maintain Web data?

o number is smaller than the number of servers

o many places have multiple servers

o exact number is unknown

o should be larger than 40% of the number of Web servers

How many pages and how much traffic in the Web?

o studies on the size of search engines, done in 2005, estimated


54

over 20 billion pages

o same studies estimated that size of static Web is roughly

doubling every eight months

Exact number of static Web pages important before wide use of dynamic

pages

Nowadays, the Web is infinite for practical purposes

o can generate an infinite number of dynamic pages

o Example: an on-line calendar

Most popular formats on Web

o HTML

o followed by GIF and JPG, ASCII text, and PDF

Structure of the Web Graph

The Web can be viewed as a graph, where

o the nodes represent individual pages

o the edges represent links between pages

Broder et al compared the topology of the Web graph to a bow-tie

Original bow-tie structure of the Web

In Baeza-Yates et al, the graph notation was extended

by dividing the CORE component into four parts:

Bridges: sites in CORE that can be reached directly from the IN


55

component and that can reach directly the OUT component

Entry points: sites in CORE that can be reached directly from

the IN component but are not in Bridges

Exit points: sites in CORE that reach the OUT component

directly, but are not in Bridges

Normal: sites in CORE not belonging to the previously defined

sub-components

Bow-Tie structure of the web


56

Refined view of the bow-tie strcuture

Modeling the Web

Heaps’ and Zipf’s laws are also valid in the Web.

» In particular, the vocabulary grows faster (larger b) and the word

distribution should be more biased (larger q)

Heaps’ Law

» An empirical rule which describes the vocabulary growth as a

function of the text size.

» It establishes that a text of n words has a vocabulary of size O(nb) for

0<b<1


57

Zipf’s Law

» An empirical rule that describes the frequency of the text words.

» It states that the i-th most frequent word appears as many times as

the most frequent one divided by iq, for some q>1

Zipf’s and Heaps’ Law

Distribution of sorted word frequencies (left) and size of the vocabulary (right)

CORE component follows a power law distribution

Power Law: function that is invariant to scale changes

f(x) =a/xα with α > 0

Depending on value of _, moments of distribution will be finite or not

α ≤ 2: average and all higher-order moments are infinite

F

Words

V

Text size


58

2 < α ≤ 3: mean exists, but variance and higher-order moments are

infinite

Web measures that follow a power law include

o number of pages per Web site

o number of Web sites per domain

o incoming and outgoing link distributions

o number of connected components of the Web graph

Also the case for the host-graph

o the connectivity graph at the level of Web sites

Distribution of document sizes: self-similar model

o based on mixing two different distributions

Main body of distribution follows a

Logarithmic Normal distribution


59

Example of file size distribution in a semi-log graph

The right tail of the distribution is heavy-tailed

o majority of documents are small

o but there is a non trivial number of large documents, so the

area

under the curve is relevant

Good fit is obtained with a Pareto distribution, which is similar to a

power law


60

Important Questions

1. Differentiate between Information Retrieval and Web Search (8) Nov/Dec

2017 AN

2. Explain the issues in the process of Information Retrieval. (8) Nov/Dec

2017 U

3. Explain in detail, the components of Information Retrieval and Search

engine (16) Nov/Dec 2017, Nov/Dec 2018, Apr/May 2018 U

4. Explain in detail about the components of IR. Nov/Dec 2016 U

5. Write short notes on Nov/Dec 2016 U

i. Characterizing the web for search. (8) U

ii. Role of AI in IR (8) AN

6. Explain the historical development of Information Systems. Discuss the

sophistication in technology in detail. U


61

7. Analyze the challenges in IR system and give your suggestion to

overcome that. AN

8. Brief about Open source search engine framework (6) Nov/Dec 2018

U

9. Explain the impact of the web on Information retrieval systems.(7)

Nov/Dec 2018 AN

10. How will you characterize the web U

11. Compare Web IR with classical IR and describe web impact on

information retrieval process. AN