documentation ir

8/7/2019 Documentation Ir

1/57

SECTION CONTENTS

Preface

Acknowledgement

1. Introduction 12. Applications of IR 103. Web search engine 164. Performance measures 355. Models of IR 456. Problems in IR 49

References 54


2/57

i. Prefaceii. Acknowledgment

1.Introduction 11.1 Definition 2

1.2 History 3

1.3 Purpose 3

1.4 Basic IR system architecture 4

1.5 Databases v/s. IR 6

1.6 How IR system works 7

1.7 The process of IR 8

2.Applications of Information Retrieval 102.1 General applications of IR 11

2.2 Domain specific applications of IR 13

2.3 Other retrieval methods 14

3.Web search engine 163.1 Overview & Definition 17


3/57

3.2 Popular search engines 19

3.3 Google Search 22

3.4 Yahoo Search 23

3.5 Bing Search 24

3.6 How search engine works 25

4.Performance measures 354.1 Performance measures 36

4.2 Other measures 40

5.Models of Information Retrieval 455.1 Model Types 46

6.Problems in Information Retrieval 496.1 Problem-1 50

6.2 Problem-2 51

6.3 Problem-3 51

6.4 Problem-4 52

6.5 Problem-5 53

iii. References 54


4/57


5/57

1.1DEFINITIONInformation Retrieval (IR) also called information storage and retrieval (ISR or

ISAR) or information organization and retrieval is the science of searching for documents, for

information within documents, and for metadata about documents, as well as that of searching

relational databases and the World Wide Web. The main trick is to retrieve what is useful while

leaving behind what is not. The process begins when a user enters a query into the system.

Queries are formal statements of information needs, for example search strings in web search

engines. In information retrieval a query does not uniquely identify a single object in the

collection. Instead, several objects may match the query, perhaps with different degrees of

relevancy.

An object is an entity that is represented by information in a database. User queries are matched

against the database information. Depending on the application the data objects may be, for

example, text documents, images, audio, mind maps or videos. Often the documents themselves

are not kept or stored directly in the IR system, but are instead represented in the system by

document surrogates or metadata.

Most IR systems compute a numeric score on how well each object in the database match the

query, and rank the objects according to this value. The top ranking objects are then shown to the

user. The process may then be iterated if the user wishes to refine the query.


6/57

1.2 HISTORYThe idea of using computers to search for relevant pieces of information was popularized in the

article As We May Think by Vannevar Bush in 1945. The first automated information retrieval

systems were introduced in the 1950s and 1960s. By 1970 several different techniques had been

shown to perform well on small text corpora such as the Cranfield collection (several thousand

documents). Large-scale retrieval systems, such as the Lockheed Dialog system, came into use

early in the 1970s.

1.3PURPOSE1.3.1 To process large document collections quickly. The amount of online data has

grown at least as quickly as the speed of computers, and we would now like to be

able to search collections that total in the order of billions to trillions of words.1.3.2 To allow more flexible matching operations.1.3.3 To allow ranked retrieval: in many cases you want the best answer to an

information need among many documents that contain certain words.


7/57

1.4 BASICIRSYSTEM

ARCHITECTURE

UserResult

Query

Deletions Additions

Documents

Components of an IR system

Information

Need

Search

Engine

Index


8/57

The above Figure illustrates the major components in an IR system. Before conducting a search,

a user has an information need, which underlies and drives the search process. We sometimes

refer to this information need as a topic, particularly when it is presented in written form as part

of a test collection for IR evaluation.

The users query is processed by a search engine, which may be running on the users local

machine, on a large cluster of machines in a remote geographic location, or anywhere in

between.

A major task of a search engine is to maintain and manipulate an inverted index for a document

collection. As its basic function, an inverted index provides a mapping between terms and the

locations in the collection in which they occur.

To support relevance ranking algorithms, the search engine maintains collection statistics

associated with the index, such as the number of documents containing each term and the length

of each document.

In addition, the search engine usually has access to the original content of the documents, in

order to report meaningful results back to the user.

Using the inverted index, collection statistics, and other data, the search engine accepts queries

from its users, processes these queries, and returns ranked lists of results. To perform relevance

ranking, the search engine computes a score, sometimes called a retrieval status value (RSV), for

each document.

After sorting documents according to their scores, the result list may be subjected to further

processing, such as the removal of duplicate or redundant results.


9/57

1.5 DATABASES

V/S.INFORMATIONRETRIEVAL

DATABASES IR

We know the schema

in advance, so semantic

correlation between queries

and data is clear.

No schema, but rather unstructured natural

language text. The result is that there is not

a clear semantic correlation between

queries and data.

We can get exact answers We get inexact, estimated answers

Strong theoretical foundation

(at least with relational)

Theory not well understood

(especially Natural Language

Processing)


10/57

1.6 HOWINFORMATION

RETRIEVAL SYSTEMSWORK

IR is a component of an information system. An information system must make sure that

everybody it is meant to serve has the information needed to accomplish tasks, solve problems,

and make decisions, no matter where that information is available. To this end, an information

system must

1.6.1 Actively find out what users need,Determining user needs involves

(a) Studying user needs in general as a basis for designing

responsive systems (such as determining what information

students typically need for assignments),

(b) Actively soliciting the needs of specific users, expressed as

query descriptions, so that the system can provide the information.

1.6.2 Acquire documents (or computer programs, or products, or data items, and so on),resulting in a collection, and

1.6.3 Match documents with needs.

Figuring out what information the user really needs to solve a problem is essential for successful

retrieval. Matching involves taking a query description and finding relevant documents in the

collection; this is the task of the IR system.


11/57


12/57

It will also involve performing the actual retrieval function, that is, executing the search strategy

in response to a query. In the diagram, the documents have been placed in a separate box to

emphasize the fact that they are not just input but can be used during the retrieval process in such

a way that their structure is more correctly seen as part of the retrieval process.

Finally, we come to the output, which is usually a set of citations or document numbers. In an

operational system the story ends here. However, in an experimental system it leaves the

evaluation to be done.


13/57


14/57

APPLICATIONSOFIR

Areas where information retrieval techniques are employed can be bifurcated into three

categories:

2.1 General applications of information retrieval2.2 Domain specific applications of information retrieval2.3 Other retrieval methods

2.1 General applications ofInformationRetrieval

1. Digital libraries - A digital library is a library in which collections are stored in digitalformats and accessible by computers. The digital content may be stored locally, or

accessed remotely via computer networks. A digital library is a type of information

retrieval system

2. Information filtering - An Information filtering system is a system that removesredundant or unwanted information from an information stream using (semi)automated or

computerized methods prior to presentation to a human user.

i. Recommender systems - Recommender systems work from a specific type ofinformation filtering system technique that attempts to recommend information

items (movies, TV program/show/episode, video on demand, music[1], books,

news, images, web pages, scientific literature such as research papers etc.) that are

likely to be of interest to the user.


15/57

3. Media searchi. Image retrieval - An image retrieval system is a computer system for browsing,

searching and retrieving images from a large database of digital images.

ii. Music retrieval - Music information retrieval (MIR) is the interdisciplinaryscience of retrieving information from music

iii. News searchiv. Speech retrievalv. Video retrieval

4. Search engines - A search engine is designed to help find information stored on acomputer system.

i. Desktop search - Desktop search is the name for the field of search tools whichsearch the contents of a user's own computer files, rather than searching the

Internet. These tools are designed to find information on the user's PC, including

web browser histories, e-mail archives, text documents, sound files, images and

video.ii. Enterprise search - Enterprise search is the practice of making content from

multiple enterprise-type sources, such as databases and intranets, searchable to a

defined audience.

iii. Federated search - Federated search is an information retrieval technology thatallows the simultaneous search of multiple searchable resources.

iv. Mobile search - Mobile search is an evolving branch of information retrievalservices that is centered around the convergence of mobile platforms and mobile

phones and other mobile devices.

v. Social search - Social search or a social search engine is a type of web search thattakes into account the Social Graph of the person initiating the Search Query.


16/57

vi. Web search - A web search engine is designed to search for information on theWorld Wide Web and FTP servers. The search results are generally presented in a

list of results and are often called hits.

2.2 Domainspecific applications

ofInformationRetrieval

1. Geographic information retrieval - Geographic Information Retrieval (GIR) orGeographical Information Retrieval is the augmentation of Information Retrieval with

geographic metadata.

2. Legal information retrieval - Legal information retrieval is the science of informationretrieval applied to legal text, including legislation, case law, and scholarly works.

3. Vertical search - A vertical search engine, as distinct from a general Web search engine,focuses on a specific segment of online content. The vertical content area may be based

on topicality, media type, or genre of content. Common examples include legal, medical,

patent (intellectual property), travel, and automobile search engines.


17/57

2.3 Other retrieval methods

Methods/Techniques in which information retrieval techniques are employed include:

1. Adversarial information retrieval - Adversarial information retrieval (adversarial IR) isa topic in information retrieval related to strategies for working with a data source where

some portion of it has been manipulated maliciously.

2. Automatic summarization - Automatic summarization is the creation of a shortenedversion of a text by a computer program. The product of this procedure still contains themost important points of the original text.

i. Multi-document summarization - Multi-document summarization is anautomatic procedure aimed at extraction of information from multiple texts

written about the same topic.

3. Compound term processing - Compound term processing is the name that is used for acategory of techniques in Information retrieval applications that performs matching on

the basis of compound terms. Compound terms are built by combining two (or more)

simple terms, for example "triple" is a single word term but "triple heart bypass" is a

compound term.

4.

Cross-lingual retrieval - Cross-language information retrieval (CLIR) is a subfield ofinformation retrieval dealing with retrieving information written in a language different

from the language of the user's query. For example, a user may pose their query in

English but retrieve relevant documents written in French.


18/57

5. Document classification - Document classification/categorization is a problem ininformation science. The task is to assign an electronic document to one or more

categories, based on its contents.

6. Spam filtering - Email filtering is the processing of e-mail to organize it according tospecified criteria. Most often this refers to the automatic processing of incoming

messages, but the term also applies to the intervention of human intelligence in addition

to anti-spam techniques, and to outgoing emails as well as those being received.

7. Question answering - In information retrieval and natural language processing (NLP),question answering (QA) is the task of automatically answering a question posed in

natural language.


19/57


20/57

3.1 SearchEngine / WebSearch

EngineDefinition &OverviewA search engine is the popular term for an Information Retrieval (IR) System designed to help

find information stored on a Computer System, such as on the World Wide Web (WWW), inside

a corporate or proprietary network, or in a Personal Computer.

The search engine allows one to ask for content meeting specific criteria (typically those

containing a given word or Phrase) and retrieves a list of items that match those criteria.

Regular users of Web search engines casually expect to receive accurate and near-instantaneous

answers to questions and requests merely by entering a short query a few words into a text

box and clicking on a search button. Underlying this simple and intuitive interface are clusters of

computers, comprising thousands of machines, working cooperatively to generate a ranked list of

those Web pages that are likely to satisfy the information need embodied in the query. These

machines identify a set of Web pages containing the terms in the query, compute a score for each

page, eliminate duplicate and redundant pages, generate summaries of the remaining pages, and

finally return the summaries and links back to the user for browsing.

In order to achieve the sub second response times expected from Web search engines, they

incorporate layers of caching and replication, taking advantage of commonly occurring queries

and exploiting parallel processing, allowing them to scale as the number of Web pages and users

increase. In order to produce accurate results, they store a snapshot of the Web. This snapshot

must be gathered and refreshed constantly by a Web crawler, also running on a cluster ofhundreds or thousands of machines, and downloading periodically perhaps once a week a

fresh copy of each page. Pages that contain rapidly changing information of high quality, such as

news services, may be refreshed daily or hourly.

Consider a simple example. If you have a computer connected to the Internet nearby, pause for a

minute to launch a browser and try the query information retrieval on one of the major


21/57

commercial Web search engines. It is likely that the search engine responded in well under a

second. Take some time to review the top ten results. Each result lists the URL for a Web page

and usually provides a title and a short snippet of text extracted from the body of the page.

Overall, the results are drawn from a variety of different Web sites and include sites associated

with leading textbooks, journals, conferences, and researchers. As is common for informational

queries such as this one, the Wikipedia article1 may be present. Do the top ten results contain

anything inappropriate? Could their order be improved? Have a look through the next ten results

and decide whether any one of them could better replace one of the top ten results.

Now, consider the millions of Web pages that contain the words information and retrieval.

This set of pages includes many that are relevant to the subject of information retrieval but are

much less general in scope than those that appear in the top ten, such as student Web pages and

individual research papers. In addition, the set includes many pages that just happen to contain

these two words, without having any direct relationship to the subject. From these millions of

possible pages, a search engines ranking algorithm selects the top-ranked pages based on a

variety of features, including the content and structure of the pages (e.g., their titles), their

relationship to other pages (e.g., the hyperlinks between them), and the content and structure of

the Web as a whole. For some queries, characteristics of the user such as her geographic location

or past searching behavior may also play a role. Balancing these features against each other in

order to rank pages by their expected relevance to a query is an example of relevance ranking.


22/57

3.2 Popular SearchEnginesGoogle www.google.com

Bing www.bing.com

Ask www.ask.com

3.2.1 Web Directories

Web directories are human-compiled indexes of sites, which are then categorized. The fact that

your site is reviewed by an editor before being placed in the index means that getting listed in a

directory is often quite difficult. Consequently, having a listing in a directory will guarantee you

a good amount of well-targeted visitors. Most search engines will rank you higher if they find

your site in one of the directories below.

Yahoo www.yahoo.com

3.2.2 Open Directory

The Open Directory is another human-compiled directory, but one where any Internet user can

become an editor and be responsible for some part of the index. Many other services use Open

Directory listings, including Google, Netscape, Lycos, AOLsearch, AltaVista and HotBot.

dmoz.com


23/57

3.2.3 And the rest...

You can submit your site to these search engines if you want. Most of these companies are

struggling on this new Google-dominated web.

Lycos www.lycos.com

FAST Search www.alltheweb.com

AOL Search search.aol.com

AltaVista www.altavista.com

DogPile www.dogpile.com


24/57

3.2.4 The three most widely used web search engines and their approximate share


25/57

3.3 Googlesearch

URL www.google.com

list of domain names

Commercial? YesType of site Search Engine

Registration Optional

Available language(s) Multilingual (124)

Owner Google

Created by Sergey Brin and Larry

Page

Launched September 15, 1997

Alexa rank 1

Revenue From AdWords

Current status Active


26/57

3.4 Yahoo! Search

URL search.yahoo.com

Commercial? Yes

Type of site Search Engine



Owner Yahoo!

Created by Yahoo!

Launched March 1, 1995

Alexa rank 4



27/57

3.5 Bing (searchengine)

URL www.bing.com

Slogan Bing & decide

Commercial? Yes

Type of site Search Engine



Owner Microsoft

Created by Microsoft

Launched June 1, 2009

Alexa rank 23



28/57

3.6 HowSearchEngineWorks? Search engines match queries against an index that they create. The index consists of the words in each document, plus pointers to their locations within

the documents. This is called an Inverted file.

A search engine or IR system comprises four essential modules:

3.6.1 A Document Processor3.6.2 A Query Processor3.6.3 A Search and matching function3.6.4 A Ranking capability

3.6.1 Document Processor

o The document processor prepares, processes, and inputs the documents, pages, or sitesthat users search against.

o Steps of Document Processor1. Normalizes the document stream.2. Breaks the document stream into desired retrievable units.3. Isolates and metatags subdocument pieces.4. Identifies potential indexable elements in documents.5. Deletes stop words.6. Stems terms.7. Extracts index entries.


29/57

8. Computes weights.9. Creates and updates the main inverted file against which the search engine

searches in order to match

Step 1 to 3

Preprocessing.

While essential and potentially important in affecting the outcome of a search, these firstthree steps simply standardize the multiple formats encountered when deriving

documents from various providers or handling various Web sites.

The steps serve to merge all the data into a single consistent data structure that all thedownstream processes can handle. The need for a well-formed, consistent format is of

relative importance in direct proportion to the sophistication of later steps of document

processing.

Step two is important because the pointers stored in the inverted file will enable a systemto retrieve various sized units either site, page, document, section, paragraph, or

sentence.

Step 4

Identify elements to index.

Identifying potential indexable elements in documents dramatically affects the nature andquality of the document representation that the engine will search against.

In designing the system, we must define the word "term." Is it the alpha-numericcharacters between blank spaces or punctuation? If so, what about non-compositional

phrases (phrases in which the separate words do not convey the meaning of the phrase,

like "skunk works" or "hot dog"), multi-word proper names, or inter-word symbols such

as hyphens or apostrophes that can denote the difference between "small business men"

versus small-business men."


30/57

Each search engine depends on a set of rules that its document processor must execute todetermine what action is to be taken by the "tokenizer," i.e. the software used to define a

term suitable for indexing.

Step-5

Deleting stop words.

This step helps save system resources by eliminating from further processing, as well as potential matching, those terms that have little value in finding useful documents in

response to a customer's query.

This step used to matter much more than it does now when memory has become so muchcheaper and systems so much faster, but since stop words may comprise up to 40 percent

of text words in a document, it still has some significance.

A stop word list typically consists of those word classes known to convey littlesubstantive meaning, such as

o articles (a, the),o conjunctions (and, but),o interjections (oh, but),o prepositions (in, over),o pronouns (he, it), ando forms of the "to be" verb (is, are).

Step-6

Term Stemming.

Stemming removes word suffixes The process has two goals.


31/57

o In terms of efficiency, stemming reduces the number of unique words in theindex, which in turn reduces the storage space required for the index and speeds

up the search process.

o In terms of effectiveness, stemming improves recall by reducing all forms of theword to a base or stemmed form.

For example, if a user asks for analyze, they may also want documents which containanalysis, analyzing, analyzer, analyzes, and analyzed.

Therefore, the document processor stems document terms to analyze- so that documentswhich include various forms of analyze- will have equal likelihood of being retrieved.

Step-7

Extract index entries.

Having completed steps 1 through 6, the document processor extracts the remainingentries from the original document. For example, the following paragraph shows the full

text sent to a search engine for processing:

o Milosevic's comments, carried by the official news agency Tanjug, cast doubtover the governments at the talks, which the international community has called

to try to prevent an all-out war in the Serbian province. "President Milosevic said

it was well known that Serbia and Yugoslavia were firmly committed to resolving

problems in Kosovo, which is an integral part of Serbia, peacefully in Serbia with

the participation of the representatives of all ethnic communities," Tanjug said.

Milosevic was speaking during a meeting with British Foreign Secretary Robin

Cook, who delivered an ultimatum to attend negotiations in a week's time on an

autonomy proposal for Kosovo with ethnic Albanian leaders from the province.

Cook earlier told a conference that Milosevic had agreed to study the proposal.

Steps 1 to 6 reduce this text for searching to the following:o Milosevic comm carri offic new agen Tanjug cast doubt govern talk interna

commun call try prevent all-out war Serb province President Milosevic said well


32/57

known Serbia Yugoslavia firm commit resolv problem Kosovo integr part Serbia

peace Serbia particip representa ethnic commun Tanjug said Milosevic speak

meeti British Foreign Secretary Robin Cook deliver ultimat attend negoti week

time autonomy propos Kosovo ethnic Alban lead province Cook earl told

conference Milosevic agree study propos.

The output of step 7 is then inserted and stored in an inverted file that lists the indexentries and an indication of their position and frequency of occurrence.

The specific nature of the index entries, however, will vary based on the decision in Step4 concerning what constitutes an indexable term.

Document processors will have phrase recognizers, as well as Named Entity recognizersand Categorizers, to insure index entries such as Milosevic are tagged as a Person and

entries such as Yugoslavia and Serbia as Countries

Step-8

Term weight assignment.

Weights are assigned to terms in the index file. The simplest of search engines just assigna binary weight: 1 for presence and 0 for absence.

Measuring the frequency of occurrence of a term in the document creates moresophisticated weighting, with length-normalization of frequencies still more

sophisticated. Extensive experience in information retrieval research over many years has

clearly demonstrated that the optimal weighting comes from use of "tf/idf." This

algorithm measures the frequency of occurrence of each term within a document. Then it

compares that frequency against the frequency of occurrence in the entire database.

A simple example would be the word "the." This word appears in too many documents tohelp distinguish one from another. A less obvious example would be the word

"antibiotic." In a sports database when we compare each document to the database as a

whole, the term "antibiotic" would probably be a good discriminator among documents,

and therefore would be assigned a high weight. Conversely, in a database devoted to

health or medicine, "antibiotic" would probably be a poor discriminator, since it occurs


33/57

very often. The TF/IDF weighting scheme assigns higher weights to those terms that

really distinguish one document from the others.

Step-9

Create index.

The index or inverted file is the internal data structure that stores the index informationand that will be searched for each query. Inverted files range from a simple listing of

every alpha-numeric sequence in a set of documents/pages being indexed along with the

overall identifying numbers of the documents in which the sequence occurs, to a more

linguistically complex list of entries, the tf/idf weights, and pointers to where inside each

document the term occurs.

The more complete the information in the index, the better the search result.


34/57

3.6.2 Query Processor

o Query processing has seven possible steps, though a system can cut these steps short andproceed to match the query to the inverted file at any of a number of places during theprocessing. Document processing shares many steps with query processing.

o Steps ofQuery ProcessorThe steps in query processing are as follows (with the option to stop processing and start

matching indicated as "Matcher"):

1. Tokenize query terms.2. Parsing Recognize query terms vs. special operators.

> Matcher

3. Delete stop words.4. Stem words.5. Creating the query.

> Matcher

6. Query expansion7. Compute weights

> Matcher


35/57

Step 1

Tokenizing.

As soon as a user inputs a query, the search engine must tokenize the query stream, i.e., break it down into understandable segments. Usually a token is defined as an alpha-

numeric string that occurs between white space and/or punctuation.

Step 2

Parsing

Since users may employ special operators in their query, including Boolean, adjacency,or proximity operators, the system needs to parse the query first into query terms and

operators. These operators may occur in the form of reserved punctuation (e.g., quotation

marks) or reserved terms in specialized format (e.g., AND, OR).

Step 3 & 4

Stop list and stemming

Some search engines will go further and stop-list and stem the query, similar to theprocesses described above in the Document Processor section. The stop list might also

contain words from commonly occurring querying phrases, such as, "I'd like information

about." However, since most publicly available search engines encourage very short

queries, as evidenced in the size of query window provided, the engines may drop these

two steps.


36/57


37/57

Step 7

Query term weighting (assuming more than one query term)

The final step in query processing involves computing weights for the terms in the query.Sometimes the user controls this step by indicating either how much to weight each term

or simply which term or concept in the query matters most and must appear in each

retrieved document to ensure relevance.

Leaving the weighting up to the user is not common, because research has shown thatusers are not particularly good at determining the relative importance of terms in their

queries. They can't make this determination for several reasons. First, they don't know

what else exists in the database, and document terms are weighted by being compared to

the database as a whole. Second, most users seek information about an unfamiliar subject,

so they may not know the correct terminology.

Few search engines implement system-based query weighting, but some do an implicitweighting by treating the first term(s) in a query as having higher significance. The

engines use this information to provide a list of documents/pages to the user.

After this final step, the expanded, weighted query is searched against the inverted file ofdocuments.


38/57


39/57

4.1 Performance measures

Many different measures for evaluating the performance of information retrieval systems have

been proposed. The measures require a collection of documents and a query. All common

measures described here assume a ground truth notion of relevancy: every document is known to

be either relevant or non-relevant to a particular query. In practice queries may be ill-posed and

there may be different shades of relevancy.

4.1.1 Precision

Precision is the fraction of the documents retrieved that are relevant to the user's information

need.

In binary classification, precision is analogous to positive predictive value. Precision takes all

retrieved documents into account. It can also be evaluated at a given cut-off rank, considering

only the topmost results returned by the system. This measure is called precision at n or P@n.

Note that the meaning and usage of "precision" in the field of Information Retrieval differs from

the definition of accuracy and precision within other branches of science and technology.


40/57

4.1.2 Recall

Recall is the fraction of the documents that are relevant to the query that are successfully

retrieved.

In binary classification, recall is called sensitivity. So it can be looked at as the probability that a

relevant document is retrieved by the query.

It is trivial to achieve recall of 100% by returning all documents in response to any query.

Therefore recall alone is not enough but one needs to measure the number of non-relevant

documents also, for example by computing the precision.

4.1.3 Fall-Out

The proportion of non-relevant documents that are retrieved, out of all non-relevant documents

available:

In binary classification, fall-out is closely related to specificity (1 specificity). It can be looked

at as the probability that a non-relevant document is retrieved by the query.

It is trivial to achieve fall-out of 0% by returning zero documents in response to any query.


41/57

4.1.4 F-measure

The weighted harmonic mean of precision and recall, the traditional F-measure or balanced F-

score is:

This is also known as the F1 measure, because recall and precision are evenly weighted.

The general formula for non-negative real is:

.

Two other commonly used F measures are the F2 measure, which weights recall twice as much

as precision, and the F0.5 measure, which weights precision twice as much as recall.

The F-measure was derived by van Rijsbergen (1979) so that F "measures the effectiveness of

retrieval with respect to a user who attaches times as much importance to recall as precision".

It is based on van Rijsbergen's effectiveness measure E = 1 (1 / ( / P + (1 ) / R)). Their

relationship is F = 1 E where = 1 / (2 + 1).

4.1.5 Mean Average precision

Precision and recall are single-value metrics based on the whole list of documents returned by

the system. For systems that return a ranked sequence of documents, it is desirable to also


42/57

consider the order in which the returned documents are presented. Average precision emphasizes

ranking relevant documents higher. It is the average of precisions computed at the point of each

of the relevant documents in the ranked sequence:

where r is the rank, N the number retrieved, rel() a binary function on the relevance of a given

rank, and P(r) precision at a given cut-off rank:

This metric is also sometimes referred to geometrically as the area under the Precision-Recall

curve.

Note that the denominator (number of relevant documents) is the number of relevant documents

in the entire collection, so that the metric reflects performance over all relevant documents,

regardless of a retrieval cutoff.

4.1.6 Discounted cumulativegain

DCG uses a graded relevance scale of documents from the result set to evaluate the usefulness,

or gain, of a document based on its position in the result list. The premise of DCG is that highly

relevant documents appearing lower in a search result list should be penalized as the graded

relevance value is reduced logarithmically proportional to the position of the result.

The DCG accumulated at a particular rank position p is defined as:


43/57

Since result set may vary in size among different queries or systems, to compare performances

the normalized version of DCG uses an ideal DCG - by sorting documents of a result list by

relevance - to normalize the score:

The nDCG values for all queries can be averaged to obtain a measure of the average performance

of a ranking algorithm. Note that in a perfect ranking algorithm, the DCGp will be the same as

the IDCGp producing an nDCG of 1.0. All nDCG calculations are then relative values on the

interval 0.0 to 1.0 and so are cross-query comparable.

4.2 Other Measures

4.2.1 Mean reciprocal rank

Mean reciprocal rank is a statistic for evaluating any process that produces a list of possible

responses to a query, ordered by probability of correctness. The reciprocal rank of a query

response is the multiplicative inverse of the rank of the first correct answer. The mean reciprocal

rank is the average of the reciprocal ranks of results for a sample of queries Q[1]:


44/57

Example: For example, suppose we have the following three sample queries for a system that

tries to translate English words to their plurals. In each case, the system makes three guesses,

with the first one being the one it thinks is most likely correct:

Query Results Correct response Rank Reciprocal rank

Cat catten, cati, cats Cats 3 1/3

Torus torii, tori, toruses Tori 2

Virus viruses, virii, viri Viruses 1 1

Given those three samples, we could calculate the mean reciprocal rank as (1/3 + 1/2 + 1)/3 =

11/18 or about 0.61.

This basic definition does not specify what to do if...

1. None of the proposed results are correct (use mean reciprocal rank 0), or if2. There are multiple correct answers in the list. Consider using mean average precision

(MAP).

4.2.2 Spearman's rank correlation coefficient

In statistics, Spearman's rank correlation coefficient or Spearman's rho, named after Charles

Spearman and often denoted by the Greek letter (rho) or as rs, is a non-parametric measure of

statistical dependence between two variables. The Spearman correlation coefficient is often

thought of as being the Pearson correlation coefficient between the ranked variables. In practice,

however, a simpler procedure is normally used to calculate . The n raw scores Xi, Yi are


45/57

converted to ranks xi, yi, and the differences di = xi yi between the ranks of each observation

on the two variables are calculated.

If there are no tied ranks, then is given by:

If tied ranks exist, Pearson's correlation coefficient between ranks should be used for the

calculation. One has to assign the same rank to each of the equal values. It is an average of their

positions in the ascending order of the values.

A Spearman correlation of 1 results when the two variables being compared are monotonically

related, even if their relationship is not linear. In contrast, this does not give a perfect Pearson

correlation


46/57

When the data are roughly elliptically distributed and there are no prominent outliers, the

Spearman correlation and Pearson correlation give similar values

The Spearman correlation is less sensitive than the Pearson correlation to strong outliers that are

in the tails of both samples


47/57

A positive Spearman correlation coefficient corresponds to an increasing monotonic trend

between X and Y.

A negative Spearman correlation coefficient corresponds to a decreasing monotonic trend

between X and Y


48/57


49/57

5.1 MODEL TYPES

For the information retrieval to be efficient, the documents are typically transformed into a

suitable representation. There are several representations. The above picture illustrates the

relationship of some common models. In the picture, the models are categorized according to

two dimensions: the mathematical basis and the properties of the model.

5.1.1 First dimension: mathematical basis

1. Set-theoretic models represent documents as sets of words or phrases. Similarities areusuallyderived from set-theoretic operations on those sets. Common models are:

i. Standard Boolean modelii. Extended Boolean model

iii. Fuzzy retrieval


50/57

2. Algebraic models represent documents and queries usually as vectors, matrices, or tuples.The similarity of the query vector and document vector is represented as a scalar value.

i. Vector space modelii. Generalized vector space model

iii. (Enhanced) Topic-based Vector Space Modeliv. Extended Boolean modelv. Latent semantic indexing aka latent semantic analysis

3. Probabilistic models treat the process of document retrieval as a probabilistic inference.Similarities are computed as probabilities that a document is relevant for a given query.

Probabilistic theorems like the Bayes' theorem are often used in these models.

i. Binary Independence Modelii. Probabilistic relevance model on which is based the okapi (BM25) relevance

function

iii. Uncertain inferenceiv. Language modelsv. Divergence-from-randomness model

vi. Latent Dirichlet allocation

4. Machine-learned ranking models view documents as vectors of ranking features (some ofwhich often incorporate other ranking models mentioned above) and try to find the best

way to combine these features into a single relevance score by machine learning methods.


51/57

5.1.2 Second dimension: properties of the model

1. Models without term-interdependencies treat different terms/words as independent. Thisfact is usually represented in vector space models by the orthogonality assumption of

term vectors or in probabilistic models by an independency assumption for term

variables.

2. Models with immanent term interdependencies allow a representation ofinterdependencies between terms. However the degree of the interdependency between

two terms is defined by the model itself. It is usually directly or indirectly derived (e.g.

by dimensional reduction) from the co-occurrence of those terms in the whole set of

documents.

3. Models with transcendent term interdependencies allow a representation ofinterdependencies between terms, but they do not allege how the interdependency

between two terms is defined. They relay an external source for the degree of

interdependency between two terms. (For example a human or sophisticated algorithms.)


52/57


53/57

6.1 Problem 1.Assisting the user

in clarifying and analyzing the

problem and determining

informationneeds.

i. Clarifying and analyzing the problem.ii. Determining what part of the problem solution can be affected by the system and what

part is left to the user.

iii. Determining what knowledge the user requires for her part in the problem solution.iv. Determining what the user knows already.v. Deduce what information is necessary to lead the user from her present knowledge state

to the required knowledge state.


54/57

6.2 Problem 2.Knowinghowpeople use and process

information.

i. Problem assembling a package of information that enables group the user to come closerto a solution of his problem.

ii. Relationship of information use to the problem-solving/decision-making processiii. How do people make relevance judgmentsiv. How do people organize information in their minds, acquire it, process it for output

6.3 Problem 3.Knowledge

representation.

i. Choosing the general approach to knowledge representation.ii. Constructing a conceptual schema.

iii. Constructing a list of values for each entity type or rules for generating such valuesiv. Knowledge/data acquisition and assimilationv. Representing uncertainty


55/57

vi. Developing fine-grained information systems

6.4 Problem 4.Procedures for

processing

knowledge/information.

i. Transformation from one representation to anotherii. Translation from one natural language to another.

iii. Translation from natural language to a formal representationiv. Translation from a formal representation to natural languagev. Translation from one formal representation into another

vi. Expression of data in tabular or graphic form adapted to the user's purposevii. Removing redundancy

viii. Summarizingix. Computing statisticsx. Deriving generalizations

xi. Drawing inferencesxii. Search and selection

xiii. Indexing: attaching predictive clues


56/57

6.5 Problem 5.Thehuman-computer interface.

i. Functions in the human-computer interfaceii. Formal design of the human-computer interface

iii.


57/57

REFERENCES

y An Introduction to Information Retrieval Christopher D. Manning, Prabhakar Raghavan,Hinrich Schtze Cambridge University Press, EnglandOnline edition (c)2009 Cambridge

UP

y Database System Concepts 5th Edition, Silberschatz, Korth and Sudarshany http://www.google.comy http://www.wikipedia.com