documentation ir

Upload: taherika-mansuri

Post on 08-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/7/2019 Documentation Ir

    1/57

    SECTION CONTENTS

    Preface

    Acknowledgement

    1. Introduction 12. Applications of IR 103. Web search engine 164. Performance measures 355. Models of IR 456. Problems in IR 49

    References 54

  • 8/7/2019 Documentation Ir

    2/57

    i. Prefaceii. Acknowledgment

    1.Introduction 11.1 Definition 2

    1.2 History 3

    1.3 Purpose 3

    1.4 Basic IR system architecture 4

    1.5 Databases v/s. IR 6

    1.6 How IR system works 7

    1.7 The process of IR 8

    2.Applications of Information Retrieval 102.1 General applications of IR 11

    2.2 Domain specific applications of IR 13

    2.3 Other retrieval methods 14

    3.Web search engine 163.1 Overview & Definition 17

  • 8/7/2019 Documentation Ir

    3/57

    3.2 Popular search engines 19

    3.3 Google Search 22

    3.4 Yahoo Search 23

    3.5 Bing Search 24

    3.6 How search engine works 25

    4.Performance measures 354.1 Performance measures 36

    4.2 Other measures 40

    5.Models of Information Retrieval 455.1 Model Types 46

    6.Problems in Information Retrieval 496.1 Problem-1 50

    6.2 Problem-2 51

    6.3 Problem-3 51

    6.4 Problem-4 52

    6.5 Problem-5 53

    iii. References 54

  • 8/7/2019 Documentation Ir

    4/57

  • 8/7/2019 Documentation Ir

    5/57

    1.1DEFINITIONInformation Retrieval (IR) also called information storage and retrieval (ISR or

    ISAR) or information organization and retrieval is the science of searching for documents, for

    information within documents, and for metadata about documents, as well as that of searching

    relational databases and the World Wide Web. The main trick is to retrieve what is useful while

    leaving behind what is not. The process begins when a user enters a query into the system.

    Queries are formal statements of information needs, for example search strings in web search

    engines. In information retrieval a query does not uniquely identify a single object in the

    collection. Instead, several objects may match the query, perhaps with different degrees of

    relevancy.

    An object is an entity that is represented by information in a database. User queries are matched

    against the database information. Depending on the application the data objects may be, for

    example, text documents, images, audio, mind maps or videos. Often the documents themselves

    are not kept or stored directly in the IR system, but are instead represented in the system by

    document surrogates or metadata.

    Most IR systems compute a numeric score on how well each object in the database match the

    query, and rank the objects according to this value. The top ranking objects are then shown to the

    user. The process may then be iterated if the user wishes to refine the query.

  • 8/7/2019 Documentation Ir

    6/57

    1.2 HISTORYThe idea of using computers to search for relevant pieces of information was popularized in the

    article As We May Think by Vannevar Bush in 1945. The first automated information retrieval

    systems were introduced in the 1950s and 1960s. By 1970 several different techniques had been

    shown to perform well on small text corpora such as the Cranfield collection (several thousand

    documents). Large-scale retrieval systems, such as the Lockheed Dialog system, came into use

    early in the 1970s.

    1.3PURPOSE1.3.1 To process large document collections quickly. The amount of online data has

    grown at least as quickly as the speed of computers, and we would now like to be

    able to search collections that total in the order of billions to trillions of words.1.3.2 To allow more flexible matching operations.1.3.3 To allow ranked retrieval: in many cases you want the best answer to an

    information need among many documents that contain certain words.

  • 8/7/2019 Documentation Ir

    7/57

    1.4 BASICIRSYSTEM

    ARCHITECTURE

    UserResult

    Query

    Deletions Additions

    Documents

    Components of an IR system

    Information

    Need

    Search

    Engine

    Index

  • 8/7/2019 Documentation Ir

    8/57

    The above Figure illustrates the major components in an IR system. Before conducting a search,

    a user has an information need, which underlies and drives the search process. We sometimes

    refer to this information need as a topic, particularly when it is presented in written form as part

    of a test collection for IR evaluation.

    The users query is processed by a search engine, which may be running on the users local

    machine, on a large cluster of machines in a remote geographic location, or anywhere in

    between.

    A major task of a search engine is to maintain and manipulate an inverted index for a document

    collection. As its basic function, an inverted index provides a mapping between terms and the

    locations in the collection in which they occur.

    To support relevance ranking algorithms, the search engine maintains collection statistics

    associated with the index, such as the number of documents containing each term and the length

    of each document.

    In addition, the search engine usually has access to the original content of the documents, in

    order to report meaningful results back to the user.

    Using the inverted index, collection statistics, and other data, the search engine accepts queries

    from its users, processes these queries, and returns ranked lists of results. To perform relevance

    ranking, the search engine computes a score, sometimes called a retrieval status value (RSV), for

    each document.

    After sorting documents according to their scores, the result list may be subjected to further

    processing, such as the removal of duplicate or redundant results.

  • 8/7/2019 Documentation Ir

    9/57

    1.5 DATABASES

    V/S.INFORMATIONRETRIEVAL

    DATABASES IR

    We know the schema

    in advance, so semantic

    correlation between queries

    and data is clear.

    No schema, but rather unstructured natural

    language text. The result is that there is not

    a clear semantic correlation between

    queries and data.

    We can get exact answers We get inexact, estimated answers

    Strong theoretical foundation

    (at least with relational)

    Theory not well understood

    (especially Natural Language

    Processing)

  • 8/7/2019 Documentation Ir

    10/57

    1.6 HOWINFORMATION

    RETRIEVAL SYSTEMSWORK

    IR is a component of an information system. An information system must make sure that

    everybody it is meant to serve has the information needed to accomplish tasks, solve problems,

    and make decisions, no matter where that information is available. To this end, an information

    system must

    1.6.1 Actively find out what users need,Determining user needs involves

    (a) Studying user needs in general as a basis for designing

    responsive systems (such as determining what information

    students typically need for assignments),

    (b) Actively soliciting the needs of specific users, expressed as

    query descriptions, so that the system can provide the information.

    1.6.2 Acquire documents (or computer programs, or products, or data items, and so on),resulting in a collection, and

    1.6.3 Match documents with needs.

    Figuring out what information the user really needs to solve a problem is essential for successful

    retrieval. Matching involves taking a query description and finding relevant documents in the

    collection; this is the task of the IR system.

  • 8/7/2019 Documentation Ir

    11/57

  • 8/7/2019 Documentation Ir

    12/57

    It will also involve performing the actual retrieval function, that is, executing the search strategy

    in response to a query. In the diagram, the documents have been placed in a separate box to

    emphasize the fact that they are not just input but can be used during the retrieval process in such

    a way that their structure is more correctly seen as part of the retrieval process.

    Finally, we come to the output, which is usually a set of citations or document numbers. In an

    operational system the story ends here. However, in an experimental system it leaves the

    evaluation to be done.

  • 8/7/2019 Documentation Ir

    13/57

  • 8/7/2019 Documentation Ir

    14/57

    APPLICATIONSOFIR

    Areas where information retrieval techniques are employed can be bifurcated into three

    categories:

    2.1 General applications of information retrieval2.2 Domain specific applications of information retrieval2.3 Other retrieval methods

    2.1 General applications ofInformationRetrieval

    1. Digital libraries - A digital library is a library in which collections are stored in digitalformats and accessible by computers. The digital content may be stored locally, or

    accessed remotely via computer networks. A digital library is a type of information

    retrieval system

    2. Information filtering - An Information filtering system is a system that removesredundant or unwanted information from an information stream using (semi)automated or

    computerized methods prior to presentation to a human user.

    i. Recommender systems - Recommender systems work from a specific type ofinformation filtering system technique that attempts to recommend information

    items (movies, TV program/show/episode, video on demand, music[1], books,

    news, images, web pages, scientific literature such as research papers etc.) that are

    likely to be of interest to the user.

  • 8/7/2019 Documentation Ir

    15/57

    3. Media searchi. Image retrieval - An image retrieval system is a computer system for browsing,

    searching and retrieving images from a large database of digital images.

    ii. Music retrieval - Music information retrieval (MIR) is the interdisciplinaryscience of retrieving information from music

    iii. News searchiv. Speech retrievalv. Video retrieval

    4. Search engines - A search engine is designed to help find information stored on acomputer system.

    i. Desktop search - Desktop search is the name for the field of search tools whichsearch the contents of a user's own computer files, rather than searching the

    Internet. These tools are designed to find information on the user's PC, including

    web browser histories, e-mail archives, text documents, sound files, images and

    video.ii. Enterprise search - Enterprise search is the practice of making content from

    multiple enterprise-type sources, such as databases and intranets, searchable to a

    defined audience.

    iii. Federated search - Federated search is an information retrieval technology thatallows the simultaneous search of multiple searchable resources.

    iv. Mobile search - Mobile search is an evolving branch of information retrievalservices that is centered around the convergence of mobile platforms and mobile

    phones and other mobile devices.

    v. Social search - Social search or a social search engine is a type of web search thattakes into account the Social Graph of the person initiating the Search Query.

  • 8/7/2019 Documentation Ir

    16/57

    vi. Web search - A web search engine is designed to search for information on theWorld Wide Web and FTP servers. The search results are generally presented in a

    list of results and are often called hits.

    2.2 Domainspecific applications

    ofInformationRetrieval

    1. Geographic information retrieval - Geographic Information Retrieval (GIR) orGeographical Information Retrieval is the augmentation of Information Retrieval with

    geographic metadata.

    2. Legal information retrieval - Legal information retrieval is the science of informationretrieval applied to legal text, including legislation, case law, and scholarly works.

    3. Vertical search - A vertical search engine, as distinct from a general Web search engine,focuses on a specific segment of online content. The vertical content area may be based

    on topicality, media type, or genre of content. Common examples include legal, medical,

    patent (intellectual property), travel, and automobile search engines.

  • 8/7/2019 Documentation Ir

    17/57

    2.3 Other retrieval methods

    Methods/Techniques in which information retrieval techniques are employed include:

    1. Adversarial information retrieval - Adversarial information retrieval (adversarial IR) isa topic in information retrieval related to strategies for working with a data source where

    some portion of it has been manipulated maliciously.

    2. Automatic summarization - Automatic summarization is the creation of a shortenedversion of a text by a computer program. The product of this procedure still contains themost important points of the original text.

    i. Multi-document summarization - Multi-document summarization is anautomatic procedure aimed at extraction of information from multiple texts

    written about the same topic.

    3. Compound term processing - Compound term processing is the name that is used for acategory of techniques in Information retrieval applications that performs matching on

    the basis of compound terms. Compound terms are built by combining two (or more)

    simple terms, for example "triple" is a single word term but "triple heart bypass" is a

    compound term.

    4.

    Cross-lingual retrieval - Cross-language information retrieval (CLIR) is a subfield ofinformation retrieval dealing with retrieving information written in a language different

    from the language of the user's query. For example, a user may pose their query in

    English but retrieve relevant documents written in French.

  • 8/7/2019 Documentation Ir

    18/57

    5. Document classification - Document classification/categorization is a problem ininformation science. The task is to assign an electronic document to one or more

    categories, based on its contents.

    6. Spam filtering - Email filtering is the processing of e-mail to organize it according tospecified criteria. Most often this refers to the automatic processing of incoming

    messages, but the term also applies to the intervention of human intelligence in addition

    to anti-spam techniques, and to outgoing emails as well as those being received.

    7. Question answering - In information retrieval and natural language processing (NLP),question answering (QA) is the task of automatically answering a question posed in

    natural language.

  • 8/7/2019 Documentation Ir

    19/57

  • 8/7/2019 Documentation Ir

    20/57

    3.1 SearchEngine / WebSearch

    EngineDefinition &OverviewA search engine is the popular term for an Information Retrieval (IR) System designed to help

    find information stored on a Computer System, such as on the World Wide Web (WWW), inside

    a corporate or proprietary network, or in a Personal Computer.

    The search engine allows one to ask for content meeting specific criteria (typically those

    containing a given word or Phrase) and retrieves a list of items that match those criteria.

    Regular users of Web search engines casually expect to receive accurate and near-instantaneous

    answers to questions and requests merely by entering a short query a few words into a text

    box and clicking on a search button. Underlying this simple and intuitive interface are clusters of

    computers, comprising thousands of machines, working cooperatively to generate a ranked list of

    those Web pages that are likely to satisfy the information need embodied in the query. These

    machines identify a set of Web pages containing the terms in the query, compute a score for each

    page, eliminate duplicate and redundant pages, generate summaries of the remaining pages, and

    finally return the summaries and links back to the user for browsing.

    In order to achieve the sub second response times expected from Web search engines, they

    incorporate layers of caching and replication, taking advantage of commonly occurring queries

    and exploiting parallel processing, allowing them to scale as the number of Web pages and users

    increase. In order to produce accurate results, they store a snapshot of the Web. This snapshot

    must be gathered and refreshed constantly by a Web crawler, also running on a cluster ofhundreds or thousands of machines, and downloading periodically perhaps once a week a

    fresh copy of each page. Pages that contain rapidly changing information of high quality, such as

    news services, may be refreshed daily or hourly.

    Consider a simple example. If you have a computer connected to the Internet nearby, pause for a

    minute to launch a browser and try the query information retrieval on one of the major

  • 8/7/2019 Documentation Ir

    21/57

    commercial Web search engines. It is likely that the search engine responded in well under a

    second. Take some time to review the top ten results. Each result lists the URL for a Web page

    and usually provides a title and a short snippet of text extracted from the body of the page.

    Overall, the results are drawn from a variety of different Web sites and include sites associated

    with leading textbooks, journals, conferences, and researchers. As is common for informational

    queries such as this one, the Wikipedia article1 may be present. Do the top ten results contain

    anything inappropriate? Could their order be improved? Have a look through the next ten results

    and decide whether any one of them could better replace one of the top ten results.

    Now, consider the millions of Web pages that contain the words information and retrieval.

    This set of pages includes many that are relevant to the subject of information retrieval but are

    much less general in scope than those that appear in the top ten, such as student Web pages and

    individual research papers. In addition, the set includes many pages that just happen to contain

    these two words, without having any direct relationship to the subject. From these millions of

    possible pages, a search engines ranking algorithm selects the top-ranked pages based on a

    variety of features, including the content and structure of the pages (e.g., their titles), their

    relationship to other pages (e.g., the hyperlinks between them), and the content and structure of

    the Web as a whole. For some queries, characteristics of the user such as her geographic location

    or past searching behavior may also play a role. Balancing these features against each other in

    order to rank pages by their expected relevance to a query is an example of relevance ranking.

  • 8/7/2019 Documentation Ir

    22/57

    3.2 Popular SearchEnginesGoogle www.google.com

    Bing www.bing.com

    Ask www.ask.com

    3.2.1 Web Directories

    Web directories are human-compiled indexes of sites, which are then categorized. The fact that

    your site is reviewed by an editor before being placed in the index means that getting listed in a

    directory is often quite difficult. Consequently, having a listing in a directory will guarantee you

    a good amount of well-targeted visitors. Most search engines will rank you higher if they find

    your site in one of the directories below.

    Yahoo www.yahoo.com

    3.2.2 Open Directory

    The Open Directory is another human-compiled directory, but one where any Internet user can

    become an editor and be responsible for some part of the index. Many other services use Open

    Directory listings, including Google, Netscape, Lycos, AOLsearch, AltaVista and HotBot.

    dmoz.com

  • 8/7/2019 Documentation Ir

    23/57

    3.2.3 And the rest...

    You can submit your site to these search engines if you want. Most of these companies are

    struggling on this new Google-dominated web.

    Lycos www.lycos.com

    FAST Search www.alltheweb.com

    AOL Search search.aol.com

    AltaVista www.altavista.com

    DogPile www.dogpile.com

  • 8/7/2019 Documentation Ir

    24/57

    3.2.4 The three most widely used web search engines and their approximate share

  • 8/7/2019 Documentation Ir

    25/57

    3.3 Googlesearch

    URL www.google.com

    list of domain names

    Commercial? YesType of site Search Engine

    Registration Optional

    Available language(s) Multilingual (124)

    Owner Google

    Created by Sergey Brin and Larry

    Page

    Launched September 15, 1997

    Alexa rank 1

    Revenue From AdWords

    Current status Active

  • 8/7/2019 Documentation Ir

    26/57

    3.4 Yahoo! Search

    URL search.yahoo.com

    Commercial? Yes

    Type of site Search Engine

    Registration Optional

    Available language(s) Multilingual (40)

    Owner Yahoo!

    Created by Yahoo!

    Launched March 1, 1995

    Alexa rank 4

    Current status Active

  • 8/7/2019 Documentation Ir

    27/57

    3.5 Bing (searchengine)

    URL www.bing.com

    Slogan Bing & decide

    Commercial? Yes

    Type of site Search Engine

    Registration Optional

    Available language(s) Multilingual (40)

    Owner Microsoft

    Created by Microsoft

    Launched June 1, 2009

    Alexa rank 23

    Current status Active

  • 8/7/2019 Documentation Ir

    28/57

    3.6 HowSearchEngineWorks? Search engines match queries against an index that they create. The index consists of the words in each document, plus pointers to their locations within

    the documents. This is called an Inverted file.

    A search engine or IR system comprises four essential modules:

    3.6.1 A Document Processor3.6.2 A Query Processor3.6.3 A Search and matching function3.6.4 A Ranking capability

    3.6.1 Document Processor

    o The document processor prepares, processes, and inputs the documents, pages, or sitesthat users search against.

    o Steps of Document Processor1. Normalizes the document stream.2. Breaks the document stream into desired retrievable units.3. Isolates and metatags subdocument pieces.4. Identifies potential indexable elements in documents.5. Deletes stop words.6. Stems terms.7. Extracts index entries.

  • 8/7/2019 Documentation Ir

    29/57

    8. Computes weights.9. Creates and updates the main inverted file against which the search engine

    searches in order to match

    Step 1 to 3

    Preprocessing.

    While essential and potentially important in affecting the outcome of a search, these firstthree steps simply standardize the multiple formats encountered when deriving

    documents from various providers or handling various Web sites.

    The steps serve to merge all the data into a single consistent data structure that all thedownstream processes can handle. The need for a well-formed, consistent format is of

    relative importance in direct proportion to the sophistication of later steps of document

    processing.

    Step two is important because the pointers stored in the inverted file will enable a systemto retrieve various sized units either site, page, document, section, paragraph, or

    sentence.

    Step 4

    Identify elements to index.

    Identifying potential indexable elements in documents dramatically affects the nature andquality of the document representation that the engine will search against.

    In designing the system, we must define the word "term." Is it the alpha-numericcharacters between blank spaces or punctuation? If so, what about non-compositional

    phrases (phrases in which the separate words do not convey the meaning of the phrase,

    like "skunk works" or "hot dog"), multi-word proper names, or inter-word symbols such

    as hyphens or apostrophes that can denote the difference between "small business men"

    versus small-business men."

  • 8/7/2019 Documentation Ir

    30/57

    Each search engine depends on a set of rules that its document processor must execute todetermine what action is to be taken by the "tokenizer," i.e. the software used to define a

    term suitable for indexing.

    Step-5

    Deleting stop words.

    This step helps save system resources by eliminating from further processing, as well as potential matching, those terms that have little value in finding useful documents in

    response to a customer's query.

    This step used to matter much more than it does now when memory has become so muchcheaper and systems so much faster, but since stop words may comprise up to 40 percent

    of text words in a document, it still has some significance.

    A stop word list typically consists of those word classes known to convey littlesubstantive meaning, such as

    o articles (a, the),o conjunctions (and, but),o interjections (oh, but),o prepositions (in, over),o pronouns (he, it), ando forms of the "to be" verb (is, are).

    Step-6

    Term Stemming.

    Stemming removes word suffixes The process has two goals.

  • 8/7/2019 Documentation Ir

    31/57

    o In terms of efficiency, stemming reduces the number of unique words in theindex, which in turn reduces the storage space required for the index and speeds

    up the search process.

    o In terms of effectiveness, stemming improves recall by reducing all forms of theword to a base or stemmed form.

    For example, if a user asks for analyze, they may also want documents which containanalysis, analyzing, analyzer, analyzes, and analyzed.

    Therefore, the document processor stems document terms to analyze- so that documentswhich include various forms of analyze- will have equal likelihood of being retrieved.

    Step-7

    Extract index entries.

    Having completed steps 1 through 6, the document processor extracts the remainingentries from the original document. For example, the following paragraph shows the full

    text sent to a search engine for processing:

    o Milosevic's comments, carried by the official news agency Tanjug, cast doubtover the governments at the talks, which the international community has called

    to try to prevent an all-out war in the Serbian province. "President Milosevic said

    it was well known that Serbia and Yugoslavia were firmly committed to resolving

    problems in Kosovo, which is an integral part of Serbia, peacefully in Serbia with

    the participation of the representatives of all ethnic communities," Tanjug said.

    Milosevic was speaking during a meeting with British Foreign Secretary Robin

    Cook, who delivered an ultimatum to attend negotiations in a week's time on an

    autonomy proposal for Kosovo with ethnic Albanian leaders from the province.

    Cook earlier told a conference that Milosevic had agreed to study the proposal.

    Steps 1 to 6 reduce this text for searching to the following:o Milosevic comm carri offic new agen Tanjug cast doubt govern talk interna

    commun call try prevent all-out war Serb province President Milosevic said well

  • 8/7/2019 Documentation Ir

    32/57

    known Serbia Yugoslavia firm commit resolv problem Kosovo integr part Serbia

    peace Serbia particip representa ethnic commun Tanjug said Milosevic speak

    meeti British Foreign Secretary Robin Cook deliver ultimat attend negoti week

    time autonomy propos Kosovo ethnic Alban lead province Cook earl told

    conference Milosevic agree study propos.

    The output of step 7 is then inserted and stored in an inverted file that lists the indexentries and an indication of their position and frequency of occurrence.

    The specific nature of the index entries, however, will vary based on the decision in Step4 concerning what constitutes an indexable term.

    Document processors will have phrase recognizers, as well as Named Entity recognizersand Categorizers, to insure index entries such as Milosevic are tagged as a Person and

    entries such as Yugoslavia and Serbia as Countries

    Step-8

    Term weight assignment.

    Weights are assigned to terms in the index file. The simplest of search engines just assigna binary weight: 1 for presence and 0 for absence.

    Measuring the frequency of occurrence of a term in the document creates moresophisticated weighting, with length-normalization of frequencies still more

    sophisticated. Extensive experience in information retrieval research over many years has

    clearly demonstrated that the optimal weighting comes from use of "tf/idf." This

    algorithm measures the frequency of occurrence of each term within a document. Then it

    compares that frequency against the frequency of occurrence in the entire database.

    A simple example would be the word "the." This word appears in too many documents tohelp distinguish one from another. A less obvious example would be the word

    "antibiotic." In a sports database when we compare each document to the database as a

    whole, the term "antibiotic" would probably be a good discriminator among documents,

    and therefore would be assigned a high weight. Conversely, in a database devoted to

    health or medicine, "antibiotic" would probably be a poor discriminator, since it occurs

  • 8/7/2019 Documentation Ir

    33/57

    very often. The TF/IDF weighting scheme assigns higher weights to those terms that

    really distinguish one document from the others.

    Step-9

    Create index.

    The index or inverted file is the internal data structure that stores the index informationand that will be searched for each query. Inverted files range from a simple listing of

    every alpha-numeric sequence in a set of documents/pages being indexed along with the

    overall identifying numbers of the documents in which the sequence occurs, to a more

    linguistically complex list of entries, the tf/idf weights, and pointers to where inside each

    document the term occurs.

    The more complete the information in the index, the better the search result.

  • 8/7/2019 Documentation Ir

    34/57

    3.6.2 Query Processor

    o Query processing has seven possible steps, though a system can cut these steps short andproceed to match the query to the inverted file at any of a number of places during theprocessing. Document processing shares many steps with query processing.

    o Steps ofQuery ProcessorThe steps in query processing are as follows (with the option to stop processing and start

    matching indicated as "Matcher"):

    1. Tokenize query terms.2. Parsing Recognize query terms vs. special operators.

    > Matcher

    3. Delete stop words.4. Stem words.5. Creating the query.

    > Matcher

    6. Query expansion7. Compute weights

    > Matcher

  • 8/7/2019 Documentation Ir

    35/57

    Step 1

    Tokenizing.

    As soon as a user inputs a query, the search engine must tokenize the query stream, i.e., break it down into understandable segments. Usually a token is defined as an alpha-

    numeric string that occurs between white space and/or punctuation.

    Step 2

    Parsing

    Since users may employ special operators in their query, including Boolean, adjacency,or proximity operators, the system needs to parse the query first into query terms and

    operators. These operators may occur in the form of reserved punctuation (e.g., quotation

    marks) or reserved terms in specialized format (e.g., AND, OR).

    Step 3 & 4

    Stop list and stemming

    Some search engines will go further and stop-list and stem the query, similar to theprocesses described above in the Document Processor section. The stop list might also

    contain words from commonly occurring querying phrases, such as, "I'd like information

    about." However, since most publicly available search engines encourage very short

    queries, as evidenced in the size of query window provided, the engines may drop these

    two steps.

  • 8/7/2019 Documentation Ir

    36/57

  • 8/7/2019 Documentation Ir

    37/57

    Step 7

    Query term weighting (assuming more than one query term)

    The final step in query processing involves computing weights for the terms in the query.Sometimes the user controls this step by indicating either how much to weight each term

    or simply which term or concept in the query matters most and must appear in each

    retrieved document to ensure relevance.

    Leaving the weighting up to the user is not common, because research has shown thatusers are not particularly good at determining the relative importance of terms in their

    queries. They can't make this determination for several reasons. First, they don't know

    what else exists in the database, and document terms are weighted by being compared to

    the database as a whole. Second, most users seek information about an unfamiliar subject,

    so they may not know the correct terminology.

    Few search engines implement system-based query weighting, but some do an implicitweighting by treating the first term(s) in a query as having higher significance. The

    engines use this information to provide a list of documents/pages to the user.

    After this final step, the expanded, weighted query is searched against the inverted file ofdocuments.

  • 8/7/2019 Documentation Ir

    38/57

  • 8/7/2019 Documentation Ir

    39/57

    4.1 Performance measures

    Many different measures for evaluating the performance of information retrieval systems have

    been proposed. The measures require a collection of documents and a query. All common

    measures described here assume a ground truth notion of relevancy: every document is known to

    be either relevant or non-relevant to a particular query. In practice queries may be ill-posed and

    there may be different shades of relevancy.

    4.1.1 Precision

    Precision is the fraction of the documents retrieved that are relevant to the user's information

    need.

    In binary classification, precision is analogous to positive predictive value. Precision takes all

    retrieved documents into account. It can also be evaluated at a given cut-off rank, considering

    only the topmost results returned by the system. This measure is called precision at n or P@n.

    Note that the meaning and usage of "precision" in the field of Information Retrieval differs from

    the definition of accuracy and precision within other branches of science and technology.

  • 8/7/2019 Documentation Ir

    40/57

    4.1.2 Recall

    Recall is the fraction of the documents that are relevant to the query that are successfully

    retrieved.

    In binary classification, recall is called sensitivity. So it can be looked at as the probability that a

    relevant document is retrieved by the query.

    It is trivial to achieve recall of 100% by returning all documents in response to any query.

    Therefore recall alone is not enough but one needs to measure the number of non-relevant

    documents also, for example by computing the precision.

    4.1.3 Fall-Out

    The proportion of non-relevant documents that are retrieved, out of all non-relevant documents

    available:

    In binary classification, fall-out is closely related to specificity (1 specificity). It can be looked

    at as the probability that a non-relevant document is retrieved by the query.

    It is trivial to achieve fall-out of 0% by returning zero documents in response to any query.

  • 8/7/2019 Documentation Ir

    41/57

    4.1.4 F-measure

    The weighted harmonic mean of precision and recall, the traditional F-measure or balanced F-

    score is:

    This is also known as the F1 measure, because recall and precision are evenly weighted.

    The general formula for non-negative real is:

    .

    Two other commonly used F measures are the F2 measure, which weights recall twice as much

    as precision, and the F0.5 measure, which weights precision twice as much as recall.

    The F-measure was derived by van Rijsbergen (1979) so that F "measures the effectiveness of

    retrieval with respect to a user who attaches times as much importance to recall as precision".

    It is based on van Rijsbergen's effectiveness measure E = 1 (1 / ( / P + (1 ) / R)). Their

    relationship is F = 1 E where = 1 / (2 + 1).

    4.1.5 Mean Average precision

    Precision and recall are single-value metrics based on the whole list of documents returned by

    the system. For systems that return a ranked sequence of documents, it is desirable to also

  • 8/7/2019 Documentation Ir

    42/57

    consider the order in which the returned documents are presented. Average precision emphasizes

    ranking relevant documents higher. It is the average of precisions computed at the point of each

    of the relevant documents in the ranked sequence:

    where r is the rank, N the number retrieved, rel() a binary function on the relevance of a given

    rank, and P(r) precision at a given cut-off rank:

    This metric is also sometimes referred to geometrically as the area under the Precision-Recall

    curve.

    Note that the denominator (number of relevant documents) is the number of relevant documents

    in the entire collection, so that the metric reflects performance over all relevant documents,

    regardless of a retrieval cutoff.

    4.1.6 Discounted cumulativegain

    DCG uses a graded relevance scale of documents from the result set to evaluate the usefulness,

    or gain, of a document based on its position in the result list. The premise of DCG is that highly

    relevant documents appearing lower in a search result list should be penalized as the graded

    relevance value is reduced logarithmically proportional to the position of the result.

    The DCG accumulated at a particular rank position p is defined as:

  • 8/7/2019 Documentation Ir

    43/57

    Since result set may vary in size among different queries or systems, to compare performances

    the normalized version of DCG uses an ideal DCG - by sorting documents of a result list by

    relevance - to normalize the score:

    The nDCG values for all queries can be averaged to obtain a measure of the average performance

    of a ranking algorithm. Note that in a perfect ranking algorithm, the DCGp will be the same as

    the IDCGp producing an nDCG of 1.0. All nDCG calculations are then relative values on the

    interval 0.0 to 1.0 and so are cross-query comparable.

    4.2 Other Measures

    4.2.1 Mean reciprocal rank

    Mean reciprocal rank is a statistic for evaluating any process that produces a list of possible

    responses to a query, ordered by probability of correctness. The reciprocal rank of a query

    response is the multiplicative inverse of the rank of the first correct answer. The mean reciprocal

    rank is the average of the reciprocal ranks of results for a sample of queries Q[1]:

  • 8/7/2019 Documentation Ir

    44/57

    Example: For example, suppose we have the following three sample queries for a system that

    tries to translate English words to their plurals. In each case, the system makes three guesses,

    with the first one being the one it thinks is most likely correct:

    Query Results Correct response Rank Reciprocal rank

    Cat catten, cati, cats Cats 3 1/3

    Torus torii, tori, toruses Tori 2

    Virus viruses, virii, viri Viruses 1 1

    Given those three samples, we could calculate the mean reciprocal rank as (1/3 + 1/2 + 1)/3 =

    11/18 or about 0.61.

    This basic definition does not specify what to do if...

    1. None of the proposed results are correct (use mean reciprocal rank 0), or if2. There are multiple correct answers in the list. Consider using mean average precision

    (MAP).

    4.2.2 Spearman's rank correlation coefficient

    In statistics, Spearman's rank correlation coefficient or Spearman's rho, named after Charles

    Spearman and often denoted by the Greek letter (rho) or as rs, is a non-parametric measure of

    statistical dependence between two variables. The Spearman correlation coefficient is often

    thought of as being the Pearson correlation coefficient between the ranked variables. In practice,

    however, a simpler procedure is normally used to calculate . The n raw scores Xi, Yi are

  • 8/7/2019 Documentation Ir

    45/57

    converted to ranks xi, yi, and the differences di = xi yi between the ranks of each observation

    on the two variables are calculated.

    If there are no tied ranks, then is given by:

    If tied ranks exist, Pearson's correlation coefficient between ranks should be used for the

    calculation. One has to assign the same rank to each of the equal values. It is an average of their

    positions in the ascending order of the values.

    A Spearman correlation of 1 results when the two variables being compared are monotonically

    related, even if their relationship is not linear. In contrast, this does not give a perfect Pearson

    correlation

  • 8/7/2019 Documentation Ir

    46/57

    When the data are roughly elliptically distributed and there are no prominent outliers, the

    Spearman correlation and Pearson correlation give similar values

    The Spearman correlation is less sensitive than the Pearson correlation to strong outliers that are

    in the tails of both samples

  • 8/7/2019 Documentation Ir

    47/57

    A positive Spearman correlation coefficient corresponds to an increasing monotonic trend

    between X and Y.

    A negative Spearman correlation coefficient corresponds to a decreasing monotonic trend

    between X and Y

  • 8/7/2019 Documentation Ir

    48/57

  • 8/7/2019 Documentation Ir

    49/57

    5.1 MODEL TYPES

    For the information retrieval to be efficient, the documents are typically transformed into a

    suitable representation. There are several representations. The above picture illustrates the

    relationship of some common models. In the picture, the models are categorized according to

    two dimensions: the mathematical basis and the properties of the model.

    5.1.1 First dimension: mathematical basis

    1. Set-theoretic models represent documents as sets of words or phrases. Similarities areusuallyderived from set-theoretic operations on those sets. Common models are:

    i. Standard Boolean modelii. Extended Boolean model

    iii. Fuzzy retrieval

  • 8/7/2019 Documentation Ir

    50/57

    2. Algebraic models represent documents and queries usually as vectors, matrices, or tuples.The similarity of the query vector and document vector is represented as a scalar value.

    i. Vector space modelii. Generalized vector space model

    iii. (Enhanced) Topic-based Vector Space Modeliv. Extended Boolean modelv. Latent semantic indexing aka latent semantic analysis

    3. Probabilistic models treat the process of document retrieval as a probabilistic inference.Similarities are computed as probabilities that a document is relevant for a given query.

    Probabilistic theorems like the Bayes' theorem are often used in these models.

    i. Binary Independence Modelii. Probabilistic relevance model on which is based the okapi (BM25) relevance

    function

    iii. Uncertain inferenceiv. Language modelsv. Divergence-from-randomness model

    vi. Latent Dirichlet allocation

    4. Machine-learned ranking models view documents as vectors of ranking features (some ofwhich often incorporate other ranking models mentioned above) and try to find the best

    way to combine these features into a single relevance score by machine learning methods.

  • 8/7/2019 Documentation Ir

    51/57

    5.1.2 Second dimension: properties of the model

    1. Models without term-interdependencies treat different terms/words as independent. Thisfact is usually represented in vector space models by the orthogonality assumption of

    term vectors or in probabilistic models by an independency assumption for term

    variables.

    2. Models with immanent term interdependencies allow a representation ofinterdependencies between terms. However the degree of the interdependency between

    two terms is defined by the model itself. It is usually directly or indirectly derived (e.g.

    by dimensional reduction) from the co-occurrence of those terms in the whole set of

    documents.

    3. Models with transcendent term interdependencies allow a representation ofinterdependencies between terms, but they do not allege how the interdependency

    between two terms is defined. They relay an external source for the degree of

    interdependency between two terms. (For example a human or sophisticated algorithms.)

  • 8/7/2019 Documentation Ir

    52/57

  • 8/7/2019 Documentation Ir

    53/57

    6.1 Problem 1.Assisting the user

    in clarifying and analyzing the

    problem and determining

    informationneeds.

    i. Clarifying and analyzing the problem.ii. Determining what part of the problem solution can be affected by the system and what

    part is left to the user.

    iii. Determining what knowledge the user requires for her part in the problem solution.iv. Determining what the user knows already.v. Deduce what information is necessary to lead the user from her present knowledge state

    to the required knowledge state.

  • 8/7/2019 Documentation Ir

    54/57

    6.2 Problem 2.Knowinghowpeople use and process

    information.

    i. Problem assembling a package of information that enables group the user to come closerto a solution of his problem.

    ii. Relationship of information use to the problem-solving/decision-making processiii. How do people make relevance judgmentsiv. How do people organize information in their minds, acquire it, process it for output

    6.3 Problem 3.Knowledge

    representation.

    i. Choosing the general approach to knowledge representation.ii. Constructing a conceptual schema.

    iii. Constructing a list of values for each entity type or rules for generating such valuesiv. Knowledge/data acquisition and assimilationv. Representing uncertainty

  • 8/7/2019 Documentation Ir

    55/57

    vi. Developing fine-grained information systems

    6.4 Problem 4.Procedures for

    processing

    knowledge/information.

    i. Transformation from one representation to anotherii. Translation from one natural language to another.

    iii. Translation from natural language to a formal representationiv. Translation from a formal representation to natural languagev. Translation from one formal representation into another

    vi. Expression of data in tabular or graphic form adapted to the user's purposevii. Removing redundancy

    viii. Summarizingix. Computing statisticsx. Deriving generalizations

    xi. Drawing inferencesxii. Search and selection

    xiii. Indexing: attaching predictive clues

  • 8/7/2019 Documentation Ir

    56/57

    6.5 Problem 5.Thehuman-computer interface.

    i. Functions in the human-computer interfaceii. Formal design of the human-computer interface

    iii.

  • 8/7/2019 Documentation Ir

    57/57

    REFERENCES

    y An Introduction to Information Retrieval Christopher D. Manning, Prabhakar Raghavan,Hinrich Schtze Cambridge University Press, EnglandOnline edition (c)2009 Cambridge

    UP

    y Database System Concepts 5th Edition, Silberschatz, Korth and Sudarshany http://www.google.comy http://www.wikipedia.com