information retrieval 1

14
IR – Introduction Created by Chethan.M Information Retrieval ISiM Syllabus : MISM 623 – Information Retrieval Systems Course Objectives - This course examines information retrieval within the context of full-text datasets. The students should be able to understand and critique existing information retrieval systems and to design and build information retrieval systems themselves. The course will introduce students to traditional methods as well as recent advances in information retrieval (IR), handling and querying of textual data. The focus will be on newer techniques of processing and retrieving textual information, including hypertext documents available on the World Wide Web. Course Outline Topics covered include: • IR Models o Boolean Model o Vector Space Model o Relational DBMS o Probabilistic Models o Language Models • Web Information Retrieval o citation network analysis o social collaboration (PageRank and HITS algorithms) • Term Indexing o Zipf's Law o term weighting • Searching and Data Structures o Inverted files to support Boolean and Vector Models o Clustering • non-hierarchical • single pass • reallocation ISiM

Upload: chethanm

Post on 11-Apr-2015

757 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Information Retrieval 1

IR – Introduction Created by Chethan.M

Information Retrieval

ISiM Syllabus : MISM 623 – Information Retrieval Systems Course Objectives - This course examines information retrieval within the context of full-text datasets. The students should be able to understand and critique existing information retrieval systems and to design and build information retrieval systems themselves. The course will introduce students to traditional methods as well as recent advances in information retrieval (IR), handling and querying of textual data. The focus will be on newer techniques of processing and retrieving textual information, including hypertext documents available on the World Wide Web.

Course OutlineTopics covered include:• IR Modelso Boolean Modelo Vector Space Modelo Relational DBMSo Probabilistic Modelso Language Models

• Web Information Retrievalo citation network analysiso social collaboration (PageRank and HITS algorithms)

• Term Indexingo Zipf's Lawo term weighting

• Searching and Data Structureso Inverted files to support Boolean and Vector Modelso Clustering• non-hierarchical• single pass• reallocationo hierarchical agglomerativeo String Searchingo Tries, binary tries, binary digital tries, suffix trees, etc.

• Retrieval Effectiveness Evaluationo Recall, Precision, Fallouto Comparing systems using average precision

Course Readings: (Chethan is using)

ISiM

Page 2: Information Retrieval 1

IR – Introduction Created by Chethan.M

1. Modern Information Retrieval / by Ricardo Baeza-Yates and Berthier Ribeiro-Neto, 20012. Introduction to Information Retrieval / Christopher D. Manning, PrabhakarExample of IR: Just getting a credit card out of your wallet so that you can type in the card number is a form of information retrieval.

What is Information Retrieval? Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

Motivation: Information retrieval deals with the representation, storage, organization of & access to information items. The representation & organization of the information items should provide the user with easy access to the information in which he is interested. Unfortunately, characterization of the user information need is not a simple problem.Given the user query, the key goal of an IR system (Search Engine) is to retrieve information which might be useful or relevant to the user. The emphasis is on the retrieval of information as opposed to the retrieval of data.

Information Retrieval is…..The indexing and retrieval of textual documents.Concerned firstly with retrieving relevant documents to a query.Concerned secondly with retrieving from large sets of documents efficiently.SelectivityFinding some desired info in a store of informationIR = select from source processIR and Literature searching (finding document)

Information Retrieval System: “An information retrieval system is a device interposed between a potential user of information & information collection itself. For a given information problem, the purpose of the system is to capture wanted items & to filter out unwanted item”.

Information retrieval systems can be distinguished by the scale at which they operate, and it is useful to distinguish three prominent scales.

In web search, the system has to provide search over billions of documents stored on millions of computers.

ISiM

Page 3: Information Retrieval 1

IR – Introduction Created by Chethan.M

At the other extreme is personal information retrieval. In the last few years, consumer operating systems have integrated information retrieval (such as Apple’s Mac OS X Spotlight or Windows Vista’s Instant Search).In between is the space of enterprise, institutional, and domain-specific search, where retrieval might be provided for collections such as a corporation’s internal documents, a database of patents, or research articles on biochemistry.

Data Retrieval:Which documents contain a set of keywordsWell defined semanticsA single erroneous object implies failure.

Information Retrieval:Information about a subject or topicSemantics is frequently looseSmall errors are toleratedNLP retrieval & non-structure dataRanking & Relevance

IR System:Interpret contents of information items.Generate a Ranking which reflects relevance.Notion of relevance is most important.

Data Retrieval VS Information Retrieval

Databases Information Retrieval

What we’re Retrieving

Structured data. Clear Semantics based on a formal model.

Mostly unstructured. Free text with some Metadata

Queries we’re posing

Formally defined queries. Unambiguous

Vague, imprecise information needs (often expressed in Natural language)

Results We get

Exact. Always in a formal sense.

Sometimes relevant, often not.

Interaction with system

One-shot Queries Interaction is important(Relevance feedback).

Text Database VS Database

ISiM

Page 4: Information Retrieval 1

IR – Introduction Created by Chethan.M

Text Database Database1. Emphasize to Retrieval

processingTransaction Processing

2. Non-Data update Data Update3. Non-Data Integrity Data Integrity

4. Non-Data StructureBookWeb page

Data StructureStudent RecordRegistration Data

History (Past)

• 1960-70’s:– Initial exploration of text retrieval systems for “small”

corpora of scientific abstracts, and law and business documents.

– Development of the basic Boolean and vector-space models of retrieval.

– Prof. Salton and his students at Cornell University are the leading researchers in the area.

• 1980’s:– Large document database systems, many run by

companies:• Lexis-Nexis• Dialog• MEDLINE

• 1990’s:– Searching FTPable documents on the Internet

• Archie• WAIS

– Searching the World Wide Web• Lycos• Yahoo• Altavista

• 1990’s continued:– Organized Competitions

• NIST TREC– Recommender Systems

• Ringo• Amazon

ISiM

Page 5: Information Retrieval 1

IR – Introduction Created by Chethan.M

• NetPerceptions– Automated Text Categorization & Clustering

• 2000’s– Link analysis for Web Search

• Google– Automated Information Extraction

• Whizbang• Fetch• Burning Glass

– Question Answering• TREC Q/A track•

• 2000’s continued:– Multimedia IR

• Image• Video• Audio and music

– Cross-Language IR• DARPA Tides

– Document Summarization

Present

Source of dataElectronic LibraryDocument of UniversityData Online (web site)

ExampleAltaVista Google Etc.

Past, Present and Future• Library is first Organization for IR• index assign by an academic and private• Searching technique (past : in library)

– Title , subject– Hierarchies search system (e.g. Dewey Decimal),

Controlled vocabularies, Collections of abstracts• Searching technique (present : in library)

– Department (of a faculty) , Term index– to develop format in User interface

ISiM

Page 6: Information Retrieval 1

IR – Introduction Created by Chethan.M

– Electronic service– Hypertext service

Related Research Areas of IR (Future)• Electronic Commerce on Web (Digital Library Online)• Database Management• Library and Information Science• Artificial Intelligence (AI)• Natural Language Processing (NLP)• Machine Learning (ML)

Typical IR Task • Given:

– A corpus of textual natural-language documents.– A user query in the form of a textual string.

• Find:– A ranked set of documents that are relevant to the query.

ISiM

Page 7: Information Retrieval 1

IR – Introduction Created by Chethan.M

Relevance• Relevance is a subjective judgment and may include:

– Being on the proper subject.– Being timely (recent information).– Being authoritative (from a trusted source).– Satisfying the goals of the user and his/her intended use of

the information (information need).• Much of IR depends upon idea that

– Similar vocabulary -> relevant to same queries• Usually look for documents matching query words• “Similar” can be measured in many ways

– String matching/comparison– Same vocabulary used– Probability that documents arise from same model– Same meaning of text

Keyword Search• Simplest notion of relevance is that the query string appears

verbatim in the document.• Slightly less strict notion is that the words in the query appear

frequently in the document, in any order (bag of words).

Problems with Keywords

ISiM

IRSystem

Query String

Document

corpus

RankedDocume

nts

1. Doc12. Doc23. Doc3 . .

Page 8: Information Retrieval 1

IR – Introduction Created by Chethan.M

• May not retrieve relevant documents that include synonymous terms.– “restaurant” vs. “café”– “PRC” vs. “China”

• May retrieve irrelevant documents that include ambiguous terms.– “bat” (baseball vs. mammal)– “Apple” (company vs. fruit)– “bit” (unit of data vs. act of eating)

Intelligent IR• Taking into account the meaning of the words used.• Taking into account the order of words in the query.• Adapting to the user based on direct or indirect feedback.• Taking into account the authority of the source.

IR Basic Concepts

• The User Task– Retrieval

• information or data• purposeful

– Browsing• glancing around• F1; cars, Le Mans, France tourism

Fig: Interaction of the user with the retrieval system through distinct tasks.

ISiM

Browsing

Database

Retrieval

Page 9: Information Retrieval 1

IR – Introduction Created by Chethan.M

• Document representation viewed as a continuum: logical view of documents might shift

• Document set to term index• Indexing

Automatic A Specialist• Full text : all occurrence word in document• select keyword

Stop wordStemming

Two IR main Functions:

1. Indexing (System perspective)- Text processing- Index construction

2. Retrieval (User perspective)- User interface- Query processing- Searching from index (index lookup)- Search result ranking

IR System: (1) Indexing

ISiM

Page 10: Information Retrieval 1

IR – Introduction Created by Chethan.M

IR System: (2) Retrieval

ISiM

Page 11: Information Retrieval 1

IR – Introduction Created by Chethan.M

Fig: The Process of retrieving information.

IR System Components

• Text Operations forms index words (tokens).– Stopword removal– Stemming

• Indexing constructs an inverted index of word to document pointers.

• Searching retrieves documents that contain a given query token from the inverted index.

• Ranking scores all retrieved documents according to a relevance metric.

• User Interface manages interaction with the user:– Query input and document output.– Relevance feedback.– Visualization of results.

• Query Operations transform the query to improve retrieval:– Query expansion using a thesaurus.– Query transformation using relevance feedback.

ISiM

IR System Architecture

TextDatabase

DatabaseManagerIndexing

Index

QueryOperations

Searching

RankingRanked

Docs

UserFeedback

Text Operations

User Interface

RetrievedDocs

UserNeed

Text

Query

Logical View

Inverted File

Page 12: Information Retrieval 1

IR – Introduction Created by Chethan.M

References:1. Modern Information Retrieval / by Ricardo Baeza-Yates and Berthier Ribeiro-Neto, 20012. Introduction to Information Retrieval / Christopher D. Manning, Prabhakar3. Intelligent Information Retrieval and Web Search , Raymond Mooney, University of Texas at Austin4. Introduction to Information Retrieval (IR), T.Keerati Boonchote

ISiM