information retrieval 1
TRANSCRIPT
![Page 1: Information Retrieval 1](https://reader036.vdocument.in/reader036/viewer/2022081209/55293f535503464d2e8b46a2/html5/thumbnails/1.jpg)
IR – Introduction Created by Chethan.M
Information Retrieval
ISiM Syllabus : MISM 623 – Information Retrieval Systems Course Objectives - This course examines information retrieval within the context of full-text datasets. The students should be able to understand and critique existing information retrieval systems and to design and build information retrieval systems themselves. The course will introduce students to traditional methods as well as recent advances in information retrieval (IR), handling and querying of textual data. The focus will be on newer techniques of processing and retrieving textual information, including hypertext documents available on the World Wide Web.
Course OutlineTopics covered include:• IR Modelso Boolean Modelo Vector Space Modelo Relational DBMSo Probabilistic Modelso Language Models
• Web Information Retrievalo citation network analysiso social collaboration (PageRank and HITS algorithms)
• Term Indexingo Zipf's Lawo term weighting
• Searching and Data Structureso Inverted files to support Boolean and Vector Modelso Clustering• non-hierarchical• single pass• reallocationo hierarchical agglomerativeo String Searchingo Tries, binary tries, binary digital tries, suffix trees, etc.
• Retrieval Effectiveness Evaluationo Recall, Precision, Fallouto Comparing systems using average precision
Course Readings: (Chethan is using)
ISiM
![Page 2: Information Retrieval 1](https://reader036.vdocument.in/reader036/viewer/2022081209/55293f535503464d2e8b46a2/html5/thumbnails/2.jpg)
IR – Introduction Created by Chethan.M
1. Modern Information Retrieval / by Ricardo Baeza-Yates and Berthier Ribeiro-Neto, 20012. Introduction to Information Retrieval / Christopher D. Manning, PrabhakarExample of IR: Just getting a credit card out of your wallet so that you can type in the card number is a form of information retrieval.
What is Information Retrieval? Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).
Motivation: Information retrieval deals with the representation, storage, organization of & access to information items. The representation & organization of the information items should provide the user with easy access to the information in which he is interested. Unfortunately, characterization of the user information need is not a simple problem.Given the user query, the key goal of an IR system (Search Engine) is to retrieve information which might be useful or relevant to the user. The emphasis is on the retrieval of information as opposed to the retrieval of data.
Information Retrieval is…..The indexing and retrieval of textual documents.Concerned firstly with retrieving relevant documents to a query.Concerned secondly with retrieving from large sets of documents efficiently.SelectivityFinding some desired info in a store of informationIR = select from source processIR and Literature searching (finding document)
Information Retrieval System: “An information retrieval system is a device interposed between a potential user of information & information collection itself. For a given information problem, the purpose of the system is to capture wanted items & to filter out unwanted item”.
Information retrieval systems can be distinguished by the scale at which they operate, and it is useful to distinguish three prominent scales.
In web search, the system has to provide search over billions of documents stored on millions of computers.
ISiM
![Page 3: Information Retrieval 1](https://reader036.vdocument.in/reader036/viewer/2022081209/55293f535503464d2e8b46a2/html5/thumbnails/3.jpg)
IR – Introduction Created by Chethan.M
At the other extreme is personal information retrieval. In the last few years, consumer operating systems have integrated information retrieval (such as Apple’s Mac OS X Spotlight or Windows Vista’s Instant Search).In between is the space of enterprise, institutional, and domain-specific search, where retrieval might be provided for collections such as a corporation’s internal documents, a database of patents, or research articles on biochemistry.
Data Retrieval:Which documents contain a set of keywordsWell defined semanticsA single erroneous object implies failure.
Information Retrieval:Information about a subject or topicSemantics is frequently looseSmall errors are toleratedNLP retrieval & non-structure dataRanking & Relevance
IR System:Interpret contents of information items.Generate a Ranking which reflects relevance.Notion of relevance is most important.
Data Retrieval VS Information Retrieval
Databases Information Retrieval
What we’re Retrieving
Structured data. Clear Semantics based on a formal model.
Mostly unstructured. Free text with some Metadata
Queries we’re posing
Formally defined queries. Unambiguous
Vague, imprecise information needs (often expressed in Natural language)
Results We get
Exact. Always in a formal sense.
Sometimes relevant, often not.
Interaction with system
One-shot Queries Interaction is important(Relevance feedback).
Text Database VS Database
ISiM
![Page 4: Information Retrieval 1](https://reader036.vdocument.in/reader036/viewer/2022081209/55293f535503464d2e8b46a2/html5/thumbnails/4.jpg)
IR – Introduction Created by Chethan.M
Text Database Database1. Emphasize to Retrieval
processingTransaction Processing
2. Non-Data update Data Update3. Non-Data Integrity Data Integrity
4. Non-Data StructureBookWeb page
Data StructureStudent RecordRegistration Data
History (Past)
• 1960-70’s:– Initial exploration of text retrieval systems for “small”
corpora of scientific abstracts, and law and business documents.
– Development of the basic Boolean and vector-space models of retrieval.
– Prof. Salton and his students at Cornell University are the leading researchers in the area.
• 1980’s:– Large document database systems, many run by
companies:• Lexis-Nexis• Dialog• MEDLINE
• 1990’s:– Searching FTPable documents on the Internet
• Archie• WAIS
– Searching the World Wide Web• Lycos• Yahoo• Altavista
• 1990’s continued:– Organized Competitions
• NIST TREC– Recommender Systems
• Ringo• Amazon
ISiM
![Page 5: Information Retrieval 1](https://reader036.vdocument.in/reader036/viewer/2022081209/55293f535503464d2e8b46a2/html5/thumbnails/5.jpg)
IR – Introduction Created by Chethan.M
• NetPerceptions– Automated Text Categorization & Clustering
• 2000’s– Link analysis for Web Search
• Google– Automated Information Extraction
• Whizbang• Fetch• Burning Glass
– Question Answering• TREC Q/A track•
• 2000’s continued:– Multimedia IR
• Image• Video• Audio and music
– Cross-Language IR• DARPA Tides
– Document Summarization
Present
Source of dataElectronic LibraryDocument of UniversityData Online (web site)
ExampleAltaVista Google Etc.
Past, Present and Future• Library is first Organization for IR• index assign by an academic and private• Searching technique (past : in library)
– Title , subject– Hierarchies search system (e.g. Dewey Decimal),
Controlled vocabularies, Collections of abstracts• Searching technique (present : in library)
– Department (of a faculty) , Term index– to develop format in User interface
ISiM
![Page 6: Information Retrieval 1](https://reader036.vdocument.in/reader036/viewer/2022081209/55293f535503464d2e8b46a2/html5/thumbnails/6.jpg)
IR – Introduction Created by Chethan.M
– Electronic service– Hypertext service
Related Research Areas of IR (Future)• Electronic Commerce on Web (Digital Library Online)• Database Management• Library and Information Science• Artificial Intelligence (AI)• Natural Language Processing (NLP)• Machine Learning (ML)
Typical IR Task • Given:
– A corpus of textual natural-language documents.– A user query in the form of a textual string.
• Find:– A ranked set of documents that are relevant to the query.
ISiM
![Page 7: Information Retrieval 1](https://reader036.vdocument.in/reader036/viewer/2022081209/55293f535503464d2e8b46a2/html5/thumbnails/7.jpg)
IR – Introduction Created by Chethan.M
Relevance• Relevance is a subjective judgment and may include:
– Being on the proper subject.– Being timely (recent information).– Being authoritative (from a trusted source).– Satisfying the goals of the user and his/her intended use of
the information (information need).• Much of IR depends upon idea that
– Similar vocabulary -> relevant to same queries• Usually look for documents matching query words• “Similar” can be measured in many ways
– String matching/comparison– Same vocabulary used– Probability that documents arise from same model– Same meaning of text
Keyword Search• Simplest notion of relevance is that the query string appears
verbatim in the document.• Slightly less strict notion is that the words in the query appear
frequently in the document, in any order (bag of words).
Problems with Keywords
ISiM
IRSystem
Query String
Document
corpus
RankedDocume
nts
1. Doc12. Doc23. Doc3 . .
![Page 8: Information Retrieval 1](https://reader036.vdocument.in/reader036/viewer/2022081209/55293f535503464d2e8b46a2/html5/thumbnails/8.jpg)
IR – Introduction Created by Chethan.M
• May not retrieve relevant documents that include synonymous terms.– “restaurant” vs. “café”– “PRC” vs. “China”
• May retrieve irrelevant documents that include ambiguous terms.– “bat” (baseball vs. mammal)– “Apple” (company vs. fruit)– “bit” (unit of data vs. act of eating)
Intelligent IR• Taking into account the meaning of the words used.• Taking into account the order of words in the query.• Adapting to the user based on direct or indirect feedback.• Taking into account the authority of the source.
IR Basic Concepts
• The User Task– Retrieval
• information or data• purposeful
– Browsing• glancing around• F1; cars, Le Mans, France tourism
Fig: Interaction of the user with the retrieval system through distinct tasks.
ISiM
Browsing
Database
Retrieval
![Page 9: Information Retrieval 1](https://reader036.vdocument.in/reader036/viewer/2022081209/55293f535503464d2e8b46a2/html5/thumbnails/9.jpg)
IR – Introduction Created by Chethan.M
• Document representation viewed as a continuum: logical view of documents might shift
• Document set to term index• Indexing
Automatic A Specialist• Full text : all occurrence word in document• select keyword
Stop wordStemming
Two IR main Functions:
1. Indexing (System perspective)- Text processing- Index construction
2. Retrieval (User perspective)- User interface- Query processing- Searching from index (index lookup)- Search result ranking
IR System: (1) Indexing
ISiM
![Page 10: Information Retrieval 1](https://reader036.vdocument.in/reader036/viewer/2022081209/55293f535503464d2e8b46a2/html5/thumbnails/10.jpg)
IR – Introduction Created by Chethan.M
IR System: (2) Retrieval
ISiM
![Page 11: Information Retrieval 1](https://reader036.vdocument.in/reader036/viewer/2022081209/55293f535503464d2e8b46a2/html5/thumbnails/11.jpg)
IR – Introduction Created by Chethan.M
Fig: The Process of retrieving information.
IR System Components
• Text Operations forms index words (tokens).– Stopword removal– Stemming
• Indexing constructs an inverted index of word to document pointers.
• Searching retrieves documents that contain a given query token from the inverted index.
• Ranking scores all retrieved documents according to a relevance metric.
• User Interface manages interaction with the user:– Query input and document output.– Relevance feedback.– Visualization of results.
• Query Operations transform the query to improve retrieval:– Query expansion using a thesaurus.– Query transformation using relevance feedback.
ISiM
IR System Architecture
TextDatabase
DatabaseManagerIndexing
Index
QueryOperations
Searching
RankingRanked
Docs
UserFeedback
Text Operations
User Interface
RetrievedDocs
UserNeed
Text
Query
Logical View
Inverted File
![Page 12: Information Retrieval 1](https://reader036.vdocument.in/reader036/viewer/2022081209/55293f535503464d2e8b46a2/html5/thumbnails/12.jpg)
IR – Introduction Created by Chethan.M
References:1. Modern Information Retrieval / by Ricardo Baeza-Yates and Berthier Ribeiro-Neto, 20012. Introduction to Information Retrieval / Christopher D. Manning, Prabhakar3. Intelligent Information Retrieval and Web Search , Raymond Mooney, University of Texas at Austin4. Introduction to Information Retrieval (IR), T.Keerati Boonchote
ISiM