chapter 2 information retrieval ms. malak bagais [textbook]: chapter 2

Chapter 2 Information Retrieval Ms. Malak Bagais [textbook]: Chapter 2 Slide 3 Objectives By the end of this lecture, student will be able to: Lists information retrieval components Describe document representation Apply Porters Algorithm Compare and apply different retrieval models Evaluate the performance of retrieving Slide 4 Information Retrieval summarization searching indexing Slide 5 Document representationQuery representationRank the documentsEvaluation of the quality of retrieval Information Retrieval Components Slide 6 Document Representation Transforming a text document to a weighted list of keywords Slide 7 Stopwords Slide 8 Data Mining has emerged as one of the most exciting and dynamic fields in computing science. The driving force for data mining is the presence of petabyte-scale online archives that potentially contain valuable bits of information hidden in them. Commercial enterprises have been quick to recognize the value of this concept; consequently, within the span of a few years, the software market itself for data mining is expected to be in excess of $10 billion. Data mining refers to a family of techniques used to detect interesting nuggets of relationships knowledge in data. While the theoretical underpinnings of the field have been around for quite some time (in the form of pattern recognition, statistics, data analysis and machine learning), the practice and use of these techniques have been largely ad hoc. With the availability of large databases to store, manage and assimilate data, the new thrust of data mining lies at the intersection of database systems, artificial intelligence and algorithms that efficiently analyze data. The distributed nature of several databases, their size and the high complexity of many techniques present interesting computational challenges. Sample Document Slide 9 ad algorithms analysis analyze archives artificial assimilate availability billion bits challenges commercial complexity computational computing concept data database databases detect distributed driving dynamic efficiently emerged enterprises excess exciting expected family field fields force form hidden high hoc information intelligence interesting intersection large largely learning lies machine manage market mining nature nuggets online pattern petabyte-scale potentially practice presence present quick recognition recognize refers relationships science size software span statistics store systems techniques theoretical thrust time underpinnings valuable years Delete stopwords Slide 10 Stemming A given word may occur in a variety of syntactic forms plurals past tense gerund forms Slide 11 Stemming A given word may occur in a variety of syntactic forms plurals past tense gerund forms connector connected preconnection connection connecting postconnection connections connects Slide 12 Stemming A stem is what is left after its affixes (prefixes and suffixes) are removed Stem connect Suffixes connector connection connections connected connecting connects Prefixes preconnection postconnection Slide 13 Letters A, E, I, O, and U are vowels A consonant in a word is a letter other than A, E, I, O, or U, with the exception of Y The letter Y is a vowel if it is preceded by a consonant, otherwise it is a consonant For example, Y in synopsis is a vowel, while in toy, it is a consonant A consonant in the algorithm description is denoted by c, and a vowel by v Porters Algorithm Slide 14 m is the measure of vc repetition *S the stem ends with S (Similarly for other letters) *v* - the stem contains a vowel *d the stem ends with a double consonant (e.g., -TT) *o the stem ends cvc, where the seconds c is not W, X, or Y (e.g. -WIL) OATS m=1 Slide 15 What is the value of m in the following words? Porters Algorithm BY PRIVATE OATEN ORRERY IVY TROUBLES TREES TROUBLE OATS Y Y TREE EE TR Slide 16 What is the value of m in the following words? Porters Algorithm BY PRIVATE OATEN ORRERY IVY TROUBLES TREES TROUBLE OATS Y Y TREE EE TR 0 0 0 0 0 111 1 22 2 2 Slide 17 Porters algorithm Step 1 Step 1: plurals and past participles Slide 18 Steps 24: straightforward stripping of suffixes Porters algorithm - Step 2 Slide 19 Steps 24: straightforward stripping of suffixes Porters algorithm Step 3 Slide 20 Steps 24: straightforward stripping of suffixes Porters algorithm Step 4 Slide 21 Example generalizations Step1: GENERALIZATION Step2: GENERALIZE Step3: GENERAL Step4: GENER OSCILLATORS Step1: OSCILLATOR Step2: OSCILLATE Step4: OSCILL Step5: OSCIL Slide 22 Number of words reduced in step1:3597 2:766 3:327 4:2424 5:1373 Number of words not reduce:3650 In an experiment reported on Porters site, suffix stripping of a vocabulary of 10,000 words http://www.tartarus.org/~martin/ Porters Algorithm Slide 23 Term-document matrix (TDM) is a two-dimensional representation of a document collection. Rows of the matrix represent various documents Columns correspond to various index terms Values in the matrix can be either the frequency or weight of the index term (identified by the column) in the document (identified by the row). Term-Document Matrix Slide 24 Term-Document matrix Slide 25 Sparse Matrixes- triples Slide 26 Sparse Matrixes- Pairs Slide 27 Raw frequency values are not useful for a retrieval model Prefer normalized weights, usually between 0 and 1, for each term in a document Dividing all the keyword frequencies by the largest frequency in the document is a simple method of normalization Normalization Slide 28 Normalized Term-Document Matrix Slide 29 Vector Representation of the sample document showing the terms, their frequencies and normalized frequencies Vector Representation ad 1 0.125 algorithm 1 0.125 analysi 1 0.125 analyz 1 0.125 archiv 1 0.125 artifici 1 0.125 assimil 1 0.125 avail 1 0.125 billion 1 0.125 bit 1 0.125 challeng 1 0.125 commerci 1 0.125 complex 1 0.125 comput 2 0.25 concept 1 0.125 data 8 1.00 databas 3 0.375 detect 1 0.125 distribut 1 0.125 drive 1 0.125 dynam 1 0.125 effici 1 0.125 emerg 1 0.125 enterpris 1 0.125 excess 1 0.125 excit 1 0.125 expect 1 0.125 famili 1 0.125 field 2 0.25 forc 1 0.125 form 1 0.125 hidden 1 0.125 high 1 0.125 hoc 1 0.125 inform 1 0.125 intellig 1 0.125 interest 2 0.25 intersect 1 0.125 knowledg 1 0.125 larg 2 0.25 learn 1 0.125 li 1 0.125 machin 1 0.125 manag 1 0.125 market 1 0.125 mine 5 0.62 natur 1 0.125 nugget 1 0.125 onlin 1 0.125 pattern 1 0.125 petabyte 1 0.125 potenti 1 0.125 practic 1 0.125 presenc 1 0.125 present 1 0.125 quick 1 0.125 recogn 1 0.125 recognit 1 0.125 refer 1 0.125 Relationship 1 0.125 scienc 1 0.125 size 1 0.125 softwar 1 0.125 span 1 0.125 statist 1 0.125 store 1 0.125 system 1 0.125 techniqu 3 0.375 theoret 1 0.125 thrust 1 0.125 time 1 0.125 underpin 1 0.125 valuabl 1 0.125 year 1 0.125 Slide 30 Retrieval models match query with documents to: separate documents into relevant and non-relevant class rank the documents according to the relevance Retrieval models Retrieval Models Boolean model Vector space model (VSM) Probabilistic models Slide 31 One of the simplest and most efficient retrieval mechanisms Based on set theory and Boolean algebra Conventional numeric representations of false as 0 and true as 1 Boolean model is interested only in the presence or absence of a term in a document In the term-document matrix replace all the nonzero values with 1 Boolean Retrieval Model Slide 32 Boolean Term-document Matrix Slide 33 Document set DocSet(K0) = {D1,D3,D5} DocSet(K4) = {D2,D3,D4,D6} Query K0 and K4 DocSet(K0) DocSet(K4) = {D3} K0 or K4 DocSet(K0) DocSet(K4) = {D1,D2,D3,D4,D5,D6} Examples Slide 34 Slide 35 User Boolean queries are usually simple Boolean expressions A Boolean query can be represented in a disjunctive normal form (DNF) disjunction corresponds to or conjunction refers to and DNF consists of a disjunction of conjunctive Boolean expressions Boolean Query Slide 36 K0 or (not K3 and K5) is in DNF DNF query processing can be very efficient If any one of the conjunctive expressions is true, the entire DNF will be true Short-circuit the expression evaluation Stop matching the expression with a document as soon as a conjunctive expression matches the document; label the document as relevant to the query DNF form Slide 37 Simplicity and efficiency of implementation Binary values can be stored using bits reduced storage requirements retrieval using bitwise operations is efficient Boolean retrieval was adopted by many commercial bibliographic systems Boolean queries are akin to database queries Boolean Model Advantages Slide 38 A document is either relevant or non-relevant to the query It is not possible to assign a degree of relevance Complicated Boolean queries are difficult for users Boolean queries retrieve too few or too many documents K0 and K4 retrieved only 1 out of 6 documents K0 or K4 retrieved 5 out of a possible 6 documents Boolean Model Disadvantages Slide 39 Treats both the documents and queries as vectors A weight based on the frequency in the document: Vector Space Model Slide 40 Graphical representation of the VSM Model Slide 41 Slide 42 Computing the similarity Slide 43 Relevance Values and Ranking Similarity between the documents and the query Ranking based on the similarity D0 (0.7774) D6 (0.4953) D2 (0.3123) D1 (0.2590) D5 (0.2122) D4 (0.1727) D3 (0.1084) Slide 44 Variations of the normalized frequency Inverse document frequency (idf) The idf for the j th term: N = no. of documents n j = no. of documents containing j th term Modified weights : Variations of VSM Slide 45 Inverse Document Frequencies Slide 46 TDM using idf Slide 47 Similarity and ranking using idf Ranking based on the similarity D0 (0.7867) D6 (0.4953) D2 (0.3361) D1 (0.2590) D5 (0.2215) D4 (0.1208) D3 (0.0969) Similarity between the documents and the query Slide 48 Queries are easier to express: allow users to attach relative weights to terms A descriptive query can be transformed to a query vector similar to documents Matching between a query and a document is not precise: document is allocated a degree of similarity Documents are ranked based on their similarity scores instead of relevant/non-relevant classes Users can go through the ranked list until their information needs are met. VSM vs. Boolean Slide 49 Evaluation should include: Functionality Response time Storage requirement Accuracy Evaluation of Retrieval Performance Slide 50 Early days: Batch testing Document collection such as cacm.all Query collection such as query.text Present day: interactive tests are used Difficult to conduct and time consuming Batch testing still important Accuracy Testing Slide 51 Precision and Recall PrecisionHow many from the retrieved are relevant? RecallHow many from the relevant are retrieved? PrecisionHow many from the retrieved are relevant? RecallHow many from the relevant are retrieved? Slide 52 Example Slide 53 F-measure Slide 54 Three retrieved document was arbitrary Average Precision Slide 55 Relationship between precision and recall

chapter 2 information retrieval ms. malak bagais [textbook]: chapter 2

Documents

stopwords slide

indexing slide

sample document slide

data analysis

new thrust of data mining

valuable bits of information

text document

family of techniques