7223t-s1

Slide 1

Introduction to MIRSSession 1COURSE : T7223 MULTIMEDIA INFORMATION RETRIEVAL SYSTEM (MIRS)10/08/20143After carefully studying this lecture, students will be able to:Describe the principle components of a multimedia information retrieval system and how they differ from other retrieval systems.Learning ObjectivesIntroduction to MIRS10/08/2014Introduction to MIRS4Recent Interesting FactsOver the past years, due to enormously improved computational power at a constant price levelsignificantly smaller costs per memory unitdigitalization, ubiquitous media capturing, HD videos, andnew and improved sensors (scanners, satellites, recorders)extreme amounts of digital documents were created and archived. Only a small fraction of these are currently economically operatedA few examples.........10/08/2014Introduction to MIRS5A few examples.....Internet contains around 500 EB of data but only a fraction is easy to findCompanies operate numerous information systems but there rarely exists a consistent access accross the diverse systems; duplicate of data make things even harder; regulatory requirement ask for archival of all information for 5 -10 yearsMedia archives encompass large amounts of photos, articles, videos but mostly writers browse through the archives by hand; upload features like on youtube inflate repositories with new materialsSurveillance produces terabytes of data per day through numerous sensors; most of data is directly moved into an archive without being considered by a human being beforehand 10/08/2014Introduction to MIRS6The General Retrieval ProblemGIVEN N documents (D0, D1, ...., DN-1)Query Q of userPROBLEMRanked list of k documents Dj (0 < j < N)which match the query sufficiently well; ranking with respect to relevance of document to the query

7Document and query indexingHow to best represent their contents?Query evaluation (or retrieval process)To what extent does a document correspond to a query?System evaluationHow good is a system? Are the retrieved documents relevant? (precision)Are all the relevant documents retrieved? (recall)10/08/2014Introduction to MIRSMIR Problems10/08/2014Introduction to MIRS8Search Paradigms and Retrieval ModelsKEYWORD-BASED SEARCH : Today, the typical approach is to enter a few keywords and to browse through huge result lists to find relevant documents. This especially works fine with text documents, however, suffer from a semantic gap between keywords and signal information. Boolean Retrieval : white AND houseVector Space Retrieval : retrieval with weight terms.10/08/2014Introduction to MIRS9SIMILARITY SEARCH : Instead of entering keywords, the user enters a few examples how the result should look like (query by example). The search engine then finds documents that match best to the patterns of query.Similarity search works with text, images, audio, and video files. The definition of what similarity actually means depend on the chosen models and featuresSearch Paradigms and Retrieval Models10/08/2014Introduction to MIRS10Indexing & RetrievalCBR architecture is composed of three important components : extraction, representation, and retrievalExtraction and representation constitute the indexing componentExtraction component extract automatically or semi-automatically region and featuresExtracted contents are represented as or transformed into suitable models and data structures, and then stored in a persistent indexThe retrieval process computes distances between source and target features, and sort the most similar contents

10/08/2014Introduction to MIRS11Indexing & RetrievalMultimedia require support of multi-dimensional datasetsE.g., a 256 dimensional feature vector.that impliesSpecialized kinds of queriesNew indexing approaches. Two choices:Map N - dimensional data to a single dimension and use traditional indexing structures Develop specialized indexing structures12Documents Dj Query Q

indexing indexing (Query analysis)

Representation Representation Query evaluation

10/08/2014Introduction to MIRSIndexing Based MIR10/08/2014Introduction to MIRS13Feature extraction algorithms condense the contents of a document to a few vectors and values.These features have to be indexed such that relevant document for a given queries can be efficiently retieved. In the following a short overview of the index structures addressed in this course. Inverted List : usually applied to text documents with a large dictionary.High-Dimensional Index Structures : many feature extraction algorithms compute high-dimensional vectors. Such vectors are usually maintained in a special index structures.Index Structures10/08/2014Introduction to MIRS14

10/08/2014Introduction to MIRS15

10/08/2014Introduction to MIRS16IR Model : Formal Description

D - set of logical views (representations) of docs in collectionQ - set of logical views (representations) of user information needs = QueriesF - framework for modeling document representations, queries and their relationshipsR (qi, dj) Ranking function - associates a real number with a query qi Q and a document representation dj D. It defines ordering among the documents with regard to query qi

10/08/2014Introduction to MIRS17Exmpl of Modeling Framework : Boolean ModelQuery terms are combined logically using the Boolean operators AND, OR, and NOT.E.g., ((data AND mining) AND (NOT text))RetrievalGiven a Boolean query, the system retrieves every document that makes the query logically true.Called exact match.The retrieval results are usually quite poor because term frequency is not considered. 10/08/2014Introduction to MIRS18Modeling Framework : Vector Space ModelQuery q is represented in the same way or slightly differently. Relevance of di to q : Compare the similarity of query q and document di. Cosine similarity (the cosine of the angle between the two vectors)

Cosine is also commonly used in text clustering

10/08/2014Introduction to MIRS19An ExampleA document space is defined by three terms:hardware, software, usersA set of documents are defined as:A1=(1, 0, 0),A2=(0, 1, 0), A3=(0, 0, 1)A4=(1, 1, 0),A5=(1, 0, 1), A6=(0, 1, 1)A7=(1, 1, 1)A8=(1, 0, 1).A9=(0, 1, 1)If the Query is hardware and softwarewhat documents should be retrieved?

10/08/2014Introduction to MIRS20An Example (cont.)In Boolean query matching:document A4, A7 will be retrieved (AND)retrieved: A1, A2, A4, A5, A6, A7, A8, A9 (OR)In Vector Space matching (cosine): q=(1, 1, 0)S(q, A1)=0.71, S(q, A2)=0.71,S(q, A3)=0S(q, A4)=1,S(q, A5)=0.5, S(q, A6)=0.5S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5Document retrieved set (with ranking)={A4, A7, A1, A2, A5, A6, A8, A9}10/08/2014Introduction to MIRS21

10/08/2014Introduction to MIRS22CBIR and CBRContent-based Image Retrieval (CBIR) as an example of Content-based Retrieval (CBR)Concentrates on low-level features.Main Ideas of CBIR:Represent an image as a set of feature descriptors.Define similarity measures of the descriptors.When a user specifies a query, the system returns images, which are sorted by similarity.10/08/2014Introduction to MIRS23Image Representation

10/08/2014Introduction to MIRS24Image Representation

10/08/2014Introduction to MIRS25Content Based Image Retrieval (CBIR)

10/08/2014Introduction to MIRS26Content Based Image Retrieval (CBIR)

27System EvaluationEfficiency: time, spaceEffectiveness:How is a system capable of retrieving relevant documents? Is a system better than another one?Metrics often used (together):Precision = retrieved relevant docs / retrieved docsRecall = retrieved relevant docs / relevant docs

relevant retrievedretrieved relevant10/08/2014Introduction to MIRS28General Form of Precision/Recall

Precision change w.r.t. Recall (not a fixed point)Systems cannot compare at one Precision/Recall pointAverage precision (on 11 points of recall: 0.0, 0.1, ,1.0)10/08/2014Introduction to MIRS29Why is IR DifficultVocabularies mismatching Synonymy: e.g. car vs. automobilePolysemy: tableQueries are ambiguous, they are partial specification of users needContent representation may be inadequate and incompleteThe user is the ultimate judge, but we dont know how the judge judgesThe notion of relevance is imprecise, context- and user-dependent10/08/2014Introduction to MIRS30Final Remarks on MIRMIR is related to many areas: NLP, AI, database, machine learning, user modelinglibrary, Web, multimedia search, Relatively weak theoriesVery strong tradition of experimentsMany remaining (and exciting) problemsDifficult area: Intuitive methods do not necessarily improve effectiveness in practice10/08/2014Introduction to MIRS

Precision

1.0

Recall

1.0

7223t-s1

Documents

retrieval systems

retrieval processto

boolean retrieval

relevant documents

text documents

query evaluation

patterns of query

query indexinghow