![Page 1: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/1.jpg)
Prepared by:
Presented to:
Latent Semantic Indexing
Bayonne/2013
HATOUM Saria
DONGO Irvin
Prof. CHBEIR Richard
![Page 2: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/2.jpg)
2
• Introduction– Information Retrieval• Vector Space Model
• Problems
• Latent Semantic Indexing– Algorithm
– Example
– Advantages
– Disadvantages
Overview
![Page 3: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/3.jpg)
3
• Many documents available.
• The need to extract information.
• Sorted and classified information.
• Queries the information.
Introduction
![Page 4: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/4.jpg)
4
Information Retrieval
• Before LSI: – Literally Matching Text corpus with many documents.• Given a query, find relevant documents.
– Some terms in a user's query will literally match terms in irrelevant documents .
![Page 5: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/5.jpg)
5
• Set-Theoretic– Fuzzy Set
• Algebraic– Vector Space• Generalised Vector Space• Latent Semantic Indexing
• Probabilistic– Binary Interdependence
Some Methods for IR
![Page 6: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/6.jpg)
6
• An algebraic model for representing text documents.
• Documents and Queries are both vectors
dj =(w1,j , w2,j , …, wt,j)
qj =(w1,q , w2,q , …, wt,q)
Vector Space Model
![Page 7: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/7.jpg)
7
vector space method– Term (rows) by document (columns) matrix, based on occurrence– one vector will be associate for each document– Cosine to measure distance between vectors (documents)• small angle = large cosine = similar• large angle = small cosine = dissimilar
![Page 8: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/8.jpg)
8
• Sim(di, dj) = 1, if di = dj.
• Sim(di, dj) = 0, if di and dj are different.
Cosine Similarity Meausure
![Page 9: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/9.jpg)
9
Document vector space
Query
Word 1
Word 2
![Page 10: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/10.jpg)
10
Problem Introduction• Traditional term-matching method doesn’t work well in information
retrieval
• We want to capture the concepts instead of words. Concepts are reflected in the words. However,– One term may have multiple meaning
– Different terms may have the same meaning.
![Page 11: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/11.jpg)
11
The Problems• Two problems that arose using the vector space model:– synonymy: many ways to express a given concept e.g.
“automobile” when querying on “car”• leads to poor recall “the percentage of all relevant documents are
retrieved”– polysemy: words have multiple meanings e.g. “surfing”• leads to poor precision “the percentage of the retrieved documents
are relevant”
• The context of the documents.
![Page 12: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/12.jpg)
12
Polysemy and Context• Document similarity on single word level: polysemy and context
carcompany
•••dodgeford
meaning 2
ringjupiter
•••space
meaning 1…
saturn...
…planet
...
contribution to similarity, if used in 1st meaning, but not if in 2nd
![Page 13: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/13.jpg)
13
Problematic• Allow users to retrieve information on the basis of a conceptual topic or
meaning of a document.
![Page 14: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/14.jpg)
14
Latent Semantic Indexing• Overcome these problems of lexical matching :– Using a statistical information retrieval method that is capable of retrieving
text based on the concepts it contains, not just by matching specific keywords.
![Page 15: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/15.jpg)
15
Characteristics of LSI • Documents are represented as "bags of words", where the order of the
words in a document is not important, only how many times each word appears in a document.
• Is a technique that projects queries and documents into a space with “latent” semantic dimensions.
• Convert high-dimensional space to lower-dimensional space
![Page 16: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/16.jpg)
16
•Concepts are represented as patterns of words that usually appear together in documents.
– For example “jaguar", “car", and “speed" might usually appear in documents about sports cars, whereas “jaguar”, “animal”, “hunting” might refer to the concept of jaguar the animal.
•LSI is based on the principle that words that are used in the same contexts tend to have similar meanings.
•LSI uses Singular Value Decomposition for the mapping of terms to concepts.
Characteristics of LSI
![Page 17: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/17.jpg)
17
• Number of words is huge
• throw out noise ‘and’, ‘is’, ‘at’, ‘the’, .etc.
• Select and use a smaller set of words that are of interest
• Stemming which means remove endings e.g. learning , learned , learn
Generate matrix
![Page 18: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/18.jpg)
18
“Semantic” Space
House
Home
Domicile
Kumquat
Apple
Orange
Pear
![Page 19: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/19.jpg)
19
Information Retrieval• Represent each document as a word vector
• Represent corpus as term-document matrix (T-D matrix) using a linear analysis method called SVD
• A classical method:– Create new vector from query terms
– Find documents with highest cosine similarity
![Page 20: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/20.jpg)
20
• We decompose the term-document matrix into three matrices.
Singular Value Decomposition(SVD)
![Page 21: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/21.jpg)
21
• d1: Shipment of gold damaged in a fire.
• d2: Delivery of silver arrived in a silver truck.
• d3: Shipment of gold arrived in a truck.
• q: Gold silver truck
Example
![Page 22: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/22.jpg)
22
Example
![Page 23: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/23.jpg)
23
Example
![Page 24: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/24.jpg)
24
Example
![Page 25: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/25.jpg)
25
New vectors
• d1 = [-0.4945, 0.6492]
• d2 = [-0.6458, -0.7194]
• d3 = [-0.5817, 0.2469]
Example
![Page 26: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/26.jpg)
26
Example
![Page 27: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/27.jpg)
27
sim(q, di) = CosΘ
• sim(q,d1) = -0.0541
• sim(q,d2) = 0.9910
• sim(q,d3) = 0.4478
Example
![Page 28: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/28.jpg)
28
Advantages• LSI overcomes two of the most problematic constraints of queries:– Synonymy
– Polysemy
• True (latent) dimensions: the new dimensions are a better representation of documents and queries.
• Term Dependence: The traditional vector space model assumes term independence but LSI has strong associations between terms like the language.
![Page 29: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/29.jpg)
29
Disadvantages• Storage– Many documents have more than 150 unique terms so the sparce.
• Efficiency– With LSI, the query must be compared to every document in the collection.
• Static Matrix– If we have new documents, we need to do a new SVD in the main matrix.
![Page 30: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/30.jpg)
30
References• [Furnas et al., 1988] Furnas, G. W., Deerwester, S., Dumais, S. T., Landauer,
T. K., Harshman, R. A., Streeter, L. A., and Lochbaum, K. E. (1988). Information retrieval using a singular value decomposition model of latent semantic structure. In Proceedings of the 11th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '88, pages 465-480, New York, NY, USA. ACM.
• [Hull, 1994] Hull, D. (1994). Improving text retrieval for the routing problem using latent semantic indexing. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '94, pages 282-291, New York, NY, USA. Springer-Verlag New York, Inc.
![Page 31: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/31.jpg)
31
• [Atreya and Elkan, 2011] Atreya, A. and Elkan, C. (2011). Latent semantic indexing (lsi) fails for trec collections. SIGKDD Explor. Newsl., 12(2):5-10.
• [Deerwester et al., 1990] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990). Indexing by latent semantic analysis. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 41(6):391-407.
• [Littman et al., 1998] Littman, M., Dumais, S. T., and Landauer, T. K. (1998). Automatic cross-language information retrieval using latent semantic indexing. In Cross-Language Information Retrieval, chapter 5, pages 51{62. Kluwer Academic Publishers.
References
![Page 32: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/32.jpg)
32
Thank you for your Attention!!!
MILESKER ANITZ
![Page 33: LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)](https://reader036.vdocument.in/reader036/viewer/2022070303/54b72e994a7959772f8b4688/html5/thumbnails/33.jpg)
33