![Page 1: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/1.jpg)
INFORMATION SEARCH
Presenter: Pham Kim Son
Saint Petersburg State University
![Page 2: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/2.jpg)
Overview Introduction Overview of IR system and basic terminologies Classical models of Information Retrieval
Boolean Model Vector Model
Modern models of Information Retrieval Latent Semantic Indexing Correlation Method
Conclusion
![Page 3: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/3.jpg)
Introduction People need information to solve problem
Very simple thing Complex thing
“Perfect search machine” defined by Larry Page is something that understand exactly what you mean and return exactly what you want.
![Page 4: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/4.jpg)
Challenges to IR Introduction of www
www is large Heterogeneous
Gives challenges to IR
![Page 5: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/5.jpg)
Overview of IR and basic terminologies
IR can be divided in to 3 components Input Processor Output
Processor
Query
Documents
Output
Input
![Page 6: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/6.jpg)
Input The main task: Obtain a good
representation of each document and query for computer to use A document representative: a list of extracted
keywords Idea: Let the computer process the natural
language in the document
Input
![Page 7: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/7.jpg)
Input (cont) Obstacles
Theory of human language has not been sufficiently developed for every language
Not clear how to use it to enhance information retrieval
![Page 8: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/8.jpg)
Input (cont) Representing documents by keywords
Step1:Removing common words Step2: Stemming
![Page 9: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/9.jpg)
Step1: Removing common words Very high frequency words very common
words They should removed comparing with stop-list So
Non-significant words not interfere IR process Reduce the size document between 30->50 per
cent (C.J. van Rijsbergen)
willof the
be
are
By
![Page 10: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/10.jpg)
Step2:Stemming Def: Stemming : The process of chopping
the ending of a term, e.g. removing “ed”, ”ing”
Algorithm Porter
![Page 11: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/11.jpg)
Processor
This part of the IR system are concerned with the retrieval process Structuring the documents in an appropriate
way Performing actual retrieval function, using a
predefined model
Input ProcessorInput
![Page 12: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/12.jpg)
Output The output will be a set of documents,
assumed to be relevant to users. Purpose of IR:
Retrieved all relevant documents Retrieved irrelevant documents as few as
possible
Input Processor output
![Page 13: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/13.jpg)
Definition
documents relevant of number Total
retrieved documents relevant of Number recall
retrieved documents of number Total
retrieved documents relevant of Number precision
Relevant docs retrieved
relevant docs
Relevant docs retrieved
All docs retrieved
![Page 14: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/14.jpg)
Returns relevant documents butmisses many useful ones too
Returns most relevantdocuments but includes lots of junk
The ideal
0
Pre
cisi
on
Recall
1
1
![Page 15: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/15.jpg)
Information Retrieval Models Classical models
Boolean model Vector model
Novel models Latent semantic indexing model Correlation method
![Page 16: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/16.jpg)
Boolean model Earliest and simplest method, widely used
in IR systems today Based on set theories and Boolean algebra. Queries are Boolean expressions of
keywords, connected by operators AND, OR, ANDNOT
Ex: (Saint Petersburg AND Russia) | (beautiful city AND Russia)
![Page 17: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/17.jpg)
Inverted files are widely used Ex:
Term1 : doc 2 , doc 5, doc6 ; Term2 : doc 2, doc4, doc5; Query : q = (term1 AND term2) Result: doc2, doc5
Term-document matrix can be used
![Page 18: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/18.jpg)
Thinking about Boolean model Advantages:
Very simple model based on sets theory Easy to understand and implement Supports exact query
Disadvantages: Retrieval based on binary decision criteria , without notion of
partial matching Sets are easy, but complex Boolean expressions are not The Boolean queries formulated by users are most often so
simplistic Retrieves so many or so few documents Gives unranked results
![Page 19: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/19.jpg)
Vector Model Why Vector Model ?
Boolean model just takes into account the existence or
nonexistence of terms in a document Has no sense about their different contributions to
documents
![Page 20: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/20.jpg)
Overview theory of vector model Documents and queries are displayed as vectors
in index-term space Space dimension is equal to the vocabulary size Components of these vectors: the weights of the
corresponding index term, which reflects its significant in terms of representative and discrimination power
Retrieval is based on whether the “query vector” and “document vector” are closed enough.
![Page 21: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/21.jpg)
Set of document:
A finite set of terms :
Every document can be displayed as vector:
the same to the query:
1 2{ , ,..., }mD D D D
1 2{ , ,..., }nT t t t
1 2( , ,..., )j j j njd w w w
1 2( , ,..., )q q nqq w w w
![Page 22: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/22.jpg)
Similarity of query q and document d:
Given a threshold , all documents with similarity > threshold are retrieved
i
j
dj
q
( , )( , ) cos( )
(|| || * || ||)
d qsimilarity q d
d q
![Page 23: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/23.jpg)
Compute a good weight A variety of weighting schemes are available They are based on three proven principles:
Terms that occur in only a few documents are more valuable than ones that appear in many
The more often a term occur in a document, the more likely it is to be important to that document
A term that appears the same number of times in a short document and in a long one is likely to be more available for the former
![Page 24: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/24.jpg)
tf-idf-dl (tf-idf) scheme The term frequency of a term ti in document dj:
The length of document dj: DLj = total number of terms occurrences in document dj Inverted document frequency: collection of N documents,
inverted document frequency of a term ti that appears in n document is :
Weight:
ij i jTF Number of occurences of term t in document d
logi
NIDF
n
*ij iij
j
TF IDFw
DL
![Page 25: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/25.jpg)
Think about Vector model Advantages:
Term weighting improves the quality of the answer Partial matching allows to retrieve the documents
that approximate the query conditions Cosine ranking formula sorts the answer
Disadvantages Assumes the independences of terms Polysemy and synonymy problem are unsolved
![Page 26: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/26.jpg)
Modern models of IR Why ?
Problems with polysemy: Bass (fish or music ?)
Problems with synonymy: Car or automobile ?
These failures can be traced to : The way index terms are identified is incomplete Lack of efficient methods to deal with polysemy
Idea to solve this problem: take the advance of implicit higher order structure(latent) in the association terms with documents .
![Page 27: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/27.jpg)
Latent Semantic Indexing (LSI) LSI overview
Representing documents roughly by terms is unreliability, ambiguity and redundancy
Should find a method , which can : Documents and terms are displayed as vectors in
a k-dim concepts space. Its weights indicating the strength of association with each of these concepts
That method should be flexible enough to remove the weak concepts, considered as noises
![Page 28: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/28.jpg)
Document-term matrix A[mxn] are built
Matrix A is factored in to 3 matrices, using Singular value decomposition SVD
U,V are orthogonal matrices
d1 d2 …. dm
t1 w11 w12 … w1m
t2 w21 w22 … w2m
: : : : : : : :tn wn1 wn2 … wnm
TA U V T T
nU U V V I
1 2( , ,..., )ndiag
1 2 1Let ( ) ... ... 0r r nrank A r
![Page 29: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/29.jpg)
These special matrices show a break down of the original relationship (doc-term) to a linearly independent components (factors)
Many of these components are very small: ( considered as noises) and should be ignored
1 0 ... 0
0 ... 0 ...
... ... ...
0 0 ... 0
T
r
A U V
![Page 30: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/30.jpg)
Criteria to choose k: Ignore noises Important information are
not lost
Documents and terms are displayed as vectors in k-dim space
Theory Eckart & Young ensures us about not losing important information
2 2 2 21
( )min || || || || ...F k F k r
rank B kA B A A
Tk k kA U V
![Page 31: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/31.jpg)
Query Should find a method to display a query to
k-dim space Query q can be seen as a document. From equation: We have
Tk k kA U V
1Tk k kq q U
![Page 32: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/32.jpg)
Similarity between objects
Term-Term: Dot product between two rows vector of matrix Ak
reflects the similarity between two terms Term-term similarity matrix :
Can consider the rows of matrix as coordinate of terms.
The relation between taking rows of as coordinate and rows of as coordinates is simple
( )( )T T T Tk k k k k k k k k k k kT A A U V V U U U
k kU
kU
k kU
![Page 33: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/33.jpg)
Document-document Dot product between two columns vectors of matrix Ak
reflect the similarity between two documents.
Can consider the row of matrix as coordinates of documents.
Term-document This value can be obtained by looking at the element of
matrix Ak
Drawback: between and within comparisons can not be done simultaneously without resizing the coordinate.
( )( )T T T Tk k k k k k k k k k k kD A A V U U V V V
k kV
1/ 2 1/ 2Tk k k k k k k kA U V U V
![Page 34: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/34.jpg)
Example q1: human machine interface for Lab ABC computer applications q2: a survey of user opinion of computer system response time. q3: the EPS user interface management system q4: System and human system engineering testing of EPS q5:Relation of user-perceived response time to error measurement
q6: The generation of random, binary, unordered tree q7: The intersection graph of paths in trees q8: Graph minors IV: Widths of trees and well-quasi-ordering q9: Graph minors: A survey
![Page 35: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/35.jpg)
Numerical results1 2 3 4 5 6 7 8 9
1 0 0 1 0 0 0 0 0
1 0 1 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0
0 1 1 0 1 0 0 0 0
0 1 1 2 0 0 0 0 0
0 1 0 0 1 0 0 0 0
0 1 0 0 1 0 0 0 0
0 0 1 1 0 0 0 0 0
0 1 0 0 0 0 0 0 1
0 0 0 0 0 1 1 1 0
0 0 0 0 0 0 1 1 1
0 0 0 0 0 0 0 1 1
q q q q q q q q q
human
interface
computer
user
system
A response
time
EPS
survey
trees
graph
minors
0.22 0.11
0.20 0.07
0.24 0.04
0.40 0.06
0.64 0.17
0.27 0.11 3.34 0 0.20 0.61 0.46 0.54 0.28 0.00 0.02 0.02 0.08
0.27 0.11 0 2.54 0.06 0.
0.30 0.14
0.21 0.27
0.01 0.49
0.04 0.62
0.03 0.45
kA
17 0.13 0.23 0.11 0.19 0.44 0.62 0.53
T Tk k k kA U V A U V
![Page 36: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/36.jpg)
Query: human computerinteraction
![Page 37: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/37.jpg)
Updating
Folding in New document New terms and docs has no effect on the
presentation of pre-existing docs and terms Re-computing SVD
Re-compute SVD Requires times and memory
Choosing one of these two methods
1k Tnew new k kd d U
![Page 38: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/38.jpg)
Think about LSI Advantages:
Synonymy problem is solved Displaying documents in a more reliable
space: Concepts space Disadvantages:
Polysemy problem is still unsolved A special algorithms for handling with large
size matrices should be implemented
![Page 39: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/39.jpg)
Correlation Method Idea: If a keyword is present in the document,
correlated keywords should be taken into account as well. So, the concepts containing in the document aren’t obscured by the choices of a specific vocabulary.
In vector space model: similarity vector :
Depend on the user query q, we now build the best query, taking the correlated keyword into account as well.
Tr A q
![Page 40: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/40.jpg)
The correlation matrix is built based on the term-document matrix Let: A is the term-document matrix D : number of document, is the mean vector.
Clearly that : Covariance matrix is computed:
Correlation Matrix S :
d
1
1 D
jj
d dD
1
1( )( )
1
D
j jj
C d d d dD
ijij
ii jj
cs
c c
,ij m m
S s
![Page 41: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/41.jpg)
Better query: We now use SVD to reduce noises in the
correlation of keywords: We choose the first k largest factors to obtain
the k dimensional of S
Generate our best query: Vector of similarity :
Define a projection, defined by :
betterq Sq
Tk k k kS X X
best kq S q
1/ 2 1/ 2T T T T Tk k k k k k k kr A S q A X X q A X X q
1/ 2 Tk kP X
![Page 42: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/42.jpg)
Strength of correlation method In real world, correlation between words are static. Number of terms has a higher stability level when
comparing with number of documents Number of documents are many times larger than
number of keywords. This method is able to handle database with a
very large number of documents and doesn’t have to update the correlation matrix every time adding new documents.
Its importance in the electronic networks.
![Page 43: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/43.jpg)
Conclusion IR overview Classical IR models :
Boolean model Vector model
Modern IR models : LSI Correlation methods
![Page 44: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/44.jpg)
Any questions ?
![Page 45: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/45.jpg)
References Gheorghe Muresan: Using document Clustering and
language modeling in mediated IR. Georges Dupret: Latent Concepts and the Number
Orthogonal Factors in Latent Semantic Indexing C.J. van RIJSBERGEN: Information Retrieval Ricardo Baeza-Yates: Modern Information Retrieval Sandor Dominich: Mathematical foundation of
information retrieval. IR lectures note from : www.cs.utexas.edu Scott Deerwester, Susan T.Dumais: Indexing by Latent
Semantic Analysis
![Page 46: INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University](https://reader036.vdocument.in/reader036/viewer/2022062500/56649f045503460f94c190e8/html5/thumbnails/46.jpg)
Desktop Search Design the desktop search that satisfies
Not affects computer’s performance Privacy is protected Ability to search as many types of files as
possible( consider music file) Multi languages search User-friendly Support semantic search if possible