techniques of information retrieval
TRANSCRIPT
![Page 1: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/1.jpg)
![Page 2: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/2.jpg)
Techniques of Information RetrievalTariq Hassan & Sabahat
![Page 3: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/3.jpg)
Road Map :• What is IR ?• Why & How it works?• Evaluation Techniques• Global & Local Methods1. Relevance Feedback2. Probabilistic Relevance Feedback3. Indirect Relevance Feedback4. Rocchio Algorithm5. Linear Classifiers6. Naïve Bayes Text Classification
Question & Discussion
![Page 4: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/4.jpg)
What is IR? Why & How?
• Information needed to satisfy user.
• Why? Due to different formats of Data.• How?
StopListStemmingInverse Document FrequencyWord Counts
![Page 5: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/5.jpg)
What is IR? Why & How?
Generally IR used in 3 scenarios1. Web search2. Personal IR ( Text Classification )3. Enterprise Level
![Page 6: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/6.jpg)
Evaluation Techniques
• Why?• How? Relevant & Non Relevant Documents
Precision And Recall MethodsP = # (relevant Items Retrieved) #(retrieved Items)
R = #(relevant Items Retrieved) #(relevant Items)
![Page 7: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/7.jpg)
Methods:1. Global Methods Reformulation Queries
2. Local MethodsRelative to the initial results against any
query
![Page 8: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/8.jpg)
Local Methods
1. Relevance Feedback
2. Probabilistic Relevance Feedback
3. Indirect Feedback
1. Relevance FeedbackFeedback given by the user about the relevance of thedocuments in the initial set of results.
1. Relevance Feedback2. Probabilistic Relevance Feedback PRF is implementing by building a classifiers.
1. Relevance Feedback2. Probabilistic Relevance Feedback3. Indirect Relevance Feedback Without user interventions. 1. By using user actions. 2. By using user Histories or Logs
![Page 9: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/9.jpg)
Conclusion : Relevance Feedback
Assumption: User have initial knowledge
Issues : Misspelling Cross Languages Mismatch Vocabulary
![Page 10: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/10.jpg)
Rocchio AlgorithmIncorporates the relevance feedback mechanism in vector space model.Also uses the Cosine Similarity FunctionEuclidean Mechanism
![Page 11: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/11.jpg)
Example
![Page 12: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/12.jpg)
Outcome• Relevance Feedback plays an
important role to understand the user requirements.
• Rocchio Algorithm is not the best but the optimized and better option due to its simplicity and good results.
• Have a significant importance with respect to content based systems.
![Page 13: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/13.jpg)
Classification Problems• Given:
– A document d– A fixed set of categories:
Sports, Informatics, literature, medical, entertainment– A training set of documents each
labeled with its class• Determine:
– A learning method or algorithm which will enable us to learn a classifier
– For a test document dT we have to determine its category
![Page 14: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/14.jpg)
Classification Techniques
• Manual (a.k.a. Knowledge Engineering)
– typically, rule-based expert systems
• Machine Learning
–Naïve Bayesian (Probabilistic)
– Decision Trees (Decision Structures)
– Support Vector Machines (Linear Classification)
![Page 15: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/15.jpg)
Document Representation
• Binary Representation• Frequency Representation• TF*IDF Representation
![Page 16: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/16.jpg)
Naïve Bayes document classification example
• Probabilistic– Prior vs Posterior
• Bernoulli Model– Feature vector with binary
elements• Multinomial Model
– Integers representing frequency of words
![Page 17: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/17.jpg)
![Page 18: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/18.jpg)
Classify the document
![Page 19: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/19.jpg)
Naïve Bayes classfication
• Very fast learning and testing– Why?
• Low storage requirements• Very good in domains with
many equally important features
• More robust to irrelevant features than many learning methods
![Page 20: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/20.jpg)
Linear Classification
• Documents as labeled vectors• Documents in the same class form a
contiguous region of space• Documents from different classes
don’t overlap (much)• Learning a classifier: build surfaces
to delineate classes in the space
![Page 21: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/21.jpg)
Support Vector Machines
• Find a linear hyperplane (decision boundary) that will separate the data
![Page 22: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/22.jpg)
Support Vector Machines
• One Possible Solution
B1
![Page 23: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/23.jpg)
Support Vector Machines
• Another possible solution
B2
![Page 24: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/24.jpg)
Support Vector Machines
• Other possible solutions
B2
![Page 25: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/25.jpg)
Support Vector Machines
• Which one is better? B1 or B2?• How do you define better?
B1
B2
![Page 26: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/26.jpg)
Support Vector Machines
• Find hyperplane maximizes the margin
B1
B2
b11
b12
b21b22
margin
![Page 27: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/27.jpg)
Support Vector MachinesB1
B2
b11
b12
b21b22
margin
Support Vectors
![Page 28: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/28.jpg)
Support Vector Machines
B1
b11
b12
0 bxw
1 bxw 1 bxw
1bxw if1
1bxw if1)(
xf 2||||
2 Marginw
![Page 29: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/29.jpg)
Support Vector Machines
B1
b11
b12
0 bxw
1 bxw 1 bxw
1bxw if1
1bxw if1)(
xf 2||||
2 Marginw
![Page 30: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/30.jpg)
Questions & Discussion
![Page 31: Techniques of information retrieval](https://reader036.vdocument.in/reader036/viewer/2022062412/58f37ccf1a28abae518b45a5/html5/thumbnails/31.jpg)
Bottom Line• Which classifier do I use for a given document
classification problem? Answer : Depends
How much training data is available? How simple/complex is the problem? How noisy is the data? How stable is the problem over time?
For an unstable problem, its better to use a simple and robust classifier.