techniques for collaboration in text filtering 1 ian soboroff department of computer science and...
TRANSCRIPT
Techniques for Collaboration in Text Filtering 1
Techniques for Collaboration inText Filtering
Ian Soboroff
Department of Computer Science and Electrical EngineeringUniversity of Maryland, Baltimore County
Techniques for Collaboration in Text Filtering 2
Overview
• Text filtering and collaborative filtering
• Finding collaboration among content profiles
• Experimental results
• Ongoing work
Techniques for Collaboration in Text Filtering 3
Information Filtering
• Given• a stream of documents (news articles, movies)• a set of users (with stable and specific interests)
• Recommend documents to users who will be interested in them• "Tell me when a jazz CD comes out that I'll like."• "Tell me when an earthquake is reported."
Techniques for Collaboration in Text Filtering 4
Content Filtering
• Construct profiles from example documents• vector of weights for terms in documents• can use known relevant and nonrelevant docs• can use external resources such as a home page,
job description, or research papers
• Match new documents against content profiles
Techniques for Collaboration in Text Filtering 5
Filtering in a Community
• Many people will be watching the same stream
• Some of them may have overlapping interests• earthquakes, mideast politics, building codes, Turkey• Charles Mingus, Duke Ellington, Kenny G
• Want to take advantage of group effort
Techniques for Collaboration in Text Filtering 6
"Pure" Collaborative Filtering
• collect users' ratings for documents• thumbs up/down, or 1-5 scale
• compute correlations among users
• predict ratings for new/unseen items using existing ratings and correlation values
Techniques for Collaboration in Text Filtering 7
Pure CF Example
Alice
Bob
Carmen
Doug
Comedies Dramas
5
? 9
4 9
? 9
7
7 ? 2 9
7 8 1 8
Techniques for Collaboration in Text Filtering 8
Combining Content and Collaboration
• Pure collaborative filtering• can recommend anything• must have ratings to give predictions• don't know much about documents or ratings
• Adding content to collaboration• content filtering can recommend an unrated
document• exploit common themes among content profiles
Techniques for Collaboration in Text Filtering 9
One Approach to CBCF
• Construct content profiles• Documents are vectors of weighted features• Build profiles from known relevant and nonrelevant
documents
• Collaborative step• Combine profile vectors into single matrix• Compute latent semantic index of profile collection
• Route new documents in profiles' "LSI space"
Techniques for Collaboration in Text Filtering 10
Latent Semantic Indexing
• Compute singular value decomposition of a content matrix • D, a representation of M in r dimensions• T, a matrix for transforming new documents gives relative importance of dimensions
wtd
t d
= T
t r
r r
DT
r d
Techniques for Collaboration in Text Filtering 11
Collaborating with LSI
• LSI dimensions are ...• based on term co-occurrence patterns between
documents (profiles)• ordered by their prominence in collection
• LSI space built from profiles• highlights common patterns among profiles• "noisy" dimensions can be pruned• project new documents into a collaborative space for
routing
Techniques for Collaboration in Text Filtering 12
Experiments with Cranfield
• Cranfield, a standard (if small) IR collection• 1398 documents, 255 scored queries
• Profiles: selected Cranfield queries• 26 queries with 15 relevant documents• 70% of profile's relevant docs used in each profile
• Results shows improvement for using LSI of profiles
• compared to using profiles alone• compared to using LSI of all of Cranfield
Techniques for Collaboration in Text Filtering 13
Results: Average Precision
k-value Set 1 Set 2- 0.2894 0.2705
Content LSI 25 0.2656 0.198050 0.3136 0.2686100 0.3251 0.3053200 0.3314 0.3144500 0.3302 0.3149
Collaborative LSI 8 0.3136 0.2583(LSI of profiles) 15 0.4151 0.3745
18 0.3600 0.3615
Content (log-tfidf)
(LSI of all of Cranfield)
Techniques for Collaboration in Text Filtering 14
Results: Precision-Recall
Techniques for Collaboration in Text Filtering 15
Experiments with TREC
• TREC-8 routing task• Profiles: 50 topics (351-400)• Test Documents: Financial Times 1993-4• Training Documents: FT 92, LA Times 89-90, FBIS
• Building profiles• short topic description• known relevant documents in training set• sample of non-relevant documents from training set
Techniques for Collaboration in Text Filtering 16
Average Precision in TREC
• Average precision...• with profiles alone = 0.4464• with profile LSI = 0.3971
• LSI shows no improvement over original profiles• Some topics conceivably have common interests
• "hydrogen energy"; "hydrogen fuel automobiles"; "hybrid fuel cars"
• "clothing sweatshops"; "human smuggling"
• But too little training overlap?
Techniques for Collaboration in Text Filtering 17
Conclusions
• LSI can improve filtering performance• but might not, if SVD can't find anything to work with
• LSI of profiles is much cheaper to compute than LSI of a whole collection (or even a sample!)
Techniques for Collaboration in Text Filtering 18
Current and Future Work
• Looking at other collections• More TREC!• Reuters-21578• Collaborative filtering collections... such as?
• Looking at other techniques• Comparison to collaboration alone?• Other methods of combining content and
collaboration