techniques for collaboration in text filtering 1 ian soboroff department of computer science and...

18
Techniques for Collaboration in Text Filtering 1 Techniques for Collaboration in Text Filtering Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore County [email protected]

Upload: brittany-mcdonald

Post on 17-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore

Techniques for Collaboration in Text Filtering 1

Techniques for Collaboration inText Filtering

Ian Soboroff

Department of Computer Science and Electrical EngineeringUniversity of Maryland, Baltimore County

[email protected]

Page 2: Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore

Techniques for Collaboration in Text Filtering 2

Overview

• Text filtering and collaborative filtering

• Finding collaboration among content profiles

• Experimental results

• Ongoing work

Page 3: Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore

Techniques for Collaboration in Text Filtering 3

Information Filtering

• Given• a stream of documents (news articles, movies)• a set of users (with stable and specific interests)

• Recommend documents to users who will be interested in them• "Tell me when a jazz CD comes out that I'll like."• "Tell me when an earthquake is reported."

Page 4: Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore

Techniques for Collaboration in Text Filtering 4

Content Filtering

• Construct profiles from example documents• vector of weights for terms in documents• can use known relevant and nonrelevant docs• can use external resources such as a home page,

job description, or research papers

• Match new documents against content profiles

Page 5: Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore

Techniques for Collaboration in Text Filtering 5

Filtering in a Community

• Many people will be watching the same stream

• Some of them may have overlapping interests• earthquakes, mideast politics, building codes, Turkey• Charles Mingus, Duke Ellington, Kenny G

• Want to take advantage of group effort

Page 6: Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore

Techniques for Collaboration in Text Filtering 6

"Pure" Collaborative Filtering

• collect users' ratings for documents• thumbs up/down, or 1-5 scale

• compute correlations among users

• predict ratings for new/unseen items using existing ratings and correlation values

Page 7: Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore

Techniques for Collaboration in Text Filtering 7

Pure CF Example

Alice

Bob

Carmen

Doug

Comedies Dramas

5

? 9

4 9

? 9

7

7 ? 2 9

7 8 1 8

Page 8: Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore

Techniques for Collaboration in Text Filtering 8

Combining Content and Collaboration

• Pure collaborative filtering• can recommend anything• must have ratings to give predictions• don't know much about documents or ratings

• Adding content to collaboration• content filtering can recommend an unrated

document• exploit common themes among content profiles

Page 9: Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore

Techniques for Collaboration in Text Filtering 9

One Approach to CBCF

• Construct content profiles• Documents are vectors of weighted features• Build profiles from known relevant and nonrelevant

documents

• Collaborative step• Combine profile vectors into single matrix• Compute latent semantic index of profile collection

• Route new documents in profiles' "LSI space"

Page 10: Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore

Techniques for Collaboration in Text Filtering 10

Latent Semantic Indexing

• Compute singular value decomposition of a content matrix • D, a representation of M in r dimensions• T, a matrix for transforming new documents gives relative importance of dimensions

wtd

t d

= T

t r

r r

DT

r d

Page 11: Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore

Techniques for Collaboration in Text Filtering 11

Collaborating with LSI

• LSI dimensions are ...• based on term co-occurrence patterns between

documents (profiles)• ordered by their prominence in collection

• LSI space built from profiles• highlights common patterns among profiles• "noisy" dimensions can be pruned• project new documents into a collaborative space for

routing

Page 12: Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore

Techniques for Collaboration in Text Filtering 12

Experiments with Cranfield

• Cranfield, a standard (if small) IR collection• 1398 documents, 255 scored queries

• Profiles: selected Cranfield queries• 26 queries with 15 relevant documents• 70% of profile's relevant docs used in each profile

• Results shows improvement for using LSI of profiles

• compared to using profiles alone• compared to using LSI of all of Cranfield

Page 13: Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore

Techniques for Collaboration in Text Filtering 13

Results: Average Precision

k-value Set 1 Set 2- 0.2894 0.2705

Content LSI 25 0.2656 0.198050 0.3136 0.2686100 0.3251 0.3053200 0.3314 0.3144500 0.3302 0.3149

Collaborative LSI 8 0.3136 0.2583(LSI of profiles) 15 0.4151 0.3745

18 0.3600 0.3615

Content (log-tfidf)

(LSI of all of Cranfield)

Page 14: Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore

Techniques for Collaboration in Text Filtering 14

Results: Precision-Recall

Page 15: Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore

Techniques for Collaboration in Text Filtering 15

Experiments with TREC

• TREC-8 routing task• Profiles: 50 topics (351-400)• Test Documents: Financial Times 1993-4• Training Documents: FT 92, LA Times 89-90, FBIS

• Building profiles• short topic description• known relevant documents in training set• sample of non-relevant documents from training set

Page 16: Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore

Techniques for Collaboration in Text Filtering 16

Average Precision in TREC

• Average precision...• with profiles alone = 0.4464• with profile LSI = 0.3971

• LSI shows no improvement over original profiles• Some topics conceivably have common interests

• "hydrogen energy"; "hydrogen fuel automobiles"; "hybrid fuel cars"

• "clothing sweatshops"; "human smuggling"

• But too little training overlap?

Page 17: Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore

Techniques for Collaboration in Text Filtering 17

Conclusions

• LSI can improve filtering performance• but might not, if SVD can't find anything to work with

• LSI of profiles is much cheaper to compute than LSI of a whole collection (or even a sample!)

Page 18: Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore

Techniques for Collaboration in Text Filtering 18

Current and Future Work

• Looking at other collections• More TREC!• Reuters-21578• Collaborative filtering collections... such as?

• Looking at other techniques• Comparison to collaboration alone?• Other methods of combining content and

collaboration