text representation & text classification for intelligent information retrieval ning yu school...
TRANSCRIPT
![Page 1: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University](https://reader036.vdocument.in/reader036/viewer/2022062314/56649eb15503460f94bb6f09/html5/thumbnails/1.jpg)
Text Representation & Text Classification
for Intelligent Information Retrieval
Ning YuSchool of Library and Information ScienceIndiana University at Bloomington
![Page 2: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University](https://reader036.vdocument.in/reader036/viewer/2022062314/56649eb15503460f94bb6f09/html5/thumbnails/2.jpg)
Outline
The big picture
A specific problem – opinion detection
![Page 3: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University](https://reader036.vdocument.in/reader036/viewer/2022062314/56649eb15503460f94bb6f09/html5/thumbnails/3.jpg)
Intelligent information retrieval
Characteristics Not restricted to keyword matching and Boolean search Deal with natural language query and advanced search criteria Coarse-to-fine level of granularity Automatically organize/evaluate/interpret solution space User-centered, e.g., adapt to user’s learning habit Etc.
![Page 4: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University](https://reader036.vdocument.in/reader036/viewer/2022062314/56649eb15503460f94bb6f09/html5/thumbnails/4.jpg)
Intelligent information retrieval
System Preferences Various source of evidence Natural language processing Semantic web technologies Automatic text classification Etc.
![Page 5: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University](https://reader036.vdocument.in/reader036/viewer/2022062314/56649eb15503460f94bb6f09/html5/thumbnails/5.jpg)
Intelligent IR system diagram
![Page 6: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University](https://reader036.vdocument.in/reader036/viewer/2022062314/56649eb15503460f94bb6f09/html5/thumbnails/6.jpg)
A Specific Question:Semi-Supervised Learning for Identifying Opinions in Web ContentDissertation work
![Page 7: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University](https://reader036.vdocument.in/reader036/viewer/2022062314/56649eb15503460f94bb6f09/html5/thumbnails/7.jpg)
Growing demand for online opinions
Enormous body of user-generated content
About anything, published anywhere and at any time
Useful for literature review, decision making, market monitoring, etc.
![Page 8: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University](https://reader036.vdocument.in/reader036/viewer/2022062314/56649eb15503460f94bb6f09/html5/thumbnails/8.jpg)
Major approaches for opinion detection
![Page 9: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University](https://reader036.vdocument.in/reader036/viewer/2022062314/56649eb15503460f94bb6f09/html5/thumbnails/9.jpg)
To acquire a broad and comprehensive collection of opinion-bearing
features (e.g., bag-of-words, POS words, N-grams (n>1), linguistic
collocations, stylistic features, contextual features);
To generate complex patterns (e.g., “good amount”) that can approximate
the context of words.
To generate and evaluate opinion detection systems;
To allow evaluation of opinion detection strategies with high confidence;
9
9
What’s Essential?Labeled Data! And lots of them!!!
![Page 10: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University](https://reader036.vdocument.in/reader036/viewer/2022062314/56649eb15503460f94bb6f09/html5/thumbnails/10.jpg)
Challenges for opinion detection
Shortage of opinion-labeled data: manual annotation is tedious, error-prone and difficult to scale up
Domain transfer: strategies designed for opinion detection in one data domain generally do not perform well in another domain
![Page 11: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University](https://reader036.vdocument.in/reader036/viewer/2022062314/56649eb15503460f94bb6f09/html5/thumbnails/11.jpg)
Motivations & research question
Easy to collect unlabeled user-generated content that contains opinions
Semi-Supervised Learning (SSL) requires only a limited number of labeled data to automatically label unlabeled data; has achieved promising results in NLP studies
Is SSL effective in opinion detection both in sparse data situations and for domain adaptation?
![Page 12: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University](https://reader036.vdocument.in/reader036/viewer/2022062314/56649eb15503460f94bb6f09/html5/thumbnails/12.jpg)
Datasets & data split
Evaluation(5%)
Unlabeled (90%)
Labeled(1-5%)
SSL Full SLBaseline
Supervised Learning (SL)
Labeled(95%)
Evaluation(5%)Labeled(1-5%)
Evaluation(5%)
Dataset(sentences) Blog Posts Movie Reviews News Articles
Opinion 4,843 5,000 5,297
Non-opinion 4,843 5,000 5,174
![Page 13: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University](https://reader036.vdocument.in/reader036/viewer/2022062314/56649eb15503460f94bb6f09/html5/thumbnails/13.jpg)
Two major SSL methods: Self-training
Assumption: Highly confident predictions made by an initial opinion classifier are reliable and can be added to the labeled set.
Limitation: Auto-labeled data may be biased by the particular opinion classifier.
![Page 14: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University](https://reader036.vdocument.in/reader036/viewer/2022062314/56649eb15503460f94bb6f09/html5/thumbnails/14.jpg)
Two major SSL methods: Co-training
Assumption: Two opinion classifiers with different strengths and weaknesses can benefit from each other.
Limitation: It is not always easy to create two different classifiers.
![Page 15: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University](https://reader036.vdocument.in/reader036/viewer/2022062314/56649eb15503460f94bb6f09/html5/thumbnails/15.jpg)
Experimental design
General settings for SSL Naïve Bayes classifier for self-training Binary values for unigram and bigram features
Co-training strategies: Unigrams and bigrams (content vs. context) Two randomly split feature/training sets A character-based language model (CLM) and a bag-of-words
model (BOW)
![Page 16: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University](https://reader036.vdocument.in/reader036/viewer/2022062314/56649eb15503460f94bb6f09/html5/thumbnails/16.jpg)
Results: Overall
For movie reviews and news articles, co-training proved to be most robust
For blog posts, SSL showed no benefits over SL due to the low initial accuracy
![Page 17: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University](https://reader036.vdocument.in/reader036/viewer/2022062314/56649eb15503460f94bb6f09/html5/thumbnails/17.jpg)
Results: Movie reviews
Both self-training and co-training can improve opinion detection performance
Co-training is more effective than self-training
![Page 18: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University](https://reader036.vdocument.in/reader036/viewer/2022062314/56649eb15503460f94bb6f09/html5/thumbnails/18.jpg)
Results: Movie reviews (cont.)
The more different the two classifiers, the better the performance
![Page 19: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University](https://reader036.vdocument.in/reader036/viewer/2022062314/56649eb15503460f94bb6f09/html5/thumbnails/19.jpg)
Results: Domain transfer (movie reviews->blog posts)
For a difficult domain (e.g., blog), simple self-training alone is promising for tackling the domain transfer problem.
![Page 20: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University](https://reader036.vdocument.in/reader036/viewer/2022062314/56649eb15503460f94bb6f09/html5/thumbnails/20.jpg)
Contributions
Comprehensive research expands the spectrum of SSL application to opinion detection
Investigation of SSL model that best fits the problem space extends understanding of opinion detection and provides a resource for knowledge-based representation
Generation of guidelines and evaluation baselines advances later studies using SSL algorithms in opinion detection
Research extensible to other data domains, non-English texts, and other text mining tasks
![Page 21: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University](https://reader036.vdocument.in/reader036/viewer/2022062314/56649eb15503460f94bb6f09/html5/thumbnails/21.jpg)
21
www.CartoonStock.com
“All my opinions are posted on my online blog.”
“A grade of 85 or higher will get you favorable mention on my blog.”
“If you want a second opinion, I’ll ask my computer”
Thank you!