text categorization and images thesis defense for carl sable committee: kathleen mckeown, vasileios...

Text Categorization and ImagesThesis Defense for Carl Sable

Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu Chang

Text Categorization

• Text categorization (TC) refers to the automatic labeling of documents, using natural language text contained in or associated with each document, into one or more pre-defined categories.

• Idea: TC techniques can be applied to image captions or articles to label the corresponding images.

Clues for Indoor versus Outdoor:Text (as opposed to visual image features)

Denver Summit of Eight leaders begin their first official meeting in the Denver Public Library, June 21.

The two engines of an Amtrak passenger train lie in the mud at the edge a marsh after the train, bound for Boston from Washington, derailed on the bank of the Hackensack River, just after crossing a bridge.

Two Paradigms of Research

• Machine learning (ML) techniques– Common in the literature– Usually involve the exploration of new algorithms

applied to bag of words representations of documents• Novel representation

– Rare in the literature– Usually more specific, but often interesting and can

lead to substantial improvement– Important for certain tasks involving images!

Contributions• General:

– An in-depth exploration of the categorization of images based on associated text

– Incorporating research into Newsblaster• Novel machine learning (ML) techniques:

– The creation of two novel TC approaches– The combination of high-precision/low-recall rules

with other systems• Novel representation:

– The integration of NLP and IR– The use of low-level image features

Framework

• Collection of Experiments– Various tasks– Multiple techniques– No clear winner for all tasks– Characteristics of tasks often dictate which

techniques work best• “No Free Lunch”

Overview

I. The Main IdeaII. Description of CorpusIII. Novel ML SystemsIV. NLP Based SystemV. High-Precision/Low-Recall RulesVI. Image FeaturesVII. NewsblasterVIII. Conclusions and Future Work

Corpus

• Raw data:– Postings from news related Usenet newsgroups– Over 2000 include embedded captioned images

• Data sets:– Multiple sets of categories representing various

levels of abstraction– Mutually exclusive and exhaustive categories

Outdoor Indoor

Events Categories

Politics Struggle

Disaster Crime Other

Subcategories for Disaster Images

Politics Struggle

Category F1

Politics 89%Struggle 88%Disaster 97%Crime 90%Other 59%

Affected People OtherWreckageWorkers Responding

Disaster Image Categories

Affected People

OtherWreckage

Workers Responding

Subcategories for Politics Images

Politics Struggle

Category F1

Politics 89%Struggle 88%Disaster 97%Crime 90%Other 59%

Meeting OtherPoliticianPhotographed

Announcement Civilians Military

Politics Image Categories

Meeting

CiviliansAnnouncement

MilitaryPolitician Photographed

Collect Labels to Train Systems

Overview

Two Novel ML Approaches

• Density estimation– Applied to the results of some other system– Often improves performance– Always provides probabilistic confidence measures for

predictions• BINS

– Uses binning to estimate accurate term weights for words with scarce evidence

– Extremely competitive for two data sets in my corpus

Density Estimation

• First apply a standard system:– For each document, compute a similarity or score for

every category.– Apply to training documents as well as test documents.

• For each test document:– Find all documents from training set with similar

category scores.– Use categories of close training documents to predict

categories of test documents.

Density Estimation Example

85, 35, 25, 95, 20

100, 75, 20, 30, 5

60, 95, 20, 30, 5

90, 25, 50, 110, 25

40, 30, 80, 25, 40

80, 45, 20, 75, 10

Category score vectorsfor training documents:

Category score vectorfor test document:

20.092.5

27.491.4

Predictions:Rocchio/TF*IDF: StruggleDE: Crime (Probability .679)

100, 40, 30, 90, 10

Struggle

Politics

Disaster

Distances:

679.07.36

(Crime)

(Struggle)

(Disaster)

(Struggle)

(Politics)

(Crime)

Actual Categories:

Density Estimation Significantly Improves Performancefor the Indoor versus Outdoor Data Set

OverallAccuracy

Indoor F1 Outdoor F1

Density EstimationRocchio/TF*IDF

Density Estimation Slightly Degrades Performancefor the Events Data Set

30.0%40.0%50.0%60.0%70.0%80.0%90.0%

100.0%

Overall

Strugg

Politic

Disaste

Crime F

Other F1

Density EstimationRocchio/TF*IDF

Density Estimation Sometimes Improves Performance,Always Provides Confidence Measures

DensityEstimation

Rocchio/TF*IDF

Indoor versus Outdoor Events: Politics, Struggle, Disaster, Crime, Other

Confidence Range # of Images Overall Accuracy %

High (P 0.9) 285 92.6

Medium (0.9 > P 0.7) 98 75.5

Low (0.7 > P 0.5) 62 72.6

Confidence Range # of Documents Overall Accuracy %

High (P 0.9) 301 94.4

Medium (0.9 > P 0.7) 68 79.4

Low (0.7 > P 0.5) 60 53.3

Very Low (0.5 > P) 14 42.9

Results of Density Estimation Experiments for the Events Data Set:

Results of Density Estimation Experiments for the Indoor versus Outdoor Data Set:

BINS System:Naïve Bayes + Smoothing

• Binning: based on smoothing in the speech recognition literature– Not enough training data to estimate term weights for

words with scarce evidence– Words with similar statistical features are grouped into

a common “bin”• Estimate a single weight for each bin

– This weight is assigned to all words in the bin– Credible estimates even for small (or zero) counts

Binning Uses Statistical Features of Words

Intuition Word

Indoor Category

Outdoor Category

CountQuantized

Clearly Indoor

conference 14 1 4

bed 1 0 8

Clearly Outdoor

plane 0 9 5

earthquake 0 4 6

Unclearspeech 2 2 6

ceremony 3 8 5

“plane”

• Sparse data– “plane” does not occur in any Indoor training

documents– Infinitely more likely to be Outdoor ???

• Assign “plane” to bins of words with similar features (e.g. IDF, category counts)

• In first half of training set, “plane” appears in:– 9 Outdoor documents – 0 Indoor documents

Lambdas: Weights• First half of training set: Assign words to bins• Second half of training set: Estimate term weights

binword ||

1)|( docswordDF

binbinobsP

)|(log2bin binobsP

Lambdas for “plane”:4.03 times more likely in an Outdoor document

310*31.5)bin |obs( IndoorP

210*13.2)bin |obs( OutdoorP

01.2bin) |P(obs

bin) |P(obslog221 OutdoorIndoor

Binning Credible Log Likelihood Ratios

Intuition Word

Indoor minus Outdoor

Indoor Category

Outdoor Category

CountQuantized

Clearly Indoor

conference 4.84 14 1 4

bed 1.35 1 0 8

Clearly Outdoor

plane -2.01 0 9 5

earthquake -1.00 0 4 6

Unclearspeech 0.84 2 2 6

ceremony -0.50 3 8 5

Lambdas Decrease with IDF

Disaster lambdas

-11-10-9-8-7-6-5-4

1 2 3 4 5 6 7 8IDF

count=0count=1

Methodology of BINS

• Divide training set into two halves:– First half used to determine bins for words– Second half used to determine lambdas for bins

• For each test document:– Map every word to a bin for each category– Add lambdas, obtaining a score for each category

• Switch halves of training and repeat • Combine results and assign each document to

category with highest score

Binning Improves Performancefor the Indoor versus Outdoor Data Set

OverallAccuracy

Indoor F1 Outdoor F1

BINSNaïve Bayes

Binning Improves Performancefor the Events Data Set

20.0%30.0%40.0%50.0%60.0%70.0%80.0%90.0%

100.0%

Overall

Strugg

Politic

Disaste

Crime F

Other F1

BINSNaïve Bayes

BINS: Robust Version of Naïve Bayes

Naïve Bayes

baseline

humans

Combining Bin Weights and Naïve Bayes Weights

• Idea:– It might be better to use the Naïve Bayes weight when

there is enough evidence for a word– Back off to the bin weight otherwise

• BINS allows combinations of weights to be used based on the level of evidence

• How can we automatically determine when to use which weights???– Entropy– Minimum Squared Error (MSE)

Can Provide File to BINS that Specifies How to Combine Weights

Based on Entropy: Based on MSE:

Use only bin weight for evidence of 0

Average bin weight and NB weight for evidence of 1

Use only NB weight for evidence of 2 or more

Appropriately Combining the Bin Weight and the Naïve Bayes Weight Leads to the Best Performance Yet

BINS (Combo #2)

BINS (Combo #1)

Naïve Bayes

BINS Performs the Best of All Systems Tested

95.0%Density Estimation

Rocchio/TF*IDF

BINS (Combo #2)

BINS (Combo #1)

Naïve Bayes

K-Nearest Neighbor

Naïve Bayes [R]

Rocchio/TF*IDF [R]

K-Nearest Neighbor [R]

Probabilistic Indexing [R]

Support Vector Machines [R]

Maximum Entropy [R]

BINS BINSSVMs SVMs

How Can We Improve Results?

• One idea: Label more documents!– Usually works – Boring

• Another idea: Use unlabeled documents!– Easily obtainable– But can this really work??? – Maybe it can…

Binning Using Unlabeled Documents

• Apply system to unlabeled documents• Choose documents with “confident” predictions

– Each word has new feature: # of unlabeled documents containing the word that are confidently predicted to belong to each category (unlabeled category counts)

– Probably less important than regular category counts– Binning provides a natural mechanism for weighting

the new feature appropriately

Determining Confident Predictions

• BINS computes a score for each category– BINS predicts category with highest score– Confidence for predicted category is score of that category

minus score of second place category– Confidence for non-predicted category is score of that

category minus score of chosen category• Cross validation experiments can be used to determine

a confidence cutoff for each category– Maximize F for category– Beta of 1 gives precision and recall equal weight, lower beta

weights precision higher

Results for Struggle Category

-600 -300 0 300

Confidence Cutoff

Precision

Recall

F(1/3)

Use F to Optimize Confidence Cutoffs (example for a single category)

Use F to Optimize Confidence Cutoffs (important region of graph highlighted)

Results for Struggle Category

0 20 40 60 80 100

Confidence Cutoff

Precision

Recall

F(1/3)

Should the New Feature Matter?

zero count lambdas (IDF=8, beta=0.5)

0 1 2 3 4 5 6 7 8 9

unlabeled category count

DisStrPolCri

zero count lambdas (category=Disaster, beta=1.0)

0 1 2 3 4 5 6 7 8 9 10

unlabeled category count

Does the New Feature Help?

• No• Why???

– New features add info but make bins smaller– Perhaps more data isn’t needed in the first place

• Should more data matter?– Hard to accumulate more labeled data– Easy to try out less labeled data!

Does Size Matter?

Effect of Training Data

Percentage Used

IN/OUT

EVENTS

Overview

Disaster Image Categories

Affected People

OtherWreckage

Workers Responding

Performance of Standard SystemsNot Very Satisfying

52.00%

54.00%

56.00%

58.00%

60.00%

62.00%

64.00%

66.00%Density Estimation

Rocchio/TF*IDF

Naïve Bayes [R]

Rocchio/TF*IDF [R]

Maximum Entropy [R]

Ambiguity for Disaster Images:Workers Responding vs. Affected People

Philippine rescuers carry a fire victim March 19 who perished in a blaze at a Manila disco.

Hypothetical alternative caption: A fire victim who perished in a blaze at a Manila disco is carried by Philippine rescuers March 19.

Workers Responding Affected People

Summary of Observations About Task

• Need to distinguish foreground from background, determine focus of image

• Not all words are important; some are misleading• Hypothesis: the main subject and verb are

particularly useful for this task– Problematic for bag of words approaches– Need linguistic analysis to determine predicate

argument relationships

Hypothesis: Subject and Verbare Useful Clues

Subject Verb Category Guessable?

Truck makes Wreckage No

couple mourn Affected People Yesblocks suffered Wreckage YesNAME gather Affected People No

child sleeps Affected People Yesinspectors search Workers Responding Yes

NAME observes Workers Responding No

workers confer Workers Responding Yes

child covers Affected People Yeschimney stands Wreckage Yes

Experiments with Humans Subjects: 4 Conditions

Test Hypothesis: Subject and Verb are Useful Clues

SENT: First sentence of caption

RAND: All words from first sentence in random order

At perished disco who Manila a a in 19 carry Philippine blaze victim a rescuers March fire

IDF: Top two TF*IDF words

disco rescuers

S-V: Subject and verb subject = “rescuers”, verb = “carry”

• More words are better than fewer words– SENT, RAND > S-V, IDF

• Syntax is important– SENT > RAND; S-V > IDF

Experiments with Humans Subjects: ResultsHypothesis: Subject and Verb are Useful Clues

55.0%60.0%

65.0%70.0%

75.0%80.0%

85.0%90.0%

SENTRANDIDFS-V Condition Average Time

(in seconds)RAND 68SENT 34IDF 22S-V 20

RAND is Very Slow!

• Perhaps human subjects unscrambled words, regaining syntactic information

Condition Average Time (in seconds)

RAND 68SENT 34IDF 22S-V 20

Using Just Two Words (S-V)Almost as Good as All the Words (Bag of Words)

52.00%

54.00%

56.00%

58.00%

60.00%

62.00%

64.00%

66.00%

SENT S-V

Density Estimation

Rocchio/TF*IDF

Naïve Bayes [R]

Rocchio/TF*IDF [R]

Maximum Entropy [R]

Operational NLP Based System

• For each test document:– Extract subject and verb– Compare to those from training set using some method

of word-to-word similarity– Based on similarities, generate a score for every

text categorization and images thesis defense for carl sable committee: kathleen mckeown, vasileios...

Documents

isw08 nayar

thermodynamical formalism vasileios chousionis thesis...

automatic detection of tags for political blogs khairun-nisa...

challenges in bioinformatics part i vasileios...

a biology primer part i: classification, cells and proteins...

mr. rajat nayar best astrologer in india

ananda subramani kannan , vasileios naserentin , andreas

classifier evaluation vasileios hatzivassiloglou university...

ntovros vasileios research - springer...ntovros vasileios...

predicting the semantic orientation of adjectives vasileios...

mycorrhizae benefits atul nayar mon jan 25 at 2

curriculum vitae dr vasileios a. bontzorlos structure ......

kotsidis, vasileios (2018) aspects of pro-social behaviour...

ranvir nayar india hotel industrypresentationeicc2006

from smooth to rough surfaces - geometry and algorithms...

dossier de prensa - home | nayar systems, el iot de la...

collocations and terminology vasileios hatzivassiloglou...

early childhood learning & the media environment priya s....

a biology primer part iii: transcription, translation, and...

vinesh unikrishnan nayar music faculty b.a ... - dps nashik