identifying comparative sentences in text documents nitin jindal and bing liu university of illinois...

Post on 30-Dec-2015

224 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Identifying Comparative Sentences in Text Documents

Nitin Jindal and Bing Liu

University of Illinois

SIGIR 2006

Introduction

• Comparisons are one of the most convincing ways of evaluation.

• Much of such info is available on the Web (customer reviews), forum discussions, and blogs.

• Useful for product manufacturers and potential customers (to make purchasing decisions).

Comparisons vs. Opinions

• Comparisons can be both objective or subjective.

• Comparative sentences have different language constructs from typical opinion sentences.

• Comparative sentences may contain some indicators.

Car X is much better than Car Y

Car X is two feet longer than Car Y

Related Work

• Linguistics: based on grammars (syntax and semantics) and logic (gradability), which is more for human consumption than for automatic identification.

• Opinion tasks: opinion extraction and classification problem, which is quite different from this comparison identification.

Comparatives (Linguistic)

• Comparatives are used to express explicit orderings between objects with respect to the degree or amount to which they possess some gradable property.

John is taller than he was

=>

John is tall to degree d

Comparatives (Linguistic)

• Two broad types:– Metalinguistic Comparatives: compare properti

es of one entity.

Ronaldo is angrier than upset.– Propositional Comparatives: compare between t

wo propositions. Three subcategories:

Comparatives (Propositional)

• Nominal Comparatives: (two sets of entities)

Paul ate more grapes than bananas.

• Adjectival Comparatives: (than, as good as)

Ford is cheaper than Volvo.

• Adverbial Comparatives: (occur after a verb phrase)

Tom ate more quickly than Jane.

Superlatives

• Adjectival Superlatives:

John is the tallest person.

• Adverbial Superlatives:

Jill did her homework most frequently.

• Equality: conjunctions like and, or, …

John and Sue, both like sushi.

POS involved

• NN: Noun• NNP: Proper Noun• VBZ: Verb, present tense, 3rd person singular• JJ: Adjective• RB: Adverb• JJR Adjective, comparatives• JJS: Adjective, superlative• RBR: Adverb, comparative• RBS: Adverb, superlative

Limitations of linguistic classification.

• Non-comparatives with comparative words: many non-comparatives contain comparative words.

In the context of speed, faster means better.John has to try his best to win this game.

• Limited coverage: many comparatives contain no comparative words.

In market capital, Intel is way ahead of Amd.Nokia Samsung, both cell phones perform badly on heat dissipation index.

The M7500 earned a World bench score of 85, whereas Asus A3V posted

a mark of 89.

Enhancements

• First limitation: machine learning methods to distinguish comparatives and non-comparatives.

• Second limitation: – User preferences:

I prefer Intel to Amd = Intel is better than Amd

– Implicit comparatives:Camera X has 2 MP, whereas camera Y has 5 MP.

Types of Comparatives

• Non-Equal Gradable: greater or less than type, including user preferences.

• Equative (Gradable): equal to type• Superlative (Gradable): greater of less than

all others type• Non-Gradable:

– A is similar to B; A has feature F1 while B has F2; A has feature F but B doesn’t

Tasks

• Identifying comparative sentences from a given text data set.

• Extracting comparative relations from sentences. (Mining comparative sentences and relations, AAAI 2006)

Class Sequential Rules with Multiple Minimum Supports

• For sequential pattern mining, patterns to the left and class to the right.

• Select patterns: keywords – POS (JJR, RBR, JJS, RBS) + Words (favor, prefer, win beat, but…) + Phrases (number one, up against)

• The performance of only using keywords are P=32%, R=94%.

Support and Confidence

• Using the minimum support of 20% and minimum confidence of 40%, one of the discovered CSRs is:

Building the Sequence DBthis/DT camera/NN has/VBZ significantly/RB more/JJR noise/NN at/IN iso/NN 100/CD than/IN the/DT nikon/NN 4500/CD

{NN}{VBZ}{RB}{moreJJR}{NN}{IN}{NN} -> comparative

• Sequences which exceeds 60% confidence threshold become rules. Minimum support = 10%.

• 13 Manual rules with conjunctions as whereas/IN, but/CC, however/RB, while/IN, though/IN, although/IN, etc..

Classification Learning

• Machine learning methods:

Feature Set = {X | X is the sequential pattern in

CSR X → y} ∪{Z | Z is the pattern in a manual rule

Z → y}

Data Preparation

• Consumer reviews on products such as digital cameras, DBD players, MP3 players and cellular phones.

• Forum discussions on topics such as Intel vs. AMD, Coke vs. Pepsi, and Microsoft vs. Google.

• News articles on topics such as automobiles, ipods, and soccer vs. football.

Number of Sentences in Data Sets

Experimental Results (1)

Experimental Results (2)

• Review: R low P high -> short sentences, hard to find patterns

• Articles and Forums: R high P low -> long sentences and find patterns too easily or find too many patterns.

Conclusion and Future Work

• Identifying comparative sentences.

• Analyzing different types of comparative sentences.

• Studying how to automatically classify subjective and objective comparisons.

top related