carmen banea, rada mihalcea university of north texas [email protected], [email protected] a...

18
Carmen Banea, Rada Mihalcea University of North Texas [email protected], [email protected] A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Janyce Wiebe University of Pittsburg [email protected]

Upload: juliet-washington

Post on 28-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Carmen Banea, Rada Mihalcea

University of North [email protected], [email protected]

A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources

Janyce WiebeUniversity of Pittsburg

[email protected]

Subjectivity analysisSubjectivity analysis (opinions and sentiments)Used in a wide variety of applications

Tracking sentiment timelines in news (Lloyd et. al, 2005)Review classification (Turney, 2002; Pang et. al, 2002)Mining opinions from product reviews (Hu and Liu, 2004)Expressive text-to-speech synthesis (Alm et. al, 2005)Text semantic analysis (Wiebe and Mihalcea, 2006; Esuli

and Sebastiani, 2006)Question answering (Yu and Hatzivassiloglou, 2003)

Much work on subjectivity analysis has focused on EnglishJapanese (Takumura et. al, 2006), Chinese (Hu et. al,

2005), German (Kim and Hovy, 2006)

Proportion of Languages on the Web

internetworldstats.com ~ updated November 30, 2007

ObjectiveDevelop a method for subjectivity analysis

thatRequires few electronic resources Can be easily ported to a new language

Applicable to the large number of languages that have scarce electronic resources

Related WorkTools that rely on manually or semi-automatically

constructed lexiconsYu and Hatzivassiloglou, 2003; Riloff and Wiebe, 2003; Kim

and Hovy, 2006Enable the efficient rule-based subjectivity and sentiment

classifiers that rely on the presence of lexicon entries in text

These tools assume the availability of advanced language processing tools:

Syntactic parsers (Wiebe, 2000), Information extraction (Riloff and Wiebe, 2003)

broad-coverage rich lexical resources WordNet (Essuli and Sebastiani, 2006)

Our approach relates most closely to the method of (Turney, 2002) for the construction of lexicons annotated for polarityWe address the task of acquiring a subjectivity lexicon We rely on fewer, smaller-scale resources

Our MethodBased on bootstrappingRequires:

A small seed set of subjective entriesOne/multiple electronic dictionariesA small training corpus (approx.

500,000 words)Experiments focused on Romanian

Applicable to other languages as well

Bootstrapping Process

seedsseeds query Candidate synonymsCandidate synonyms

Max. no. of iterations?

no

yes

Candidate synonymsCandidate synonyms

Selected synonymsSelected synonyms

Variable filtering

Online dictionary

Fixed filtering

Seed SetCategory

Sample Entries (with their English translation)

Noun blestem (curse), despot (tyrant), furie (fury), idiot (idiot), fericire (happiness)

Verb iubi (love), aprecia (appreciate), spera (hope), dori (wish), uri (hate)

Adjective

frumos (beautiful), dulce (sweet), urat (ugly), fericit (happy), fascinant (fascinating)

Adverb posibil (possibly), probabil (probably),desigur (of course), enervant (unnerving)

60 seeds, evenhandedly sampled from verbs, nouns, adjectives and adverbs.

Manually selectedSeed sources:

XI-th grade curriculum for Romanian Language and Literature

Translations of instances appearing in the OpinionFinder strong subjective lexicon (Wiebe and Riloff, 2005)

Expansion

Romanian dictionary: http://www.dexonline.roDictionaries for other languages are also available, or

can be obtained from paper dictionaries through OCR

Definition

All open-class words, that have a definition in the dictionary

longer than 3 lettersDiacritics are removed

Candidate synonymsCandidate synonyms

SeedSeed

FilteringCandidates are filtered based on a measure

of similarity with the original seedsWe use Latent Semantic Analysis (LSA)

(Dumais et al., 1988) trained on the SemCor corpus (Miller et al., 1993)

After each iteration, only candidates with an LSA score higher than a given threshold are selected for further expansion

Example:Seed: dulce (sweet)Candidate synonyms: cu gust dulce (sweet-

tasting). placut (pleasant), dulceag (quasi-sweet)

FilteringSeveral iterations of the bootstrapping

process will result in a subjectivity lexicon consisting of a ranked list of candidates in decreasing order of similarity to the original seeds

A variable filtering threshold can be used to further restrict the similarity for a more pure lexicon

Filtering parameters:Similarity thresholdNumber of iterations

Lexicon Acquisition

EvaluationRule-based classifier of subjectivity

(Riloff and Wiebe, 2003)Subjective sentence: three or more subjective

entries.Objective sentence: two subjective entries or less.

Gold standard data set (Mihalcea, Banea and Wiebe, 2007)504 sentences from five SemCor documents

(manually translated in Romanian)Labeled by two annotatorsAgreement (all): 83% (=0.67)Agreement (uncertain removed): 89% (=0.77)Baseline: 54% (all subjective)

Number of Iterations

F-measure for the bootstrapping subjectivity lexicon over 5 iterations and an LSA threshold of 0.5

Similarity Threshold

F-measure for the fifth bootstrapping iteration for varying LSA scores

Comparison

Bootstrapping rule-based classifier: uses a 3913 entries subjectivity lexicon obtained through 5 iterations and similarity threshold of 0.5

ConclusionsOur bootstrapping method uses few

electronic resources:A small seed setOne/multiple dictionariesA small corpus of half a million words

A large subjectivity lexicon of approx. 4000 entries was extracted

Using an unsupervised rule-based classifier, a subjectivity F-measure of 66.20% and an overall F-measure of 61.69% can be achieved

Questions?