rada mihalcea university of michigan linguistic ethnography: identifying dominant word classes in...

21
Rada Mihalcea University of Michigan Linguistic Ethnography: Identifying Dominant Word Classes in Text Stephen Pulman Oxford University

Upload: jessie-ross

Post on 19-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Rada Mihalcea University of Michigan Linguistic Ethnography: Identifying Dominant Word Classes in Text Stephen Pulman Oxford University

Rada MihalceaUniversity of

Michigan

Linguistic Ethnography: Identifying Dominant Word

Classes in TextStephen PulmanOxford University

Page 2: Rada Mihalcea University of Michigan Linguistic Ethnography: Identifying Dominant Word Classes in Text Stephen Pulman Oxford University

Linguistic Ethnography?

• Finding and understanding patterns in given types of text– Find the characteristics of a text– Reflective of behavior or style

• Examples– Female vs. male authored texts (gender)– Texts describing happy vs. sad moods (mood)– Humorous vs. non-humorous text (comic)– Introvert vs. extrovert authors (psychology)

Page 3: Rada Mihalcea University of Michigan Linguistic Ethnography: Identifying Dominant Word Classes in Text Stephen Pulman Oxford University

Linguistic Ethnography vs. Text Classification

• Text classification:– Automatic separation of classes of text– Supervised or semi-supervised algorithms

(Naïve Bayes, SVM, perceptron, etc.)– Feature weighting and selection

• Linguistic ethnography– Identification of classes of words over salient

features– Understand the characteristics of the texts– Insights into the properties and behaviors

modeled by those texts

Page 4: Rada Mihalcea University of Michigan Linguistic Ethnography: Identifying Dominant Word Classes in Text Stephen Pulman Oxford University

Well kids, I had an awesome birthday thanks to you. =D Just wanted to so thank you for coming and thanks for the gifts and junk. =) I have many pictures and I will post them later. hearts

Well kids, I had an awesome birthday thanks to you. =D Just wanted to so thank you for coming and thanks for the gifts and junk. =) I have many pictures and I will post them later. hearts

mood:

Home alone for too many hours, all week long ... screaming child, headache, tears that just won’t let themselves loose.... and now I’ve lost my wedding band. I hate this.

Home alone for too many hours, all week long ... screaming child, headache, tears that just won’t let themselves loose.... and now I’ve lost my wedding band. I hate this.

mood:

An Example: Finding Happiness

Page 5: Rada Mihalcea University of Michigan Linguistic Ethnography: Identifying Dominant Word Classes in Text Stephen Pulman Oxford University

Corpus-derived Happiness Factors

yay 86.67shopping 79.56awesome 79.71birthday 78.37lovely 77.39concert 74.85cool 73.72cute 73.20lunch 73.02books 73.02

goodbye 18.81hurt 17.39tears 14.35cried 11.39upset 11.12sad 11.11cry 10.56died 10.07lonely 9.50crying 5.50

Page 6: Rada Mihalcea University of Michigan Linguistic Ethnography: Identifying Dominant Word Classes in Text Stephen Pulman Oxford University

Identifying Word Classes in Text

• Foreground corpus: corpus of texts of interest

• Background corpus: “neutral” texts– Collection of texts that do not have the

property shared by the foreground corpus– Balanced corpus

• Mix of texts

• Goal: identify word classes that are dominant in the foreground corpus

Page 7: Rada Mihalcea University of Michigan Linguistic Ethnography: Identifying Dominant Word Classes in Text Stephen Pulman Oxford University

Word Class Dominance

• C = {W1, W2, …, Wn}

• Score significantly higher than 1: word classes that are dominant in the foreground corpus

)(

)(

FSize

WFrequency

Coverage CWi

Fi

)(

)(

BSize

WFrequency

Coverage CWi

Bi

)(

)(.

CCoverage

CCoveragenanceDomi

B

FF

Page 8: Rada Mihalcea University of Michigan Linguistic Ethnography: Identifying Dominant Word Classes in Text Stephen Pulman Oxford University

Lexical Resources for Word Classes

• Roget– Thesaurus of English language– 100,000 grouped based on synonymy and other

semantic relations• Linguistic Inquiry and Word Count (LIWC)

– Lexicon developed for psycholinguistic analysis (Pennebaker & all)

– 2,200 words grouped into 70 classes• WordNet Affect

– Resource built on top of WordNet– Annotations with the emotions in the

classification of Ortony– Focus on: anger, disgust, fear, joy, sadness,

surprise

Page 9: Rada Mihalcea University of Michigan Linguistic Ethnography: Identifying Dominant Word Classes in Text Stephen Pulman Oxford University

Word Class Examples

• Roget:– PERFECTION: perfection, purity, integrity,

impeccability, …– MEDIOCRITY: mediocrity, dullness, indifference,

inferiority, …

• LIWC:– OPTIMISM: accept, best, confidence, glorious, hope, …– SOCIAL: adult, advice, affair, boy, buddies, comrade, …

• WordNet-Affect:– ANGER: offense, temper, irritation, fury, rage, …– JOY: worship, adoration, sympathy, tenderness, respect,

love, …

Page 10: Rada Mihalcea University of Michigan Linguistic Ethnography: Identifying Dominant Word Classes in Text Stephen Pulman Oxford University

A Case Study: Verbal Humour

• Gain insights into the “language of humour”

• Find classes of words that are dominant in humorous text

• Foreground corpus: humorous text– Two types of verbal humour:

• One-liners• Humorous news articles

• Background corpus: non-humorous text– A mix of data from non-humorous sources:

Reuters newspapers, British National Corpus, proverbs, Open Mind Common Sense

Page 11: Rada Mihalcea University of Michigan Linguistic Ethnography: Identifying Dominant Word Classes in Text Stephen Pulman Oxford University

Humorous Data: One-liners

• “He who smiles in a crisis has found someone to blame”

• Short sentence, simple syntax • Deliberate use of rhetoric devices (alliteration, rhyme)• Frequent use of creative language• Comic effect

• Web-based bootstrapping• Start with a few manually selected seeds • Identify a list of Web pages including at least one seed• Parse Web pages and find new one-liners• Repeat

– 16,000 one-liners

Page 12: Rada Mihalcea University of Michigan Linguistic Ethnography: Identifying Dominant Word Classes in Text Stephen Pulman Oxford University

Humorous Data: News stories

• “The Onion”– “the best source of humour out there” (Jeff

Grienfield, CNN)• Canadian Prime Minister Jean Chrétien and Indian President Abdul

Kalam held a subdued press conference in the Canadian Capitol building Monday to announce that the two nations have peacefully and sheepishly resolved a dispute over their common border. "We are - well, I guess proud isn't the word - relieved, I suppose, to restore friendly relations with India after the regrettable dispute over the exact coordinates of our shared border," said Chrétien, who refused to meet reporters' eyes as he nervously crumpled his prepared statement. "The border that, er... Well, I guess it turns out that we don't share a border after all." Chrétien then officially withdrew his country's demand that India hand over a 20-mile-wide stretch of land that was to have served as a demilitarized buffer zone between the two nations.“

– 1,125 news articles from August 2005 – March 2006• 1,000-10,000 characters

Page 13: Rada Mihalcea University of Michigan Linguistic Ethnography: Identifying Dominant Word Classes in Text Stephen Pulman Oxford University

Dominant Roget Word Classes in Humorous Text

• anonymity 3.48 : you, person, cover, anonymous, unknown, unidentified, unspecified

• odor 3.36 : nose, smell, strong, breath, inhale, stink, pong, perfume, flavor

• secrecy 2.96 : close, wall, secret, meeting, apart, ourselves, security, censorship

• wrong 2.83 : wrong, illegal, evil, terrible, shame, beam, incorrect, pity, horror

• unorthodoxy 2.52 : error, non, err, wander, pagan, fallacy, atheism, erroneous, fallacious

• overestimation 2.45 : think, exaggerate, overestimated, overestimate, exaggerated

• disarrangement 2.18 : trouble, throw, ball, bug, insanity, confused, upset, mess, confuse

Page 14: Rada Mihalcea University of Michigan Linguistic Ethnography: Identifying Dominant Word Classes in Text Stephen Pulman Oxford University

Dominant LIWC Word Classes in Humorous Text

• you 3.17 : you, thou, thy, thee, thin• I 2.84 : myself, mine• swear 2.81 : hell, ass, butt, suck, dick, arse,

bastard, sucked, sucks, boobs • self 2.23 : our, myself, mine, lets, ourselves, ours• sexual 2.07 : love, loves, loved, naked, butt, gay,

dick, boobs, cock, horny, fairy• groom 2.06 : soap, shower, perfume, makeup• cause 1.99 : why, how, because, found, since,

product, depends, thus, cos • humans 1.79 : man, men, person, children,

human, child, kids, baby, girl, boy

Page 15: Rada Mihalcea University of Michigan Linguistic Ethnography: Identifying Dominant Word Classes in Text Stephen Pulman Oxford University

Dominant WordNet-Affect Word Classes in Humorous Text

• surprise 3.31 : stupid, wonder, wonderful, beat, surprised, surprise, amazing, terrific

Page 16: Rada Mihalcea University of Michigan Linguistic Ethnography: Identifying Dominant Word Classes in Text Stephen Pulman Oxford University

Evaluation

• How good are these classes?• Derive word classes from different data

sets and measure correlation• Split the one-liners in two: 8,000 one-liners

vs. 8,000 one-liners• Split the news stories in two: 550 stories vs.

550 stories• 16,000 one-liners vs. 1,100 news stories

Roget LIWCone-liners vs. one-liners 0.95 0.96news stories vs. news stories 0.84 0.88one-liners vs. news stories 0.63 0.42

Page 17: Rada Mihalcea University of Michigan Linguistic Ethnography: Identifying Dominant Word Classes in Text Stephen Pulman Oxford University

Characteristics of Verbal Humour

• Observed by analyzing the word classes• Human-centerdness

– YOU, I, SELF, HUMANS• you occurs in more than 25% of the one-liners • “You can always find what you are not looking for.”• professional communities• “It was so cold last winter, that I saw a lawyer with

his hands in his own pockets.”

Page 18: Rada Mihalcea University of Michigan Linguistic Ethnography: Identifying Dominant Word Classes in Text Stephen Pulman Oxford University

Characteristics of Verbal Humour

• Negative polarity– WRONG, UNORTHODOXY,

DISARRANGEMENT• “Only adults have trouble with

child-proof bottles.”• “When everything comes your

way, you are in the wrong lane.”

Page 19: Rada Mihalcea University of Michigan Linguistic Ethnography: Identifying Dominant Word Classes in Text Stephen Pulman Oxford University

Dominant Classes in Humour

– Human-centeredness: human-related semantic classes found dominant in humorous text as compared to non-humorous text

– Negative polarity: semantic classes with negative orientation

• Humour as “natural therapy” where tensions related to negative scenarios concerning us humans are relieved through laughter

• Correlation with empirical observations from previous work • Human-centerdness, negative polarity, sexual

vocabulary, swear words, surprise

Page 20: Rada Mihalcea University of Michigan Linguistic Ethnography: Identifying Dominant Word Classes in Text Stephen Pulman Oxford University

Conclusions

• Find the dominant word classes in types of text• Reflective of behavior or style• Systematic and portable

• Case study on humour:• Good correlation among classes derived from

different corpora• Correlation with empirical observations from

previous work

Page 21: Rada Mihalcea University of Michigan Linguistic Ethnography: Identifying Dominant Word Classes in Text Stephen Pulman Oxford University

A conclusion is simply the place where you got tired of thinking.?