passive-aggressive sequence labeling with discriminative post-editing for recognising person...

Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for
Recognising Person Entities in Tweets.

Leon Derczynski
Kalina Bontcheva

Problem

Finding person NEs in tweets, a diverse genreNeed to know participates in events / claims

Twitter as the
D. Melanogaster of social media1

Newswire: regulatedour most frequently-used corpora [..] written and edited predominantly by working-age white men 2

Twitter: wild; many stylesHeadlines

Conversations

Colloquial

Just noise (hashtags, URLs, mentions)

1. Tufekci, 2014. Big Questions for Social Media Big Data: Representativeness, Validity and Other Methodological Pitfalls Proc. ICWSM; 2. Eisenstein, 2013. What to do about bad language on the internet Proc. NAACL; Image Mr.checker Wikimedia Commons

Why person entities?

There are many entity types and classification schemesACE (PER, GPE, ORG); maybe add PROD

Freebase top-level ( la Ritter)

Have a long tail, making them resistant to gazetteer approaches

Required to mine conversations and claims

Unfortunately, they're difficult to find in tweets:

Stanford NER on CoNLL news:92.29F1
Stanford NER on Ritter tweets: 63.20F1

Machine learning for twitter NER

We know twitter's diverse & noisy, so let's add word shape (Xxx) and lemma features

Conventional approaches sequence labelling

Lots of dysfluency, differs from newswire

What if we throw out whole-sequence idea and only use local context?Stanford72.19F1 (up from ~63)
SVM75.89F1
MaxEnt76.76F1

CRF78.89F1

Looks like sequence labelling is useful

Two ML adaptations

SVM/UMHyperplane may lie between two unbalanced classes

Move closer to minority class, to reflect prior distribution

CRF-PAPassive: when example's hinge loss is zero, skip updates

Aggressive: when hinge loss >0, scale down example's weight

Single-pass results

Corpus: person entities from MSM2013, Ritter, UMBC tweet datasets (86k toks, 1.7k ents)PRF

Stanford90.6060.0072.19Ritter77.2380.1878.68SVM/UM81.1674.9777.94CRF-PA86.8574.7180.32

Honourable mention: MaxEnt, precision 91.10

Ritter: good recall, possibly from huge bootstrapped integrated resource

How can we improve recall without this?

Recall problems

Typical missed entities:Under Obama 's tax plan , ...

delighted for you & Dave !

Strategies for selling in a slow market : by Denise Calaman

Looks like things we'd find in a gazetteer

How can we include these without reducing precision?

Post-editing can be effective in fixing up MT output

Post-editing

Formulate as binary discriminative problemIs a given non-entity text actually a person?

Narrow search space:Does a token in an out-of-entity sequence begin a with known person name?

Confine window to two tokens

Given a set of triggers, are tokens in a bigram beginning with a trigger, a person?Best Ann Coulter quotesUnder Obama 's tax plan

Evaluation

Baselines: no editing, gazetteer term, gazetter term+1

Goal is to improve recall: use cost-sensitive SVM
Missed entity F1OverallNo editing0.0080.32Term only5.8282.58Term+16.0581.67SVM Cost 0.1 (P)78.2683.07SVM Cost 1.5 (R)92.7383.83
Ritter-78.68

Error analysis

False positives:Other-class entities (Huff Post, Exodus Porter)

Descriptive titles (Millionaire Rob Ford)

Names in non-name senses (Marie Claire)

Polysemous names (Mark)

False negatives:Capitalisation (charlie gibson, KANYE WEST)

Spelling errors (Russel Crowe)

Common nouns (Jack Straw)

Uncommon names (Spicy Pickle Jr.)

Conclusion

PA adaptation of CRF helps NER in diverse domain

Automatic post-editing improves recall

SVM using context much better than gazetteer

Only external resource is first name lists

Thank you for your time!

Do you have any questions?

Research partially supported by the European Union/EU under the Information and Communication Technologies (ICT) theme of the 7th Framework Programme for R&D (FP7), grant PHEME (611233).

Entities in tweets

News

Tweets

PER

Politicians, business leaders, journalists, celebrities

Sportsmen, actors, TV personalities, celebrities, names of friends

LOC

Countries, cities, rivers, and other places related to current affairs

Restaurants, bars, local landmarks/areas, cities, rarely countries

ORG

Public and private companies, government organisations

Bands, internet companies, sports clubs

passive-aggressive sequence labeling with discriminative post-editing for recognising person...

Science