passive-aggressive sequence labeling with discriminative post-editing for recognising person...
TRANSCRIPT
Passive-Aggressive Sequence Labeling with Discriminative
Post-Editing for
Recognising Person Entities in Tweets.
Leon Derczynski
Kalina Bontcheva
Problem
Finding person NEs in tweets, a diverse genreNeed to know
participates in events / claims
Twitter as the
D. Melanogaster of social media1
Newswire: regulatedour most frequently-used corpora [..] written and edited predominantly by working-age white men 2
Twitter: wild; many stylesHeadlines
Conversations
Colloquial
Just noise (hashtags, URLs, mentions)
1. Tufekci, 2014. Big Questions for Social Media Big Data: Representativeness, Validity and Other Methodological Pitfalls Proc. ICWSM; 2. Eisenstein, 2013. What to do about bad language on the internet Proc. NAACL; Image Mr.checker Wikimedia Commons
Why person entities?
There are many entity types and classification schemesACE (PER, GPE, ORG); maybe add PROD
Freebase top-level ( la Ritter)
Have a long tail, making them resistant to gazetteer approaches
Required to mine conversations and claims
Unfortunately, they're difficult to find in tweets:
Stanford NER on CoNLL news:92.29F1
Stanford NER on Ritter tweets: 63.20F1
Machine learning for twitter NER
We know twitter's diverse & noisy, so let's add word shape (Xxx) and lemma features
Conventional approaches sequence labelling
Lots of dysfluency, differs from newswire
What if we throw out whole-sequence idea and only use local
context?Stanford72.19F1 (up from ~63)
SVM75.89F1
MaxEnt76.76F1
CRF78.89F1
Looks like sequence labelling is useful
Two ML adaptations
SVM/UMHyperplane may lie between two unbalanced classes
Move closer to minority class, to reflect prior
distribution
CRF-PAPassive: when example's hinge loss is zero, skip updates
Aggressive: when hinge loss >0, scale down example's weight
Single-pass results
Corpus: person entities from MSM2013, Ritter, UMBC tweet datasets (86k toks, 1.7k ents)PRF
Stanford90.6060.0072.19Ritter77.2380.1878.68SVM/UM81.1674.9777.94CRF-PA86.8574.7180.32
Honourable mention: MaxEnt, precision 91.10
Ritter: good recall, possibly from huge bootstrapped integrated resource
How can we improve recall without this?
Recall problems
Typical missed entities:Under Obama 's tax plan , ...
delighted for you & Dave !
Strategies for selling in a slow market : by Denise Calaman
Looks like things we'd find in a gazetteer
How can we include these without reducing precision?
Post-editing can be effective in fixing up MT output
Post-editing
Formulate as binary discriminative problemIs a given non-entity text actually a person?
Narrow search space:Does a token in an out-of-entity sequence begin a with known person name?
Confine window to two tokens
Given a set of triggers, are tokens in a bigram beginning with a trigger, a person?Best Ann Coulter quotesUnder Obama 's tax plan
Evaluation
Baselines: no editing, gazetteer term, gazetter term+1
Goal is to improve recall: use cost-sensitive SVM
Missed entity F1OverallNo editing0.0080.32Term
only5.8282.58Term+16.0581.67SVM Cost 0.1 (P)78.2683.07SVM Cost 1.5
(R)92.7383.83
Ritter-78.68
Error analysis
False positives:Other-class entities (Huff Post, Exodus Porter)
Descriptive titles (Millionaire Rob Ford)
Names in non-name senses (Marie Claire)
Polysemous names (Mark)
False negatives:Capitalisation (charlie gibson, KANYE WEST)
Spelling errors (Russel Crowe)
Common nouns (Jack Straw)
Uncommon names (Spicy Pickle Jr.)
Conclusion
PA adaptation of CRF helps NER in diverse domain
Automatic post-editing improves recall
SVM using context much better than gazetteer
Only external resource is first name lists
Thank you for your time!
Do you have any questions?
Research partially supported by the European Union/EU under the Information and Communication Technologies (ICT) theme of the 7th Framework Programme for R&D (FP7), grant PHEME (611233).
Entities in tweets
News
Tweets
PER
Politicians, business leaders, journalists, celebrities
Sportsmen, actors, TV personalities, celebrities, names of friends
LOC
Countries, cities, rivers, and other places related to current affairs
Restaurants, bars, local landmarks/areas, cities, rarely countries
ORG
Public and private companies, government organisations
Bands, internet companies, sports clubs