presentation at nldb 2012

Two-stage Named Entity Recognition usingaveraged perceptrons

Lars Buitinck Maarten Marx

Information and Language Processing SystemsInformatics Institute

University of Amsterdam

17th Int’l Conf. on Applications of NLP to InformationSystems

Buitinck, Marx Two-stage NER

Outline

Named Entity Recognition

Find names in text and classify them as belonging topersons, locations, organizations, events, products or“miscellaneous”Use machine learning

Named Entity Recognition

Find names in text and classify them as belonging topersons, locations, organizations, events, products or“miscellaneous”Use machine learning

Named Entity Recognition for Dutch

State of the art algorithm for Dutch by Desmet and Hoste(2011); voting classifiers with GA to train weightsGood training sets are just becoming availableMany practitioners retrain Stanford CRF-NER tagger

Overview

Realize that NER is two problems in one: recognition andclassificationPipeline solution with two classifiersUse custom feature sets for eachDo not used precompiled list of names (“gazetteer”)Work at the sentence level (because of how training setsare set up)

Overview

Recognition stage

Token-level task: is a token the Beginning of, Inside, orOutside any entity name?Features:

Word window wi−2, . . . ,wi+2POS tags for words in windowConjunction of words and POS tags in window, e.g.(wi−1,pi−1)Capitalization of tokens in window(Character) prefixes and suffixes of wi and wi−1REs for digits, Roman numerals and punctuation

Recognition stage

Classification stage

Don’t do this at token-level; we know the entity spans!Input is a list of tokens considered an entity by therecognition stageFeatures:

The tokens we got from recognitionThe four surrounding tokensTheir pre- and suffixes up to length fourCapitalization pattern, as a string on the alphabet (L|U|O)∗The occurrence of capitalized tokens, digits and dashes inthe entire sentence

Learning algorithm

Use averaged perceptron for both stagesLearns an approximation of max-margin solution (linearSVM)40 iterationsUsed the LBJ machine learning toolkit

Learning algorithm

Evaluation

Aim for F1 score, as defined in the CoNLL 2002 sharedtask on NERTwo corpora: CoNLL 2002 and a subset of SoNaR(courtesy Desmet and Hoste)Compare against Stanford and Desmet and Hoste’salgorithm

Evaluation

Results on CoNLL 2002

309.686 tokens containing 19901 names, four categories65% training, 22% validation and 12% test setsStanford achieves F1 = 74.72; "miscellaneous" category ishard (< 0.7)We achieve F1 = 75.14; "organization" category is hard

Results on SoNaR

New, large corpus with manual annotationsUsed a 200k tokens subset of a preliminary version,three-fold cross validationState of the art is Desmet and Hoste (2011) withF1 = 84.44Best individual classifier from that paper (CRF) gets 83.77Our system: 83.56Here, “product” and “miscellaneous” categories are hard

Results on SoNaR

Conclusion

Near-state of the art performance from simple learnerswith good feature setsNo gazetteers, so should be fairly reusable(Side conclusion: SoNaR is more easily learnable thanCoNLL)

Conclusion

Future work

Being integrated in UvA’s xTAS text analysis pipelineUsed to find entities in Dutch Hansard corpus(forthcoming) and link entities to WikipediaFull SoNaR is now available; new evaluation needed

Future work

presentation at nldb 2012

Technology

stage ner

window conjunction of

pos tags

word window wi2

window character prexes

punctuation buitinck

sentence level

sufxes of wi