presentation at nldb 2012

51
Two-stage Named Entity Recognition using averaged perceptrons Lars Buitinck Maarten Marx Information and Language Processing Systems Informatics Institute University of Amsterdam 17th Int’l Conf. on Applications of NLP to Information Systems Buitinck, Marx Two-stage NER

Upload: maartenmarx

Post on 10-May-2015

1.090 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Presentation at NLDB 2012

Two-stage Named Entity Recognition usingaveraged perceptrons

Lars Buitinck Maarten Marx

Information and Language Processing SystemsInformatics Institute

University of Amsterdam

17th Int’l Conf. on Applications of NLP to InformationSystems

Buitinck, Marx Two-stage NER

Page 2: Presentation at NLDB 2012

Outline

Buitinck, Marx Two-stage NER

Page 3: Presentation at NLDB 2012

Named Entity Recognition

Find names in text and classify them as belonging topersons, locations, organizations, events, products or“miscellaneous”Use machine learning

Buitinck, Marx Two-stage NER

Page 4: Presentation at NLDB 2012

Named Entity Recognition

Find names in text and classify them as belonging topersons, locations, organizations, events, products or“miscellaneous”Use machine learning

Buitinck, Marx Two-stage NER

Page 5: Presentation at NLDB 2012

Named Entity Recognition for Dutch

State of the art algorithm for Dutch by Desmet and Hoste(2011); voting classifiers with GA to train weightsGood training sets are just becoming availableMany practitioners retrain Stanford CRF-NER tagger

Buitinck, Marx Two-stage NER

Page 6: Presentation at NLDB 2012

Named Entity Recognition for Dutch

State of the art algorithm for Dutch by Desmet and Hoste(2011); voting classifiers with GA to train weightsGood training sets are just becoming availableMany practitioners retrain Stanford CRF-NER tagger

Buitinck, Marx Two-stage NER

Page 7: Presentation at NLDB 2012

Named Entity Recognition for Dutch

State of the art algorithm for Dutch by Desmet and Hoste(2011); voting classifiers with GA to train weightsGood training sets are just becoming availableMany practitioners retrain Stanford CRF-NER tagger

Buitinck, Marx Two-stage NER

Page 8: Presentation at NLDB 2012

Overview

Realize that NER is two problems in one: recognition andclassificationPipeline solution with two classifiersUse custom feature sets for eachDo not used precompiled list of names (“gazetteer”)Work at the sentence level (because of how training setsare set up)

Buitinck, Marx Two-stage NER

Page 9: Presentation at NLDB 2012

Overview

Realize that NER is two problems in one: recognition andclassificationPipeline solution with two classifiersUse custom feature sets for eachDo not used precompiled list of names (“gazetteer”)Work at the sentence level (because of how training setsare set up)

Buitinck, Marx Two-stage NER

Page 10: Presentation at NLDB 2012

Overview

Realize that NER is two problems in one: recognition andclassificationPipeline solution with two classifiersUse custom feature sets for eachDo not used precompiled list of names (“gazetteer”)Work at the sentence level (because of how training setsare set up)

Buitinck, Marx Two-stage NER

Page 11: Presentation at NLDB 2012

Overview

Realize that NER is two problems in one: recognition andclassificationPipeline solution with two classifiersUse custom feature sets for eachDo not used precompiled list of names (“gazetteer”)Work at the sentence level (because of how training setsare set up)

Buitinck, Marx Two-stage NER

Page 12: Presentation at NLDB 2012

Overview

Realize that NER is two problems in one: recognition andclassificationPipeline solution with two classifiersUse custom feature sets for eachDo not used precompiled list of names (“gazetteer”)Work at the sentence level (because of how training setsare set up)

Buitinck, Marx Two-stage NER

Page 13: Presentation at NLDB 2012

Recognition stage

Token-level task: is a token the Beginning of, Inside, orOutside any entity name?Features:

Word window wi−2, . . . ,wi+2POS tags for words in windowConjunction of words and POS tags in window, e.g.(wi−1,pi−1)Capitalization of tokens in window(Character) prefixes and suffixes of wi and wi−1REs for digits, Roman numerals and punctuation

Buitinck, Marx Two-stage NER

Page 14: Presentation at NLDB 2012

Recognition stage

Token-level task: is a token the Beginning of, Inside, orOutside any entity name?Features:

Word window wi−2, . . . ,wi+2POS tags for words in windowConjunction of words and POS tags in window, e.g.(wi−1,pi−1)Capitalization of tokens in window(Character) prefixes and suffixes of wi and wi−1REs for digits, Roman numerals and punctuation

Buitinck, Marx Two-stage NER

Page 15: Presentation at NLDB 2012

Recognition stage

Token-level task: is a token the Beginning of, Inside, orOutside any entity name?Features:

Word window wi−2, . . . ,wi+2POS tags for words in windowConjunction of words and POS tags in window, e.g.(wi−1,pi−1)Capitalization of tokens in window(Character) prefixes and suffixes of wi and wi−1REs for digits, Roman numerals and punctuation

Buitinck, Marx Two-stage NER

Page 16: Presentation at NLDB 2012

Recognition stage

Token-level task: is a token the Beginning of, Inside, orOutside any entity name?Features:

Word window wi−2, . . . ,wi+2POS tags for words in windowConjunction of words and POS tags in window, e.g.(wi−1,pi−1)Capitalization of tokens in window(Character) prefixes and suffixes of wi and wi−1REs for digits, Roman numerals and punctuation

Buitinck, Marx Two-stage NER

Page 17: Presentation at NLDB 2012

Recognition stage

Token-level task: is a token the Beginning of, Inside, orOutside any entity name?Features:

Word window wi−2, . . . ,wi+2POS tags for words in windowConjunction of words and POS tags in window, e.g.(wi−1,pi−1)Capitalization of tokens in window(Character) prefixes and suffixes of wi and wi−1REs for digits, Roman numerals and punctuation

Buitinck, Marx Two-stage NER

Page 18: Presentation at NLDB 2012

Recognition stage

Token-level task: is a token the Beginning of, Inside, orOutside any entity name?Features:

Word window wi−2, . . . ,wi+2POS tags for words in windowConjunction of words and POS tags in window, e.g.(wi−1,pi−1)Capitalization of tokens in window(Character) prefixes and suffixes of wi and wi−1REs for digits, Roman numerals and punctuation

Buitinck, Marx Two-stage NER

Page 19: Presentation at NLDB 2012

Recognition stage

Token-level task: is a token the Beginning of, Inside, orOutside any entity name?Features:

Word window wi−2, . . . ,wi+2POS tags for words in windowConjunction of words and POS tags in window, e.g.(wi−1,pi−1)Capitalization of tokens in window(Character) prefixes and suffixes of wi and wi−1REs for digits, Roman numerals and punctuation

Buitinck, Marx Two-stage NER

Page 20: Presentation at NLDB 2012

Recognition stage

Token-level task: is a token the Beginning of, Inside, orOutside any entity name?Features:

Word window wi−2, . . . ,wi+2POS tags for words in windowConjunction of words and POS tags in window, e.g.(wi−1,pi−1)Capitalization of tokens in window(Character) prefixes and suffixes of wi and wi−1REs for digits, Roman numerals and punctuation

Buitinck, Marx Two-stage NER

Page 21: Presentation at NLDB 2012

Classification stage

Don’t do this at token-level; we know the entity spans!Input is a list of tokens considered an entity by therecognition stageFeatures:

The tokens we got from recognitionThe four surrounding tokensTheir pre- and suffixes up to length fourCapitalization pattern, as a string on the alphabet (L|U|O)∗The occurrence of capitalized tokens, digits and dashes inthe entire sentence

Buitinck, Marx Two-stage NER

Page 22: Presentation at NLDB 2012

Classification stage

Don’t do this at token-level; we know the entity spans!Input is a list of tokens considered an entity by therecognition stageFeatures:

The tokens we got from recognitionThe four surrounding tokensTheir pre- and suffixes up to length fourCapitalization pattern, as a string on the alphabet (L|U|O)∗The occurrence of capitalized tokens, digits and dashes inthe entire sentence

Buitinck, Marx Two-stage NER

Page 23: Presentation at NLDB 2012

Classification stage

Don’t do this at token-level; we know the entity spans!Input is a list of tokens considered an entity by therecognition stageFeatures:

The tokens we got from recognitionThe four surrounding tokensTheir pre- and suffixes up to length fourCapitalization pattern, as a string on the alphabet (L|U|O)∗The occurrence of capitalized tokens, digits and dashes inthe entire sentence

Buitinck, Marx Two-stage NER

Page 24: Presentation at NLDB 2012

Classification stage

Don’t do this at token-level; we know the entity spans!Input is a list of tokens considered an entity by therecognition stageFeatures:

The tokens we got from recognitionThe four surrounding tokensTheir pre- and suffixes up to length fourCapitalization pattern, as a string on the alphabet (L|U|O)∗The occurrence of capitalized tokens, digits and dashes inthe entire sentence

Buitinck, Marx Two-stage NER

Page 25: Presentation at NLDB 2012

Classification stage

Don’t do this at token-level; we know the entity spans!Input is a list of tokens considered an entity by therecognition stageFeatures:

The tokens we got from recognitionThe four surrounding tokensTheir pre- and suffixes up to length fourCapitalization pattern, as a string on the alphabet (L|U|O)∗The occurrence of capitalized tokens, digits and dashes inthe entire sentence

Buitinck, Marx Two-stage NER

Page 26: Presentation at NLDB 2012

Classification stage

Don’t do this at token-level; we know the entity spans!Input is a list of tokens considered an entity by therecognition stageFeatures:

The tokens we got from recognitionThe four surrounding tokensTheir pre- and suffixes up to length fourCapitalization pattern, as a string on the alphabet (L|U|O)∗The occurrence of capitalized tokens, digits and dashes inthe entire sentence

Buitinck, Marx Two-stage NER

Page 27: Presentation at NLDB 2012

Classification stage

Don’t do this at token-level; we know the entity spans!Input is a list of tokens considered an entity by therecognition stageFeatures:

The tokens we got from recognitionThe four surrounding tokensTheir pre- and suffixes up to length fourCapitalization pattern, as a string on the alphabet (L|U|O)∗The occurrence of capitalized tokens, digits and dashes inthe entire sentence

Buitinck, Marx Two-stage NER

Page 28: Presentation at NLDB 2012

Classification stage

Don’t do this at token-level; we know the entity spans!Input is a list of tokens considered an entity by therecognition stageFeatures:

The tokens we got from recognitionThe four surrounding tokensTheir pre- and suffixes up to length fourCapitalization pattern, as a string on the alphabet (L|U|O)∗The occurrence of capitalized tokens, digits and dashes inthe entire sentence

Buitinck, Marx Two-stage NER

Page 29: Presentation at NLDB 2012

Learning algorithm

Use averaged perceptron for both stagesLearns an approximation of max-margin solution (linearSVM)40 iterationsUsed the LBJ machine learning toolkit

Buitinck, Marx Two-stage NER

Page 30: Presentation at NLDB 2012

Learning algorithm

Use averaged perceptron for both stagesLearns an approximation of max-margin solution (linearSVM)40 iterationsUsed the LBJ machine learning toolkit

Buitinck, Marx Two-stage NER

Page 31: Presentation at NLDB 2012

Learning algorithm

Use averaged perceptron for both stagesLearns an approximation of max-margin solution (linearSVM)40 iterationsUsed the LBJ machine learning toolkit

Buitinck, Marx Two-stage NER

Page 32: Presentation at NLDB 2012

Learning algorithm

Use averaged perceptron for both stagesLearns an approximation of max-margin solution (linearSVM)40 iterationsUsed the LBJ machine learning toolkit

Buitinck, Marx Two-stage NER

Page 33: Presentation at NLDB 2012

Evaluation

Aim for F1 score, as defined in the CoNLL 2002 sharedtask on NERTwo corpora: CoNLL 2002 and a subset of SoNaR(courtesy Desmet and Hoste)Compare against Stanford and Desmet and Hoste’salgorithm

Buitinck, Marx Two-stage NER

Page 34: Presentation at NLDB 2012

Evaluation

Aim for F1 score, as defined in the CoNLL 2002 sharedtask on NERTwo corpora: CoNLL 2002 and a subset of SoNaR(courtesy Desmet and Hoste)Compare against Stanford and Desmet and Hoste’salgorithm

Buitinck, Marx Two-stage NER

Page 35: Presentation at NLDB 2012

Evaluation

Aim for F1 score, as defined in the CoNLL 2002 sharedtask on NERTwo corpora: CoNLL 2002 and a subset of SoNaR(courtesy Desmet and Hoste)Compare against Stanford and Desmet and Hoste’salgorithm

Buitinck, Marx Two-stage NER

Page 36: Presentation at NLDB 2012

Results on CoNLL 2002

309.686 tokens containing 19901 names, four categories65% training, 22% validation and 12% test setsStanford achieves F1 = 74.72; "miscellaneous" category ishard (< 0.7)We achieve F1 = 75.14; "organization" category is hard

Buitinck, Marx Two-stage NER

Page 37: Presentation at NLDB 2012

Results on CoNLL 2002

309.686 tokens containing 19901 names, four categories65% training, 22% validation and 12% test setsStanford achieves F1 = 74.72; "miscellaneous" category ishard (< 0.7)We achieve F1 = 75.14; "organization" category is hard

Buitinck, Marx Two-stage NER

Page 38: Presentation at NLDB 2012

Results on CoNLL 2002

309.686 tokens containing 19901 names, four categories65% training, 22% validation and 12% test setsStanford achieves F1 = 74.72; "miscellaneous" category ishard (< 0.7)We achieve F1 = 75.14; "organization" category is hard

Buitinck, Marx Two-stage NER

Page 39: Presentation at NLDB 2012

Results on CoNLL 2002

309.686 tokens containing 19901 names, four categories65% training, 22% validation and 12% test setsStanford achieves F1 = 74.72; "miscellaneous" category ishard (< 0.7)We achieve F1 = 75.14; "organization" category is hard

Buitinck, Marx Two-stage NER

Page 40: Presentation at NLDB 2012

Results on SoNaR

New, large corpus with manual annotationsUsed a 200k tokens subset of a preliminary version,three-fold cross validationState of the art is Desmet and Hoste (2011) withF1 = 84.44Best individual classifier from that paper (CRF) gets 83.77Our system: 83.56Here, “product” and “miscellaneous” categories are hard

Buitinck, Marx Two-stage NER

Page 41: Presentation at NLDB 2012

Results on SoNaR

New, large corpus with manual annotationsUsed a 200k tokens subset of a preliminary version,three-fold cross validationState of the art is Desmet and Hoste (2011) withF1 = 84.44Best individual classifier from that paper (CRF) gets 83.77Our system: 83.56Here, “product” and “miscellaneous” categories are hard

Buitinck, Marx Two-stage NER

Page 42: Presentation at NLDB 2012

Results on SoNaR

New, large corpus with manual annotationsUsed a 200k tokens subset of a preliminary version,three-fold cross validationState of the art is Desmet and Hoste (2011) withF1 = 84.44Best individual classifier from that paper (CRF) gets 83.77Our system: 83.56Here, “product” and “miscellaneous” categories are hard

Buitinck, Marx Two-stage NER

Page 43: Presentation at NLDB 2012

Results on SoNaR

New, large corpus with manual annotationsUsed a 200k tokens subset of a preliminary version,three-fold cross validationState of the art is Desmet and Hoste (2011) withF1 = 84.44Best individual classifier from that paper (CRF) gets 83.77Our system: 83.56Here, “product” and “miscellaneous” categories are hard

Buitinck, Marx Two-stage NER

Page 44: Presentation at NLDB 2012

Results on SoNaR

New, large corpus with manual annotationsUsed a 200k tokens subset of a preliminary version,three-fold cross validationState of the art is Desmet and Hoste (2011) withF1 = 84.44Best individual classifier from that paper (CRF) gets 83.77Our system: 83.56Here, “product” and “miscellaneous” categories are hard

Buitinck, Marx Two-stage NER

Page 45: Presentation at NLDB 2012

Results on SoNaR

New, large corpus with manual annotationsUsed a 200k tokens subset of a preliminary version,three-fold cross validationState of the art is Desmet and Hoste (2011) withF1 = 84.44Best individual classifier from that paper (CRF) gets 83.77Our system: 83.56Here, “product” and “miscellaneous” categories are hard

Buitinck, Marx Two-stage NER

Page 46: Presentation at NLDB 2012

Conclusion

Near-state of the art performance from simple learnerswith good feature setsNo gazetteers, so should be fairly reusable(Side conclusion: SoNaR is more easily learnable thanCoNLL)

Buitinck, Marx Two-stage NER

Page 47: Presentation at NLDB 2012

Conclusion

Near-state of the art performance from simple learnerswith good feature setsNo gazetteers, so should be fairly reusable(Side conclusion: SoNaR is more easily learnable thanCoNLL)

Buitinck, Marx Two-stage NER

Page 48: Presentation at NLDB 2012

Conclusion

Near-state of the art performance from simple learnerswith good feature setsNo gazetteers, so should be fairly reusable(Side conclusion: SoNaR is more easily learnable thanCoNLL)

Buitinck, Marx Two-stage NER

Page 49: Presentation at NLDB 2012

Future work

Being integrated in UvA’s xTAS text analysis pipelineUsed to find entities in Dutch Hansard corpus(forthcoming) and link entities to WikipediaFull SoNaR is now available; new evaluation needed

Buitinck, Marx Two-stage NER

Page 50: Presentation at NLDB 2012

Future work

Being integrated in UvA’s xTAS text analysis pipelineUsed to find entities in Dutch Hansard corpus(forthcoming) and link entities to WikipediaFull SoNaR is now available; new evaluation needed

Buitinck, Marx Two-stage NER

Page 51: Presentation at NLDB 2012

Future work

Being integrated in UvA’s xTAS text analysis pipelineUsed to find entities in Dutch Hansard corpus(forthcoming) and link entities to WikipediaFull SoNaR is now available; new evaluation needed

Buitinck, Marx Two-stage NER