a survey of nlp toolkits
DESCRIPTION
A Survey of NLP Toolkits. Jing Jiang Mar 8, 2007. Outline. WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases) NER SNoW, OpenNLP and LingPipe. Outline (cont.). What does the tool provide? Is the tool easy to use as a stand-alone program? - PowerPoint PPT PresentationTRANSCRIPT
A Survey of NLP Toolkits
Jing JiangMar 8, 2007
03/08/2007 2
Outline
• WordNet• Statistics-based phrases• POS taggers• Parsers • Chunkers (syntax-based phrases)• NER• SNoW, OpenNLP and LingPipe
03/08/2007 3
Outline (cont.)
• What does the tool provide?• Is the tool easy to use as a stand-alone
program?• Is the tool easy to modify or integrate with
my program?
03/08/2007 4
WordNet
• Background:– Princeton, George Miller, 1985– “WordNet: An Electronic Lexical Database”– Current version: WordNet 3.0
• What does it provide?– A database of words and their relations
• Nouns, verbs, adjectives and adverbs• Lexical relations: morphology• Semantic relations: synonyms,
hypernyms/hyponyms, holonyms/meronyms, etc.
03/08/2007 5
WordNet• To use as a stand-alone program?
– A command line program– Web interface
• To modify or integrate with my program?– API in C– Online manual not very clear (http://
wordnet.princeton.edu/doc)– Interfaces in other languages (http://
wordnet.princeton.edu/links#local)• Java• Perl• Many others
03/08/2007 6
WordNet::Similarity
• Background– Ted Pedersen et al.
• What does it provide:– Semantic similarity between two words
measured in various ways using WordNet– Need to understand the measures to make
the best use• Demo:
– http://marimba.d.umn.edu/cgi-bin/similarity.cgi
03/08/2007 7
WordNet::Similarity
• To use as a stand-alone program?– A Perl script to call from command line– Web interface
• To modify or integrate with my program?– A Perl module– Online API with details and examples
03/08/2007 8
Ngram Statistics Package
• What does it provide:– N-grams from a corpus ranked by a user-
selected statistical measure of association (e.g. mutual information, chi-squared test)
03/08/2007 9
Ngram Statistics Package
• To use as a stand-alone program?– count.pl, statistic.pl– Input can be flat text– Regular expressions to define tokens can be specified
by the user• To modify or integrate with my program?
– Perl module– Online API with details and examples– User can define new statistical measures of
association
03/08/2007 10
LingPipe: Significant Phrases
• What does it provide:– Collocations (similar to NSP)– Relatively new terms
• Foreground vs. background• Web application: Amazon “SIPs”, Yahoo “Buzz
Index”, Google “in the news”• http://www.alias-i.com/lingpipe/demos/tutorial/inter
estingPhrases/read-me.html
03/08/2007 11
POS Taggers
• What do they provide?– POS tags
• How many POS tags are there?– Penn Treebank Tag Set
http://www.cis.upenn.edu/~treebank/– Which tags are useful to your task?
03/08/2007 12
Brill Tagger
• Background– Eric Brill, PhD thesis, U Penn, 1993– Transformation-based error-driven learning
• Accuracy and speed– ~96%– ~5000 sentences ~4 seconds
03/08/2007 13
Brill Tagger
• To use as a stand-alone program?– Call from command line– Input must be one sentence per line,
tokenized• E.g. We ’re going today , are you ?
• To modify or integrate with my program?– No API
03/08/2007 14
Charniak Parser
• Background– Eugene Charniak, Brown University– State-of-the-art
• What does it provide?– Syntactic parse tree
03/08/2007 15
Charniak Parser
• To use as a stand-alone program?– Call from command line– Input must be one sentence per line
• To modify or integrate with my program?– No API
03/08/2007 16
Collins Parser
• Background– Michael Collins, PhD thesis, U Penn, 1999– Head-driven statistical models
• What does it provide?– Syntactic parse trees– Head word for each production (dependency
relations, but no relation labels)
03/08/2007 17
Collins Parser
• To use as a stand-alone program?– Call from command line– Input must be one sentence per line,
tokenized, POS tagged• To modify or integrate with my program?
– No API
03/08/2007 18
MiniPar
• Background– Dekang Lin, U Alberta
• What does it provide?– Dependency parse trees– Dependency relation labels
• Accuracy and speed– ~88% precision, ~80% recall for dependency
relations– 300 words / second (Pentium II 300, 128MB)
03/08/2007 19
Examples of Dependency Relations
• The Fulton County Grand Jury said Friday an investigation of Atlanta 's recent primary election produced…
• say V:s:N Fulton County Grand Jury• Fulton County Grand Jury N:det:Det the• Fulton County Grand Jury N:lex-mod:U Fulton• Fulton County Grand Jury N:lex-mod:U County• Fulton County Grand Jury N:lex-mod:U Grand• say V:subj:N Fulton County Grand Jury• say V:guest:N Friday• produce V:s:N investigation• investigation N:det:Det an• investigation N:mod:Prep of
03/08/2007 20
MiniPar
• To use as a stand-alone program?– A command line program– Input must be one sentence per line
• To modify or integrate with my program?– API in C– Parse tree and dependency relations are
stored in some data structure for easy access
03/08/2007 21
Comparison of Parsers
• Accuracy:– Charniak > Collins > MiniPar
• Dependency relations:– Collins, MiniPar
• Dependency relation labels:– MiniPar
• Speed– MiniPar
03/08/2007 22
Chunkers (Shallow Parsers)
• What do they provide?– Phrase structure of a sentence– E.g. [NP He] [VP reckons] [NP the current
account deficit] [VP will narrow] [PP to] [NP only 1.8 billion] [PP in] [NP September]
• Compare with collocations
03/08/2007 23
Named Entity Recognizers
• What do they provide?– Named entities of various pre-defined types
(e.g. Person, Location, Organization, Number, etc.)
03/08/2007 24
SNoW-based Tools
• Use SNoW as the underlying learner• In C++• API available for many components
03/08/2007 25
SNoW-based Tools
• Sentence splitter• Tokenizer• POS tagger• Dependency parser• Chunker• NE tagger• SRL
03/08/2007 26
OpenNLP
• Java-based, open source project• Maximum entropy models• Pipeline structure
– Sentence detector tokenizer POS tagger Chunker
• Java API
03/08/2007 27
OpenNLP
• Sentence boundary detector• Tokenizer• POS tagger• Chunker• Parser• Name Finder• Coreference
03/08/2007 28
LingPipe
• Java-based libraries for various linguistic analysis
• http://www.alias-i.com/lingpipe/index.html