nlp tools by : asef pourmasoumi hossein kamyar supervisor : dr. kahani

30
NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

Upload: lindsay-reed

Post on 16-Dec-2015

232 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

NLP ToolsBy : Asef pourmasoumiHossein Kamyar

Supervisor : Dr. Kahani

Page 2: NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

NLP Tasks

• Sentence splitter & Tokenizer• Stemming• Discourse analysis• Coreference Resolution • Named entity recognition (NER)• Natural language generation• Natural language understanding• Part of speech tagging (POS) • Optical character recognition (OCR)• Semantic role labeling (SRL)• Parsing & Chunker• Relationship extraction• Question answering• Text Summarization • Summarization Evaluation

Page 3: NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

NLP Tasks

• Machine Translation • Sentiment analysis• Speech recognition • Speech segmentation• Topic segmentation • Word sense disambiguation • Text simplification• Text-to-speech• Query expansion• RTE• Text to image• Clustering & Classification & IR• And …

Page 4: NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

Sentence splitter & Tokenizer

• GATE• UNIVERSITY OF ILLINOIS

• Sentence Segmentation tool

• download link : http://cogcomp.cs.illinois.edu/page/tools_view/2

• UNIVERSITY OF STANFORD• including the part-of-speech (POS) tagger, the named entity recognizer (NER), the

parser, and the coreference resolution system.• download link : http://nlp.stanford.edu/software/corenlp.shtml

• MontyTagger

• link : http://web.media.mit.edu/~hugo/montylingua/• Ling Pipe• OpenNLP

• link : http://incubator.apache.org/opennlp/index.html• Natural Language Toolkit

• open source Python modules, Windows, Mac OSX and Linux.

• link : http://www.nltk.org/download

Sentence breaking ,sentence boundary disambiguation

Page 5: NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

Stemming

• Oleander Porter's algorithm - stemming library in C++ released under BSD• Lovins stemming algorithm - with source code in a couple of languages• Porter stemming algorithm - including source code in several languages• Lancaster stemming algorithm - Lancaster University, UK• UEA-Lite Stemmer - University of East Anglia, UK

• Themis - open source IR framework, includes Porter stemmer implementation (PostgreSQL, Java API)

• Snowball - free stemming algorithms for many languages, includes source code, including stemmers for five romance languages

• PTStemmer - A Java/Python/.Net stemming toolkit for the Portuguese language• jsSnowball - open source JavaScript implementation of Snowball stemming algorithms for

many languages• hindi_stemmer - open source stemmer for Hindi• czech_stemmer - open source stemmer for Czech

Page 6: NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

Coreference Resolution

• Illinois has online & downloadable CR• UNIVERSITY OF STANFORD

• integrated in the Stanford suite of NLP tools, StanfordCoreNLP.• download link : http://nlp.stanford.edu/software/corenlp.shtml

• Ling Pipe

• OpenNLP• link : http://incubator.apache.org/opennlp/index.html

• Natural Language Toolkit• download link : http://www.nltk.org/download

• BART (Beautiful Anaphora Resolution Toolkit.)• download link : http://www.bart-coref.org/

• Guitar (A General Tool for Anaphora Resolution)• download link : http://cswww.essex.ac.uk/Research/nle/GuiTAR/

CR determines which words("mentions") refer to the same objects ("entities").

Page 7: NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

Named entity recognition

• Given a stream of text, determine which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g. person, location, organization).

• Illinois

• Stanford Natural Language Processing Group • link : http://nlp.stanford.edu/software/CRF-NER.shtml• downloadable (written in java) English & German.

• Ling Pipe

• OpenNLP• link : http://incubator.apache.org/opennlp/index.html

• Natural Language Toolkit• link : http://www.nltk.org/download

Page 8: NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

Part of speech tagging

• Illinois • Stanford Natural Language Processing Group

• link : http://nlp.stanford.edu/software/tagger.shtml• downloadable (written in java). English, Arabic, Chinese.

• Ling Pipe• OpenNLP

• link : http://incubator.apache.org/opennlp/index.html• MontyTagger

• link : http://web.media.mit.edu/~hugo/montylingua/• Natural Language Toolkit

• open source Python modules, Windows, Mac OSX and Linux.• link : http://www.nltk.org/download

• GATE

• And many others in http://nlp.stanford.edu/links/statnlp.htm

• Given a sentence, determine the part of speech for each word. Many words, especially common ones, can serve as multiple parts of speech. For example, "book" can be a noun ("the book on the table") or verb ("to book a flight").

Page 9: NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

Semantic role labeling

• Illinois has online & downloadable SRL• MontyTagger

• Link : http://web.media.mit.edu/~hugo/montylingua/• ASSERT (Automatic Statistical SEmantic Role Tagger)

• Link : http://cemantix.org/assert.html• Downloadable, OS : RedHat Linux• It is designed and implemented by Sameer S. Pradhan, with some initial contribution from

Daniel Gildea at the University of Rochester.• ASSERT is trained to tag: i) PropBank arguments, ii) Thematic roles, and iii) Opinions, in plain

text.

• SwiRL: The Semantic Role Labeler• English constructed on top of full syntactic analysis of text using Eugene Charniak's parser.• SwiRL trains one classifier for each argument label using a rich set of syntactic and semantic features.

• Link : http://www.surdeanu.name/mihai/swirl/

• CoNLL-2005 Shared Task: Semantic Role Labeling: Systems & Results• Link : http://www.lsi.upc.edu/~srlconll/st05/st05.html

Page 10: NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

Parser & Chunker

• Illinois

• Stanford

• link : http://nlp.stanford.edu/software/tagger.shtml• downloadable (written in java), English , Arabic, Chinese.

• OpenNLP• link : http://incubator.apache.org/opennlp/index.html

• Natural Language Toolkit• link : http://www.nltk.org/download

Determine the parse tree (grammatical analysis) of a given sentence

Page 11: NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

Question answering

• List of question-and-answer websites

Website Founded Alexa Ranking Registration?

Allexperts 1998 1957 No

AOL Answers 2006 6634 Yes

Answerbag 2003 1128

Answers 2005 127 No

Askpedia 123765

Ask Me Help Desk 2003 6686 Yes

Askville Yes

Blurtit 2006 1716

ChaCha 1198

Experts Exchange 1996 1424 Yes

Wolfram Alpha 2009 3883 No

Wikipedia Reference Desk 2001 7 No

Page 12: NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

Automatic Summarization

• http://topicmarks.com/dashboard

• http://www.tools4noobs.com/summarize/

• http://www.uoguelph.ca/~wdarling/summ/

• Produce a readable summary of a chunk of text. Often used to provide summaries of text of a known type, such as articles in the financial section of a newspaper.

Other• http://swesum.nada.kth.se/index-eng.html• http://www.summarization.com/mead/• http://textcompactor.com/

Multi-document online text summarizer• http://newsfeedresearcher.com/• http://iresearch-reporter.com/• http://shablast.com/

Page 13: NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

Summarization Evaluation

• ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

• Link : http://berouge.com/default.aspx• Downloadable, written in Perl.

• MEADeval: (An Evaluation Framework for Extractive Summarization)

• Link: http://tangra.si.umich.edu/clair/meadeval/• Downloadable, written in Perl

Page 14: NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

Machine Translation

• Stanford : Entailment-based MT evaluation

• Link : http://nlp.stanford.edu/software/mteval.shtml• Downloadable (written in java)• It is based on the Stanford RTE system, which performs inference between two short texts,

determining if one is entailed by the other. We use this inference mechanism to predict the adequacy of MT system output at the segment level compared to a reference translation.

• EGYPT system System from 1999 JHU workshop. Mainly of historical interest.• GIZA++ and mkcls Franz Och. C++. GPL.• Thot Phrase-based model building kit• Phramer An Open-Source Java Statistical Phrase-Based MT Decoder • Moses A new open-source phrase-based MT decoder with functionality beyond Pharaoh. • SRILM : For creating n-grams.• Syntax Augmented Machine Translation via Chart Parsing Andreas Zollmann and Ashish Venugopal • Rewrite a decoder for IBM Model • BLEU scoring tool for machine translation evaluation

Free, but getting them requires hassle• Pharaoh decoder Philip Koehn, ISI. • MTTK Machine Translation Tool Kit. Deng and Byrne.

Page 15: NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

Topic segmentation

• Given a chunk of text, separate it into segments each of which is devoted to a topic, and identify the topic of the segment.

• Stanford• Link : http://nlp.stanford.edu/software/tmt/tmt-0.3/• Downloadable (written in java)

• English , Arabic, Chinese version 14.7MB,

• Features• Import and manipulate text from cells in Excel and other spreadsheets.• Train topic models (LDA and Labeled LDA) to create summaries of the text.• Select parameters (such as the number of topics) via a data-driven process.• Generate rich Excel-compatible outputs for tracking word usage across topics, time,

and other groupings of data.

Page 16: NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

Word sense disambiguation

• WordNet::SenseRelate • Link : http://senserelate.sourceforge.net/• Two different word sense disambiguation algorithms,

• WordNet-SenseRelate-AllWords :Assigns a sense to each word in a text.• WordNet-SenseRelate-TargetWord : Assigns a sense to a given target word. • WordNet-SenseRelate-WordToSet : Assigns the meaning to a word that is most related to a

given set of words. • They carry out word sense disambiguation by measuring the semantic similarity between a word

and its neighbors. In particular, a word is assigned the sense that is most related to its neighbors.

• GWSD is a system for unsupervised all-words graph-based word sense disambiguation• Link : http://lit.csci.unt.edu/~rada/downloads/GWSD/GWSD.1.0.tar.gz

• Many words have more than one meaning; we have to select the meaning which makes the most sense in context. For this problem, we are typically given a list of words and associated word senses, e.g. from a dictionary or from an online resource such as WordNet.

Page 17: NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

List of Toolkits

Name Language Creators site

AlchemyAPI C, C++, C#, Java, Python, Perl, Ruby

Orchestr8 [1]

Antelope framework C#, VB.net Proxem [2]Apertium C++, Java (various) [3]Cogito Expert System S.p.A. [4]

Carabao Language Kit Any COM+ compliant language.

Digital Sonata Pty Ltd [5]

DELPH-IN LISP, C++ Deep Linguistic Processing with HPSG Initiative [6]

Distinguo C++ Ultralingua Inc. [7]Ellogon C / C++ Georgios Petasis [8]

FreeLing C++ Universitat Politècnica de Catalunya [9]

General Architecture for Text Engineering

Java GATE open source community [10]

Graph Expression Java Startup huti.ru [11]

Learning Based Java Java Cognitive Computation Group at the University of Illinois [12]

LingPipe Java Alias-i [13]

LinguaStream Java University of Caen, France [14]

Page 18: NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

List of Toolkits

Name Language Creators siteMallet Java University of Massachusetts Amherst [15]

MII nlp toolkit Java UCLA Medical Imaging Informatics (MII) Group [16]

Modular Audio Recognition Framework

JavaThe MARF Research and Development Group, Concordia University

[17]

MontyLingua Python, Java MIT [18]

Natural Language Toolkit (NLTK) Python [19]

NooJ (based on INTEX) .NET University of Franche-Comté, France [20]

OpenNLP Java Online community [21]

Rosette C, C++, Java, .NET

Basis Technology [22]

ScalaNLP Scala David Hall and Daniel Ramage [23]

Stanford NLP Java The Stanford Natural Language Processing Group [24]

Text Engineering Software Laboratoryz(Tesla)

Java University of Cologne [25]

Thinktelligence Delegator Java Thinktelligence Corporation [26]

UIMA Java / C++ Apache [27]

WebLab-project Java OW2 [28]

UniteX Java & C++ Laboratoire d'Automatique Documentaire et Linguistique [29]

The Dragon Toolkit Java Drexel University [30]

Factorie Java University of Massachusetts Amherst [31]

Silpa Indic Language Processing Toolkit

Python Silpa opensource community developers [32]

Page 19: NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

Corpora

• LDC (Linguistic Data Consortium) link and its catalogue by year. Email: [email protected]. Provides the largest range of corpora on CD-ROM. Cost ranges from cheap (e.g., ACL-DCI disk) to pricey. CDs can be purchased individually; institutions can become members and receive discounts on CDs.

• European Language Resources Association link and its catalogue. Distribution agency is ELDA. Rapidly growing collection of materials in European languages.

• ICAME (International Computer Archive of Modern English) link Sells various corpora (including Brown and London-Lund).

• Reuters @ NIST link Reuters corpora are now distributed by NIST. • TRACTOR link TELRI Research Archive of Computational Tools and Resource. Corpora,

many multilingual, in European community languages. • CLR (Consortium for Lexical Research) link. Focuses more on language processing tools

and lexicons, but does have some corpora.• OTA (Oxford Text Archive) link Provides mainly literary texts. Has a bright new web site.

Most materials are available on the web or by anonymous ftp to ota.ox.ac.uk. • Leipzig Corpora Collection link Sentence collections in MySQL database for 17 mainly

European languages.

Page 20: NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

Corpora

• BNC (British National Corpus) link A 100 million word corpus of British English And now, an XML edition.

• European Corpus Initiative Multilingual Corpus I (ECI/MCI)link A 98 million word corpus, covering most of the major European languages, as well as Turkish, Japanese, Russian, Chinese, and Malay. Cheap.

• Survey of English Usage link At the Department of English Language and Literature at University College London. Includes the British part of ICE, the International Corpus of English project. Now available tagged, and parsed for function. 83,419 sentences. Includes ICECUP, dedicated retrieval software. Also, Diachronic Corpus of Present-Day Spoken English (800,000 words, tagged and parsed, half from ICE-GB and half from London-Lund).

• International Corpus of English (ICE)link Million word collections of English from various world Englishes: ICE-NZ, ICE-HK, ICE-East Africa, etc.

• Corpora held by Lancaster University link This link provides its own annotations. • The European Language Activity Network link Promises a uniform query language for

accessing corpora in all EU languages -- but isn't quite there yet. • Talkbank link. Rich video and transcripts.

Page 21: NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

NLP Research Group

• Academic departments with computational linguistics programs• Institute for Communicating and Collaborative Systems at the University of Edinburgh• Institute for Research in Cognitive Science at the University of Pennsylvania• Computational Linguistics & Phonetics at Saarland University• Computational Linguistics and Language Technology at Ohio State University• Stanford Natural Language Processing Group• Computational Linguistics at the University of Washington• Human Language Technology Research Institute at the University of Texas at Dallas• Department of Computer Science at the University of Illinois Urbana-Champaign (

Cognitive Computation Group)• Center for Language and Speech Processing at Johns Hopkins University• Non-university computational linguistics groups• German Research Center for Artificial Intelligence

Page 22: NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

NLP Research Sponsors

Summer Internships and Opportunities Google Internships Summer of Code 2008 custom essay Data Science Summer Institute

Page 23: NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

Blogs, Video Lectures

• Blogs• Hal Daume III's NLP blog• LingPipe blog (Bob Carpenter)• Fernando Pereira's Structured Learning blog• Language Log• John Langford's Machine Learning blog• Jamie Pennebaker's Wordwatcher's blog

• Video lectures• ACL Video Archive• Videos of Machine Learning lectures• Machine Learning and Cognitive Science 2007 – includes talks by Chris Manning, Sharon Goldwater,

John Goldsmith, and others.• MIT workshop: Where Does Syntax Come From? Have We All Been Wrong? – speakers include Chris

Manning, Noam Chomsky, Partha Niyogi, Howard Lasnik and Joshua Tenenbaum.• NIPS 2007 tutorials – including Geoffrey Hinton, Ben Taskar, and Robert Shapire.• Graduate Summer School: Probabilistic Models of Cognition: The Mathematics of Mind (July 9 - 26, 20

07) – slides and webcast links of all the talks. A lot of good introductory stuffs on graphical models, Bayesian learning, etc.

• Microsoft Research – Videos on Researchchannel.• Google Roundtable

Page 24: NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

Conferences

• General (World Wide): ACL / ANLP / COLING / LREC / HLT• General (USA): NAACL / CICLING• General (Europe): EACL / RANLP / AMLaP• General (Asia): ijc-NLP (formerly, NLPRS) / PACLIC / PACLING / JNLP / IALP• Formal Grammar: FG / LFG / HPSG / TAG+• Machine Learning: ICML / ECML / NIPS• Statistical NLP: EMNLP / CoNLL / WVLC • Information Retrieval: SIGIR / ECIR• Computational Semantics: IWCS / ICoS• Others: IWPT / WAS / MOL / SENSEVAL / FSMNLP

Page 25: NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

Journals

• NLP/CL• Computational Linguistics link• Natural Language Engineering link• Journal on Research on Language and Computation link• Language Resources and Evaluation link (Formerly Computers and the Humanities)• Research on Language and Computation link (More)• Logic, Language and Information link• Computer Speech and Language link• Linguistic Issues in Language Technology link (LiLT)• Journal of Interesting Negative Results in Natural Language Processing and Machine Learning

CfP: Interesting Negative Results in Summarization link• Terminology link• Traitement Automatique des Langues link• CfP: Special Issue on Scaling NLP link• Texto! link• Corpus Linguistics and Linguistic Theory link• ICAME Journal link

Page 26: NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

Journals

• IR/IS• Information Retrieval link• D-Lib Magazine link• Information Processing & Management link• Journal of the American Society for Information Science and Technology link• Information Science link• Information Development link• Information Design Journal + Document Design link

• Speech Processing• International Journal of Speech Technology link• Speech Communication link• Journal of the Acoustical Society of America link• IEEE Transactions on Signal Processing link• IEEE Transactions on Audio, Speech & Language Processing link

CfP: Special Issue on New Approaches to Statistical Speech and Text Processing link

Page 28: NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

Journals

• Language and Identity• Language in Society link • Journal of Language, Identity, and Education link• Language & Intercultural Communication link

• BioInformatics• Bioinformatics link• Biomedical Informatics link• Applied Bioinformatics link• Online Journal of Bioinformatics link• In Silico Biology link• Artificial Intelligence in Medicine link

Page 29: NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

Supplementary Links

• http://lac.essex.ac.uk/vm• http://comp.ling.utexas.edu/wiki/doku.php/nlp_links• http://www-a2k.is.tokushima-u.ac.jp/member/kita/NLP/nlp.html• http://www.coli.uni-saarland.de/~csporled/page.php?id=tools• http://www.elsnet.org/toolslist.html• http://zope.bioinfo.cnio.es/bionlp_tools/all_bionlp_tools• http://en.wikipedia.org/wiki/List_of_natural_language_processing_toolkits

Page 30: NLP Tools By : Asef pourmasoumi Hossein Kamyar Supervisor : Dr. Kahani

Question?

• In the sy • Sjd • Sdj • Sdfh • Sdf • Sdf • Sdfkj • Sdjkf