chu-ren huang academia sinica cwn.ling.sinica.tw/huang/huang.htm
DESCRIPTION
From Synergy to Knowledge: Integrating multiple language resources Part I: Language Resources and Tools. Chu-Ren Huang Academia Sinica http://cwn.ling.sinica.edu.tw/huang/huang.htm. Outline: Language Resources and Tools. - PowerPoint PPT PresentationTRANSCRIPT
4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
From Synergy to Knowledge: Integrating multiple language resources
Part I: Language Resources and Tools
Chu-Ren Huang
Academia Sinica
http://cwn.ling.sinica.edu.tw/huang/huang.htm
p. 2C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Outline: Language Resources and Tools
Introduction: 10 Years in Chinese Language Processing-A mirror for other Asian Languages
The Starting Point: Resources and Resources Sharing
OLAC: The Open Language Archives Community
Asian Language Resources Committee of AFNLP
Standards: ISO TC37 Language Resources Mangagement
Language Archives Project of Taiwan
Tools: Getting Started in NLP with NLTK
p. 3C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Why Resources and ToolsLanguage Resources
Foundation and empirical basis of scientific studies of natural languages
The only reliable source for language specific features
Infrastructure for knowledge representation and knowledge engineering
Essential to preserve linguistic and cultural diversity
Tools
Needed to ‘process’
General enough for multilingual processing and cross-lingual comparison
Robust enough to deal with language specific issues
p. 4C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Chinese Language Processing as a MirrorFor the development of Asian Language Processing
Unlike Japanese, which has enjoying being one of the leaders in technological innovation
The development of Chinese language processing coincides with the developing economies of Taiwan and China
Especially the availability of Chinese language PC’s
Similar to the situation of many Asian languages now
p. 5C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
CLP in the past 10 yearsA review of what happened in the past ten years in
Chinese Language Processing (1992-2002)
from a somewhat personal perspective
1992 –Corpora
Completion of the first Chinese corpus for linguistic research (Huang and Chen, COLING ’92.1214-1217)
-untagged, non-segmented
-but searchable
p. 6C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
CLP 1992 –19931992 –Segmentation Standard
Announcement of the first national standard for word segmentation by PRC government.
《 GB 13715- 信息處理用現代漢語分詞規範》 .
1993 –Lexicon
Completion and Release of the first version of CKIP lexicon (with the category set and ICG thematic roles)
First version of K. Chen’s parser for Chinese
p. 7C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
CLP Corpus 1994 – 19951994
10th year anniversary for the Automation of Chinese historical textual databases.
Completion of the pre-Qin Classic Chinese corpus at Academia Sinica.
1995
Completion of Sinica Corpus (v. 1.0 1 million words), the first balanced and tagged Chinese corpus.
p. 8C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
CLP 1996 –Research Institutes
10th Anniversary of the Institute of Computational Linguistics at Peking University
10th Anniversary of the Chinese Knowledge Information Processing Group at Academia Sinica
–Anthology of Papers
Readings in Chinese Natural Language Processing (Journal of Chinese Linguistics Monograph)
Editors: Huang, Chen, and T’sou
p. 9C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
CLP 1996 November-1997
Sinica Corpus on Web
One of the first fully searchable language corpus on the WWW
http://www.sinica.edu.tw/ftms-bin/kiwi.sh (old webpage in web archives)
http://www.sinica.edu.tw/SinicaCorpus/ (current page)
1997
Publication of the first Chinese dictionary compiled directly from a corpus (Huang et al.’s Mandarin Daily Classifier Dictionary and Noun-Classifier Collocation Dictionary )
The Tenth Annual ROCLING conference
p. 10C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
CLP 1998
–KnowledgeNet
Release of HowNet, the first full-fledged Chinese and English-Chinese LKB
http://www.keenage.com/
-Segmentation Standard
Official announcement of CNS14366 for Taiwan
p. 11C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
CLP 2000 –Treebanks
Simultaneous completion and announcement of two Chinese Treebanks:
*Penn Chinese Treebank
*Sinica Treebank
ACL Workshop on Chinese Language Processing
p. 12C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
CLP 2001-20022001 –Society
Formal approval of the formation of
ACL SigHAN, the first international organization on Chinese Language Processing
2002First SigHAN workshop on Chinese Language Processing
Formal launch of Hsieh’s Intelligent Character Encoding System (a sustainable solution to the missing character problem)
COLING2002 in Taipei
p. 13C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
CLP 2003 -2003
THE FIRST INTERNATIONAL CHINESE WORD SEGMENTATION BAKEOFF
http://www.sighan.org/bakeoff2003/
2002-2005
Chinese Proposition Bank
http://www.cis.upenn.edu/~chinese/cpb/
2003,2005,2007
Chinese Gigaword Corpus v.1., v.2, and tagged version
p. 14C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
What CLP Development Showed? Resources Lead
When tools and standards completes a comprehensive infrastructure
Research will bloom
p. 15C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Resources Development Towards a Sharable and Sustainable Model of Resou
rces Development
OLAC
Open Language Archives Community
http://www.language-archives.org
p. 16C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
OLAC AimsOLAC, the Open Language Archives Community, is an
international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by:
developing consensus on best current practice for the digital archiving of language resources;
developing a network of interoperating repositories and services for housing and accessing such resources.
p. 17C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
OLAC OrganizationCoordinators: Steven Bird & Gary Simons
Council: Anthony Aristar (Linguist List), Christopher Cieri (LDC), Gary Holton (Alaska Native Lanuage Center), Chu-Ren Huang (Academia Sinica), Heidi Johnson (Archive of the Indigenous Languages of Latin America), Laurent Romary (Atilf, University of Nancy), Joan Spanne (SIL), Martin Wynne (Oxford Text Archive)
Participating Archives & Services: 39 archives including LDC, ELRA, DFKI, CBOLD, ANLC, LACITO, Perseus, SIL, APS, Utrecht, Academia Sinica, TalkBank, Rosetta, MPI
Individual Members: ~120
p. 18C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Types of Language Resource
DATA: any information which documents or describes a language, such as a:
monograph, data file, shoebox of index cards, unanalyzed recordings, heavily annotated texts, complete descriptive grammar
TOOLS: computational resources that facilitate creating, viewing, querying, or otherwise using language data
includes fonts, stylesheets, DTDs, Schemas
ADVICE: any information about: reliable data sources, appropriate tools and
practices
p. 19C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
The Gap
p. 20C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Coordinated Approach
OAIOLAC
"A shared architectural vision, having many components, and implemented in stages by the community, will bridge the gap"
Analogies: federated databases; semantic web
p. 21C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
CONVERT CREATECREATE EXPORT DELIVERFORMAT
OLAC
OAI
CONTENT METADATA
OLAC REPOSITORIESOLAC SERVICES
USER SERVICES
OLAC
PROC
OLAC
MHP
OAI
MS
DC
SoftwareRecommendations
InitiativesStandards
p. 22C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
The Foundation: 3 initiatives
Dublin Core Metadata Initiative (DC)
founded in 1995 (Dublin, Ohio)
conventions for resource discovery on the web
Open Archives Initiative (OAI)
founded in 1999 (Santa Fe)
interoperability of e-print services
Open Language Archives Community (OLAC)
founded in 2000 (Philadelphia)
a partnership of institutions and individuals
creating a worldwide virtual library of language resources
p. 23C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Foundation 1: DC Elements15 metadata elements:
broad interdisciplinary consensus
each element is optional and repeatable
applies to digital and traditional formats
Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Language, Relation, Coverage, Rights.
dublincore.org
p. 24C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Foundation 1: DC QualifiersEncoding Schemes:
a controlled vocabulary or notation used to express the value of an element
helps a client system to interpret the element content
e.g. Language = "en" (not "English", "Anglais", ...)
Refinements:
makes the meaning of an element more specific
e.g. Subject.language, Type.linguistic
p. 25C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Foundation 2: OAI Repository
p. 26C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Foundation 2: OAI StandardsTo implement the OAI infrastructure, an archive must
comply with two standards:
1. The OAI Shared Metadata Set Dublin Core
interoperability across all repositories
2. The OAI Metadata Harvesting Protocol HTTP requests - 6 verbs:
Identify, ListIdentifiers, ListMetadataFormats, ListSets, ListRecords, GetRecord
XML responses
p. 27C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Foundation 2: OAI Service Providers and Data Providers
p. 28C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Foundation 3: OLAC & OAIRecall: OAI data providers must support:
Dublin Core Metadata
OAI Metadata harvesting protocol
BUT: OAI data providers can support:
a more specialized metadata format
a more specialized harvesting protocol
What OLAC does:
specialized metadata for language resources
specialized harvesting (extra validation)
p. 29C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
OLAC StandardsAside:
standards = the protocols and interfaces that allow the community to function
recommendations = "standards" for representing linguistic content
OLAC has three primary standards:
OLACMS: the OLAC Metadata Set (Qualified DC)
OLAC MHP: refinements to the OAI protocol
OLAC Process: a procedure for identifying Best Common Practice Recommendations
p. 30C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
The OLAC Metadata Set
The three categories of metadata: Work language: describes information entities and
their intellectual attributes e.g. names of works and their creators
Document language: describes and provides access to the physical manifestation of information
e.g. format, publisher, date, rights Subject language: describes what a document is
about e.g. subject, description
p. 31C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
OLACMS and Controlled Vocabularies
Language:
A language of the intellectual content of the resource (OLAC-Language)
Subject.language:
A language which the content of the resource describes or discusses (OLAC-Language)
OLAC-Language:
A vocabulary for identifying the language(s) that the data is in, or that a piece of linguistic description is about, or that a particular tool can process
p. 32C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
CONVERT CREATECREATE EXPORT DELIVERFORMAT
Summary: With the software in place, we have a complete platform
OAI
CONTENT METADATA
OLAC
PROC
OLAC
MHP
OAI
MS
DC
SoftwareRecommendations
InitiativesStandards
p. 33C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
CONVERT CREATECREATE EXPORT DELIVERFORMAT
Summary: Repositories completely bridge the gap, letting us consistently organize and archive our resources
OAI
CONTENT METADATA
OLAC REPOSITORIES
OLAC
PROC
OLAC
MHP
OAI
MS
DC
SoftwareRecommendations
InitiativesStandards
p. 34C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
CONVERT CREATECREATE EXPORT DELIVERFORMAT
OLAC
OAI
CONTENT METADATA
OLAC REPOSITORIESOLAC SERVICES
USER SERVICES
OLAC
PROC
OLAC
MHP
OAI
MS
DC
SoftwareRecommendations
InitiativesStandards
Acknowledgements: ISLE and TalkBank projects (NSF), participants of the Philadelphia workshop, Eva Banik (programmer), Hernando de Soto (the analogy)
p. 35C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
OLACMS helps archive versatility
Given Shared Metadata Standard
New language archives can be created on the fly by harvesting existing archives
Rich information can be inferred by establishing temporal and geographic anchors for each document.
p. 36C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
OLAC Infrastructure
Helps to Solve Language Archive Problems such as
Language Identification
and
Metadata Set for Multi-lingual Language Archives
p. 37C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
The Language Identification ProblemThe DC code (e.g. ‘en’ for English) is not enough to describe all th
e languages in the world
Enthnologue (http://www.ethnologue.org) is comprehensive but not complete
Potential Problems of using Enthnologue (or any existing language list)
over-splitting
over-chunking omission
p. 38C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
A Fundamental Solution to Language Identification Problems
Registering language groups with an OLAC registration service
OLAC language classification server would house a comprehensive list of language family names (defined by users) and their extensional definitions (i.e. sets of Enthnologue code
s)
AS:Amis = {ALV, AIS}
ALV= Amis, AIS= Nataoran
p. 39C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Describing Multi-Lingual Resources in OLACMS
Directionality is crucial in multilingual resources
However, OLAC metadata is flat and unordered
Bi-directional MT
<Language code= X/>
<Language code= Y/>
<Subject.language code= X/>
<Subject.language code= Y/>
p. 40C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Multi-lingual Resources IIText: language
Bitext (bilingual aligned corpus) There is always an directionality
Original: language
Translation: Subject.language
Language Description (Field Notes) Elicitation, transcription, translation, notes
Multiple related resources
p. 41C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Language Archives Project of Taiwan Part of the National Digital Archives Project (NDAP)
Pilot Stage 2000-2001
First Phase: 2002-2006
Both Language Archives
And Linguistic Anchor
p. 42C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Language and Digital Archives
WWhheerree HHiissttoorriiccaall MMaappss
LLaanngguuaaggee CChhaannggeess
LLaanngguuaaggee VVaarriiaattiioonnss
LLaanngguuaaggee
WWhheenn
Digital Archives
HHooww aanndd WWhhaatt
p. 43C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Digital Archives are Linguistically Anchored
• ArchiveArchives are s are anchored with Lexical KnowledgeBase anchored with Lexical KnowledgeBase (LKB)(LKB)
-because LKB as collection of lexical types instantiated in ar-because LKB as collection of lexical types instantiated in archives uniquely defines each archivechives uniquely defines each archive
-And each lexical item is the conceptual atom projecting kno-And each lexical item is the conceptual atom projecting knowledge from archive to archivewledge from archive to archive
p. 44C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Multi-anchor Knowledge Linking Geographical anchor based on GIS (geography
information system)
-Ecology (Fauna, Weather, Geology etc.)
-Socio-Anthropological classification
Linguistic anchor based on LKB
-etymology, language grouping, loan words,
p. 45C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Institute of Linguistics
Language Archives
p. 46C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Two branch projects :
1 Chinese Archives -- 5 sub-projects :• Early- Mandarin Chinese Lexicon
• Lexical Database of Pre-Qin Bronze and Bamboo Manuscripts
• Modern Chinese Corpus and Treebank• New Age Corpus: Linguistic Representations and Archi
ves of Multimedia Data
• Southern-Min Archive: A Database of Historical Change
in Language Distribution
2 Formosan Language Archives.
p. 47C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
GOAL :
1. Collect the corpus and the lexicon in the period of Early Mandarin Chinese.
2. Provide a systematical knowledge thesaurus as well as powerful instrument for the study of the grammatical development.
Archives Description :
1. Digitalization of texts (10,000,000 characters).
2. Tagging of grammatical markers (3,500,000 characters).
3. Construction of the lexical database.
http:www.sinica.edu.tw/Early_Mandarin
Early- Mandarin Chinese Lexicon
p. 48C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 49C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Archives Description :• to digitize the bronze inscriptions from the Shang to the
Eastern Chou dynasties.• the construction of a typological lexicon of bronze inscri
ptions and bamboo scripts accurate encoding and analysis for the bronze inscriptions and Chu scripts.
Achievement : • Proof-read bronze inscriptions (12113 piece of bronze in
scriptions).
http://Inscription.sinica.edu.tw
Lexical Database of Pre-Qin Bronze and Bamboo Manuscripts
p. 50C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 51C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Achievement : Segmented words tagged with their part-of-spe
ech (10 millions words version in 2006).
Syntactic tree structure : 30,000.
http://www.sinica.edu.tw/SinicaCorpus
http://treebank.sinica.edu.tw
Modern Chinese Corpus and Treebank
p. 52C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 53C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 54C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Treebank
p. 55C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Archives Description :
1. A multimodal corpus of spoken Mandarin in Taiwan.
2. By means of different designs of tasks and scenarios.
3. Combining data format of written transcripts with digital technology of video and audio processing.
New Age Corpus: Linguistic Representations and Archives of Multimedia Data
p. 56C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Achievement :
Transcribed and transformed the 11 hour-digital data.
Tagged the 5-hour speech data.
http://mmc.sinica.edu.tw
New Age Corpus: Linguistic Representations and Archives of Multimedia Data
p. 57C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 58C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Archives Description :
1. From the perspectives of historical change and geographical distribution.
2. A tagged corpus of Southern Min written documents from 16th century to 20th century.
3. A linguistic Geographical Informational System displaying distributions of languages in Hsinfeng.
Southern-Min Archive: A Database of Historical Change in Language
Distribution
p. 59C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 60C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Archives Description :
1. Preserve the endangered Formosan Austronesian lang
uages
1.1 corpora, lexicons and grammars
1.2 integration of linguistic information with GIS.
2. fifteen extant Formosan languages
2.1 Rukai, Yami, Saisiyat, Tsou, Atayal, Bunun, Paiw
an, Amis and Puyuma
http://http://formosan.sinica.edu.tw/
Formosan Language archives
p. 61C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 62C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 63C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 64C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Sinica BOW: Bilingual Ontological Wordnet
To construct a Chinese WordNet as the linguistic ontology for knowledge representation;
To provide linguistic anchoring grounded with temporal information by building a synchronic lexicon for all historical periods; and
To provide linguistic anchoring reference and implementation services.
p. 65C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 66C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Asian Language Resources CommitteeMail List: [email protected]
Affiliated with AFNLP
Cataloguing Asian Language Resources Will adopt OLACMS and search engine
Hosting ALR Workshops (5 so far)
Asian Language Processing Special Issues in Language Resources and Evaluation
Co-Chairs:Togunaga [email protected]
Huang [email protected]
http://www.cl.cs.titech.ac.jp/alr/
4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
An overview of theNatural Language Toolkit
http://nltk.sourceforge.net
Project Leaders: Steven Bird, Edward Loper, Ewan Klein
Acknowledgement: I would like to thank Steven Bird for agreeing to let me use these slides on NLTK
p. 68C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Summary NLTK is a suite of open source Python modules, data
sets and tutorials
supporting research and development in natural language processing
Download NLTK from nltk.sourceforge.net
A Truly Multilingual Toolkit accessible to beginning researchers in NLP
A good way to attract international scholars to research on your language
Also a good stepping stone for a developing HLT language to test a full range of NLP applications
p. 69C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Components of NLTK1. Code: corpus readers, tokenizers, stemmers, tagger
s, chunkers, parsers, wordnet, ... (50k lines of code)
2. Corpora: 20+ annotated data sets widely used in natural language processing (300Mb data)
3. Documentation: a 360-page book, articles, reviews, API documentation
p. 70C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
1. Code corpus readers
tokenizers
stemmers
taggers
parsers
wordnet
semantic interpretation
clusterers
evaluation metrics
…
p. 71C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
2. Corpora Brown Corpus
Carnegie Mellon Pronouncing Dictionary
CoNLL 2000 Chunking Corpus
Project Gutenberg Selections
NIST 1999 Information Extraction: Entity Recognition Corpus
US Presidential Inaugural Address Corpus
Indian Language POS-Tagged Corpus
Prepositional Phrase Attachment Corpus
SENSEVAL 2 Corpus
Sinica Treebank Corpus Sample
Universal Declaration of Human Rights Corpus
Stopwords Corpus
TIMIT Corpus Sample
Treebank Corpus Sample
…
p. 72C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
3. Documentation a 360-page book about natural language processing in Python
and NLTK teaches Python and NLP
provides numerous examples and exercises
installation instructions
presentation slides for some of the book chapters
API Documentation: describes every module, interface, class, and method
p. 73C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Parser demonstrations
p. 74C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Interactive session (WordNet)
p. 75C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Adoption in NLP coursesAmsterdam, Ben-Gurion, Brown, Bryn Mawr, CD
AC-Mumbai, Coruña, Edinburgh, Erlangen, Georgetown, Helsinki, IIT-Bombay, Iowa State, Konstanz, MIT, Macquarie, Magdeburg, Malta, Marquette, Melbourne, Nancy, Naval Postgraduate School, Northeastern, Ohio State, Pitt, San Diego State, Simon Fraser, Stanford, Syracuse University, Tsuda College, U Colorado, UC Berkeley, UMass Amherst, UNAM, U Penn, UT Austin, Warsaw
p. 76C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Contribute… NLTK is an open source project
all code, data, documentation is free
dozens of people have contributed over the past 6 years
please visit the website for project ideas
sign up on the NLTK-Announce mailing list to hear about new releases