chu-ren huang academia sinica cwn.ling.sinica.tw/huang/huang.htm

76
4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007 From Synergy to Knowledge: Integrating multiple language resources Part I: Language Resources and Tools Chu-Ren Huang Academia Sinica http:// cwn.ling.sinica.edu.tw/huang/huang.htm

Upload: cid

Post on 14-Jan-2016

48 views

Category:

Documents


0 download

DESCRIPTION

From Synergy to Knowledge: Integrating multiple language resources Part I: Language Resources and Tools. Chu-Ren Huang Academia Sinica http://cwn.ling.sinica.edu.tw/huang/huang.htm. Outline: Language Resources and Tools. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

From Synergy to Knowledge: Integrating multiple language resources

Part I: Language Resources and Tools

Chu-Ren Huang

Academia Sinica

http://cwn.ling.sinica.edu.tw/huang/huang.htm

Page 2: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 2C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Outline: Language Resources and Tools

Introduction: 10 Years in Chinese Language Processing-A mirror for other Asian Languages

The Starting Point: Resources and Resources Sharing

OLAC: The Open Language Archives Community

Asian Language Resources Committee of AFNLP

Standards: ISO TC37 Language Resources Mangagement

Language Archives Project of Taiwan

Tools: Getting Started in NLP with NLTK

Page 3: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 3C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Why Resources and ToolsLanguage Resources

Foundation and empirical basis of scientific studies of natural languages

The only reliable source for language specific features

Infrastructure for knowledge representation and knowledge engineering

Essential to preserve linguistic and cultural diversity

Tools

Needed to ‘process’

General enough for multilingual processing and cross-lingual comparison

Robust enough to deal with language specific issues

Page 4: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 4C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Chinese Language Processing as a MirrorFor the development of Asian Language Processing

Unlike Japanese, which has enjoying being one of the leaders in technological innovation

The development of Chinese language processing coincides with the developing economies of Taiwan and China

Especially the availability of Chinese language PC’s

Similar to the situation of many Asian languages now

Page 5: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 5C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

CLP in the past 10 yearsA review of what happened in the past ten years in

Chinese Language Processing (1992-2002)

from a somewhat personal perspective

1992 –Corpora

Completion of the first Chinese corpus for linguistic research (Huang and Chen, COLING ’92.1214-1217)

-untagged, non-segmented

-but searchable

Page 6: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 6C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

CLP 1992 –19931992 –Segmentation Standard

Announcement of the first national standard for word segmentation by PRC government.

《 GB 13715- 信息處理用現代漢語分詞規範》 .

1993 –Lexicon

Completion and Release of the first version of CKIP lexicon (with the category set and ICG thematic roles)

First version of K. Chen’s parser for Chinese

Page 7: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 7C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

CLP Corpus 1994 – 19951994

10th year anniversary for the Automation of Chinese historical textual databases.

Completion of the pre-Qin Classic Chinese corpus at Academia Sinica.

1995

Completion of Sinica Corpus (v. 1.0 1 million words), the first balanced and tagged Chinese corpus.

Page 8: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 8C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

CLP 1996 –Research Institutes

10th Anniversary of the Institute of Computational Linguistics at Peking University

10th Anniversary of the Chinese Knowledge Information Processing Group at Academia Sinica

–Anthology of Papers

Readings in Chinese Natural Language Processing (Journal of Chinese Linguistics Monograph)

Editors: Huang, Chen, and T’sou

Page 9: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 9C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

CLP 1996 November-1997

Sinica Corpus on Web

One of the first fully searchable language corpus on the WWW

http://www.sinica.edu.tw/ftms-bin/kiwi.sh (old webpage in web archives)

http://www.sinica.edu.tw/SinicaCorpus/ (current page)

1997

Publication of the first Chinese dictionary compiled directly from a corpus (Huang et al.’s Mandarin Daily Classifier Dictionary and Noun-Classifier Collocation Dictionary )

The Tenth Annual ROCLING conference

Page 10: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 10C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

CLP 1998

–KnowledgeNet

Release of HowNet, the first full-fledged Chinese and English-Chinese LKB

http://www.keenage.com/

-Segmentation Standard

Official announcement of CNS14366 for Taiwan

Page 11: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 11C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

CLP 2000 –Treebanks

Simultaneous completion and announcement of two Chinese Treebanks:

*Penn Chinese Treebank

*Sinica Treebank

ACL Workshop on Chinese Language Processing

Page 12: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 12C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

CLP 2001-20022001 –Society

Formal approval of the formation of

ACL SigHAN, the first international organization on Chinese Language Processing

2002First SigHAN workshop on Chinese Language Processing

Formal launch of Hsieh’s Intelligent Character Encoding System (a sustainable solution to the missing character problem)

COLING2002 in Taipei

Page 13: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 13C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

CLP 2003 -2003

THE FIRST INTERNATIONAL CHINESE WORD SEGMENTATION BAKEOFF

http://www.sighan.org/bakeoff2003/

2002-2005

Chinese Proposition Bank

http://www.cis.upenn.edu/~chinese/cpb/

2003,2005,2007

Chinese Gigaword Corpus v.1., v.2, and tagged version

Page 14: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 14C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

What CLP Development Showed? Resources Lead

When tools and standards completes a comprehensive infrastructure

Research will bloom

Page 15: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 15C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Resources Development Towards a Sharable and Sustainable Model of Resou

rces Development

OLAC

Open Language Archives Community

http://www.language-archives.org

Page 16: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 16C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

OLAC AimsOLAC, the Open Language Archives Community, is an

international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by:

developing consensus on best current practice for the digital archiving of language resources;

developing a network of interoperating repositories and services for housing and accessing such resources.

Page 17: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 17C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

OLAC OrganizationCoordinators: Steven Bird & Gary Simons

Council: Anthony Aristar (Linguist List), Christopher Cieri (LDC), Gary Holton (Alaska Native Lanuage Center), Chu-Ren Huang (Academia Sinica), Heidi Johnson (Archive of the Indigenous Languages of Latin America), Laurent Romary (Atilf, University of Nancy), Joan Spanne (SIL), Martin Wynne (Oxford Text Archive)

Participating Archives & Services: 39 archives including LDC, ELRA, DFKI, CBOLD, ANLC, LACITO, Perseus, SIL, APS, Utrecht, Academia Sinica, TalkBank, Rosetta, MPI

Individual Members: ~120

Page 18: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 18C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Types of Language Resource

DATA: any information which documents or describes a language, such as a:

monograph, data file, shoebox of index cards, unanalyzed recordings, heavily annotated texts, complete descriptive grammar

TOOLS: computational resources that facilitate creating, viewing, querying, or otherwise using language data

includes fonts, stylesheets, DTDs, Schemas

ADVICE: any information about: reliable data sources, appropriate tools and

practices

Page 19: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 19C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

The Gap

Page 20: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 20C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Coordinated Approach

OAIOLAC

"A shared architectural vision, having many components, and implemented in stages by the community, will bridge the gap"

Analogies: federated databases; semantic web

Page 21: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 21C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

CONVERT CREATECREATE EXPORT DELIVERFORMAT

OLAC

OAI

CONTENT METADATA

OLAC REPOSITORIESOLAC SERVICES

USER SERVICES

OLAC

PROC

OLAC

MHP

OAI

MS

DC

SoftwareRecommendations

InitiativesStandards

Page 22: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 22C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

The Foundation: 3 initiatives

Dublin Core Metadata Initiative (DC)

founded in 1995 (Dublin, Ohio)

conventions for resource discovery on the web

Open Archives Initiative (OAI)

founded in 1999 (Santa Fe)

interoperability of e-print services

Open Language Archives Community (OLAC)

founded in 2000 (Philadelphia)

a partnership of institutions and individuals

creating a worldwide virtual library of language resources

Page 23: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 23C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Foundation 1: DC Elements15 metadata elements:

broad interdisciplinary consensus

each element is optional and repeatable

applies to digital and traditional formats

Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Language, Relation, Coverage, Rights.

dublincore.org

Page 24: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 24C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Foundation 1: DC QualifiersEncoding Schemes:

a controlled vocabulary or notation used to express the value of an element

helps a client system to interpret the element content

e.g. Language = "en" (not "English", "Anglais", ...)

Refinements:

makes the meaning of an element more specific

e.g. Subject.language, Type.linguistic

Page 25: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 25C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Foundation 2: OAI Repository

Page 26: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 26C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Foundation 2: OAI StandardsTo implement the OAI infrastructure, an archive must

comply with two standards:

1. The OAI Shared Metadata Set Dublin Core

interoperability across all repositories

2. The OAI Metadata Harvesting Protocol HTTP requests - 6 verbs:

Identify, ListIdentifiers, ListMetadataFormats, ListSets, ListRecords, GetRecord

XML responses

Page 27: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 27C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Foundation 2: OAI Service Providers and Data Providers

Page 28: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 28C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Foundation 3: OLAC & OAIRecall: OAI data providers must support:

Dublin Core Metadata

OAI Metadata harvesting protocol

BUT: OAI data providers can support:

a more specialized metadata format

a more specialized harvesting protocol

What OLAC does:

specialized metadata for language resources

specialized harvesting (extra validation)

Page 29: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 29C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

OLAC StandardsAside:

standards = the protocols and interfaces that allow the community to function

recommendations = "standards" for representing linguistic content

OLAC has three primary standards:

OLACMS: the OLAC Metadata Set (Qualified DC)

OLAC MHP: refinements to the OAI protocol

OLAC Process: a procedure for identifying Best Common Practice Recommendations

Page 30: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 30C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

The OLAC Metadata Set

The three categories of metadata: Work language: describes information entities and

their intellectual attributes e.g. names of works and their creators

Document language: describes and provides access to the physical manifestation of information

e.g. format, publisher, date, rights Subject language: describes what a document is

about e.g. subject, description

Page 31: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 31C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

OLACMS and Controlled Vocabularies

Language:

A language of the intellectual content of the resource (OLAC-Language)

Subject.language:

A language which the content of the resource describes or discusses (OLAC-Language)

OLAC-Language:

A vocabulary for identifying the language(s) that the data is in, or that a piece of linguistic description is about, or that a particular tool can process

Page 32: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 32C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

CONVERT CREATECREATE EXPORT DELIVERFORMAT

Summary: With the software in place, we have a complete platform

OAI

CONTENT METADATA

OLAC

PROC

OLAC

MHP

OAI

MS

DC

SoftwareRecommendations

InitiativesStandards

Page 33: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 33C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

CONVERT CREATECREATE EXPORT DELIVERFORMAT

Summary: Repositories completely bridge the gap, letting us consistently organize and archive our resources

OAI

CONTENT METADATA

OLAC REPOSITORIES

OLAC

PROC

OLAC

MHP

OAI

MS

DC

SoftwareRecommendations

InitiativesStandards

Page 34: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 34C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

CONVERT CREATECREATE EXPORT DELIVERFORMAT

OLAC

OAI

CONTENT METADATA

OLAC REPOSITORIESOLAC SERVICES

USER SERVICES

OLAC

PROC

OLAC

MHP

OAI

MS

DC

SoftwareRecommendations

InitiativesStandards

Acknowledgements: ISLE and TalkBank projects (NSF), participants of the Philadelphia workshop, Eva Banik (programmer), Hernando de Soto (the analogy)

Page 35: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 35C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

OLACMS helps archive versatility

Given Shared Metadata Standard

New language archives can be created on the fly by harvesting existing archives

Rich information can be inferred by establishing temporal and geographic anchors for each document.

Page 36: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 36C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

OLAC Infrastructure

Helps to Solve Language Archive Problems such as

Language Identification

and

Metadata Set for Multi-lingual Language Archives

Page 37: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 37C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

The Language Identification ProblemThe DC code (e.g. ‘en’ for English) is not enough to describe all th

e languages in the world

Enthnologue (http://www.ethnologue.org) is comprehensive but not complete

Potential Problems of using Enthnologue (or any existing language list)

over-splitting

over-chunking omission

Page 38: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 38C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

A Fundamental Solution to Language Identification Problems

Registering language groups with an OLAC registration service

OLAC language classification server would house a comprehensive list of language family names (defined by users) and their extensional definitions (i.e. sets of Enthnologue code

s)

AS:Amis = {ALV, AIS}

ALV= Amis, AIS= Nataoran

Page 39: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 39C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Describing Multi-Lingual Resources in OLACMS

Directionality is crucial in multilingual resources

However, OLAC metadata is flat and unordered

Bi-directional MT

<Language code= X/>

<Language code= Y/>

<Subject.language code= X/>

<Subject.language code= Y/>

Page 40: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 40C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Multi-lingual Resources IIText: language

Bitext (bilingual aligned corpus) There is always an directionality

Original: language

Translation: Subject.language

Language Description (Field Notes) Elicitation, transcription, translation, notes

Multiple related resources

Page 41: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 41C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Language Archives Project of Taiwan Part of the National Digital Archives Project (NDAP)

Pilot Stage 2000-2001

First Phase: 2002-2006

Both Language Archives

And Linguistic Anchor

Page 42: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 42C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Language and Digital Archives

WWhheerree HHiissttoorriiccaall MMaappss

LLaanngguuaaggee CChhaannggeess

LLaanngguuaaggee VVaarriiaattiioonnss

LLaanngguuaaggee

WWhheenn

Digital Archives

HHooww aanndd WWhhaatt

Page 43: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 43C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Digital Archives are Linguistically Anchored

• ArchiveArchives are s are anchored with Lexical KnowledgeBase anchored with Lexical KnowledgeBase (LKB)(LKB)

-because LKB as collection of lexical types instantiated in ar-because LKB as collection of lexical types instantiated in archives uniquely defines each archivechives uniquely defines each archive

-And each lexical item is the conceptual atom projecting kno-And each lexical item is the conceptual atom projecting knowledge from archive to archivewledge from archive to archive

Page 44: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 44C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Multi-anchor Knowledge Linking Geographical anchor based on GIS (geography

information system)

-Ecology (Fauna, Weather, Geology etc.)

-Socio-Anthropological classification

Linguistic anchor based on LKB

-etymology, language grouping, loan words,

Page 45: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 45C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Institute of Linguistics

Language Archives

Page 46: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 46C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Two branch projects :

1 Chinese Archives -- 5 sub-projects :• Early- Mandarin Chinese Lexicon

• Lexical Database of Pre-Qin Bronze and Bamboo Manuscripts

• Modern Chinese Corpus and Treebank• New Age Corpus: Linguistic Representations and Archi

ves of Multimedia Data

• Southern-Min Archive: A Database of Historical Change

in Language Distribution

2 Formosan Language Archives.

Page 47: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 47C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

GOAL :

1. Collect the corpus and the lexicon in the period of Early Mandarin Chinese.

2. Provide a systematical knowledge thesaurus as well as powerful instrument for the study of the grammatical development.

Archives Description :

1. Digitalization of texts (10,000,000 characters).

2. Tagging of grammatical markers (3,500,000 characters).

3. Construction of the lexical database.

http:www.sinica.edu.tw/Early_Mandarin

Early- Mandarin Chinese Lexicon

Page 48: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 48C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Page 49: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 49C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Archives Description :• to digitize the bronze inscriptions from the Shang to the

Eastern Chou dynasties.• the construction of a typological lexicon of bronze inscri

ptions and bamboo scripts accurate encoding and analysis for the bronze inscriptions and Chu scripts.

Achievement : • Proof-read bronze inscriptions (12113 piece of bronze in

scriptions).

http://Inscription.sinica.edu.tw

Lexical Database of Pre-Qin Bronze and Bamboo Manuscripts

Page 50: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 50C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Page 51: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 51C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Achievement : Segmented words tagged with their part-of-spe

ech (10 millions words version in 2006).

Syntactic tree structure : 30,000.

http://www.sinica.edu.tw/SinicaCorpus

http://treebank.sinica.edu.tw

Modern Chinese Corpus and Treebank

Page 52: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 52C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Page 53: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 53C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Page 54: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 54C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Treebank

Page 55: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 55C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Archives Description :

1. A multimodal corpus of spoken Mandarin in Taiwan.

2. By means of different designs of tasks and scenarios.

3. Combining data format of written transcripts with digital technology of video and audio processing.

New Age Corpus: Linguistic Representations and Archives of Multimedia Data

Page 56: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 56C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Achievement :

Transcribed and transformed the 11 hour-digital data.

Tagged the 5-hour speech data.

http://mmc.sinica.edu.tw

New Age Corpus: Linguistic Representations and Archives of Multimedia Data

Page 57: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 57C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Page 58: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 58C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Archives Description :

1. From the perspectives of historical change and geographical distribution.

2. A tagged corpus of Southern Min written documents from 16th century to 20th century.

3. A linguistic Geographical Informational System displaying distributions of languages in Hsinfeng.

Southern-Min Archive: A Database of Historical Change in Language

Distribution

Page 59: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 59C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Page 60: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 60C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Archives Description :

1. Preserve the endangered Formosan Austronesian lang

uages

1.1 corpora, lexicons and grammars

1.2 integration of linguistic information with GIS.

2. fifteen extant Formosan languages

2.1 Rukai, Yami, Saisiyat, Tsou, Atayal, Bunun, Paiw

an, Amis and Puyuma

http://http://formosan.sinica.edu.tw/

Formosan Language archives

Page 61: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 61C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Page 62: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 62C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Page 63: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 63C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Page 64: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 64C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Sinica BOW: Bilingual Ontological Wordnet

To construct a Chinese WordNet as the linguistic ontology for knowledge representation;

To provide linguistic anchoring grounded with temporal information by building a synchronic lexicon for all historical periods; and

To provide linguistic anchoring reference and implementation services.

Page 65: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 65C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Page 66: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 66C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Asian Language Resources CommitteeMail List: [email protected]

Affiliated with AFNLP

Cataloguing Asian Language Resources Will adopt OLACMS and search engine

Hosting ALR Workshops (5 so far)

Asian Language Processing Special Issues in Language Resources and Evaluation

Co-Chairs:Togunaga [email protected]

Huang [email protected]

http://www.cl.cs.titech.ac.jp/alr/

Page 67: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

An overview of theNatural Language Toolkit

http://nltk.sourceforge.net

Project Leaders: Steven Bird, Edward Loper, Ewan Klein

Acknowledgement: I would like to thank Steven Bird for agreeing to let me use these slides on NLTK

Page 68: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 68C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Summary NLTK is a suite of open source Python modules, data

sets and tutorials

supporting research and development in natural language processing

Download NLTK from nltk.sourceforge.net

A Truly Multilingual Toolkit accessible to beginning researchers in NLP

A good way to attract international scholars to research on your language

Also a good stepping stone for a developing HLT language to test a full range of NLP applications

Page 69: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 69C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Components of NLTK1. Code: corpus readers, tokenizers, stemmers, tagger

s, chunkers, parsers, wordnet, ... (50k lines of code)

2. Corpora: 20+ annotated data sets widely used in natural language processing (300Mb data)

3. Documentation: a 360-page book, articles, reviews, API documentation

Page 70: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 70C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

1. Code corpus readers

tokenizers

stemmers

taggers

parsers

wordnet

semantic interpretation

clusterers

evaluation metrics

Page 71: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 71C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

2. Corpora Brown Corpus

Carnegie Mellon Pronouncing Dictionary

CoNLL 2000 Chunking Corpus

Project Gutenberg Selections

NIST 1999 Information Extraction: Entity Recognition Corpus

US Presidential Inaugural Address Corpus

Indian Language POS-Tagged Corpus

Prepositional Phrase Attachment Corpus

SENSEVAL 2 Corpus

Sinica Treebank Corpus Sample

Universal Declaration of Human Rights Corpus

Stopwords Corpus

TIMIT Corpus Sample

Treebank Corpus Sample

Page 72: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 72C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

3. Documentation a 360-page book about natural language processing in Python

and NLTK teaches Python and NLP

provides numerous examples and exercises

installation instructions

presentation slides for some of the book chapters

API Documentation: describes every module, interface, class, and method

Page 73: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 73C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Parser demonstrations

Page 74: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 74C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Interactive session (WordNet)

Page 75: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 75C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Adoption in NLP coursesAmsterdam, Ben-Gurion, Brown, Bryn Mawr, CD

AC-Mumbai, Coruña, Edinburgh, Erlangen, Georgetown, Helsinki, IIT-Bombay, Iowa State, Konstanz, MIT, Macquarie, Magdeburg, Malta, Marquette, Melbourne, Nancy, Naval Postgraduate School, Northeastern, Ohio State, Pitt, San Diego State, Simon Fraser, Stanford, Syracuse University, Tsuda College, U Colorado, UC Berkeley, UMass Amherst, UNAM, U Penn, UT Austin, Warsaw

Page 76: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 76C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Contribute… NLTK is an open source project

all code, data, documentation is free

dozens of people have contributed over the past 6 years

please visit the website for project ideas

sign up on the NLTK-Announce mailing list to hear about new releases