chu-ren huang academia sinica cwn.ling.sinica.tw/huang/huang.htm

4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

From Synergy to Knowledge: Integrating multiple language resources

Part I: Language Resources and Tools

Chu-Ren Huang

Academia Sinica

http://cwn.ling.sinica.edu.tw/huang/huang.htm



p. 2C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Outline: Language Resources and Tools

Introduction: 10 Years in Chinese Language Processing-A mirror for other Asian Languages

The Starting Point: Resources and Resources Sharing

OLAC: The Open Language Archives Community

Asian Language Resources Committee of AFNLP

Standards: ISO TC37 Language Resources Mangagement

Language Archives Project of Taiwan

Tools: Getting Started in NLP with NLTK


Why Resources and ToolsLanguage Resources

Foundation and empirical basis of scientific studies of natural languages

The only reliable source for language specific features

Infrastructure for knowledge representation and knowledge engineering

Essential to preserve linguistic and cultural diversity

Tools

Needed to ‘process’

General enough for multilingual processing and cross-lingual comparison

Robust enough to deal with language specific issues


Chinese Language Processing as a MirrorFor the development of Asian Language Processing

Unlike Japanese, which has enjoying being one of the leaders in technological innovation

The development of Chinese language processing coincides with the developing economies of Taiwan and China

Especially the availability of Chinese language PC’s

Similar to the situation of many Asian languages now


CLP in the past 10 yearsA review of what happened in the past ten years in

Chinese Language Processing (1992-2002)

from a somewhat personal perspective

1992 –Corpora

Completion of the first Chinese corpus for linguistic research (Huang and Chen, COLING ’92.1214-1217)

-untagged, non-segmented

-but searchable


CLP 1992 –19931992 –Segmentation Standard

Announcement of the first national standard for word segmentation by PRC government.

《 GB 13715- 信息處理用現代漢語分詞規範》 .

1993 –Lexicon

Completion and Release of the first version of CKIP lexicon (with the category set and ICG thematic roles)

First version of K. Chen’s parser for Chinese


CLP Corpus 1994 – 19951994

10th year anniversary for the Automation of Chinese historical textual databases.

Completion of the pre-Qin Classic Chinese corpus at Academia Sinica.

1995

Completion of Sinica Corpus (v. 1.0 1 million words), the first balanced and tagged Chinese corpus.


CLP 1996 –Research Institutes

10th Anniversary of the Institute of Computational Linguistics at Peking University

10th Anniversary of the Chinese Knowledge Information Processing Group at Academia Sinica

–Anthology of Papers

Readings in Chinese Natural Language Processing (Journal of Chinese Linguistics Monograph)

Editors: Huang, Chen, and T’sou


CLP 1996 November-1997

Sinica Corpus on Web

One of the first fully searchable language corpus on the WWW

http://www.sinica.edu.tw/ftms-bin/kiwi.sh (old webpage in web archives)

http://www.sinica.edu.tw/SinicaCorpus/ (current page)

1997

Publication of the first Chinese dictionary compiled directly from a corpus (Huang et al.’s Mandarin Daily Classifier Dictionary and Noun-Classifier Collocation Dictionary ）

The Tenth Annual ROCLING conference

http://www.sinica.edu.tw/ftms-bin/kiwi.sh

http://www.sinica.edu.tw/SinicaCorpus/


CLP 1998

–KnowledgeNet

Release of HowNet, the first full-fledged Chinese and English-Chinese LKB

http://www.keenage.com/

-Segmentation Standard

Official announcement of CNS14366 for Taiwan

http://www.keenage.com/


CLP 2000 –Treebanks

Simultaneous completion and announcement of two Chinese Treebanks:

*Penn Chinese Treebank

*Sinica Treebank

ACL Workshop on Chinese Language Processing


CLP 2001-20022001 –Society

Formal approval of the formation of

ACL SigHAN, the first international organization on Chinese Language Processing

2002First SigHAN workshop on Chinese Language Processing

Formal launch of Hsieh’s Intelligent Character Encoding System (a sustainable solution to the missing character problem)

COLING2002 in Taipei


CLP 2003 -2003

THE FIRST INTERNATIONAL CHINESE WORD SEGMENTATION BAKEOFF

http://www.sighan.org/bakeoff2003/

2002-2005

Chinese Proposition Bank

http://www.cis.upenn.edu/~chinese/cpb/

2003,2005,2007

Chinese Gigaword Corpus v.1., v.2, and tagged version

http://www.sighan.org/bakeoff2003/

http://www.cis.upenn.edu/~chinese/cpb/


What CLP Development Showed? Resources Lead

When tools and standards completes a comprehensive infrastructure

Research will bloom


Resources Development Towards a Sharable and Sustainable Model of Resou

rces Development

OLAC

Open Language Archives Community

http://www.language-archives.org


OLAC AimsOLAC, the Open Language Archives Community, is an

international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by:

developing consensus on best current practice for the digital archiving of language resources;

developing a network of interoperating repositories and services for housing and accessing such resources.


OLAC OrganizationCoordinators: Steven Bird & Gary Simons

Council: Anthony Aristar (Linguist List), Christopher Cieri (LDC), Gary Holton (Alaska Native Lanuage Center), Chu-Ren Huang (Academia Sinica), Heidi Johnson (Archive of the Indigenous Languages of Latin America), Laurent Romary (Atilf, University of Nancy), Joan Spanne (SIL), Martin Wynne (Oxford Text Archive)

Participating Archives & Services: 39 archives including LDC, ELRA, DFKI, CBOLD, ANLC, LACITO, Perseus, SIL, APS, Utrecht, Academia Sinica, TalkBank, Rosetta, MPI

Individual Members: ~120


Types of Language Resource

DATA: any information which documents or describes a language, such as a:

monograph, data file, shoebox of index cards, unanalyzed recordings, heavily annotated texts, complete descriptive grammar

TOOLS: computational resources that facilitate creating, viewing, querying, or otherwise using language data

includes fonts, stylesheets, DTDs, Schemas

ADVICE: any information about: reliable data sources, appropriate tools and

practices


The Gap


Coordinated Approach

OAIOLAC

"A shared architectural vision, having many components, and implemented in stages by the community, will bridge the gap"

Analogies: federated databases; semantic web


CONVERT CREATECREATE EXPORT DELIVERFORMAT

OLAC

OAI

CONTENT METADATA

OLAC REPOSITORIESOLAC SERVICES

USER SERVICES

OLAC

PROC

OLAC

MHP

OAI

MS

DC

SoftwareRecommendations

InitiativesStandards


The Foundation: 3 initiatives

Dublin Core Metadata Initiative (DC)

founded in 1995 (Dublin, Ohio)

conventions for resource discovery on the web

Open Archives Initiative (OAI)

founded in 1999 (Santa Fe)

interoperability of e-print services

Open Language Archives Community (OLAC)

founded in 2000 (Philadelphia)

a partnership of institutions and individuals

creating a worldwide virtual library of language resources


Foundation 1: DC Elements15 metadata elements:

broad interdisciplinary consensus

each element is optional and repeatable

applies to digital and traditional formats

Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Language, Relation, Coverage, Rights.

dublincore.org


Foundation 1: DC QualifiersEncoding Schemes:

a controlled vocabulary or notation used to express the value of an element

helps a client system to interpret the element content

e.g. Language = "en" (not "English", "Anglais", ...)

Refinements:

makes the meaning of an element more specific

e.g. Subject.language, Type.linguistic


Foundation 2: OAI Repository


Foundation 2: OAI StandardsTo implement the OAI infrastructure, an archive must

comply with two standards:

1. The OAI Shared Metadata Set Dublin Core

interoperability across all repositories

2. The OAI Metadata Harvesting Protocol HTTP requests - 6 verbs:

Identify, ListIdentifiers, ListMetadataFormats, ListSets, ListRecords, GetRecord

XML responses


Foundation 2: OAI Service Providers and Data Providers


Foundation 3: OLAC & OAIRecall: OAI data providers must support:

Dublin Core Metadata

OAI Metadata harvesting protocol

BUT: OAI data providers can support:

a more specialized metadata format

a more specialized harvesting protocol

What OLAC does:

specialized metadata for language resources

specialized harvesting (extra validation)


OLAC StandardsAside:

standards = the protocols and interfaces that allow the community to function

recommendations = "standards" for representing linguistic content

OLAC has three primary standards:

OLACMS: the OLAC Metadata Set (Qualified DC)

OLAC MHP: refinements to the OAI protocol

OLAC Process: a procedure for identifying Best Common Practice Recommendations


The OLAC Metadata Set

The three categories of metadata: Work language: describes information entities and

their intellectual attributes e.g. names of works and their creators

Document language: describes and provides access to the physical manifestation of information

e.g. format, publisher, date, rights Subject language: describes what a document is

about e.g. subject, description


OLACMS and Controlled Vocabularies

Language:

A language of the intellectual content of the resource (OLAC-Language)

Subject.language:

A language which the content of the resource describes or discusses (OLAC-Language)

OLAC-Language:

A vocabulary for identifying the language(s) that the data is in, or that a piece of linguistic description is about, or that a particular tool can process



Summary: With the software in place, we have a complete platform

OAI

CONTENT METADATA

OLAC

PROC

OLAC

MHP

OAI

MS

DC





Summary: Repositories completely bridge the gap, letting us consistently organize and archive our resources

OAI

CONTENT METADATA

OLAC REPOSITORIES

OLAC

PROC

OLAC

MHP

OAI

MS

DC





OLAC

OAI

CONTENT METADATA

OLAC REPOSITORIESOLAC SERVICES

USER SERVICES

OLAC

PROC

OLAC

MHP

OAI

MS

DC



Acknowledgements: ISLE and TalkBank projects (NSF), participants of the Philadelphia workshop, Eva Banik (programmer), Hernando de Soto (the analogy)


OLACMS helps archive versatility

Given Shared Metadata Standard

New language archives can be created on the fly by harvesting existing archives

Rich information can be inferred by establishing temporal and geographic anchors for each document.


OLAC Infrastructure

Helps to Solve Language Archive Problems such as

Language Identification

and

Metadata Set for Multi-lingual Language Archives


The Language Identification ProblemThe DC code (e.g. ‘en’ for English) is not enough to describe all th

e languages in the world

Enthnologue (http://www.ethnologue.org) is comprehensive but not complete

Potential Problems of using Enthnologue (or any existing language list)

over-splitting

over-chunking omission

http://www.ethnologue.org/

http://www.ethnologue.org/


A Fundamental Solution to Language Identification Problems

Registering language groups with an OLAC registration service

OLAC language classification server would house a comprehensive list of language family names (defined by users) and their extensional definitions (i.e. sets of Enthnologue code

s)

AS:Amis = {ALV, AIS}

ALV= Amis, AIS= Nataoran


Describing Multi-Lingual Resources in OLACMS

Directionality is crucial in multilingual resources

However, OLAC metadata is flat and unordered

Bi-directional MT

<Language code= X/>

<Language code= Y/>

<Subject.language code= X/>

<Subject.language code= Y/>


Multi-lingual Resources IIText: language

Bitext (bilingual aligned corpus) There is always an directionality

Original: language

Translation: Subject.language

Language Description (Field Notes) Elicitation, transcription, translation, notes

Multiple related resources


Language Archives Project of Taiwan Part of the National Digital Archives Project (NDAP)

Pilot Stage 2000-2001

First Phase: 2002-2006

Both Language Archives

And Linguistic Anchor


Language and Digital Archives

WWhheerree HHiissttoorriiccaall MMaappss

LLaanngguuaaggee CChhaannggeess

LLaanngguuaaggee VVaarriiaattiioonnss

LLaanngguuaaggee

WWhheenn

Digital Archives

HHooww aanndd WWhhaatt


Digital Archives are Linguistically Anchored

• ArchiveArchives are s are anchored with Lexical KnowledgeBase anchored with Lexical KnowledgeBase (LKB)(LKB)

-because LKB as collection of lexical types instantiated in ar-because LKB as collection of lexical types instantiated in archives uniquely defines each archivechives uniquely defines each archive

-And each lexical item is the conceptual atom projecting kno-And each lexical item is the conceptual atom projecting knowledge from archive to archivewledge from archive to archive


Multi-anchor Knowledge Linking Geographical anchor based on GIS (geography

information system)

-Ecology (Fauna, Weather, Geology etc.)

-Socio-Anthropological classification

Linguistic anchor based on LKB

-etymology, language grouping, loan words,


Institute of Linguistics

Language Archives


Two branch projects ：

１ Chinese Archives -- 5 sub-projects ：• Early- Mandarin Chinese Lexicon

• Lexical Database of Pre-Qin Bronze and Bamboo Manuscripts

• Modern Chinese Corpus and Treebank• New Age Corpus: Linguistic Representations and Archi

ves of Multimedia Data

• Southern-Min Archive: A Database of Historical Change

in Language Distribution

２ Formosan Language Archives.


GOAL ：

1. Collect the corpus and the lexicon in the period of Early Mandarin Chinese.

2. Provide a systematical knowledge thesaurus as well as powerful instrument for the study of the grammatical development.

Archives Description ：

1. Digitalization of texts (10,000,000 characters).

2. Tagging of grammatical markers (3,500,000 characters).

3. Construction of the lexical database.

http:www.sinica.edu.tw/Early_Mandarin

Early- Mandarin Chinese Lexicon


Archives Description ：• to digitize the bronze inscriptions from the Shang to the

Eastern Chou dynasties.• the construction of a typological lexicon of bronze inscri

ptions and bamboo scripts accurate encoding and analysis for the bronze inscriptions and Chu scripts.

Achievement ： • Proof-read bronze inscriptions (12113 piece of bronze in

scriptions).

http://Inscription.sinica.edu.tw

Lexical Database of Pre-Qin Bronze and Bamboo Manuscripts

http://inscription.sinica.edu.tw/


Achievement ： Segmented words tagged with their part-of-spe

ech (10 millions words version in 2006).

Syntactic tree structure ： 30,000.

http://www.sinica.edu.tw/SinicaCorpus

http://treebank.sinica.edu.tw

Modern Chinese Corpus and Treebank


Treebank



1. A multimodal corpus of spoken Mandarin in Taiwan.

2. By means of different designs of tasks and scenarios.

3. Combining data format of written transcripts with digital technology of video and audio processing.

New Age Corpus: Linguistic Representations and Archives of Multimedia Data


Achievement ：

Transcribed and transformed the 11 hour-digital data.

Tagged the 5-hour speech data.

http://mmc.sinica.edu.tw

New Age Corpus: Linguistic Representations and Archives of Multimedia Data



1. From the perspectives of historical change and geographical distribution.

2. A tagged corpus of Southern Min written documents from 16th century to 20th century.

3. A linguistic Geographical Informational System displaying distributions of languages in Hsinfeng.

Southern-Min Archive: A Database of Historical Change in Language

Distribution



1. Preserve the endangered Formosan Austronesian lang

uages

1.1 corpora, lexicons and grammars

1.2 integration of linguistic information with GIS.

2. fifteen extant Formosan languages

2.1 Rukai, Yami, Saisiyat, Tsou, Atayal, Bunun, Paiw

an, Amis and Puyuma

http://http://formosan.sinica.edu.tw/

Formosan Language archives


Sinica BOW: Bilingual Ontological Wordnet

To construct a Chinese WordNet as the linguistic ontology for knowledge representation;

To provide linguistic anchoring grounded with temporal information by building a synchronic lexicon for all historical periods; and

To provide linguistic anchoring reference and implementation services.


Asian Language Resources CommitteeMail List: [email protected]

Affiliated with AFNLP

Cataloguing Asian Language Resources Will adopt OLACMS and search engine

Hosting ALR Workshops (5 so far)

Asian Language Processing Special Issues in Language Resources and Evaluation

Co-Chairs:Togunaga [email protected]

Huang [email protected]

http://www.cl.cs.titech.ac.jp/alr/

mailto:[email protected]



http://www.cl.cs.titech.ac.jp/alr/

4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

An overview of theNatural Language Toolkit

http://nltk.sourceforge.net

Project Leaders: Steven Bird, Edward Loper, Ewan Klein

Acknowledgement: I would like to thank Steven Bird for agreeing to let me use these slides on NLTK

http://nltk.sourceforge.net/

http://www.csse.unimelb.edu.au/~sb/

http://www.cis.upenn.edu/~edloper/

http://www.cis.upenn.edu/~edloper/

http://www.ltg.ed.ac.uk/~ewan/

http://www.ltg.ed.ac.uk/~ewan/


Summary NLTK is a suite of open source Python modules, data

sets and tutorials

supporting research and development in natural language processing

Download NLTK from nltk.sourceforge.net

A Truly Multilingual Toolkit accessible to beginning researchers in NLP

A good way to attract international scholars to research on your language

Also a good stepping stone for a developing HLT language to test a full range of NLP applications


Components of NLTK1. Code: corpus readers, tokenizers, stemmers, tagger

s, chunkers, parsers, wordnet, ... (50k lines of code)

2. Corpora: 20+ annotated data sets widely used in natural language processing (300Mb data)

3. Documentation: a 360-page book, articles, reviews, API documentation


1. Code corpus readers

tokenizers

stemmers

taggers

parsers

wordnet

semantic interpretation

clusterers

evaluation metrics

…


2. Corpora Brown Corpus

Carnegie Mellon Pronouncing Dictionary

CoNLL 2000 Chunking Corpus

Project Gutenberg Selections

NIST 1999 Information Extraction: Entity Recognition Corpus

US Presidential Inaugural Address Corpus

Indian Language POS-Tagged Corpus

Prepositional Phrase Attachment Corpus

SENSEVAL 2 Corpus

Sinica Treebank Corpus Sample

Universal Declaration of Human Rights Corpus

Stopwords Corpus

TIMIT Corpus Sample

Treebank Corpus Sample

…


3. Documentation a 360-page book about natural language processing in Python

and NLTK teaches Python and NLP

provides numerous examples and exercises

installation instructions

presentation slides for some of the book chapters

API Documentation: describes every module, interface, class, and method


Parser demonstrations


Interactive session (WordNet)


Adoption in NLP coursesAmsterdam, Ben-Gurion, Brown, Bryn Mawr, CD

AC-Mumbai, Coruña, Edinburgh, Erlangen, Georgetown, Helsinki, IIT-Bombay, Iowa State, Konstanz, MIT, Macquarie, Magdeburg, Malta, Marquette, Melbourne, Nancy, Naval Postgraduate School, Northeastern, Ohio State, Pitt, San Diego State, Simon Fraser, Stanford, Syracuse University, Tsuda College, U Colorado, UC Berkeley, UMass Amherst, UNAM, U Penn, UT Austin, Warsaw


Contribute… NLTK is an open source project

all code, data, documentation is free

dozens of people have contributed over the past 6 years

please visit the website for project ideas

sign up on the NLTK-Announce mailing list to hear about new releases

chu-ren huang academia sinica cwn.ling.sinica.tw/huang/huang.htm

Documents

fuzzy numbers

qin classic chinese

chinesea method

searchablea method

tsoua method

nltka method

clp corpus

resources sharingolac