chu-ren huang academia sinica cwn.ling.sinica.tw/huang/huang.htm

Post on 03-Feb-2016

60 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

From Synergy to Knowledge: Integrating multiple language resources Part II: Creating Synergy and Multi-functionality of Language Resources. Chu-Ren Huang Academia Sinica http://cwn.ling.sinica.edu.tw/huang/huang.htm. Outline. From Language Resources to Language Technology A word’s company - PowerPoint PPT Presentation

TRANSCRIPT

4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

From Synergy to Knowledge: Integrating multiple language resources

Part II: Creating Synergy and Multi-functionality of Language Resources

Chu-Ren Huang

Academia Sinica

http://cwn.ling.sinica.edu.tw/huang/huang.htm

p. 2C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Outlineo From Language Resources to Language Technology

o A word’s company

o Classical Paradigm of Language Resource Development

o A new paradigm: Integrating Multiple Language resources

o Introduction: CGW Corpus

o Chinese WordSketch: Integrating multiple resources

o Wen-Guo: Merging different resources to create new synergy

p. 3C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

From Language Resources to Language Technology

Language Modeling and Knowledge Generation: How to acquire linguistic model and/or generalization from language resources?

Sharability: can two or more resources be combined to create bigger and better resources

Re-usability: Can a resource be used for a different purpose than what it is designed for?

p. 4C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

A word’s company: Corpus KeyWord In Context (KWIC) and the color pen

1 political association 4 person in an agreement/dispute4 person in an agreement/dispute 2 social event 5 to be party to something...3 group of peopleThe coloured pens method from Kilgarriff et al. 2005

1 arity, which will be used to take a party of under-privileged children to D 2 from outside. You are invited to a party and after a couple of drinks you d 3 tion, we believe politicians of all parties will listen to our views. &equo 4 ould be reaching agreement with all parties concerned, as to which events, 5 lack people. I have certainly been party to one or two discussions amongst 6 . These should be discussed by both parties before entering into the relatio 7 presents They had hosted a cocktail party at Kensington palace, for example 8 akes. By midnight the end-of-course party is in full swing, but most cadet 9 e should be a right for the injured party to terminate the contract. A mana 10 by the Safran Peoples ' Liberation Party. This presents the powerful neigh 11 s. Ahead I could see the rest of my party plodding towards the final slope t 12 cial ethic. The two main political parties - the Tories and the Liberals - 13 ritish successes in Perth The small party of British players competing in th 14 to help control. One member of the party went to summon the rescue team and 15 rket society fashion magazine. The party was held at his flat which was a l 16 security and secrecy than any Tory Party Conference : it seems that bootleg

p. 5C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

A Word’s Company Automatically Detected: WordSketch w BNC Data

p. 6C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Sketch Engine and Chinese WordSketch Sketch Engine http://www.sketchengine.co.uk

Developed by team led by Adam Kilgarriff

A new corpus viewing tool

Discovering grammatical information from a gigantic corpus

Chinese Wordsketch by Academia Sinica

http://www.ling.sinica.edu.tw/wordsketch (for Taiwan only)

Academia Sinica, Taiwan (Huang, Smith, Ma, Simon 黃居仁,史尚明,馬偉雲,石穆 )

p. 7C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Classical Paradigm of Language Resource Development

Data Collection and Preparation:

Design Criteria : by human

Data collection : executed or supervised by human

digitization : input and/or proofreading by human

Knowledge Enrichment: tagging and structural annotation

Knowledge source : by human

Representational standard and annotation : by human

Quality and speed of human labor becomes the bottleneck of language resources development

p. 8C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Current Challenges to Corpus and Language Resource Research

Corpus size is too small : Disambiguation

Collocation

Grammatical functions and other dependencies

usually requires corpus size of 100 million words or above to yield significant distributional information.

Resources development is slow and tedious

Semantic Role Tagging

POS tagging post-processing

p. 9C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Estimating Corpus Scale for Automatic Extraction of Linguistic Knowledge

How many events do we need to establish reliable description of a word from corpus? automatically?

Grammatical Information based on Word-word Collocation

V + N :「開立」+「發票」 A + N :「不實」+ 「發票」

Collocational information between any given two mid-frequency words (frequency rank 10,000 or above)

That occur within a 10 word window of the keyword (5 before and 5 after

Requires a corpus size of 1 billion words or above

p. 10C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Classical Chinese Corpora:Million Word Scale

Corpus Name Online Year

Data Duration/

Content

Sinica 4.0

(Taiwan)1996

5.2 M words

7.9 M characters 1990-1996

Fully Tagged

Sinica 5.0

(Taiwan)2006

10 M words1990-2004

Fully Tagged

Sinorama

(Taiwan)2003

3.2 M English words

5.3 M Chinese characters

1976 – 2000

(1999-2000)

Aligned

CCL

(Peking)2003 85 M simplified characters

1919 -2003

Partially tagged

(1 million) M= million

p. 11C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

A new paradigm Integrating Multiple Language resources

From Synergy to Knowledge

Integrate multiple existing (language) resources to create new resource

Allow resources to scale up beyond existing resources,

Generate new knowledge which does not exist in any individual resource

General methodology (without too much additional manual work):

merging existing, similarly annotated resources, or

creating an overall conceptual framework for different knowledge/language resources to be integrated

Automatically

p. 12C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

From Synergy to Knowledge When A and B have synergy, we say in Chinese that

A and B bring out the advantages of each other

Knowledge is what we know about the world, either descriptive or explanatory

Knowledge cannot be created from nothing, it comes by

Keen observation of facts

Sharp reasoning when we put two or more facts together

Different language resources can be put together to

Facilitate observation of facts, and

Create an environment where different linguistic facts can be more easily associated (for knowledge discovery)

p. 13C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Synergy: Integrating different types of language resoureces

Research based on Chinese Gigaword Corpus

Chinese Gigaword Corpus: Introduction

Implementation of fully automatic corpus tagging

Word Sketch Engine: Introduction

Chinese Word Sketch

Integrating corpus program with

lexico-grammatical information

p. 14C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Introduction: CGW Corpus IChinese Gigaword Second Edition (2005)

Produced and released by Linguistic Data Consortium (LDC) in 2003 (first edition).

Newswire text data in Chinese.

Second edition contains additional data collected after the publication of the first edition.

Three distinct international sources :

Central News Agency of Taiwan

Xinhua News Agency of Beijing

Zaobao Newspaper of Singapore

p. 15C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Introduction: CGW Corpus II

CNA Xinhua Zaobao

First Edition 1991-2002 1990-2002

New in Second EditionOct. 2002 -

Dec. 2004

Jan. 2003 -

Dec. 2004

Oct. 2000 -

Sep. 2003

Table 1. Coverage of Chinese GigaWord Corpus

p. 16C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Introduction: CGW Corpus IIIMarkup Structure

All text data are presented in SGML form, using a very simple,

minimal markup structure.

<DOC id="CNA19910101.0003" type="story"><HEADLINE>捷運局對工程噪音採多項防治措施</HEADLINE><DATELINE>( 中央社台北一日電 )</DATELINE><TEXT><P>台北都會區捷運工程正處於積極趕工階段 ,…</P><P>淡水線工程進度百分之三十六點一九 , 落後百分之二點六七 ,…</P></TEXT></DOC>

p. 17C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Introduction: CGW Corpus IVStatistics

Resource Characters Words Documents

First

Edition

CNA 735 462 1,649

Xinhua 382 252 817

TOTAL 1,118 714 2,466

Second

Edition

CNA 792 497 1,769

Xinhua 471 310 992

Zaobao 28 18 41

TOTAL 1,291 825 2,803

Table 2. Content of data from each source

Unit: Million

p. 18C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

CGW after fully automatic tagging

  Word Type Word Token

CNA 1,917,093 496,465,879

XIN 1,409,747 305,595,420

ZBN 273,111 18,328,571

Total 2,999,590 820,389,870

 

p. 19C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

II. 1. Corpus Preparation: (Almost) Fully Automatic Segmentation and Tagging

Strategy (Ma and Chen 2005)(Ma and Chen 2005) : HMM method for POHMM method for POS tagging for words existing in basic lexicon and morS tagging for words existing in basic lexicon and morpheme-analysis-based method (Tseng and Chen 200pheme-analysis-based method (Tseng and Chen 2002) to predict POS’s for new words.2) to predict POS’s for new words.

Integrating Language Resources Sinica lexicon with 80,000 word entries. Sinica lexicon with 80,000 word entries.

A 50,000-words’ set collected from Sinica Corpus 3.0 A 50,000-words’ set collected from Sinica Corpus 3.0 (10 million words balanced corpus).(10 million words balanced corpus).

5,000 new words from Xinhua new-words dictionary.5,000 new words from Xinhua new-words dictionary.

Tagset : Adopting Sinica Tagset as a uniform tagging set.

p. 20C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Preparation: Implementation Environment: Environment: 2 PC (2.8GHz CPU) 2 PC (2.8GHz CPU)

Time ConsumedTime Consumed :: over 3 daysover 3 days

OutputOutput : : 462 million words of CNA462 million words of CNA

252 million words of XIN252 million words of XIN

Ma and Huang 2006 (LREC 2006)Ma and Huang 2006 (LREC 2006)

See http://ckipsvr.iis.sinica.edu.tw/ for demo of the CKIP Segmentation program

p. 21C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Preparation: Tagging

Segmented and Tagged Article

<DOC id="CNA19910101.0003" type="story"><HEADLINE>捷運局 (Nc)  對 (P31)  工程 (Nac)  噪音 (Nad)  採 (VC2)  多 (Neqa)  項 (Nfa)  防治 (VC2)  措施(Nac)</HEADLINE><DATELINE>((PARENTHESISCATEGORY)  中央社 (Nca)  台北 (Nca)  一日 (Nd)  電 (VC2)   )(PARENTHESISCATEGORY)</DATELINE><TEXT><P>台北 (Nca)  都會區 (Ncb)  捷運 (Nad)  工程 (Nac)  正 (Dd) 處於 (VJ3)  積極 (VH11)  趕工 (VA4)  階段 (Nac)  , (COMMACATEGORY) …</P><P>淡水線 (Na)  工程 (Nac)  進度 (Nad) 百分之三十六點一九 (Neqa), (COMMACATEGORY)落後 (VJ1)  百分之二點六七 (Neqa)  , (COMMACATEGORY)…</P></TEXT></DOC>

p. 22C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Summary of Fully Tagged CGW Corpus Fully segmented and tagged with Sinica tagset by Ac

ademia Sinica

Being processing by PKU with their tagset

Potentially the most important source for processing and comparative studies of Mandarin Chinese

Will be available from LDC in 2007.

p. 23C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

CWS and Integration of Corpus Search Engine with Lexico-grammatical Information

Overview

A word sketch is a one-page, automatic, corpus-derived summary of a word's grammatical and collocational behavior.

The Word Sketch Engine, which takes as input a corpus of any language and a corresponding grammar patterns, generates word sketches for the words of that language.

We synergize rich lexicon-based grammatical information (ICG, Chen and Huang 1992) with stochastic information.

p. 24C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Word SketchWord Sketch Engine (Kilgarriff et al.)

Register for trial usage at http://www.sketchengine.co.uk

A Versatile Corpus Viewing and Searching Tool

The Word Sketch Engine, which takes as input a corpus of any language and a corresponding grammar patterns, generates word sketches for the words of that language.

Based on pre-defined context-free rules to identify grammatical functions (relations)

Ranked by Saliency: frequency adjusted MI (based on Dekang Lin’s definition of Pair-wise MI)

p. 25C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Design Criteria of Sketch Engine Grammatical relation is the information that is both of

most interest to HLT and linguistic research

However, GR’s can only be discovered based on collocational data, hence requires very large corpus and high quality annotation at the same time, a seeming unsolvable dilemma

There is a solution when corpus is big enough Context-free patterns allows fairly reliable extraction of

a substantial number, if not all, relations

(When there are enough instances of relations extracted), the saliency ranking correctly picks the distributional tendencies and allows users to ignore idiosyncrasies/errors.

p. 26C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

WordSketch’s Approach:From Lexical Types to Relations Types

BNC has 100,000,000 Words 939,028 word types

70,000,000 tuples (relations) Extracted

More than 70 relations per lemma

For CWS II, and CGW corpus (CNA data) 1,917,093 word Types

59,183,238 tuples (<eat, obj, rice>)

More than 30 relations per lemma

p. 27C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Chinese WordSketch: An Overview Concordance

WordSketch

Sketch Difference

Thesaurus

p. 28C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

p. 29C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

p. 30C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

p. 31C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

p. 32C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

CWS: SketchDiffComparing the behaviors of two words

p. 33C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

CWS: Thesaurus of 快樂

p. 34C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Application to Chinese Corpus: Comparing ThesaurusWe shall know a word by the company it keeps

p. 35C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Context-free patterns: Does Quality of Grammatical Knowledge Matter?

The implementation of CWS I simply adopts English like CF grammatical patterns (since Chinese and English supposedly share very similar PS rules)

However, the result was not very satisfactory

Missing a lot of relations, such as objects which do not appear right next to a verb

Mis-classifying topicalized objects as subjects

Missing objects in non-canonical positions

p. 36C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Linguistic Knowledge Should Solve the above Problems

Comprehensive Lexical Knowledge of Verb Frames exists Information-based Case Grammar (ICG) Encoded on over 40,000 verbs in Sinica Lexicon

ICG Basic Patterns for Stative Pseudo-transitive Verb (VI)

EXPERIENCER<GOAL[PP[ 對 ]]<VI

EXPERIENCER<VI<<GOAL[PP[ 於 ]]

THEME<GOAL[PP{ 對、以 }]<VI

THEME<VI<<GOAL[PP[ 於 ]]

THEME<VI<<SOURCE[PP{ 自、於 }]

THEME< SOURCE[PP{ 歸、為 }]<VI

p. 37C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Comparing Lexical Knowledge Between CWS I and CWS II CWS I: 11 definitions, 11 patterns

One single patter for verb-object relation

CWS II: 32 definitions, 80 patterns

20 patterns for verb-object relation

59,183,238 tuples (<eat, obj, rice>)

from 496,465,879 words

English has 39 definitions, 40 patterns

p. 38C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Synergy among tagging, statistics, and linguistic knowledge

Collocations are identified with Context free rules in Word Sketch Engine

Collocating Pattern for Object from CSE I

1:"V[BCJ]" "Di"? "N[abc]"? "DE"? "N[abc]"? 2: "Na" [tag!= "Na"]

Challenge: Long-distance relations

全穀麵包,吃了很健康。

quan.gu mian.bao, chi le hen jian.kang

有人嘗試要將這荷花分類,卻越分越累。 you ren chang.shi yao jiang zhe he.hua fen.lei, que yue fen yue lei

他 只 吃了 一 口 飯 …

Ta zhi chi let yi kou fan

p. 39C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Integrating prior Knowledge in Processing

Knowledge Source

Information-based Case Grammar (ICG, Chen and Huang 1992)

Encoded on over 40,000 verbs in Sinica Lexicon

ICG Basic Patterns for Stative Pseudo-transitive Verb (VI)

EXPERIENCER<GOAL[PP[ 對 ]]<VI

EXPERIENCER<VI<<GOAL[PP[ 於 ]]

THEME<GOAL[PP{ 對、以 }]<VI

THEME<VI<<GOAL[PP[ 於 ]]

THEME<VI<<SOURCE[PP{ 自、於 }]

THEME< SOURCE[PP{ 歸、為 }]<VI

p. 40C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Integrating prior Knowledge in Processing: Examples

村莊 (object) 明天將 被 夷為平地 (VB11)

cunzhuang mingtian jiang bei yiweipingdi

begin time1 location time1 adv? passive_prep adv_string 1:"V[BCJ].*" [tag!="DE"]

大量 的 遊客 破壞 (VC2) 公園 景觀 (object)

daliang de youke pohuai gongyuan jingguan

1:"VC.*" (particle|prep)? NP not_noun

(NP is defined as “…noun_modifier{0,2} 2:noun…”.

p. 41C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Integrating prior Knowledge in Processing: Partial Result

Object Recall Comparison

CSE I CSE II

hong2 (red) 0 0

pao3 (run) 0 8,704

kan4 (look) 32,350 64,096

da3 (hit) 26,016 47,182

song4 (give) 0 76,378

shuo1 (say) 0 20,350

xiang1xin4 (believe) 0 52,373

quan4 (persuade) 0 3,852

p. 42C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Integrating prior Knowledge in Processing: Partial Result II

Most salient objects for chi1 「吃」 in CSEII

Those among top 20 salient object fromCSE1, but not II

飯 fan4 rice 802 70.96 (4),

虧 kui disadvantage 329 59.24 (12)

苦頭 ku3tou2 suffering 194 58.71 (14)

p. 43C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Applications: Chinese WordSketch Test version of Chinese Word Sketch is available

Permanent version of CWS will be available from Academia Sinica Soon

http://wordsketch.ling.sinica.edu.tw

p. 44C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Application: Resolving Nominalization

])[(log

][|])[(|])[(|log

][|])[(|])[(|log

][|])[(|])[(|log

][|])[(|])[(|log

111

111

111

111

nomtvP

nomvtPnomttPnomtvtP

nomvwPnomtwPnomtvwP

nomvtPnomttPnomtvtP

nomvwPnomtwPnomtvwP

ii

iiiiii

iiiiii

iiiiii

iiiiii

Chinese verbs are nominalized without overt markup

Resolving Categorical ambiguity with distributional information only

Two Approaches: HMM and Bayesian Classifier

HMM: N-grams

Classifier: left, right contexts, plus own verb sub-class, weighted

2.0 ,3.0 ,5.0

p. 45C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Nonminalization Results (Ma and Huang 2006)

0

20

40

60

80

100

文學 生活 社會 科學 哲學 藝術 綜合

Topics

F-sc

ore(

%)

HMM-1

HMM-2

Classifier-1

Classifier-2

Classifier-3

Best overall HMM performance: 69%

Best Overall Bayesian classifier performance: 74%

    

p. 46C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Mining Cross-Strait Lexical Difference Strategy: Using a pair of know contrasting words

as seeds and lookup SketchDifference

Clinton 克林頓 ke4 Vs. 柯林頓 ke1

What is found: Other unique translation for either PRC or Taiwan

克林頓 (PRC) only and/or patterns (vs 柯林頓 only)

葉利欽 88 54.6 Yeltin 葉爾勤 (3)

布什 65 49.7 Bush 布希 (4)

萊溫斯基 10 41.3 Lewinsky 呂茵斯基 / 呂女 (1)

戈爾 20 39.4 Gore 高爾 (2)

p. 47C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Adventures in Wen-Land: 文國尋寶記http://www.sinica.edu.tw/Wen/

Integrating the following

Corpora: Sinica Corpus, Textbook Corpus (3 different editions), Tang poems, Dream of the Red Chamber, On the Water Margin…

Lexicon: General, Classifier, Idiom ( 成語 )

Linked with a corpus/lexicon interface

Developed by: Huang, Fengju Lo, Hui-chun Hsiao, and team of teachers

p. 48C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

The Substantive IssuesLanguage Resources Used in WenGuo

Textual Databases (of classical texts)

Text Corpora

Linguistic and Philological Knowledge from previous research

LKB Extracted and composed from the above

p. 49C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Adventures in Wen-Land (2001)

p. 50C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Adventures in Wen-Land What: Is a virtual theme park for on-line Chinese lang

uage learning and teaching .

How: Is the end product of a National Digital Museum Project sponsored by the National Science Council, ROC (A Linguistic and Literary KnowledgetNet for Elementary School Children)

When: Was completed in spring, 2001

p. 51C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Adventures in Wen-Land Who: The team included

Chu-Ren Huang a linguist

Feng-ju Lo a literary scholar

Hui-chun Hsiao a web-based art-designer

Ching-Chun Hsieh a computer scientist

Chi-chao Liao, Chiu-Jung Lu Pei-chuan Wei...

Mei-ling Li, Hsiou-Hua Chiu, Shu-wen Huang, Cheng-chi Jiang elementary school Chinese teachers

p. 52C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

An Adventure in Seven PartsThe Geography of Wen-Land

天罡地煞梁山泊論英雄好漢 On the Water Margin梁山 mountain

大觀園 garden

西園 music hall

倒影湖 lake

接龍瀑布 falls

黑白宮 castle

學堂 colleges

名稱 Scene

大觀園一探紅樓兒女情懷 The Dream of the Red Chamber

進入時光隧道,回味唐宋流行歌 Song Poetry

語文的無窮趣味,遊戲的新鮮挑戰 Games

出口成章,妙語串成珠璣 Chinese Idiom Dict.

名詞語量詞配出中文的特色 Noun-class. Dict.

由教科書有限的字數裡找出豐富的知識與無窮的趣味 Three versions of textbooks

學習目標 Content

p. 53C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

The Adventure’s Seven GuidesThe Denizens of Wen-Land

神行太保 The Chinese Mercury (one of the 108 heroes)梁山 mountain

大觀園 garden

西園 music hall

倒影湖 lake

接龍瀑布 falls

黑白宮 castle

學堂 colleges

名稱 Scene

鴛鴦 A Maid who knows the ins and outs

宋代少婦 Young Song Dynasty Woman平平與明明 A Twin哪吒 The mythical flying child acrobat

林三本 A medieval estate owner

李小哲 a learned young scholar (a miniature version of Y.T Lee)

導覽人物 Featured Character

p. 54C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Designing Adventures: Threads that hold the KnowledgeNet Together

A Thread without a guiding needle goes nowhere

穿針引線 A Lexical Needle Picks Up & Connects

-Only the Textual Materials that it is allowed to go through

p. 55C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Pulling Through and Pulling Together Lexicalthread and hyperlink

Lexical KnowledgeBase (LKB) guides us through all language resources that use the same word

-In WenGuo, we assume users will be using textbook vocabulary to guide them

p. 56C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Pulling Through and Pulling Together Lexicalthread and Textutal Filter

LKB provides the chronological (such as when a word is first taught/learned) and distributional (such as frequency) feature of each word.

-In WenGuo, by knowing a user’s level at school, we can gauge/pace learning

p. 57C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Integrated Resources as Learning Background in Wenguo

p. 58C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Using LKB to Pace Linguistic Knowledge A learner identifies his/her school year (3rd grade

etc.) when log in

-control vocabulary level of learning activity

-pace/monitor development of ling. Skill

A user can also specify which textbook version to view

-allows cross-track comparison of linguistic development

-allows supplementation at corresponding learning level

p. 59C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Synergizing Archive-based LKB LKB’s based on classical or prototypical texts facilitat

es quick and accurate lexical comparison and allows immediate reference to original text

-In WenGuo, users can easily find out the literary references and citations in several classics and go immediately from vocabulary to text

p. 60C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Integrated Lexical KnowledgeBase entry of 雲海 yun2hai3

p. 61C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Citation of Classifier 個 ge5 in Three Textbooks

p. 62C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Collocating Nouns of Classifier 張 From Huang et al. 1997 國語日報量辭典

p. 63C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Concluding Remarks-Corpus is a sample of the ‘real words in action’

-Corpora and other language resources can be combined to create powerful language teaching and learning tools

-The integration must be linked by lexical terms

-Corpora must be tagged with POS

-In practice, different editions of textbooks can be treated as different corpora

-And be linked for comparison or borrowing

-Corpus facilitates creation of synergy for learning and teaching

p. 64C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Other Useful Resources Sinica Corpus 中央研究院現代漢語平衡語料庫 , first t

agge corpus of Chinese, online since 1996

http://www.sinica.edu.tw/SinicaCorpus

SouWenJieZi - A Linguistic KnowledgeNet. August 1999.

http://words.sinica.edu.tw/

SINICA BOW 2002

http://bow.sinica.edu.tw

Chinese Wordnet 2005, >16,000 synsets

http://cwn.ling.sinica.edu.tw/

p. 65C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Conclusion

The synergy of different language resources crea

tes

Knowledge

生生不息生生不息

p. 66C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Concluding RemarksOther NLP Research Activities at Academia Sinica

Chinese Wordnet: ongoing, >10,000 synsets

http://cwn.ling.sinica.edu.tw

Bilingual Wordnet linked to SUMO ontology

http://bow.sinica.edu.tw

Fully Sense-tagged corpus: combining cwn and Sinica corpus with machine learning algorithm

Directed by Sue-Jin Ker of Soochow Univ.

Subset to be available soon

Asian lexicon standard: NEDO project

Tokunaga, Calzolari, Shirai, Virach, Prevot…

p. 67C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Reference

DLC CGW Corpus: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T14

Chinese Word Sketch 試用網址 : http://corpora.fi.muni.cz/chinese_all/ (帳號 :chinese 密碼 :chinese)

Wei-yun Ma, and Chu-Ren Huang. 2006. Uniform and Effective Tagging of a Heterogeneous Giga-word Corpus. To be Presented at the 5th International Conference on Language Resources and Evaluation (LREC2006). Genoa, Itlay. 24-28 May, 2006.

CKIP (Chinese Knowledge Information Processing Group). (1995/1998). The Content and Illustration of Academica Sinica Corpus. (Technical Report no 95- 02/98-04). Taipei: Academia Sinica

Huang Chu-Ren, Keh-Jiann Chen, Feng-Yi Chen, Keh- Jiann Chen, Zhao-Ming Gao and Kuang-Yu Chen. (2000). Sinica Treebank: Design Criteria, Annotation Guidelines, and On-line Interface. Proceedings of 2nd Chinese Language Processing Workshop pp. 29-37.

Kilgarriff, Adam, Chu-Ren Huang, Pavel Rychly, Simon Smith, and David Tugwell. (2005). Chinese Word Sketches. ASIALEX 2005: Words in Asian Cultural Context.

p. 68C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Reference ( 續 )

Ma, Wei-Yun and Keh-Jiann Chen, (2005). Design of CKIP Chinese Word Segmentation System, Chinese and Oriental Languages Information Processing Society, Vol 14. No. 3. pp. 235-249.

Tseng, H.H. & K.J. Chen, (2002). Design of Chinese Morphological Analyzer,” Proceedings of SIGHAN Workshop on Chinese Language Processing, pp. 49-55.

Tsai Yu-Fang and Keh-Jiann Chen, 2003, "Reliable and Cost-Effective Pos-Tagging", Proceedings of ROCLING XV, pp161-174.

Tsai Yu-Fang and Keh-Jiann Chen, 2003, "Context-rule Model for POS Tagging", Proceedings of PACLIC 17, pp146-151.

Tsai Yu-Fang and Keh-Jiann Chen, 2004, "Reliable and Cost-Effective Pos-Tagging", International Journal of Computational Linguistics & Chinese Language Processing, Vol. 9 #1, pp83-96.

top related