towards constructing a chinese information extraction system to support innovations in library...

50
Towards Constructing a Chinese Towards Constructing a Chinese Information Extraction System to Information Extraction System to Support Innovations in Library Support Innovations in Library Services Services World Library and Information Congress: 72nd IFLA General Conference and Council, 20-24 Au gust 2006, Seoul, Korea Library of Chinese Academy of Sciences Zhang Zhixiong, Li Sa, Wu Zhengxin, Lin Ying

Upload: shonda-lillian-hancock

Post on 28-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

Towards Constructing a Chinese Information Towards Constructing a Chinese Information Extraction System to Support Innovations in Extraction System to Support Innovations in Library ServicesLibrary Services

World Library and Information Congress: 72nd IFLA General Conference and Council, 20-24 August 2006, Seoul, Korea

Library of Chinese Academy of Sciences

Zhang Zhixiong, Li Sa, Wu Zhengxin, Lin Ying

Page 2: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

outlineoutline

1.1. IntroductionIntroduction

2.2. What is IE (Information Extraction)? What is IE (Information Extraction)?

3.3. Potential functions in Innovations of Potential functions in Innovations of Library ServicesLibrary Services

4.4. Constructing a Chinese Information Constructing a Chinese Information Extraction SystemExtraction System

5.5. Tests and EvaluationTests and Evaluation

Page 3: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

1. Introduction1. Introduction

Library of Library of Chinese Academy of Sciences – Now changing the name to National Science Now changing the name to National Science

Library of ChinaLibrary of China– about 400 staffs, HQ in Beijing, 3 branches about 400 staffs, HQ in Beijing, 3 branches

in Lanzhou, Chengdu, Wuhan, in Lanzhou, Chengdu, Wuhan, – serve 90 CAS research institutes across the serve 90 CAS research institutes across the

countrycountry– in 2001,initiated Chinese National Science in 2001,initiated Chinese National Science

Digital Library (CSDL) programDigital Library (CSDL) program

Page 4: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

1. Introduction1. Introduction

CSDL (Chinese National Science Digital CSDL (Chinese National Science Digital Library )Library )– provided abundant digital information provided abundant digital information

resources for users. (e-journals,6000 resources for users. (e-journals,6000 west,11000 Chinese, 15000 in one day)west,11000 Chinese, 15000 in one day)

– developed information systems to support developed information systems to support networked services.networked services.

Page 5: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

Union Catalogs & Document Union Catalogs & Document DeliveryDelivery

Page 6: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

FFederated database searchederated database search

Page 7: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

Digital reference Digital reference

Page 8: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

remote authenticationremote authentication

Page 9: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

1. Introduction1. Introduction

CSDL (Chinese National Science Digital CSDL (Chinese National Science Digital Library )Library )– provided abundant digital information provided abundant digital information

resources for users. (e-journals,6000 resources for users. (e-journals,6000 west,11000 Chinese, 15000 in one day)west,11000 Chinese, 15000 in one day)

– developed information systems to support developed information systems to support networked services.networked services.

– Carried out lots of training and propaganda Carried out lots of training and propaganda programprogram

Page 10: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

1. Introduction1. Introduction

CSDL become one of the key research CSDL become one of the key research facility to researcher and graduated facility to researcher and graduated students of CAS.students of CAS.

WhileWhile– Information requirement of researcher and Information requirement of researcher and

graduated students changed rapidlygraduated students changed rapidly– Traditional information retrieval methods is Traditional information retrieval methods is

not sufficientnot sufficient

Page 11: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

1. Introduction1. Introduction

The User of CSDL want to:The User of CSDL want to:– get rid of the information noise get rid of the information noise – effectively get a comprehensive view of effectively get a comprehensive view of

recent development of domainrecent development of domain– disclose significant relationships between disclose significant relationships between

informationinformation The Librarian of CSDL want to:The Librarian of CSDL want to:

– improve the service standard of CSDL improve the service standard of CSDL – turn the digital library into a knowledge turn the digital library into a knowledge

repositoryrepository

Page 12: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

1. Introduction1. Introduction

Information Extraction (IE) is the Information Extraction (IE) is the emerging technology serves to our needsemerging technology serves to our needs

Page 13: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

outlineoutline

1.1. IntroductionIntroduction

2.2. What is IE (Information Extraction)?What is IE (Information Extraction)?

3.3. Potential functions in Innovations of Potential functions in Innovations of Library ServicesLibrary Services

4.4. Constructing a Chinese Information Constructing a Chinese Information Extraction SystemExtraction System

5.5. Tests and EvaluationTests and Evaluation

Page 14: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

2.What is IE 2.What is IE (Information (Information Extraction)? Extraction)? NLP Research Group, Univsity of SheffielNLP Research Group, Univsity of Sheffiel

dd– Information extraction (IE) is a term that has cInformation extraction (IE) is a term that has c

ome to be applied to the activity of automaticaome to be applied to the activity of automatically extracting pre-specified sorts of informatiolly extracting pre-specified sorts of information from natural language textsn from natural language texts

Page 15: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

2.What is IE 2.What is IE (Information (Information Extraction)? Extraction)? Dr. Hamish CunninghamDr. Hamish Cunningham

– IE is a process that takes texts (and sometimes IE is a process that takes texts (and sometimes speech) as input and produces fixed-format, speech) as input and produces fixed-format, unambiguous data as outputunambiguous data as output

– InputInput unstructuredunstructured free textfree text

– OutputOutput fixed-formatfixed-format unambiguousunambiguous

Page 16: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

2.What is IE 2.What is IE (Information (Information Extraction)? Extraction)? Output (structured information source) can Output (structured information source) can

be used for:be used for:– searching searching – analysisanalysis– generating summarygenerating summary– constructing indicesconstructing indices

Page 17: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

##### ####### NHS TRUST - PATIENT CASE NOTE ########:######### ####### DOB: 1944 CLEF-RMH-Entry-Key: 52A4F6DB2B46E

AB 1992 Seen in General Surgical  This lady who has had a mastectomy and left open capsulotomy and removal of her prosthesis was seen by me in the clinic today on behalf of XXXXXXXXXXX. She has extensive bony lymphoedema in her left arm which does not seem to be getting any better although she is more or less reconciled to the problem. The original problem was that she complained of shooting pain in the direction of ulna nerve and although there does not seem to be any evidence of local, regional or distant recurrence the pain itself warrants management in a pain clinic. XXXXXXXXX could be seen in the pain clinic at the XXXXXXX but as this would involve a lot of travelling would like to be treated nearer her home. I wonder whether it would be possible for you to investigate if there is a pain clinic available at XXXXXXXXXXX as I am sure XXXXX could be treated and benefit from its management. I have otherwise arranged for her to be seen in the clinic again in a year's time. There are no signs of recurrence at this time.

5213A4F612F1

IE, A exampleIE, A example

recurrence

no signs of recurrence

bony lymphoedema

shooting pain in thedirection of ulna nerve

pain

Interventions

Problems

Problem Site

Locations

left arm

local, regional or distant

a year’s time

today

at this time

Time

pain clinic

clinic

pain clinic

General Surgical

pain clinic

mastectomy left open capsulotomyremoval of her prosthesis

management

management

Page 18: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

IE, A exampleIE, A example

Extracted Information could be collected…

Interventions

Problems

Problem Site

Locations

Time

recurrence

no signs of recurrence

bony lymphoedema

shooting pain in thedirection of ulna nerve

pain

left arm

local, regional or distant

a year’s time

today

at this time

pain clinic

clinic

pain clinic

General Surgical

pain clinic

mastectomy left open capsulotomyremoval of her prosthesis

management

management

recurrence

no signs of recurrence

bony lymphoedema

shooting pain in thedirection of ulna nerve

pain

left arm

local, regional or distant

a year’s time

today

at this time

pain clinic

clinic

pain clinic

General Surgical

pain clinic

mastectomyleft open capsulotomy

removal of her prosthesis

management

management

recurrence

no signs of recurrencebony lymphoedema

shooting pain in thedirection of ulna nerve

pain

left armlocal, regional or distant

a year’s timetoday

at this time

pain clinicclinic

pain clinicGeneral Surgical

pain clinic

mastectomy

left open capsulotomy

removal of her prosthesis

managementmanagement

Page 19: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

2.What is IE 2.What is IE (Information (Information Extraction)? Extraction)? 5 kinds of Information Extraction tasks5 kinds of Information Extraction tasks

– Named Entity recognition (NE)Named Entity recognition (NE)– Coreference resolution (CO)Coreference resolution (CO)– Template Element construction (TE) Template Element construction (TE) – Template Relation construction (TR) Template Relation construction (TR) – Scenario Template production (ST)Scenario Template production (ST)

Page 20: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

2.What is IE 2.What is IE (Information (Information Extraction)?Extraction)? NE is about finding entitiesNE is about finding entities CO about which entities and references CO about which entities and references

(such as pronouns) refer to the same thing(such as pronouns) refer to the same thing TE about what attributes entities haveTE about what attributes entities have TR about what relationships between TR about what relationships between

entities there areentities there are ST about events that the entities ST about events that the entities

participate in. participate in.

Page 21: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

2.What is IE (Information 2.What is IE (Information Extraction)?Extraction)?

Information Extraction will:Information Extraction will:– play a very important role in coping with the play a very important role in coping with the

huge collections of digital information huge collections of digital information – bring innovations in library servicesbring innovations in library services

Page 22: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

outlineoutline

1.1. IntroductionIntroduction

2.2. What is IE (Information Extraction)? What is IE (Information Extraction)?

3.3. Potential functions in Innovations of Potential functions in Innovations of Library ServicesLibrary Services

4.4. Constructing a Chinese Information Constructing a Chinese Information Extraction SystemExtraction System

5.5. Tests and EvaluationTests and Evaluation

Page 23: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

3. Potential functions in 3. Potential functions in Innovations of Library Innovations of Library ServicesServices1.1. Automatic annotation and metadata creatAutomatic annotation and metadata creat

ionion– automatic annotation of digital materialsautomatic annotation of digital materials– automatic acquisition of metadataautomatic acquisition of metadata– For example, MnM, S-CREAM, AERODAFor example, MnM, S-CREAM, AERODA

ML, SemTag, KIM, hTechsightML, SemTag, KIM, hTechsight– ontology-based IE techniquesontology-based IE techniques

Page 24: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

3. Potential functions in 3. Potential functions in Innovations of Library Innovations of Library ServicesServices2.2. Improving data mining in information Improving data mining in information

analysisanalysis– Large-scale data analysis Large-scale data analysis – Detection of many types of evidenceDetection of many types of evidence– Get enough structured data for analysis Get enough structured data for analysis

Page 25: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

3. Potential functions in 3. Potential functions in Innovations of Library Innovations of Library ServicesServices3.3. Developing knowledge base from free teDeveloping knowledge base from free te

xtxt– statistical and numeric databasesstatistical and numeric databases– terminological databaseterminological database– fact sheetsfact sheets

– SOBA (SmartWeb Ontology-Based AnnotatSOBA (SmartWeb Ontology-Based Annotation)ion)

Page 26: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

3. Potential functions in 3. Potential functions in Innovations of Library Innovations of Library ServicesServices4.4. Generating answers in digital reference Generating answers in digital reference

systemsystem– Most research libraries establish digital Most research libraries establish digital

reference service reference service – Can we get answers directly from Can we get answers directly from

information systemsinformation systems– Natural language QA (Question Answering) Natural language QA (Question Answering)

Page 27: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

SO…SO…

IE is very importantIE is very important How to build an IE system (Chinese)How to build an IE system (Chinese) CSDL try to find an effective wayCSDL try to find an effective way

Page 28: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

outlineoutline

1.1. IntroductionIntroduction

2.2. What is IE (Information Extraction)? What is IE (Information Extraction)?

3.3. Potential functions in Innovations of Potential functions in Innovations of Library ServicesLibrary Services

4.4. Constructing a Chinese Information Constructing a Chinese Information Extraction SystemExtraction System

5.5. Tests and EvaluationTests and Evaluation

Page 29: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

4. Constructing a Chinese 4. Constructing a Chinese Information Extraction Information Extraction SystemSystem A Chinese IE solution A Chinese IE solution

– which makes full use of GATEwhich makes full use of GATE– trying to develop a Chinese IE plug-in to trying to develop a Chinese IE plug-in to

process Chinese information resource based process Chinese information resource based on GATE framework. on GATE framework.

Page 30: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

4. Constructing a Chinese 4. Constructing a Chinese Information Extraction Information Extraction SystemSystem GATEGATE

– (General Architecture for Text Engineering)(General Architecture for Text Engineering)

– Open Source, Developed from 1995Open Source, Developed from 1995

– GATE, a frameworkGATE, a framework Language Resources (LRs) Language Resources (LRs) Processing Resources (PRs) Processing Resources (PRs) Visual Resources (VRs) Visual Resources (VRs)

– ANNIE (A Nearly-New IE system)ANNIE (A Nearly-New IE system) tokeniser, sentence splitter, POS tagger, gazetteer, finite stattokeniser, sentence splitter, POS tagger, gazetteer, finite stat

e transducer and orthomatchere transducer and orthomatcher

Page 31: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

ANNIE PipelineANNIE Pipeline

Page 32: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

GATE: good for EnglishGATE: good for English

Page 33: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

GATE: Not so good for GATE: Not so good for ChineseChinese

Page 34: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

4.Constructing a Chinese 4.Constructing a Chinese Information Extraction Information Extraction SystemSystem Key difficulties for Chinese information eKey difficulties for Chinese information e

xtractionxtraction– Chinese tokenizing Chinese tokenizing – Chinese gazetteersChinese gazetteers– Chinese named entity recognitionChinese named entity recognition

Page 35: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

Chinese tokenizingChinese tokenizing

English languageEnglish language– words are separated by white space and words are separated by white space and

punctuationpunctuation

Chinese LanguageChinese Language– without any separation between words without any separation between words

Page 36: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

a simple sentencea simple sentence

(I am a Chinese)

can be broken into several forms with segmenter

(I am a Chinese)

(I am China person)

(I am center country person)

Page 37: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

Chinese gazetteersChinese gazetteers

GATE gazetteer lists for EnglishGATE gazetteer lists for English– very abundant very abundant

GATE gazetteer lists for Chinese process GATE gazetteer lists for Chinese process – simple and short gazetteers such as date, time, simple and short gazetteers such as date, time,

organization, location, money, province etcorganization, location, money, province etc– for a flexible language like Chinese, the list is for a flexible language like Chinese, the list is

very limitedvery limited

Page 38: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

Chinese named entity Chinese named entity recognitionrecognition GATE system uses JAPE (a Java GATE system uses JAPE (a Java

Annotation Patterns Engine) rules to Annotation Patterns Engine) rules to recognize NErecognize NE

Page 39: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

JAPE rules JAPE rules

grammar of Chinese is quite different grammar of Chinese is quite different from that of Englishfrom that of English

the JAPE rules provided by GATE are not the JAPE rules provided by GATE are not suitable for Chinese textssuitable for Chinese texts

We need to rewrite JAPE rules to We need to rewrite JAPE rules to implement Chinese information extractionimplement Chinese information extraction

Page 40: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

Solutions to the problemsSolutions to the problems

Page 41: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

three main tasks we three main tasks we have done have done 1.1. Integrating ICTCLAS to perform words Integrating ICTCLAS to perform words

segmentationsegmentation

Page 42: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

three main tasks we three main tasks we have donehave done

2.2. Developing Chinese gazetteers to enrich Developing Chinese gazetteers to enrich GATE language resourcesGATE language resources

Page 43: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

three main tasks we three main tasks we have donehave done

3.3. Rewriting JAPE rules to recognize Rewriting JAPE rules to recognize Chinese NEChinese NE

Page 44: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

Chinese JAPE rule Chinese JAPE rule

Page 45: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

outlineoutline

1.1. IntroductionIntroduction

2.2. What is IE (Information Extraction)? What is IE (Information Extraction)?

3.3. Potential functions in Innovations of Potential functions in Innovations of Library ServicesLibrary Services

4.4. Constructing a Chinese Information Constructing a Chinese Information Extraction SystemExtraction System

5.5. Tests and EvaluationTests and Evaluation

Page 46: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

5.Tests and Evaluation5.Tests and Evaluation

one years of working, we implemented the one years of working, we implemented the system system

carry out experimentscarry out experiments

Page 47: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

Same piece of articleSame piece of article

Page 48: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

Our outputOur output

Page 49: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

ConclusionsConclusions

bring forth a solution for Chinese bring forth a solution for Chinese information extraction systeminformation extraction system

carried out a valuable experimentcarried out a valuable experiment still many works need to be donestill many works need to be done lay a good foundation for our future works lay a good foundation for our future works

Page 50: Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA

Thanks!Thanks!

谢谢!谢谢!