final review 31 october 20031 wp2: named entity recognition and classification claire grover...

28
Final Review 31 October 2003 1 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Upload: sophia-james

Post on 28-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October 2003 1

WP2: Named Entity Recognition and Classification

Claire GroverUniversity of Edinburgh

Page 2: Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October 2003 2

NERC

Multilingual IE ArchitectureMultilingual IE Architecture

Web pages

ENERC

FNERC

HNERC

INERC

Fact ExtractionDemarcator

Database

Domain Ontology

Page 3: Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October 2003 3

WP2: ObjectivesWP2: Objectives

• Specification of language neutral NERC architecture (month 6: D2.1)

• NERC v.1: adaptation and integration of the four existing NERC modules (month 12: D2.2)

• Specification of Corpus Collection Methodology• NERC v.2: improvement of NERC v.1, incorporation

of name matching (month 18: D2.3)• NERC-based Demarcation• NERC v.3: improvement of NERC v.2, incorporation

of rapid adaptation mechanisms, porting to the 2nd domain (month 26: D2.4)

Page 4: Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October 2003 4

Features Specific to Features Specific to CROSSMARC CROSSMARC NERCNERC

• Multilinguality. Currently 4 languages but should be able to add new languages.

• Web pages as input. Conversion of HTML to XHTML and use of XML as common exchange format with a specific DTD per domain.

• Extensible to new domains. There is a need to rapidly add new domains.

Page 5: Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October 2003 5

Shared Features of the NERC ComponentsShared Features of the NERC Components

• XHTML input and output, shared DTD• Shared domain ontology• Each reuses existing NLP tools and linguistic

resources• Stepwise transformation of the XHTML to

incrementally add mark-up, e.g. tokenisation, sentence identification, part-of-speech tagging, entity recognition.

Page 6: Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October 2003 6

NERC Version 2NERC Version 2

• Final version of NERC for the 1st domain• All four monolingual systems use hand-coded rule

sets– HNERC uses the Ellogon Text Engineering Platform.– ENERC uses the LT TTT and LT XML tools and adds

XML annotations incrementally.– INERC is implemented as a sequence of XSLT

transformations of the XML document.– FNERC uses Lingway’s XTIRP Extraction Tool which

applies a sequence of rule-based modules.

Page 7: Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October 2003 7

NERC Version 3NERC Version 3

• Reported in D2.4.• Final version of NERC, dealing with 2nd domain.• Main focus is customisation methodology and

experimentation to allow rapid adaptation to new domains.

• NERC architecture where the monolingual components are different from each other means that customisation methods are defined per component.

Page 8: Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October 2003 8

ENERC Customisation MethodologyENERC Customisation Methodology

• Retain XML pipeline architecture.

• Replace the named entity rule sets with a maximum entropy tagger.

• Experiments with the C&C Tagger and OpenNLP.

• Limited human intervention (selection of appropriate features).

Page 9: Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October 2003 9

FNERC Customisation MethodologyFNERC Customisation Methodology

• Retain XTIRP-based architecture and modules. • Use machine learning to assist in the acquisition

of regular expression named entity rules.• The machine-learning module produces a first

version of human-readable rules plus lists of examples and counter-examples.

• The human expert modifies the rule set appropriately.

• This method reduces rule set development time to about a third.

Page 10: Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October 2003 10

HNERC Customisation MethodologyHNERC Customisation Methodology

• ML-HNERC comprises: • Token-based HNERC

– operates over word tokens, treating NERC as a tagging problem.

– word token classification performed by five independent taggers with the final tag chosen through a simple majority voter.

• Phrase-based HNERC – operates over phrases which have been identified using a

grammar automatically induced from the training corpus

– uses a C4.5 decision tree classifier to recognize phrases that describe entities.

Page 11: Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October 2003 11

INERC Customisation MethodologyINERC Customisation Methodology

• INERC is modular, with components which are general and reusable in new domains. Customization can be restricted to the lexical knowledge bases.

• Statistically driven process of generalizing from the annotated corpus material to derive more generalized lexical resources.

• Compute a frequency score to expand the lexical resources.

Page 12: Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October 2003 12

Evaluation MethodologyEvaluation Methodology

• For both domains we have a hand annotated corpus of 100 pages per language, split 50-50 into training and testing material.

• Each monolingual NERC is evaluated against the testing corpus.

• Standard measures of precision, recall and f-measure are used.

Page 13: Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October 2003 13

Evaluation SummaryEvaluation Summary

Domain 1 F-score

Domain 2 F-score

ENERC 0.73 0.59

FNERC 0.77 0.75

HNERC 0.86 0.68

INERC 0.82 0.77

Page 14: Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October 2003 14

• Rule-based approach gives better results but it is knowledge intensive and requires significant resources for customisation to each new domain.

• The FNERC approach to rule induction is promising.

• In our experiments the machine learning approaches give lower results but:– they allow easy adaptation to new domains– there is scope to improve performance.– more training material would give better

performance.

ConclusionsConclusions

Page 15: Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October 2003 15

Other WP2 ActivitiesOther WP2 Activities

• Collection and annotation of corpora for each language and domain.

• NERC-based Demarcation

Page 16: Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October 2003 16

Corpus Collection MethodologyCorpus Collection Methodology

• For each domain the process follows two steps:– identification of interesting characteristics

of product descriptions and the collection of statistics relevant to these characteristics from at least 50 different sites for a language.

– collection of pages and their separation into training and testing corpora.

Page 17: Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October 2003 17

Corpus collection principlesCorpus collection principles

Domain Independent principles:• Training and testing corpora have the same number of pages

• Corpus size fixed for all languages.

• Corpora are representative of the statistics found per language in the site classification step.

Domain Specific Principles:• The maximum number of pages from one site allowed in a corpus

must be decided depending on the domain.

• The testing corpus must contain X number of pages that come from sites not represented in the training corpus.

Page 18: Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October 2003 18

AnnotationAnnotation

• Annotation performed using NCSR’s annotation tool.

• Annotation guidelines drawn up per domain.• Each corpus annotated by two separate

annotators and inter-annotator agreement checked.

• Final corpus result of correction of cases of disagreement.

Page 19: Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October 2003 19

NERC-Based DemarcatorNERC-Based Demarcator

• Operates after NERC and before FE.

• Locates different product descriptions inside a web page.

• Current version is heuristics-based.

• Characteristic information:– 1st domain: manufacturer, model, price– 2nd domain: job_title, organization, education title

• Output: Product_No attribute on entities

Page 20: Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October 2003 20

Demarcator EvaluationDemarcator Evaluation

Greek Italian English French

1st domain

NE 0.77 0.91 0.63 0.52

NUMEX 0.75 0.84 0.54 0.52

TIMEX 0.59 0.72 0.44 0.41

2nd domain

NE 0.77 0.64 0.47 0.62

Page 21: Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October 2003 21

Results OverviewResults Overview

• Successful multilingual NERC system which is an integral part of a resaerch platform for extracting information from web pages.

• An architecture that allows for new languages and swift adaptation to new domains.

• Four independent approaches each of which provide good results.

• Well motivated corpus collection methodology.• Publicly distributed corpora for all languages

and both domains

Page 22: Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October 2003 22

Page 23: Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October 2003 23

Page 24: Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October 2003 24

Page 25: Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October 2003 25

Page 26: Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October 2003 26

Shared DTDsShared DTDs

Domain 1

NE: MANUF, MODEL, PROCESSOR, SOFT_OS

TIMEX: TIME, DATE, DURATION

NUMEX: LENGTH, WEIGHT, SPEED, CAPACITY,

RESOLUTION, MONEY, PERCENT

Domain 2

NE: MUNICIPALITY, REGION, COUNTRY, ORGANIZATION, JOB_TITLE, EDU_TITLE, LANGUAGE, S/W

TIMEX: DATE, DURATION

NUMEX: MONEY

TERM: SCHEDULE, ORG_UNIT

Page 27: Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October 2003 27

11stst Domain Evaluation Results Domain Evaluation Results

ENERC FNERC HNERC INERC

NE MANUF 0.52 0.68 0.86 0.93

MODEL 0.70 0.58 0.71 0.70

SOFT_OS 0.76 0.90 0.80 0.94

PROCESSOR 0.91 0.93 0.91 0.96

NUMEX SPEED 0.78 0.84 0.90 0.88

CAPACITY 0.90 0.85 0.84 0.96

LENGTH 0.85 0.61 0.88 0.89

RESOLUTION 0.96 0.83 0.75 0.89

MONEY 0.62 0.80 0.80 0.74

PERCENT 0.67 0.75 0.77 0.86

WEIGHT 0.96 0.93 1.00 0.88

TIMEX DATE 0.45 0.84 0.96 0.57

DURATION 0.73 0.85 0.87 0.41

TIME 0.47 0.69 1.00 -

Overall (aprox) 0.73 0.77 0.86 0.82

Page 28: Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October 2003 28

2nd Domain Evaluation Results2nd Domain Evaluation Results

ENERC FNERC HNERC INERC

NE MUNICIPALITY 0.70 0.77 0.82 0.92

REGION 0.65 0.81 0.40 0.94

COUNTRY 0.87 0.73 0.84 0.86

ORGANIZATION 0.56 0.58 0.50 0.71

JOB_TITLE 0.55 0.71 0.50 0.78

EDU_TITLE 0.36 0.57 0.67 0.82

LANGUAGE 0.67 0.69 0.95 0.83

S/W 0.55 0.82 0.70 0.75

NUMEX MONEY 0.25 0.93 0.00 0.00

TIMEX DATE 0.79 0.61 0.93 0.77

DURATION 0.83 0.88 0.91 0.74

TERM ORG_UNIT 0.37 0.66 0.39 0.51

SCHEDULE 0.00 0.57 0.00 0.40

Overall 0.59 0.75 0.68 0.77