final review 31 october 20031 wp2: named entity recognition and classification claire grover...

Final Review 31 October 2003 1

WP2: Named Entity Recognition and Classification

Claire GroverUniversity of Edinburgh


NERC

Multilingual IE ArchitectureMultilingual IE Architecture

Web pages

ENERC

FNERC

HNERC

INERC

Fact ExtractionDemarcator

Database

Domain Ontology


WP2: ObjectivesWP2: Objectives

• Specification of language neutral NERC architecture (month 6: D2.1)

• NERC v.1: adaptation and integration of the four existing NERC modules (month 12: D2.2)

• Specification of Corpus Collection Methodology• NERC v.2: improvement of NERC v.1, incorporation

of name matching (month 18: D2.3)• NERC-based Demarcation• NERC v.3: improvement of NERC v.2, incorporation

of rapid adaptation mechanisms, porting to the 2nd domain (month 26: D2.4)


Features Specific to Features Specific to CROSSMARC CROSSMARC NERCNERC

• Multilinguality. Currently 4 languages but should be able to add new languages.

• Web pages as input. Conversion of HTML to XHTML and use of XML as common exchange format with a specific DTD per domain.

• Extensible to new domains. There is a need to rapidly add new domains.


Shared Features of the NERC ComponentsShared Features of the NERC Components

• XHTML input and output, shared DTD• Shared domain ontology• Each reuses existing NLP tools and linguistic

resources• Stepwise transformation of the XHTML to

incrementally add mark-up, e.g. tokenisation, sentence identification, part-of-speech tagging, entity recognition.


NERC Version 2NERC Version 2

• Final version of NERC for the 1st domain• All four monolingual systems use hand-coded rule

sets– HNERC uses the Ellogon Text Engineering Platform.– ENERC uses the LT TTT and LT XML tools and adds

XML annotations incrementally.– INERC is implemented as a sequence of XSLT

transformations of the XML document.– FNERC uses Lingway’s XTIRP Extraction Tool which

applies a sequence of rule-based modules.


NERC Version 3NERC Version 3

• Reported in D2.4.• Final version of NERC, dealing with 2nd domain.• Main focus is customisation methodology and

experimentation to allow rapid adaptation to new domains.

• NERC architecture where the monolingual components are different from each other means that customisation methods are defined per component.


ENERC Customisation MethodologyENERC Customisation Methodology

• Retain XML pipeline architecture.

• Replace the named entity rule sets with a maximum entropy tagger.

• Experiments with the C&C Tagger and OpenNLP.

• Limited human intervention (selection of appropriate features).


FNERC Customisation MethodologyFNERC Customisation Methodology

• Retain XTIRP-based architecture and modules. • Use machine learning to assist in the acquisition

of regular expression named entity rules.• The machine-learning module produces a first

version of human-readable rules plus lists of examples and counter-examples.

• The human expert modifies the rule set appropriately.

• This method reduces rule set development time to about a third.


HNERC Customisation MethodologyHNERC Customisation Methodology

• ML-HNERC comprises: • Token-based HNERC

– operates over word tokens, treating NERC as a tagging problem.

– word token classification performed by five independent taggers with the final tag chosen through a simple majority voter.

• Phrase-based HNERC – operates over phrases which have been identified using a

grammar automatically induced from the training corpus

– uses a C4.5 decision tree classifier to recognize phrases that describe entities.


INERC Customisation MethodologyINERC Customisation Methodology

• INERC is modular, with components which are general and reusable in new domains. Customization can be restricted to the lexical knowledge bases.

• Statistically driven process of generalizing from the annotated corpus material to derive more generalized lexical resources.

• Compute a frequency score to expand the lexical resources.


Evaluation MethodologyEvaluation Methodology

• For both domains we have a hand annotated corpus of 100 pages per language, split 50-50 into training and testing material.

• Each monolingual NERC is evaluated against the testing corpus.

• Standard measures of precision, recall and f-measure are used.


Evaluation SummaryEvaluation Summary

Domain 1 F-score

Domain 2 F-score

ENERC 0.73 0.59

FNERC 0.77 0.75

HNERC 0.86 0.68

INERC 0.82 0.77


• Rule-based approach gives better results but it is knowledge intensive and requires significant resources for customisation to each new domain.

• The FNERC approach to rule induction is promising.

• In our experiments the machine learning approaches give lower results but:– they allow easy adaptation to new domains– there is scope to improve performance.– more training material would give better

performance.

ConclusionsConclusions


Other WP2 ActivitiesOther WP2 Activities

• Collection and annotation of corpora for each language and domain.

• NERC-based Demarcation


Corpus Collection MethodologyCorpus Collection Methodology

• For each domain the process follows two steps:– identification of interesting characteristics

of product descriptions and the collection of statistics relevant to these characteristics from at least 50 different sites for a language.

– collection of pages and their separation into training and testing corpora.


Corpus collection principlesCorpus collection principles

Domain Independent principles:• Training and testing corpora have the same number of pages

• Corpus size fixed for all languages.

• Corpora are representative of the statistics found per language in the site classification step.

Domain Specific Principles:• The maximum number of pages from one site allowed in a corpus

must be decided depending on the domain.

• The testing corpus must contain X number of pages that come from sites not represented in the training corpus.


AnnotationAnnotation

• Annotation performed using NCSR’s annotation tool.

• Annotation guidelines drawn up per domain.• Each corpus annotated by two separate

annotators and inter-annotator agreement checked.

• Final corpus result of correction of cases of disagreement.


NERC-Based DemarcatorNERC-Based Demarcator

• Operates after NERC and before FE.

• Locates different product descriptions inside a web page.

• Current version is heuristics-based.

• Characteristic information:– 1st domain: manufacturer, model, price– 2nd domain: job_title, organization, education title

• Output: Product_No attribute on entities


Demarcator EvaluationDemarcator Evaluation

Greek Italian English French

1st domain

NE 0.77 0.91 0.63 0.52

NUMEX 0.75 0.84 0.54 0.52

TIMEX 0.59 0.72 0.44 0.41

2nd domain

NE 0.77 0.64 0.47 0.62


Results OverviewResults Overview

• Successful multilingual NERC system which is an integral part of a resaerch platform for extracting information from web pages.

• An architecture that allows for new languages and swift adaptation to new domains.

• Four independent approaches each of which provide good results.

• Well motivated corpus collection methodology.• Publicly distributed corpora for all languages

and both domains


Shared DTDsShared DTDs

Domain 1

NE: MANUF, MODEL, PROCESSOR, SOFT_OS

TIMEX: TIME, DATE, DURATION

NUMEX: LENGTH, WEIGHT, SPEED, CAPACITY,

RESOLUTION, MONEY, PERCENT

Domain 2

NE: MUNICIPALITY, REGION, COUNTRY, ORGANIZATION, JOB_TITLE, EDU_TITLE, LANGUAGE, S/W

TIMEX: DATE, DURATION

NUMEX: MONEY

TERM: SCHEDULE, ORG_UNIT


11stst Domain Evaluation Results Domain Evaluation Results

ENERC FNERC HNERC INERC

NE MANUF 0.52 0.68 0.86 0.93

MODEL 0.70 0.58 0.71 0.70

SOFT_OS 0.76 0.90 0.80 0.94

PROCESSOR 0.91 0.93 0.91 0.96

NUMEX SPEED 0.78 0.84 0.90 0.88

CAPACITY 0.90 0.85 0.84 0.96

LENGTH 0.85 0.61 0.88 0.89

RESOLUTION 0.96 0.83 0.75 0.89

MONEY 0.62 0.80 0.80 0.74

PERCENT 0.67 0.75 0.77 0.86

WEIGHT 0.96 0.93 1.00 0.88

TIMEX DATE 0.45 0.84 0.96 0.57

DURATION 0.73 0.85 0.87 0.41

TIME 0.47 0.69 1.00 -

Overall (aprox) 0.73 0.77 0.86 0.82


2nd Domain Evaluation Results2nd Domain Evaluation Results

ENERC FNERC HNERC INERC

NE MUNICIPALITY 0.70 0.77 0.82 0.92

REGION 0.65 0.81 0.40 0.94

COUNTRY 0.87 0.73 0.84 0.86

ORGANIZATION 0.56 0.58 0.50 0.71

JOB_TITLE 0.55 0.71 0.50 0.78

EDU_TITLE 0.36 0.57 0.67 0.82

LANGUAGE 0.67 0.69 0.95 0.83

S/W 0.55 0.82 0.70 0.75

NUMEX MONEY 0.25 0.93 0.00 0.00

TIMEX DATE 0.79 0.61 0.93 0.77

DURATION 0.83 0.88 0.91 0.74

TERM ORG_UNIT 0.37 0.66 0.39 0.51

SCHEDULE 0.00 0.57 0.00 0.40

Overall 0.59 0.75 0.68 0.77

final review 31 october 20031 wp2: named entity recognition and classification claire grover...

Documents