named entity recognition gate.ac.uk/ nlp.shef.ac.uk/ hamish cunningham

104
Named Entity Recognition http://gate.ac.uk/ http://nlp.shef.ac.uk/ Hamish Cunningham Kalina Bontcheva RANLP, Borovets, Bulgaria, 8 th September 2003

Upload: milos

Post on 19-Mar-2016

34 views

Category:

Documents


2 download

DESCRIPTION

Named Entity Recognition http://gate.ac.uk/ http://nlp.shef.ac.uk/ Hamish Cunningham Kalina Bontcheva RANLP, Borovets, Bulgaria, 8 th September 2003. Structure of the Tutorial. task definition applications corpora, annotation evaluation and testing how to preprocessing - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

Named Entity Recognition

http://gate.ac.uk/ http://nlp.shef.ac.uk/

Hamish CunninghamKalina Bontcheva

RANLP, Borovets, Bulgaria, 8th September 2003

Page 2: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

Structure of the Tutorial

• task definition• applications• corpora, annotation• evaluation and testing• how to

– preprocessing– approaches to NE– baseline– rule-based approaches– learning-based approaches

• multilinguality• future challenges

Page 3: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

3(109)

Information Extraction

• Information Extraction (IE) pulls facts and structured information from the content of large text collections.

• IR - IE - NLU • MUC: Message Understanding

Conferences • ACE: Automatic Content Extraction

Page 4: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

4(109)

MUC-7 tasks

• NE: Named Entity recognition and typing

• CO: co-reference resolution • TE: Template Elements • TR: Template Relations • ST: Scenario Templates

Page 5: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

5(109)

An Example• The shiny red rocket was

fired on Tuesday. It is the brainchild of Dr. Big Head. Dr. Head is a staff scientist at We Build Rockets Inc.

• NE: entities are "rocket", "Tuesday", "Dr. Head" and "We Build Rockets"

• CO: "it" refers to the rocket; "Dr. Head" and "Dr. Big Head" are the same

• TE: the rocket is "shiny red" and Head's "brainchild".

• TR: Dr. Head works for We Build Rockets Inc.

• ST: a rocket launching event occurred with the various participants.

Page 6: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

6(109)

Performance levels• Vary according to text type, domain,

scenario, language • NE: up to 97% (tested in English,

Spanish, Japanese, Chinese) • CO: 60-70% resolution • TE: 80% • TR: 75-80% • ST: 60% (but: human level may be

only 80%)

Page 7: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

7(109)

What are Named Entities?• NER involves identification of proper

names in texts, and classification into a set of predefined categories of interest

• Person names• Organizations (companies, government

organisations, committees, etc)• Locations (cities, countries, rivers, etc)• Date and time expressions

Page 8: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

8(109)

What are Named Entities (2)• Other common types: measures (percent,

money, weight etc), email addresses, Web addresses, street addresses, etc.

• Some domain-specific entities: names of drugs, medical conditions, names of ships, bibliographic references etc.

• MUC-7 entity definition guidelines (Chinchor’97)

http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/ne_task.html

Page 9: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

9(109)

What are NOT NEs (MUC-7)

• Artefacts – Wall Street Journal• Common nouns, referring to named entities –

the company, the committee • Names of groups of people and things named

after people – the Tories, the Nobel prize• Adjectives derived from names – Bulgarian,

Chinese• Numbers which are not times, dates,

percentages, and money amounts

Page 10: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

10(109)

Basic Problems in NE• Variation of NEs – e.g. John Smith, Mr

Smith, John. • Ambiguity of NE types: John Smith

(company vs. person) – May (person vs. month) – Washington (person vs. location) – 1945 (date vs. time)

• Ambiguity with common words, e.g. "may"

Page 11: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

11(109)

More complex problems in NE

• Issues of style, structure, domain, genre etc. • Punctuation, spelling, spacing, formatting, ... all

have an impact:Dept. of Computing and MathsManchester Metropolitan UniversityManchesterUnited Kingdom

Tell me more about Leonardo

Da Vinci

Page 12: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

12(109)

Structure of the Tutorial

• task definition• applications• corpora, annotation• evaluation and testing• how to

– preprocessing– approaches to NE– baseline– rule-based approaches– learning-based approaches

• multilinguality• future challenges

Page 13: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

13(109)

Applications

• Can help summarisation, ASR and MT• Intelligent document access

– Browse document collections by the entities that occur in them

– Formulate more complex queries than IR can answer

– Application domains:• News• Scientific articles, e.g, MEDLINE abstracts

Page 14: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

14(109)

Structure of the Tutorial

• task definition• applications• corpora, annotation• evaluation and testing• how to

– preprocessing– approaches to NE– baseline– rule-based approaches– learning-based approaches

• multilinguality• future challenges

Page 15: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

15(109)

Some NE Annotated Corpora

• MUC-6 and MUC-7 corpora - English• CONLL shared task corpora

http://cnts.uia.ac.be/conll2003/ner/ - NEs in English and Germanhttp://cnts.uia.ac.be/conll2002/ner/ - NEs in Spanish and Dutch

• TIDES surprise language exercise (NEs in Cebuano and Hindi)

• ACE – English - http://www.ldc.upenn.edu/Projects/ACE/

Page 16: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

16(109)

The MUC-7 corpus

• 100 documents in SGML • News domain• 1880 Organizations (46%)• 1324 Locations (32%)• 887 Persons (22%)• http://www.itl.nist.gov/iaui/894.02/

related_projects/muc/proceedings/muc_7_proceedings/marsh_slides.pdf

Page 17: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

17(109)

The MUC-7 Corpus (2)

<ENAMEX TYPE="LOCATION">CAPE CANAVERAL</ENAMEX>, <ENAMEX TYPE="LOCATION">Fla.</ENAMEX> &MD; Working in chilly temperatures <TIMEX TYPE="DATE">Wednesday</TIMEX> <TIMEX TYPE="TIME">night</TIMEX>, <ENAMEX TYPE="ORGANIZATION">NASA</ENAMEX> ground crews readied the space shuttle Endeavour for launch on a Japanese satellite retrieval mission.

<p>Endeavour, with an international crew of six, was set to blast off from the

<ENAMEX TYPE="ORGANIZATION|LOCATION">Kennedy Space Center</ENAMEX> on <TIMEX TYPE="DATE">Thursday</TIMEX> at <TIMEX TYPE="TIME">4:18 a.m. EST</TIMEX>, the start of a 49-minute launching period. The <TIMEX TYPE="DATE">nine day</TIMEX> shuttle flight was to be the 12th launched in darkness.

Page 18: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

18(109)

NE Annotation Tools - Alembic

Page 19: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

19(109)

NE Annotation Tools – Alembic (2)

Page 20: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

20(109)

NE Annotation Tools - GATE

Page 21: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

21(109)

Corpora and System Development

• Corpora are divided typically into a training and testing portion

• Rules/Learning algorithms are trained on the training part

• Tuned on the testing portion in order to optimise – Rule priorities, rules effectiveness, etc.– Parameters of the learning algorithm and the features used

• Evaluation set – the best system configuration is run on this data and the system performance is obtained

• No further tuning once evaluation set is used!

Page 22: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

22(109)

Structure of the Tutorial

• task definition• applications• corpora, annotation• evaluation and testing• how to

– preprocessing– approaches to NE– baseline– rule-based approaches– learning-based approaches

• multilinguality• future challenges

Page 23: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

23(109)

Performance Evaluation

• Evaluation metric – mathematically defines how to measure the system’s performance against a human-annotated, gold standard

• Scoring program – implements the metric and provides performance measures – For each document and over the entire corpus– For each type of NE

Page 24: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

24(109)

The Evaluation Metric

• Precision = correct answers/answers produced • Recall = correct answers/total possible correct

answers• Trade-off between precision and recall • F-Measure = (β2 + 1)PR / β2R + P

[van Rijsbergen 75]• β reflects the weighting between precision and

recall, typically β=1

Page 25: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

25(109)

The Evaluation Metric (2)

• Precision = Correct + ½ Partially correctCorrect + Incorrect + Partial

• Recall = Correct + ½ Partially correctCorrect + Missing + Partial

• Why: NE boundaries are often misplaced, sosome partially correct results

Page 26: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

26(109)

The MUC scorer (1)Document: 9601020572-----------------------------------------------------------------

POS ACT| COR PAR INC | MIS SPU NON| REC PRE ------------------------+-------------+--------------+-----------SUBTASK SCORES | | |enamex | | |organization 11 12| 9 0 0| 2 3 0| 82 75 person 24 26| 24 0 0| 0 2 0| 100 92 location 27 31| 25 0 0| 2 6 0| 93 81 …

* * * SUMMARY SCORES * * *----------------------------------------------------------------- POS ACT| COR PAR INC | MIS SPU NON| REC PRE-----------------------+-------------+--------------+------------TASK SCORES | | |enamex | | |organizatio 1855 1757|1553 0 37| 265 167 30| 84 88person 883 859| 797 0 13| 73 49 4| 90 93location 1322 1406|1199 0 13| 110 194 7| 91 85

Page 27: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

27(109)

The MUC scorer (2)

• Tracking errors in each document, for each instance in the text

ENAMEX cor inc PERSON PERSON "Wernher von Braun" "Braun"ENAMEX cor inc PERSON PERSON "von Braun" "Braun"ENAMEX cor cor PERSON PERSON "Braun" "Braun"…ENAMEX cor cor LOCATI LOCATI "Saturn" "Saturn"…

Page 28: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

28(109)

The GATE Evaluation Tool

Page 29: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

29(109)

Regression Testing

• Need to track system’s performance over time• When a change is made to the system we want

to know what implications are over the entire corpus

• Why: because an improvement in one case can lead to problems in others

• GATE offers automated tool to help with the NE development task over time

Page 30: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

30(109)

Regression Testing (2)At corpus level – GATE’s corpus benchmark tool – tracking system’s performance over time

Page 31: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

31(109)

Structure of the Tutorial

• task definition• applications• corpora, annotation• evaluation and testing• how to

– preprocessing– approaches to NE– baseline– rule-based approaches– learning-based approaches

• multilinguality• future challenges

Page 32: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

32(109)

Pre-processing for NE Recognition

• Format detection • Word segmentation (for languages

like Chinese)• Tokenisation • Sentence splitting • POS tagging

Page 33: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

33(109)

Two kinds of NE approachesKnowledge

Engineering

• rule based • developed by experienced

language engineers • make use of human

intuition • requires only small amount

of training data• development could be very

time consuming • some changes may be hard

to accommodate

Learning Systems

• use statistics or other machine learning

• developers do not need LE expertise

• requires large amounts of annotated training data

• some changes may require re-annotation of the entire training corpus

• annotators are cheap (but you get what you pay for!)

Page 34: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

34(109)

List lookup approach - baseline

• System that recognises only entities stored in its lists (gazetteers).

• Advantages - Simple, fast, language independent, easy to retarget (just create lists)

• Disadvantages - collection and maintenance of lists, cannot deal with name variants, cannot resolve ambiguity

Page 35: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

35(109)

Creating Gazetteer Lists

• Online phone directories and yellow pages for person and organisation names (e.g. [Paskaleva02])

• Locations lists – US GEOnet Names Server (GNS) data – 3.9 million locations

with 5.37 million names (e.g., [Manov03])– UN site: http://unstats.un.org/unsd/citydata– Global Discovery database from Europa technologies Ltd,

UK (e.g., [Ignat03])• Automatic collection from annotated training data

Page 36: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

36(109)

Structure of the Tutorial

• task definition• applications• corpora, annotation• evaluation and testing• how to

– preprocessing– approaches to NE– baseline– rule-based approaches– learning-based approaches

• multilinguality• future challenges

Page 37: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

37(109)

Shallow Parsing Approach (internal structure)

• Internal evidence – names often have internal structure. These components can be either stored or guessed, e.g. location:

• Cap. Word + {City, Forest, Center, River}

• e.g. Sherwood Forest

• Cap. Word + {Street, Boulevard, Avenue, Crescent, Road}

• e.g. Portobello Street

Page 38: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

38(109)

Problems with the shallow parsing approach

• Ambiguously capitalised words (first word in sentence)[All American Bank] vs. All [State Police]

• Semantic ambiguity"John F. Kennedy" = airport (location) "Philip Morris" = organisation

• Structural ambiguity [Cable and Wireless] vs. [Microsoft] and [Dell];[Center for Computational Linguistics] vs. message from [City Hospital] for [John Smith]

Page 39: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

39(109)

Shallow Parsing Approach with Context

• Use of context-based patterns is helpful in ambiguous cases

• "David Walton" and "Goldman Sachs" are indistinguishable

• But with the phrase "David Walton of Goldman Sachs" and the Person entity "David Walton" recognised, we can use the pattern "[Person] of [Organization]" to identify "Goldman Sachs“ correctly.

Page 40: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

40(109)

Identification of Contextual Information

• Use KWIC index and concordancer to find windows of context around entities

• Search for repeated contextual patterns of either strings, other entities, or both

• Manually post-edit list of patterns, and incorporate useful patterns into new rules

• Repeat with new entities

Page 41: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

41(109)

Examples of context patterns

• [PERSON] earns [MONEY]• [PERSON] joined [ORGANIZATION]• [PERSON] left [ORGANIZATION]• [PERSON] joined [ORGANIZATION] as [JOBTITLE]• [ORGANIZATION]'s [JOBTITLE] [PERSON]• [ORGANIZATION] [JOBTITLE] [PERSON]• the [ORGANIZATION] [JOBTITLE]• part of the [ORGANIZATION]• [ORGANIZATION] headquarters in [LOCATION]• price of [ORGANIZATION]• sale of [ORGANIZATION]• investors in [ORGANIZATION]• [ORGANIZATION] is worth [MONEY]• [JOBTITLE] [PERSON]• [PERSON], [JOBTITLE]

Page 42: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

42(109)

Caveats

• Patterns are only indicators based on likelihood

• Can set priorities based on frequency thresholds

• Need training data for each domain • More semantic information would be useful

(e.g. to cluster groups of verbs)

Page 43: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

43(109)

Rule-based Example: FACILE • FACILE - used in MUC-7 [Black et al 98] • Uses Inxight’s LinguistiX tools for tagging and

morphological analysis • Database for external information, role similar to a

gazetteer• Linguistic info per token, encoded as feature vector:

– Text offsets – Orthographic pattern (first/all capitals, mixed, lowercase)– Token and its normalised form– Syntax – category and features– Semantics – from database or morphological analysis– Morphological analyses

• Example:(1192 1196 10 T C "Mrs." "mrs." (PROP TITLE) (ˆPER_CIV_F)(("Mrs." "Title" "Abbr")) NIL)PER_CIV_F – female civilian (from database)

Page 44: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

44(109)

FACILE (2)• Context-sensitive rules written in special rule

notation, executed by an interpreter• Writing rules in PERL is too error-prone and hard• Rules of the kind:

A => B\C/D, where:– A is a set of attribute-value expressions and optional score,

the attributes refer to elements of the input token feature vector

– B and D are left and right context respectively and can be empty

– B, C, D are sequences of attribute-value pairs and Klene regular expression operations; variables are also supported

• [syn=NP, sem=ORG] (0.9) =>\ [norm="university"],[token="of"],[sem=REGION|COUNTRY|CITY] / ;

Page 45: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

45(109)

FACILE (3)

# Rule for the mark up of person names when the first name is not

# present or known from the gazetteers: e.g 'Mr J. Cass',

[SYN=PROP,SEM=PER, FIRST=_F, INITIALS=_I, MIDDLE=_M, LAST=_S] #_F, _I, _M, _S are variables, transfer info from RHS

=> [SEM=TITLE_MIL|TITLE_FEMALE|TITLE_MALE]\[SYN=NAME, ORTH=I|O, TOKEN=_I]?, [ORTH=C|A, SYN=PROP, TOKEN=_F]?, [SYN=NAME, ORTH=I|O, TOKEN=_I]?, [SYN=NAME, TOKEN=_M]?, [ORTH=C|A|O,SYN=PROP,TOKEN=_S, SOURCE!=RULE] #proper name, not recognised by a rule/;

Page 46: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

46(109)

FACILE (4)

• Preference mechanism:– The rule with the highest score is preferred– Longer matches are preferred to shorter matches– Results are always one semantic categorisation of

the named entity in the text• Evaluation (MUC-7 scores):

– Organization: 86% precision, 66% recall– Person: 90% precision, 88% recall– Location: 81% precision, 80% recall – Dates: 93% precision, 86% recall

Page 47: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

47(109)

Example Rule-based System - ANNIE

• Created as part of GATE• GATE – Sheffield’s open-source infrastructure

for language processing• GATE automatically deals with document

formats, saving of results, evaluation, and visualisation of results for debugging

• GATE has a finite-state pattern-action rule language, used by ANNIE

Page 48: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

48(109)

NE ComponentsThe ANNIE system – a reusable and easily extendable set of components

Page 49: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

49(109)

Gazetteer lists for rule-based NE

• Needed to store the indicator strings for the internal structure and context rules

• Internal location indicators – e.g., {river, mountain, forest} for natural locations; {street, road, crescent, place, square, …}for address locations

• Internal organisation indicators – e.g., company designators {GmbH, Ltd, Inc, …}

• Produces Lookup results of the given kind

Page 50: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

50(109)

The Named Entity Grammars

• Phases run sequentially and constitute a cascade of FSTs over the pre-processing results

• Hand-coded rules applied to annotations to identify NEs

• Annotations from format analysis, tokeniser, sentence splitter, POS tagger, and gazetteer modules

• Use of contextual information • Finds person names, locations, organisations, dates,

addresses.

Page 51: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

51(109)

 NE Rule in JAPEJAPE: a Java Annotation Patterns Engine• Light, robust regular-expression-based processing • Cascaded finite state transduction • Low-overhead development of new components• Simplifies multi-phase regex processing

Rule: Company1 Priority: 25 ( ( {Token.orthography == upperInitial} )+ //from tokeniser {Lookup.kind == companyDesignator} //from gazetteer lists ):match --> :match.NamedEntity = { kind=company, rule=“Company1” }

Page 52: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

52(109)

Nam

ed E

ntiti

es in

GA

TE

Page 53: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

53(109)

Using co-reference to classify ambiguous NEs

• Orthographic co-reference module that matches proper names in a document

• Improves NE results by assigning entity type to previously unclassified names, based on relations with classified NEs

• May not reclassify already classified entities• Classification of unknown entities very useful for

surnames which match a full name, or abbreviations, e.g. [Bonfield] will match [Sir Peter Bonfield]; [International Business Machines Ltd.] will match [IBM]

Page 54: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

54(109)

Named Entity Coreference

Page 55: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

55(109)

Structure of the Tutorial

• task definition• applications• corpora, annotation• evaluation and testing• how to

– preprocessing– approaches to NE– baseline– rule-based approaches– learning-based approaches

• multilinguality• future challenges

Page 56: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

56(109)

Machine Learning Approaches

• ML approaches frequently break down the NE task in two parts:– Recognising the entity boundaries– Classifying the entities in the NE categories

• Some work is only on one task or the other• Tokens in text are often coded with the IOB scheme

– O – outside, B-XXX – first word in NE, I-XXX – all other words in NE

– Easy to convert to/from inline MUC-style markup– Argentina B-LOC

played Owith ODel B-PERBosque I-PER

Page 57: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

57(109)

IdentiFinder [Bikel et al 99]

• Based on Hidden Markov Models• Their HMM has 7 regions – one for each MUC type,

not-name, begin-sentence and end-sentence• Features

– Capitalisation– Numeric symbols– Punctuation marks– Position in the sentence– 14 features in total, combining above info, e.g.,

containsDigitAndDash (09-96), containsDigitAndComma (23,000.00)

Page 58: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

58(109)

IdentiFinder (2)

• Back-off models and smoothing • Unknown words• Further back-off and smoothing• Different strategies for name-class bigrams,

first-word bigrams and non-first-word bigrams

Page 59: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

59(109)

IdentiFinder (3)

• MUC-6 (English) and MET-1(Spanish) corpora used for evaluation

• Mixed case English – IdentiFinder - 94.9% f-measure– Best rule-based – 96.4%

• Spanish mixed case– IdentiFinder – 90%– Best rule-based - 93%– Lower case names, noisy training data, less training data

• Training data: 650,000 words, but similar performance with half of the data. Less than 100,000 words reduce the performance to below 90% on English

Page 60: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

60(109)

MENE [Borthwick et al 98]

• Combining rule-based and ML NE to achieve better performance

• Tokens tagged as: XXX_start, XXX_continue, XXX_end, XXX_unique, other (non-NE), where XXX is an NE category

• Uses Maximum Entropy– One only needs to find the best features for the

problem – ME estimation routine finds the best relative

weights for the features

Page 61: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

61(109)

MENE (2)

• Features– Binary features – “token begins with capitalised

letter”, “token is a four-digit number”– Lexical features – dependencies on the

surrounding tokens (window ±2) e.g., “Mr” for people, “to” for locations

– Dictionary features – equivalent to gazetteers (first names, company names, dates, abbreviations)

– External systems – whether the current token is recognised as an NE by a rule-based system

Page 62: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

62(109)

MENE (3)• MUC-7 formal run corpus

– MENE – 84.2% f-measure– Rule-based systems it uses – 86% - 91 %– MENE + rule-based systems – 92%

• Learning curve– 20 docs – 80.97%– 40 docs – 84.14%– 100 docs – 89.17%– 425 docs – 92.94%

Page 63: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

63(109)

Fine-grained Classification of NEs [Fleischman 02]

• Finer-grained categorisation needed for applications like question answering

• Person classification into 8 sub-categories – athlete, politician/government, clergy, businessperson, entertainer/artist, lawyer, doctor/scientist, police.

• Supervised approach using local context and global semantic information such as WordNet

• Used a decision list classifier and Identifinder to construct automatically training set from untagged data

• Held-out set of 1300 instances hand annotated

Page 64: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

64(109)

Fine-grained Classification of NEs (2)

• Word frequency features– For each 8 categories 10 distinct word positions = 80

features per instance– 3 words before & after the instance– The two-word bigrams immediately before and after the

instance– The three-word trigrams before/after the instance

• Topic signatures and WordNet information– Compute lists of terms that signal relevance to a

topic/category [Lin&Hovy 00] & expand with WordNet synonyms to counter unseen examples

– Politician – campaign, republican, budget

Page 65: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

65(109)

Fine-grained Classification of NEs (3)

• Due to differing contexts, instances of the same name in a single text were classified differently

• MemRun chooses the prevailing sub-category based on their most frequent classification

• Othomatching-like algorithm is developed to match George Bush, Bush, and George W. Bush

• Expts with k-NN, Naïve Bayes, SVMs, Neural Networks and C4.5 show that C4.5 is best

• Future work: treating finer grained classification as a WSD task (categories are different senses of a person)

Page 66: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

66(109)

Structure of the Tutorial

• task definition• applications• corpora, annotation• evaluation and testing• how to

– preprocessing– approaches to NE– baseline– rule-based approaches– learning-based approaches

• multilinguality• future challenges

Page 67: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

67(109)

Multilingual Named Entity Recognition

• Recent experiments are aimed at NE recognition in multiple languages

• TIDES surprise language evaluation exercise measures how quickly researchers can develop NLP components in a new language

• CONLL’02, CONLL’03 focus on language-independent NE recognition

Page 68: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

68(109)

Analysis of the NE Task in Multiple Languages [Palmer&Day 97]

Language NE Time/Date

Numeric exprs.

Org/Per/Loc

Chinese 4454 17.2% 1.8% 80.9%

English 2242 10.7% 9.5% 79.8%

French 2321 18.6% 3% 78.4%

Japanese 2146 26.4% 4% 69.6%

Portuguese 3839 17.7% 12.1% 70.3%

Spanish 3579 24.6% 3% 72.5%

Page 69: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

69(109)

Analysis of Multilingual NE (2)

• Numerical and time expressions are very easy to capture using rules

• Constitute together about 20-30% of all NEs• All numerical expressions in the 6 languages

required only 5 patterns• Time expressions similarly require only a few

rules (less than 30 per language) • Many of these rules are reusable across the

languages

Page 70: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

70(109)

Analysis of Multilingual NE (3)

• Suggest a method for calculating the lower bound for system performance given a corpus in the target language

• Conclusion: Much of the NE task can be achieved by simple string analysis and common phrasal contexts

• Zipf’s law: the prevalence of frequent phenomena allow high scores to be achieved directly from the training data

• Chinese, Japanese, and Portuguese corpora had a lower bound above 70%

• Substantial further advances require language specificity

Page 71: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

71(109)

What is needed for multilingual NE

• Extensive support for non-Latin scripts and text encodings, including conversion utilities– Automatic recognition of encoding [Ignat et al03]– Occupied up to 2/3 of the TIDES Hindi effort

• Bi-lingual dictionaries• Annotated corpus for evaluation• Internet resources for gazetteer list collection

(e.g., phone books, yellow pages, bi-lingual pages)

Page 72: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

72(109)

Multilingual support - Alembic

Japaneseexample

Page 73: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

73(109)

                     

GATE Unicode Kit (GUK) Complements Java’s facilities

• Support for defining Input Methods (IMs)

• currently 30 IMs for 17 languages

• Pluggable in other applications (e.g. JEdit)

Editing Multilingual Data

Page 74: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

74(109)

Multilingual Data - GATEAll processing, visualisation and editing tools use GUK

Page 75: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

75(109)

Gazetteer-based Approach to Multilingual NE [Ignat et al 03]

• Deals with locations only• Even more ambiguity than in one language:

– Multiple places that share the same name, such as the fourteen cities and villages in the world called ‘Paris’

– Place names that are also words in one or more languages, such as ‘And’ (Iran), ‘Split’ (Croatia)

– Places have varying names in different languages (Italian ‘Venezia’ vs. English ‘Venice’, German ‘Venedig’, French ‘Venise’)

Page 76: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

76(109)

Gazetteer-based multilingual NE (2)

• Disambiguation module applies heuristics based on location size and country mentions (prefer the locations from the country mentioned most)

• Performance evaluation:– 853 locations from 80 English texts– 96.8% precision– 96.5% recall

Page 77: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

77(109)

Machine Learning for Multilingual NE

• CONLL’2002 and 2003 shared tasks were NE in Spanish, Dutch, English, and German

• The most popular ML techniques used:– Maximum Entropy (5 systems)– Hidden Markov Models (4 systems)– Connectionist methods (4 systems)

• Combining ML methods has been shown to boost results

Page 78: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

78(109)

ML for NE at CONLL (2)

• The choice of features is at least as important as the choice of ML algorithm– Lexical features (words)– Part-of-speech– Orthographic information– Affixes– Gazetteers

• External, unmarked data is useful to derive gazetteers and for extracting training instances

Page 79: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

79(109)

ML for NE at CONLL (3)

• English (f-measure) – Baseline 59.5%– Systems – between 60.2% and 88.76%

• German (f-measure) – Baseline – 30.3% – Systems – between 47.7% and 72.4%

• Spanish (f-measure)– Baseline – 35.9%– Systems – between 60.9% and 81.4%

• Dutch (f-measure)– Baseline – 53.1%– Systems – between 56.4% and 77%

Page 80: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

80(109)

Language Independent NE Recognition [Cucerzan&Yarowsky

02]• Uses iterative learning and re-estimation of contextual and morphological patterns, using tri models

• Learns from unannotated text and requires only small list of labelled names, without using other language-specific tools

• Word internal features:– Some prefixes and suffixes are good indicators – For example -escu, -wski, -ova, -ov for person

names

Page 81: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

81(109)

Language Independent NE Recognition [Cucerzan&Yarowsky

02]• Classify all occurrences of an entity of the text together by combining the contextual and morphological clues from each instance

• “One NE class per document/discourse” assumption similar to the “one sense per discourse” assumption used in word sense disambiguation [Gale,Church&Yarowsky 92]

• 70.5% - 75.4% f-measure for Romanian • Measured on two tasks – NE identification and

classification (NE boundaries are pre-defined)

Page 82: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

82(109)

TIDES surprise language exercise

• Collaborative effort between a number of sites to develop resources and tools for various LE tasks on a surprise language

• Tasks: IE (including NE), machine translation, summarisation, cross-language IR

• Dry-run lasted 10 days on the Cebuano language from the Philippines

• Surprise language was Hindi, announced at the start of June 2003; duration 1 month

Page 83: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

83(109)

Language categorisation

• LDC – survey of 300 largest languages (by population) to establish what resources are available

• http://www.ldc.upenn.edu/Projects/TIDES/language-summary-table.html

• Classification dimensions:– Dictionaries, news texts, parallel texts, e.g., Bible– Script, orthography, words separated by spaces

Page 84: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

84(109)

The Surprise Languages

• Cebuano:– Latin script and words are spaced, but– Few resources and little work, so– Medium difficulty

• Hindi– Non-latin script, different encodings used, words

are spaced, no capitalisation– Many resources available– Medium difficulty

Page 85: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

85(109)

Named Entity Recognition for TIDES

• Information on other systems and results from TIDES is still unavailable to non-TIDES participants

• Will be made available by the end of 2003 in a Special issue of ACM Transactions on Asian Language Information Processing (TALIP). Rapid Development of Language Capabilities: The Surprise Languages

• The Sheffield approach is presented below, because it is not subject to these restrictions

Page 86: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

86(109)

Dictionary-based Adaptation of an English POS tagger

• Substituted Hindi/Cebuano lexicon for English one in a Brill-like tagger

• Hindi/Cebuano lexicon derived from a bi-lingual dictionary

• Used empty ruleset since no training data available

• Used default heuristics (e.g. return NNP for capitalised words)

• Very experimental, but reasonable results

Page 87: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

87(109)

Evaluation of the Tagger

• No formal evaluation was possible• Estimate around 67% accuracy on Hindi –

evaluated by a native speaker on 1000 words• Created in 2 person days• Results and a tagging service made available

to other researchers in TIDES• Important pre-requisite for NE recognition

Page 88: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

88(109)

NE grammars• Most English JAPE rules based on POS tags

and gazetteer lookup• Grammars can be reused for languages

with similar word order, orthography etc.• No time to make detailed study of Cebuano,

but very similar in structure to English• Most of the rules left as for English, but

some adjustments to handle especially dates

• Used both English and Cebuano grammars and gazetteers, because NEs appear in both languages

Page 89: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

89(109)

Page 90: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

90(109)

Evaluation ResultsCebuano English

BaselineEntity P R F P R F

Person 71 65 68 36 36 36

Org 75 71 73 31 47 38

Location 73 78 76 65 7 12

Date 83 100 92 42 58 49

Total 76 79 77.5 45 41.7 43

Page 91: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

91(109)

Structure of the Tutorial

• task definition• applications• corpora, annotation• evaluation and testing• how to

– preprocessing– approaches to NE– baseline– rule-based approaches– learning-based approaches

• multilinguality• future challenges

Page 92: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

92(109)

Future challenges

• Towards semantic tagging of entities• New evaluation metrics for semantic entity

recognition• Expanding the set of entities recognised – e.g.,

vehicles, weapons, substances (food, drug)• Finer-grained hierarchies, e.g., types of

Organizations (government, commercial, educational, etc.), Locations (regions, countries, cities, water, etc)

Page 93: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

93(109)

Future challenges (2)

• Standardisation of the annotation formats– [Ide & Romary 02] – RDF-based annotation

standards– [Collier et al 02] – multi-lingual named entity

annotation guidelines– Aimed at defining how to annotate in order to

make corpora more reusable and lower the overhead of writing format conversion tools • MUC used inline markup• TIDES and ACE used stand-off markup, but two

different kinds (XML vs one-word per line)

Page 94: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

94(109)

Towards Semantic Tagging of Entities

• The MUC NE task tagged selected segments of text whenever that text represents the name of an entity.

• In ACE (Automated Content Extraction), these names are viewed as mentions of the underlying entities. The main task is to detect (or infer) the mentions in the text of the entities themselves.

• ACE focuses on domain- and genre-independent approaches

• ACE corpus contains newswire, broadcast news (ASR output and cleaned), and newspaper reports (OCR output and cleaned)

Page 95: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

95(109)

ACE Entities

• Dealing with – Proper names – e.g., England, Mr. Smith, IBM– Pronouns – e.g., he, she, it– Nominal mentions – the company, the spokesman

• Identify which mentions in the text refer to which entities, e.g., – Tony Blair, Mr. Blair, he, the prime minister, he– Gordon Brown, he, Mr. Brown, the chancellor

Page 96: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

96(109)

ACE Example <entity ID="ft-airlines-27-jul-2001-2" GENERIC="FALSE" entity_type = "ORGANIZATION"> <entity_mention ID="M003" TYPE = "NAME" string = "National Air Traffic Services"> </entity_mention> <entity_mention ID="M004" TYPE = "NAME" string = "NATS"> </entity_mention> <entity_mention ID="M005" TYPE = "PRO" string = "its"> </entity_mention> <entity_mention ID="M006" TYPE = "NAME" string = "Nats"> </entity_mention> </entity>

Page 97: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

97(109)

ACE Entities (2)

• Some entities can have different roles, i.e., behave as Organizations, Locations, or Persons – GPEs (Geo-political entities)

• New York [GPE – role: Person], flush with Wall Street money, has a lot of loose change jangling in its pockets.

• All three New York [GPE – role: Location] regional commuter train systems were found to be punctual more than 90 percent of the time.

Page 98: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

98(109)

Further information on ACE

• ACE is a closed-evaluation initiative, which does not allow the publication of results

• Further information on guidelines and corpora is available at:

• http://www.ldc.upenn.edu/Projects/ACE/• ACE also includes other IE tasks, for further

details see Doug Appelt’s presentation:http://www.clsp.jhu.edu/ws03/groups/sparse/presentations/doug.ppt

Page 99: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

99(109)

Evaluating Richer NE Tagging

• Need for new metrics when evaluating hierarchy/ontology-based NE tagging

• Need to take into account distance in the hierarchy

• Tagging a company as a charity is less wrong than tagging it as a person

Page 100: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

100(109)

THANK YOU!

Page 101: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

101(109)

Further Reading• Aberdeen J., Day D., Hirschman L., Robinson P. and Vilain M. 1995.

MITRE: Description of the Alembic System Used for MUC-6. MUC-6 proceedings. Pages141-155. Columbia, Maryland. 1995.

• Black W.J., Rinaldi F., Mowatt D. Facile: Description of the NE System Used For MUC-7. Proceedings of 7th Message Understanding Conference, Fairfax, VA, 19 April - 1 May, 1998.

• Borthwick. A. A Maximum Entropy Approach to Named Entity Recognition.PhD Dissertation. 1999

• Bikel D., Schwarta R., Weischedel. R. An algorithm that learns what’s in a name. Machine Learning 34, pp.211-231, 1999

• Carreras X., Màrquez L., Padró. 2002. Named Entity Extraction using AdaBoost.The 6th Conference on Natural Language Learning. 2002

• Chang J.S., Chen S. D., Zheng Y., Liu X. Z., and Ke S. J. Large-corpus-based methods for Chinese personal name recognition. Journal of Chinese Information Processing, 6(3):7-15, 1992

• Chen H.H., Ding Y.W., Tsai S.C. and Bian G.W. Description of the NTU System Used for MET2. Proceedings of 7th Message Understanding Conference, Fairfax, VA, 19 April - 1 May, 1998.

• Chinchor. N. MUC-7 Named Entity Task Definition Version 3.5.Available by from ftp.muc.saic.com/pub/MUC/MUC7-guidelines, 1997

Page 102: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

102(109)

Further reading (2)• Collins M., Singer Y. Unsupervised models for named entity classification

In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999

• Collins M. Ranking Algorithms for Named-Entity Extraction: Boosting and the Voted Perceptron. Proceedings of the 40th Annual Meeting of the ACL, Philadelphia, pp. 489-496, July 2002 Gotoh Y., Renals S. Information extraction from broadcast news, Philosophical Transactions of the Royal Society of London, series A: Mathematical, Physical and Engineering Sciences, 2000.

• Grishman R. The NYU System for MUC-6 or Where's the Syntax? Proceedings of the MUC-6 workshop, Washington. November 1995.

• [Ign03a] C. Ignat and B. Pouliquen and A. Ribeiro and R. Steinberger. Extending and Information Extraction Tool Set to Eastern-European Languages. Proceedings of Workshop on Information Extraction for Slavonic and other Central and Eastern European Languages (IESL'03). 2003.

• Krupka G. R., Hausman K. IsoQuest Inc.: Description of the NetOwlTM Extractor System as Used for MUC-7. Proceedings of 7th Message Understanding Conference, Fairfax, VA, 19 April - 1 May, 1998.

• McDonald D. Internal and External Evidence in the Identification and Semantic Categorization of Proper Names. In B.Boguraev and J. Pustejovsky editors: Corpus Processing for Lexical Acquisition. Pages21-39. MIT Press. Cambridge, MA. 1996

• Mikheev A., Grover C. and Moens M. Description of the LTG System Used for MUC-7. Proceedings of 7th Message Understanding Conference, Fairfax, VA, 19 April - 1 May, 1998

• Miller S., Crystal M., et al. BBN: Description of the SIFT System as Used for MUC-7. Proceedings of 7th Message Understanding Conference, Fairfax, VA, 19 April - 1 May, 1998

Page 103: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

103(109)

Further reading (3)• Palmer D., Day D.S. A Statistical Profile of the Named Entity Task.

Proceedings of the Fifth Conference on Applied Natural Language Processing, Washington, D.C., March 31- April 3, 1997.

• Sekine S., Grishman R. and Shinou H. A decision tree method for finding and classifying names in Japanese texts. Proceedings of the Sixth Workshop on Very Large Corpora, Montreal, Canada, 1998

• Sun J., Gao J.F., Zhang L., Zhou M., Huang C.N. Chinese Named Entity Identification Using Class-based Language Model. In proceeding of the 19th International Conference on Computational Linguistics (COLING2002), pp.967-973, 2002.

• Takeuchi K., Collier N. Use of Support Vector Machines in Extended Named Entity Recognition. The 6th Conference on Natural Language Learning. 2002

• D.Maynard, K. Bontcheva and H. Cunningham. Towards a semantic extraction of named entities. Recent Advances in Natural Language Processing, Bulgaria, 2003.

• M. M. Wood and S. J. Lydon and V. Tablan and D. Maynard and H. Cunningham. Using parallel texts to improve recall in IE. Recent Advances in Natural Language Processing, Bulgaria, 2003.

• D.Maynard, V. Tablan and H. Cunningham. NE recognition without training data on a language you don't speak. ACL Workshop on Multilingual and Mixed-language Named Entity Recognition: Combining Statistical and Symbolic Models, Sapporo, Japan, 2003.

Page 104: Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

104(109)

Further reading (4)• H. Saggion, H. Cunningham, K. Bontcheva, D. Maynard, O. Hamza, Y. Wilks.

Multimedia Indexing through Multisource and Multilingual Information Extraction; the MUMIS project. Data and Knowledge Engineering, 2003.

• D. Manov and A. Kiryakov and B. Popov and K. Bontcheva and D. Maynard, H. Cunningham. Experiments with geographic knowledge for information extraction. Workshop on Analysis of Geographic References, HLT/NAACL'03, Canada, 2003.

• H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02). Philadelphia, July 2002.

• H. Cunningham. GATE, a General Architecture for Text Engineering. Computers and the Humanities, volume 36, pp. 223-254, 2002.

• D. Maynard, H. Cunningham, K. Bontcheva, M. Dimitrov. Adapting A Robust Multi-Genre NE System for Automatic Content Extraction. Proc. of the 10th International Conference on Artificial Intelligence: Methodology, Systems, Applications (AIMSA 2002), 2002.

• E. Paskaleva and G. Angelova and M.Yankova and K. Bontcheva and H. Cunningham and Y. Wilks. Slavonic Named Entities in GATE. 2003. CS-02-01.

• K. Pastra, D. Maynard, H. Cunningham, O. Hamza, Y. Wilks. How feasible is the reuse of grammars for Named Entity Recognition? Language Resources and Evaluation Conference (LREC'2002), 2002.