1 fefor, march 2002 named-entity recognition for swedish past, present and way ahead... dimitrios...

54
1 Fefor, March 2002 Named-Entity Named-Entity Recognition for Recognition for Swedish Swedish Past, Present and Way Ahead... Past, Present and Way Ahead... Dimitrios Kokkinakis Dimitrios Kokkinakis

Upload: jadyn-higginbottom

Post on 11-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

1Fefor, March 2002

Named-Entity Recognition Named-Entity Recognition for Swedishfor Swedish

Past, Present and Way Ahead...Past, Present and Way Ahead...

Dimitrios KokkinakisDimitrios Kokkinakis

2Fefor, March 2002

OutlineOutline

Looking BackLooking Back: AVENTINUS, flexers,...: AVENTINUS, flexers,... Current Status & WorkplanCurrent Status & Workplan::

Resources: Lexical, Textual and AlgorithmicResources: Lexical, Textual and Algorithmic NER on Part-of-Speech Annotated MaterialNER on Part-of-Speech Annotated Material Way Ahead, Approach and Evaluation SamplesWay Ahead, Approach and Evaluation Samples

Resource LocalizationResource Localization (if required...) (if required...) NE Tagset and GuidelinesNE Tagset and Guidelines Survey of the Market for NERSurvey of the Market for NER: Tools, Projects,...: Tools, Projects,... ProblemsProblems: Ambiguity, Metonymy, Text Format : Ambiguity, Metonymy, Text Format

(Orthography, Source Modality...)...(Orthography, Source Modality...)...

3Fefor, March 2002

Looking Back...Looking Back...

NER in the AVENTINUS project (LE4) without listsNER in the AVENTINUS project (LE4) without lists No proper evaluation on a large scaleNo proper evaluation on a large scale Collection of a few types of resources; e.g. appositivesCollection of a few types of resources; e.g. appositives Method: finite-state grammars ’semantic grammars’; Method: finite-state grammars ’semantic grammars’;

one for each categoryone for each category Delivered rules (for Swedish NER) that were compiled Delivered rules (for Swedish NER) that were compiled

in a user-required productin a user-required product

See Kokkinakis (2001): See Kokkinakis (2001): svenska.gu.se/~svedk/publics/swe_ner.pssvenska.gu.se/~svedk/publics/swe_ner.ps for a grammar for a grammar used for identifying ”Transportation Means”used for identifying ”Transportation Means”

4Fefor, March 2002

Snapshots from AVESnapshots from AVE11

Police report from Europol

5Fefor, March 2002

Snapshots from AVESnapshots from AVE22

6Fefor, March 2002

Snapshots from AVESnapshots from AVE33

7Fefor, March 2002

Swe-NER without ListsSwe-NER without Lists

......see the flexers example

How long can we go without lists?

8Fefor, March 2002

Swe-NER Evaluation Swe-NER Evaluation Sample in AWBSample in AWB

See also SUC2

9Fefor, March 2002

In the framework of...In the framework of...

my PhD, a collection of 35 documents was my PhD, a collection of 35 documents was manually tagged; newspaper articles (30) & manually tagged; newspaper articles (30) & reports from a popular science periodical (5)reports from a popular science periodical (5)

ENTITY #AMOUNT DOCUMENTS 3535

PersonsPersons 419 (419 (84f)84f) TOKENS 20,92720,927

LocationsLocations 569 (569 (89f)89f) PROPER NOUNS

1,4221,422

OrganizationsOrganizations 272 (272 (83f)83f)

TemporalTemporal 504 (504 (89f)89f)

MonetaryMonetary 80 (80 (97f)97f)

10Fefor, March 2002

Status & WorkplanStatus & Workplan

ResourcesResources Lexical, Textual and AlgorithmicLexical, Textual and Algorithmic

NER on Part-of-Speech Annotated MaterialNER on Part-of-Speech Annotated MaterialWay Ahead, Approach and Evaluation Way Ahead, Approach and Evaluation

SamplesSamples

11Fefor, March 2002

EvidenceEvidence

McDonald (1996)McDonald (1996)::InternalInternal evidence evidence:: is taken from within the sequence of is taken from within the sequence of

words that comprise the name, such as the content of words that comprise the name, such as the content of lists of proper names (gazetteers), abbreviations and lists of proper names (gazetteers), abbreviations and acronyms (acronyms (Ltd, Inc., GmbhLtd, Inc., Gmbh))

ExternalExternal evidence evidence:: provided by the context in which a provided by the context in which a name appears – the characteristic properties or events name appears – the characteristic properties or events in a syntactic relation (verbs, adjectives) with a proper in a syntactic relation (verbs, adjectives) with a proper noun can be used to provide confirming or criterial noun can be used to provide confirming or criterial evidence for a name’s category – aevidence for a name’s category – an n important type of important type of complementary information since internal evidence complementary information since internal evidence can never be complete.can never be complete.....

12Fefor, March 2002

Lexical Resources (1) Lexical Resources (1) (Internal Evidence)(Internal Evidence)

Name Lists (Gazeteers)Name Lists (Gazeteers)

Multiword namesMultiword names

Single namesSingle names

Organizations (profit): 1,200Organizations (non-profit): 60Locations: 40

Org/commerc.: 1,500Person First: 70,000Person Last: 5,000Cities non-Swe.:2,200

Org/no-comm: 200Provinces: 70Airports: 10Cities Swe.: 1,600

Countries: 230Events: 10...

13Fefor, March 2002

Lexical Resources (2) Lexical Resources (2) (Internal Evidence)(Internal Evidence)

Designators, affixes, and trigger wordsDesignators, affixes, and trigger words

Titles, premodifiers, Titles, premodifiers, appositions...appositions...

e.g. personse.g. persons

PostPostModsMods: Jr, Junior,…PreTitlesPreTitles: VD, Dr, sir,…NationalityNationality: belgaren, brasilianaren, dansken,…OccupationOccupation: amiral, kriminolog, psykolog,...

e.g. organizationse.g. organizations

Design.& TriggersDesign.& Triggers: bolaget X, föreningen X, institutet X, organisationen X, stiftelsen X, förbundet X,…X Agency, X Biotech, X Chemical, X Consultancy ,…AffixesAffixes:+kollegium,+verket,...

14Fefor, March 2002

Lexical Resources Lexical Resources (External Evidence)(External Evidence)

the the Volvo/SaabVolvo/Saab case (can be generalized) case (can be generalized) a typical, frequent and fairly difficult examplea typical, frequent and fairly difficult example

For instance:For instance: ...Saab ...Saab 90009000...... ...mellanklass...mellanklassbilar sombilar som Volvo,... Volvo,... ...att ...att köraköra Volvo i en Volvostad som... Volvo i en Volvostad som... ... i en stor ... i en stor svartsvart Volvo och blinkade... Volvo och blinkade... ...tjuven försvinner i en ...tjuven försvinner i en stulenstulen Saab Saab ...tappat kontrollen över ...tappat kontrollen över sinsin Volvo Volvo Volvo Volvo stegsteg med 12 kronor med 12 kronor Saab Saab backadebackade med 1 peocent med 1 peocent ...gick Volvo ...gick Volvo nedned med 10 kronor... med 10 kronor... ..............

object: car

object: share

organization

......ignore infrequent cases and detailsignore infrequent cases and details

15Fefor, March 2002

FlexersFlexers Example Example

Sense1Sense1:: object, the product object, the product (vehicle) (vehicle)

Morphology:Morphology: number (singular/plural), case (nominative/genitive), definiteness

Samples:Samples: Volvon är billigare, singular, e.g. en svart Volvo ...

Corpus Analysis/Usage:Corpus Analysis/Usage:

1. Saab/Volvo Saab/Volvo NUMNUM2. Saab/VolvoSaab/Volvo NUMNUM??

((coupé|turbo|dieselcabriolet|corvette|transporter|cc|...coupé|turbo|dieselcabriolet|corvette|transporter|cc|...))3. ((GENITIVE/POSS-PRN/ARTCLGENITIVE/POSS-PRN/ARTCL)) ADJADJ/PRTCPL/PRTCPL* Saab/Volvo * Saab/Volvo NUMNUM??4. ((GENITIVE/POSS-PRN/ARTCLGENITIVE/POSS-PRN/ARTCL))? ? ADJADJ/PRTCPL/PRTCPL+ Saab/Volvo + Saab/Volvo NUMNUM??5. bilar sombilar som Saab/Volvo Saab/Volvo6. typen/kör/*köratypen/kör/*köra Saab/Volvo Saab/Volvo

>9 out of 10 cases

no rule without exception: [[Saab/VolvoSaab/Volvo TimeExpression; När Volvo 1994...] TimeExpression; När Volvo 1994...]

16Fefor, March 2002

FlexersFlexers Example Example

Sense2Sense2:: object, the object, the shareshare

Morphology:Morphology: number (singular/plural), case (nominative/genitive), definiteness

Samples:Samples: Volvon har gått upp med...

Corpus Analysis/Usage:Corpus Analysis/Usage:

1. Saab/VolvoSaab/Volvo AUX AUX?? VERB(steg/stig VERB(steg/stig**/backa/backa**)) 2. Saab/VolvoSaab/Volvo AUX AUX?? VERB(öka VERB(öka**/minska*)/minska*)?? med NUM procent med NUM procent 3. Saab/Volvo Saab/Volvo gick (tillbaka kraftigt|mot strömmen|upp|ned)gick (tillbaka kraftigt|mot strömmen|upp|ned) 4. Saab/Volvo Saab/Volvo NUMNUM procent procent

Rest of cases? Sense3 the building <not found>

Rest of cases? Sense4 the organization

17Fefor, March 2002

FlexersFlexers Example Example

CAR_TYPECAR_TYPE ((SaabSaab||VolvoVolvo||FordFord||......))/NP.../NP...VERBVERB ((stigastiga||stigerstiger||stigitstigit||stegsteg||backa[^/ ]+backa[^/ ]+||...)/(VMISA...)/(VMISA||

VMU0AVMU0A||......))AUX_VERBAUX_VERB [^/ ]+/[^/ ]+/((VTISAVTISA||VTU0AVTU0A||......))MCMC [0-9][0-9][0-9][0-9]??[0-9][0-9]??/MC/MC||[0-9][0-9][0-9][0-9]??[.,][0-9][0-9][.,][0-9][0-9]??/MC/MCSPACESPACE [ \t]+[ \t]+

{{CAR_TYPECAR_TYPE}{}{SPACESPACE}({}({AUX_VERBAUX_VERB}{}{SPACESPACE})?{})?{VERBVERB}(}(”med/S ”{MC}”med/S ”{MC}{SPACE}procent{SPACE}procent)?)? {{tag-as-sense2;tag-as-sense2;}}

{{CAR_TYPECAR_TYPE}{}{SPACESPACE}{}{MCMC}{}{SPACESPACE}}procentprocent {{tag-as-sense2;tag-as-sense2;}}{{CAR_TYPECAR_TYPE}{}{SPACESPACE}}gickgick{{SPACESPACE}(}(”tillbaka/ kraftigt””tillbaka/ kraftigt”||”mot/S ”mot/S

strömm”strömm”||”upp/””upp/”||”ned/””ned/”)) {{tag-as-sense2;tag-as-sense2;}}

18Fefor, March 2002

SUC-2SUC-2

The second version of SUC has been semi-The second version of SUC has been semi-automaticallyautomatically???? annotated with ”NAMES” annotated with ”NAMES”

15131 PERSON 15131 PERSON 8771 PLACE8771 PLACE 6309 INST6309 INST 1887 WORK1887 WORK 638 PRODUCT638 PRODUCT 540 OTHER540 OTHER 364 ANIMAL364 ANIMAL 280 MYTH280 MYTH 245 EVENT245 EVENT 242 FORMULA242 FORMULA

Här har <NAME TYPE=ANIMAL>Nalle </NAME> frukosterat...

...ber <NAME TYPE=MYTH>Herren </NAME> välsigna vår...

...årsmöte i <NAME TYPE=OTHER> Kristiansborgskyrkan</NAME>…

...till nitrat ( <DISTINCT TYPE=FORMULA> NO3-</DISTINCT> ) och därefter...

19Fefor, March 2002

POS Taggers & TagsetPOS Taggers & Tagset

Three off-the-shelf POS taggers have been downloaded Three off-the-shelf POS taggers have been downloaded and are currently under development with our new and are currently under development with our new tagsettagset

TreeTagger: HMM + Decision TreesTreeTagger: HMM + Decision Trees

TnT: Viterbi (HMM)TnT: Viterbi (HMM)

Brills: Brills: TTransformation-basedransformation-based

NER is a complex of different tasks; POS tagging is a basicNER is a complex of different tasks; POS tagging is a basictask which can aid the task which can aid the detection of entities detection of entities

20Fefor, March 2002

POS Taggers & TagsetPOS Taggers & Tagset

The NER will be/is applied on part-of-speech The NER will be/is applied on part-of-speech annotated material. The relevant tags for marking annotated material. The relevant tags for marking proper nouns (as found in the training corpus-SUC2):proper nouns (as found in the training corpus-SUC2):

NPNSNDNPNSND ...i Europa/...i Europa/NPNSNDNPNSND har inte... har inte...

NPNSGDNPNSGD ...för Litauens/...för Litauens/NPNSGDNPNSGD parlament där... parlament där...

NPUSNDNPUSND ...berättar Torgny/...berättar Torgny/NPUSNDNPUSND Lindgren/... Lindgren/...

NPUSGDNPUSGD ...är Mona Eliassons/...är Mona Eliassons/NPUSGDNPUSGD recept... recept...

NP*SNDNP*SND Ulf Norrman vann H-43/Ulf Norrman vann H-43/NP*SNDNP*SND......

XFXF ……vunnit en Grand/vunnit en Grand/XFXF Slam/ Slam/XFXF......

YY ...ÖB/...ÖB/YY under kriget i Libanon... under kriget i Libanon...

21Fefor, March 2002

Explore JAPE&GATE2Explore JAPE&GATE2

Java Annotation Pattern Engine (JAPE) GrammarJava Annotation Pattern Engine (JAPE) Grammar– Set of rulesSet of rules

» LHS regular expression over annotationsLHS regular expression over annotations» RHS annotations to be addedRHS annotations to be added» PriorityPriority» Left and Right context around the patternLeft and Right context around the pattern

– Rules are compiled in a FST over annotationsRules are compiled in a FST over annotations

22Fefor, March 2002

JAPE RulesJAPE Rules

Rule: Location1Rule: Location1Priority: 25Priority: 25(( (({Lookup.majorType==loc_key,Lookup.minorType==pre}{SpaceToken}{Lookup.majorType==loc_key,Lookup.minorType==pre}{SpaceToken})?)? {Lookup.majorType=={Lookup.majorType==locationlocation}}(({SpaceToken}{SpaceToken} {Lookup.majorType=={Lookup.majorType==loc_keyloc_key,Lookup.minorType==,Lookup.minorType==postpost}})?)? )):locName --> :locName.Location={kind=”:locName --> :locName.Location={kind=”locationlocation”,rule=”Location1”}”,rule=”Location1”}

ChinaChina seasea locationlocation

23Fefor, March 2002

Plan for (Plan for (the rest ofthe rest of) 2002) 2002 January-AprilJanuary-April: inventory of existing L&A resources;: inventory of existing L&A resources;

re-training of pos-taggers with språkdatas tagset;re-training of pos-taggers with språkdatas tagset;localization, ’completion’& structuring of L-resources;localization, ’completion’& structuring of L-resources;provision of (draft) guidelines for the NER task; provision of (draft) guidelines for the NER task; working with ’WORK&ART’ and ’EVENTS’;working with ’WORK&ART’ and ’EVENTS’;

May-SeptemberMay-September: implementations; porting of old : implementations; porting of old scripts to the current state-of-affairs; SUC2 with ML?; scripts to the current state-of-affairs; SUC2 with ML?; developing a Swedish JAPE module in GATE2 developing a Swedish JAPE module in GATE2

OctoberOctober: evaluation: evaluation NovemberNovember: new web-interface and GATE2 integration: new web-interface and GATE2 integration DecemberDecember: wrapping-upp: wrapping-upp

24Fefor, March 2002

Annotation GuidelinesAnnotation Guidelines

FFirst draft specifications for the creation of simple irst draft specifications for the creation of simple guidelines for the NER work as applied on Swedish guidelines for the NER work as applied on Swedish datadata have been written have been written

IIdeas from MUC, ACE and deas from MUC, ACE and ownown experience experienceThe guidelines are expected to evolve during the course The guidelines are expected to evolve during the course

of the project, refined and extendedof the project, refined and extendedThe purpose of the guidelines is to try and impose some The purpose of the guidelines is to try and impose some

consistency measures for annotation and evaluation, consistency measures for annotation and evaluation, and and giving the potential future users of the system a giving the potential future users of the system a clearer picture of what the recognition components can clearer picture of what the recognition components can offeroffer

Pragmatic rather than theoretic...Pragmatic rather than theoretic...

25Fefor, March 2002

Guidelines cont’dGuidelines cont’d

Named Entity Recognition (NER) consists of a number Named Entity Recognition (NER) consists of a number of subtasksof subtasks,, correspond correspondinging to a number of XML tag to a number of XML tag elementselements

The only insertions allowed during tagging are tags The only insertions allowed during tagging are tags enclosed in angled brackets. No extra white space or enclosed in angled brackets. No extra white space or carriage returns are to be insertedcarriage returns are to be inserted

The markup will have the form of the entity type and The markup will have the form of the entity type and attribute information:attribute information:<ELEMENT-NAME ATTR-NAME="ATTR-VALUE"><ELEMENT-NAME ATTR-NAME="ATTR-VALUE">a a text-text-

stringstring</ELEMENT-NAME></ELEMENT-NAME>Six (+1) categories will be recognized Six (+1) categories will be recognized

26Fefor, March 2002

““PLACE” NAMESPLACE” NAMES

<ENAMEX TYPE=”G-PLC”><ENAMEX TYPE=”G-PLC”>; ; DescriptionDescription: a (natural) : a (natural) geographically/geologically or astronomically defined location, geographically/geologically or astronomically defined location, with physical extent; such as bodies of water, rivers, mountains, with physical extent; such as bodies of water, rivers, mountains, geological formations, islands, continents, stars, galaxies, …geological formations, islands, continents, stars, galaxies, …

<ENAMEX TYPE=”P-PLC”><ENAMEX TYPE=”P-PLC”>; ; DescriptionDescription: (geo-political entities) : (geo-political entities) politically defined geographical regionspolitically defined geographical regions; ; nations, states, cities, nations, states, cities, villages, provinces, regions, villages, provinces, regions, other other populated urban areapopulated urban areass …)…); e.g.,; e.g., the capital city is used to refer to the nation’s governmentthe capital city is used to refer to the nation’s government e.g. e.g. USA attackerade XUSA attackerade X;;

<ENAMEX TYPE=”F-PLC”><ENAMEX TYPE=”F-PLC”>; ; DescriptionDescription: facility entities which are : facility entities which are (permanent) man-made artefacts falling (permanent) man-made artefacts falling under the domains of under the domains of architecture, transportation infrastructure and civil engineeringarchitecture, transportation infrastructure and civil engineering;; such as such as streets, parks, stadiums, airports, ports, museums, streets, parks, stadiums, airports, ports, museums, tunnels, bridges,…tunnels, bridges,…

27Fefor, March 2002

““PERSON” NAMESPERSON” NAMES

<ENAMEX TYPE=”H-PRS”><ENAMEX TYPE=”H-PRS”>;; DescriptionDescription: person entities are: person entities are

limited to humans, fictional human characters appearing in TV,limited to humans, fictional human characters appearing in TV,

movies etc.movies etc.; c; christian, hristian, ffamily names, amily names, nnicknames,icknames, group names, group names, tribes,tribes,……

<ENAMEX TYPE=”O-PRS”><ENAMEX TYPE=”O-PRS”>; ; DescriptionDescription: Saints, gods, names of : Saints, gods, names of animals and pets,…animals and pets,…

e.g. Herren, Gud, Athena, Ior,...e.g. Herren, Gud, Athena, Ior,...

28Fefor, March 2002

““ORGANIZATION” ORGANIZATION” NAMESNAMES

<ENAMEX TYPE=”C-ORG”><ENAMEX TYPE=”C-ORG”>;; DescriptionDescription: organization : organization entities are divided into two categories; theentities are divided into two categories; the first is first is limited to limited to commercial commercial corporations, multinational corporations, multinational organizations, tv-channelsorganizations, tv-channels,,…(both multiword and single …(both multiword and single word entities)word entities)

<ENAMEX TYPE=”G-ORG”><ENAMEX TYPE=”G-ORG”>; ; DescriptionDescription: organization : organization entities of the second groups are limited toentities of the second groups are limited to governmental and non-profit organizations such as governmental and non-profit organizations such as political parties, governmental bodies at any level of political parties, governmental bodies at any level of importance, political groups, non-profit organizations, importance, political groups, non-profit organizations, unions, universities, embassies, army…unions, universities, embassies, army… (sport teams, (sport teams, music groups, stock exchanges, orchestras, music groups, stock exchanges, orchestras, churches,churches,......)?)?

29Fefor, March 2002

““EVENT” NAMESEVENT” NAMES

<ENAMEX TYPE=”EVN”><ENAMEX TYPE=”EVN”>;; DescriptionDescription: Historical, : Historical, sports, festivals, races, sports, festivals, races, War and Peace War and Peace eventsevents (Battles), conferences, Christmas, holidays(Battles), conferences, Christmas, holidays

e.g. formel-1, andra världskriget, Julitrav, VM, e.g. formel-1, andra världskriget, Julitrav, VM, OS, Mittmässan, elitserien, ...OS, Mittmässan, elitserien, ...

Open category; orthography might not be Open category; orthography might not be enough...enough...

30Fefor, March 2002

““WORK/ART” NAMESWORK/ART” NAMES

<ENAMEX TYPE=”WRK”><ENAMEX TYPE=”WRK”>;; DescriptionDescription: This is one of the : This is one of the most difficult categories since a work or art name is most difficult categories since a work or art name is usually comprised by tokens that are seldom proper usually comprised by tokens that are seldom proper nouns. Titles of books, films, songs, artwork, nouns. Titles of books, films, songs, artwork, paintings, tv-programs, magazines, newspapers, …paintings, tv-programs, magazines, newspapers, …

e.g. e.g. X sjöng X sjöng ““Barnens visaBarnens visa””Ett fotografi med Ett fotografi med titelntiteln Galna turister visar en Galna turister visar en

gatumarknad i Brasiliengatumarknad i Brasilien

Open category; long chains; orthography is not enough...Open category; long chains; orthography is not enough...

31Fefor, March 2002

““OBJECT” NAMESOBJECT” NAMES

<ENAMEX TYPE=”OBJ”><ENAMEX TYPE=”OBJ”>;; DescriptionDescription: ships, : ships, machines, artefacts, products, diseases/prizes machines, artefacts, products, diseases/prizes named after people, named after people, boatsboats, …, …

e.g. e.g. fartyget Miriam, Alzheimers fartyget Miriam, Alzheimers sjukdomsjukdom

32Fefor, March 2002

Tool Comparison-1 (IE)Tool Comparison-1 (IE)

INFORMATION EXTRACTION SYSTEMS N

am

ed

En

titi

es

No

min

al

En

titi

es

No

rma

lize

d

Tim

e

Re

lati

on

s

Ev

en

ts

Mu

lti-

Lin

gu

al

Ex

ten

sib

le v

ia

Ma

ch

ine

Le

arn

ing

Ex

ten

sib

le v

ia

Pro

gra

mm

ing

COMMERCIAL COMPANIESAeroText, Lockheed Martin x x x x x EN,ES,ZH,JP x xIdentiFinder, BBN/Verizon POLTM x x x EN, ZH, (AR) xIntelligent Miner for Text, IBM x x x EN xNet Owl, SRA POLTM+ x x EN (ES, AR, Ti, JP, DE, FR, (Russian) xThing Finder, Inxight POLTM+ 0 EN,ES,ZH,JP,FR xContext, Oracle x EN xSemio Taxonomy EN,FR,ES,IT,JP,DU,(ZH,DE) xLexiQuest Mine x x EN,FR,ES,DE,DU xLingSoft x EN xCoGenTex/Cornell x x x EN xTextWise/Syracuse Univ. x x x EN x

NON-PROFIT ORGANIZATIONSAlembic, MITRE x x x x EN, ZH, ES x xGATE, U. Sheffield x x x EN xUniv. of Arizona x x EN xNew Mexico State University x x EN xFastus/TextPro, SRI International x x x x ENProteus, New York University x x x x x EN, JP, ES x xTIMEX, MITRE x EN, ES xUniv. of Massachusetts/Amherst x x x EN x xEN=English ZH=Chinese ES=Spanish JP=Japanese IT=Italian FR=French DE=German DU=Dutch AR=ArabicP-People, O=Organization, L=Location, T=Time, M=Money

Screenshot taken fr. Mark Maybury

INFORMATIONEXTRACTIONSYSTEMS

33Fefor, March 2002

Entity Extraction Tools – Entity Extraction Tools – Commercial Vendors Commercial Vendors 020204020204

AeroText - Lockheed Martin's AeroText & trade;AeroText - Lockheed Martin's AeroText & trade;– www.lockheedmartin.com/factsheets/product589.htmlwww.lockheedmartin.com/factsheets/product589.html

BBN's Identifinder: BBN's Identifinder: www.bbn.com/speech/identifinder.htmlwww.bbn.com/speech/identifinder.html IBM's Intelligent Miner for TextIBM's Intelligent Miner for Text

– www-4.ibm.com/software/data/iminer/fortext/index.htmlwww-4.ibm.com/software/data/iminer/fortext/index.html SRA NetOwl: SRA NetOwl: www.netowl.comwww.netowl.com Inxight's ThingFinderInxight's ThingFinder

– www.inxight.com/products/thing_finder/www.inxight.com/products/thing_finder/ Semio taxonomies:Semio taxonomies: www.semio.com www.semio.com Context: Context: technet.oracle.com/products/oracle7/context/tutorialtechnet.oracle.com/products/oracle7/context/tutorial// LexiQuest Mine: LexiQuest Mine: www.lexiquest.comwww.lexiquest.com Lingsoft: Lingsoft: www.lingsoft.fiwww.lingsoft.fi CoGenTex: CoGenTex: www.cogentex.comwww.cogentex.com TextWise: TextWise: www.textwise.comwww.textwise.com & &

www.infonortics.com/searchengines/boston1999/arnold/sld001.www.infonortics.com/searchengines/boston1999/arnold/sld001.htmhtm

34Fefor, March 2002

Entity Extraction Tools – Entity Extraction Tools – Non-Profit Organizations Non-Profit Organizations

MITRE’s Alembic extraction system and Alembic Workbench MITRE’s Alembic extraction system and Alembic Workbench annotation tool: annotation tool: www.www.mitremitre.org/technology/.org/technology/nlpnlp

Univ. of Sheffield’s GATE: Univ. of Sheffield’s GATE: gate.ac.ukgate.ac.uk Univ. of Arizona: Univ. of Arizona: ai.bpa.arizona.eduai.bpa.arizona.edu New Mexico State University (Tabula Rasa system): New Mexico State University (Tabula Rasa system):

http://crl.nmsu.edu/Research/Projects/tr/index.htmlhttp://crl.nmsu.edu/Research/Projects/tr/index.html SRI Internationals Fastus/TextPro:SRI Internationals Fastus/TextPro:

– www.ai.sri.com/~appelt/fastus.htmlwww.ai.sri.com/~appelt/fastus.html– www.ai.sri.com/~appelt/TextProwww.ai.sri.com/~appelt/TextPro (not free since Jan 2002!) (not free since Jan 2002!)

New York University’s ProteusNew York University’s Proteus– www.cs.nyu.edu/cs/projects/proteuswww.cs.nyu.edu/cs/projects/proteus//

University of Massachusetts (Badger and Crystal):University of Massachusetts (Badger and Crystal):– www-nlp.cs.umass.eduwww-nlp.cs.umass.edu//

35Fefor, March 2002

Name Analysis SoftwareName Analysis Software

Language Analysis Systems Inc.’s (Herndon, VA) Language Analysis Systems Inc.’s (Herndon, VA) “Name Reference Library” www.las-inc.com & “Name Reference Library” www.las-inc.com & www.onomastix.com/www.onomastix.com/

Supports analysis of Arabic, Hispanic, Chinese, Thai, Russian, Supports analysis of Arabic, Hispanic, Chinese, Thai, Russian, Korean, and Indonesian names; others in future versions... Korean, and Indonesian names; others in future versions...

Product Features:Product Features:– Identifying the cultural classification of a person name– Given a name, provides common variants on that name, e.g., “Abd Al

Rahman” or “Abdurrahman” or ... – Implied gender– Identifies title, affixes, qualifiers, e.g.,

"Bin," means "son of" as in Osama Bin Laden– List top countries where name occurs

Cost: $3,535 a copy and a $990 annual fee !Cost: $3,535 a copy and a $990 annual fee !

36Fefor, March 2002

Example 1: IBM’s Example 1: IBM’s Intelligent MinerIntelligent Miner

See: www-4.ibm.com/software/data/iminer/fortext/index.html

37Fefor, March 2002

Example 2: GATE2Example 2: GATE2

38Fefor, March 2002

Example 3: AWBExample 3: AWB

39Fefor, March 2002

SomeSome Relevant Projects Relevant Projects

ACE: Automated Content ExtractionACE: Automated Content Extraction((www.www.nistnist..govgov/speech/tests/ace/speech/tests/ace))

NIST: National Institure of Standards and TechnologiesNIST: National Institure of Standards and Technologies((http://www.http://www.itlitl..nistnist..govgov//iauiiaui/894.02/related_projects//894.02/related_projects/mucmuc/index.html/index.html); +evaluation tools); +evaluation tools

TIDES: Translingual Information Detection Extraction and TIDES: Translingual Information Detection Extraction and Summarization; DARPA; multilingual name extraction (Summarization; DARPA; multilingual name extraction (www.www.darpadarpa.mil/.mil/itoito/research/tides/research/tides))

MUSE: MUSE: AA MUlti-Source Entity finder MUlti-Source Entity finder ((http://www.dcs.shef.ac.uk/~hamish/muse.htmlhttp://www.dcs.shef.ac.uk/~hamish/muse.html))

Identifying Named Entities in Speech Identifying Named Entities in Speech (HUB)(HUB) Other... Other...

40Fefor, March 2002

Tool Comparison-2 (DC,TM...)Tool Comparison-2 (DC,TM...)

Document Clustering, Mining, Topic Detection, and Visualization Systems A

ll w

ord

s eq

ual

ly

No

un

Ph

rase

s

Nam

ed E

nti

ties

Acc

epts

Pre

def

ined

T

erm

s

Pre

def

ined

T

axo

no

mie

s

Gen

erat

es

Tax

on

om

ies

Mu

ltiL

ing

ual

?

Sto

ry

Seg

men

tati

on

New

To

pic

D

etec

tio

n?

To

pic

Tra

ckin

g

(Pre

de

fin

ed

To

pic

s)

Inxight Categorizer, Tree Studio, Inxight x xEN, FR, ES, DE, DU,

… (12) x

Semio Taxonomy x x x xEN,FR,ES,IT,JP,DU (ZH,DE)

LexiQuest Mine x x x x EN, FR, ES, DE, DU xInterMedia Text, Oracle x EN xNorthernLight x EN x xAutonomy x EN xLotus Discovery Server (LDS), Lotus x x EN xQKS Classifier, Quiver x x x EN xFulcrum Knowledge Server, Hummingbird x x x EN

SPIRE/Themeview, PNNL x x x EN

VantagePoint, Search Technology Inc. x x x EN

Mohomine, Inc. x x EN xIntelligent Miner for Text, IBM x x EN x x xOasis, OnTopic, BBN/Verizon x x EN, ZH, AR x x xEN=English ZH=Chinese ES=Spanish JP=Japanese DE=German DU=Dutch FR=French AR=Arabic IT=Italian

Document Clustering, Mining, Topic Detection, and Visualization Systems

Screenshot taken fr. Mark Maybury

41Fefor, March 2002

EvaluationEvaluation

Evaluation consists of (at least) three parts:Evaluation consists of (at least) three parts:– Entity DetectionEntity Detection (of the string that names an (of the string that names an

entity): entity): <ENAMEX><ENAMEX>FjärranFjärran ÖsternÖstern</ENAMEX></ENAMEX>– Attribute Recognition/ClassificationAttribute Recognition/Classification (of the (of the

entity); entity); <ENAMEX TYPE=“LOCATION”><ENAMEX TYPE=“LOCATION”>FjärranFjärran ÖsternÖstern</ENAMEX></ENAMEX>

– Extent Recognition Extent Recognition (measure the ability of a (measure the ability of a system to correctly determine an entity’s system to correctly determine an entity’s extentextent partial correctness): partial correctness): Fjärran <ENAMEX TYPE=“LOCATION”> <ENAMEX TYPE=“LOCATION”>ÖsternÖstern</ENAMEX></ENAMEX>

42Fefor, March 2002

Evaluation cont’dEvaluation cont’d

Systems exist that identify names ~90-95% accurately Systems exist that identify names ~90-95% accurately in newswire texts (in several languages)in newswire texts (in several languages)

Metrics: Metrics: VaryVary from test case to test case; the from test case to test case; the “simplest” definitions are:“simplest” definitions are:

PrecisionPrecision = #CorrectReturned/#TotalReturned = #CorrectReturned/#TotalReturned

RecallRecall = #CorrectReturned/#CorrectPossible = #CorrectReturned/#CorrectPossibleQuite high figures in P&R can be found in the Quite high figures in P&R can be found in the litterature based exclusively on these litterature based exclusively on these simplersimpler metrics... metrics...

Almost non-existent discussion on metonymy or other Almost non-existent discussion on metonymy or other difficult cases makes the results suspect?!difficult cases makes the results suspect?!

43Fefor, March 2002

Evaluation cont’dEvaluation cont’d

Guidelines for more rigid evaluation criteria have been imposed by the MUC; e.g. Guidelines for more rigid evaluation criteria have been imposed by the MUC; e.g. Precision = Correct + Precision = Correct + ( 0.5 * Partially Correct )( 0.5 * Partially Correct )

ActualActualCorrect:Correct: two single fills are considered identical two single fills are considered identical Partially Correct:Partially Correct: two single fills are not identical, but partial credit should still be given two single fills are not identical, but partial credit should still be givenActual = Correct + Incorrect + Partially Correct + SpuriousActual = Correct + Incorrect + Partially Correct + SpuriousSpurious:Spurious: a response object has no key object aligned with it a response object has no key object aligned with it Recall = Correct + Recall = Correct + ( 0.5 * Partially Correct )( 0.5 * Partially Correct )

PossiblePossible See: See: http://www.itl.nist.gov/iaui/894.02/related_projects/muc/http://www.itl.nist.gov/iaui/894.02/related_projects/muc/

muc_sw/muc_sw_manual.htmlmuc_sw/muc_sw_manual.html

44Fefor, March 2002

Resource Localization Resource Localization (Organizations: Govermental)(Organizations: Govermental)

See: See: http://www.gksoft.com/govt/http://www.gksoft.com/govt/

181 govermentalorgs for Norway

45Fefor, March 2002

Resource Localization Resource Localization (Organizations: Govermental)(Organizations: Govermental)

See: See: http://www.odci.gov/cia/publications/factbook/index.htmlhttp://www.odci.gov/cia/publications/factbook/index.html

46Fefor, March 2002

Resource Localization Resource Localization (Organizations: Govermental)(Organizations: Govermental)

See: See: http://www.odci.gov/cia/publications/factbook/index.htmlhttp://www.odci.gov/cia/publications/factbook/index.html

47Fefor, March 2002

Resource Localization Resource Localization (Organizations: Publishers)(Organizations: Publishers)

See: See: http://www.http://www.netlibrary.comnetlibrary.com

500 publ.

48Fefor, March 2002

Resource Localization Resource Localization (Locations: Countries)(Locations: Countries)

See: See: http://www.http://www.reseguide.sereseguide.se

184 countries

49Fefor, March 2002

Resource Localization Resource Localization (Locations: Cities)(Locations: Cities)

www.calle.com

50Fefor, March 2002

Problems: MetonymyProblems: Metonymy

a speaker uses a reference to one entity to refer to another a speaker uses a reference to one entity to refer to another entity –entity – oror entitiesentities – related to it– related to it;; ALLALL words are words are metonyms?!metonyms?!

(In ACE) Classic metonymies and composites(In ACE) Classic metonymies and composites

Reference to two entities, one explicitReference to two entities, one explicitand one indirect reference; commonly thisand one indirect reference; commonly thisis the case of capital city names standing inis the case of capital city names standing infor national govermentsfor national goverments

Apply to GPEs, typically having a goverment, a populate, a geographic location and an abstract notion of statehood

51Fefor, March 2002

Problems: DCA?Problems: DCA?

The The DCA DCA approach approach might not work for some of the NE might not work for some of the NE categoriescategories that are long and mentioned only once; that are long and mentioned only once; particularlyparticularly EVENTSEVENTS, , ARTWORKARTWORK, …, …

In these cases context sensitive grammars might be the In these cases context sensitive grammars might be the alternative; alternative; They work fairly well for novel entities They work fairly well for novel entities and rules can be created by hand or learned via and rules can be created by hand or learned via machine learning or statistical algorithmsmachine learning or statistical algorithms

example....example....

52Fefor, March 2002

Rules that capture local patterns that characterize Rules that capture local patterns that characterize entities, from instances of annotated training data or entities, from instances of annotated training data or semi-automatic analysis of corpora:semi-automatic analysis of corpora:

– XXXXXX köpte köpte YYYYYY: : XXXXXX and and YYYYYY are with very high probability organizationsare with very high probability organizations

EMI köpte Virgin_Music_GroupEMI köpte Virgin_Music_GroupGrundin köpte HornlineGrundin köpte HornlineMoyne köpte TrustorMoyne köpte TrustorOptiroc köpte StråbrukenOptiroc köpte StråbrukenPandox köpte Park_Avenue_HotelPandox köpte Park_Avenue_HotelSF köpte EuropafilmSF köpte EuropafilmStagecoach köpte SwebusStagecoach köpte SwebusTrelleborg köpte Intertrade Trelleborg köpte Intertrade

53Fefor, March 2002

DCA more problems...DCA more problems...

<Dagens Indutri 020306 s.18><Dagens Indutri 020306 s.18>

FordsFords VD och delägare Bill VD och delägare Bill FordFord stal showen från Volvo PV när stal showen från Volvo PV när bilsalongen i Genève... bilsalongen i Genève... FordFord köpte Volvo Personvagnar 1999....På köpte Volvo Personvagnar 1999....På FordsFords egen presskonferens betonade Bill egen presskonferens betonade Bill FordFord att Volvo... att Volvo...

<Dagens Indutri 020306 s.22><Dagens Indutri 020306 s.22>

Indutri- och finansmannen Indutri- och finansmannen Carl BennetCarl Bennet, via sitt bolag , via sitt bolag CarlCarl BennetBennet AB, AB, börsnoterade...börsnoterade...Carl Bennet Carl Bennet framhåller att...framhåller att...

54Fefor, March 2002

Some Final RemarksSome Final Remarks

A challenge with NER is creating a stable definitionA challenge with NER is creating a stable definitionof what an entity is and creating a taxonomy of entities of what an entity is and creating a taxonomy of entities to map to...to map to...

Having done that it becomes simpler to solve Having done that it becomes simpler to solve metonymy and other ambiguity problems...metonymy and other ambiguity problems...

Problems remain; where shall we draw the entity Problems remain; where shall we draw the entity boundaries?boundaries?

Text format...Text format...

Shall we just go for it or try and Shall we just go for it or try and rationalizerationalize the entity the entity types?types?

time will show...time will show...