the nerd project

21
Giuseppe Rizzo <[email protected] >

Upload: giuseppe-rizzo

Post on 10-May-2015

2.950 views

Category:

Technology


7 download

DESCRIPTION

Seminar at Ecole Centrale, Paris, France

TRANSCRIPT

Page 2: The NERD project

What is a Named Entity recognition task?

A task that aims to locate and classify the name of a person or an organization, a location, a brand, a product, a numeric expression including time, date, money and percent in a textual document

12 March 2012 Seminar @ Ecole Centrale, Paris 2/21

Page 3: The NERD project

History of NER benchmarks CoNLL 2003 and CoNLL 2005

schema (4 types): person, organization, location and miscellaneous language independent task

ACE 2004, ACE 2005 and ACE 2007 schema (7 types): person, organization, location, facility, weapon,

vehicle and geo-political entity entity recognition, not just name (e.g. description, pronoun) find relationships among entities extracted

TAC 2009 (Knowledge Base Track) schema (3 types): person, organization and location create a knowledge base from the named entities extracted

ETAPE 2012 (Named Entity Task) schema: Quaero (7 main types, 32 sub-types)

12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 3

Page 4: The NERD project

NER Tools

Standalone software GATE Stanford CoreNLP Temis

Web APIs

12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 4

Page 5: The NERD project

Factual comparison of 10 Web NER tools

12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 5

Alchemy DBpedia

Evri Extr Lup Calais Saplo WM Yahoo Zemanta

Granularity OEN OEN OED OEN OEN OEN OED OEN OEN OED

Language EN FR GR IT PT RU SP SW

EN GR* PT* SP*

EN IT

EN EN FR IT

EN FR SP

EN SW

EN FR SP

EN EN

Quota (calls/day)

30000 unl 3000 3000 unl 50000 1333 unl 5000 10000

Sample Clients

C/C++ C# Java Perl PHP5 Python Ruby

Java JS PHP

AS Java PHP

Java N/A Java Java JS PHP Python

Java Perl

JS PHP

C# Java JS Perl PHP Python Ruby

Content chunk

150KB 452KB 8KB 32KB 20KB 8KB 26KB 80KB 7769KB 970KB

Page 6: The NERD project

12 March 2012 Seminar @ Ecole Centrale, Paris 6/21

Factual comparison (II) Alchemy DBpedia Evri Extr Lup Calais Saplo WM Yahoo Zemanta

Response Format

JSON MicroF XML RDF

HTML JSON RDF XML

HTML JSON RDF

HTML JSON RDF XML

HTML JSON RDFa XML

JSON MicroF

JSON JSON XML

JSON XML

XML JSON RDF

Entity type number

324 320 5 34 319 95 5 7 13 81

Entity position

N/A char offset

N/A word offset

range of chars

char offset

N/A POS offset

range of chars

N/A

Classif. Ontologies

Alchemy DBpedia FreeBase Scema.org

Evri DBpedia

DBpedia LinkedMDB

OpenCalais

N/A ESTER

Yahoo FreeBase

Defer. Vocabularies

DBpeda FreeBase USCensus UMBEL OpenCyc YAGO MusicBrainz CrunchBase

DBpedia Evri DBpedia

DBpedia LinkedMDB

OpenCalais

N/A DBpedia Geonames CIAFactbook Wikicompanies

Wikipedia

Wikipedia IMDB MusicBrainz Amazon YouTube TechCrunch ...

Page 7: The NERD project

Human made benchmarks

We performed two evaluation experiments: WEKEX 2011 ISWC 2011

Each field has been rated by a Boolean value: true if correct, false otherwise

12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 7

t = (entity, type, URI, relevant)

Rizzo G., Troncy R. (2011), NERD: A Framework for Evaluating Named Entity Recognition Tools in the Web of Data. In: International Semantic Web Conference 2011 (ISWC'11), Bonn, Germany.

Page 8: The NERD project

WEKEX 2011 Benchmark

Controlled experiment 4 human raters 10 English news articles (5 from BBC and 5 from The

New York Times) Each rater evaluated each article for 5 extractors

200 total evaluations

Fleiss's kappa score moderate agreement among raters

12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 8

Rizzo G., Troncy R. (2011), NERD: Evaluating Named Entity Recognition Tools in the Web of Data. In: (ISWC'11) Workshop on Web Scale Knowledge Extraction (WEKEX'11), Bonn, Germany.

Page 9: The NERD project

Results

12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 9

different behavior for different sources

Page 10: The NERD project

ISWC 2011 Benchmark

Controlled experiment 10 human raters 2 English news articles from The New York Times each rater evaluated each article for 6 extractors

120 total evaluations

Fleiss's kappa score substantial

agreement among raters

12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 10

Page 11: The NERD project

Results

12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 11

Page 12: The NERD project

12 March 2012 Seminar @ Ecole Centrale, Paris 12/21

What is NERD?

REST API2 ontology1

UI3

1 http://nerd.eurecom.fr/ontology 2 http://nerd.eurecom.fr/api/application.wadl 3 http://nerd.eurecom.fr/

The NERD ontology has been integrated in the NIF project,

a EU FP7 in the context of the LOD2: Creating Knowledge

out of Interlinked Data

Page 13: The NERD project

NERD Ontology

Align the taxonomies used by the extractors

12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 13

Page 14: The NERD project

Building the NERD ontology

12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 14

NERD type Occurrence

Person 10

Organization 10

Country 6

Company 6

Location 6

Continent 5

City 5

RadioStation 5

Album 5

Product 5

... ...

Page 15: The NERD project

NERD REST API

12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 15

GET, POST, PUT,

DELETE

/document /user /annotation/{extractor} /extraction /evaluation ...

JSON/RDF*

“entities” : [{ “entity”: “Tim Berners-Lee” , “type”: “Person” , “uri”: "http://dbpedia.org/resource/Tim_berners_lee", “nerdType”: "http://nerd.eurecom.fr/ontology#Person", “startChar”: 30, “endChar”: 45, “confidence”: 1, “relevance”: 0.5 }]

Rizzo G., Troncy R. (2012), NERD: A Framework for Unifying Named Entity Recognition and Disambiguation Web Extraction Tools. In: European chapter of the Association for Computational Linguistics (EACL'12), Avignon, France.

Page 16: The NERD project

NIF: NLP Interchange Format Framework

Different outputs for the NLP tools

Manual effort required for integration or reuse time consuming need to capture the definition of the attributes used in the

response format

NIF uses RDF for representing NER results as Linked Data

12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 16

OpenCalais "_type": "Organization", “name": "North Atlantic Treaty Organization", "organizationtype": "governmental civilian", "nationality": "N/A", "_typeReference": http://s.opencalais.com/1/type/em/e/Organization", ...

DBpedia Spotlight "@URI": "http://dbpedia.org/resource/DBpedia", "@types": "DBpedia:Software,DBpedia:Work” "@surfaceForm": "dbpedia", "@offset": "0", "@support": "11", "@similarityScore": "0.2387271374464035", …

Page 17: The NERD project

Named Entities as textual annotations

Let's consider the document: http://www.w3.org/DesignIssues/LinkedData.html

12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 17

The Semantic Web isn't just about putting data on the web. It is about making links, so that a person or machine can explore the web of data. With linked data, when you have some of it, you can find other, related, data.…. All the above plus, Use open standards from W3C (RDF and SPARQL) to identify things, so that people can point at your stuff ...

entities: { … [entity: W3C, startChar: 23107, endChar: 23110], … }

Page 18: The NERD project

NERD meets NIF

12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 18

Model documents through a set of strings deferencable on the Web

: offset_23107_ 23110 a str:String ; str:referenceContext :offset_0_26546 .

: offset_23107_ 23110 sso:oen dbpedia:W3C.

dbpedia:W3C rdf:type nerd:Organization .

Map string to entity

Classification

Rizzo G, Troncy R., Hellmann S. and Bruemmer M. (2012), NERD meets NIF: Lifting NLP Extraction Results to the Linked Data Cloud. In: (LDOW'12) Linked Data on the Web (WWW'12), Lyon, France.

Page 19: The NERD project

NERD Demo

12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 19

Page 20: The NERD project

NERD Timeline and Future Work

12/03/2012 - Multimedia Semantics and Interaction - Séminaire Ecole Centrale - 20

beginning

today

Comparison of named entity extractors

NERD REST API and NERD ontology

NERD “smart” service: combining the best of all NER tools

NERD benchmarks

Lift NERD output results to the LOD cloud

Dashboard for improving the NERD user experience

Page 21: The NERD project

12/03/2012 - - 21 Multimedia Semantics and Interaction - Séminaire Ecole Centrale

http://www.slideshare.net/giusepperizzo

@giusepperizzo @rtroncy #nerd

http://nerd.eurecom.fr