what you can make out of linked data

90
Text What you can make out of Linked Data Marco Fossati < [email protected] > Steven R. Loomis < [email protected] > 1

Upload: marco-fossati

Post on 05-Jul-2015

475 views

Category:

Technology


2 download

DESCRIPTION

Tutorial given at the 38th Internationalization & Unicode Conference.

TRANSCRIPT

Page 1: What you Can Make Out of Linked Data

Text

What you can make out of Linked DataMarco Fossati <[email protected]> Steven R. Loomis <[email protected]>

1

Page 2: What you Can Make Out of Linked Data

Let's meet the presenters first!

2

Page 3: What you Can Make Out of Linked Data

Marco FossatiNatural Language Processing

Advocate Recommender Systems

Aficionado Open Data Apologist

3

Page 4: What you Can Make Out of Linked Data

Steven R. LoomisIBMChair, Unicode ULI-TC !

Projects: ICU, CLDR, ULI

Page 5: What you Can Make Out of Linked Data

Outline

1. Linked Open Data 101

2. DBpedia

3. The ULI use case

5

Page 6: What you Can Make Out of Linked Data

Warning! Highly interactive tutorial

6

Page 7: What you Can Make Out of Linked Data

Let's get started!

7

Page 8: What you Can Make Out of Linked Data

Text

Linked Open Data 101The Big Picture

8

Page 9: What you Can Make Out of Linked Data

What is data?Data is how we express facts in a reusable form

9

Page 10: What you Can Make Out of Linked Data

Why data? The ingredients for...

...InformationKnowledge

Wisdom

10

Page 11: What you Can Make Out of Linked Data

OK it's data, what else?

Billions of factsBig “Santa Clara is a city”

Richly structuredLinked

Open Open licenses

11

Page 12: What you Can Make Out of Linked Data

Facts, not words

A fact is...

An assertion about the world

Subject + predicate + object

A triple

Natural language

Human mind

!Machine

12

Page 13: What you Can Make Out of Linked Data

Human mindPerceiving relationships between entities

13

Page 14: What you Can Make Out of Linked Data

Natural language"Elvis Presley sings Jailhouse Rock"

14

Page 15: What you Can Make Out of Linked Data

MachineThe triple

Elvis Presley

Jailhouse Rock!

sings

15

Page 16: What you Can Make Out of Linked Data

The graphRich structure made of triples

16

Page 17: What you Can Make Out of Linked Data

Text

From the web of documents...

17

Page 18: What you Can Make Out of Linked Data

Text

...to the web of entities

18

Page 19: What you Can Make Out of Linked Data

The web of entities

An entity can be...

Identified

Described through relationships

Understood both by humans and machines

19

Page 20: What you Can Make Out of Linked Data

Towards a WWW of entitiesIdentify via HTTP URIs

http://dbpedia.org/resource/Elvis_Presley

Describe via RDF statements

:Elvis_presley :sings :Jailhouse_Rock

Understand via

HTML for humans

RDF for machines

20

Page 21: What you Can Make Out of Linked Data

Hands-on Time!

https://pad.okfn.org/p/DBpediaULI

21

Page 22: What you Can Make Out of Linked Data

Next in line…

22

Page 23: What you Can Make Out of Linked Data

Text

DBpediaExtracting Knowledge from Wikipedia

23

Page 24: What you Can Make Out of Linked Data

DBpedia is…

A. …a data extraction framework

from Wikipedia semi-structured data

B. …an open-source community effort

24

Page 25: What you Can Make Out of Linked Data

Why?

25

Page 26: What you Can Make Out of Linked Data

Wikipedia can’t answer simple questions“What do Santa Clara and San Francisco have in common?”

26

Page 27: What you Can Make Out of Linked Data

Wikipedia can’t answer complex questions“Which are the black and white movies produced in Italy that have soundtracks which were composed by musicians who were born in a city of the Trentino-Alto-Adige region with less than 40,000 inhabitants?”

27

Page 28: What you Can Make Out of Linked Data

The story so far

Project started in 2007

From good ol’ PHP to Java + Scala

Steadily growing community

Internationalization Committee

Freely available on GitHub

28

Page 29: What you Can Make Out of Linked Data

Data in WikipediaTitle

Short abstract

Long abstract29

Page 30: What you Can Make Out of Linked Data

Structure in WikipediaInfobox Images

30

Page 31: What you Can Make Out of Linked Data

Structure in Wikipedia

Links

Categories31

Page 32: What you Can Make Out of Linked Data

Structure in WikipediaInterlanguage Links

32

Page 33: What you Can Make Out of Linked Data

Much more at

http://dbpedia.org/Datasets

33

Page 34: What you Can Make Out of Linked Data

DBpedia Extraction Framework (DEF)

Wikipedia dump Extractors RDF graph

34

Page 35: What you Can Make Out of Linked Data

Extractors

Article Features

Abstract, redirects, categories, geo-coordinates, interlanguage links, etc.

Infobox

Raw

Mapping-based

35

Page 36: What you Can Make Out of Linked Data

Raw Infobox Extractor:Elvis_Presley

:born “Elvis Aaron Presley…”

:died “August 16, 1977…”

:restingPlace “Graceland…”

:education “L.C. Humes…”

:occupation “Singer…”

36

Page 37: What you Can Make Out of Linked Data

The Big IssuesData is heterogeneous! Data is multilingual!

37

Page 38: What you Can Make Out of Linked Data

38

Page 39: What you Can Make Out of Linked Data

Solution• The DBpedia ontology as a multilingual glue • Wikipedia-to-ontology Mapping

39

Page 40: What you Can Make Out of Linked Data

DBpedia OntologyEncoding the worldwide encyclopedic knowledge

40

Page 41: What you Can Make Out of Linked Data

Mapping-based Extractor

Combines what belongs together

Separates what is different

41

Page 42: What you Can Make Out of Linked Data

DIEF -Mapping-Based Infobox extractor

42

Page 43: What you Can Make Out of Linked Data

The Mappings WikiAnybody can contribute to mappings.dbpedia.org

43

Page 44: What you Can Make Out of Linked Data

Download the latest DBpedia dump at

http://downloads.dbpedia.org/current/

44

Page 45: What you Can Make Out of Linked Data

English SPARQL endpointdbpedia.org/sparql

45

Page 46: What you Can Make Out of Linked Data

Language chaptersDBpedia in your mother tongue

46

Page 47: What you Can Make Out of Linked Data

Active chapters

International (English-based)

Basque, Czech, Dutch, French, German, Greek, Indonesian, Italian, Japanese, Korean, Polish, Portuguese, Spanish

47

Page 48: What you Can Make Out of Linked Data

Host your own language chapter!

48

Page 49: What you Can Make Out of Linked Data

ApplicationsGet the best out of DBpedia data

49

Page 50: What you Can Make Out of Linked Data

Knowledge GraphsHighly informative summaries in your own language

50

Page 51: What you Can Make Out of Linked Data

Text

Question Answering“Who is Bram Stoker?”

51

Page 52: What you Can Make Out of Linked Data

Text

Entity LinkingDetecting Things in Text

52

Page 53: What you Can Make Out of Linked Data

Language and Domain-specific Resources for Short Sentences Classification

Automatic Huge Gazetteers

53

Page 54: What you Can Make Out of Linked Data

DBpedia StakeholdersWho is using the knowledge base?

54

Page 55: What you Can Make Out of Linked Data

Open GovernmentLinking Local Data

55

Page 56: What you Can Make Out of Linked Data

Digital LibrariesEnriching the Catalogue

56

Page 57: What you Can Make Out of Linked Data

Data-driven JournalismBuilding Infographics

57

Page 58: What you Can Make Out of Linked Data

Hands-on Time!

https://pad.okfn.org/p/DBpediaULI

58

Page 59: What you Can Make Out of Linked Data

And now the final part!

59

Page 60: What you Can Make Out of Linked Data

Text

The ULI use casePutting Linked Open Data to work

Page 61: What you Can Make Out of Linked Data

What’s wrong with Localization Interoperability?

Inconsistent application, implementation, and interpretation of standards

Lack of clear requirements for localization data interchange

Page 62: What you Can Make Out of Linked Data

Unicode Localization Interoperability

Technical Committee of Unicode

Focus Areas:

1. Translation memory

2. Translation source strings / translations

3. Segmentation rules

Page 63: What you Can Make Out of Linked Data

ULI: Segmentation

Given: Thanks to Dr. Jones for this effort.

UAX#11 Segmentation: |Thanks to Dr.| Jones for this effort.| English: |Thanks to Dr. Jones for this effort.|

Page 64: What you Can Make Out of Linked Data

ULI Suppression: Abbreviations English

Mr.Mrs.Dr.St.…

Spanish

Sr.Dto.Sra.Avda.…

Russian

проф.февр.тел.кв.…

Page 65: What you Can Make Out of Linked Data

Demo: ULI Breaks

http://demo.icu-project.org/icu-bin/icusegments

DEMO

Page 66: What you Can Make Out of Linked Data

DBpedia applied to ULI(University of Leipzig)Sebastian Hellman,Martin Brümmer,Dimitris Kontokostas

Opportunity:

Help segmentation by supplying abbreviation data

Page 67: What you Can Make Out of Linked Data

Yes!

Evaluation shows that especially for small texts, abbreviations can contribute to precision and recall of segmentation

Page 68: What you Can Make Out of Linked Data

Success rate

Page 69: What you Can Make Out of Linked Data

multilingual with over 100 languages

!

structured data eases extraction

!

additional data like entity types and categories

Page 70: What you Can Make Out of Linked Data

Example: Mr.

“MR” disambiguation page links to “Mr.” article. !

Ends in full stop, so may be an abbreviation.

Page 71: What you Can Make Out of Linked Data

The “Mr.” SPARQL querySELECT ?entryExample ?exampleTested ?indegreeRanking WHERE { <http://dbpedia.org/resource/Mr.> rdfs:label ?entryExample ; rdfs:comment ?exampleTested . FILTER ( lang(?entryExample) = lang(?exampleTested) ) #subselect: { SELECT count(?in) as ?indegreeRanking WHERE { ?in ?p <http://dbpedia.org/resource/Mr.> } } } LIMIT 100

DEMO

Page 72: What you Can Make Out of Linked Data

Example DBpedia data (English)

St.

Street

<http://en.wikipedia.org/wiki/Street>

<http://schema.org/Place><http://dbpedia.org/ontology/Place><http://dbpedia.org/ontology/PopulatedPlace>

Page 73: What you Can Make Out of Linked Data

Example DBpedia data (Russian)

Проф.

Профессор (Professor)

<http://ru.wikipedia.org/wiki/Профессор>

Page 74: What you Can Make Out of Linked Data
Page 75: What you Can Make Out of Linked Data

1.

Get abbreviation URIs

Page 76: What you Can Make Out of Linked Data

2.

Load DBpedia data into local DB

Page 77: What you Can Make Out of Linked Data

3.

SPARQL Query data and tsv output

Page 78: What you Can Make Out of Linked Data

!

22859 abbreviations with 78197 meanings in 99 languages

Page 79: What you Can Make Out of Linked Data

!

Long Tail !

!

!

Only 25 languages >100 abbrevs. Only 7 languages >1000 abbrevs. !

!

!

22859 abbreviations with 78197 meanings in 99 languages

!

!

Page 80: What you Can Make Out of Linked Data

Long tail (total abbrevs)

Page 81: What you Can Make Out of Linked Data

Long tail (total abbrevs) (zoom)

Page 82: What you Can Make Out of Linked Data

ULI ProcessDBpedia

Wikipedia

ULI Review

Extraction

Translation Memory

Translation MemoryTranslation

Memory

Comparison

"Lupa.na.encyklopedii" by Julo - Own work. Licensed under Public domain via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Lupa.na.encyklopedii.jpg#mediaviewer/File:Lupa.na.encyklopedii.jpg

Manual review

CLDR

CLDR

abb

rs.

CLDR Suppressions

Page 83: What you Can Make Out of Linked Data

Comparison with Translation Memory

Entry % in TMCorp. 0.0307%

St. 0.0023%P.T.T.C. 0%

"Trichtermitfilter" by Gmhofmann - Own work. Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Trichtermitfilter.jpg#mediaviewer/

File:Trichtermitfilter.jpg

Page 84: What you Can Make Out of Linked Data

CLDR Input

Extract abbreviations from CLDR localized data

Days of week: Sun. Mon. Tue. Wed. Thu. …

Months: Jan. Feb. Mar. …

etc…

Page 85: What you Can Make Out of Linked Data

Manual Review

Page 86: What you Can Make Out of Linked Data

CLDR output format <segmentations> <segmentation type="SentenceBreak"> <!--From ULI data, http://uli.unicode.org--> <suppressions type="standard"> <suppression>Port.</suppression> <suppression>Alt.</suppression> <suppression>Di.</suppression> <suppression>Ges.</suppression> <suppression>frz.</suppression>

Page 87: What you Can Make Out of Linked Data

CLDR 26 Output

http://cldr.unicode.org

“Break Suppression”

de 239en 151es 164fr 82it 45pt 170ru 18

Page 88: What you Can Make Out of Linked Data

Challenges

"Long Tail" Languages

harder to find existing TM data

harder to find linguistic rules/review

harder to find tagged corpora to benchmark

Systematic issues with using redirects/disambiguation

Page 89: What you Can Make Out of Linked Data

OpportunityScope:

Non-full stop punctuation- "Yahoo!"

Language specific abbreviation rules

Context (Medical, Business, …)

Leverage

Schema/Taxonomy ( “Place” vs “Person” etc. ) to filter

DBpedia lists

Additional LOD

Page 90: What you Can Make Out of Linked Data

Thank You!

Further Q&A?

!

Slides & contact info: https://pad.okfn.org/p/DBpediaULI