02 semantic multimedia - einfuehrungs-workshop ss2012

20
Master Seminar SS2012 Semantic Multimedia Einführungsworkshop 16.04.2012 Dr. Harald Sack / Nadine Steinmetz Donnerstag, 3. Mai 12

Upload: harald-sack

Post on 24-May-2015

534 views

Category:

Education


1 download

DESCRIPTION

EInführungs-Workshop zum Seminar Semantic Multimedia, Sommersemester 2012, Hasso-Plattner-Inbstitut, Universität Potsdam, Dr. Harald Sack

TRANSCRIPT

Page 1: 02 Semantic Multimedia - Einfuehrungs-Workshop SS2012

Master Seminar SS2012 Semantic Multimedia

Einführungsworkshop16.04.2012

Dr. Harald Sack / Nadine Steinmetz

Donnerstag, 3. Mai 12

Page 2: 02 Semantic Multimedia - Einfuehrungs-Workshop SS2012

Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam

2

Überblick - Bausteine

Linked Data - Einführung

Text Mining - Einführung

Linked Data Dumps

RDF / OWL /SPARQL /

JENA

POS Tagging/

StemmingNER Daten Disambi-

guierung

Kategorien-systeme

Information Retrieval

Bibliografie Daten

Donnerstag, 3. Mai 12

Page 3: 02 Semantic Multimedia - Einfuehrungs-Workshop SS2012

Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam

3

RDF / OWL / SPARQL / JENA<owl:Class rdf:about="http://dbpedia.org/ontology/Spacecraft"><rdfs:label xml:lang="en">spacecraft</rdfs:label><rdfs:label xml:lang="fr">vaisseau spatial</rdfs:label><rdfs:subClassOf rdf:resource="http://dbpedia.org/ontology/MeanOfTransportation"></owl:Class>

OWL

<http://dbpedia.org/resource/Autism> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Thing> .<http://dbpedia.org/resource/Aristotle> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Philosopher> . RDF

select * where {<http://dbpedia.org/resource/Berlin> ?p ?o .?o a <http://dbpedia.org/ontology/Person> . --> LIMIT 100

SPARQL

com.hp.hpl.jena.query.ResultSet result = qexecw2.execSelect();if (result != null) {while (result.hasNext()) {QuerySolution querysol = result.nextSolution();

! Object aux2 = querysol.get("o"); JENA

Donnerstag, 3. Mai 12

Page 4: 02 Semantic Multimedia - Einfuehrungs-Workshop SS2012

Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam

4

Kategoriensysteme

Wikipedia-Kategorien<http://dbpedia.org/resource/Alabama> <http://purl.org/dc/terms/subject>

<http://dbpedia.org/resource/Category:Former_British_colonies> .

<http://dbpedia.org/resource/Alabama> <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:Place_names_in_Alabama_of_Native_American_origin> .

<http://dbpedia.org/resource/Alabama> <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:States_and_territories_established_in_1819> .

<http://dbpedia.org/resource/Alabama> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>

<http://dbpedia.org/ontology/AdministrativeRegion> .

<http://dbpedia.org/resource/Alabama> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/PopulatedPlace> .

<http://dbpedia.org/resource/Alabama> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Place> .

dbpedia-Ontologie OWL Lite

SKOS

Donnerstag, 3. Mai 12

Page 5: 02 Semantic Multimedia - Einfuehrungs-Workshop SS2012

Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam

5

<http://dbpedia.org/resource/Alabama> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>

<http://dbpedia.org/class/yago/StatesOfTheConfederateStatesOfAmerica> .

<http://dbpedia.org/resource/Alabama> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/class/yago/SouthernUnitedStates> .

<http://dbpedia.org/resource/Alabama> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/class/yago/StatesOfTheUnitedStates> .

Yago

Kategoriensysteme

RDFS

Donnerstag, 3. Mai 12

Page 6: 02 Semantic Multimedia - Einfuehrungs-Workshop SS2012

Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam

6

Bibliografie Daten

Dump zum Download unter: http://mediaglobe.yovisto.com/semmul2012/

www.bibsonomy.org

http://citeseer.ist.psu.edu

www.mendeley.com

Donnerstag, 3. Mai 12

Page 7: 02 Semantic Multimedia - Einfuehrungs-Workshop SS2012

Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam

7

Bibliografie Daten als Linked Data

DBLP on the Semantic Web

Donnerstag, 3. Mai 12

Page 8: 02 Semantic Multimedia - Einfuehrungs-Workshop SS2012

Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam

8

Linked Data Dumps & Datenverarbeitung

instance_types_en.nt

labels_en.nt

freebase_links.nt

disambiguations_en.nt

gawk 'BEGIN{FS=" -->"; -->{a[$1]++;b[$1]=$3","b[$1] -->END{for(i in a) print(i" -->"gensub(/http:\/\/dbpedia\.org\/ontology\//,"","g",gensub(/,$/,"","g",b[i]))) -->' instance_types_de_wTreeDepth_sorted.txt | sed 's/Person,Person/Person/g' | sort -t --> -k1,1 > instance_types_de_concat.txt

join -a 1 -1 1 -2 1 -e' null ' -o1.1,2.2,1.2 instance_types_de_concat.txt ../owlSameAs_all_urlEncoded_sorted.txt | awk -F' -->' '{if(gsub(/ null /," null ",$2)>0) print($1" -->"$3); else print($2" -->"$3) -->' > instance_types_de_concat_wOwlSameAs.txt

Donnerstag, 3. Mai 12

Page 9: 02 Semantic Multimedia - Einfuehrungs-Workshop SS2012

Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam

9

POS Tagging / Stemming

Arjen_ADJA Robben_NN hätte_VAFIN der_ART Held_NN im_APPRART Spitzenspiel_NNbei_APPR Borussia_NE Dortmund_NE werden_VAFIN können._ADJA Stattdessen_NNsteht_VVFIN der_ART Bayern-Star_NN nach_APPR seinem_PPOSAT vergebenen_ADJAElfmeter_NN als_APPR der_ART große_ADJA Verlierer_NN da._XY

Arjen Robb hatt der Held im Spitzenspiel bei Borussia Dortmund werd konnen.Stattdess steht der Bayern-Star nach sein vergeb Elfmet als der gross Verli da.

Arjen Robben hätte der Held im Spitzenspiel bei Borussia Dortmund werden können. Stattdessen steht der Bayern-Star nach seinem vergebenen Elfmeter als der große Verlierer da.

Donnerstag, 3. Mai 12

Page 10: 02 Semantic Multimedia - Einfuehrungs-Workshop SS2012

Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam

10

Information Retrieval Ansätze

tf*idf Gewichtung der Relevanz eines Wortes für ein Dokument in Bezug auf Dokumentenkorpus

Distanzmaße zum Vergleich von Dokumenten

Okapi BM25

Latent Semantic Analysis (LSA)

Latent Semantic Indexing (LSI)

Ranking Funktion für Relevanz von Dokumenten in Bezug auf Suchanfragen

Zusammenhang von Dokumenten bezüglich der enthaltenen Terme bzw. generierten Konzepten zu den Termen

Pattern Detektion bezüglich der enthaltenen Terme in einer unstrukturierten Menge von Text-Dokumenten

Donnerstag, 3. Mai 12

Page 11: 02 Semantic Multimedia - Einfuehrungs-Workshop SS2012

Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam

11

NER Daten

the ford mustangs-197s197ronaele mustangronaelemustnagmustang svt cobramustangsmustangford mustang coupeford mustang convertibleford1972 ford mustang

alternative Labels

http://dbpedia.org/resource/Ford_Mustang

Original Label: „Ford Mustang“

http://mediaglobe.yovisto.com:8080/semex/

Donnerstag, 3. Mai 12

Page 12: 02 Semantic Multimedia - Einfuehrungs-Workshop SS2012

Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam

12

NER Daten

http://dbpedia.org/resource/1972_Ford_Mustang --> http://dbpedia.org/resource/Ford_Mustanghttp://dbpedia.org/resource/Ford_Mustang_GT --> http://dbpedia.org/resource/Ford_Mustanghttp://dbpedia.org/resource/Ford_Mustang_GT_Convertible --> http://dbpedia.org/resource/Ford_Mustanghttp://dbpedia.org/resource/Ford_Mustang_GT_Coupe --> http://dbpedia.org/resource/Ford_Mustanghttp://dbpedia.org/resource/Mustang_%28car%29 --> http://dbpedia.org/resource/Ford_Mustanghttp://dbpedia.org/resource/Mustang_GT --> http://dbpedia.org/resource/Ford_Mustanghttp://dbpedia.org/resource/Mustang_SVT_Cobra_R --> http://dbpedia.org/resource/Ford_Mustanghttp://dbpedia.org/resource/Ronaele --> http://dbpedia.org/resource/Ford_Mustanghttp://dbpedia.org/resource/Ronaele_Mustang --> http://dbpedia.org/resource/Ford_Mustanghttp://dbpedia.org/resource/S-197 --> http://dbpedia.org/resource/Ford_Mustanghttp://dbpedia.org/resource/S197 --> http://dbpedia.org/resource/Ford_Mustanghttp://dbpedia.org/resource/The_Ford_Mustang --> http://dbpedia.org/resource/Ford_Mustanghttp://dbpedia.org/resource/Mustnag --> http://dbpedia.org/resource/Mustang

Redirects

Begriffsklärungsseitenhttp://dbpedia.org/resource/Mustang --> http://dbpedia.org/resource/Ford_Mustang

Donnerstag, 3. Mai 12

Page 14: 02 Semantic Multimedia - Einfuehrungs-Workshop SS2012

Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam

14

Disambiguierung

Steve McQueen rast mit dem Mustang die Route 101 entlang.

Kandidaten Finden

3 Kandidaten 34 Kandidaten 23 Kandidaten

Steve McQueen rast mit dem Mustang die Route 101 entlang.

Steve McQueen rast mit dem Mustang die Route 101 entlang.

Entity Erkennung

Steve McQueen rast mit dem Mustang die Route 101 entlang.

Disambiguierung

http://dbpedia.org/resource/Steve_McQueen

http://dbpedia.org/resource/Ford_Mustang

http://dbpedia.org/resource/U.S._Route_101_in_California

Donnerstag, 3. Mai 12

Page 15: 02 Semantic Multimedia - Einfuehrungs-Workshop SS2012

Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam

15

Disambiguierung

dbp:Steve_McQueen_(artist)

dbp:Mustang_(horse)

dbp:Mustang_(Jeans)

dbp:U.S._Route_101_in_California

dbp:Steve_McQueen

dbp:Ford_Mustang

Donnerstag, 3. Mai 12

Page 16: 02 Semantic Multimedia - Einfuehrungs-Workshop SS2012

Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam

16

DisambiguierungCooccurence Analyse

Mustanghttp://dbpedia.org/resource/Ford_Mustang

Route 101Steve McQueen

context tags:

score: 2.0

a combination (j = 3), 259 combinations are generated.Subsequently, in this paper by terms we will refer to singleterms as well as to valid term groups.

Term Mapping: The terms then are mapped to distinctsemantic entities. For our approach we use entities of theDBpedia. DBpedia provides labels for the identification ofdistinct entities in 92 languages. We use English and Germanas well as Finnish labels, as we have noticed that neitherEnglish nor the German labels contain important acronymsas labels, but the Finnish language version does. As taggingusers prefer to keep it short and simple [4], resources dealingwith ”Domain Name System” would rather be tagged with”DNS” than ”Domain Name System”.

After simple string matching of the terms of the contextto DBpedia entities, the URIs are revised for redirectsand disambiguation URIs. That is, concerning URIs arereplaced by their redirects resp. the URIs they link to asdisambiguation URIs. For our sample context overall 120candidates are mapped to 8 terms. These entity candidateshave to be disambiguated within the given context. Thisdisambiguation process is described in the next sections.

C. Co-occurence Analysis of Context Terms in WikipediaArticles

To find the appropriate entity for a term of the contextthe disambiguation is processed for every entity candidatemapped to the term. In the first step, we use the Wikipediaarticle referring to the entity candidate to count occurenceof all the other terms in the context of the term currentlyprocessed (subsequently, this analysis step is referred toas CA). The score for an entity candidate is calculated asfollows:

C(t) = {tj}, j = 1...k

W (uri(t)i) = {wr}, r = 1...|W (uri(t)i)|

t is the term currently disambiguated. C(t) is the set ofterms in the context in which t has to be disambiguated.W (uri(t)i) is the set of all terms in the Wikipedia articlefor the current entity candidate uri(t)i of the term t. Tocalculate the CA score the number (countercooci ) of howoften all other terms of the context occur in the article forthe entity candidate is determined as:

countercooci =kX

j=1

|W (uri(t)i)|X

r=1

�(tj , wr)

with �(x, y) = { 1: x=y0: else .

Finally, the CA score is calculated as follows:

scoreCAi = countercooci ·|W (uri(t)i) ⇥ C(t)|

|C(t)|

D. Link Graph Analysis of Relationships between Entities

We assume entities that are related to each other arealso linked by means of their Wikipedia articles. Thus, forthis analysis step we evaluate the link graphs for the entitycandidates of a context. Subsequently, this analysis step isreferred to as WA.

For our approach we have identified three different linktypes that describe certain relationships between entities.The link types are shown in Fig. 2 in descendent order fortheir strength of relationship between the relevant entities.

Link types b) and c) are links with a path length ofw = 2. That means, these entities are linked through a node,which also is an entity. E. g., Albert Einstein and GottfriedLeibniz both have incoming and outgoing links to the BerlinAcademy of Sciences, but they are not directly linked in theirWikipedia articles. So, these two entities are linked with alink type b).

There are some entities in Wikipedia, that refer to nu-merous other entities and that are referred to by lots ofother entities. We ignored these entities with the highest in-and outdegrees (such as ”United States”7 with over 300.000incoming and almost 1.000 outgoing links), because entitiesthat are only linked through such a highly frequented hubare probably not closely related to each other.

The WA detects connections between the entity currentlyprocessed and the entity candidates of the other terms in thecontext. A score for every link type is calculated similar tothe calculation of the score in the CA.

We count the entity candidates the processed candidate islinked to. For link types b) and c) we also count the numberof different paths between two candidates. We calculate thescore for direct links as follows:

counterdlinksi =kX

j=1

mX

l=1

|uri(t)i � uri(tj)m|

scoredlinksi =|t � tk||C(t)| · counterdlinksi

counterdlinksi is the number of candidates the processedcandidate (uri(t)i) is linked to directly.

With this calculation we achieve to get higher scores forentity candidates that are linked to only one of the candidatesof the other terms. Such candidates have fewer links, butthese links are more explicit. An entity candidate, that islinked to more than one of the candidates of a specific termin the context is much less relevant, because these linksmight reveal ambiguity again. The ranking we achieve byour score calculation is shown in Fig. 3. ”uri 1” is linkedto one entity candidate of every term in the context. Thatimplies, that this entity candidate is strongly related withinthis context. Also, relationships of this candidate to the otherterms in the context are not ambiguous as the candidate is

7http://dbpedia.org/resource/United States

Donnerstag, 3. Mai 12

Page 17: 02 Semantic Multimedia - Einfuehrungs-Workshop SS2012

Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam

17

Disambiguierung

Direkte Links

Symmetrische Links

Unidirektionale Links

Linkanalyse

Donnerstag, 3. Mai 12

Page 18: 02 Semantic Multimedia - Einfuehrungs-Workshop SS2012

Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam

18

Disambiguierung

score

dlinksi =|t�tk|

|C(t)|·counterdlinksi

1

Linkanalyse

Donnerstag, 3. Mai 12

Page 19: 02 Semantic Multimedia - Einfuehrungs-Workshop SS2012

Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam

19

Weitere mögliche Ansätze

Recommender Systeme

Collaborative FilteringContent-Based Filtering

• Empfehlung anhand der Eigenschaften von Dokumenten• Eigenschaftsanalyse zum algorithmischen Vergleich von Dokumenten• Schlagworte vs. Schlüsselworte

• Empfehlung anhand des Verhaltens der Benutzer• Ähnlichkeit von Benutzerprofilen• Profil bezüglich der Nutzung von Dokumenten

Donnerstag, 3. Mai 12

Page 20: 02 Semantic Multimedia - Einfuehrungs-Workshop SS2012

Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam

20

Hands-On Workshop

RDF / OWL /SPARQL /

JENA

Daten-verarbeitung

Disambi-guierungoder oder

Donnerstag, 3. Mai 12