02 semantic multimedia - einfuehrungs-workshop ss2012
DESCRIPTION
EInführungs-Workshop zum Seminar Semantic Multimedia, Sommersemester 2012, Hasso-Plattner-Inbstitut, Universität Potsdam, Dr. Harald SackTRANSCRIPT
Master Seminar SS2012 Semantic Multimedia
Einführungsworkshop16.04.2012
Dr. Harald Sack / Nadine Steinmetz
Donnerstag, 3. Mai 12
Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam
2
Überblick - Bausteine
Linked Data - Einführung
Text Mining - Einführung
Linked Data Dumps
RDF / OWL /SPARQL /
JENA
POS Tagging/
StemmingNER Daten Disambi-
guierung
Kategorien-systeme
Information Retrieval
Bibliografie Daten
Donnerstag, 3. Mai 12
Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam
3
RDF / OWL / SPARQL / JENA<owl:Class rdf:about="http://dbpedia.org/ontology/Spacecraft"><rdfs:label xml:lang="en">spacecraft</rdfs:label><rdfs:label xml:lang="fr">vaisseau spatial</rdfs:label><rdfs:subClassOf rdf:resource="http://dbpedia.org/ontology/MeanOfTransportation"></owl:Class>
OWL
<http://dbpedia.org/resource/Autism> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Thing> .<http://dbpedia.org/resource/Aristotle> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Philosopher> . RDF
select * where {<http://dbpedia.org/resource/Berlin> ?p ?o .?o a <http://dbpedia.org/ontology/Person> . --> LIMIT 100
SPARQL
com.hp.hpl.jena.query.ResultSet result = qexecw2.execSelect();if (result != null) {while (result.hasNext()) {QuerySolution querysol = result.nextSolution();
! Object aux2 = querysol.get("o"); JENA
Donnerstag, 3. Mai 12
Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam
4
Kategoriensysteme
Wikipedia-Kategorien<http://dbpedia.org/resource/Alabama> <http://purl.org/dc/terms/subject>
<http://dbpedia.org/resource/Category:Former_British_colonies> .
<http://dbpedia.org/resource/Alabama> <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:Place_names_in_Alabama_of_Native_American_origin> .
<http://dbpedia.org/resource/Alabama> <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:States_and_territories_established_in_1819> .
<http://dbpedia.org/resource/Alabama> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://dbpedia.org/ontology/AdministrativeRegion> .
<http://dbpedia.org/resource/Alabama> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/PopulatedPlace> .
<http://dbpedia.org/resource/Alabama> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Place> .
dbpedia-Ontologie OWL Lite
SKOS
Donnerstag, 3. Mai 12
Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam
5
<http://dbpedia.org/resource/Alabama> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://dbpedia.org/class/yago/StatesOfTheConfederateStatesOfAmerica> .
<http://dbpedia.org/resource/Alabama> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/class/yago/SouthernUnitedStates> .
<http://dbpedia.org/resource/Alabama> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/class/yago/StatesOfTheUnitedStates> .
Yago
Kategoriensysteme
RDFS
Donnerstag, 3. Mai 12
Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam
6
Bibliografie Daten
Dump zum Download unter: http://mediaglobe.yovisto.com/semmul2012/
www.bibsonomy.org
http://citeseer.ist.psu.edu
www.mendeley.com
Donnerstag, 3. Mai 12
Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam
7
Bibliografie Daten als Linked Data
DBLP on the Semantic Web
Donnerstag, 3. Mai 12
Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam
8
Linked Data Dumps & Datenverarbeitung
instance_types_en.nt
labels_en.nt
freebase_links.nt
disambiguations_en.nt
gawk 'BEGIN{FS=" -->"; -->{a[$1]++;b[$1]=$3","b[$1] -->END{for(i in a) print(i" -->"gensub(/http:\/\/dbpedia\.org\/ontology\//,"","g",gensub(/,$/,"","g",b[i]))) -->' instance_types_de_wTreeDepth_sorted.txt | sed 's/Person,Person/Person/g' | sort -t --> -k1,1 > instance_types_de_concat.txt
join -a 1 -1 1 -2 1 -e' null ' -o1.1,2.2,1.2 instance_types_de_concat.txt ../owlSameAs_all_urlEncoded_sorted.txt | awk -F' -->' '{if(gsub(/ null /," null ",$2)>0) print($1" -->"$3); else print($2" -->"$3) -->' > instance_types_de_concat_wOwlSameAs.txt
Donnerstag, 3. Mai 12
Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam
9
POS Tagging / Stemming
Arjen_ADJA Robben_NN hätte_VAFIN der_ART Held_NN im_APPRART Spitzenspiel_NNbei_APPR Borussia_NE Dortmund_NE werden_VAFIN können._ADJA Stattdessen_NNsteht_VVFIN der_ART Bayern-Star_NN nach_APPR seinem_PPOSAT vergebenen_ADJAElfmeter_NN als_APPR der_ART große_ADJA Verlierer_NN da._XY
Arjen Robb hatt der Held im Spitzenspiel bei Borussia Dortmund werd konnen.Stattdess steht der Bayern-Star nach sein vergeb Elfmet als der gross Verli da.
Arjen Robben hätte der Held im Spitzenspiel bei Borussia Dortmund werden können. Stattdessen steht der Bayern-Star nach seinem vergebenen Elfmeter als der große Verlierer da.
Donnerstag, 3. Mai 12
Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam
10
Information Retrieval Ansätze
tf*idf Gewichtung der Relevanz eines Wortes für ein Dokument in Bezug auf Dokumentenkorpus
Distanzmaße zum Vergleich von Dokumenten
Okapi BM25
Latent Semantic Analysis (LSA)
Latent Semantic Indexing (LSI)
Ranking Funktion für Relevanz von Dokumenten in Bezug auf Suchanfragen
Zusammenhang von Dokumenten bezüglich der enthaltenen Terme bzw. generierten Konzepten zu den Termen
Pattern Detektion bezüglich der enthaltenen Terme in einer unstrukturierten Menge von Text-Dokumenten
Donnerstag, 3. Mai 12
Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam
11
NER Daten
the ford mustangs-197s197ronaele mustangronaelemustnagmustang svt cobramustangsmustangford mustang coupeford mustang convertibleford1972 ford mustang
alternative Labels
http://dbpedia.org/resource/Ford_Mustang
Original Label: „Ford Mustang“
http://mediaglobe.yovisto.com:8080/semex/
Donnerstag, 3. Mai 12
Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam
12
NER Daten
http://dbpedia.org/resource/1972_Ford_Mustang --> http://dbpedia.org/resource/Ford_Mustanghttp://dbpedia.org/resource/Ford_Mustang_GT --> http://dbpedia.org/resource/Ford_Mustanghttp://dbpedia.org/resource/Ford_Mustang_GT_Convertible --> http://dbpedia.org/resource/Ford_Mustanghttp://dbpedia.org/resource/Ford_Mustang_GT_Coupe --> http://dbpedia.org/resource/Ford_Mustanghttp://dbpedia.org/resource/Mustang_%28car%29 --> http://dbpedia.org/resource/Ford_Mustanghttp://dbpedia.org/resource/Mustang_GT --> http://dbpedia.org/resource/Ford_Mustanghttp://dbpedia.org/resource/Mustang_SVT_Cobra_R --> http://dbpedia.org/resource/Ford_Mustanghttp://dbpedia.org/resource/Ronaele --> http://dbpedia.org/resource/Ford_Mustanghttp://dbpedia.org/resource/Ronaele_Mustang --> http://dbpedia.org/resource/Ford_Mustanghttp://dbpedia.org/resource/S-197 --> http://dbpedia.org/resource/Ford_Mustanghttp://dbpedia.org/resource/S197 --> http://dbpedia.org/resource/Ford_Mustanghttp://dbpedia.org/resource/The_Ford_Mustang --> http://dbpedia.org/resource/Ford_Mustanghttp://dbpedia.org/resource/Mustnag --> http://dbpedia.org/resource/Mustang
Redirects
Begriffsklärungsseitenhttp://dbpedia.org/resource/Mustang --> http://dbpedia.org/resource/Ford_Mustang
Donnerstag, 3. Mai 12
Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam
13
NER Daten
Weitere Datenquellen
http://wordnet.princeton.edu/
http://wortschatz.uni-leipzig.de/
http://de.wiktionary.org/
Donnerstag, 3. Mai 12
Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam
14
Disambiguierung
Steve McQueen rast mit dem Mustang die Route 101 entlang.
Kandidaten Finden
3 Kandidaten 34 Kandidaten 23 Kandidaten
Steve McQueen rast mit dem Mustang die Route 101 entlang.
Steve McQueen rast mit dem Mustang die Route 101 entlang.
Entity Erkennung
Steve McQueen rast mit dem Mustang die Route 101 entlang.
Disambiguierung
http://dbpedia.org/resource/Steve_McQueen
http://dbpedia.org/resource/Ford_Mustang
http://dbpedia.org/resource/U.S._Route_101_in_California
Donnerstag, 3. Mai 12
Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam
15
Disambiguierung
dbp:Steve_McQueen_(artist)
dbp:Mustang_(horse)
dbp:Mustang_(Jeans)
dbp:U.S._Route_101_in_California
dbp:Steve_McQueen
dbp:Ford_Mustang
Donnerstag, 3. Mai 12
Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam
16
DisambiguierungCooccurence Analyse
Mustanghttp://dbpedia.org/resource/Ford_Mustang
Route 101Steve McQueen
context tags:
score: 2.0
a combination (j = 3), 259 combinations are generated.Subsequently, in this paper by terms we will refer to singleterms as well as to valid term groups.
Term Mapping: The terms then are mapped to distinctsemantic entities. For our approach we use entities of theDBpedia. DBpedia provides labels for the identification ofdistinct entities in 92 languages. We use English and Germanas well as Finnish labels, as we have noticed that neitherEnglish nor the German labels contain important acronymsas labels, but the Finnish language version does. As taggingusers prefer to keep it short and simple [4], resources dealingwith ”Domain Name System” would rather be tagged with”DNS” than ”Domain Name System”.
After simple string matching of the terms of the contextto DBpedia entities, the URIs are revised for redirectsand disambiguation URIs. That is, concerning URIs arereplaced by their redirects resp. the URIs they link to asdisambiguation URIs. For our sample context overall 120candidates are mapped to 8 terms. These entity candidateshave to be disambiguated within the given context. Thisdisambiguation process is described in the next sections.
C. Co-occurence Analysis of Context Terms in WikipediaArticles
To find the appropriate entity for a term of the contextthe disambiguation is processed for every entity candidatemapped to the term. In the first step, we use the Wikipediaarticle referring to the entity candidate to count occurenceof all the other terms in the context of the term currentlyprocessed (subsequently, this analysis step is referred toas CA). The score for an entity candidate is calculated asfollows:
C(t) = {tj}, j = 1...k
W (uri(t)i) = {wr}, r = 1...|W (uri(t)i)|
t is the term currently disambiguated. C(t) is the set ofterms in the context in which t has to be disambiguated.W (uri(t)i) is the set of all terms in the Wikipedia articlefor the current entity candidate uri(t)i of the term t. Tocalculate the CA score the number (countercooci ) of howoften all other terms of the context occur in the article forthe entity candidate is determined as:
countercooci =kX
j=1
|W (uri(t)i)|X
r=1
�(tj , wr)
with �(x, y) = { 1: x=y0: else .
Finally, the CA score is calculated as follows:
scoreCAi = countercooci ·|W (uri(t)i) ⇥ C(t)|
|C(t)|
D. Link Graph Analysis of Relationships between Entities
We assume entities that are related to each other arealso linked by means of their Wikipedia articles. Thus, forthis analysis step we evaluate the link graphs for the entitycandidates of a context. Subsequently, this analysis step isreferred to as WA.
For our approach we have identified three different linktypes that describe certain relationships between entities.The link types are shown in Fig. 2 in descendent order fortheir strength of relationship between the relevant entities.
Link types b) and c) are links with a path length ofw = 2. That means, these entities are linked through a node,which also is an entity. E. g., Albert Einstein and GottfriedLeibniz both have incoming and outgoing links to the BerlinAcademy of Sciences, but they are not directly linked in theirWikipedia articles. So, these two entities are linked with alink type b).
There are some entities in Wikipedia, that refer to nu-merous other entities and that are referred to by lots ofother entities. We ignored these entities with the highest in-and outdegrees (such as ”United States”7 with over 300.000incoming and almost 1.000 outgoing links), because entitiesthat are only linked through such a highly frequented hubare probably not closely related to each other.
The WA detects connections between the entity currentlyprocessed and the entity candidates of the other terms in thecontext. A score for every link type is calculated similar tothe calculation of the score in the CA.
We count the entity candidates the processed candidate islinked to. For link types b) and c) we also count the numberof different paths between two candidates. We calculate thescore for direct links as follows:
counterdlinksi =kX
j=1
mX
l=1
|uri(t)i � uri(tj)m|
scoredlinksi =|t � tk||C(t)| · counterdlinksi
counterdlinksi is the number of candidates the processedcandidate (uri(t)i) is linked to directly.
With this calculation we achieve to get higher scores forentity candidates that are linked to only one of the candidatesof the other terms. Such candidates have fewer links, butthese links are more explicit. An entity candidate, that islinked to more than one of the candidates of a specific termin the context is much less relevant, because these linksmight reveal ambiguity again. The ranking we achieve byour score calculation is shown in Fig. 3. ”uri 1” is linkedto one entity candidate of every term in the context. Thatimplies, that this entity candidate is strongly related withinthis context. Also, relationships of this candidate to the otherterms in the context are not ambiguous as the candidate is
7http://dbpedia.org/resource/United States
Donnerstag, 3. Mai 12
Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam
17
Disambiguierung
Direkte Links
Symmetrische Links
Unidirektionale Links
Linkanalyse
Donnerstag, 3. Mai 12
Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam
18
Disambiguierung
score
dlinksi =|t�tk|
|C(t)|·counterdlinksi
1
Linkanalyse
Donnerstag, 3. Mai 12
Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam
19
Weitere mögliche Ansätze
Recommender Systeme
Collaborative FilteringContent-Based Filtering
• Empfehlung anhand der Eigenschaften von Dokumenten• Eigenschaftsanalyse zum algorithmischen Vergleich von Dokumenten• Schlagworte vs. Schlüsselworte
• Empfehlung anhand des Verhaltens der Benutzer• Ähnlichkeit von Benutzerprofilen• Profil bezüglich der Nutzung von Dokumenten
Donnerstag, 3. Mai 12
Einführungsworkshop Master Seminar SS 2012 - Semantic Multimedia, Dr. Harald Sack / Nadine Steinmetz, Hasso-Plattner-Institut, Potsdam
20
Hands-On Workshop
RDF / OWL /SPARQL /
JENA
Daten-verarbeitung
Disambi-guierungoder oder
Donnerstag, 3. Mai 12