web intelligence 2013 - characterizing concepts of interest leveraging linked data and the social...
DESCRIPTION
Paper presented at the 2013 IEEE/WIC/ACM International Conference on Web Intelligence, Atlanta, GA, USATRANSCRIPT
Copyright 2013 INSIGHT Centre for Data Analytics. All rights reserved.
INSIGHT Centre for Data Analytics www.insight-centre.org
Semantic Web & Linked Data
Research Programme
Characterising concepts of interest
leveraging Linked Data
and the Social Web
Fabrizio Orlandi, Pavan Kapanipathi,
Amit Sheth, Alexandre Passant
IEEE/WIC/ACM Web Intelligence
Atlanta, GA, USA20
thNovember 2013
INSIGHT Centre for Data Analytics www.insight-centre.org
Semantic Web & Linked Data
Research Programme
Scenario:
Personalisation and User Profiling on the Social Web
http://www.flickr.com/photos/giladlotan/
INSIGHT Centre for Data Analytics www.insight-centre.org
Semantic Web & Linked Data
Research Programme
INSIGHT Centre for Data Analytics www.insight-centre.org
Semantic Web & Linked Data
Research Programme
INSIGHT Centre for Data Analytics www.insight-centre.org
Semantic Web & Linked Data
Research Programme
User Profile
Solution
Interlink social websites
Merge and model user data
Personalise users’ experience
using their profile
Integration&
User Modelling
Recommendations
Search Personalisation
Adaptive Systems
[Orlandi et al., I-Semantics 2012]
INSIGHT Centre for Data Analytics www.insight-centre.org
Semantic Web & Linked Data
Research Programme
Entity-based user profiles of interests:
Sport
CEV Volleyball Cup
Music
Heavy Metal
Mastodon
Atlanta
…
Problem
6
INSIGHT Centre for Data Analytics www.insight-centre.org
Semantic Web & Linked Data
Research Programme
Entity-based user profiles of interests:
Sport
CEV Volleyball Cup
Music
Heavy Metal
Mastodon
Atlanta
…
Problem
Semantics?
Pragmatics?
Relevance?
7
INSIGHT Centre for Data Analytics www.insight-centre.org
Semantic Web & Linked Data
Research Programme
Linking Open Data
The Semantics of the Web of Data
LOD Cloud by R. Cyganiak
and A. Jentzsch8
INSIGHT Centre for Data Analytics www.insight-centre.org
Semantic Web & Linked Data
Research Programme
Music
Heavy Metal
Mastodon
Atlanta
CEV Champions League
Volleyball
Semantic Web
RDF
Example
9
“Mastodon is the best heavy metal band from Atlanta…
Can’t wait to see them live again!”
“Trentino vs Lugano about to start - Diatec youngster to
impress again in CEV Champions League #volleyball”
“W3C Invites Implementations of five Candidate
Recommendations for RDF 1.1 #SemanticWeb”
• Named entity recognition
and disambiguation
• Frequency + time-decay
weighting scheme
INSIGHT Centre for Data Analytics www.insight-centre.org
Semantic Web & Linked Data
Research Programme
Music
Heavy Metal
Mastodon (band)
Atlanta (GA.)
CEV Champions League
Volleyball
Semantic Web
RDF
Example
10
Very abstract, very popular
Specific and time-dependent on events, etc.
Specific, very popular and time-dependent
Specific and time-dependent on events, etc.
Abstract and not popular
Abstract and popular
Specific and not popular
Very popular
Are all the extracted entities useful for personalisation?
How are concepts/entities being used on the Social Web? (Pragmatics)
INSIGHT Centre for Data Analytics www.insight-centre.org
Semantic Web & Linked Data
Research Programme
The Dimensions of our
Characterisation
Specificity
The level of abstraction that an entity has in a common
conceptual schema shared by humans
Popularity
How popular an entity is on the Social Web
– How frequently is it mentioned/used at that point of time?
Temporal Dynamics
The trend and evolution of the frequency of mentions of an
entity on the Social Web
– i.e. popularity over time
11
INSIGHT Centre for Data Analytics www.insight-centre.org
Semantic Web & Linked Data
Research Programme
Requirements
Our use case: real-time personalisation of Social
Web streams
1. (quasi-) Real-time computation of the dimensions
2. Results constantly up to date with the real world
3. Knowledge base and domain independent approach
12
INSIGHT Centre for Data Analytics www.insight-centre.org
Semantic Web & Linked Data
Research Programme
Popularity
We chose the Twitter Search API
We search for an entity on the Twitter stream in a short recent time
frame.
Run entity disambiguation on the resulting tweets to filter out noisy
tweets.
Count the remaining tweets in a given timeframe.
The Popularity measure is the resulting value in tweets/second.
This is fast, simple, up-to-date, only for short recent timeframe.
e.g. “Music”~ 16.6 tw/s
“Heavy Metal”~ 0.09 tw/s
“Semantic Web”~ 0.0008 tw/s
13
INSIGHT Centre for Data Analytics www.insight-centre.org
Semantic Web & Linked Data
Research Programme
Temporal Dynamics
We use Wikipedia page views
Entities are already mapped to DBpedia
MediaWiki API provides a long history of daily page views of
Wikipedia articles
We use Mean and Standard Deviation for the last 30 days of page
views to identify if the popularity of an entity is:
– Stable/Unstable
– Trendy/Non-Trendy
(Diagrams from: stats.grok.se)CEV_Champions_League Typhoon_Haiyan (2013)
INSIGHT Centre for Data Analytics www.insight-centre.org
Semantic Web & Linked Data
Research Programme
Specificity
We use the Linking Open Data (LOD) cloud
Most of the available knowledge bases (e.g. DMOZ, Wordnet,
OpenCyc) are not up-to-date.
Wikipedia would be large, domain-independent, continuously
updated, but:
– entities are not organised hierarchically in a taxonomy
– We cannot use taxonomy-based methods (i.e. super/sub -type rel.)
– PLUS: expensive algorithms would not be good for real-time computation
LOD Links Structure!
15
INSIGHT Centre for Data Analytics www.insight-centre.org
Semantic Web & Linked Data
Research Programme
Graph based measures
SOA graph based method:
indegree and outdegree
(here called Incoming/Outgoing Predicates – IP and OP)
We can use these methods with RDF triples
We introduce “distinct in/out-degree” (IDP and ODP )
m
s1
s2
s3
o1
o2
p1
p1
p2
p3
p4
Values for “m”:
IP (indegree) = 3
OP (outdegree) = 2
IDP (distinct indegree) = 2
ODP (distinct outdegree) = 2
16
INSIGHT Centre for Data Analytics www.insight-centre.org
Semantic Web & Linked Data
Research Programme
Our Specificity Measure
DRR (Distinct Relations Ratio):
DRR = Incoming Distinct Predicates (IDP)
Outgoing Distinct Predicates (ODP)
Compared with:
IP/OP, IP+OP, IP, IDP
Computed on Sindice SPARQL
endpoint in less than 1sec.
17
INSIGHT Centre for Data Analytics www.insight-centre.org
Semantic Web & Linked Data
Research Programme
Alternative SOA Method
DMOZ (Open Directory Project) taxonomy
We use the hierarchical structure of DMOZ as an alternative method to
measure specificity.
We manually map entities to the DMOZ entities and compute the
distance from the root of the DMOZ tree.
18
INSIGHT Centre for Data Analytics www.insight-centre.org
Semantic Web & Linked Data
Research Programme
Generation of a Gold Standard
Binary classification of entities
5 humans classified 160 entities in:
– Generic (38%)
– Specific (62%)
Substantial agreement (k=0.61)
Ranking of entities
5 humans rated the specificity of 160 entities in:
– 1 to 10 scale (1=very generic, 10=very specific)
Average Rate 7.03
Average Std. Dev. 1.45
AVG Top 30 High Std. Dev. 5.66
AVG Top 30 Low Std. Dev. 7.51
Abstract entities are harder
for humans to rate
19
INSIGHT Centre for Data Analytics www.insight-centre.org
Semantic Web & Linked Data
Research Programme
Evaluation: Classification
We compared the different methods against the gold standard
created manually by the users
Agreement with gold std. in the binary classification task:
The performance of the DRR measure for this classification task
is comparable to a manual classification done using the DMOZ
taxonomy and to human judgement.
DMOZ DRR IP/OP IP+OP IP random
83.9% 84.1% 70.0% 70.0% 72.5% 61.9%
20
INSIGHT Centre for Data Analytics www.insight-centre.org
Semantic Web & Linked Data
Research Programme
Evaluation: Ranking
We rank the specificity of 50 randomly chosen entities using:
Gold standard (average of the 5 users’ rates for each entity)
DMOZ levels (integers, 0 to 9)
– We compute “DMOZ-” and “DMOZ+” as the worst and best possible rankings
compared to the gold standard ranking.
DRR, IP/OP, IP+OP, random, values (real numbers)
We compute NDCG (Normalized Discounted Cumulative Gain) at
different ranking positions “p”.
(DCGideal is the ranking of the gold std.)
INSIGHT Centre for Data Analytics www.insight-centre.org
Semantic Web & Linked Data
Research Programme
Evaluation: Ranking
DRR: +5% for NDCG at 10 and 20
INSIGHT Centre for Data Analytics www.insight-centre.org
Semantic Web & Linked Data
Research Programme
Evaluation on User Profiles
We evaluate the impact of the proposed measures on user
profiles of interests, a real use case
27 volunteers
Interests extracted from users’ posts on Facebook and Twitter
with NLP tools (as described in our previous work [1])
Frequency-based + time decay weighting strategy
Each user rated his/her Top 30 list of interests generated (total
of 794 user ratings)
Ratings on a “1 to 5” scale according to how relevant/interesting
is each entity of interest to the user (5 is highly relevant)
[1] Orlandi et al., I-Semantics 201223
INSIGHT Centre for Data Analytics www.insight-centre.org
Semantic Web & Linked Data
Research Programme
Average score (1 to 5 scale) is computed according to groups of types of
entities
Not-popular and generic entities better represent users’ perception of
their interests (but we have only 17% of them)
This behaviour might be different in other applications and use cases!
(e.g. news recommendations, etc.)
Evaluation on User Profiles
(+8%)
(+12%)(17%)
24
INSIGHT Centre for Data Analytics www.insight-centre.org
Semantic Web & Linked Data
Research Programme
Conclusions
Introduced dimensions for characterisation of concepts of interest:
specificity, popularity and temporal dynamics.
Proposed methods for their computation satisfying requirements for
real-time personalisation of Social Web streams:
Real-time, domain independent, up to date.
Introduced a novel measure (DRR) for specificity of concepts based
on the LOD cloud
Evaluated for two different tasks (classification and ranking) against SOA
methods (humans, DMOZ, graph measures)
Evaluated the impact of the measures on user profiles of interests
(27 users and ~800 ratings)
Abstract and non-popular interests are preferred by users
25
INSIGHT Centre for Data Analytics www.insight-centre.org
Semantic Web & Linked Data
Research Programme
Future work
Experiment the measures on user profiles used for different
personalisation tasks.
E.g. a tweets recommender system should give priority to trendy,
popular and specific entities instead.
Improve the simple popularity and trend detection methods.
Improve the DRR measure adding more “semantics”, i.e. considering
the different types of edges.
26
INSIGHT Centre for Data Analytics www.insight-centre.org
Semantic Web & Linked Data
Research Programme
Thanks!
@badmotorf
@pavankaps
@amit_p
@terraces