web intelligence 2013 - characterizing concepts of interest leveraging linked data and the social...

27
Copyright 2013 INSIGHT Centre for Data Analytics. All rights reserved. INSIGHT Centre for Data Analytics www.insight-centre.org Semantic Web & Linked Data Research Programme Characterising concepts of interest leveraging Linked Data and the Social Web Fabrizio Orlandi , Pavan Kapanipathi, Amit Sheth, Alexandre Passant IEEE/WIC/ACM Web Intelligence Atlanta, GA, USA 20 th November 2013

Upload: fabrizio-orlandi

Post on 28-Aug-2014

1.098 views

Category:

Social Media


0 download

DESCRIPTION

Paper presented at the 2013 IEEE/WIC/ACM International Conference on Web Intelligence, Atlanta, GA, USA

TRANSCRIPT

Page 1: Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked Data and the Social Web

Copyright 2013 INSIGHT Centre for Data Analytics. All rights reserved.

INSIGHT Centre for Data Analytics www.insight-centre.org

Semantic Web & Linked Data

Research Programme

Characterising concepts of interest

leveraging Linked Data

and the Social Web

Fabrizio Orlandi, Pavan Kapanipathi,

Amit Sheth, Alexandre Passant

IEEE/WIC/ACM Web Intelligence

Atlanta, GA, USA20

thNovember 2013

Page 2: Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked Data and the Social Web

INSIGHT Centre for Data Analytics www.insight-centre.org

Semantic Web & Linked Data

Research Programme

Scenario:

Personalisation and User Profiling on the Social Web

http://www.flickr.com/photos/giladlotan/

Page 3: Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked Data and the Social Web

INSIGHT Centre for Data Analytics www.insight-centre.org

Semantic Web & Linked Data

Research Programme

Page 4: Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked Data and the Social Web

INSIGHT Centre for Data Analytics www.insight-centre.org

Semantic Web & Linked Data

Research Programme

Page 5: Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked Data and the Social Web

INSIGHT Centre for Data Analytics www.insight-centre.org

Semantic Web & Linked Data

Research Programme

User Profile

Solution

Interlink social websites

Merge and model user data

Personalise users’ experience

using their profile

Integration&

User Modelling

Recommendations

Search Personalisation

Adaptive Systems

[Orlandi et al., I-Semantics 2012]

Page 6: Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked Data and the Social Web

INSIGHT Centre for Data Analytics www.insight-centre.org

Semantic Web & Linked Data

Research Programme

Entity-based user profiles of interests:

Sport

CEV Volleyball Cup

Music

Heavy Metal

Mastodon

Atlanta

Problem

6

Page 7: Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked Data and the Social Web

INSIGHT Centre for Data Analytics www.insight-centre.org

Semantic Web & Linked Data

Research Programme

Entity-based user profiles of interests:

Sport

CEV Volleyball Cup

Music

Heavy Metal

Mastodon

Atlanta

Problem

Semantics?

Pragmatics?

Relevance?

7

Page 8: Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked Data and the Social Web

INSIGHT Centre for Data Analytics www.insight-centre.org

Semantic Web & Linked Data

Research Programme

Linking Open Data

The Semantics of the Web of Data

LOD Cloud by R. Cyganiak

and A. Jentzsch8

Page 9: Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked Data and the Social Web

INSIGHT Centre for Data Analytics www.insight-centre.org

Semantic Web & Linked Data

Research Programme

Music

Heavy Metal

Mastodon

Atlanta

CEV Champions League

Volleyball

Semantic Web

RDF

Example

9

“Mastodon is the best heavy metal band from Atlanta…

Can’t wait to see them live again!”

“Trentino vs Lugano about to start - Diatec youngster to

impress again in CEV Champions League #volleyball”

“W3C Invites Implementations of five Candidate

Recommendations for RDF 1.1 #SemanticWeb”

• Named entity recognition

and disambiguation

• Frequency + time-decay

weighting scheme

Page 10: Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked Data and the Social Web

INSIGHT Centre for Data Analytics www.insight-centre.org

Semantic Web & Linked Data

Research Programme

Music

Heavy Metal

Mastodon (band)

Atlanta (GA.)

CEV Champions League

Volleyball

Semantic Web

RDF

Example

10

Very abstract, very popular

Specific and time-dependent on events, etc.

Specific, very popular and time-dependent

Specific and time-dependent on events, etc.

Abstract and not popular

Abstract and popular

Specific and not popular

Very popular

Are all the extracted entities useful for personalisation?

How are concepts/entities being used on the Social Web? (Pragmatics)

Page 11: Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked Data and the Social Web

INSIGHT Centre for Data Analytics www.insight-centre.org

Semantic Web & Linked Data

Research Programme

The Dimensions of our

Characterisation

Specificity

The level of abstraction that an entity has in a common

conceptual schema shared by humans

Popularity

How popular an entity is on the Social Web

– How frequently is it mentioned/used at that point of time?

Temporal Dynamics

The trend and evolution of the frequency of mentions of an

entity on the Social Web

– i.e. popularity over time

11

Page 12: Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked Data and the Social Web

INSIGHT Centre for Data Analytics www.insight-centre.org

Semantic Web & Linked Data

Research Programme

Requirements

Our use case: real-time personalisation of Social

Web streams

1. (quasi-) Real-time computation of the dimensions

2. Results constantly up to date with the real world

3. Knowledge base and domain independent approach

12

Page 13: Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked Data and the Social Web

INSIGHT Centre for Data Analytics www.insight-centre.org

Semantic Web & Linked Data

Research Programme

Popularity

We chose the Twitter Search API

We search for an entity on the Twitter stream in a short recent time

frame.

Run entity disambiguation on the resulting tweets to filter out noisy

tweets.

Count the remaining tweets in a given timeframe.

The Popularity measure is the resulting value in tweets/second.

This is fast, simple, up-to-date, only for short recent timeframe.

e.g. “Music”~ 16.6 tw/s

“Heavy Metal”~ 0.09 tw/s

“Semantic Web”~ 0.0008 tw/s

13

Page 14: Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked Data and the Social Web

INSIGHT Centre for Data Analytics www.insight-centre.org

Semantic Web & Linked Data

Research Programme

Temporal Dynamics

We use Wikipedia page views

Entities are already mapped to DBpedia

MediaWiki API provides a long history of daily page views of

Wikipedia articles

We use Mean and Standard Deviation for the last 30 days of page

views to identify if the popularity of an entity is:

– Stable/Unstable

– Trendy/Non-Trendy

(Diagrams from: stats.grok.se)CEV_Champions_League Typhoon_Haiyan (2013)

Page 15: Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked Data and the Social Web

INSIGHT Centre for Data Analytics www.insight-centre.org

Semantic Web & Linked Data

Research Programme

Specificity

We use the Linking Open Data (LOD) cloud

Most of the available knowledge bases (e.g. DMOZ, Wordnet,

OpenCyc) are not up-to-date.

Wikipedia would be large, domain-independent, continuously

updated, but:

– entities are not organised hierarchically in a taxonomy

– We cannot use taxonomy-based methods (i.e. super/sub -type rel.)

– PLUS: expensive algorithms would not be good for real-time computation

LOD Links Structure!

15

Page 16: Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked Data and the Social Web

INSIGHT Centre for Data Analytics www.insight-centre.org

Semantic Web & Linked Data

Research Programme

Graph based measures

SOA graph based method:

indegree and outdegree

(here called Incoming/Outgoing Predicates – IP and OP)

We can use these methods with RDF triples

We introduce “distinct in/out-degree” (IDP and ODP )

m

s1

s2

s3

o1

o2

p1

p1

p2

p3

p4

Values for “m”:

IP (indegree) = 3

OP (outdegree) = 2

IDP (distinct indegree) = 2

ODP (distinct outdegree) = 2

16

Page 17: Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked Data and the Social Web

INSIGHT Centre for Data Analytics www.insight-centre.org

Semantic Web & Linked Data

Research Programme

Our Specificity Measure

DRR (Distinct Relations Ratio):

DRR = Incoming Distinct Predicates (IDP)

Outgoing Distinct Predicates (ODP)

Compared with:

IP/OP, IP+OP, IP, IDP

Computed on Sindice SPARQL

endpoint in less than 1sec.

17

Page 18: Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked Data and the Social Web

INSIGHT Centre for Data Analytics www.insight-centre.org

Semantic Web & Linked Data

Research Programme

Alternative SOA Method

DMOZ (Open Directory Project) taxonomy

We use the hierarchical structure of DMOZ as an alternative method to

measure specificity.

We manually map entities to the DMOZ entities and compute the

distance from the root of the DMOZ tree.

18

Page 19: Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked Data and the Social Web

INSIGHT Centre for Data Analytics www.insight-centre.org

Semantic Web & Linked Data

Research Programme

Generation of a Gold Standard

Binary classification of entities

5 humans classified 160 entities in:

– Generic (38%)

– Specific (62%)

Substantial agreement (k=0.61)

Ranking of entities

5 humans rated the specificity of 160 entities in:

– 1 to 10 scale (1=very generic, 10=very specific)

Average Rate 7.03

Average Std. Dev. 1.45

AVG Top 30 High Std. Dev. 5.66

AVG Top 30 Low Std. Dev. 7.51

Abstract entities are harder

for humans to rate

19

Page 20: Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked Data and the Social Web

INSIGHT Centre for Data Analytics www.insight-centre.org

Semantic Web & Linked Data

Research Programme

Evaluation: Classification

We compared the different methods against the gold standard

created manually by the users

Agreement with gold std. in the binary classification task:

The performance of the DRR measure for this classification task

is comparable to a manual classification done using the DMOZ

taxonomy and to human judgement.

DMOZ DRR IP/OP IP+OP IP random

83.9% 84.1% 70.0% 70.0% 72.5% 61.9%

20

Page 21: Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked Data and the Social Web

INSIGHT Centre for Data Analytics www.insight-centre.org

Semantic Web & Linked Data

Research Programme

Evaluation: Ranking

We rank the specificity of 50 randomly chosen entities using:

Gold standard (average of the 5 users’ rates for each entity)

DMOZ levels (integers, 0 to 9)

– We compute “DMOZ-” and “DMOZ+” as the worst and best possible rankings

compared to the gold standard ranking.

DRR, IP/OP, IP+OP, random, values (real numbers)

We compute NDCG (Normalized Discounted Cumulative Gain) at

different ranking positions “p”.

(DCGideal is the ranking of the gold std.)

Page 22: Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked Data and the Social Web

INSIGHT Centre for Data Analytics www.insight-centre.org

Semantic Web & Linked Data

Research Programme

Evaluation: Ranking

DRR: +5% for NDCG at 10 and 20

Page 23: Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked Data and the Social Web

INSIGHT Centre for Data Analytics www.insight-centre.org

Semantic Web & Linked Data

Research Programme

Evaluation on User Profiles

We evaluate the impact of the proposed measures on user

profiles of interests, a real use case

27 volunteers

Interests extracted from users’ posts on Facebook and Twitter

with NLP tools (as described in our previous work [1])

Frequency-based + time decay weighting strategy

Each user rated his/her Top 30 list of interests generated (total

of 794 user ratings)

Ratings on a “1 to 5” scale according to how relevant/interesting

is each entity of interest to the user (5 is highly relevant)

[1] Orlandi et al., I-Semantics 201223

Page 24: Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked Data and the Social Web

INSIGHT Centre for Data Analytics www.insight-centre.org

Semantic Web & Linked Data

Research Programme

Average score (1 to 5 scale) is computed according to groups of types of

entities

Not-popular and generic entities better represent users’ perception of

their interests (but we have only 17% of them)

This behaviour might be different in other applications and use cases!

(e.g. news recommendations, etc.)

Evaluation on User Profiles

(+8%)

(+12%)(17%)

24

Page 25: Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked Data and the Social Web

INSIGHT Centre for Data Analytics www.insight-centre.org

Semantic Web & Linked Data

Research Programme

Conclusions

Introduced dimensions for characterisation of concepts of interest:

specificity, popularity and temporal dynamics.

Proposed methods for their computation satisfying requirements for

real-time personalisation of Social Web streams:

Real-time, domain independent, up to date.

Introduced a novel measure (DRR) for specificity of concepts based

on the LOD cloud

Evaluated for two different tasks (classification and ranking) against SOA

methods (humans, DMOZ, graph measures)

Evaluated the impact of the measures on user profiles of interests

(27 users and ~800 ratings)

Abstract and non-popular interests are preferred by users

25

Page 26: Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked Data and the Social Web

INSIGHT Centre for Data Analytics www.insight-centre.org

Semantic Web & Linked Data

Research Programme

Future work

Experiment the measures on user profiles used for different

personalisation tasks.

E.g. a tweets recommender system should give priority to trendy,

popular and specific entities instead.

Improve the simple popularity and trend detection methods.

Improve the DRR measure adding more “semantics”, i.e. considering

the different types of edges.

26

Page 27: Web Intelligence 2013 - Characterizing concepts of interest leveraging Linked Data and the Social Web

INSIGHT Centre for Data Analytics www.insight-centre.org

Semantic Web & Linked Data

Research Programme

Thanks!

@badmotorf

[email protected]

@pavankaps

[email protected]

@amit_p

[email protected]

@terraces

[email protected]