of sampling and smoothing: approximating distributions over linked open data

20
Institute for Web Science & Technologies – WeST Of Sampling and Smoothing: Approximating Distributions over Linked Open Data Thomas Gottron May 26th, 2014 PROFILES Workshop, Crete

Upload: thomas-gottron

Post on 10-May-2015

130 views

Category:

Science


2 download

DESCRIPTION

Talk at the PROFILES 2014 workshop (co-located with ESWC) on sampling RDF graphs and smoothing techniques for estimating data distributions

TRANSCRIPT

Page 1: Of Sampling and Smoothing: Approximating Distributions over Linked Open Data

Institute for Web Science & Technologies – WeST

Of Sampling and Smoothing: Approximating Distributions over

Linked Open Data

Thomas Gottron

May 26th, 2014

PROFILES Workshop, Crete

Page 2: Of Sampling and Smoothing: Approximating Distributions over Linked Open Data

Thomas Gottron PROFILES 26.5.2014, 2Approximating Distributions over LOD

Distributions over Linked Data

Probability to observe a certain pattern k

foaf:knows

Predicates

foaf:Personrdf:type

RDF class types

sioc

:follo

ws

?x foaf:knows

rdfs:label

Property Sets

rdf:t

ype

?y foaf:Person

dbpedia:Actor

rdf:type

Type Sets rdf:t

ype

?z

dbpedia:Actor

foaf:knows

foaf:name

ECS

Page 3: Of Sampling and Smoothing: Approximating Distributions over Linked Open Data

Thomas Gottron PROFILES 26.5.2014, 3Approximating Distributions over LOD

Distributions over Linked Data

Effectively: Estimate a distribution over pattern instances ki

Applications: Query federation Data Mining Schema inferencing

k1 k2 knk3 ...

Page 4: Of Sampling and Smoothing: Approximating Distributions over Linked Open Data

Thomas Gottron PROFILES 26.5.2014, 4Approximating Distributions over LOD

Distributions over Linked Data

Using entire LOD cloud becomes less and less feasible Solution:

Operate on a sample

Challenges: How to sample? How to deal with unobserved

instances of a pattern?

k1 k2 knk3 ...

Only an approximation!

Page 5: Of Sampling and Smoothing: Approximating Distributions over Linked Open Data

Thomas Gottron PROFILES 26.5.2014, 5Approximating Distributions over LOD

Sampling Linked Open Data

Page 6: Of Sampling and Smoothing: Approximating Distributions over Linked Open Data

Thomas Gottron PROFILES 26.5.2014, 6Approximating Distributions over LOD

Data Format

Linked Data as N-Quads:

triple – what is the information?

context URI – where does it come

from?

s op

c

( )s op c

Page 7: Of Sampling and Smoothing: Approximating Distributions over Linked Open Data

Thomas Gottron PROFILES 26.5.2014, 7Approximating Distributions over LOD

Sampling Strategies

Triple (Edge) Based Sampling

Unique Subject URI (Node) Based Sampling

Context Based Sampling

For all sampling approaches: Unbiased sampling based on uniform distribution

s op

s

c

Page 8: Of Sampling and Smoothing: Approximating Distributions over Linked Open Data

Thomas Gottron PROFILES 26.5.2014, 8Approximating Distributions over LOD

Smoothing Distributions

Page 9: Of Sampling and Smoothing: Approximating Distributions over Linked Open Data

Thomas Gottron PROFILES 26.5.2014, 9Approximating Distributions over LOD

Obtaining a Distribution from an Index

k1

k2

k3

...

kn

d1,1 d1,2 d1,3 ...

d2,1 d2,2

d3,1 d3,2 d3,3 ...

dn,1 dn,2 dn,3 ...

https://github.com/gottron/lod-index-models

Page 10: Of Sampling and Smoothing: Approximating Distributions over Linked Open Data

Thomas Gottron PROFILES 26.5.2014, 10Approximating Distributions over LOD

Obtaining a Distribution from an Index

k1

k2

k3

...

kn

4

2

10

8

Relative frequencies

...

Page 11: Of Sampling and Smoothing: Approximating Distributions over Linked Open Data

Thomas Gottron PROFILES 26.5.2014, 11Approximating Distributions over LOD

Unobserved patterns!

Unobserved pattern instance (e.g. predicate, type sets)

Adjusted relative frequencies

k1

k2

k3

...

kn

4

2

10

8

<new> 0

...

+ λ

+ λ

+ λ

+ λ

+ λ

P(<new>) = 0

P(<new>) > 0

Page 12: Of Sampling and Smoothing: Approximating Distributions over Linked Open Data

Thomas Gottron PROFILES 26.5.2014, 12Approximating Distributions over LOD

Unobserved patterns!

Unobserved pattern instance (e.g. predicate, type sets)

Lidstone-Smoothing with parameter λ Laplace-Smoothing (Add-One) for λ = 1

k1

k2

k3

...

kn

4

2

10

8

<new> 0

...

+ λ

+ λ

+ λ

+ λ

+ λ

Page 13: Of Sampling and Smoothing: Approximating Distributions over Linked Open Data

Thomas Gottron PROFILES 26.5.2014, 13Approximating Distributions over LOD

Evaluation

Page 14: Of Sampling and Smoothing: Approximating Distributions over Linked Open Data

Thomas Gottron PROFILES 26.5.2014, 14Approximating Distributions over LOD

Experimental Evaluation

Obtain different distributions based on: Sampling:

• Strategy (triple, USU, context)• Rate: (5% - 90%)

Smoothing:• Laplace• Lidstone with λ = 0.5, λ = 0.1 and λ = 0.01

Compare to full data set 10 iterations

Dynamic Linked Data Observatory

Weekly snapshots, 16M triples(only first snapshot used here)

Page 15: Of Sampling and Smoothing: Approximating Distributions over Linked Open Data

Thomas Gottron PROFILES 26.5.2014, 15Approximating Distributions over LOD

Comparing Distributions

Information theoretic measure for comparing distributions:

???

Cross-Entropy of P and Q

Kullback-Leibler Divergence

Page 16: Of Sampling and Smoothing: Approximating Distributions over Linked Open Data

Thomas Gottron PROFILES 26.5.2014, 16Approximating Distributions over LOD

Experimental Setup

Index construction / Estimation of distributions

...

...

5% 10% 20% 30% Full (100%)

...

90%

5%

„dev

iatio

n“

10% 20% 30% 100%90%

Page 17: Of Sampling and Smoothing: Approximating Distributions over Linked Open Data

Thomas Gottron PROFILES 26.5.2014, 17Approximating Distributions over LOD

RDF class typesPredicates

Impact of Sampling Strategy

Property sets Type sets

ECS similar

Page 18: Of Sampling and Smoothing: Approximating Distributions over Linked Open Data

Thomas Gottron PROFILES 26.5.2014, 18Approximating Distributions over LOD

Impact of SmoothingPredicates, context

samplingPredicates, triple sampling

ECS, context sampling ECS, USU sampling

Page 19: Of Sampling and Smoothing: Approximating Distributions over Linked Open Data

Thomas Gottron PROFILES 26.5.2014, 19Approximating Distributions over LOD

Conclusion

Summary

Baseline for sampling and smoothing techniques Little difference between classical smoothing techniques Quality of context-based sampling as realistic scenario Other samplings suitable for generating VoID descriptions

Future Work

Smarter smoothing techniques Inspired by Language Modelling Specific for LOD

Page 20: Of Sampling and Smoothing: Approximating Distributions over Linked Open Data

Thomas Gottron PROFILES 26.5.2014, 20Approximating Distributions over LOD

Thanks!

Contact:Thomas Gottron

WeST – Institute for Web Science and Technologies

Universität Koblenz-Landau

[email protected]