observing linked data dynamics
Post on 12-Sep-2014
424 Views
Preview:
DESCRIPTION
TRANSCRIPT
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association
!) INSTITUTE AIFB, KARLSRUHE INSTITUTE OF TECHNOLOGY, GERMANY; 2) DERI, NATIONAL UNIVERSITY OF IRELAND, GALWAY
http://swse.deri.org/dyldo/
Observing Linked Data DynamicsTobias Käfer1, Ahmed Abdelrahman2, Patrick O’Byrne2, Jürgen Umbrich2, Aidan Hogan2
May 30, 2013Extended Semantic Web Conference (ESWC 2013), Montpellier, France
2 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013
Linked Data Dynamics
… more than the growth of the LOD-Cloud
Why you might care:As a publisher:
VersioningLink Maintenance
As a consumer:ReasoningHybrid Linked Data Warehouses
May 30, 2013
3 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013
The Dynamic Linked Data Observatory – Part of a Bigger Movement (Web Observatories)
“[…] in order to study the Web, you need to observe what happens on the Web. To do this, one has to study it every day to understand the dynamics of the Web and the interaction with technology, and what people do with it.”
“[…] to create a distributed archive of data on the Web and its activity, and […] mechanisms and tools that will be able to explore its development in the past, to examine its present condition and to establish potential developments in the future.”
May 30, 2013
Prof. Dame Wendy Hall, 2013
http://www.thehindu.com/sci-tech/internet/web-observatory-for-cybergazing/article4386613.ece
WebScience Trust: definition of a Web ObservatoryA definition of the Web Observatory
4 http://swse.deri.org/dyldo/
Mission: To capture the dynamics of Linked Data
Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013
The Dynamic Linked Data Observatory
May 30, 2013
Billion Triple Challenge Datasetof 2010
+LOD cloudFixed URI list
+ crawl
The Linked Data Web
See our Paper at LDOW 2012more on that in a second
5 http://swse.deri.org/dyldo/
Mission: To capture the dynamics of Linked Data
Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013
The Dynamic Linked Data Observatory
May 30, 2013
Billion Triple Challenge Datasetof 2010
+LOD cloudFixed URI list
+ crawl
The Linked Data Web
See our Paper at LDOW 2012“kernel”, “core”,
or “seedlist” part
“extended”
or “crawl” part
Core part: Combination of LOD/CKAN and BTC220 example URIs from the data sets in the LOD cloud220 top PageRanked URIs from the BTC 2010 datasetCrawled from there to get approx. 100k URIs (Union of 10 crawls)
6 http://swse.deri.org/dyldo/
Mission: To capture the dynamics of Linked Data
Weekly snapshots of a URI list derived from the LOD cloud and 2010‘s Billion triple challenge dataset, chosen for coverage and variety.
Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013
The Dynamic Linked Data Observatory
May 30, 2013
Billion Triple Challenge Datasetof 2010
+LOD cloudFixed URI list
+ crawl
The Linked Data Web
See our Paper at LDOW 2012“kernel”, “core”,
or “seedlist” part
“extended”
or “crawl” part
May 6, 2012 today1 week
7 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013
Nominal size of a snapshot: 95,737 (Kernel) / 191,474 URIs (Extended)May to November 2012: 6 months, 29 (weekly) snapshotsStatistics on the data basis:
This presentation: Findings from the first half year of observation
May 30, 2013
Statistic Kernel ExtendedMean pay-level domains 573.6 ± 16.6 1,738.6 ± 218
Mean documents 68,996.9 ± 5,555.2 152,355.7 ± 2,356.3
Mean quadruples 16,001,671 ± 988,820 94,725,595 ± 10,279,806
Sum quadruples 464,048,460 2,747,042,282
May 6, 2012 today1 weekAnalysed in this paper
8 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013
How often do
links between
documents
change?
Are document updates mostly
additions or mostly deletions?
Secret questions of a Linked Data geek
Call for observations on different levels of abstraction:
May 30, 2013
granularity
RDF Graphs Documents Hosts (PLD)
Are there provider-
dependent
publishing patterns?
How frequently does a Linked Data document
change?Can I assume schema data to be static?
What are the
most dynamic
predicates?
(… vs. <html>)
9 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013
Document-level dynamics: Life (Availability)…
May 30, 2013
snapshots
10
0
20
30% documents of 87k *)
0 5 10 15 20 25
Mean = 23.1 (~80%)
26% URIs availablein all snapshots
*) 8
6,69
6 R
DF
docu
men
ts e
ver a
ppea
red
in ≥
1 k
erne
l sna
psho
t
You probably miss 20% of
the sources in a download
(cf. 50% for the HTML web in
Fetterly et al. (2003))
10 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013
Document-level dynamics: … and Death
May 30, 2013
Last Heart-Beat:Overestimates death…
… and death certificate filled:underestimates death
Of documents, 5% are likely to go dead in 6 months. (cf. 20% and 48% for the HTML web in Koehler (1999) and Ntoulas et al. (2004) resp.)
HTT
P-5
00 e
tc.
11 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013
Document-level dynamics: Changes
May 30, 2013
62% of all documents were static (cf. 56%, 66%, or 50% reported for
the HTML web (Brewington and Cybenko (2000), Fetterly et al. (2003),
and Ntoulas et al. (2004)))
12 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013
May 30, 2013
Only few documents
change, but frequently a
vg. #
Sna
psho
ts w
ith c
hang
esin
doc
umen
ts w
ith c
hang
es
Share of documents with changeson the host (PLD)
Hardly any changes at all Few changes, but if s
o,
to most documents
Document-level changes clustered by host (PLD)most documents
Frequent changes in
Decide per host (PLD) on a refreshing strategy(cf. Ntoulas et al. (2004) on per-site HTML change predictions)
13 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013
Document-level changes per topic and party
Grouping domains by metadata from theLOD cloud and the DataHub
May 30, 2013
The LOD cloud colour-coded by topic
LOD
-clo
ud to
pic
Par
ty
14 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013
RDF-level dynamics: triples
May 30, 2013
Only 27,6% of thedocuments updatedvalues for terms(i.e. one per triple)24% monotonicadditions
*
* given there are changes at all
*
Deletions and additions almost always balance out, which calls for efficient data revision strategies in Linked Data Warehouses
15 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013
RDF-level dynamics: terms
May 30, 2013
We‘re talking small numbers
Most active; cf. most
active predicates
Static schema
signature of a
document
Because of the static schema structure of documents, void descriptions don‘t need to be updated frequently.
16 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013
RDF-level dynamics: The most dynamic predicates
May 30, 2013
Indicating a timestamp*) provenance time updated, and provenance time added respectively
17 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013
Dynamics of the RDF link structure
Outward links from the kernel to other documents
May 30, 2013
Period of stabilisation(cf. inavailability of documents) If there is a trend, then it is decreasing
(cf. dying documents)
Cf. non-200 HTTP responses
Not many new links introduced
Low-volume but constant stream of fresh outward links : sec.gov, identi.ca, zitgist.com, dbtropes.org, ontologycentral.com, freebase.com
New links in batches: bbc.co.uk, bnf.fr, dbpedia.org, linkedct.org, bio2rdf.org
Exceptions(cf. publishing patterns)
Cf. Ntoulas et al. (2004): 25% new links each week (in a growing HTML data set)
18 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013
Summary and Q&A
Analyses from first half yearData collection is continuingFuture work:
More sources & analyses, results as RDFWe appreciate your feed-back and speculationsWhat would youlook for in the data?Thanks for your attention
May 30, 2013
10
0
20
30% documents of the 87k
0 5 10 15 20 25snapshots
Our home page with
• more details,
• a google group,
• the data for download,
• and an UI to play around
with the data:
http://swse.deri.org/dyldo/
19 http://swse.deri.org/dyldo/Observing Linked Data Dynamics // TOBIAS KÄFER, Ahmed Abdelgayed, Patrick O'Byrne, Jürgen Umbrich, Aidan Hogan // ESWC 2013
This presentation is CC BY SA – picture credits
Picture on title slide based on a picture by A. Sparrow http://www.flickr.com/photos/49937157@N03/
CC BY 2.0Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
CC BY SAEvolution http://commons.wikimedia.org/wiki/File:Human_evolution_scheme.svg
CC BY SADeath http://commons.wikimedia.org/wiki/File:Death.jpg
CC BY SA 3.0Seismogram http://www.flickr.com/photos/brettneilson/2281403809/
CC BY
May 30, 2013
top related