thu bernstein key_warp_speed
DESCRIPTION
TRANSCRIPT
Processing Linked Data at Warp SpeedAbraham Bernstein
CCBY NASA http://www.flickr.com/photos/nasa_jsc_photo/sets/72157629726792248/with/7197236116/0
CCBY NASA http://www.flickr.com/photos/nasa_jsc_photo/sets/72157629726792248/with/7197236116/0
"Earth's Location in the Universe (JPEG)" by Andrew Z. Colvin - Own work. Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons http://commons.wikimedia.org/wiki/File:Earth%27s_Location_in_the_Universe_(JPEG).jpg#mediaviewer/File:Earth%27s_Location_in_the_Universe_(JPEG).jpg
"IBM Electronic Data Processing Machine - GPN-2000-001881" NASA, Public Domain @ Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg
Processing Graphs
"IBM Electronic Data Processing Machine - GPN-2000-001881" NASA, Public Domain @ Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg
Semantic Web ReasoningKB: Asserted Triples
Entailed KB: Asserted & Infered Triples
DL
Reasoning
Inductive R.
Analogical R
.
Your R.
Signal/Collect
P. Stutz, A. Bernstein, and W. Cohen. Signal/Collect: Graph algorithms for the (semantic) web. International Semantic Web Conference–ISWC 2010. Springer Berlin Heidelberg, 2010. 764-780.
• Vertices as stateful processing units
• Vertices interact through signals along edges
• Which are collected by a processing function that updates the vertex state
Processing Graphs Naturally
• Define the graph structure
• Vertices represent RDFS classes
• Edges from superclasses to subclasses
• Vertex state initialized with the class that the vertex represents
Signal/Collect: An Intuition for RDFS subclass inference
id: animalstate: {animal}
id: birdstate: {bird}
id: owlstate: {owl}
id: penguinstate: {penguin}
id: animal
id: bird
id: owlid: penguin
Stutz et al, 2010
Signal/Collect: An Intuition for RDFS subclass inference
Stutz et al, 2010
id: animalstate: {animal}
id: birdstate: {bird}
id: owlstate: {owl}
id: penguinstate: {penguin}
{bird, animal}
{animal}
{bird, animal}
id: animalstate: {animal}
id: birdstate: {bird}
id: owlstate: {owl}
id: penguinstate: {penguin}id: penguinstate: {penguin, bird, animal}
id: birdstate: {bird, animal}
id: owlstate: {owl, bird, animal}
def collect =
state [[
s2signals
s
def signal = state
Scoring/Asychronicity: Single-Source Shortest Path
state: ∞
state: ∞
state: ∞
state: ∞
state: ∞
state: ∞
state: 0
∞
1
∞
∞∞1
1∞
state: 1 state: ∞
state: ∞
state: ∞state: 1
state: 1 2
1
2
∞21
1∞
state: 1 state: 2
state: 2
state: 2state: 1
state: 1 2
1
2
321
13
state: 1 state: 2
state: 2
state: 2state: 1
state: 1
def signal = state + weightdef collect = min (state, min (signals) )
state: 1
Scoring/Asynchronicity: Single-Source Shortest Path
state: ∞
state: ∞
state: ∞
state: 0
111
state: 1 state: ∞
state: ∞
state: ∞
state: 1 22
2
state: 2
state: 2
state: 2
state: 2
state: 2
var oldState = infinity
def scoreSignal =if (state ! = oldState)
1
else
0
33
PageRank in Code
class Document(id: Any) extends Vertex(id, 0.15) { def collect = 0.15 + 0.85 * signals[Double].foldLeft(0.0)(_ + _)}
Algorithm
class Citation(citer: Any, cited: Any) extends Edge(citer, cited) { def signal = source.state.asInstanceOf[Double] * weight / source.sumOfOutWeights}
ExecutionInitialization object Algorithm {
def executeCitationRank(db: SparqlAccessor) { val computeGraph = new AsynchronousComputeGraph() val citations = new SparqlTuples(db, "select ?source ?target where {" + "?source <http://lsdis.cs.uga.edu/projects/semdis/opus#cites> ?target}") citations foreach { case (citer, cited) => computeGraph.addVertex[Document](citer) computeGraph.addVertex[Document](cited) computeGraph.addEdge[Citation](citer, cited) } computeGraph.execute() }}
Signal/Collect as a Platform
P. Stutz, A. Bernstein, and W. Cohen. Signal/Collect: Graph algorithms for the (semantic) web. International Semantic Web Conference–ISWC 2010. Springer Berlin Heidelberg, 2010. 764-780.
T R I P L E R U S H F R A U D D E T E C T I O N
D C O P S
Imag
es::
Bo In
sogn
a, h
ttps:
//flic
.kr/p
/9fa
mxT
, CC
-NC
-ND
http
s://c
reat
ivec
omm
ons.
org/
licen
ses/
by-n
c-nd
/2.0
/ An
tana
, http
s://fl
ic.k
r/p/g
GQ
PhA,
CC
- BY-
SA,
http
s://c
reat
ivec
omm
ons.
org/
licen
ses/
by-s
a/2.
0/,
Jeff
Kubi
na, h
ttps:
//flic
.kr/p
/2C
G5P
U, C
C-B
Y-SA
Tr i p l e S t o r e
E LV I S D Y L A N
J O B S
?X I N S P I R E D ?Y ?Y I N S P I R E D ?Z
I N S P I R E D
DATA QUERY
INSP I R
ED
DylanElvis inspired
*Elvis inspiredDylan* inspired DylanElvis *
*Elvis *** inspired Dylan* *
** *
DylanElvis inspired
*Elvis inspiredDylan* inspired DylanElvis *
*Elvis *** inspired Dylan* *
** *
*Dylan inspired Jobs* inspiredJobsDylan *
JobsDylan inspired
*Dylan *Jobs* *
DylanElvis inspired
Dylan* inspired
** inspired
*Dylan inspired Jobs* inspired
JobsDylan inspired
Query Vertex
?X inspired ?Y!?Y inspired ?Z
?X inspired ?Y!?Y inspired ?Z
Elvis inspired Dylan!Dylan inspired ?Z
Dylan inspired Jobs!Jobs inspired ?Z
No vertex with ID![ Jobs inspired * ]
Query Vertex
{ ?X = Elvis, ?Y = Dylan, ?Z = Jobs }
✗
Elvis inspired Dylan!Dylan inspired Jobs
✓
?X inspired ?Y!?Y inspired ?Z
DylanElvis inspired
Dylan* inspired
** inspired
*Dylan inspired Jobs* inspired
JobsDylan inspired
Query Vertex
?X inspired ?Y!?Y inspired ?Z
?X inspired ?Y!?Y inspired ?Z
Elvis inspired Dylan!Dylan inspired ?Z
Dylan inspired Jobs!Jobs inspired ?Z
No vertex with ID![ Jobs inspired * ]
Query Vertex
{ ?X = Elvis, ?Y = Dylan, ?Z = Jobs }
✗
Elvis inspired Dylan!Dylan inspired Jobs
✓
?X inspired ?Y!?Y inspired ?Z
P e r f o r m a n c e R e s u l t s
Distributed (8 nodes), LUBM 10240 (~1.36 billion triples)
Single-node, LUBM 160 (~21 million triples)
Fastest L1 L2 L3 L4 L5 L6 L7 Geo.
of 10 runs mean
TripleRush 3,111.2 1,457.9 0.7 3.5 9.5 29.1 1,165.8 62.1
Trinity.RDF 12,648.0 6,018.0 8,735.0 5.0 4.0 9.0 31,214.0 450.0
TriAD 7,631.0 1,663.0 4,290.0 2.1 0.5 69.0 14,895.0 249.0
TriAD-SG 2,146.0 2,025.0 1,647.0 1.3 0.7 1.4 16,863.0 106.0
Fastest L1 L2 L3 L4 L5 L6 L7 Geo
of 10 runs mean
TripleRush 22.6 27.8 0.4 1.0 0.4 0.9 21.2 2.94
Trinity.RDF 281.0 132.0 110.0 5.0 4.0 9.0 630.0 46.0
TriAD 427.0 117.0 210.0 2.0 0.5 19.0 693.0 39.0
TriAD-SG 97.0 140.0 31.0 1.0 0.2 1.8 711.0 14.0
O n g o i n g W o r k : G r a p h P a r t i t i o n i n g
• https://github.com/uzh/triplerush
T R I P L E R U S H F R A U D D E T E C T I O N
D C O P S
Signal/Collect as a Platform
Imag
es::
Bo In
sogn
a, h
ttps:
//flic
.kr/p
/9fa
mxT
, CC
-NC
-ND
http
s://c
reat
ivec
omm
ons.
org/
licen
ses/
by-n
c-nd
/2.0
/ An
tana
, http
s://fl
ic.k
r/p/g
GQ
PhA,
CC
- BY-
SA,
http
s://c
reat
ivec
omm
ons.
org/
licen
ses/
by-s
a/2.
0/,
Jeff
Kubi
na, h
ttps:
//flic
.kr/p
/2C
G5P
U, C
C-B
Y-SA
D e c o m p o s i n g F r a u d P a t t e r n s
• Participants can be labeled as:
• Splitters
• Aggregators
• Forwarders
$10k$11k
2 x $5k $10k
10k CHF
2k CHF4k CHF
6k CHF8k CHF
E l i c i t a t i o n P r o c e s s
FilterConnectMatch
B i t c o i n Tr a n s a c t i o n s : R u n t i m e v s . M a t c h i n g C o m p l e x i t y
0"
1000"
2000"
3000"
4000"
5000"
6000"
7000"
4" 6" 8" 10" 12"
Time%(sec)%
Matching%Complexity%
Time"in"GC"
Total"Processing"Time"
Dataset Size: 50M Transactions, Matching Duration: 1 week
Runtime: 27 min Throughput: 1.8M / min
Runtime: 35 min Throughput: 1.4M / min
Runtime: 98 min Throughput: 0.5M / min
T R I P L E R U S H F R A U D D E T E C T I O N
D C O P S
Signal/Collect as a Platform
Imag
es::
Bo In
sogn
a, h
ttps:
//flic
.kr/p
/9fa
mxT
, CC
-NC
-ND
http
s://c
reat
ivec
omm
ons.
org/
licen
ses/
by-n
c-nd
/2.0
/ An
tana
, http
s://fl
ic.k
r/p/g
GQ
PhA,
CC
- BY-
SA,
http
s://c
reat
ivec
omm
ons.
org/
licen
ses/
by-s
a/2.
0/,
Jeff
Kubi
na, h
ttps:
//flic
.kr/p
/2C
G5P
U, C
C-B
Y-SA
≠
≠≠
≠
y
Distributed Constraint Optimization
z
q
x
x ≠ y y ≠ z z ≠ x q ≠ z
x ∈ {0,1,2} y ∈ {0,1} z ∈ {1,2}
q ∈ {0,1,2}
Vertex Coloring in action
Optimized Version of DSA Running on a MacBook Pro with 8 workers(slow, due to lots of IO for logging, bookkeeping, etc.)
• Scaled to: 10 Million vertices / variablesVerman and Bernstein, 2014
Industry Usage!
“We are using Signal/Collect to analyze millions of claims every day to iden9fy opportuni9es for our clients to save money through be=er healthcare or avoiding fraud, waste, and abuse.”
US Healthcare Analy/cs Company
"IBM Electronic Data Processing Machine - GPN-2000-001881" NASA, Public Domain @ Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg
Processing Graph Streams
"IBM Electronic Data Processing Machine - GPN-2000-001881" NASA, Public Domain @ Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg
147.1b € 97.5b €
147.1b € 97.5b €
5’100
3m Viewers 300 Events/s
250 TV Channels 25 frames/s
EPG for 7 days LOD enhanced
Traditional TripleStore
http://www.mpi.de/
http://www.ifi.uzh.ch/
http://www.uni-sb.de/courses/
http://www.universities.de/Saarbruecken http://www.ifi.uzh.ch/
http://www.ifi.uzh.ch/i_am_a_URI
http://www.ifi.uzh.ch/i_am_a_URI
http://www.ifi.uzh.ch/i_am_a_URI
http://www.ifi.uzh.ch/i_am_a_URI
http://www.ifi.uzh.ch/i_am_a_URI
http://www.ifi.uzh.ch/i_am_a_URI
http://www.lubm.org/teaches
http://www.lubm.org/
Traditional TripleStore
http://www.mpi.de/
http://www.ifi.uzh.ch/
http://www.uni-sb.de/courses/
http://www.universities.de/Saarbruecken http://www.ifi.uzh.ch/
http://www.ifi.uzh.ch/i_am_a_URI
http://www.ifi.uzh.ch/i_am_a_URI
http://www.ifi.uzh.ch/i_am_a_URI
http://www.ifi.uzh.ch/i_am_a_URI
http://www.ifi.uzh.ch/i_am_a_URI
http://www.ifi.uzh.ch/i_am_a_URI
http://www.lubm.org/teaches
http://www.lubm.org/
Semantic Flow Processing
Semantic Flow Processing
t
ViSTA-TV Viewership Data
0
7500
15000
22500
30000
0h 8h 16h 24h 8h 16h 24h 8h 16h 24h
Num of Valid Data Entries
day 1 day 2 day 3
Cac
he
ViSTA-‐TV project: UserLog and EPG streams
Semantic Flow Processing is:
• Time-stamped tripes t = <s,p,o> [time]
• Semantic flow F = [t1, t2, ... tn]
• Perform query matching on cached subset of F
• Subject to Stress—incoming data-rate overwhelms the system’s processing capability
Context: Time Window
last 1 min
?within 1 sec
Load Shedding
last 1 min
Eviction
last 1 min
?
Eviction is:
• Remove cached data
• Note: Eviction may lower recall
• Evict potential to produce future results
• Types of eviction strategies (considered):
• Random
• Time-based (i.e, FIFO)
• Least Recently Used (LRU)
Why not LRU?
LRU
C
B
Input: UserLog
,<watch>,Channel User 1 B
Join
EPG CacheTwo-way Join:
UserLog EPG
Why not LRU?
LRU
C
B
User 1
Input: UserLog
,<watch>,Channel C
Join
EPG Cache
Why not LRU?
LRU
B
C
Input
Input: EPG
Channel , <Play>, Show 4DD
EPG Cache
D
User 3
Input: UserLog
,<watch>,Channel
?
Why not LRU?
LRU
C
Input: EPG
Channel , <Play>, Show 5E
B
Input: EPG
Channel , <Play>, Show 4D
InputInput: EPG
Channel , <Play>, Show 6FInput: EPG
Channel , <Play>, Show 7G
Input: EPG
Channel , <Play>, Show 8H
D
EPG Cache
CLOCK is:
• Consider both recency (LRU) and past results
• Giving each data entry a score
• The score could be incremented and depreciated
• Named by the buffer management algorithm CLOCK
CLOCK
CLOCK Score
6
1
3C
B
Input: UserLog
,<watch>,Channel User 2 C
Join
EPG Cache
CLOCK
CLOCK Score
6
1
4C
B
Input: UserLog
,<watch>,Channel User 2 C
Join
EPG Cache
CLOCK
CLOCK Score
6
1
4C
B
Input: EPG
Channel , <Play>, Show 4D
Inputdep()
EPG Cache
CLOCK
CLOCK Score
51
4C
B
Input: EPG
Channel , <Play>, Show 4D
Inputdep()
EPG Cache
CLOCK
CLOCK Score
5
04C
B
Input: EPG
Channel , <Play>, Show 4D
Input dep()D
EPG Cache
CLOCK
CLOCK Score
5
init = 1
4C
Input: EPG
Channel , <Play>, Show 4D
Input dep()D
EPG Cache
CLOCK
CLOCK Score
5
1
4C
Input
dep()D
Input: EPG
Channel , <Play>, Show 5E EPG Cache
CLOCK
CLOCK Score
5
1
3C
Input
dep()D
Input: EPG
Channel , <Play>, Show 5E EPG Cache
CLOCK
CLOCK Score
4
1
3C
Inputdep()
D
Input: EPG
Channel , <Play>, Show 5E EPG Cache
CLOCK
CLOCK Score
4
0
3C
Input dep()D
Input: EPG
Channel , <Play>, Show 5E EPG CacheE
CLOCK
CLOCK Score
4
1
3C
Input
dep()E
EPG Cache
Input: EPG
Channel , <Play>, Show 6F
Input: EPG
Channel , <Play>, Show 7G
Input: EPG
Channel , <Play>, Show 8H
CLOCK Eviction is:
!
• The weight is adjusted by dep():
• Linear: dep(w) = w - 1
• Exponential: dep(w) = w * ρ (0< ρ < 1)
Recency History
Experimental Results TV Viewership Data
Rec
all
0%
25%
50%
75%
100%
Cache Size
100% 50% 25% 15% 1%
Random FIFO LRU CLOCK
Two-way join query with 1,919,216 input triples
Experimental Results: Depreciation Function
Exponential factor ρ = ρ in {0.25, 0.5, 0.75, 0.95}
Limitations
• Static depreciation function: dep()
• Experiment on other datasets
• Real implementation to investigate other performance metrics
• Only local eviction strategies considered
CCBY NASA http://www.flickr.com/photos/nasa_jsc_photo/sets/72157629726792248/with/7197236116/0
"IBM Electronic Data Processing Machine - GPN-2000-001881" NASA, Public Domain @ Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg
Semantic Web ReasoningKB: Asserted Triples
Entailed KB: !Asserted & Infered Triples
DL
Reasoning
Inductive R.
Analogical R
.
Your R.
Signal/Collect
P. Stutz, A. Bernstein, and W. Cohen. Signal/Collect: Graph algorithms for the (semantic) web. International Semantic Web Conference–ISWC 2010. Springer Berlin Heidelberg, 2010. 764-780.
CLOCK Eviction is:
!
• The weight is adjusted by dep():"
• Linear: dep(w) = w - 1"
• Exponential: dep(w) = w * ρ (0< ρ < 1)
Recency History
CCBY NASA http://www.flickr.com/photos/nasa_jsc_photo/sets/72157629726792248/with/7197236116/0
Philip Stutz, Mihaela Verman, Shen Gao, Daniel Strebel, Bibek Paudel, Lorenz Fischer, Thomas Keller, Robin Hafen, Genc Mazlami