big data management platforms - cheriton school of ... › ~tozsu › courses › cs848 › w19 ›...
TRANSCRIPT
-
Big Data Management Platforms
M. Tamer Özsu
University of WaterlooDavid R. Cheriton School of Computer Science
© M. Tamer Özsu CS848 - Winter 2019 1 / 137
-
Relational DBMS market (IDC/Gartner)
2011: US$24B1994: US$ 7B2010-2011 growth 16.3%
2010-2017 growth 143.2%Ma
rketsize
inbillionU.S.dollars
323234.534.5
37.737.7
41.541.5
45.745.7
50.150.1
Sizeofthedatabasemarketworldwidefrom2012to2017(inbillionU.S.dollars)
AdditionalInformation:Worldwide;IDG(InfoWorld);IDC;2015
2012 2013 2014 2015* 2016* 2017*
SourcesIDG(InfoWorld);IDC©Statista2018
0
10
20
30
40
50
60
© M. Tamer Özsu CS848 - Winter 2019 2 / 137
-
But...
38%
44%18%
RelationalNoSQL
Other
35%
44%21%
2013(204)
2018(397)
Source: https://db-engines.com© M. Tamer Özsu CS848 - Winter 2019 3 / 137
https://db-engines.com
-
But...
RDBMS Key-value Document Graph Wide Other0
20
40 37.7
18.6
12.8 11.3
2
17.7
34.8
16.911.6 12.7
3
21.2%
2013 2018
Source: https://db-engines.com© M. Tamer Özsu CS848 - Winter 2019 3 / 137
https://db-engines.com
-
What is Happening?
Technology push/Application Pull
Many technological changes
NetworksParallelism (Cluster computing & Multicore)Storage systemsProcessor technology. . .
New applications“Big data” applications
IoTRecommender systemsPersonalized health systems. . .
© M. Tamer Özsu CS848 - Winter 2019 4 / 137
-
Big Data
Data-driven approaches to solving big problems
Example
Voice recognition
Google voice accuracy 90Model-driven approach 60
Question-Answering systems
Confusion set disambiguation(e.g., to, two, too) [Banko &Brill, 2001]
© M. Tamer Özsu CS848 - Winter 2019 5 / 137
-
Big Data Characteristics – Four Vs
© M. Tamer Özsu CS848 - Winter 2019 6 / 137
-
Big Data Characteristics – Four Vs
VolumeAlmost every dataset we use is high volume and volume is increasing
,but consider theseGoogle: in 2016, user uploads to YouTube required 1PB of new storagecapacity per dayAlibaba: in 2017, 320 PB of log data was generated in a six hourperiod as a result of customer purchase activity
Variety
Data are multi-modal
Velocity
Streaming dataFacebook: 900 million photo uploads per dayAlibaba: During 2017/11/11 470 million event logs per second
Veracity
Data quality & cleaningManaging uncertain dataSearch under uncertainty
© M. Tamer Özsu CS848 - Winter 2019 7 / 137
-
Big Data Characteristics – Four Vs
VolumeAlmost every dataset we use is high volume and volume is increasing,but consider theseGoogle: in 2016, user uploads to YouTube required 1PB of new storagecapacity per dayAlibaba: in 2017, 320 PB of log data was generated in a six hourperiod as a result of customer purchase activity
Variety
Data are multi-modal
Velocity
Streaming dataFacebook: 900 million photo uploads per dayAlibaba: During 2017/11/11 470 million event logs per second
Veracity
Data quality & cleaningManaging uncertain dataSearch under uncertainty
© M. Tamer Özsu CS848 - Winter 2019 7 / 137
-
Big Data Characteristics – Four Vs
VolumeAlmost every dataset we use is high volume and volume is increasing,but consider theseGoogle: in 2016, user uploads to YouTube required 1PB of new storagecapacity per dayAlibaba: in 2017, 320 PB of log data was generated in a six hourperiod as a result of customer purchase activity
VarietyData are multi-modal
Velocity
Streaming dataFacebook: 900 million photo uploads per dayAlibaba: During 2017/11/11 470 million event logs per second
Veracity
Data quality & cleaningManaging uncertain dataSearch under uncertainty
© M. Tamer Özsu CS848 - Winter 2019 7 / 137
-
Big Data Characteristics – Four Vs
VolumeAlmost every dataset we use is high volume and volume is increasing,but consider theseGoogle: in 2016, user uploads to YouTube required 1PB of new storagecapacity per dayAlibaba: in 2017, 320 PB of log data was generated in a six hourperiod as a result of customer purchase activity
VarietyData are multi-modal
VelocityStreaming dataFacebook: 900 million photo uploads per dayAlibaba: During 2017/11/11 470 million event logs per second
Veracity
Data quality & cleaningManaging uncertain dataSearch under uncertainty
© M. Tamer Özsu CS848 - Winter 2019 7 / 137
-
Big Data Characteristics – Four Vs
VolumeAlmost every dataset we use is high volume and volume is increasing,but consider theseGoogle: in 2016, user uploads to YouTube required 1PB of new storagecapacity per dayAlibaba: in 2017, 320 PB of log data was generated in a six hourperiod as a result of customer purchase activity
VarietyData are multi-modal
VelocityStreaming dataFacebook: 900 million photo uploads per dayAlibaba: During 2017/11/11 470 million event logs per second
VeracityData quality & cleaningManaging uncertain dataSearch under uncertainty
© M. Tamer Özsu CS848 - Winter 2019 7 / 137
-
Big Data Software Stack
Distributed Data Storage
Data ProcessingFramework
Script, SQLNoSQLSearch
Streaming
Data Analysis
ResourceManagement
ShardShard Shard
© M. Tamer Özsu CS848 - Winter 2019 8 / 137
-
Outline
1 Distributed Storage Systems
2 Main Memory Systems
3 Big Data ProcessingMapReduce [Li et al., 2014]Spark [Zaharia, 2016]
4 Stream Data Processing
5 Graph Processing
6 RDF Data Management
© M. Tamer Özsu CS848 - Winter 2019 9 / 137
-
Outline
1 Distributed Storage Systems
2 Main Memory Systems
3 Big Data ProcessingMapReduce [Li et al., 2014]Spark [Zaharia, 2016]
4 Stream Data Processing
5 Graph Processing
6 RDF Data Management
© M. Tamer Özsu CS848 - Winter 2019 10 / 137
-
Big Data Storage
Two basic solutions to store big data
ObjectsFiles
Different goals
Complementary solutions
© M. Tamer Özsu CS848 - Winter 2019 11 / 137
-
Object Storage
Objectives
Very large amounts of unstructured dataObject transfer over a long-distance networkAccess with REST Internet protocols
Approach
An object includes its data, metadata and an identifierFlat object space: no hierarchy of objects to manage
OSD (object storage device): to store objects on disk
© M. Tamer Özsu CS848 - Winter 2019 12 / 137
-
Object
Object = triple (oid, data, metadata)
Oid: unique, permanent identifier
Once created, an object can be directly accessed by its oid
Data: unstructured data (text, binary, . . . )
Easy to move over the network
Metadata: understand the dataMuch flexibility in defining an object’s properties
Protection, replication, lifetime, etc.
© M. Tamer Özsu CS848 - Winter 2019 13 / 137
-
Amazon Simple Storage Service (S3)
Bucket 1 Bucket N
· · ·
Amazon S3
Client
Put object Get object
Online storage service
Simple and easyPut/get object in abucket (an onlinedirectory) using a uniquekey given by the userVery big objects
Max size: 5TBGet/put: 5GB
Access via REST or SOAP
Use cases: images, videos
© M. Tamer Özsu CS848 - Winter 2019 14 / 137
-
Object Stores
Cloud services
Amazon S3Rackspace FilesMicrosoft Azure Vault StorageGoogle Cloud Storage
Enterprise systems
Swift, OpenStack.orgIBM Spectrum Scale ObjectQuantum LattusCaringo SwarmEMC AtmosHitachi HCPScality RING
© M. Tamer Özsu CS848 - Winter 2019 15 / 137
-
File Storage
ObjectivesStorage of unstructured data (byte stringes)
Can also be organized as fixed or variable size records
Managed in a directory hierarchySharing in a data center
NAS (network attached storage) or SAN (storage area network)
ApproachSeparation of data (file content) from metadata
First read the metadata (in the directory) to locate the content
© M. Tamer Özsu CS848 - Winter 2019 16 / 137
-
Distributed Files
Objectives
Very big files, with many records
Capabilities
ScalabilityFault-tolerance
Approaches
Block-based: GFSObject-based: LustreHybrid: Ceph
© M. Tamer Özsu CS848 - Winter 2019 17 / 137
-
Google File System (GFS)
Used by many Google applications
Search engine, Bigtable, Mapreduce, etc.Proprietary
The basis for open source solutions
Hadoop Distributed File System (HDFS)
Objectives
Very big files, of typically several Terabytes, containing many elementssuch as web documentsShared-nothing cluster of thousands of nodes, built from inexpensivehardware
Node failure is the norm, not the exception
Performance, fault-tolerance and high-availability
Using partitioning and replication
© M. Tamer Özsu CS848 - Winter 2019 18 / 137
-
GFS: Design Choices
Files are divided in fixed-size partitions, called chunks, of large size,i.e. 64 MB
Partitions are distributed across multiple nodesEach partition is replicated at several nodes (2 by default)
Optimized for read and append
Random updates are rareLarge reads of bulk data (e.g. 1 MB) and small random reads (e.g. 1KB)Append operations are also large and there may be many concurrentclients that append the same fileHigh throughput (for bulk data) more important than low latency
© M. Tamer Özsu CS848 - Winter 2019 19 / 137
-
GFS: Capabilities
Traditional file system interface (create, open, read, write, close, anddelete file)
Two additional operations: snapshot and record appendNo update: need to do read and write
Relaxed consistency, with atomic record append
No need for distributed lock managementUp to the application to use techniques such as checkpointing andwriting self-validating records
Single GFS master
Maintains file metadata such as namespace, access control information,and data placement informationSimple, lightly loaded, fault-tolerant
Fast recovery and replication strategies
© M. Tamer Özsu CS848 - Winter 2019 20 / 137
-
GFS: Capabilities
Application
GFS ClientMaster
Chunk Server Chunk Server
Get chunk location
Get chunk
© M. Tamer Özsu CS848 - Winter 2019 21 / 137
-
Lustre
Object-based
Used a lot in High Performance Computing (HPC) and scientificapplications in the cloud
Intel Cloud Edition for Lustre
2 types of servers
Metadata serversObject servers: store file contents in object storage targets (OSTs)
Shared-disk cluster
© M. Tamer Özsu CS848 - Winter 2019 22 / 137
-
Distributed File Systems
Block-based
Google File System (GFS)Hadoop Distributed File System (HDFS)GlusterFS, Red Hat, for shared-nothing clusterGlobal File System 2 (GFS2), Red Hat, for shared-disk cluster
Object-based
Lustre, IntelXtreemFS, Xtreem European Project
Hybrid
Ceph, Red HatMicrosoft Azure HDInsight: integration of HDFS within Azure BlobStorageOzone, Hortonworks: extension of HDFS with object-based level
© M. Tamer Özsu CS848 - Winter 2019 23 / 137
-
Outline
1 Distributed Storage Systems
2 Main Memory Systems
3 Big Data ProcessingMapReduce [Li et al., 2014]Spark [Zaharia, 2016]
4 Stream Data Processing
5 Graph Processing
6 RDF Data Management
© M. Tamer Özsu CS848 - Winter 2019 24 / 137
-
Database Systems and Technological Changes
Database systems were architected for the following typicalenvironment:
1-7 MIPS CPU with about 1KB cacheLess than 10MB RAMAbout 80MB hard disk with 1MB/sec transfer rate
But the technology has changed enormously. Each large server on ourcluster:
Intel Xeon E5-4600 (Ivy Bridge) CPU: > 80,000 MIPS with 8 cores,20MB L3 cache shared among cores, 256KB and 64KB per core of L2and L1 caches
CPU speeds are now ∼ 300, 000 MIPS32GB RAM
Big servers can now have >1TB of RAM
1TB of hard disk with 6GB/s transfer rateHard disks of 6TB are coming onto market
Database architectures have not changed since their introduction atthe beginning of 1980s (until recently)
© M. Tamer Özsu CS848 - Winter 2019 25 / 137
-
Typical DBMS Architecture
data files, indices, . . .
Disk Space Manager
Buffer Manager
Files and Access Methods
Operator Evaluator Optimizer
Executor Parser
LockManager
TransactionManager Recovery
Manager
DBMS
Database
SQL Commands
Web Forms Applications SQL Interface
Fig
ure
insp
ired
by
Ra
ma
kri
shn
an
/G
ehrk
e:“
Da
tab
ase
Ma
na
gem
ent
Sys
tem
s”,
McG
raw
-Hill
20
03
.
© M. Tamer Özsu CS848 - Winter 2019 26 / 137
-
Recall the Developments in RAM
1980
1984
1988
1992
1996
2000
2004
2008
2012
2016
8KB
80KB
800KB
8MB
80MB
800MB
8GB
80GB
800GB
8TB
Years
RA
MS
ize
© M. Tamer Özsu CS848 - Winter 2019 27 / 137
-
Memory Hierarchy
CPU(with registers)
caches
main memory
solid state memory
hard disks
tape library
capacity
bytes
kilo-/megabytes
gigabytes
gigabytes
terabytes
petabytes
latency
< 1 ns
< 10 ns
70–100 ns
∼ 5µsec
3–10 ms
varies
Fast, but expensive and small, memory close to CPU
Larger, slower memory at the periphery
Increasing gap between CPU and memory speeds
Latency is hidden by using the fast memory as a cache.
© M. Tamer Özsu CS848 - Winter 2019 28 / 137
-
Main Memory (In-Memory) DBMS
Fundamental characteristic
Store the entire database in main memory for processing, possibly usingthe hard disk for persistence.
Not a new idea – butincreasing memory sizes allowaccommodating biggerdatabases
Usually goes withcolumn-oriented storage, butnot always (e.g., SQL ServerHekaton works on rows)
Can target OLTP or OLAPworkloads or both
Requires rethinking of manyfunctions, but there aresimplifications
© M. Tamer Özsu CS848 - Winter 2019 29 / 137
-
Design Decisions
Access patternsRandom access in memory is still more expensive than sequential accessPartly due to cache lines capacity and cache misses
Main memory optimized storageNo page structure necessaryNo secondary indexes, only main memory indexTry to use latch-free data structures
Data organizationPartitioned vs non-partitioned vs multi-versioned vs row/column
ConcurrencyLock managers are performance bottlenecksVery lightweight locking or optimistic
DurabilityCheckpointing gives “light” durabilityTransaction logging either to disk or non-volatile memoryReplication
© M. Tamer Özsu CS848 - Winter 2019 30 / 137
-
Data Organization in Disk-based Systems
Classical disk-based relational systemspage data to and from disk on demand(fixed-size blocks, e.g., 8KB)
Storage engine manages buffer pool
Page frames are in-memoryrepresentation of pageHash table maps page id to location inpoolIf page is not in memory, disk I/Oneeded to populate a target pageframe
Logical record identifiers:(page id, offset)
Frame1 Frame2 Frame3
Frame4 Frame5 Frame6
Frame7 Frame8 Frame9
Buffer pool with page frames
Hash table
Elegant solution for abstracting disk away from other layers of thedatabase stack
Too much overhead for main-memory systems
© M. Tamer Özsu CS848 - Winter 2019 31 / 137
-
Data Organization in Main Memory Systems
Data lives in RAM in main-memorydatabases; no need to page from disk
Modern systems avoid page indirectionthrough buffer pool
Avoid logical record identifiers: (pageid, offset)
Avoid two-level indirection: Hashtable access to resolve page frame inbuffer pool and Calculate pointer torecord using offset within pageCommon practice to use in-memorypointers for direct record access
Can result in order of magnitudeperformance improvements
Avoids page-based indirection toresolve record pointerAvoids page latch overhead
Frame1 Frame2 Frame3
Frame4 Frame5 Frame6
Frame7 Frame8 Frame9
Buffer pool with page frames
© M. Tamer Özsu CS848 - Winter 2019 32 / 137
-
Data Organization in Main Memory Systems
Data lives in RAM in main-memorydatabases; no need to page from disk
Modern systems avoid page indirectionthrough buffer pool
Avoid logical record identifiers: (pageid, offset)
Avoid two-level indirection: Hashtable access to resolve page frame inbuffer pool and Calculate pointer torecord using offset within pageCommon practice to use in-memorypointers for direct record access
Can result in order of magnitudeperformance improvements
Avoids page-based indirection toresolve record pointerAvoids page latch overhead
page-id
hash table page frameB-tree node
key, page-idkey, page-id
record*
B-tree node
key, record*key, record*
record
record
© M. Tamer Özsu CS848 - Winter 2019 32 / 137
-
Data Organization in Main Memory Systems
Data lives in RAM in main-memorydatabases; no need to page from disk
Modern systems avoid page indirectionthrough buffer pool
Avoid logical record identifiers: (pageid, offset)
Avoid two-level indirection: Hashtable access to resolve page frame inbuffer pool and Calculate pointer torecord using offset within pageCommon practice to use in-memorypointers for direct record access
Can result in order of magnitudeperformance improvements
Avoids page-based indirection toresolve record pointerAvoids page latch overhead
page-id
hash table page frameB-tree node
key, page-idkey, page-id
record*
B-tree node
key, record*key, record*
record
record
© M. Tamer Özsu CS848 - Winter 2019 32 / 137
-
Organization
PartitionedDisjoint partitioning of the database, assign partition to node, core, etc.
I Simple per-node implementation (serial execution, no concurrent datastructures)
I Partition management burden: rebalance for hotspotsI Unpredictable behavior if too many cross-partition transactions
Non-partitioned
Any thread/core can access any record in the database
I Self-balancing system: no need to worry about partition managementI Increased implementation complexity (e.g., concurrent data structures)
© M. Tamer Özsu CS848 - Winter 2019 33 / 137
-
Organization
Multi-VersionedI Allows for high concurrency; important in multi-core environmentsI Readers allow to execute uncontested, do not block writers
Row/Columnar Layout
I Reasonable OLTP performance on in-memory columnar layout; not true fordisk-based column stores
© M. Tamer Özsu CS848 - Winter 2019 34 / 137
-
System Organization Choices
Partitioned Multi-Versioned Row/ColumnarHekaton No Yes RowHyPer No Yes HybridSAP HANA No Yes HybridH-Store/VoltDB Yes No Row
© M. Tamer Özsu CS848 - Winter 2019 35 / 137
-
Outline
1 Distributed Storage Systems
2 Main Memory Systems
3 Big Data ProcessingMapReduce [Li et al., 2014]Spark [Zaharia, 2016]
4 Stream Data Processing
5 Graph Processing
6 RDF Data Management
© M. Tamer Özsu CS848 - Winter 2019 36 / 137
-
MapReduce Basics [Li et al., 2014]
For data analysis of very large data sets
Highly dynamic, irregular, schemaless, etc.SQL too heavy
“Embarrassingly parallel problems”
New, simple parallel programming modelData structured as (key, value) pairs
E.g. (doc-id, content), (word, count), etc.
Functional programming style with two functions to be given:
Map(k1,v1) → list(k2,v2)Reduce(k2, list (v2)) → list(v3)
Implemented on a distributed file system (e.g., Google File System)on very large clusters
© M. Tamer Özsu CS848 - Winter 2019 37 / 137
-
Map Function
User-defined function
Processes input key/value pairsProduces a set of intermediate key/value pairs
Map function I/O
Input: read a chunk from distributed file system (DFS)Output: Write to intermediate file on local disk
MapReduce library
Executes map functionGroups together all intermediate values with the same key (i.e.,generates a set of lists)Passes these lists to reduce functions
Effect of map function
Processes and partitions input dataBuilds a distributed map (transparent to user)Similar to “group by” operator in SQL
© M. Tamer Özsu CS848 - Winter 2019 38 / 137
-
Reduce Function
User-defined function
Accepts one intermediate key and a set of values for that key (i.e., alist)Merges these values together to form a (possibly) smaller setTypically, zero or one output value is generated per invocation
Reduce function I/O
Input: read from intermediate files using remote reads on local files ofcorresponding mapper nodesOutput: Each reducees writes its output as a file back to DFS
Effect of map function
Similar to aggregation operaton in SQL
© M. Tamer Özsu CS848 - Winter 2019 39 / 137
-
MapReduce Processing
...Inp
ut
dat
ase
t
Map
Map
Map
Map
(k1, v)
(k2, v)(k2, v)
(k2, v)
(k1, v)
(k1, v)
(k2, v)
Group by k
Group by k
(k1, (v , v , v))
(k2, (v , v , v , v)) Reduce
Reduce
Ou
tpu
td
ata
set
© M. Tamer Özsu CS848 - Winter 2019 40 / 137
-
Example 1
Assume you are reading the monthly average temperatures for each of the12 months of a year for a bunch of cities, i.e., each input is the name ofthe city (key) and the average monthly temperature. Compute the averageannual temperature for each city.
Map:Input: 〈City, Month, MonthAvgTemp〉
1 Create key/value pairs
Output: 〈City, MonthAvgTemp〉Reduce:Input: 〈City, MonthAvgTemp〉
1 Sort to get 〈City, list(MonthAvgTemp)〉 (i.e., it combines in a listthe monthly average temperatures for a given city)
2 Compute average over list(MonthAvgTemp)
Output: 〈City, AnnualAvgTemp〉
© M. Tamer Özsu CS848 - Winter 2019 41 / 137
-
Example 2
Consider EMP (ENAME, TITLE, CITY)
Query:
SELECT CITY , COUNT(∗ )FROM EMPWHERE ENAME LIKE ”\%Smith ”GROUP BY CITY
Map:
I n p u t : ( TID , emp ) , Output : ( CITY , 1 )i f emp .ENAME l i k e ”\%Smith ” return ( CITY , 1 )
Reduce:
I n p u t : ( CITY , l i s t ( 1 ) ) , Output : ( CITY ,SUM( l i s t ( 1 ) ) )return ( CITY ,SUM( 1∗ ) )
© M. Tamer Özsu CS848 - Winter 2019 42 / 137
-
MapReduce Architecture
Scheduler
Master
Input Module
Map Module
Combine Module
Partition Module
Map Process
Worker
Input Module
Map Module
Combine Module
Partition Module
Map Process
Worker
Input Module
Map Module
Combine Module
Partition Module
Map Process
Worker
Group Module
Reduce Module
Output Module
Reduce Process
Worker
Group Module
Reduce Module
Output Module
Reduce Process
Worker
© M. Tamer Özsu CS848 - Winter 2019 43 / 137
-
Execution Flow with Architecture [Dean and Ghemawat, 2008]MapReduce: Simplified Data Processing on Large Clusters
7. When all map tasks and reduce tasks have been completed, the mas-ter wakes up the user program. At this point, the MapReduce callin the user program returns back to the user code.
After successful completion, the output of the mapreduce executionis available in the R output files (one per reduce task, with file namesspecified by the user). Typically, users do not need to combine these Routput files into one file; they often pass these files as input to anotherMapReduce call or use them from another distributed application thatis able to deal with input that is partitioned into multiple files.
3.2 Master Data StructuresThe master keeps several data structures. For each map task andreduce task, it stores the state (idle, in-progress, or completed) and theidentity of the worker machine (for nonidle tasks).
The master is the conduit through which the location of interme-diate file regions is propagated from map tasks to reduce tasks. There -fore, for each completed map task, the master stores the locations andsizes of the R intermediate file regions produced by the map task.Updates to this location and size information are received as map tasksare completed. The information is pushed incrementally to workersthat have in-progress reduce tasks.
3.3 Fault ToleranceSince the MapReduce library is designed to help process very largeamounts of data using hundreds or thousands of machines, the librarymust tolerate machine failures gracefully.
Handling Worker FailuresThe master pings every worker periodically. If no response is receivedfrom a worker in a certain amount of time, the master marks the workeras failed. Any map tasks completed by the worker are reset back to theirinitial idle state and therefore become eligible for scheduling on otherworkers. Similarly, any map task or reduce task in progress on a failedworker is also reset to idle and becomes eligible for rescheduling.
Completed map tasks are reexecuted on a failure because their out-put is stored on the local disk(s) of the failed machine and is thereforeinaccessible. Completed reduce tasks do not need to be reexecutedsince their output is stored in a global file system.
When a map task is executed first by worker A and then later exe-cuted by worker B (because A failed), all workers executing reducetasks are notified of the reexecution. Any reduce task that has notalready read the data from worker A will read the data from worker B.
MapReduce is resilient to large-scale worker failures. For example,during one MapReduce operation, network maintenance on a runningcluster was causing groups of 80 machines at a time to become unreach-able for several minutes. The MapReduce master simply re executed thework done by the unreachable worker machines and continued to makeforward progress, eventually completing the MapReduce operation.
Semantics in the Presence of FailuresWhen the user-supplied map and reduce operators are deterministicfunctions of their input values, our distributed implementation pro-duces the same output as would have been produced by a nonfaultingsequential execution of the entire program.
split 0
split 1
split 2
split 3
split 4
(1) fork
(3) read(4) local write
(1) fork(1) fork
(6) write
worker
worker
worker
Master
UserProgram
outputfile 0
outputfile 1
worker
worker
(2)assignmap
(2)assignreduce
(5) remote
(5) read
Inputfiles
Mapphasr
Intermediate files(on local disks)
Reducephase
Outputfiles
Fig. 1. Execution overview.
COMMUNICATIONS OF THE ACM January 2008/Vol. 51, No. 1 109
© M. Tamer Özsu CS848 - Winter 2019 44 / 137
-
Characteristics
FlexibilityUser can write any map and reduce function codeNo need to know how to parallelize
ScalabilityElastic scalabilityAutomatic load balancing
EfficiencySimple: parallel scanNo database loading
Fault toleranceWorker failure
Master pings workers periodically; assumes failure if no responseTasks (both map and reduce) on failed workers scheduled on a differentworker node
Master failureCheckpoints of master stateRecovery after failure → progress can halt
Replication on distributed data store
© M. Tamer Özsu CS848 - Winter 2019 45 / 137
-
Hadoop
Most popular MapReduce implementation – developed by Yahoo!Two components
Processing engineHDFS: Hadoop Distributed Storage System – others possibleCan be deployed on the same machine or on different machines
ProcessesJob tracker: hosted on the master node and implements the scheduleTask tracker: hosted on the worker nodes and accepts tasks from job trackerand executes them
HDFSName node: stores how data are partitioned, monitors the status of datanodes, and data dictionaryData node: Stores and manages data chunks assigned to it
Task Tracker Job Tracker Task Tracker
Data Node Name Node Data Node
Worker 1 Name Node Worker n
MapReduce
HDFS
© M. Tamer Özsu CS848 - Winter 2019 46 / 137
-
Hadoop UDF Functions
Phase Name Function
Map
InputFormat::getSplit Partition the input data into differentsplits. Each split is processed by a map-per and may consist of several chunks.
RecordReader::next Define how a split is divided into items.Each item is a key/value pair and used asthe input for the map function.
Mapper::map Users can customize the map function toprocess the input data. The input dataare transformed into some intermediatekey/value pairs.
WritableComparable::compareTo The comparison function for the key/-value pairs.
Job::setCombinerClass Specify how the key/value pair are aggre-gated locally.
Shuffle Job::setPartitionerClass Specify how the intermediate key/valuepairs are shuffled to different reducers.
ReduceJob::setGroupingComparatorClass Specify how the key/value pairs are
grouped in the reduce phase.Reducer::reduce Users write their own reduce functions
to perform the corresponding jobs.© M. Tamer Özsu CS848 - Winter 2019 47 / 137
-
MapReduce Implementations
Name Language File System Index Master Server MultipleJobSupport
Hadoop Java HDFS No Name Node and JobTracker
Yes
Disco PythonandErlang
DistributedIndex
DiscoServer
No No
Skynet Ruby MySQL orUnix FileSystem
No Any node in the cluster No
FileMap Shelland PerlScripts
Unix FileSystem
No Any node in the cluster No
Twister Java Unix FileSystem
No One master node in brokernetwork
Yes
Cascading Java HDFS No Name Node and JobTracker
Yes
© M. Tamer Özsu CS848 - Winter 2019 48 / 137
-
MapReduce Languages
Hive Pig
Language Declarative SQL-like Dataflow
Data model Nested Nested
UDF Supported Supported
Data partition Supported Not supported
Interface Command line, web,JDBC/ODBC server
Command line
Query optimization Rule based Rule based
Metastore Supported Not supported
© M. Tamer Özsu CS848 - Winter 2019 49 / 137
-
MapReduce Implementations of Database Operators
Select and Project can be easily implemented in the map function
Aggregation is not difficult (see next slide)
Join requires more work
MapReduce join implementations
θ-join
Equijoin
Repartitionjoin
Semi-join Map-only join
Broadcast join Trojan join
Similarity join Multiway join
MultipleMapReduce
jobs
Replicatedjoin
© M. Tamer Özsu CS848 - Winter 2019 50 / 137
-
Aggregation
Key Value
1 R1
2 R2
map
Aid Value
1 R1
2 R2
Mapper 1
R
Extracting aggrega-tion attribute (Aid)
Key Value
3 R3
4 R4
map
Aid Value
1 R3
2 R4
Mapper 2
R
Map Phase
Par
titi
onin
gby
Aid
(Rou
nd
Rob
in)
Aid Value
1R1
R3 red
uce Result
1, f(R1, R3)
Reducer 1
Grouping byAid
Applying the aggregationfunction for the tuples withthe same Aid
Aid Value
2R2
R4 red
uce Result
2, f(R2, R4)
Reducer 2
Reduce Phase
© M. Tamer Özsu CS848 - Winter 2019 51 / 137
-
θ-Join
Baseline implementation of R(A,B) on S(B,C )1 Partition R and assign each partition to mappers
2 Each mapper takes 〈a, b〉 tuples and converts them to a list ofkey/value pairs of the form (b, 〈a,R〉)
3 Each reducer pulls the pairs with the same key
4 Each reducer joins tuples of R with tuples of S
© M. Tamer Özsu CS848 - Winter 2019 52 / 137
-
DBMS on MapReduce
HadoopDB Llama Cheetah
Language SQL-like Simple interface SQL
Storage Row store Column store Hybrid store
Data com-pression
No Yes Yes
Data parti-tion
Horizontally parti-tioned
Vertically partitioned Horizontally par-titioned at chunklevel
Indexing Local index in eachdatabase instance
No index Local index for eachdata chunk
Query op-timization
Rule based optimiza-tion plus local op-timization by Post-greSQL
Column-based op-timization, latematerialization andprocessing multiwayjoin in one job
Multi-query optimiza-tion, materializedviews
© M. Tamer Özsu CS848 - Winter 2019 53 / 137
-
Spark System
MapReduce does not perform well in iterative computations
Workflow model is acyclicHave to write to HDFS after each iteration and have to read fromHDFS at the beginning of next iteration
Spark objectives
Better support for iterative programsProvide a complete ecosystemSimilar abstraction (to MapReduce) for programmingMaintain MapReduce fault-tolerance and scalability
Fundamental concepts
RDD: Reliable Distributed DatasetsCaching of working setMaintaining lineage for fault-tolerance
© M. Tamer Özsu CS848 - Winter 2019 54 / 137
-
Spark System
MapReduce does not perform well in iterative computations
Workflow model is acyclicHave to write to HDFS after each iteration and have to read fromHDFS at the beginning of next iteration
Spark objectives
Better support for iterative programsProvide a complete ecosystemSimilar abstraction (to MapReduce) for programmingMaintain MapReduce fault-tolerance and scalability
Fundamental concepts
RDD: Reliable Distributed DatasetsCaching of working setMaintaining lineage for fault-tolerance
© M. Tamer Özsu CS848 - Winter 2019 54 / 137
-
Spark System
MapReduce does not perform well in iterative computations
Workflow model is acyclicHave to write to HDFS after each iteration and have to read fromHDFS at the beginning of next iteration
Spark objectives
Better support for iterative programsProvide a complete ecosystemSimilar abstraction (to MapReduce) for programmingMaintain MapReduce fault-tolerance and scalability
Fundamental concepts
RDD: Reliable Distributed DatasetsCaching of working setMaintaining lineage for fault-tolerance
© M. Tamer Özsu CS848 - Winter 2019 54 / 137
-
MapReduce Ecosystem [Michiardi, 2015]
MapReduce Storm Mahout Pig Giraph
S4 Hive
Samza Drill
Impala
GeneralBatching
Streaming Iterative Ad hoc/SQL Graph
Specialized Systems
© M. Tamer Özsu CS848 - Winter 2019 55 / 137
-
Spark Ecosystem [Michiardi, 2015]
NativeSparkApps
SparkSQL
SparkStreaming
MLlib(machinelearning)
GraphX(graph
processing)
Apache Spark
© M. Tamer Özsu CS848 - Winter 2019 56 / 137
-
Spark Programming Model [Zaharia et al., 2010, 2012]
HDFS
Create RDD
· · ·
RDD
Cache? CacheYes
TransformRDD?
No
Process
No
TransformYes
HDFS
Each transform generates anew RDD that may also becached or processed
Created from HDFS or parallelized arrays;Partitioned across worker machines;May be made persistent lazily;
Processing done on one of the RDDs;Done in parallel across workers;First processing on a RDD is from disk;Subsequent processing of the same RDD from cache
© M. Tamer Özsu CS848 - Winter 2019 57 / 137
-
Spark Programming Model [Zaharia et al., 2010, 2012]
HDFS
Create RDD
· · ·
RDD
Cache? CacheYes
TransformRDD?
No
Process
No
TransformYes
HDFS
Each transform generates anew RDD that may also becached or processed
Created from HDFS or parallelized arrays;Partitioned across worker machines;May be made persistent lazily;
Processing done on one of the RDDs;Done in parallel across workers;First processing on a RDD is from disk;Subsequent processing of the same RDD from cache
© M. Tamer Özsu CS848 - Winter 2019 57 / 137
-
Spark Programming Model [Zaharia et al., 2010, 2012]
HDFS
Create RDD
· · ·
RDD
Cache? CacheYes
TransformRDD?
No
Process
No
TransformYes
HDFS
Each transform generates anew RDD that may also becached or processed
Created from HDFS or parallelized arrays;Partitioned across worker machines;May be made persistent lazily;
Processing done on one of the RDDs;Done in parallel across workers;First processing on a RDD is from disk;Subsequent processing of the same RDD from cache
© M. Tamer Özsu CS848 - Winter 2019 57 / 137
-
Spark Programming Model [Zaharia et al., 2010, 2012]
HDFS
Create RDD
· · ·
RDD
Cache? CacheYes
TransformRDD?
No
Process
No
TransformYes
HDFS
Each transform generates anew RDD that may also becached or processed
Created from HDFS or parallelized arrays;Partitioned across worker machines;May be made persistent lazily;
Processing done on one of the RDDs;Done in parallel across workers;First processing on a RDD is from disk;Subsequent processing of the same RDD from cache
© M. Tamer Özsu CS848 - Winter 2019 57 / 137
-
Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patterns
lines = spark.textFile(hdfs://...)
CreateRDD
errors = lines.filter( .startsWith(\ERROR"))
Transform RDD
messages = errors.map( .split(‘\t ’)(2))
Another transform
cachedMsgs = messages.cache()
Cache results
cachedMsgs.filter( .contains(\foo")).count
Action
cachedMsgs.filter( .contains(\bar")).count
Another Action
accesses cache
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
TasksResults
Cache Cache Cache
© M. Tamer Özsu CS848 - Winter 2019 58 / 137
-
Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patternslines = spark.textFile(hdfs://...)
CreateRDD
errors = lines.filter( .startsWith(\ERROR"))
Transform RDD
messages = errors.map( .split(‘\t ’)(2))
Another transform
cachedMsgs = messages.cache()
Cache results
cachedMsgs.filter( .contains(\foo")).count
Action
cachedMsgs.filter( .contains(\bar")).count
Another Action
accesses cache
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
TasksResults
Cache Cache Cache
© M. Tamer Özsu CS848 - Winter 2019 58 / 137
-
Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patternslines = spark.textFile(hdfs://...)
CreateRDD
errors = lines.filter( .startsWith(\ERROR"))
Transform RDD
messages = errors.map( .split(‘\t ’)(2))
Another transform
cachedMsgs = messages.cache()
Cache results
cachedMsgs.filter( .contains(\foo")).count
Action
cachedMsgs.filter( .contains(\bar")).count
Another Action
accesses cache
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
TasksResults
Cache Cache Cache
© M. Tamer Özsu CS848 - Winter 2019 58 / 137
-
Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patternslines = spark.textFile(hdfs://...)
CreateRDD
errors = lines.filter( .startsWith(\ERROR"))
Transform RDD
messages = errors.map( .split(‘\t ’)(2))
Another transform
cachedMsgs = messages.cache()
Cache results
cachedMsgs.filter( .contains(\foo")).count
Action
cachedMsgs.filter( .contains(\bar")).count
Another Action
accesses cache
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
TasksResults
Cache Cache Cache
© M. Tamer Özsu CS848 - Winter 2019 58 / 137
-
Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patternslines = spark.textFile(hdfs://...)
CreateRDD
errors = lines.filter( .startsWith(\ERROR"))
Transform RDD
messages = errors.map( .split(‘\t ’)(2))
Another transform
cachedMsgs = messages.cache()
Cache results
cachedMsgs.filter( .contains(\foo")).count
Action
cachedMsgs.filter( .contains(\bar")).count
Another Action
accesses cache
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
TasksResults
Cache Cache Cache
© M. Tamer Özsu CS848 - Winter 2019 58 / 137
-
Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patternslines = spark.textFile(hdfs://...)
CreateRDD
errors = lines.filter( .startsWith(\ERROR"))
Transform RDD
messages = errors.map( .split(‘\t ’)(2))
Another transform
cachedMsgs = messages.cache()
Cache results
cachedMsgs.filter( .contains(\foo")).count
Action
cachedMsgs.filter( .contains(\bar")).count
Another Action
accesses cache
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
TasksResults
Cache Cache Cache
© M. Tamer Özsu CS848 - Winter 2019 58 / 137
-
Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patternslines = spark.textFile(hdfs://...)
CreateRDD
errors = lines.filter( .startsWith(\ERROR"))
Transform RDD
messages = errors.map( .split(‘\t ’)(2))
Another transform
cachedMsgs = messages.cache()
Cache results
cachedMsgs.filter( .contains(\foo")).count
Action
cachedMsgs.filter( .contains(\bar")).count
Another Action
accesses cache
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
Tasks
Results
Cache Cache Cache
© M. Tamer Özsu CS848 - Winter 2019 58 / 137
-
Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patternslines = spark.textFile(hdfs://...)
CreateRDD
errors = lines.filter( .startsWith(\ERROR"))
Transform RDD
messages = errors.map( .split(‘\t ’)(2))
Another transform
cachedMsgs = messages.cache()
Cache results
cachedMsgs.filter( .contains(\foo")).count
Action
cachedMsgs.filter( .contains(\bar")).count
Another Action
accesses cache
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
TasksResults
Cache Cache Cache
© M. Tamer Özsu CS848 - Winter 2019 58 / 137
-
Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patternslines = spark.textFile(hdfs://...)
CreateRDD
errors = lines.filter( .startsWith(\ERROR"))
Transform RDD
messages = errors.map( .split(‘\t ’)(2))
Another transform
cachedMsgs = messages.cache()
Cache results
cachedMsgs.filter( .contains(\foo")).count
Action
cachedMsgs.filter( .contains(\bar")).count
Another Action
accesses cache
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
TasksResults
Cache Cache Cache
© M. Tamer Özsu CS848 - Winter 2019 58 / 137
-
Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patternslines = spark.textFile(hdfs://...)
CreateRDD
errors = lines.filter( .startsWith(\ERROR"))
Transform RDD
messages = errors.map( .split(‘\t ’)(2))
Another transform
cachedMsgs = messages.cache()
Cache results
cachedMsgs.filter( .contains(\foo")).count
Action
cachedMsgs.filter( .contains(\bar")).count
Another Action
accesses cache
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
TasksResults
Cache Cache Cache
© M. Tamer Özsu CS848 - Winter 2019 58 / 137
-
Example Transform Functions
Function Meaningmap(func) Return a new RDD formed by passing each element of
the source through a function func
filter(func) Return a new RDD formed by selecting those elements ofthe source on which func returns true
flatMap(func) Similar to map, but each input item can be mapped to 0or more output items (func should return a Seq not asingle item)
mapPartitions(func) Similar to map, but runs separately on each partition(block) of the RDD, so func must be of type Iterator
sample(repl , fraction,seed)
Sample a fraction fraction of the data, with or withoutreplacement (set repl accordingly), using a given randomnumber generator seed
union(otherDataset)intersection()
Return a new RDD that contains the union/intersectionof the elements in the source RDD and the argument
groupByKey() Operates on a RDD of (K, V) pairs, returns a RDD of(K, Iterable) pairs
reduceByKey(func, . . .) Operates on a RDD of (K, V) pairs, returns a RDD of(K, V) pairs where the values for each key are aggregatedusing the given reduce function func
© M. Tamer Özsu CS848 - Winter 2019 59 / 137
-
Example Actions
Action Meaningcollect() Return all the elements of the dataset as an array at the
driver program.
count()
take(n) Return an array with the first n elements of the dataset
first(func) Return the first element of the dataset (similar totake(1))
takeSample(repl , num,[seed ])
Return an array with a random sample of num elementsof the dataset, with or without replacement, optionallypre-specifying a random number generator seed
takeOrdered(n, [ordering ]) Return the first n elements of the RDD using either theirnatural order or a custom comparator
saveAsTextFile(path) Write the elements of the RDD as a text file (or set oftext files) in a given directory in the local filesystem,HDFS or any other Hadoop-supported file system
countByKey() Only available on RDDs of type (K, V). Returns ahashmap of (K, Int) pairs with the count of each key
foreach(func) Run a function func on each element of the dataset
© M. Tamer Özsu CS848 - Winter 2019 60 / 137
-
RDD Fault Tolerance
RDDs maintain lineage information that can be used to reconstructlost partitions
Lineage is constructed as an object and stored persistently forrecovery
Example
messages = spark.textFile(\hdfs://...")
.lines.filter( .startsWith(\ERROR"))
.errors.map( .split(‘\t ’)(2))
HDFS File Filtered RDDfilter
(func= .contains(...))
Mapped RDDmap
(func= .split(...))
© M. Tamer Özsu CS848 - Winter 2019 61 / 137
-
Outline
1 Distributed Storage Systems
2 Main Memory Systems
3 Big Data ProcessingMapReduce [Li et al., 2014]Spark [Zaharia, 2016]
4 Stream Data Processing
5 Graph Processing
6 RDF Data Management
© M. Tamer Özsu CS848 - Winter 2019 62 / 137
-
Stream System Inputs & Outputs
Inputs
One or more sources generate data continuously, in real time, and in fixedorder (by timestamp)
I Sensor networks – weather monitoring, road traffic monitoringI Web data – financial trading, news/sports tickersI Scientific data – experiments in particle physicsI Transaction logs – point-of-sale purchasesI Network traffic analysis – IP packet headers
Outputs
Want to collect and process data in real-time; up-to-date answersgenerated continuously or periodically
I Environment monitoringI Location monitoringI Correlations across stock pricesI Denial-of-service attack detection
© M. Tamer Özsu CS848 - Winter 2019 63 / 137
-
Stream System Inputs & Outputs
Inputs
One or more sources generate data continuously, in real time, and in fixedorder (by timestamp)
I Sensor networks – weather monitoring, road traffic monitoringI Web data – financial trading, news/sports tickersI Scientific data – experiments in particle physicsI Transaction logs – point-of-sale purchasesI Network traffic analysis – IP packet headers
Outputs
Want to collect and process data in real-time; up-to-date answersgenerated continuously or periodically
I Environment monitoringI Location monitoringI Correlations across stock pricesI Denial-of-service attack detection
© M. Tamer Özsu CS848 - Winter 2019 63 / 137
-
DBMS versus DSS
Traditional DBMS:
Transient query
Persistent data
One-time result
Data Stream System (DSS):
Transient data
Persistent queries
Continuous results
Other differences of DSSI Push-based (data-driven)
I Unbounded stream; query execution as data arrives at the system –one look
I System conditions may not be stable – arrival rates fluctuate,workload may change
© M. Tamer Özsu CS848 - Winter 2019 64 / 137
-
DBMS versus DSS
Traditional DBMS:
Transient query
Persistent data
One-time result
Data Stream System (DSS):
Transient data
Persistent queries
Continuous results
Other differences of DSSI Push-based (data-driven)
I Unbounded stream; query execution as data arrives at the system –one look
I System conditions may not be stable – arrival rates fluctuate,workload may change
© M. Tamer Özsu CS848 - Winter 2019 64 / 137
-
DBMS versus DSS
Traditional DBMS:
Transient query
Persistent data
One-time result
Data Stream System (DSS):
Transient data
Persistent queries
Continuous results
Other differences of DSSI Push-based (data-driven)
I Unbounded stream; query execution as data arrives at the system –one look
I System conditions may not be stable – arrival rates fluctuate,workload may change
© M. Tamer Özsu CS848 - Winter 2019 64 / 137
-
DSS System Architecture
Working
Storage
Summary
Storage
Static
Storage
Query
Repository
Query
ProcessorOutput
Buffer
Input
Monitor
Streaming
inputs Updates to
static data
User
queries
Streaming
outputs
© M. Tamer Özsu CS848 - Winter 2019 65 / 137
-
Processing Models
Continuous processing
Each new data item is processed as soon as it arrives in the system.Examples: Apache Storm, Heron
Windowed processing
Incoming data items are batched in windows and executed as a batch.For user, recently arrived data may be more interesting and useful.Examples: Aurora, STREAM, Spark Streaming
© M. Tamer Özsu CS848 - Winter 2019 66 / 137
-
Processing Models
Continuous processing
Each new data item is processed as soon as it arrives in the system.Examples: Apache Storm, Heron
Windowed processing
Incoming data items are batched in windows and executed as a batch.For user, recently arrived data may be more interesting and useful.Examples: Aurora, STREAM, Spark Streaming
© M. Tamer Özsu CS848 - Winter 2019 66 / 137
-
Stream Data Models
Append-only sequence of timestamped items that arrive in some order.
〈timestamp, payload〉
What is the payload?
Relational tuples
Revision tuples
Sequence of events (as in publish/subscribe systems)
Sequence of sets (or bags) of elements with each set storing elementsthat have arrived during the same unit of time
© M. Tamer Özsu CS848 - Winter 2019 67 / 137
-
Window Definition
According to the direction of endpoint movementI Fixed window: both endpoints are fixedI Sliding window: both endpoints can slide (backward or forward)I Landmark window: one endpoint fixed, the other sliding
According to definition of window sizeI Logical window: time-based – window length measured in time unitsI Physical window: count-based – window length measured in number of data
itemsI Partitioned window: split a window into multiple count-based windowsI Predicate window: arbitrary predicate defines the contents of the window
Most common onesI Time-based sliding windowI Count-based sliding window
© M. Tamer Özsu CS848 - Winter 2019 68 / 137
-
Window Definition
According to the direction of endpoint movementI Fixed window: both endpoints are fixedI Sliding window: both endpoints can slide (backward or forward)I Landmark window: one endpoint fixed, the other sliding
According to definition of window sizeI Logical window: time-based – window length measured in time unitsI Physical window: count-based – window length measured in number of data
itemsI Partitioned window: split a window into multiple count-based windowsI Predicate window: arbitrary predicate defines the contents of the window
Most common onesI Time-based sliding windowI Count-based sliding window
© M. Tamer Özsu CS848 - Winter 2019 68 / 137
-
Window Definition
According to the direction of endpoint movementI Fixed window: both endpoints are fixedI Sliding window: both endpoints can slide (backward or forward)I Landmark window: one endpoint fixed, the other sliding
According to definition of window sizeI Logical window: time-based – window length measured in time unitsI Physical window: count-based – window length measured in number of data
itemsI Partitioned window: split a window into multiple count-based windowsI Predicate window: arbitrary predicate defines the contents of the window
Most common onesI Time-based sliding windowI Count-based sliding window
© M. Tamer Özsu CS848 - Winter 2019 68 / 137
-
Stream Query Models
Recall that queries are persistent
They may be monotonic or non-monotonic
Monotonic: result set always grows
If Q(t) is the result of a query at time t, given twoexecutions at time ti and tj , Q(ti ) ⊆ Q(tj) for all ti > tj
Non-monotonic: deletions from the result set is possible
© M. Tamer Özsu CS848 - Winter 2019 69 / 137
-
Streaming Operators
Stateless operators are no problem:e.g., selection, projection
Stateful operators (e.g., nested loopjoin) are blocking
You need to see the entire inneroperand
For some blocking operators,non-blocking versions exist
Symmetric hash join
Windowed execution otherwise
σ
© M. Tamer Özsu CS848 - Winter 2019 70 / 137
-
Streaming Operators
Stateless operators are no problem:e.g., selection, projection
Stateful operators (e.g., nested loopjoin) are blocking
You need to see the entire inneroperand
For some blocking operators,non-blocking versions exist
Symmetric hash join
Windowed execution otherwise
for each tuple r ∈ R dofor each tuple s ∈ S do
if r == s thenadd 〈r , s〉 to result
end forend for
© M. Tamer Özsu CS848 - Winter 2019 70 / 137
-
Streaming Operators
Stateless operators are no problem:e.g., selection, projection
Stateful operators (e.g., nested loopjoin) are blocking
You need to see the entire inneroperand
For some blocking operators,non-blocking versions exist
Symmetric hash join
Windowed execution otherwise
σ
© M. Tamer Özsu CS848 - Winter 2019 70 / 137
-
Streaming Operators
Stateless operators are no problem:e.g., selection, projection
Stateful operators (e.g., nested loopjoin) are blocking
You need to see the entire inneroperand
For some blocking operators,non-blocking versions exist
Symmetric hash join
Windowed execution otherwise
σ
© M. Tamer Özsu CS848 - Winter 2019 70 / 137
-
Operators over Sliding Windows
Joins and aggregation may require unbounded state, so they typicallyoperate over sliding windows
Example: track the maximum value in an on-line sequence over asliding window of the last N time units
75 53 67 71 68 67 73 70 68 65 64 62 61
time
Max=75
© M. Tamer Özsu CS848 - Winter 2019 71 / 137
-
Operators over Sliding Windows
Joins and aggregation may require unbounded state, so they typicallyoperate over sliding windows
Example: track the maximum value in an on-line sequence over asliding window of the last N time units
75 53 67 71 68 67 73 70 68 65 64 62 61
time
Max=?
© M. Tamer Özsu CS848 - Winter 2019 71 / 137
-
Operators over Sliding Windows
Joins and aggregation may require unbounded state, so they typicallyoperate over sliding windows
Example: track the maximum value in an on-line sequence over asliding window of the last N time units
75 53 67 71 68 67 73 70 68 65 64 62 61
time
Max=?
© M. Tamer Özsu CS848 - Winter 2019 71 / 137
-
Load Management
What to do when the data arrival rate is higher than the system’sprocessing capacity?
Random load shedding
Randomly drop a fraction of arriving data items
Semantic load shedding
Examine the payload (contents) of a data item before deciding whetheror not to drop itSome data items may have more value than others
Or, rather than dropping tuples:
Spill to disk and process during idle timesShorten the windowsUpdate the answer less often
© M. Tamer Özsu CS848 - Winter 2019 72 / 137
-
Outline
1 Distributed Storage Systems
2 Main Memory Systems
3 Big Data ProcessingMapReduce [Li et al., 2014]Spark [Zaharia, 2016]
4 Stream Data Processing
5 Graph Processing
6 RDF Data Management
© M. Tamer Özsu CS848 - Winter 2019 73 / 137
-
Graph Data are Very Common
Internet
© M. Tamer Özsu CS848 - Winter 2019 74 / 137
-
Graph Data are Very Common
Socialnetworks
© M. Tamer Özsu CS848 - Winter 2019 74 / 137
-
Graph Data are Very Common
Trade volumesand
connections
© M. Tamer Özsu CS848 - Winter 2019 74 / 137
-
Graph Data are Very Common
Biologicalnetworks
© M. Tamer Özsu CS848 - Winter 2019 74 / 137
-
Graph Data are Very Common
As of September 2011
MusicBrainz
(zitgist)
P20
Turismo de
Zaragoza
yovisto
Yahoo! Geo
Planet
YAGO
World Fact-book
El ViajeroTourism
WordNet (W3C)
WordNet (VUA)
VIVO UF
VIVO Indiana
VIVO Cornell
VIAF
URIBurner
Sussex Reading
Lists
Plymouth Reading
Lists
UniRef
UniProt
UMBEL
UK Post-codes
legislationdata.gov.uk
Uberblic
UB Mann-heim
TWC LOGD
Twarql
transportdata.gov.
uk
Traffic Scotland
theses.fr
Thesau-rus W
totl.net
Tele-graphis
TCMGeneDIT
TaxonConcept
Open Library (Talis)
tags2con delicious
t4gminfo
Swedish Open
Cultural Heritage
Surge Radio
Sudoc
STW
RAMEAU SH
statisticsdata.gov.
uk
St. Andrews Resource
Lists
ECS South-ampton EPrints
SSW Thesaur
us
SmartLink
Slideshare2RDF
semanticweb.org
SemanticTweet
Semantic XBRL
SWDog Food
Source Code Ecosystem Linked Data
US SEC (rdfabout)
Sears
Scotland Geo-
graphy
ScotlandPupils &Exams
Scholaro-meter
WordNet (RKB
Explorer)
Wiki
UN/LOCODE
Ulm
ECS (RKB
Explorer)
Roma
RISKS
RESEX
RAE2001
Pisa
OS
OAI
NSF
New-castle
LAASKISTI
JISC
IRIT
IEEE
IBM
Eurécom
ERA
ePrints dotAC
DEPLOY
DBLP (RKB
Explorer)
Crime Reports
UK
Course-ware
CORDIS (RKB
Explorer)CiteSeer
Budapest
ACM
riese
Revyu
researchdata.gov.
ukRen. Energy Genera-
tors
referencedata.gov.
uk
Recht-spraak.
nl
RDFohloh
Last.FM (rdfize)
RDF Book
Mashup
Rådata nå!
PSH
Product Types
Ontology
ProductDB
PBAC
Poké-pédia
patentsdata.go
v.uk
OxPoints
Ord-nance Survey
Openly Local
Open Library
OpenCyc
Open Corpo-rates
OpenCalais
OpenEI
Open Election
Data Project
OpenData
Thesau-rus
Ontos News Portal
OGOLOD
JanusAMP
Ocean Drilling Codices
New York
Times
NVD
ntnusc
NTU Resource
Lists
Norwe-gian
MeSH
NDL subjects
ndlna
myExperi-ment
Italian Museums
medu-cator
MARC Codes List
Man-chester Reading
Lists
Lotico
Weather Stations
London Gazette
LOIUS
Linked Open Colors
lobidResources
lobidOrgani-sations
LEM
LinkedMDB
LinkedLCCN
LinkedGeoData
LinkedCT
LinkedUser
FeedbackLOV
Linked Open
Numbers
LODE
Eurostat (OntologyCentral)
Linked EDGAR
(OntologyCentral)
Linked Crunch-
base
lingvoj
Lichfield Spen-ding
LIBRIS
Lexvo
LCSH
DBLP (L3S)
Linked Sensor Data (Kno.e.sis)
Klapp-stuhl-club
Good-win
Family
National Radio-activity
JP
Jamendo (DBtune)
Italian public
schools
ISTAT Immi-gration
iServe
IdRef Sudoc
NSZL Catalog
Hellenic PD
Hellenic FBD
PiedmontAccomo-dations
GovTrack
GovWILD
GoogleArt
wrapper
gnoss
GESIS
GeoWordNet
GeoSpecies
GeoNames
GeoLinkedData
GEMET
GTAA
STITCH
SIDER
Project Guten-berg
MediCare
Euro-stat
(FUB)
EURES
DrugBank
Disea-some
DBLP (FU
Berlin)
DailyMed
CORDIS(FUB)
Freebase
flickr wrappr
Fishes of Texas
Finnish Munici-palities
ChEMBL
FanHubz
EventMedia
EUTC Produc-
tions
Eurostat
Europeana
EUNIS
EU Insti-
tutions
ESD stan-dards
EARTh
Enipedia
Popula-tion (En-AKTing)
NHS(En-
AKTing) Mortality(En-
AKTing)
Energy (En-
AKTing)
Crime(En-
AKTing)
CO2 Emission
(En-AKTing)
EEA
SISVU
education.data.g
ov.uk
ECS South-ampton
ECCO-TCP
GND
Didactalia
DDC Deutsche Bio-
graphie
datadcs
MusicBrainz
(DBTune)
Magna-tune
John Peel
(DBTune)
Classical (DB
Tune)
AudioScrobbler (DBTune)
Last.FM artists
(DBTune)
DBTropes
Portu-guese
DBpedia
dbpedia lite
Greek DBpedia
DBpedia
data-open-ac-uk
SMCJournals
Pokedex
Airports
NASA (Data Incu-bator)
MusicBrainz(Data
Incubator)
Moseley Folk
Metoffice Weather Forecasts
Discogs (Data
Incubator)
Climbing
data.gov.uk intervals
Data Gov.ie
databnf.fr
Cornetto
reegle
Chronic-ling
America
Chem2Bio2RDF
Calames
businessdata.gov.
uk
Bricklink
Brazilian Poli-
ticians
BNB
UniSTS
UniPathway
UniParc
Taxonomy
UniProt(Bio2RDF)
SGD
Reactome
PubMedPub
Chem
PRO-SITE
ProDom
Pfam
PDB
OMIMMGI
KEGG Reaction
KEGG Pathway
KEGG Glycan
KEGG Enzyme
KEGG Drug
KEGG Com-pound
InterPro
HomoloGene
HGNC
Gene Ontology
GeneID
Affy-metrix
bible ontology
BibBase
FTS
BBC Wildlife Finder
BBC Program
mes BBC Music
Alpine Ski
Austria
LOCAH
Amster-dam
Museum
AGROVOC
AEMET
US Census (rdfabout)
Media
Geographic
Publications
Government
Cross-domain
Life sciences
User-generated content
Linked data
© M. Tamer Özsu CS848 - Winter 2019 74 / 137
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.http://lod-cloud.net/
-
Graph Types
Property graph
film 2014(initial release date, “1980-05-23”)
(label, “The Shining”)(music contributor, music contributor/4110)
(language, (iso639 3/eng)(label, “English”)
(usedIn, iso3166/CA)(usesScript, script/latn))
books 0743424425(rating, 4.7)
StephenKing
(creator)
offers 0743424425amazonOffer
geo 2635167(name, “United Kingdom”)
(population, 62348447)
UnitedKingdom
(wikipediaArticle)
actor 29704(actor name, “Jack Nicholson”)
film 3418(label, “The Passenger”)
film 1267(label, “The Last Tycoon”)
director 8476(director name, “Stanley Kubrick”)
film 2685(label, “A Clockwork Orange”)
film 424(label, “Spartacus”)
actor 30013(actor name, “Shelley Duvall”)
(relatedBook)
(hasOffer)
(based near)(actor)
(director) (actor)
(actor) (actor)
(director) (director)
© M. Tamer Özsu CS848 - Winter 2019 75 / 137
-
Graph Types
RDF graph
mdb:film/2014
“1980-05-23”
movie:initial release date
“The Shining”refs:label
mob:music contributormusic contributor
lexvo:iso639 3/eng
language
bm:books/0743424425
4.7
rev:rating
bm:persons/StephenKingdc:creator
bm:offers/0743424425amazonOffer
geo:2635167
“United Kingdom”
gn:name
62348447
gn:population
wp:UnitedKingdom
gn:wikipediaArticle
mdb:actor/29704
“Jack Nicholson”
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
“The Last Tycoon”
refs:label
mdb:director/8476
“Stanley Kubrick”
movie:director name
mdb:film/2685
“A Clockwork Orange”
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
“Shelley Duvall”
movie:actor name“English”
rdf:label
lexvo:iso3166/CA
lvont:usedInlexvo:script/latin
lvont:usesScript
movie:relatedBook
scam:hasOffer
foaf:based nearmovie:actor
movie:director
movie:actor
movie:actor movie:actor
movie:director movie:director
© M. Tamer Özsu CS848 - Winter 2019 75 / 137
-
Graph Types
Property graph
film 2014(initial release date, “1980-05-23”)
(label, “The Shining”)(music contributor, music contributor/4110)
(language, (iso639 3/eng)(label, “English”)
(usedIn, iso3166/CA)(usesScript, script/latn))
books 0743424425(rating, 4.7)
StephenKing
(creator)
offers 0743424425amazonOffer
geo 2635167(name, “United Kingdom”)
(population, 62348447)
UnitedKingdom
(wikipediaArticle)
actor 29704(actor name, “Jack Nicholson”)
film 3418(label, “The Passenger”)
film 1267(label, “The Last Tycoon”)
director 8476(director name, “Stanley Kubrick”)
film 2685(label, “A Clockwork Orange”)
film 424(label, “Spartacus”)
actor 30013(actor name, “Shelley Duvall”)
(relatedBook)
(hasOffer)
(based near)(actor)
(director) (actor)
(actor) (actor)
(director) (director)
Workload: Online queries andanalytic workloads
Query execution: Varies
RDF graph
mdb:film/2014
“1980-05-23”
movie:initial release date
“The Shining”refs:label
bm:books/0743424425
4.7
rev:rating
bm:offers/0743424425amazonOffer
geo:2635167
“United Kingdom”
gn:name
62348447
gn:population
mdb:actor/29704
“Jack Nicholson”
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
“The Last Tycoon”
refs:label
mdb:director/8476
“Stanley Kubrick”
movie:director name
mdb:film/2685
“A Clockwork Orange”
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:relatedBook
scam:hasOffer
foaf:based nearmovie:actor
movie:directormovie:actor
movie:actor movie:actor
movie:director movie:director
Workload: SPARQL queries
Query execution: subgraphmatching by homomorphism
© M. Tamer Özsu CS848 - Winter 2019 75 / 137
-
Property Graph Processing Systems
SystemMemory/
DiskArchitecture
Computingparadigm
SupportedWorkloads
Hadoop Disk Parallel/Distributed MapReduce Analytical
Haloop Disk Parallel/Distributed MapReduce Analytical
Pegasus Disk Parallel/Distributed MapReduce Analytical
GraphX Disk Parallel/DistributedMapReduce
(Spark)Analytical
Pregel/Giraph Memory Parallel/Distributed Vertex-Centric Analytical
GraphLab Memory Parallel/Distributed Vertex-Centric Analytical
GraphChi Disk Single machine Vertex-Centric Analytical
Stream Disk Single machine Edge-Centric Analytical
Trinity Memory Parallel/DistributedFlexible using K-V
store on DSMOnline &Analytical
Titan Disk Parallel/DistributedK-V store
(Cassandra)Online
Neo4J Disk Single machineProcedural/Linked-list
Online
© M. Tamer Özsu CS848 - Winter 2019 76 / 137
-
Scale-up or Scale-out? [Lin, 2018]
Scale-up: Single machine execution
Graph datasets are small and can fit in a single machine – even in mainmemorySingle machine avoids parallel execution complexities
Scale-out: Parallel execution
Graph data sets grow when they are expanded to their storage formatsWorkstations big enough to handle even smaller datasets are stillexpensiveSome graphs are very large: Alibaba: several billion vertices, > 100million edgesDataset size may not be the determinant ⇒ parallelizing computationis important
Dataset |V | |E | Regular size Single Machine∗
Live Journal 4,847,571 68,993,773 1.08GB 6.3GBUSA Road 23,947,347 58,333,344 951MB 9.09GBTwitter 41,652,230 1,468,365,182 26GB 128 GBUK0705 82,240,700 2,829,101,180 48GB 247GBWorld Road 682,496,072 717,016,716 15GB 194GBCommonCrawl2014 1,727,000,000 64,422,000,000 1.3TB Out of memory
∗ Using (PowerLyra)
We focus on parallel graph analytics systems
© M. Tamer Özsu CS848 - Winter 2019 77 / 137
-
Scale-up or Scale-out? [Lin, 2018]
Scale-up: Single machine execution
Graph datasets are small and can fit in a single machine – even in mainmemorySingle machine avoids parallel execution complexities
Scale-out: Parallel execution
Graph data sets grow when they are expanded to their storage formats
Workstations big enough to handle even smaller datasets are stillexpensiveSome graphs are very large: Alibaba: several billion vertices, > 100million edgesDataset size may not be the determinant ⇒ parallelizing computationis important
Dataset |V | |E | Regular size Single Machine∗
Live Journal 4,847,571 68,993,773 1.08GB 6.3GBUSA Road 23,947,347 58,333,344 951MB 9.09GBTwitter 41,652,230 1,468,365,182 26GB 128 GBUK0705 82,240,700 2,829,101,180 48GB 247GBWorld Road 682,496,072 717,016,716 15GB 194GBCommonCrawl2014 1,727,000,000 64,422,000,000 1.3TB Out of memory
∗ Using (PowerLyra)
We focus on parallel graph analytics systems
© M. Tamer Özsu CS848 - Winter 2019 77 / 137
-
Scale-up or Scale-out? [Lin, 2018]
Scale-up: Single machine execution
Graph datasets are small and can fit in a single machine – even in mainmemorySingle machine avoids parallel execution complexities
Scale-out: Parallel execution
Graph data sets grow when they are expanded to their storage formatsWorkstations big enough to handle even smaller datasets are stillexpensiveSome graphs are very large: Alibaba: several billion vertices, > 100million edgesDataset size may not be the determinant ⇒ parallelizing computationis important
Dataset |V | |E | Regular size Single Machine∗
Live Journal 4,847,571 68,993,773 1.08GB 6.3GBUSA Road 23,947,347 58,333,344 951MB 9.09GBTwitter 41,652,230 1,468,365,182 26GB 128 GBUK0705 82,240,700 2,829,101,180 48GB 247GBWorld Road 682,496,072 717,016,716 15GB 194GBCommonCrawl2014 1,727,000,000 64,422,000,000 1.3TB Out of memory
∗ Using (PowerLyra)
We focus on parallel graph analytics systems
© M. Tamer Özsu CS848 - Winter 2019 77 / 137
-
Scale-up or Scale-out? [Lin, 2018]
Scale-up: Single machine execution
Graph datasets are small and can fit in a single machine – even in mainmemorySingle machine avoids parallel execution complexities
Scale-out: Parallel execution
Graph data sets grow when they are expanded to their storage formatsWorkstations big enough to handle even smaller datasets are stillexpensiveSome graphs are very large: Alibaba: several billion vertices, > 100million edgesDataset size may not be the determinant ⇒ parallelizing computationis important
Dataset |V | |E | Regular size Single Machine∗
Live Journal 4,847,571 68,993,773 1.08GB 6.3GBUSA Road 23,947,347 58,333,344 951MB 9.09GBTwitter 41,652,230 1,468,365,182 26GB 128 GBUK0705 82,240,700 2,829,101,180 48GB 247GBWorld Road 682,496,072 717,016,716 15GB 194GBCommonCrawl2014 1,727,000,000 64,422,000,000 1.3TB Out of memory
∗ Using (PowerLyra)
We focus on parallel graph analytics systems
© M. Tamer Özsu CS848 - Winter 2019 77 / 137
-
Graph Partitioning
Edge-cut (vertex-disjoint)Achieve disjoint partitions by allocating each vertex to a partitionObjective 1: Partitions should be balancedObjective 2: Minimize edge-cuts (to reduce communication)Good for graphs with low-degree vertices, not for power-law graphsExamples: Hashing, METIS [Karypis and Kumar, 1995], labelpropagation algorithms [Ugander and Backstrom, 2013]
Vertex-cut (edge-disjoint)Achieve disjoint partitions by allocating each edge to a partition andreplicating vertices as necessaryObjective 1: Partitions should be balancedObjective 2: Minimize vertex-cuts (to reduce replica cost)Perform better on power-law graphsExamples: Hashing, greedy algorithms [Gonzalez et al., 2012]
HybridEdge-cut for low-degree vertices/vertex-cut for high-degree onesPowerLyra [Chen et al., 2015]
© M. Tamer Özsu CS848 - Winter 2019 78 / 137
-
Graph Partitioning
Edge-cut (vertex-disjoint)Achieve disjoint partitions by allocating each vertex to a partitionObjective 1: Partitions should be balancedObjective 2: Minimize edge-cuts (to reduce communication)Good for graphs with low-degree vertices, not for power-law graphsExamples: Hashing, METIS [Karypis and Kumar, 1995], labelpropagation algorithms [Ugander and Backstrom, 2013]
Vertex-cut (edge-disjoint)Achieve disjoint partitions by allocating each edge to a partition andreplicating vertices as necessaryObjective 1: Partitions should be balancedObjective 2: Minimize vertex-cuts (to reduce replica cost)Perform better on power-law graphsExamples: Hashing, greedy algorithms [Gonzalez et al., 2012]
HybridEdge-cut for low-degree vertices/vertex-cut for high-degree onesPowerLyra [Chen et al., 2015]
© M. Tamer Özsu CS848 - Winter 2019 78 / 137
-
Graph Partitioning
Edge-cut (vertex-disjoint)Achieve disjoint partitions by allocating each vertex to a partitionObjective 1: Partitions should be balancedObjective 2: Minimize edge-cuts (to reduce communication)Good for graphs with low-degree vertices, not for power-law graphsExamples: Hashing, METIS [Karypis and Kumar, 1995], labelpropagation algorithms [Ugander and Backstrom, 2013]
Vertex-cut (edge-disjoint)Achieve disjoint partitions by allocating each edge to a partition andreplicating vertices as necessaryObjective 1: Partitions should be balancedObjective 2: Minimize vertex-cuts (to reduce replica cost)Perform better on power-law graphsExamples: Hashing, greedy algorithms [Gonzalez et al., 2012]
HybridEdge-cut for low-degree vertices/vertex-cut for high-degree onesPowerLyra [Chen et al., 2015]
© M. Tamer Özsu CS848 - Winter 2019 78 / 137
-
Graph Workloads
Online graph querying
Reachability
Single source shortest-path
Subgraph matching
SPARQL queries
Offline graph analytics
PageRank
Clustering
Connected components
Diameter finding
Graph colouring
All pairs shortest path
Graph pattern mining
Machine learning algorithms(Belief propagation, Gaussiannon-negative matrixfactorization)
© M. Tamer Özsu CS848 - Winter 2019 79 / 137
-
Graph Workloads
Online graph querying
Reachability
Single source shortest-path
Subgraph matching
SPARQL queries
Offline graph analytics
PageRank
Clustering
Connected components
Diameter finding
Graph colouring
All pairs shortest path
Graph pattern mining
Machine learning algorithms(Belief propagation, Gaussiannon-negative matrixfactorization)
© M. Tamer Özsu CS848 - Winter 2019 80 / 137
-
Reachability Queries
film 2014(initial release date, “1980-05-23”)
(label, “The Shining”)(music contributor, music contributor/4110)
(language, (iso639 3/eng)(label, “English”)
(usedIn, iso3166/CA)(usesScript, script/latn))
books 0743424425(rating, 4.7)
StephenKing
(creator)
offers 0743424425amazonOffer
geo 2635167(name, “United Kingdom”)
(population, 62348447)
UnitedKingdom
(wikipediaArticle)
actor 29704(actor name, “Jack Nicholson”)
film 3418(label, “The Passenger”)
film 1267(label, “The Last Tycoon”)
director 8476(director name, “Stanley Kubrick”)
film 2685(label, “A Clockwork Orange”)
film 424(label, “Spartacus”)
actor 30013(actor name, “Shelley Duvall”)
(relatedBook)
(hasOffer)
(based near)(actor)
(director) (actor)
(actor) (actor)
(director) (director)
© M. Tamer Özsu CS848 - Winter 2019 81 / 137
-
Reachability Queries
film 2014(initial release date, “1980-05-23”)
(label, “The Shining”)(music contributor, music contributor/4110)
(language, (iso639 3/eng)(label, “English”)
(usedIn, iso3166/CA)(usesScript, script/latn))
books 0743424425(rating, 4.7)
StephenKing
(creator)
offers 0743424425amazonOffer
geo 2635167(name, “United Kingdom”)
(population, 62348447)
UnitedKingdom
(wikipediaArticle)
actor 29704(actor name, “Jack Nicholson”)
film 3418(label, “The Passenger”)
film 1267(label, “The Last Tycoon”)
director 8476(director name, “Stanley Kubrick”)
film 2685(label, “A Clockwork Orange”)
film 424(label, “Spartacus”)
actor 30013(actor name, “Shelley Duvall”)
(relatedBook)
(hasOffer)
(based near)(actor)
(director) (actor)
(actor) (actor)
(director) (director)
Can you reach film 1267 from film 2014?
© M. Tamer Özsu CS848 - Winter 2019 81 / 137
-
Reachability Queries
film 2014(initial release date, “1980-05-23”)
(label, “The Shining”)(music contributor, music contributor/4110)
(language, (iso639 3/eng)(label, “English”)
(usedIn, iso3166/CA)(usesScript, script/latn))
books 0743424425(rating, 4.7)
StephenKing
(creator)
offers 0743424425amazonOffer
geo 2635167(name, “United Kingdom”)
(population, 62348447)
UnitedKingdom
(wikipediaArticle)
actor 29704(actor name, “Jack Nicholson”)
film 3418(label, “The Passenger”)
film 1267(label, “The Last Tycoon”)
director 8476(director name, “Stanley Kubrick”)
film 2685(label, “A Clockwork Orange”)
film 424(label, “Spartacus”)
actor 30013(actor name, “Shelley Duvall”)
(relatedBook)
(hasOffer)
(based near)(actor)
(director) (actor)
(actor) (actor)
(director) (director)
Is there a book whose rating is > 4.0 associated with a film thatwas directed by Stanley Kubrick?
© M. Tamer Özsu CS848 - Winter 2019 81 / 137
-
Reachability Queries
Think of Facebook graph and finding friends of friends.
© M. Tamer Özsu CS848 - Winter 2019 81 / 137
-
Subgraph Matching
?m ?dmovie:director
?name
rdfs:label
?b
movie:relatedBook
“Stanley Kubrick”
movie:director name
?rrev:rating
FILTER(?r > 4.0)
mdb:film/2014
“1980-05-23”
movie:initial release date
“The Shining”refs:label
bm:books/0743424425
4.7
rev:rating
bm:offers/0743424425amazonOffer
geo:2635167
“United Kingdom”
gn:name
62348447
gn:population
mdb:actor/29704
“Jack Nicholson”
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
“The Last Tycoon”
refs:label
mdb:director/8476
“Stanley Kubrick”
movie:director name
mdb:film/2685
“A Clockwork Orange”
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:relatedBook
scam:hasOffer
foaf:based nearmovie:actor
movie:directormovie:actor
movie:actor movie:actor
movie:director movie:director
SubgraphM
atching
© M. Tamer Özsu CS848 - Winter 2019 82 / 137
-
PageRank Computation
A web page is important if it is pointed to by other importantpages.
P1 P2
P3
P5P6
P4
r(Pi ) = (1− d) + d∑
Pj∈BPi
r(Pj)
|FPj |
(let d = 1)
r(P2) =r(P1)
2+
r(P3)
3
rk+1(Pi ) =∑
Pj∈BPi
rk(Pj)
|FPj |
BPi : in-neighbours of PiFPi : out-neighbours of Pi
© M. Tamer Özsu CS848 - Winter 2019 83 / 137
-
PageRank Computation
A web page is important if it is pointed to by other importantpages.
P1 P2
P3
P5P6
P4
rk+1(Pi ) =∑
Pj∈BPi
rk(Pj)
|FPj |
Iteration 0 Iteration 1 Iteration 2Rank atIter. 2
r0(P1) = 1/6 r1(P1) = 1/18 r2(P1) = 1/36 5r0(P2) = 1/6 r1(P2) = 5/36 r2(P2) = 1/18 4r0(P3) = 1/6 r1(P3) = 1/12 r2(P3) = 1/36 5r0(P4) = 1/6 r1(P4) = 1/4 r2(P4) = 17/72 1r0(P5) = 1/6 r1(P5) = 5/36 r2(P5) = 11/72 3r0(P6) = 1/6 r1(P6) = 1/6 r2(P6) = 14/72 2
Iterative processingTouch each vertex
© M. Tamer Özsu CS848 - Winter 2019 83 / 137
-
Can MapReduce be Used for Graph Analytics?
Yes; map and reduce functions can be written for graph analyticsworkloads
Scalable Graph processing Class SGC [Qin et al., 2014]Connected component computation [Kiveris et al., 2014; Rastogi et al.,2013]
Not suitable for iterative processing due to data movement at eachstage
No guarantee that computation will be assigned to the same workernodes in the next round
High I/O cost
Need to save in storage system (HDFS) intermediate results of eachiteration
© M. Tamer Özsu CS848 - Winter 2019 84 / 137
-
Can MapReduce be Used for Graph Analytics?
Yes; map and reduce functions can be written for graph analyticsworkloads
Scalable Graph processing Class SGC [Qin et al., 2014]Connected component computation [Kiveris et al., 2014; Rastogi et al.,2013]
Not suitable for iterative processing due to data movement at eachstage
No guarantee that computation will be assigned to the same workernodes in the next round
High I/O costNeed to save in storage system (HDFS) intermediate results of eachiteration
There are systems that address these concernsHaLoop [Bu et al., 2010, 2012]GraphX over Spark [Gonzalez et al., 2014]
© M. Tamer Özsu CS848 - Winter 2019 84 / 137
-
GraphX [Gonzalez et al., 2014]
Built on top of Spark
Objective is to combine data analytics with graph processing
Unify computation on tables and graphs
Carefully convert graph to tabular representation
Native GraphX API or can accommodate vertex-centric com