an introduction to graph analytics platforms · ndl subjects ndlna my experi-ment italian museums...

137
An Introduction to Graph Analytics Platforms M. Tamer ¨ Ozsu University of Waterloo David R. Cheriton School of Computer Science © M. Tamer ¨ Ozsu Dagstuhl Spring School (2016/03/07–09) 1 / 59

Upload: others

Post on 28-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

An Introduction to Graph Analytics Platforms

M. Tamer Ozsu

University of WaterlooDavid R. Cheriton School of Computer Science

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 1 / 59

Page 2: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Graph Data are Very Common

Internet

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 2 / 59

Page 3: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Graph Data are Very Common

Socialnetworks

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 2 / 59

Page 4: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Graph Data are Very Common

Trade volumesand

connections

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 2 / 59

Page 5: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Graph Data are Very Common

Biologicalnetworks

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 2 / 59

Page 6: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Graph Data are Very Common

As of September 2011

MusicBrainz

(zitgist)

P20

Turismo de

Zaragoza

yovisto

Yahoo! Geo

Planet

YAGO

World Fact-book

El ViajeroTourism

WordNet (W3C)

WordNet (VUA)

VIVO UF

VIVO Indiana

VIVO Cornell

VIAF

URIBurner

Sussex Reading

Lists

Plymouth Reading

Lists

UniRef

UniProt

UMBEL

UK Post-codes

legislationdata.gov.uk

Uberblic

UB Mann-heim

TWC LOGD

Twarql

transportdata.gov.

uk

Traffic Scotland

theses.fr

Thesau-rus W

totl.net

Tele-graphis

TCMGeneDIT

TaxonConcept

Open Library (Talis)

tags2con delicious

t4gminfo

Swedish Open

Cultural Heritage

Surge Radio

Sudoc

STW

RAMEAU SH

statisticsdata.gov.

uk

St. Andrews Resource

Lists

ECS South-ampton EPrints

SSW Thesaur

us

SmartLink

Slideshare2RDF

semanticweb.org

SemanticTweet

Semantic XBRL

SWDog Food

Source Code Ecosystem Linked Data

US SEC (rdfabout)

Sears

Scotland Geo-

graphy

ScotlandPupils &Exams

Scholaro-meter

WordNet (RKB

Explorer)

Wiki

UN/LOCODE

Ulm

ECS (RKB

Explorer)

Roma

RISKS

RESEX

RAE2001

Pisa

OS

OAI

NSF

New-castle

LAASKISTI

JISC

IRIT

IEEE

IBM

Eurécom

ERA

ePrints dotAC

DEPLOY

DBLP (RKB

Explorer)

Crime Reports

UK

Course-ware

CORDIS (RKB

Explorer)CiteSeer

Budapest

ACM

riese

Revyu

researchdata.gov.

ukRen. Energy Genera-

tors

referencedata.gov.

uk

Recht-spraak.

nl

RDFohloh

Last.FM (rdfize)

RDF Book

Mashup

Rådata nå!

PSH

Product Types

Ontology

ProductDB

PBAC

Poké-pédia

patentsdata.go

v.uk

OxPoints

Ord-nance Survey

Openly Local

Open Library

OpenCyc

Open Corpo-rates

OpenCalais

OpenEI

Open Election

Data Project

OpenData

Thesau-rus

Ontos News Portal

OGOLOD

JanusAMP

Ocean Drilling Codices

New York

Times

NVD

ntnusc

NTU Resource

Lists

Norwe-gian

MeSH

NDL subjects

ndlna

myExperi-ment

Italian Museums

medu-cator

MARC Codes List

Man-chester Reading

Lists

Lotico

Weather Stations

London Gazette

LOIUS

Linked Open Colors

lobidResources

lobidOrgani-sations

LEM

LinkedMDB

LinkedLCCN

LinkedGeoData

LinkedCT

LinkedUser

FeedbackLOV

Linked Open

Numbers

LODE

Eurostat (OntologyCentral)

Linked EDGAR

(OntologyCentral)

Linked Crunch-

base

lingvoj

Lichfield Spen-ding

LIBRIS

Lexvo

LCSH

DBLP (L3S)

Linked Sensor Data (Kno.e.sis)

Klapp-stuhl-club

Good-win

Family

National Radio-activity

JP

Jamendo (DBtune)

Italian public

schools

ISTAT Immi-gration

iServe

IdRef Sudoc

NSZL Catalog

Hellenic PD

Hellenic FBD

PiedmontAccomo-dations

GovTrack

GovWILD

GoogleArt

wrapper

gnoss

GESIS

GeoWordNet

GeoSpecies

GeoNames

GeoLinkedData

GEMET

GTAA

STITCH

SIDER

Project Guten-berg

MediCare

Euro-stat

(FUB)

EURES

DrugBank

Disea-some

DBLP (FU

Berlin)

DailyMed

CORDIS(FUB)

Freebase

flickr wrappr

Fishes of Texas

Finnish Munici-palities

ChEMBL

FanHubz

EventMedia

EUTC Produc-

tions

Eurostat

Europeana

EUNIS

EU Insti-

tutions

ESD stan-dards

EARTh

Enipedia

Popula-tion (En-AKTing)

NHS(En-

AKTing) Mortality(En-

AKTing)

Energy (En-

AKTing)

Crime(En-

AKTing)

CO2 Emission

(En-AKTing)

EEA

SISVU

education.data.g

ov.uk

ECS South-ampton

ECCO-TCP

GND

Didactalia

DDC Deutsche Bio-

graphie

datadcs

MusicBrainz

(DBTune)

Magna-tune

John Peel

(DBTune)

Classical (DB

Tune)

AudioScrobbler (DBTune)

Last.FM artists

(DBTune)

DBTropes

Portu-guese

DBpedia

dbpedia lite

Greek DBpedia

DBpedia

data-open-ac-uk

SMCJournals

Pokedex

Airports

NASA (Data Incu-bator)

MusicBrainz(Data

Incubator)

Moseley Folk

Metoffice Weather Forecasts

Discogs (Data

Incubator)

Climbing

data.gov.uk intervals

Data Gov.ie

databnf.fr

Cornetto

reegle

Chronic-ling

America

Chem2Bio2RDF

Calames

businessdata.gov.

uk

Bricklink

Brazilian Poli-

ticians

BNB

UniSTS

UniPathway

UniParc

Taxonomy

UniProt(Bio2RDF)

SGD

Reactome

PubMedPub

Chem

PRO-SITE

ProDom

Pfam

PDB

OMIMMGI

KEGG Reaction

KEGG Pathway

KEGG Glycan

KEGG Enzyme

KEGG Drug

KEGG Com-pound

InterPro

HomoloGene

HGNC

Gene Ontology

GeneID

Affy-metrix

bible ontology

BibBase

FTS

BBC Wildlife Finder

BBC Program

mes BBC Music

Alpine Ski

Austria

LOCAH

Amster-dam

Museum

AGROVOC

AEMET

US Census (rdfabout)

Media

Geographic

Publications

Government

Cross-domain

Life sciences

User-generated content

Linked data

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 2 / 59

Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.http://lod-cloud.net/

Page 7: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Outline

1 Introduction – Graph Types

2 Property Graph ProcessingClassificationOnline queryingOffline analytics

3 Graph Analytics Computational ModelsVertex-CentricBlock-CentricMapReduce-BasedModified MapReduce

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 3 / 59

Page 8: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Outline

1 Introduction – Graph Types

2 Property Graph ProcessingClassificationOnline queryingOffline analytics

3 Graph Analytics Computational ModelsVertex-CentricBlock-CentricMapReduce-BasedModified MapReduce

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 4 / 59

Page 9: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Graph Types

Property graph

film 2014(initial release date, “1980-05-23”)

(label, “The Shining”)

books 0743424425(rating, 4.7)

offers 0743424425amazonOffer

geo 2635167(name, “United Kingdom”)

(population, 62348447) actor 29704(actor name, “Jack Nicholson”)

film 3418(label, “The Passenger”)

film 1267(label, “The Last Tycoon”)

director 8476(director name, “Stanley Kubrick”)

film 2685(label, “A Clockwork Orange”)

film 424(label, “Spartacus”)

actor 30013

(relatedBook)

(hasOffer)

(based near)(actor)

(director) (actor)

(actor) (actor)

(director) (director)

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 5 / 59

Page 10: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Graph Types

RDF graph

mdb:film/2014

“1980-05-23”

movie:initial release date

“The Shining”refs:label

bm:books/0743424425

4.7

rev:rating

bm:offers/0743424425amazonOffer

geo:2635167

“United Kingdom”

gn:name

62348447

gn:population

mdb:actor/29704

“Jack Nicholson”

movie:actor name

mdb:film/3418

“The Passenger”

refs:label

mdb:film/1267

“The Last Tycoon”

refs:label

mdb:director/8476

“Stanley Kubrick”

movie:director name

mdb:film/2685

“A Clockwork Orange”

refs:label

mdb:film/424

“Spartacus”

refs:label

mdb:actor/30013

movie:relatedBook

scam:hasOffer

foaf:based nearmovie:actor

movie:directormovie:actor

movie:actor movie:actor

movie:director movie:director

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 5 / 59

Page 11: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Graph Types

Property graph

film 2014(initial release date, “1980-05-23”)

(label, “The Shining”)

books 0743424425(rating, 4.7)

offers 0743424425amazonOffer

geo 2635167(name, “United Kingdom”)

(population, 62348447) actor 29704(actor name, “Jack Nicholson”)

film 3418(label, “The Passenger”)

film 1267(label, “The Last Tycoon”)

director 8476(director name, “Stanley Kubrick”)

film 2685(label, “A Clockwork Orange”)

film 424(label, “Spartacus”)

actor 30013

(relatedBook)

(hasOffer)

(based near)(actor)

(director) (actor)

(actor) (actor)

(director) (director)

Workload: Online queries andanalytic workloads

Query execution: Varies

RDF graph

mdb:film/2014

“1980-05-23”

movie:initial release date

“The Shining”refs:label

bm:books/0743424425

4.7

rev:rating

bm:offers/0743424425amazonOffer

geo:2635167

“United Kingdom”

gn:name

62348447

gn:population

mdb:actor/29704

“Jack Nicholson”

movie:actor name

mdb:film/3418

“The Passenger”

refs:label

mdb:film/1267

“The Last Tycoon”

refs:label

mdb:director/8476

“Stanley Kubrick”

movie:director name

mdb:film/2685

“A Clockwork Orange”

refs:label

mdb:film/424

“Spartacus”

refs:label

mdb:actor/30013

movie:relatedBook

scam:hasOffer

foaf:based nearmovie:actor

movie:directormovie:actor

movie:actor movie:actor

movie:director movie:director

Workload: SPARQL queries

Query execution: subgraphmatching by homomorphism

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 5 / 59

Page 12: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

RDF Introduction

Everything is an uniquely namedresource

Prefixes can be used to shorten thenames

Properties of resources can be defined

Relationships with other resources canbe defined

Resource descriptions can becontributed by different people/groupsand can be located anywhere in the web

Integrated web “database”

http://data.linkedmdb.org/resource/actor/JN29704

xmlns:y=http://data.linkedmdb.org/resource/actor/

y:JN29704

y:JN29704:hasName “Jack Nicholson”

y:JN29704:BornOnDate “1937-04-22”

y:TS2014:title “The Shining”

y:TS2014:releaseDate “1980-05-23”

y:TS2014

JN29704:movieActor

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 6 / 59

Page 13: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

RDF Introduction

Everything is an uniquely namedresource

Prefixes can be used to shorten thenames

Properties of resources can be defined

Relationships with other resources canbe defined

Resource descriptions can becontributed by different people/groupsand can be located anywhere in the web

Integrated web “database”

http://data.linkedmdb.org/resource/actor/JN29704

xmlns:y=http://data.linkedmdb.org/resource/actor/

y:JN29704

y:JN29704:hasName “Jack Nicholson”

y:JN29704:BornOnDate “1937-04-22”

y:TS2014:title “The Shining”

y:TS2014:releaseDate “1980-05-23”

y:TS2014

JN29704:movieActor

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 6 / 59

Page 14: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

RDF Introduction

Everything is an uniquely namedresource

Prefixes can be used to shorten thenames

Properties of resources can be defined

Relationships with other resources canbe defined

Resource descriptions can becontributed by different people/groupsand can be located anywhere in the web

Integrated web “database”

http://data.linkedmdb.org/resource/actor/JN29704

xmlns:y=http://data.linkedmdb.org/resource/actor/

y:JN29704

y:JN29704:hasName “Jack Nicholson”

y:JN29704:BornOnDate “1937-04-22”

y:TS2014:title “The Shining”

y:TS2014:releaseDate “1980-05-23”

y:TS2014

JN29704:movieActor

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 6 / 59

Page 15: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

RDF Introduction

Everything is an uniquely namedresource

Prefixes can be used to shorten thenames

Properties of resources can be defined

Relationships with other resources canbe defined

Resource descriptions can becontributed by different people/groupsand can be located anywhere in the web

Integrated web “database”

http://data.linkedmdb.org/resource/actor/JN29704

xmlns:y=http://data.linkedmdb.org/resource/actor/

y:JN29704

y:JN29704:hasName “Jack Nicholson”

y:JN29704:BornOnDate “1937-04-22”

y:TS2014:title “The Shining”

y:TS2014:releaseDate “1980-05-23”

y:TS2014

JN29704:movieActor

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 6 / 59

Page 16: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

RDF Introduction

Everything is an uniquely namedresource

Prefixes can be used to shorten thenames

Properties of resources can be defined

Relationships with other resources canbe defined

Resource descriptions can becontributed by different people/groupsand can be located anywhere in the web

Integrated web “database”

http://data.linkedmdb.org/resource/actor/JN29704

xmlns:y=http://data.linkedmdb.org/resource/actor/

y:JN29704

y:JN29704:hasName “Jack Nicholson”

y:JN29704:BornOnDate “1937-04-22”

y:TS2014:title “The Shining”

y:TS2014:releaseDate “1980-05-23”

y:TS2014

JN29704:movieActor

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 6 / 59

Page 17: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

RDF Data Model

Triple: Subject, Predicate (Property), Object(s, p, o)

Subject: the entity that is described (URIor blank node)

Predicate: a feature of the entity (URI)Object: value of the feature (URI, blank

node or literal)

(s, p, o) ∈ (U ∪ B)× U × (U ∪ B ∪ L)

Set of RDF triples is called an RDF graph

U

Subject Object

U B U B L

U: set of URIsB: set of blank nodesL: set of literals

Predicate

Subject Predicate Objecthttp://...imdb.../film/2014 rdfs:label “The Shining”http://...imdb.../film/2014 movie:releaseDate “1980-05-23”http://...imdb.../29704 movie:actor name “Jack Nicholson”. . . . . . . . .

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 7 / 59

Page 18: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

RDF Example InstancePrefixes: mdb=http://data.linkedmdb.org/resource/; geo=http://sws.geonames.org/

bm=http://wifo5-03.informatik.uni-mannheim.de/bookmashup/lexvo=http://lexvo.org/id/;wp=http://en.wikipedia.org/wiki/

Subject Predicate Object

mdb: film/2014 rdfs:label “The Shining”mdb:film/2014 movie:initial release date “1980-05-23”’mdb:film/2014 movie:director mdb:director/8476mdb:film/2014 movie:actor mdb:actor/29704mdb:film/2014 movie:actor mdb: actor/30013mdb:film/2014 movie:music contributor mdb: music contributor/4110mdb:film/2014 foaf:based near geo:2635167mdb:film/2014 movie:relatedBook bm:0743424425mdb:film/2014 movie:language lexvo:iso639-3/engmdb:director/8476 movie:director name “Stanley Kubrick”mdb:film/2685 movie:director mdb:director/8476mdb:film/2685 rdfs:label “A Clockwork Orange”mdb:film/424 movie:director mdb:director/8476mdb:film/424 rdfs:label “Spartacus”mdb:actor/29704 movie:actor name “Jack Nicholson”mdb:film/1267 movie:actor mdb:actor/29704mdb:film/1267 rdfs:label “The Last Tycoon”mdb:film/3418 movie:actor mdb:actor/29704mdb:film/3418 rdfs:label “The Passenger”geo:2635167 gn:name “United Kingdom”geo:2635167 gn:population 62348447geo:2635167 gn:wikipediaArticle wp:United Kingdombm:books/0743424425 dc:creator bm:persons/Stephen+Kingbm:books/0743424425 rev:rating 4.7bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOfferlexvo:iso639-3/eng rdfs:label “English”lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CAlexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn

URI Literal

URI

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 8 / 59

Page 19: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

RDF Graph

mdb:film/2014

“1980-05-23”

movie:initial release date

“The Shining”refs:label

bm:books/0743424425

4.7

rev:rating

bm:offers/0743424425amazonOffer

geo:2635167

“United Kingdom”

gn:name

62348447

gn:population

mdb:actor/29704

“Jack Nicholson”

movie:actor name

mdb:film/3418

“The Passenger”

refs:label

mdb:film/1267

“The Last Tycoon”

refs:label

mdb:director/8476

“Stanley Kubrick”

movie:director name

mdb:film/2685

“A Clockwork Orange”

refs:label

mdb:film/424

“Spartacus”

refs:label

mdb:actor/30013

movie:relatedBook

scam:hasOffer

foaf:based nearmovie:actor

movie:directormovie:actor

movie:actor movie:actor

movie:director movie:director

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 9 / 59

Page 20: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

RDF Query Model – SPARQL

Query Model - SPARQL Protocol and RDF Query LanguageGiven U (set of URIs), L (set of literals), and V (set of variables), aSPARQL expression is defined recursively:

an atomic triple pattern, which is an element of

(U ∪ V )× (U ∪ V )× (U ∪ V ∪ L)

?x rdfs:label “The Shining”

P FILTER R, where P is a graph pattern expression and R is a built-inSPARQL condition (i.e., analogous to a SQL predicate)

?x rev:rating ?p FILTER(?p > 3.0)

P1 AND/OPT/UNION P2, where P1 and P2 are graph patternexpressions

Example:SELECT ?nameWHERE {

?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .?d movie : d i r e c t o r n a m e ” S t a n l e y K u b r i c k ” .?m movie : r e l a t e d B o o k ?b . ?b r e v : r a t i n g ? r .FILTER(? r > 4 . 0 )

}© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 10 / 59

Page 21: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

SPARQL Queries

SELECT ?nameWHERE {

?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .?d movie : d i r e c t o r n a m e ” S t a n l e y K u b r i c k ” .?m movie : r e l a t e d B o o k ?b . ?b r e v : r a t i n g ? r .FILTER(? r > 4 . 0 )

}

?m ?dmovie:director

?name

rdfs:label

?b

movie:relatedBook

“Stanley Kubrick”

movie:director name

?rrev:rating

FILTER(?r > 4.0)

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 11 / 59

Page 22: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Outline

1 Introduction – Graph Types

2 Property Graph ProcessingClassificationOnline queryingOffline analytics

3 Graph Analytics Computational ModelsVertex-CentricBlock-CentricMapReduce-BasedModified MapReduce

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 12 / 59

Page 23: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Outline

1 Introduction – Graph Types

2 Property Graph ProcessingClassificationOnline queryingOffline analytics

3 Graph Analytics Computational ModelsVertex-CentricBlock-CentricMapReduce-BasedModified MapReduce

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 13 / 59

Page 24: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Classification [Ammar and Ozsu, 2016]

Graph Dynamism

StaticGraphs

DynamicGraphs

StreamingGraphs

EvolvingGraphs

Algorithm Types

Offline Online

Streaming Incremental

Dynamic

BatchDynamic

Workload Types

OnlineQueries

AnalyticsWorkloads

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 14 / 59

Page 25: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Classification [Ammar and Ozsu, 2016]

Graph Dynamism

StaticGraphs

DynamicGraphs

StreamingGraphs

EvolvingGraphs

Algorithm Types

Offline Online

Streaming Incremental

Dynamic

BatchDynamic

Workload Types

OnlineQueries

AnalyticsWorkloads

Focus here is on the

dynamism of the

graphs in whether or

not they change and

how they change.

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 14 / 59

Page 26: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Classification [Ammar and Ozsu, 2016]

Graph Dynamism

StaticGraphs

DynamicGraphs

StreamingGraphs

EvolvingGraphs

Algorithm Types

Offline Online

Streaming Incremental

Dynamic

BatchDynamic

Workload Types

OnlineQueries

AnalyticsWorkloads

Focus here is on the

dynamism of the

graphs in whether or

not they change and

how they change.

Focus here is on how

algorithms behave as

their input changes.

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 14 / 59

Page 27: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Classification [Ammar and Ozsu, 2016]

Graph Dynamism

StaticGraphs

DynamicGraphs

StreamingGraphs

EvolvingGraphs

Algorithm Types

Offline Online

Streaming Incremental

Dynamic

BatchDynamic

Workload Types

OnlineQueries

AnalyticsWorkloads

Focus here is on the

dynamism of the

graphs in whether or

not they change and

how they change.

Focus here is on how

algorithms behave as

their input changes.

The types of workloads

that the approaches are

designed to handle.

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 14 / 59

Page 28: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Classification [Ammar and Ozsu, 2016]

Graph Dynamism

StaticGraphs

DynamicGraphs

StreamingGraphs

EvolvingGraphs

Algorithm Types

Offline Online

Streaming Incremental

Dynamic

BatchDynamic

Workload Types

OnlineQueries

AnalyticsWorkloads

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 14 / 59

Page 29: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Classification [Ammar and Ozsu, 2016]

Graph Dynamism

StaticGraphs

DynamicGraphs

StreamingGraphs

EvolvingGraphs

Algorithm Types

Offline Online

Streaming Incremental

Dynamic

BatchDynamic

Workload Types

OnlineQueries

AnalyticsWorkloads

Graphs do not

change or we

are not inter-

ested in their

changes – only

a snapshot is

considered.

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 14 / 59

Page 30: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Classification [Ammar and Ozsu, 2016]

Graph Dynamism

StaticGraphs

DynamicGraphs

StreamingGraphs

EvolvingGraphs

Algorithm Types

Offline Online

Streaming Incremental

Dynamic

BatchDynamic

Workload Types

OnlineQueries

AnalyticsWorkloads

Graphs do not

change or we

are not inter-

ested in their

changes – only

a snapshot is

considered.

Graphs change

and we are

interested in

their changes.

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 14 / 59

Page 31: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Classification [Ammar and Ozsu, 2016]

Graph Dynamism

StaticGraphs

DynamicGraphs

StreamingGraphs

EvolvingGraphs

Algorithm Types

Offline Online

Streaming Incremental

Dynamic

BatchDynamic

Workload Types

OnlineQueries

AnalyticsWorkloads

Graphs do not

change or we

are not inter-

ested in their

changes – only

a snapshot is

considered.

Graphs change

and we are

interested in

their changes.

Dynamic

graphs with

high veloc-

ity changes –

not possible to

see the entire

graph at once.

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 14 / 59

Page 32: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Classification [Ammar and Ozsu, 2016]

Graph Dynamism

StaticGraphs

DynamicGraphs

StreamingGraphs

EvolvingGraphs

Algorithm Types

Offline Online

Streaming Incremental

Dynamic

BatchDynamic

Workload Types

OnlineQueries

AnalyticsWorkloads

Graphs do not

change or we

are not inter-

ested in their

changes – only

a snapshot is

considered.

Graphs change

and we are

interested in

their changes.

Dynamic

graphs with

high veloc-

ity changes –

not possible to

see the entire

graph at once.

Dynamic

graphs with un-

known changes

– requires re-

discovery of

the graph (e.g.,

LOD).

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 14 / 59

Page 33: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Classification [Ammar and Ozsu, 2016]

Graph Dynamism

StaticGraphs

DynamicGraphs

StreamingGraphs

EvolvingGraphs

Algorithm Types

Offline Online

Streaming Incremental

Dynamic

BatchDynamic

Workload Types

OnlineQueries

AnalyticsWorkloads

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 14 / 59

Page 34: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Classification [Ammar and Ozsu, 2016]

Graph Dynamism

StaticGraphs

DynamicGraphs

StreamingGraphs

EvolvingGraphs

Algorithm Types

Offline Online

Streaming Incremental

Dynamic

BatchDynamic

Workload Types

OnlineQueries

AnalyticsWorkloads

Computation accesses a

portion of the graph

and the results are

computed for a subset

of vertices; e.g., point-

to-point shortest path,

subgraph matching,

reachability, SPARQL.

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 14 / 59

Page 35: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Classification [Ammar and Ozsu, 2016]

Graph Dynamism

StaticGraphs

DynamicGraphs

StreamingGraphs

EvolvingGraphs

Algorithm Types

Offline Online

Streaming Incremental

Dynamic

BatchDynamic

Workload Types

OnlineQueries

AnalyticsWorkloads

Computation accesses a

portion of the graph

and the results are

computed for a subset

of vertices; e.g., point-

to-point shortest path,

subgraph matching,

reachability, SPARQL.

Computation accesses

the entire graph and

may require multiple

iterations; e.g., PageR-

ank, clustering, graph

colouring, all pairs

shortest path.

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 14 / 59

Page 36: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Classification [Ammar and Ozsu, 2016]

Graph Dynamism

StaticGraphs

DynamicGraphs

StreamingGraphs

EvolvingGraphs

Algorithm Types

Offline Online

Streaming Incremental

Dynamic

BatchDynamic

Workload Types

OnlineQueries

AnalyticsWorkloads

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 14 / 59

Page 37: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Classification [Ammar and Ozsu, 2016]

Graph Dynamism

StaticGraphs

DynamicGraphs

StreamingGraphs

EvolvingGraphs

Algorithm Types

Offline Online

Streaming Incremental

Dynamic

BatchDynamic

Workload Types

OnlineQueries

AnalyticsWorkloads

Sees the en-

tire input in

advance.

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 14 / 59

Page 38: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Classification [Ammar and Ozsu, 2016]

Graph Dynamism

StaticGraphs

DynamicGraphs

StreamingGraphs

EvolvingGraphs

Algorithm Types

Offline Online

Streaming Incremental

Dynamic

BatchDynamic

Workload Types

OnlineQueries

AnalyticsWorkloads

Sees the en-

tire input in

advance.

Sees the input

piece-meal as it

executes.

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 14 / 59

Page 39: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Classification [Ammar and Ozsu, 2016]

Graph Dynamism

StaticGraphs

DynamicGraphs

StreamingGraphs

EvolvingGraphs

Algorithm Types

Offline Online

Streaming Incremental

Dynamic

BatchDynamic

Workload Types

OnlineQueries

AnalyticsWorkloads

Sees the en-

tire input in

advance.

Sees the input

piece-meal as it

executes.

One-pass on-

line algorithm

with limited

memory.

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 14 / 59

Page 40: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Classification [Ammar and Ozsu, 2016]

Graph Dynamism

StaticGraphs

DynamicGraphs

StreamingGraphs

EvolvingGraphs

Algorithm Types

Offline Online

Streaming Incremental

Dynamic

BatchDynamic

Workload Types

OnlineQueries

AnalyticsWorkloads

Sees the en-

tire input in

advance.

Sees the input

piece-meal as it

executes.

One-pass on-

line algorithm

with limited

memory.

Online algo-

rithm with

some info

about forth-

coming input.

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 14 / 59

Page 41: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Classification [Ammar and Ozsu, 2016]

Graph Dynamism

StaticGraphs

DynamicGraphs

StreamingGraphs

EvolvingGraphs

Algorithm Types

Offline Online

Streaming Incremental

Dynamic

BatchDynamic

Workload Types

OnlineQueries

AnalyticsWorkloads

Sees the en-

tire input in

advance.

Sees the input

piece-meal as it

executes.

One-pass on-

line algorithm

with limited

memory.

Online algo-

rithm with

some info

about forth-

coming input.

Sees the en-

tire input

in advance,

which may

change; an-

swers computed

as change oc-

curs.

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 14 / 59

Page 42: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Classification [Ammar and Ozsu, 2016]

Graph Dynamism

StaticGraphs

DynamicGraphs

StreamingGraphs

EvolvingGraphs

Algorithm Types

Offline Online

Streaming Incremental

Dynamic

BatchDynamic

Workload Types

OnlineQueries

AnalyticsWorkloads

Sees the en-

tire input in

advance.

Sees the input

piece-meal as it

executes.

One-pass on-

line algorithm

with limited

memory.

Online algo-

rithm with

some info

about forth-

coming input.

Sees the en-

tire input

in advance,

which may

change; an-

swers computed

as change oc-

curs.

Similar to dynamic,

but computation

happens in batches

of changes.

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 14 / 59

Page 43: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Example Design Points

Graph Dynamism

StaticGraphs

DynamicGraphs

StreamingGraphs

EvolvingGraphs

Algorithm Types

Offline Online

Streaming Incremental

Dynamic

BatchDynamic

Workload Types

OnlineQueries

AnalyticsWorkloads

Compute the query result/perform analytic computation over the graphas it exists.

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 15 / 59

Page 44: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Example Design Points

Graph Dynamism

StaticGraphs

DynamicGraphs

StreamingGraphs

EvolvingGraphs

Algorithm Types

Offline Online

Streaming Incremental

Dynamic

BatchDynamic

Workload Types

OnlineQueries

AnalyticsWorkloads

Compute the query result/perform analytic computation over the graphas it is revealed.

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 15 / 59

Page 45: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Example Design Points

Graph Dynamism

StaticGraphs

DynamicGraphs

StreamingGraphs

EvolvingGraphs

Algorithm Types

Offline Online

Streaming Incremental

Dynamic

BatchDynamic

Workload Types

OnlineQueries

AnalyticsWorkloads

Compute the query result/perform analytic computation on each snap-shot from scratch.

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 15 / 59

Page 46: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Example Design Points

Graph Dynamism

StaticGraphs

DynamicGraphs

StreamingGraphs

EvolvingGraphs

Algorithm Types

Offline Online

Streaming Incremental

Dynamic

BatchDynamic

Workload Types

OnlineQueries

AnalyticsWorkloads

Continuously compute the query result/perform analytic computation asthe input changes.

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 15 / 59

Page 47: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Example Design Points

Graph Dynamism

StaticGraphs

DynamicGraphs

StreamingGraphs

EvolvingGraphs

Algorithm Types

Offline Online

Streaming Incremental

Dynamic

BatchDynamic

Workload Types

OnlineQueries

AnalyticsWorkloads

Compute the query result/perform analytic computation after a batch ofinput changes.

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 15 / 59

Page 48: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Example Design Points – Not all alternatives make sense

Graph Dynamism

StaticGraphs

DynamicGraphs

StreamingGraphs

EvolvingGraphs

Algorithm Types

Offline Online

Streaming Incremental

Dynamic

BatchDynamic

Workload Types

OnlineQueries

AnalyticsWorkloads

Dynamic (or batch-dynamic) algorithms do not make sense for staticgraphs.

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 16 / 59

Page 49: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Graph Processing Systems

System Memory/Disk

ArchitectureComputingparadigm

SupportedWorkloads

Hadoop Disk Parallel/Distributed MapReduce Analytical

Haloop Disk Parallel/Distributed MapReduce Analytical

Pegasus Disk Parallel/Distributed MapReduce Analytical

GraphX Disk Parallel/DistributedMapReduce

(Spark)Analytical

Pregel/Giraph Memory Parallel/Distributed Vertex-Centric Analytical

GraphLab Memory Parallel/Distributed Vertex-Centric Analytical

GraphChi Disk Single machine Vertex-Centric Analytical

Stream Disk Single machine Edge-Centric Analytical

Trinity Memory Parallel/DistributedFlexible using K-V

store on DSMOnline &Analytical

Titan Disk Parallel/DistributedK-V store

(Cassandra)Online

Neo4J Disk Single machineProcedural/Linked-list

Online

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 17 / 59

Page 50: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Graph Workloads

Online graph querying

Reachability

Single source shortest-path

Subgraph matching

SPARQL queries

Offline graph analytics

PageRank

Clustering

Strongly connectedcomponents

Diameter finding

Graph colouring

All pairs shortest path

Graph pattern mining

Machine learning algorithms(Belief propagation, Gaussiannon-negative matrixfactorization)

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 18 / 59

Page 51: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Outline

1 Introduction – Graph Types

2 Property Graph ProcessingClassificationOnline queryingOffline analytics

3 Graph Analytics Computational ModelsVertex-CentricBlock-CentricMapReduce-BasedModified MapReduce

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 19 / 59

Page 52: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Reachability Queries

film 2014(initial release date, “1980-05-23”)

(label, “The Shining”)

books 0743424425(rating, 4.7)

offers 0743424425amazonOffer

geo 2635167(name, “United Kingdom”)

(population, 62348447) actor 29704(actor name, “Jack Nicholson”)

film 3418(label, “The Passenger”)

film 1267(label, “The Last Tycoon”)

director 8476(director name, “Stanley Kubrick”)

film 2685(label, “A Clockwork Orange”)

film 424(label, “Spartacus”)

actor 30013

(relatedBook)

(hasOffer)

(based near)(actor)

(director) (actor)

(actor) (actor)

(director) (director)

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 20 / 59

Page 53: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Reachability Queries

film 2014(initial release date, “1980-05-23”)

(label, “The Shining”)

books 0743424425(rating, 4.7)

offers 0743424425amazonOffer

geo 2635167(name, “United Kingdom”)

(population, 62348447) actor 29704(actor name, “Jack Nicholson”)

film 3418(label, “The Passenger”)

film 1267(label, “The Last Tycoon”)

director 8476(director name, “Stanley Kubrick”)

film 2685(label, “A Clockwork Orange”)

film 424(label, “Spartacus”)

actor 30013

(relatedBook)

(hasOffer)

(based near)(actor)

(director) (actor)

(actor) (actor)

(director) (director)

Can you reach film 1267 from film 2014?

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 20 / 59

Page 54: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Reachability Queries

film 2014(initial release date, “1980-05-23”)

(label, “The Shining”)

books 0743424425(rating, 4.7)

offers 0743424425amazonOffer

geo 2635167(name, “United Kingdom”)

(population, 62348447) actor 29704(actor name, “Jack Nicholson”)

film 3418(label, “The Passenger”)

film 1267(label, “The Last Tycoon”)

director 8476(director name, “Stanley Kubrick”)

film 2685(label, “A Clockwork Orange”)

film 424(label, “Spartacus”)

actor 30013

(relatedBook)

(hasOffer)

(based near)(actor)

(director) (actor)

(actor) (actor)

(director) (director)

Is there a book whose rating is > 4.0 associated with a film that wasdirected by Stanley Kubrick?

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 20 / 59

Page 55: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Reachability Queries

Think of Facebook graph and finding friends of friends.

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 20 / 59

Page 56: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Subgraph Matching

?m ?dmovie:director

?name

rdfs:label

?b

movie:relatedBook

“Stanley Kubrick”

movie:director name

?rrev:rating

FILTER(?r > 4.0)

mdb:film/2014

“1980-05-23”

movie:initial release date

“The Shining”refs:label

bm:books/0743424425

4.7

rev:rating

bm:offers/0743424425amazonOffer

geo:2635167

“United Kingdom”

gn:name

62348447

gn:population

mdb:actor/29704

“Jack Nicholson”

movie:actor name

mdb:film/3418

“The Passenger”

refs:label

mdb:film/1267

“The Last Tycoon”

refs:label

mdb:director/8476

“Stanley Kubrick”

movie:director name

mdb:film/2685

“A Clockwork Orange”

refs:label

mdb:film/424

“Spartacus”

refs:label

mdb:actor/30013

movie:relatedBook

scam:hasOffer

foaf:based nearmovie:actor

movie:directormovie:actor

movie:actor movie:actor

movie:director movie:director

SubgraphM

atching

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 21 / 59

Page 57: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Outline

1 Introduction – Graph Types

2 Property Graph ProcessingClassificationOnline queryingOffline analytics

3 Graph Analytics Computational ModelsVertex-CentricBlock-CentricMapReduce-BasedModified MapReduce

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 22 / 59

Page 58: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

PageRank Computation

A web page is important if it is pointed to by other importantpages.

P1 P2

P3

P5P6

P4

r(Pi ) =∑

Pj∈BPi

r(Pj)

|FPj|

r(P2) =r(P1)

2+

r(P3)

3

rk+1(Pi ) =∑

Pj∈BPi

rk(Pj)

|FPj|

BPi: in-neighbours of Pi

FPi: out-neighbours of Pi

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 23 / 59

Page 59: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

PageRank Computation

A web page is important if it is pointed to by other importantpages.

P1 P2

P3

P5P6

P4

rk+1(Pi ) =∑

Pj∈BPi

rk(Pj)

|FPj|

Iteration 0 Iteration 1 Iteration 2Rank atIter. 2

r0(P1) = 1/6 r1(P1) = 1/18 r2(P1) = 1/36 5r0(P2) = 1/6 r1(P2) = 5/36 r2(P2) = 1/18 4r0(P3) = 1/6 r1(P3) = 1/12 r2(P3) = 1/36 5r0(P4) = 1/6 r1(P4) = 1/4 r2(P4) = 17/72 1r0(P5) = 1/6 r1(P5) = 5/36 r2(P5) = 11/72 3r0(P6) = 1/6 r1(P6) = 1/6 r2(P6) = 14/72 2

Iterative processing.

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 23 / 59

Page 60: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Outline

1 Introduction – Graph Types

2 Property Graph ProcessingClassificationOnline queryingOffline analytics

3 Graph Analytics Computational ModelsVertex-CentricBlock-CentricMapReduce-BasedModified MapReduce

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 24 / 59

Page 61: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Some Alternative Computational Models for OfflineAnalytics

Vertex-centric (Scatter-Gather)Specify (a) computation at each vertex, and (b) communication withneighbour verticesSynchronous – Pregel [Malewicz et al., 2010], GiraphAsynchronous – GraphLab [Low et al., 2012]

Block-centricSimilar to vertex-centric but on blocks for communication

Connected subgraph of the graph

Blogel [Yan et al., 2014]MapReduce

Need to save in HDFS intermediate results of each iteration – bothgood and badHadoop, Haloop [Bu et al., 2012]

Modified MapReduceBased on Spark [Zaharia et al., 2010; Zaharia, 2016]

Keep intermediate states in memoryProvide fault-tolerance by keeping lineage

GraphX [Gonzalez et al., 2014]

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 25 / 59

Page 62: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Some Alternative Computational Models for OfflineAnalytics

Vertex-centric (Scatter-Gather)Specify (a) computation at each vertex, and (b) communication withneighbour verticesSynchronous – Pregel [Malewicz et al., 2010], GiraphAsynchronous – GraphLab [Low et al., 2012]

Block-centricSimilar to vertex-centric but on blocks for communication

Connected subgraph of the graph

Blogel [Yan et al., 2014]

MapReduceNeed to save in HDFS intermediate results of each iteration – bothgood and badHadoop, Haloop [Bu et al., 2012]

Modified MapReduceBased on Spark [Zaharia et al., 2010; Zaharia, 2016]

Keep intermediate states in memoryProvide fault-tolerance by keeping lineage

GraphX [Gonzalez et al., 2014]

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 25 / 59

Page 63: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Some Alternative Computational Models for OfflineAnalytics

Vertex-centric (Scatter-Gather)Specify (a) computation at each vertex, and (b) communication withneighbour verticesSynchronous – Pregel [Malewicz et al., 2010], GiraphAsynchronous – GraphLab [Low et al., 2012]

Block-centricSimilar to vertex-centric but on blocks for communication

Connected subgraph of the graph

Blogel [Yan et al., 2014]MapReduce

Need to save in HDFS intermediate results of each iteration – bothgood and badHadoop, Haloop [Bu et al., 2012]

Modified MapReduceBased on Spark [Zaharia et al., 2010; Zaharia, 2016]

Keep intermediate states in memoryProvide fault-tolerance by keeping lineage

GraphX [Gonzalez et al., 2014]

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 25 / 59

Page 64: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Some Alternative Computational Models for OfflineAnalytics

Vertex-centric (Scatter-Gather)Specify (a) computation at each vertex, and (b) communication withneighbour verticesSynchronous – Pregel [Malewicz et al., 2010], GiraphAsynchronous – GraphLab [Low et al., 2012]

Block-centricSimilar to vertex-centric but on blocks for communication

Connected subgraph of the graph

Blogel [Yan et al., 2014]MapReduce

Need to save in HDFS intermediate results of each iteration – bothgood and badHadoop, Haloop [Bu et al., 2012]

Modified MapReduceBased on Spark [Zaharia et al., 2010; Zaharia, 2016]

Keep intermediate states in memoryProvide fault-tolerance by keeping lineage

GraphX [Gonzalez et al., 2014]

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 25 / 59

Page 65: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Outline

1 Introduction – Graph Types

2 Property Graph ProcessingClassificationOnline queryingOffline analytics

3 Graph Analytics Computational ModelsVertex-CentricBlock-CentricMapReduce-BasedModified MapReduce

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 26 / 59

Page 66: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Vertex-Centric Computation

“Think like a vertex”

vertex_scatter(vertex v)

Push local computation toneighbours on the out-boundedges

vertex_gather(vertex v)

Gather local computation fromneighbours on the in-bound edges

Continue until all vertices areinactive

Vertex state machine

?

Active Inactive

Vote halt

Message received

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 27 / 59

Page 67: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Vertex-Centric Computation

“Think like a vertex”

vertex_scatter(vertex v)

Push local computation toneighbours on the out-boundedges

vertex_gather(vertex v)

Gather local computation fromneighbours on the in-bound edges

Continue until all vertices areinactive

Vertex state machine

?

Active Inactive

Vote halt

Message received

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 27 / 59

Page 68: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Synchronous Vertex-Centric Computation

Machine 1

Machine 2

Machine 3

Machine 1

Machine 2

Machine 3

Machine 1

Machine 2

Machine 3

CommunicationBarrier

Each machine performsvertex-centric computationon its graph partition

CommunicationBarrier

Superstep 1 Superstep 2 Superstep 3

Computation

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 28 / 59

Page 69: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Synchronous Vertex-Centric Computation

Machine 1

Machine 2

Machine 3

Machine 1

Machine 2

Machine 3

Machine 1

Machine 2

Machine 3

CommunicationBarrier

Each machine performsvertex-centric computationon its graph partition

CommunicationBarrier

Superstep 1 Superstep 2 Superstep 3

Computation

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 28 / 59

Page 70: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Synchronous Vertex-Centric Computation

Machine 1

Machine 2

Machine 3

Machine 1

Machine 2

Machine 3

Machine 1

Machine 2

Machine 3

CommunicationBarrier

Each machine performsvertex-centric computationon its graph partition

CommunicationBarrier

Superstep 1 Superstep 2 Superstep 3

Computation

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 28 / 59

Page 71: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Synchronous Vertex-Centric Computation

Machine 1

Machine 2

Machine 3

Machine 1

Machine 2

Machine 3

Machine 1

Machine 2

Machine 3

CommunicationBarrier

Each machine performsvertex-centric computationon its graph partition

CommunicationBarrier

Superstep 1 Superstep 2 Superstep 3

Computation

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 28 / 59

Page 72: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Synchronous Vertex-Centric Computation

Machine 1

Machine 2

Machine 3

Machine 1

Machine 2

Machine 3

Machine 1

Machine 2

Machine 3

CommunicationBarrier

Each machine performsvertex-centric computationon its graph partition

CommunicationBarrier

Superstep 1 Superstep 2 Superstep 3

Computation

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 28 / 59

Page 73: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Asynchronous Vertex-Centric Computation

No communication barriers. 3

Uses the most recent vertex values. 3

Implemented via distributed locking

Machine 1

Machine 2

Machine 3

Machine 1

Machine 2

Machine 3

v0

v1 v2

v3 v4

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 29 / 59

Page 74: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Asynchronous Vertex-Centric Computation

No communication barriers. 3

Uses the most recent vertex values. 3

Implemented via distributed locking

Machine 1

Machine 2

Machine 3

Machine 1

Machine 2

Machine 3

v0

v1 v2

v3 v4

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 29 / 59

Page 75: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Asynchronous Vertex-Centric Computation

No communication barriers. 3

Uses the most recent vertex values. 3

Implemented via distributed locking

Machine 1

Machine 2

Machine 3

Machine 1

Machine 2

Machine 3

v0

v1 v2

v3 v4

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 29 / 59

Page 76: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Asynchronous Vertex-Centric Computation

No communication barriers. 3

Uses the most recent vertex values. 3

Implemented via distributed locking

Machine 1

Machine 2

Machine 3

Machine 1

Machine 2

Machine 3

v0

v1 v2

v3 v4

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 29 / 59

Page 77: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Asynchronous Vertex-Centric Computation

No communication barriers. 3

Uses the most recent vertex values. 3

Implemented via distributed locking

Machine 1

Machine 2

Machine 3

Machine 1

Machine 2

Machine 3

v0

v1 v2

v3 v4

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 29 / 59

Page 78: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Asynchronous Vertex-Centric Computation

No communication barriers. 3

Uses the most recent vertex values. 3

Implemented via distributed locking

Machine 1

Machine 2

Machine 3

Machine 1

Machine 2

Machine 3

v0

v1 v2

v3 v4

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 29 / 59

Page 79: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Summary of an Experiment [Han et al., 2014]

A large study comparing Giraph, GraphLab, GPS, Mizan.

1 Giraph scales better across graphs;GraphLab scales better across more machines.

2 Distributed locking for asynchronous execution is not scalable –Performance degrades as more machines are used due to lockcontention, termination scheme, lack of message batching

3 Graph storage should be memory and mutation efficient.

4 Message processing optimizations are very important.

5 Workloads have different resource demands

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 30 / 59

Page 80: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Summary of an Experiment [Han et al., 2014]

A large study comparing Giraph, GraphLab, GPS, Mizan.

1 Giraph scales better across graphs;GraphLab scales better across more machines.

2 Distributed locking for asynchronous execution is not scalable –Performance degrades as more machines are used due to lockcontention, termination scheme, lack of message batching

3 Graph storage should be memory and mutation efficient.

4 Message processing optimizations are very important.

5 Workloads have different resource demands

64 machines TW UK

Giraph (byte array) 5.8GB 7.0GBGraphLab (sync) 4.5GB 14GB

TW 16 machines 128 machines

Giraph (byte array) 8.5GB 5.8GBGraphLab (sync) 11GB 3.3GB

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 30 / 59

Page 81: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Summary of an Experiment [Han et al., 2014]

A large study comparing Giraph, GraphLab, GPS, Mizan.

1 Giraph scales better across graphs;GraphLab scales better across more machines.

2 Distributed locking for asynchronous execution is not scalable –Performance degrades as more machines are used due to lockcontention, termination scheme, lack of message batching

3 Graph storage should be memory and mutation efficient.

4 Message processing optimizations are very important.

5 Workloads have different resource demands

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 30 / 59

Page 82: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Summary of an Experiment [Han et al., 2014]

A large study comparing Giraph, GraphLab, GPS, Mizan.

1 Giraph scales better across graphs;GraphLab scales better across more machines.

2 Distributed locking for asynchronous execution is not scalable –Performance degrades as more machines are used due to lockcontention, termination scheme, lack of message batching

3 Graph storage should be memory and mutation efficient.

4 Message processing optimizations are very important.

5 Workloads have different resource demands

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 30 / 59

Page 83: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Summary of an Experiment [Han et al., 2014]

A large study comparing Giraph, GraphLab, GPS, Mizan.

1 Giraph scales better across graphs;GraphLab scales better across more machines.

2 Distributed locking for asynchronous execution is not scalable –Performance degrades as more machines are used due to lockcontention, termination scheme, lack of message batching

3 Graph storage should be memory and mutation efficient.

4 Message processing optimizations are very important.

5 Workloads have different resource demands

No Mutations

Time Memory

Byte array 3 3Hash map 7 7

With Mutations (DMST)

Time Memory

Byte array 77 3Hash map 3 7

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 30 / 59

Page 84: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Summary of an Experiment [Han et al., 2014]

A large study comparing Giraph, GraphLab, GPS, Mizan.

1 Giraph scales better across graphs;GraphLab scales better across more machines.

2 Distributed locking for asynchronous execution is not scalable –Performance degrades as more machines are used due to lockcontention, termination scheme, lack of message batching

3 Graph storage should be memory and mutation efficient.

4 Message processing optimizations are very important.

5 Workloads have different resource demands

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 30 / 59

Page 85: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Summary of an Experiment [Han et al., 2014]

A large study comparing Giraph, GraphLab, GPS, Mizan.1 Giraph scales better across graphs;

GraphLab scales better across more machines.2 Distributed locking for asynchronous execution is not scalable –

Performance degrades as more machines are used due to lockcontention, termination scheme, lack of message batching

3 Graph storage should be memory and mutation efficient.4 Message processing optimizations are very important.5 Workloads have different resource demands

Algorithm CPU Memory Network

PageRank Medium Medium HighSSSP Low Low LowWCC Low Medium MediumDMST High High Medium

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 30 / 59

Page 86: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Outline

1 Introduction – Graph Types

2 Property Graph ProcessingClassificationOnline queryingOffline analytics

3 Graph Analytics Computational ModelsVertex-CentricBlock-CentricMapReduce-BasedModified MapReduce

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 31 / 59

Page 87: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Block-Centric Computation

Blogel [Yan et al., 2014]: “Think like a block”; also “think like agraph” [Tian et al., 2013]

Vertex-centric assumes all vertices communicate over the network;this is not efficient

Read-world graphs have skewed vertex degree distribution

Common in power-law graphsProblem: imbalanced communication workloads

Real-world graphs have large diameters

Common in road networks, web graphs, terrain meshesProblem: one superstep per hop ⇒ too many supersteps

Real-world graphs have high average vertex degree

Common in social networks, mobile communication networksProblem: heavy average communication workloads

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 32 / 59

Page 88: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Blogel Principles

Exploit the partitioning of the graph

Message exchanges only among blocks

Block: a connected subgraph of the graph

Within a block, run a serial in-memory algorithm; no need to follow aBSP model

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 33 / 59

Page 89: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Benefits of Block-Centric Computation

High-degree vertices inside a block send no messages

Fewer number of supersteps

Fewer number of blocks than vertices

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 34 / 59

Page 90: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Example: Weakly Connected Component

Algorithm exchanges vertex id’swith neighbours

id(vi )← min{vi , vj , . . . , vk}where vj , . . . , vk are neighboursof vi

Vertex-centric requires everyvertex sends to its neighboursuntil every vertex is reached

Block-centric needs twoiterations:

1 All vertices in partition Aexchange ids; X and Y sendids to neighbours in partitionB

2 All vertices in partition Bexchange ids

A B

0

X

Y

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 35 / 59

Page 91: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Block Construction

The partitioning algorithm needs to maximize number of vertices thathave all their edges in the same partition

Hash partitioning is not suitable because many vertices will probablyhave at least one cut-edge

URL partitioner

For web graphs: based on domain names of web page nodes

2D partitioner

For spatial networks: based on coordinates of node

Graph Voronoi diagram partitioner

For general graphs

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 36 / 59

Page 92: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Outline

1 Introduction – Graph Types

2 Property Graph ProcessingClassificationOnline queryingOffline analytics

3 Graph Analytics Computational ModelsVertex-CentricBlock-CentricMapReduce-BasedModified MapReduce

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 37 / 59

Page 93: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

MapReduce Basics [Li et al., 2014]

For data analysis of very large data sets

Highly dynamic, irregular, schemaless, etc.SQL too heavy

“Embarrassingly parallel problems”

New, simple parallel programming modelData structured as (key, value) pairs

E.g. (doc-id, content), (word, count), etc.

Functional programming style with two functions to be given:

Map(k1,v1) → list(k2,v2)

Reduce(k2, list (v2)) → list(v3)

Implemented on a distributed file system (e.g., Google File System)on very large clusters

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 38 / 59

Page 94: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

MapReduce Processing

...Inp

ut

dat

ase

t

Map

Map

Map

Map

(k1, v)

(k2, v)(k2, v)

(k2, v)

(k1, v)

(k1, v)

(k2, v)

Group by k

Group by k

(k1, (v , v , v))

(k1, (v , v , v , v)) Reduce

Reduce

Ou

tpu

td

ata

set

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 39 / 59

Page 95: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

MapReduce Architecture

Scheduler

Master

Input Module

Map Module

Combine Module

Partition Module

Map Process

Worker

Input Module

Map Module

Combine Module

Partition Module

Map Process

Worker

Input Module

Map Module

Combine Module

Partition Module

Map Process

Worker

Group Module

Reduce Module

Output Module

Reduce Process

Worker

Group Module

Reduce Module

Output Module

Reduce Process

Worker

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 40 / 59

Page 96: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Execution Flow with Architecture [Dean and Ghemawat, 2008]MapReduce: Simplified Data Processing on Large Clusters

7. When all map tasks and reduce tasks have been completed, the mas-ter wakes up the user program. At this point, the MapReduce callin the user program returns back to the user code.

After successful completion, the output of the mapreduce executionis available in the R output files (one per reduce task, with file namesspecified by the user). Typically, users do not need to combine these Routput files into one file; they often pass these files as input to anotherMapReduce call or use them from another distributed application thatis able to deal with input that is partitioned into multiple files.

3.2 Master Data StructuresThe master keeps several data structures. For each map task andreduce task, it stores the state (idle, in-progress, or completed) and theidentity of the worker machine (for nonidle tasks).

The master is the conduit through which the location of interme-diate file regions is propagated from map tasks to reduce tasks. There -fore, for each completed map task, the master stores the locations andsizes of the R intermediate file regions produced by the map task.Updates to this location and size information are received as map tasksare completed. The information is pushed incrementally to workersthat have in-progress reduce tasks.

3.3 Fault ToleranceSince the MapReduce library is designed to help process very largeamounts of data using hundreds or thousands of machines, the librarymust tolerate machine failures gracefully.

Handling Worker FailuresThe master pings every worker periodically. If no response is receivedfrom a worker in a certain amount of time, the master marks the workeras failed. Any map tasks completed by the worker are reset back to theirinitial idle state and therefore become eligible for scheduling on otherworkers. Similarly, any map task or reduce task in progress on a failedworker is also reset to idle and becomes eligible for rescheduling.

Completed map tasks are reexecuted on a failure because their out-put is stored on the local disk(s) of the failed machine and is thereforeinaccessible. Completed reduce tasks do not need to be reexecutedsince their output is stored in a global file system.

When a map task is executed first by worker A and then later exe-cuted by worker B (because A failed), all workers executing reducetasks are notified of the reexecution. Any reduce task that has notalready read the data from worker A will read the data from worker B.

MapReduce is resilient to large-scale worker failures. For example,during one MapReduce operation, network maintenance on a runningcluster was causing groups of 80 machines at a time to become unreach-able for several minutes. The MapReduce master simply re executed thework done by the unreachable worker machines and continued to makeforward progress, eventually completing the MapReduce operation.

Semantics in the Presence of FailuresWhen the user-supplied map and reduce operators are deterministicfunctions of their input values, our distributed implementation pro-duces the same output as would have been produced by a nonfaultingsequential execution of the entire program.

split 0

split 1

split 2

split 3

split 4

(1) fork

(3) read(4) local write

(1) fork(1) fork

(6) write

worker

worker

worker

Master

UserProgram

outputfile 0

outputfile 1

worker

worker

(2)assignmap

(2)assignreduce

(5) remote

(5) read

Inputfiles

Mapphasr

Intermediate files(on local disks)

Reducephase

Outputfiles

Fig. 1. Execution overview.

COMMUNICATIONS OF THE ACM January 2008/Vol. 51, No. 1 109

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 41 / 59

Page 97: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Hadoop

Most popular MapReduce implementation – developed by Yahoo!Two components

Processing engineHDFS: Hadoop Distributed Storage System – others possibleCan be deployed on the same machine or on different machines

ProcessesJob tracker: hosted on the master node and implements the scheduleTask tracker: hosted on the worker nodes and accepts tasks from job trackerand executes them

HDFSName node: stores how data are partitioned, monitors the status of datanodes, and data dictionaryData node: Stores and manages data chunks assigned to it

Task Tracker Job Tracker Task Tracker

Data Node Name Node Data Node

Worker 1 Name Node Worker n

MapReduce

HDFS

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 42 / 59

Page 98: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

HaLoop [Bu et al., 2012]

Overcome MapReduce shortcomings for iterative jobs

Having to save data in HDFS in between each iterationChecking the fixpoint requires a new job at each iteration

Scheduler change: assign to the same machine the map & reducetasks that occur in different iterations but access the same data

Cache invariant data

Cache reduce output to easily check for fixpoint

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 43 / 59

Page 99: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Outline

1 Introduction – Graph Types

2 Property Graph ProcessingClassificationOnline queryingOffline analytics

3 Graph Analytics Computational ModelsVertex-CentricBlock-CentricMapReduce-BasedModified MapReduce

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 44 / 59

Page 100: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Spark System

MapReduce does not perform well in iterative computations

Workflow model is acyclicHave to write to HDFS after each iteration and have to read fromHDFS at the beginning of next iteration

Spark objectives

Better support for iterative programsProvide a complete ecosystemSimilar abstraction (to MapReduce) for programmingMaintain MapReduce fault-tolerance and scalability

Fundamental concepts

RDD: Reliable Distributed DatasetsCaching of working setMaintaining lineage for fault-tolerance

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 45 / 59

Page 101: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Spark System

MapReduce does not perform well in iterative computations

Workflow model is acyclicHave to write to HDFS after each iteration and have to read fromHDFS at the beginning of next iteration

Spark objectives

Better support for iterative programsProvide a complete ecosystemSimilar abstraction (to MapReduce) for programmingMaintain MapReduce fault-tolerance and scalability

Fundamental concepts

RDD: Reliable Distributed DatasetsCaching of working setMaintaining lineage for fault-tolerance

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 45 / 59

Page 102: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Spark System

MapReduce does not perform well in iterative computations

Workflow model is acyclicHave to write to HDFS after each iteration and have to read fromHDFS at the beginning of next iteration

Spark objectives

Better support for iterative programsProvide a complete ecosystemSimilar abstraction (to MapReduce) for programmingMaintain MapReduce fault-tolerance and scalability

Fundamental concepts

RDD: Reliable Distributed DatasetsCaching of working setMaintaining lineage for fault-tolerance

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 45 / 59

Page 103: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Spark Ecosystem [Michiardi, 2015]

NativeSparkApps

SparkSQL

SparkStreaming

MLlib(machinelearning)

GraphX(graph

processing)

Apache Spark

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 46 / 59

Page 104: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Spark Programming Model [Zaharia et al., 2010, 2012]

HDFS

Create RDD

· · ·

RDD

Cache? CacheYes

TransformRDD?

No

Process

No

TransformYes

HDFS

Each transform generates anew RDD that may also becached or processed

Created from HDFS or parallelized arrays;Partitioned across worker machines;May be made persistent lazily;

Processing done on one of the RDDs;Done in parallel across workers;First processing on a RDD is from disk;Subsequent processing of the same RDD from cache

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 47 / 59

Page 105: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Spark Programming Model [Zaharia et al., 2010, 2012]

HDFS

Create RDD

· · ·

RDD

Cache? CacheYes

TransformRDD?

No

Process

No

TransformYes

HDFS

Each transform generates anew RDD that may also becached or processed

Created from HDFS or parallelized arrays;Partitioned across worker machines;May be made persistent lazily;

Processing done on one of the RDDs;Done in parallel across workers;First processing on a RDD is from disk;Subsequent processing of the same RDD from cache

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 47 / 59

Page 106: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Spark Programming Model [Zaharia et al., 2010, 2012]

HDFS

Create RDD

· · ·

RDD

Cache? CacheYes

TransformRDD?

No

Process

No

TransformYes

HDFS

Each transform generates anew RDD that may also becached or processed

Created from HDFS or parallelized arrays;Partitioned across worker machines;May be made persistent lazily;

Processing done on one of the RDDs;Done in parallel across workers;First processing on a RDD is from disk;Subsequent processing of the same RDD from cache

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 47 / 59

Page 107: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Spark Programming Model [Zaharia et al., 2010, 2012]

HDFS

Create RDD

· · ·

RDD

Cache? CacheYes

TransformRDD?

No

Process

No

TransformYes

HDFS

Each transform generates anew RDD that may also becached or processed

Created from HDFS or parallelized arrays;Partitioned across worker machines;May be made persistent lazily;

Processing done on one of the RDDs;Done in parallel across workers;First processing on a RDD is from disk;Subsequent processing of the same RDD from cache

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 47 / 59

Page 108: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Example – Log Mining [Zaharia et al., 2010, 2012]

Load log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patterns

lines = spark.textFile(hdfs://...)

CreateRDD

errors = lines.filter( .startsWith(ERROR))

Transform RDD

messages = errors.map( .split(‘\t ’)(2))

Another transform

cachedMsgs = messages.cache()

Cache results

cachedMsgs.filter( .contains(foo)).count

Action

cachedMsgs.filter( .contains(bar)).count

Another Action

accesses cache

Driver

WorkerWorkerWorker

Block 1 Block 2 Block 3

TasksResults

Cache Cache Cache

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 48 / 59

Page 109: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Example – Log Mining [Zaharia et al., 2010, 2012]

Load log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patternslines = spark.textFile(hdfs://...)

CreateRDD

errors = lines.filter( .startsWith(ERROR))

Transform RDD

messages = errors.map( .split(‘\t ’)(2))

Another transform

cachedMsgs = messages.cache()

Cache results

cachedMsgs.filter( .contains(foo)).count

Action

cachedMsgs.filter( .contains(bar)).count

Another Action

accesses cache

Driver

WorkerWorkerWorker

Block 1 Block 2 Block 3

TasksResults

Cache Cache Cache

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 48 / 59

Page 110: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Example – Log Mining [Zaharia et al., 2010, 2012]

Load log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patternslines = spark.textFile(hdfs://...)

CreateRDD

errors = lines.filter( .startsWith(ERROR))

Transform RDD

messages = errors.map( .split(‘\t ’)(2))

Another transform

cachedMsgs = messages.cache()

Cache results

cachedMsgs.filter( .contains(foo)).count

Action

cachedMsgs.filter( .contains(bar)).count

Another Action

accesses cache

Driver

WorkerWorkerWorker

Block 1 Block 2 Block 3

TasksResults

Cache Cache Cache

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 48 / 59

Page 111: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Example – Log Mining [Zaharia et al., 2010, 2012]

Load log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patternslines = spark.textFile(hdfs://...)

CreateRDD

errors = lines.filter( .startsWith(ERROR))

Transform RDD

messages = errors.map( .split(‘\t ’)(2))

Another transform

cachedMsgs = messages.cache()

Cache results

cachedMsgs.filter( .contains(foo)).count

Action

cachedMsgs.filter( .contains(bar)).count

Another Action

accesses cache

Driver

WorkerWorkerWorker

Block 1 Block 2 Block 3

TasksResults

Cache Cache Cache

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 48 / 59

Page 112: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Example – Log Mining [Zaharia et al., 2010, 2012]

Load log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patternslines = spark.textFile(hdfs://...)

CreateRDD

errors = lines.filter( .startsWith(ERROR))

Transform RDD

messages = errors.map( .split(‘\t ’)(2))

Another transform

cachedMsgs = messages.cache()

Cache results

cachedMsgs.filter( .contains(foo)).count

Action

cachedMsgs.filter( .contains(bar)).count

Another Action

accesses cache

Driver

WorkerWorkerWorker

Block 1 Block 2 Block 3

TasksResults

Cache Cache Cache

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 48 / 59

Page 113: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Example – Log Mining [Zaharia et al., 2010, 2012]

Load log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patternslines = spark.textFile(hdfs://...)

CreateRDD

errors = lines.filter( .startsWith(ERROR))

Transform RDD

messages = errors.map( .split(‘\t ’)(2))

Another transform

cachedMsgs = messages.cache()

Cache results

cachedMsgs.filter( .contains(foo)).count

Action

cachedMsgs.filter( .contains(bar)).count

Another Action

accesses cache

Driver

WorkerWorkerWorker

Block 1 Block 2 Block 3

TasksResults

Cache Cache Cache

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 48 / 59

Page 114: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Example – Log Mining [Zaharia et al., 2010, 2012]

Load log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patternslines = spark.textFile(hdfs://...)

CreateRDD

errors = lines.filter( .startsWith(ERROR))

Transform RDD

messages = errors.map( .split(‘\t ’)(2))

Another transform

cachedMsgs = messages.cache()

Cache results

cachedMsgs.filter( .contains(foo)).count

Action

cachedMsgs.filter( .contains(bar)).count

Another Action

accesses cache

Driver

WorkerWorkerWorker

Block 1 Block 2 Block 3

Tasks

Results

Cache Cache Cache

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 48 / 59

Page 115: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Example – Log Mining [Zaharia et al., 2010, 2012]

Load log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patternslines = spark.textFile(hdfs://...)

CreateRDD

errors = lines.filter( .startsWith(ERROR))

Transform RDD

messages = errors.map( .split(‘\t ’)(2))

Another transform

cachedMsgs = messages.cache()

Cache results

cachedMsgs.filter( .contains(foo)).count

Action

cachedMsgs.filter( .contains(bar)).count

Another Action

accesses cache

Driver

WorkerWorkerWorker

Block 1 Block 2 Block 3

TasksResults

Cache Cache Cache

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 48 / 59

Page 116: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Example – Log Mining [Zaharia et al., 2010, 2012]

Load log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patternslines = spark.textFile(hdfs://...)

CreateRDD

errors = lines.filter( .startsWith(ERROR))

Transform RDD

messages = errors.map( .split(‘\t ’)(2))

Another transform

cachedMsgs = messages.cache()

Cache results

cachedMsgs.filter( .contains(foo)).count

Action

cachedMsgs.filter( .contains(bar)).count

Another Action

accesses cache

Driver

WorkerWorkerWorker

Block 1 Block 2 Block 3

TasksResults

Cache Cache Cache

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 48 / 59

Page 117: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Example – Log Mining [Zaharia et al., 2010, 2012]

Load log messages from a file system, create a new file by filtering theerror messages, read this file into memory, then interactively search forvarious patternslines = spark.textFile(hdfs://...)

CreateRDD

errors = lines.filter( .startsWith(ERROR))

Transform RDD

messages = errors.map( .split(‘\t ’)(2))

Another transform

cachedMsgs = messages.cache()

Cache results

cachedMsgs.filter( .contains(foo)).count

Action

cachedMsgs.filter( .contains(bar)).count

Another Action

accesses cache

Driver

WorkerWorkerWorker

Block 1 Block 2 Block 3

TasksResults

Cache Cache Cache

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 48 / 59

Page 118: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

RDD and Processing

HDFS

lines = spark.textFile(hdfs://...)

linesError, msg1

Warn, msg2

Error, msg1

Info, msg8

Warn, msg2

Info, msg8

Error, msg3

Info, msg5

Info, msg5

Error, msg4

Warn, msg9

Error, msg1

errors

errors = lines.filter( .startsWith(ERROR))

Error, msg1

Error, msg1

Error, msg3 Error, msg4

Error, msg1

messages

messages = errors.map .split(‘\t ’)(2)

msg1

msg1

msg3 msg4

msg1

Th

ese

are

no

tye

tg

ener

ated

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 49 / 59

Page 119: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

RDD and Processing

lineserrors

messagesmsg1

msg1

msg3 msg4

msg1

lines

messages.filter( .contains(foo)).count

errors

messagesmsg1

msg1

msg3 msg4

msg1

Now

the

RD

Ds

are

mat

eria

lized

;

Co

mm

and

no

tye

tex

ecu

ted

Driver

messages.filter( .contains(foo)).count

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 49 / 59

Page 120: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

GraphX [Gonzalez et al., 2014]

Built on top of Spark

Objective is to combine data analytics with graph processing

Unify computation on tables and graphs

Carefully convert graph to tabular representation

Native GraphX API or can accommodate vertex-centric computation

NativeSparkApps

SparkSQL

SparkStreaming

MLlib(machinelearning)

GraphX(graph

processing)

Apache Spark

Vertex-centric API

AppApp

App App

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 50 / 59

Page 121: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

GraphX: Representation of Graphs as Tables

A

B

C

D

E

F

G

H

I

J

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 51 / 59

Page 122: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

GraphX: Representation of Graphs as Tables

Partition 1

Partition 2

A

B

C

D

E

F

G

H

I

J

Edge-disjointpartitioning

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 51 / 59

Page 123: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

GraphX: Representation of Graphs as Tables

Partition 1

Partition 2

Mac

hin

e1

Mac

hin

e2

Vertex Table

(RDD)v-prop:vertex prop.

A

B

C

D

E

F

G

H

I

J

Edge-disjointpartitioning

A v-prop

B v-prop

...

I v-prop

D v-prop

E v-prop

F v-prop

J v-prop

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 51 / 59

Page 124: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

GraphX: Representation of Graphs as Tables

Partition 1

Partition 2

Mac

hin

e1

Mac

hin

e2

Vertex Table

(RDD)v-prop:vertex prop.

Edge Table

(RDD)e-prop:edge prop.

A

B

C

D

E

F

G

H

I

J

Edge-disjointpartitioning

A v-prop

B v-prop

...

I v-prop

D v-prop

E v-prop

F v-prop

J v-prop

A e-prop B

A e-prop C

...

F e-prop G

A e-prop D

A e-prop E...

E e-prop F

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 51 / 59

Page 125: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

GraphX: Representation of Graphs as Tables

Partition 1

Partition 2

Mac

hin

e1

Mac

hin

e2

Vertex Table

(RDD)v-prop:vertex prop.

Edge Table

(RDD)e-prop:edge prop.

A

B

C

D

E

F

G

H

I

J

Edge-disjointpartitioning

A v-prop

B v-prop

...

I v-prop

D v-prop

E v-prop

F v-prop

J v-prop

A e-prop B

A e-prop C

...

F e-prop G

A e-prop D

A e-prop E...

E e-prop FJoining vertices

and edgesMove vertices to edges

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 51 / 59

Page 126: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

GraphX: Representation of Graphs as Tables

Partition 1

Partition 2

Mac

hin

e1

Mac

hin

e2

Vertex Table

(RDD)v-prop:vertex prop.

Edge Table

(RDD)e-prop:edge prop.

RoutingTable

(RDD)

A

B

C

D

E

F

G

H

I

J

Edge-disjointpartitioning

A v-prop

B v-prop

...

I v-prop

D v-prop

E v-prop

F v-prop

J v-prop

A e-prop B

A e-prop C

...

F e-prop G

A e-prop D

A e-prop E...

E e-prop F

A 1 2

B 1

...

I 1

F 1 2

D 2

E 2

J 2

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 51 / 59

Page 127: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

GraphX: Computation Model

Mac

hin

e1

Mac

hin

e2

Vertex Table Edge Table

A v-prop

B v-prop

...

I v-prop

D v-prop

E v-prop

F v-prop

J v-prop

A e-prop B

A e-prop C

...

F e-prop G

A e-prop D

A e-prop E...

E e-prop F

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 52 / 59

Page 128: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

GraphX: Computation Model

Mac

hin

e1

Mac

hin

e2

Vertex Table Edge Table

A v-prop

B v-prop

...

I v-prop

D v-prop

E v-prop

F v-prop

J v-prop

A e-prop B

A e-prop C

...

F e-prop G

A e-prop D

A e-prop E...

E e-prop F

First Phase: JoinVertex table on Edge table

Triples View

A v-prop e-prop B v-prop

A v-prop e-prop C v-prop

C v-prop e-prop G v-prop

...

E v-prop e-prop G v-prop

J v-prop e-prop G v-prop

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 52 / 59

Page 129: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

GraphX: Computation Model

Mac

hin

e1

Mac

hin

e2

Vertex Table Edge Table

A v-prop

B v-prop

...

I v-prop

D v-prop

E v-prop

F v-prop

J v-prop

A e-prop B

A e-prop C

...

F e-prop G

A e-prop D

A e-prop E...

E e-prop FTriples View

A v-prop e-prop B v-prop

A v-prop e-prop C v-prop

C v-prop e-prop G v-prop

...

E v-prop e-prop G v-prop

J v-prop e-prop G v-prop

Second Phase: Compute neighbourhoodGroup-by aggregate

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 52 / 59

Page 130: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

GraphX: Operators

Table transform operators – inherited from Sparkmap(func) Return a new RDD formed by passing each element

of the source through a function func

filter(func) Return a new RDD formed by selecting thoseelements of the source on which func returns true

flatMap(func) Similar to map, but each input item can be mappedto 0 or more output items

mapPartitions(func) Similar to map, but runs separately on each partition(block) of the RDD, so func must be of type Iterator

sample(repl , fraction,seed)

Sample a fraction fraction of the data, with orwithout replacement (set repl accordingly), using agiven random number generator seed

union(otherDataset)intersection()

Return a new RDD containing the union/intersectionof the elements in the source RDD and the argument

groupByKey() Operates on a RDD of (K, V) pairs, returns a RDDof (K, Iterable<V>) pairs

reduceByKey(func, . . .) Operates on a RDD of (K, V) pairs, returns a RDDof (K, V) pairs where the values for each key areaggregated using the given reduce function func

Graph operatorsGraph(vertex coll ,edge coll)

Logically binds together a pair of vertex and edgeproperty collections into a property graph; verifiesthat each vertex occurs only once and edges connectexisting vertices

triplets(vertex coll ,vertex coll , edge coll)

Returns the triplets view of the graph

mrTriplets(map,reduce) MapReduce triplets - encodes the two-stage processof join to create triplets and group by

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 53 / 59

Page 131: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

GraphX: Operators

Table transform operators – inherited from Spark

Graph operatorsGraph(vertex coll ,edge coll)

Logically binds together a pair of vertex and edgeproperty collections into a property graph; verifiesthat each vertex occurs only once and edges connectexisting vertices

triplets(vertex coll ,vertex coll , edge coll)

Returns the triplets view of the graph

mrTriplets(map,reduce) MapReduce triplets - encodes the two-stage processof join to create triplets and group by

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 53 / 59

Page 132: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Acknowledgements

This presentation draws upon collaborative research and discussions withthe following colleagues

Khaled Ammar, U. Waterloo Khuzaima Daudjee, U. Waterloo

Young Han, U. Waterloo

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 54 / 59

Page 133: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

Thank you!

Research supported by

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 55 / 59

Page 134: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 56 / 59

Page 135: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

References I

Ammar, K. and Ozsu, M. T. (2016). Approaches to graph processing – an overview. Inpreparation.

Bu, Y., Howe, B., Balazinska, M., and Ernst, M. D. (2012). The HaLoop approach tolarge-scale iterative data analysis. VLDB J., 21(2):169–190.

Dean, J. and Ghemawat, S. (2008). Mapreduce: Simplified data processing on largeclusters. Commun. ACM, 51(1):107–113.

Gonzalez, J. E., Xin, R. S., Dave, A., Crankshaw, D., Franklin, M. J., and Stoica, I.(2014). GraphX: graph processing in a distributed dataflow framework. In Proc. 11thUSENIX Symp. on Operating System Design and Implementation, pages 599–613.

Han, M., Daudjee, K., Ammar, K., Ozsu, M. T., Wang, X., and Jin, T. (2014). Anexperimental comparison of Pregel-like graph processing systems. Proc. VLDBEndowment, 7(12):1047–1058.

Li, F., Ooi, B. C., Ozsu, M. T., and Wu, S. (2014). Distributed data management usingMapReduce. ACM Comput. Surv., 46(3):Article No. 31.

Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., and Hellerstein, J. M.(2012). Distributed graphlab: A framework for machine learning in the cloud. Proc.VLDB Endowment, 5(8):716–727.

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 57 / 59

Page 136: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

References II

Malewicz, G., Austern, M. H., Bik, A. J. C., Dehnert, J. C., Horn, I., Leiser, N., andCzajkowski, G. (2010). Pregel: a system for large-scale graph processing. In Proc.ACM SIGMOD Int. Conf. on Management of Data, pages 135–146.

Michiardi, P. (2015). Introduction to spark internals. Slideshare. Available from:http://www.slideshare.net/michiard/introduction-to-spark-internals?

qid=511145e7-79d7-41d8-a133-9e705d4933c3&v=qf1&b=&from_search=11 [Lastretrieved: 9 July 2015].

Tian, Y., Balmin, A., Corsten, S. A., Tatikonda, S., and McPherson, J. (2013). From“think like a vertex” to “think like a graph”. Proc. VLDB Endowment, 7(3):193–204.

Yan, D., Cheng, J., Lu, Y., and Ng, W. (2014). Blogel: A block-centric framework fordistributed computation on real-world graphs. Proc. VLDB Endowment,7(14):1981–1992.

Zaharia, M. (2016). An Architecture for Fast and General Data Processing on LargeClusters. ACM Books. Forthcoming.

Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J.,Shenker, S., and Stoica, I. (2012). Resilient distributed datasets: A fault-tolerantabstraction for in-memory cluster computing. In Proc. 9th USENIX Symp. onNetworked Systems Design & Implementation, pages 2–2.

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 58 / 59

Page 137: An Introduction to Graph Analytics Platforms · NDL subjects ndlna my Experi-ment Italian Museums medu-cator MARC Codes List Man-chester Reading Lists Lotico ... Bio-graphie data

References III

Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. (2010). Spark:Cluster computing with working sets. In Proc. 2nd USENIX Workshop on Hot Topicsin Cloud Computing, pages 10–10.

© M. Tamer Ozsu Dagstuhl Spring School (2016/03/07–09) 59 / 59