big, fast graph analysis and data management for hadoop...nosql/ hdfs clientiniatepgxasa yarntask...

50

Upload: others

Post on 31-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan
Page 2: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Big,  Fast  Graph  Analysis  and  Data  Management  for  Hadoop  

Jay  Banerjee,    PhD,  Senior  Director,  Oracle  Spa@al  &  Graph  Hassan  Chafi,  PhD,  Senior  Research  Manager,  Oracle  Labs  Zhe  Wu,  PhD,  Architect,  Oracle  Spa@al  and  Graph  

October  2,  2014  

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Page 3: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Safe  Harbor  Statement  

The  following  is  intended  to  outline  our  general  product  direc@on.  It  is  intended  for  informa@on  purposes  only,  and  may  not  be  incorporated  into  any  contract.  It  is  not  a  commitment  to  deliver  any  material,  code,  or  func@onality,  and  should  not  be  relied  upon  in  making  purchasing  decisions.  The  development,  release,  and  @ming  of  any  features  or  func@onality  described  for  Oracle’s  products  remains  at  the  sole  discre@on  of  Oracle.  

Page 4: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Program  Agenda  

Introduc)on  to  Graphs  

Data  Management  for  Property  Graphs  in  Hadoop  

PGX  –  Parallel  In-­‐Memory  Graph  Analy)cs  

1  

2  

3  

Current  Direc)on  for  Graph  Research    

Page 5: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Program  Agenda  

Introduc)on  to  Graphs  

Data  Management  for  Property  Graphs  in  Hadoop  

PGX  –  Parallel  In-­‐Memory  Graph  Analy)cs  

1  

2  

3  

Page 6: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Introduc@on  to  Graphs  

Page 7: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

This  en@re  session  covers  the  current  research  and  development  of  a  property  graph  prototype  

Page 8: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Rela@onal  Tables  vs  Rela@onship  Graphs  

anI)neraryReserved  

aPassenger  

aFlightSegment  

aBookingAgent   anAirport  

aFlightSchedule  

aPayment  

aReferenceDate  

aFlightCost  

A  Graph  Representa)on  of  the  Airline  Reserva)on  Data  Model  

Page 9: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

When  to  Choose  Graph  Based  Data  Modeling    •  Flexible  schema  

– No  schema  or  incomplete  schema,  may  evolve  over  @me  – Types  of  rela@onships  may  be  heterogeneous  

• Heavily  rela@onship  centric    – linked  data  model    

• Analy@cs  on  the  en@re  data  set  or  a  subset  is  important  – Find  influencers  in  a  social  network  – Find  clusters  of  similar  individuals  

Page 10: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Major  Types  of  Graphs  in  Oracle  Spa@al  and  Graph  

• Network  Data  Model  graph  – Mainly  used  for  physical,  spa@al  networks  

– Water  and  gas  pipelines,  electricity  networks,  road  networks  

–  Telcos,  u@li@es,  transporta@on  

•  RDF  Seman@c  graph  –  Enterprise  class  RDF  Graph  Database  for  Linked  data;  Knowledge  management,  Intelligence  applica@ons  

–  Scales  to  petabytes  of  triples  –  by  exploi@ng  Exadata,  RAC,  SQL*Loader  ,  Parallelism,  Label  Security  

– W3C  standards:  RDFS,  OWL2  RL,  OWL2  EL,  SPARQL  1.1,  RDB2RDF,  RDFa,  SKOS  plus  User-­‐defined  rules  

–  Seman@c  reasoning,  unifying  metadata  from  many  sources  

Page 11: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Another  kind  of  Graph:  Property  Graph  

•  @nkerpop.com  provides  a  set  of  commonly  used  APIs  •  Primarily  used  for  traversing  enterprise  and  social  networks  

–  Simple  no@on  of  nodes  and  edges  with  proper@es  

•  Property  graph  model  differs  from  RDF  graph  model  –  No  industry  standards  for  data  representa@on  and  query  language  –  No  no@on  of  unique  URIs  (Universal  Resource  Iden@fiers)  –  No  rule-­‐based  seman@cs  (OWL,  user-­‐defined  rules)  –  Edges  may  have  proper@es  (RDF  edge  proper@es  managed  in  named  graphs)  

Page 12: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Property  Graph  Example  

(from  hjps://github.com/@nkerpop/gremlin/wiki/Defining-­‐a-­‐Property-­‐Graph)  

A  property  graph  has  these  elements:  •   a  set  of  ver@ces    

each  vertex  has  a  unique  iden@fier.  each  vertex  has  a  set  of  outgoing  edges.  each  vertex  has  a  set  of  incoming  edges.  each  vertex  has  a  collec@on  of  proper@es                                      defined  by  a  map  from  key  to  value.  

•   a  set  of  edges    each  edge  has  a  unique  iden@fier.  each  edge  has  an  outgoing  tail  vertex.  each  edge  has  an  incoming  head  vertex.  each  edge  has  a  label  that  denotes  the                                        type  of  rela@onship  between  its  two  ver@ces.  each  edge  has  a  collec@on  of  proper@es                                        defined  by  a  map  from  key  to  value.  

Page 13: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Value  of  Property  Graph  Analy@cs  •  Finding  paths  between  ver@ces  •  Iden@fying  primary  influencers  in  a  people  network  

•  Iden@fying  clusters  to  customize  campaigns  

• Predic@ng  trends,  customer  behavior  

• Managing  brand  reputa@on  

•  Sen@ment  analysis  

• Discovering  rela@onships  based  on  pajern  matching  

Page 14: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Prototype  Reference  Architecture  for  Property  Graph  on  Hadoop      

 Scalable  and  Persistent  Storage  Management  

Graph  Data  Access  API  

Graph  Analy)cs  Single-­‐Node  in-­‐Memory  

Parallel  In-­‐memory  Analy@c  Engine  

REST/Web  Service  Blueprints  &  Lucene  

Graph  Model  and  format  

RDF  (RDF/XML,  N-­‐Triples,  N-­‐Quads,  TriG,N3,JSON)  

PG  (GraphML,  GML,  Graph-­‐SON,  

Flat  Files)  

Python,  Perl,  PHP,  Ruby,  Javascript,  …

 

Java  APIs  

Java  APIs  

Property  Graph  formats  

GraphML,    GML,    

Graph-­‐SON,    Flat  Files  

Other  NoSQL  Apache  HBase  

Page 15: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Program  Agenda  

Introduc)on  to  Graphs  

Data  Management  for  Property  Graphs  in  Hadoop  

PGX  –  Parallel  In-­‐Memory  Graph  Analy)cs  

1  

2  

3  

Current  Direc)on  for  Graph  Research    

Page 16: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Research  into  Property  Graph  Data  Management  for  Hadoop  

Oracle  Property  Graph  Prototype  

Page 17: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Property  Graph  Data  Access  Layer  • Key  ajributes  of  the  data  access  layer  

– Persistent  stores  for  storing  graph  data  

– Efficient  Tinkerpop  Blueprints  API  implementa@on  

– Text  search  through  Apache  Lucene  integra@on  

– Easy-­‐to-­‐use  Java  APIs  for    •  Fast,  parallel  bulk  load  (import)  of  property  graph  into  the  database  •  Fast,  parallel  read  (export)  of  property  graph  from  the  database  •  Fast,  parallel  text  indexing  of  property  graph  data  (ver@ces  and  edges)  

– REST  Interface  

Page 18: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

• Oracle  NoSQL  Database  – Key  features  used  

•  Tables,  Parent/Child  table  •  Secondary  indices  •  Parallel  scanner  

• Apache  HBase  ⁻  Key  features  used  •  Tables  •  Parallel  scans  across  regions  •  Column  family  

Persistent  Stores  Used  in  Data  Access  Layer  

Page 19: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

• Oracle  NoSQL  Database  – Key  features  used  

•  Tables,  Parent/Child  table  •  Secondary  indices  •  Parallel  scanner  

• Apache  HBase  ⁻  Key  features  used  •  Tables  •  Parallel  scans  across/within  regions  •  Column  family  

Persistent  Stores  Used  in  the  Data  Access  Layer  

Page 20: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Efficient  Blueprints  API  Implementa@on  • OraclePropertyGraph  

– Leverage  na#ve  database    features  •  Use  column  family/map  to  aggregate  K/V  pairs  •  Use  parallel  table  scan,  secondary  indices    •  Use  parallel  region  scan  

– Maintain  a  small,  in-­‐memory          ver@ces/edges  cache  – Batch  up  opera@ons  performed              against  graph  elements  

Graph   KeyIndexableGraph  Transac@onalGraph   IndexableGraph  

OraclePropertyGraphBase  

oracle.pg.hbase.  OraclePropertyGraph  

oracle.pg.nosql.  OraclePropertyGraph  

Page 21: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Easy-­‐to-­‐Use  Java  APIs  • OraclePropertyGraph  

–  Setup  database  connec@on  informa@on  •  kconfig  =  new  KVStoreConfig("kvstore",  hhosts);    •  hconn  =  [email protected]@on(conf);  

–  Create  OraclePropertyGraph  instance  •  opg  =  OraclePropertyGraph.getInstance(…)  

–  Add/remove/edit  graph  data  •  opgdl  =  OraclePropertyGraphDataLoader.getInstance();  •  opgdl.loadData(opg,  vertexFile,  edgeFile,    32  /*    parallel  degree  */);  

–  Navigate  •  vertex.getEdges(direc@onOut,  “follows”)            vertex.outE.inV.name    

–  Using  PGX  analy@c  func@ons    •  analyst  =  opg.getInMemAnalyst();  

•  triangles  =  analyst.countTriangles().get();  

DB  Connec@on  info  

Create  OraclePropertyGraph  

Add/Remove/Edit  Graph  

Navigate/Search  

Perform  Analy@cs  

Page 22: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Text  Search  through  Apache  Lucene  

•  Integrate  with  Apache  Lucene  •  Support  manual  and  auto  indexing  of  Graph  elements    

•  Manual  index:    

•  oraclePropertyGraph.createIndex(“my_index",  Vertex.class);  

•  indexVer@ces  =  oraclePropertyGraph.getIndex(“my_index”  ,    Vertex.class);  

•  [email protected](“key”,  “value”,    myVertex);  

•  Auto  Index  •  oraclePropertyGraph.createKeyIndex(“name”,  Edge.class);  

•  oraclePropertyGraph.getEdges(“name”,  “*hello*world”);  

•   Scale  to  billions  of  graph  elements  

 Find  ver@ces  using        

                 syntax  like:        "*oracle*  or  *graph*”  

Page 23: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

REST  Interface  • Provides  an  implementa@on/client  agnos@c  way  – Access  graph  data  – Manipulate  graph  data  setenv      pg_graph      hjp://localhost:8182/graphs/oracle_user_graph  curl  …  -­‐X  POST  "${pg_graph}/ver@ces/101"  

curl  …  -­‐X  POST  "${pg_graph}/ver@ces/102“  

curl  …  -­‐X  POST          

                 "${pg_graph}/edges/200?_outV=101&  

                     _label=like&_inV=102&weight=1.2“  

curl  …  -­‐i  -­‐X  DELETE  "${pg_graph}/edges/200"  

Page 24: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Property  Graph  Serializa@on  Formats  

• GML  

• GraphML  

• GraphSON  

• Oracle-­‐defined  Property  Graph  Flat  Files  •  Vertex  file  •  Edge  file  •  Allow  mul@ple  data  types  to  be    

     associated  with  one  key  

•  Support  Date  with  Timezone  and  Serializable  objects  

Different  data  types  associated  with  the  same  

“likes”  rela@onship  

Page 25: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Example  Usage  Flow  with  Typical  Java  APIs  (NoSQL  Database)  –  Setup  database  connec@on  informa@on  

•  hhosts  =  new  String[1];  hhosts[0]  =  "127.0.0.1:5000";  •  kconfig  =  new  KVStoreConfig("kvstore",  hhosts);    

–  Create  OraclePropertyGraph  instance  •  opg  =  OraclePropertyGraph.getInstance(…)  

–  Add/remove/edit  graph  data  •  opgdl  =  OraclePropertyGraphDataLoader.getInstance();  •  opgdl.loadData(opg,  vertexFile,  edgeFile,    32  /*    parallel  degree  */);  •  opg.addEdge(null,  vertexA,  vertexB,  “friend”);  

–  Navigate  •  opg.getVer@ces();  •  vertex.getEdges(direc@onOut,  “follows”)            vertex.outE.inV.name    

–  Using  PGX  analy@c  func@ons    •  analyst  =  opg.getInMemAnalyst();  

•  triangles  =  analyst.countTriangles().get();  

Page 26: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Example  Usage  Flow  with  Typical  Java  APIs  (on  Apache  HBase)  –  Setup  database  connec@on  informa@on  

•  conf  =  [email protected]();    •  conf.set("hbase.zookeeper.quorum",  “host1,host2,host3");    conf.set("hbase.zookeeper.property.clientPort","2181");  

–  Create  OraclePropertyGraph  instance  •  opg  =  OraclePropertyGraph.getInstance(…)  

–  Add/remove/edit  graph  data  •  opgdl  =  OraclePropertyGraphDataLoader.getInstance();  •  opgdl.loadData(opg,  vertexFile,  edgeFile,    32  /*    parallel  degree  */);  •  opg.addEdge(null,  vertexA,  vertexB,  “friend”);  

–  Navigate  •  opg.getVer@ces();  •  vertex.getEdges(direc@onOut,  “follows”)            vertex.outE.inV.name    

–  Using  PGX  analy@c  func@ons    •  analyst  =  opg.getInMemAnalyst();  

•  triangles  =  analyst.countTriangles().get();  

Page 27: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Performance  Preview  • Benchmark  property  graph  used  

– Social  network  data  (1.4  billion  edges)  – Data  loading  completed  in    

•  4.8  hr  with  48  threads  using  6-­‐Node  HBase  cluster  on  BDA  – No  secondary  indices  built  

•  15.8  hrs  with  48  threads  using  8-­‐Node  NoSQL  EE  cluster  – Mul@ple  secondary  indices  created  

– Full  graph  scan  •  1865  seconds  to  retrieve  1.4  Billion  edges  on  HBase  

– 3  splits  per  region  and  48  threads  •  6841  seconds  to  retrieve  1.4  Billion  edges  on  NoSQL  

– maxConcurrentRequests  32  used  in  TableIteratorOp@ons  

Page 28: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Performance  Inves@ga@ons  

0  

2000  

4000  

6000  

8000  

10000  

12000  

14000  

16000  

1   6   11   16   21   26   31  

Time  (secs)  

DOP  

LiveJ  graph  on  HBase  

Load  

0  

100  

200  

300  

400  

500  

600  

700  

1   6   11   16   21   26   31  

Time  (secs)  

DOP  

LiveJ  graph  on  HBase  

getEdges()  

Page 29: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Performance  Inves@ga@ons  

0  

10000  

20000  

30000  

40000  

50000  

60000  

70000  

80000  

1   6   11   16   21   26   31  

Time  (secs)  

DOP  

LiveJ  graph  on  NoSQL  

Load  

0  

200  

400  

600  

800  

1000  

1200  

1400  

1600  

1   6   11   16   21   26   31  Time  (secs)  

DOP  

LiveJ  graph  on  NoSQL  

getEdges()  

Page 30: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Program  Agenda  

Introduc)on  to  Graphs  

Data  Management  for  Property  Graphs  in  Hadoop  

PGX  –  Parallel  In-­‐Memory  Graph  Analy)cs  

1  

2  

3  

Current  Direc)on  for  Graph  Research    

Page 31: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

PGX  –  Parallel  In-­‐Memory  Graph  Analy)cs  

Page 32: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Graph  Analysis  • Data  analysis  that  considers  rela@onships  between  data  en@@es  

– By  modelling  your  data  as  a  graph  

•  Examples  

Purchase  Record  

customer items

Product  Recommenda@on   Influencer  Iden@fica@on  

Communica@on  Stream  (e.g.  tweets)  

 Graph  Pajern  Matching  Community  Detec@on  

Page 33: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

PGX:  Graph  Analysis  Frameworks    •  Large  graph  analysis  is  @me-­‐consuming  because  …  

–  The  computa@on  typically  involves  touching  most  nodes  and  edges  in  the  graph  

–  The  data-­‐access  pajern  is,  inherently,  non-­‐sequen@al  (random)  •  Commodity  HW  and  Rela@onal  SW  is  op@mized  for  sequen@al  access  

•  PGX  is  an  in-­‐memory,  parallel  framework  for  fast  graph  analy@cs  

•  PGX  exploits  the  architecture  of  modern  servers  –  The  computa@on  is  parallelized  using  mul@ple  CPU  cores  

–  The  non-­‐sequen@al  data-­‐access  is  mi@gated  with  large  DRAMs    

Page 34: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

•  Consequently,  PGX  (single  machine)  is  much  faster  than  exis@ng  – Distributed  execu@on  or  – Out-­‐of-­‐core  execu@on    

PGX  Performance  (In-­‐memory)  

0.01  0.1  1  

10  100  

LiveJ   Twijer  Run)

me  in  Secon

ds  

PageRank    

PGX  (SPARC)   PGX  (X86)   GraphLab  (X86  x  16)   SQL  (X86)  

1  

10  

100  

1000  

10000  

100000  

LiveJ   Web-­‐UK  Run)

me  in  Secon

ds  

Triangle  Coun)ng    

Execu@on  @me  of  two  popular  graph  analysis  algorithms    (log  scale,  lower  is  bejer)  

• PGX  (In-­‐memory)  :  x86  and  SPARC  (T5)  • GraphLab  (state-­‐of-­‐art  distributed  framework)  • SQL:  disk-­‐based    

3x  –  10x  faster  than  16-­‐machine  distributed  execu@on  

Two  orders-­‐of-­‐magnitude  faster  than  disk-­‐based  execu@on  

Page 35: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Oracle  Property  Graph  Protoype  and  PGX  •  PGX  can  load  up  a  graph  (through  data  access  layer)  from:  –  Property  Graph  or  RDF  Seman@c  Graph  from  Oracle  Database  

–  Stored  in  Oracle  Database,  Apache  HBase  or  Oracle  NoSQL  Database  

•  PGX  and  the  Data  Access  Layer  –  PGX  handles  analy@c  workloads  while  the  data  access  layer  handles  transac@onal  workloads  

•  PGX  supports  automa@c  refresh  from  an  Oracle  Database  via  delta-­‐update    

Oracle  Property  Graph  or  RDF  (RDB,  Hbase  or  NoSQL)  

PGX  

Analy@c  Request  Analy@c  Request  

Analy@c  Request  Analy@c  Request  Analy@c  Request  Analy@c  Request  

Trans-­‐ac@onal  Request  

Page 36: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Snapshot  Consistency    •  PGX  in-­‐memory  graph  provides  a  consistent  snapshot  to  a  client  session  

•  DB  changes  are  tracked  by  PGX;  new  instances  are  created  quickly  for  new  clients  •  Exis@ng  clients  can  also  explicitly  refresh  the  graph  instance  •  It  becomes  trivial  to  support  many  clients,  consistently,  by  adding  more  servers  

Oracle  property  graph  prototype  

Graph    Instance  

client  

Delta

PGX  …

Graph    Instance  ‘  

New  client  

refresh

PGX  

Graph    Instance  

Graph    Instance  ‘  

Client  Client  Client  

Page 37: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

•  PGX  provides  a  rich  set  of  built-­‐in  (parallel)  graph  algorithms  

•  as    well  as  parallel  graph  muta@on  opera@ons  

Built-­‐in  Algorithms  and  Graph  Muta@on  

Detec)ng  Components  and  Communi)es  

Tarjan’s,  Kosaraju’s,    Weakly  Connected  Components,  Label  Propaga@on  (w/  variants),  Soman  and  Narang’s    Spacifica@on  

Ranking  and  Walking  

Pagerank,  Personalized  Pagerank,  Betwenness  Centrality  (w/  variants),  Closeness  Centrality,  Degree  Centrality,  Eigenvector  Centrality,  HITS,  Random  walking  and  sampling  (w/  variants)  

Evalua)ng  Community  Structures  

Conductance,  Modularity  Clustering  Coefficient  (Triangle  Coun@ng)  Adamic-­‐Adar  

Path-­‐Finding    

Hop-­‐Distance  (BFS)  Dijkstra’s,    Bi-­‐direc@onal  Dijkstra’s    Bellman-­‐Ford’s  

Link  Predic)on   SALSA    (Twijer’s  Who-­‐to-­‐follow)  

Other  Classics   Vertex  Cover  Minimum  Spanning-­‐Tree(Prim’s)  

a  

d  

b   e  

g  

c   i  

f  

h  

The  original  graph  a  

d  

b   e  

g  

c   i  

f  

h  

Create  Undirected  Graph  

Simplify  Graph  

a  

d  

b   e  

g  

c   i  

f  

h  

Le�  Set:  “a,b,e”  a   d  

b  

e  

g  

c  

i  

Create  Bipar)te  Graph  

g  e   b   d   i   a   f   c   h  

Sort-­‐By-­‐Degree  (Renumbering)  

Filtered  Subgraph  

d  

b  g  

i  

e  

Page 38: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

•  PGX  allows  users  to  implement  their  own  graph  algorithms  with  an  intui@ve  DSL  –  Our  compiler  transforms  the  given  program  for  PGX  execu@on  

– While  applying  op@miza@ons  and  paralleliza@on  

•  DSL  gives  you  performance  benefits  over  using  low-­‐level  APIs  –  No  repeated  interac@on  between  client  and  sever  

–  Direct  access  to  internal  data  representa@on  –  Automated  paralleliza@on  from  the  compiler  

Domain-­‐Specific  Language  (DSL)  for  Custom  Algorithms  

foreach(t: G.nodes) { val = 0.85 + d * sum(w:t.inNbrs){w.PR/w.degree()}; ... }

DSL  Program  

Compiler  PGX  

Graph    Sever  

Graph  Analy@c  Client  

Low-­‐level  API  

getNode()

getEdge().getProperty()

.getNbr().

…  

The  compiler  parallelizes  graph  algorithm  if  possible    

Page 39: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Execu@on  from  Hadoop  Environments  •  PGX  is  integrated  with  Hadoop-­‐based  clusters  

–  Via  YARN  

•  Execu)on  modes:  

PGX  

RDF  /  HBASE  /  NoSQL  /  HDFS    

Client  ini@ate  PGX  as  a  YARN  task  

Interac@ve    Execu@on  

pgx> :loadGraph mygraph.json …

Hadoop  Cluster  

Client  

Client  controls  PGX  via  an  interac@ve  shell      

... pgx> :pagerank mygraph 0.85 …

To  load  the  Graph  and  run  the  analysis  

Shared  Server  Execu@on  

PGX  Sever  

RDF  /  HBASE  /  NoSQL  /  HDFS    

PGX  can  be  configured  to  run  as  a    service,  with    certain  graphs  pre-­‐loaded  And  shared  by  mul@ple  

clients  

Batch  Mode  

:loadGraph … … :pagerank …

Client  can    submit  a  PGX  script  as  a  batch  job  

PGX  

Dry  Run  (Local  Execu@on)  

…   Client  can  run  PGX  locally  with  small  data  set  Data  File  

Page 40: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Distributed  Graph  Analysis  •  But  what  about  graphs  that  do  not  fit  in  a  single  machine?  

   Answer:  PGX.DIST,  a  distributed  graph  processing  framework  –  The  graph  is  spread  across  mul@ple  machines  in  a  cluster  

–  The  computa@on  is  distributed  

–  The  communica@on  is  op@mized  for  bandwidth  

•  Early  prototype  available  for  customer  POCs  

Interconnection Network

……  

Computa@on  Machines  

In-memory Graph (distributed)

Page 41: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

PGX  Distributed  Performance  •  Our  prototype  is  already  faster  and  more  scalable  than  the  state-­‐of-­‐the-­‐art  graph  framework  

0  

5  

10  

15  

20  

25  

PGX.DIST   GraphLab   PGX.DIST   GraphLab   PGX.DIST   GraphLab  

Pagerank   SSSP   WCC  

Rela)ve  Perform

ance  

 (Highe

r  is  Be^

er)  

Xeon  E5-­‐2660  (2.20  GHz,  2  socket  x    8  core  x  2  threads),      2  -­‐-­‐  16  machine,  connected  by  Mellanox  Connect-­‐IB  Graph:  Twijer  2010  (1.4  billion  edges)    

2   4   8   16   2   4   8   16   2   4   8   16   2   4   8   16   2   4   8   16   2   4   8   16  

PGX.DIST  performs  much    bejer  when  using  same  #  of  machines    

It  also  scales  bejer  with  increasing  #  of  machines  

Page 42: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Compiler  Support  and  Integra@on  (on-­‐going)  • We  are  adding  the  compiler  support  for  DSL  

– Goal:  the  same  DSL  program  is  compiled  into  PGX.DIST  

• As  a  result,  the  user  can  easily  migrate  execu@on  environments  – Depending  on  the  size  of  their  data      

Compiler  

DSL  Program  

Laptop  

Cluster  (distributed)  

Server  (in-­‐memory)  

Page 43: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Summary  • Property  graph  prototype  reference  architecture  on  Hadoop  • Persistent  and  scalable  property  graph  data  management  

• Parallel  Graph  Analy@cs  (PGX)  

• Prototype  provides  a  solid  founda)on  for  large  scale  graph  applica@ons  by  combining  – Fast,  parallel,  extensible  in-­‐memory  graph  analy@cs  – Fast,  scalable,  persistent  graph  data  management    

Page 44: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

If  you  like  what  you  heard,  sign  up  for  beta  tes)ng  

Page 45: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Page 46: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan
Page 47: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Appendix  

Page 48: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

•  Flexible  Key-­‐Value  Data  Model  

•  ACID  transac@ons  •  Horizontally  Scalable  •  Highly  Available      •  Elas@c  Configura@on  •  Simple  administra@on  

•  Intelligent  Driver  •  Commercial  grade  so�ware  and  support  

Features  

Oracle  NoSQL  Database  Enterprise  Edi@on    Scalable,  Highly  Available,  Key-­‐Value  Database  

Applica@on  

Storage Nodes Datacenter  B  

Storage Nodes Datacenter  A  

Applica@on  

NoSQL  DB  Driver  

Applica@on  

NoSQL  DB  Driver  

Applica@on  

Java  SE  6  (JDK  1.6.0  u25)+;    Solaris  or  Linux  

Page 49: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Comparing  Graph  Databases  • Modeling  capabili@es  

•  APIs  and  analy@cal  func@ons  

•  Performance  and  scalability  •  Enterprise  features  •  Interoperability  with  other  data  types  •  Tools  integra@on  

Page 50: Big, Fast Graph Analysis and Data Management for Hadoop...NoSQL/ HDFS ClientiniatePGXasa YARNtask InteracveExecuon pgx> :loadGraph mygraph.json … Hadoop Cluster% Client ClientcontrolsPGXviaan

Copyright  ©  2014,  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

 Scalable  and  Persistent  Database  

Graph  Data  Access  API  

REST/Web  Service  

Java  APIs  

NoSQL  Database  Apache  HBase  

Parallel  Graph  Analy)cs  (PGX)