dato vs graphx

Post on 14-Apr-2017

480 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

DATO VS. SPARK GRAPHX

KEIRA ZHOUOCT, 2015

Details: https://github.com/keiraqz/dato-vs-graphx

SETTINGS• 1 master node and 3 work nodes on AWS

• m4.large instances with 8GB of RAM with 2 cores

DATO• A graph-based, asynchronous, high performance, distributed

computation framework written in C++

• 30-days free trial, then a service fee

• Install GraphLab Create on the local machine and Dato Distributed on a cluster

SPARK GRAPHX• Come with Spark

import org.apache.spark._import org.apache.spark.graphx._

EXPERIMENTS• Graph Algorithms

• Triangle-counting• PageRank• Connected Components

• Datasets: Stanford Large Network Dataset Collection (SNAP)• Facebook:

• Nodes: 4039 | Edges: 88234 | Number of triangles: 1612010• YouTube:

• Nodes: 1134890 | Edges: 2987624 | Number of triangles: 3056386• Pokec:

• Nodes: 1632803 | Edges: 30622564 | Number of triangles: 32557458• LiveJournal:

• Nodes: 3997962 | Edges: 34681189 | Number of triangles: 177820130

EXPERIMENTS (CONT’D)• Default settings

• Dato:• GRAPHLAB_FILEIO_MAXIMUM_CACHE_CAPACITY = 4G

• GraphX• Start with executor memory = 1G• Change into 2G later

RESULTS• Triangle Counting: both Dato and GraphX (if it finishes the job) returns the

correct answer as listed on the SNAP website.

• For Pokec and LiveJournal data, GraphX has trouble finishing the computation

TAKE-AWAY FOR GRAPHX• What I observed was that certain stages within the job kept

failing

• A stage in Spark will operate on one partition of the RDD at a time (and load the data in that partition into memory)

• Potential Solution

• Increasing the executor memory• Increase the number of partitions of the RDD so that each

stage is processing smaller amount of data

RESULTS (CONT’D)• PageRank: The threshold for PageRank is set to 0.001

RESULTS (CONT’D)• Connected Components

CONCLUSIONS• Quick setups for both of the tools without fine-tune runtime

parameters, but

• Dato has clear advantages over GraphX in terms of execution time for processing large scale graph data

• However, GraphX is free while Dato charges a service fee after the free trial.

• The goal of the GraphX project is to unify graph-parallel and data-parallel computation in one system with a single composable API.

• Further experiments can be done to compare the overall performance of a specific task that contains both graph algorithms and other data-parallel computation

MORE DETAILS• https://github.com/keiraqz/dato-vs-graphx

REFERENCES• Dato:

• https://dato.com/

• Spark GraphX:

• https://spark.apache.org/docs/1.1.0/graphx-programming-guide.html

• Stanford Large Network Dataset Collection (SNAP):

• https://snap.stanford.edu/data/

top related