tree and graph processing on hadoop
DESCRIPTION
Tree and Graph Processing On Hadoop. Ted Malaska. Schedule. Intro Overview of Hadoop and Eco-System Summarize Tree Rooting MR Overview/Implementation Options Hbase Overview/Implementation Options Giraph Overview/Implementation Options Spark Overview/Implementation Options Summery - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Tree and Graph Processing On Hadoop](https://reader036.vdocument.in/reader036/viewer/2022062501/5681642f550346895dd5f91b/html5/thumbnails/1.jpg)
1
Tree and Graph Processing On Hadoop
Ted Malaska
![Page 2: Tree and Graph Processing On Hadoop](https://reader036.vdocument.in/reader036/viewer/2022062501/5681642f550346895dd5f91b/html5/thumbnails/2.jpg)
2
Schedule
• Intro• Overview of Hadoop and Eco-System• Summarize Tree Rooting• MR Overview/Implementation Options• Hbase Overview/Implementation Options• Giraph Overview/Implementation Options• Spark Overview/Implementation Options• Summery• Quesitons
![Page 3: Tree and Graph Processing On Hadoop](https://reader036.vdocument.in/reader036/viewer/2022062501/5681642f550346895dd5f91b/html5/thumbnails/3.jpg)
3
Intro
• Hi there
![Page 4: Tree and Graph Processing On Hadoop](https://reader036.vdocument.in/reader036/viewer/2022062501/5681642f550346895dd5f91b/html5/thumbnails/4.jpg)
4
Overview of Hadoop and Eco-System
SearchNoSqlMachine LearningLFPRTQStreamingIngestionBatch
HDFSSecurity and Access Controls
Auditing and Monitoring
Map
Red
uce
Pig
Crun
ch
Hive
Gira
ph
Sqoo
p
Flum
e
Kafk
a
Stor
m
Spar
k St
ream
ing
Spar
k
Impa
la
Mah
out
Ory
x
R Pyth
on S
trea
min
g
SAS
HBas
e
Accu
mul
o
NFS
Sear
ch S
olR
![Page 5: Tree and Graph Processing On Hadoop](https://reader036.vdocument.in/reader036/viewer/2022062501/5681642f550346895dd5f91b/html5/thumbnails/5.jpg)
5
In Scope for Tonight
SearchNoSqlMachine LearningLFPRTQStreamingIngestionBatch
HDFSSecurity and Access Controls
Auditing and Monitoring
Map
Red
uce
Pig
Crun
ch
Hive
Gira
ph
Sqoo
p
Flum
e
Kafk
a
Stor
m
Spar
k St
ream
ing
Spar
k
Impa
la
Mah
out
Ory
x
R Pyth
on S
trea
min
g
SAS
HBas
e
Accu
mul
o
NFS
Sear
ch S
olR
![Page 6: Tree and Graph Processing On Hadoop](https://reader036.vdocument.in/reader036/viewer/2022062501/5681642f550346895dd5f91b/html5/thumbnails/6.jpg)
6
Summarize Tree Rooting
• Basic Tree
0
1 1
22 2
2
3
33
True Root
Leafs
Branches
Vertex
Edge
Depth
![Page 7: Tree and Graph Processing On Hadoop](https://reader036.vdocument.in/reader036/viewer/2022062501/5681642f550346895dd5f91b/html5/thumbnails/7.jpg)
7
Summarize Tree Rooting
• More Complex Tree
0
11
22 2
2
3
32
Circular Link
Multiple Parents
![Page 8: Tree and Graph Processing On Hadoop](https://reader036.vdocument.in/reader036/viewer/2022062501/5681642f550346895dd5f91b/html5/thumbnails/8.jpg)
8
Summarize Tree Rooting
• Merging Trees• Borderline True Graph Problem
0
11
22 2
2
3
32
0
0
Multi RootedVertex
True RootTrue Root
![Page 9: Tree and Graph Processing On Hadoop](https://reader036.vdocument.in/reader036/viewer/2022062501/5681642f550346895dd5f91b/html5/thumbnails/9.jpg)
9
Summarize Tree Rooting
• Know your data
![Page 10: Tree and Graph Processing On Hadoop](https://reader036.vdocument.in/reader036/viewer/2022062501/5681642f550346895dd5f91b/html5/thumbnails/10.jpg)
10
Basic Storage Format
• <NodeID>|<EdgeID>
• Example• 101• 101|201• 101|202• 201• 202|301• 301
![Page 11: Tree and Graph Processing On Hadoop](https://reader036.vdocument.in/reader036/viewer/2022062501/5681642f550346895dd5f91b/html5/thumbnails/11.jpg)
11
Preprocessing
• Terming Data• Nodes and edges have data• Data has weight• Normally linkage information is under 10% of true data size
• Organize Data by Partitioning
![Page 12: Tree and Graph Processing On Hadoop](https://reader036.vdocument.in/reader036/viewer/2022062501/5681642f550346895dd5f91b/html5/thumbnails/12.jpg)
12
Basic Solution
• Step 1: Identify Roots• Echo to all edges• Vertexes with that receive no echoes are roots• Root the root
• Step 2: Walk the tree• Echo from last newly rooted Vertex to all edges• If vertex is not already rooted then root it.
• 101• 101|201• 101|202• 201• 202|301• 301
• 101|R:101• 101|201|R:101• 101|202|R:101• 201|R:Null• 202|301|R:Null• 301|R:Null
• 101|R:101• 101|201|R:101• 101|202|R:101• 201|R:101• 202|301|R:101• 301|R:Null
• 101|R:101• 101|201|R:101• 101|202|R:101• 201|R:101• 202|301|R:101• 301|R:101
![Page 13: Tree and Graph Processing On Hadoop](https://reader036.vdocument.in/reader036/viewer/2022062501/5681642f550346895dd5f91b/html5/thumbnails/13.jpg)
13
Map Reduce
• Massive parallel processing on Hadoop• Based on the Google 2004 MapReduce white paper• Able to process PBs of data
![Page 14: Tree and Graph Processing On Hadoop](https://reader036.vdocument.in/reader036/viewer/2022062501/5681642f550346895dd5f91b/html5/thumbnails/14.jpg)
14
Map Reduce
Data Blocks
Data Blocks
Data Blocks
Mapper
Mapper
Mapper
Sort & Shuffle
Sort & Shuffle
Sort & Shuffle
Mapper
Mapper
Data Blocks
Data Blocks
![Page 15: Tree and Graph Processing On Hadoop](https://reader036.vdocument.in/reader036/viewer/2022062501/5681642f550346895dd5f91b/html5/thumbnails/15.jpg)
15
Map Reduce
• Self Joins• Always dumping two output:
• Newly Rooted• Still Un-Rooted
All Data
Un-Rooted
Newly Rooted
Un-Rooted
Newly Rooted
Old Rooted 0
MR - Stage0
Root Identifying
MR – Stage1
Rooting
Un-Rooted
Newly Rooted
Old Rooted 0
MR – Stage2
RootingOld Rooted 1
![Page 16: Tree and Graph Processing On Hadoop](https://reader036.vdocument.in/reader036/viewer/2022062501/5681642f550346895dd5f91b/html5/thumbnails/16.jpg)
16
Map Reduce
• Great for large batch operations• No memory limit• Not good at iterations
![Page 17: Tree and Graph Processing On Hadoop](https://reader036.vdocument.in/reader036/viewer/2022062501/5681642f550346895dd5f91b/html5/thumbnails/17.jpg)
17
HBase
• Largest and Most used NoSql Implementation in the World• Based on the Google 2006 BigTable white paper• Imagine it like a giant HashMap with keys and values• Handles 100k of operations a second on even a small 10 node cluster
![Page 18: Tree and Graph Processing On Hadoop](https://reader036.vdocument.in/reader036/viewer/2022062501/5681642f550346895dd5f91b/html5/thumbnails/18.jpg)
18
HBase Getting
Client
HBase Master
HBase Region Server HBase Region Server HBase Region Server
Block Cache Block Cache Block Cache
![Page 19: Tree and Graph Processing On Hadoop](https://reader036.vdocument.in/reader036/viewer/2022062501/5681642f550346895dd5f91b/html5/thumbnails/19.jpg)
19
HBase Putting
Client
HBase Master
HBase Region Server HBase Region Server HBase Region Server
WAL
MemStore
HFile
HFile
HFile
WAL
MemStore
WAL
MemStore
![Page 20: Tree and Graph Processing On Hadoop](https://reader036.vdocument.in/reader036/viewer/2022062501/5681642f550346895dd5f91b/html5/thumbnails/20.jpg)
20
HBase
• Good for graph traversing• Bad for large batch processing
• Scan rate about 8x slower then HDFS• Good for end of a long tail
![Page 21: Tree and Graph Processing On Hadoop](https://reader036.vdocument.in/reader036/viewer/2022062501/5681642f550346895dd5f91b/html5/thumbnails/21.jpg)
21
Giraph
• System built for Large Batch Graph Processing • Based on Pregel 2009 white paper• Hardened by LinkedIn and FaceBook• Recorded to handle up to a Trillion edges
![Page 22: Tree and Graph Processing On Hadoop](https://reader036.vdocument.in/reader036/viewer/2022062501/5681642f550346895dd5f91b/html5/thumbnails/22.jpg)
22
Giraph Loading
Data Blocks
Data Blocks
Data Blocks
Worker
Worker
Worker
Worker
Master
![Page 23: Tree and Graph Processing On Hadoop](https://reader036.vdocument.in/reader036/viewer/2022062501/5681642f550346895dd5f91b/html5/thumbnails/23.jpg)
23
Com
mun
icati
on
Giraph (Bulk Synchronous Parallel)
Worker Worker Worker
Loca
l ver
tex
com
putin
g
Barrier synchronization
Loca
l ver
tex
com
putin
g
Loca
l ver
tex
com
putin
g
![Page 24: Tree and Graph Processing On Hadoop](https://reader036.vdocument.in/reader036/viewer/2022062501/5681642f550346895dd5f91b/html5/thumbnails/24.jpg)
24
Giraph
• Most mature bulk graph processing out there• Of all the solutions, most graph focused
![Page 25: Tree and Graph Processing On Hadoop](https://reader036.vdocument.in/reader036/viewer/2022062501/5681642f550346895dd5f91b/html5/thumbnails/25.jpg)
25
Spark
• At Berkeley around 2011 some asked is we could do better then MR• Take advantage of lower cost memory• Building on everything before
![Page 26: Tree and Graph Processing On Hadoop](https://reader036.vdocument.in/reader036/viewer/2022062501/5681642f550346895dd5f91b/html5/thumbnails/26.jpg)
26
Spark
WorkerDag Scheduler
(Like a queue planner
Spark Worker
RDD Objects
Task Threads
Block Manager
Rdd1.join(rdd2).groupBy(…).filter(…)
Task Scheduler
Threads
Block Manager
ClusterManager
![Page 27: Tree and Graph Processing On Hadoop](https://reader036.vdocument.in/reader036/viewer/2022062501/5681642f550346895dd5f91b/html5/thumbnails/27.jpg)
27
Spark
• Implementations• Onion MR approach with Basic Spark• Pregel approach with Bagel or GraphX
• Bagel is a Façade over Generic Spark Functionality• GraphX is an effort extend to Spark
• Less code• Learning curve • Its Raw will be changing a lot in the next year