Transcript
Page 1: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Real-Time stream computation on

graphs using Storm, Neo4j and

Python

Sonal Raj

http://www.sonalraj.com

Presented at Pycon India 2013

Bangalore, India

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

1

Page 2: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Introduction

2

• With data multiplying each day, storage and

knowledge extraction is a major concern.

• Social Data Analysis, Business Intelligence

• Constraints of Real Time and Fault-Tolerant

Processing

Page 3: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

. . In this Talk

3

• A look at storm as a distributed

computation Framework

• Neo4J as a NoSQL graph database

• Some Cool Pictures

• What are we trying to achieve ?

Page 4: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Disclaimer !

4

• This talk presents an overview of Storm and

Neo4J . . Less dirty details

• I’m going to go pretty fast . . . Please hang on.

Page 5: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

5

Part -1

Storm – The Hadoop

of Real Time

Page 6: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Don’t we have Hadoop ?

6

Page 7: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Storm v/s Hadoop

7

STORM

HADOOP

• Distributed

Processing

• Fault Tolerance

Page 8: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Storm v/s Hadoop

8

HADOOP

• Large but Finite Jobs

• Processes a Lot of Data at Once

• High Latency

Page 9: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Storm v/s Hadoop

9

HADOOP

• Large but Finite Jobs

• Processes a Lot of Data at Once

• High Latency

Storm

Infinite Computations called Topologies

Process Infinite Streams of data one-tuple-at-a-time

Low Latency

Page 10: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

So, what Storm gives us . .

10

Real-Time Computations

Guaranteed data Processing

Horizontal Scalability and Fault-Tolerance

No intermediate message Brokers

Higher Abstraction than Message Passing, so makes

sense !

Page 11: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

A little deeper . . Concepts

11

Streams

Tuple Tuple Tuple Tuple Tuple

An unbounded sequence of Tuples

Page 12: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

A little deeper . . Concepts

12

Streams

Tuple Tuple Tuple Tuple Tuple

An unbounded sequence of Tuples

So, what kind of a tuple is this ?

Page 13: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

A little deeper . . Concepts

13

Spouts

A source of Streams

Page 14: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

A little deeper . . Concepts

14

Spouts

A source of Streams

But, what is the source FOR the spouts ?

Page 15: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

A little deeper . . Concepts

15

Bolts

Computational units processing input

streams and producing new streams

Page 16: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

A little deeper . . Concepts

16

Bolts

Computational units processing input

streams and producing new streams

Just 1 stream ?

Page 17: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

A little deeper . . Concepts

17

Topologies

A network of spouts and bolts

Page 18: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Is that it . . . ?

18

Tasks and Parallelism

A spout or bolt can execute

multiple tasks across the

cluster

Page 19: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

19

[ ]Mr. Tuple

O Shoot, where do I go now?

Page 20: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Groupings . . To the rescue of Mr. Tuple !

20

• Shuffle Grouping #pick a random task

• Fields Grouping #mod hashing on a

subset of tuple fields

• All Grouping #sends to all tasks

• Global Grouping #picks task with lowest

task id

Page 21: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

A Storm Cluster

21

NIMBUS

ZOOKEEPER

ZOOKEEPER

ZOOKEEPER

SUPERVISOR

SUPERVISOR

SUPERVISOR

SUPERVISOR

SUPERVISOR

Page 22: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

A Storm Cluster

22

NIMBUS

ZOOKEEPER

ZOOKEEPER

ZOOKEEPER

SUPERVISOR

SUPERVISOR

SUPERVISOR

SUPERVISOR

SUPERVISOR

If this were Hadoop

Job TrackerTask Tracker

Page 23: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

A Storm Cluster

23

NIMBUS

ZOOKEEPER

ZOOKEEPER

ZOOKEEPER

SUPERVISOR

SUPERVISOR

SUPERVISOR

SUPERVISOR

SUPERVISOR

But it’s NOT Hadoop !

Co-ordinates

Everything

Page 24: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Salient Features . .

24

• Storm > 0.7 supports Transactional Topologies Processes small batches of topologies

If failure during commit, both batch+commit is

retried

• Storm guarantees message Processing using

acknowledgements

• Petrel by AirSage is a python wrapper for

Storm ; you can write and submit topologies in

Python.

Page 25: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

25

Part -2

Neo4J – “Get Graphed”

Page 26: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

26

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

This is how

Graph Data was

represented in

RDBMS.

Page 27: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

27

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

ENTER, NOSQL DATABASES

Page 28: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

28

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Types of NOSQL Databases

Graph

databases

Document

databases

Column-

Family

Key-Value

Stores

Data Complexity

Da

ta S

ize

Page 29: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

29

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Why NOSQL Databases

• Easily horizontally scalable

• Dynamic Schemas, Handle Unstructured data really

well.

• Excel in speed and volume

• Trade off in consistency for efficiency (except in

graph databases . . . We’ll see why )

• Pleasure to code

• Free to use any query language ( even SQL ! )

• Downtime? What Downtime ?

Page 30: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

30

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

The Property Graph Model of Graph Databases

• Core Abstractions

Nodes

Relationship between Nodes

Properties of both

• Traversal Framework

High Performance Queries on connected datasets

• Bindings

REST, Gremlin, etc.

Page 31: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

31

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Neo4J

• Fully ACID with rollbacks support (unbelievable!)

• Schema-less and Efficient storage of Semi Structured

Data

• Fast deep traversal instead of slow SQL queries that

span many table joins

• Whiteboard Friendly

• Very natural to express graph related problems with

traversals (recommendation engine, shortest path etc..)

Page 32: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

32

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Neo4J Pythonized !

• Py2Neo is an excellent binding for Neo4J

• Accesses Neo4J using it’s RESTful API

• Still under development . . Features like labels yet to be

included !

Page 33: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

33

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

So, Will Relational databases be Extinct ?

OOPS!

Page 34: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

34

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Categories of Graphical Data

• Social Networks

• Citations

• Product Co-Purchasing

• Internet peer-to-peer

• Road Network and Map Data

• Web Graphs

Excellent Source of Sample Graphical Data

“ http://snap.Stanford.edu/data/ “

Page 35: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

35

Part -3

Get your hands dirty !

Page 36: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

36

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

A demo . .

• Sample Social Network data set

• Data Includes people signing up info,

adding friends, unfriending etc. . . for a

month’s activity

• Neo4J

Store and Update the social data

• Storm

Calculate “friendship-index”

Page 37: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

37

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

A demo . .

• “friendship-index”

n = Through how many people is

person “A” connected to person “B”

Gives an idea of how close two people

are !

Useful while searching friends on Social

Networks ( something like friends of friends concept

in facebook’s graph search )

Page 38: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

38

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

The Topology . .

UpdateSpout

UpdateBolt

QuerySpout Query

Bolt

Source

Source

Page 39: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

39Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Update Spout

Page 40: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

40Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Update Spout

Define what kind of tuples

are emitted

Page 41: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

41Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Update Spout

Gets and emits tuple streams

Page 42: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

42Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Update Bolt

Page 43: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

43Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Update Bolt

Objects for database access

and indexing service

Page 44: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

44Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Update Bolt

Page 45: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

45Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Query Spout

Page 46: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

46Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Query Spout

The tuple to be emitted

can contain multiple

entities.

Page 47: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

47Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Query Bolt

Page 48: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

48Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Query Bolt

Page 49: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

49Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Query Bolt

Retrieve caller friend and

requested friend ids

Page 50: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

50Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Query Bolt

Retrieve caller friend

and requested friend

ids as per database

Page 51: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

51Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Create Topology

Page 52: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

52Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Create Topology

Import all spout and

bolt files

Page 53: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

53Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Create Topology

Unfortunately, There was no option in

Petrel to turn off console debug, so the

console view is really messy.

Page 54: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

54Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Topology.yaml

Configurations to the topology are

specified in this file

Page 55: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

55

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

A little More . .

UpdateSpout

UpdateBolt

QuerySpout Query

Bolt

Source

Source

Page 56: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

56

Final Thoughts

• A Storm-Neo4j framework is a boon for real-time graph computations

• Quite flexible in Java, Python bindings and implementations still have a long way to go.

• If you are an Admin or developer, Analyse your data and computing requirements before narrowing down on a framework.

Page 57: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

57

…to play with Storm and Neo4J

• My PyCon Talk Repo – slides, code skeletons,

etc.http://www.sonalraj.com/neo-storm.html

• Storm documentation (official)http://github.com/nathanmarz/storm

• Storm Bookhttp://www.amazon.com/Getting-Started-Storm-Jonathan-Leibiusky/dp/1449324010

• Deployment of storm on AWShttp://github.com/nathanmarz/storm-deploy

• Neo4J Documentationhttp://www.neo4j.org

Page 58: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

58

Ex-terminated . . .

- That’s it- Thanks for Listening !- Questions


Top Related