analyzing flight data - meetupfiles.meetup.com/9505222/spark graphx.pdf · 2016-07-24 · 2 © 2016...

40
© 2016 IBM Corporation IBM Analytics Analyzing Flight Data Jeff Carlson Rich Tarro July 21, 2016

Upload: others

Post on 26-Jan-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

© 2016 IBM Corporation

IBM Analytics

Analyzing Flight Data

Jeff Carlson

Rich Tarro

July 21, 2016

Page 2: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

2 © 2016 IBM Corporation

Agenda

Spark Overview – a quick review

Introduction to Graph Processing and Spark GraphX

GraphX Overview

Demo Scenario Overview

Demo

Wrap-up

Page 3: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

3 © 2016 IBM Corporation

Agenda

Spark Overview – a quick review

Introduction to Graph Processing and Spark GraphX

GraphX Overview

Demo Scenario Overview

Demo

Wrap-up

Page 4: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

4 © 2016 IBM Corporation

What is Spark?

Spark is an open source

in-memory

application framework for

distributed data processing and

iterative analysis

on massive data volumes

“Analytic Operating System”

Page 5: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

5 © 2016 IBM Corporation

Key reasons for interest in Spark

Performant In-memory architecture greatly reduces disk I/O

Anywhere from 20-100x faster for common tasks

Productive Concise and expressive syntax, especially compared to prior approaches

Single programming model across a range of use cases and steps in data lifecycle

Integrated with common programming languages – Java, Python, Scala

New tools continually reduce skill barrier for access (e.g. SQL for analysts)

Leverages existing

investments

Works well within existing Hadoop ecosystem

Improves with age Large and growing community of contributors continuously improve full analytics stack and extend capabilities

Page 6: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

6 © 2016 IBM Corporation

Spark includes a set of core libraries that enable various

analytic methods which can process data from many sources

Spark Core Engine

general

compute

engine, handles

distributed task

dispatching,

scheduling and

basic I/O

functions

Spark SQLSpark

Streaming

MLlib

(machine

learning)

GraphX

(graph)

executes SQL

statements

performs

streaming

analytics using

micro-batches

common

machine

learning and

statistical

algorithms

distributed

graph

processing

framework

large variety of

data sources and

formats can be

supported, both on

premise or cloud

BigInsights

(HDFS)

Cloudant

dashDB

Object

Storage

SQL

DB

…many

others

IBM CLOUD OTHER CLOUD CLOUD APPS ON-PREMISE

Page 7: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

7 © 2016 IBM Corporation

Spark Application Architecture

A Spark application is initiated from a driver program

Spark execution modes:– Standalone with the built-in cluster manager

– Use Mesos as the cluster manager

– Use YARN as the cluster manager

– Standalone cluster on any cloud (BlueMix, IBM Softlayer, Amazon, Azure, …)

Page 8: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

8 © 2016 IBM Corporation

Spark RDDs

Immutable

Two types of operations– Transformations ~ DDL (Create View V2 as…) – Lazy Evaluation

• val rddNumbers = sc.parallelize(1 to 10): Numbers from 1 to 10

• val rddNumbers2 = rddNumbers.map (x => x+1): Numbers from 2 to 11

• The LINEAGE on how to obtain rddNumbers2 from rddNumber is recorded

• It’s a Directed Acyclic Graph (DAG)

• No actual data processing does take place Lazy evaluations

– Actions ~ Select (Select * From V2…) – Perform Computations• rddNumbers2.collect(): Array [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

• Performs transformations and action

• Returns a value (or write to a file)

Fault tolerance– If data in memory is lost it will be recreated from lineage

Page 9: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

9 © 2016 IBM Corporation

Agenda

Spark Overview – a quick review

Introduction to Graph Processing and Spark GraphX

GraphX Overview

Demo Scenario Overview

Demo

Wrap-up

Page 10: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

10 © 2016 IBM Corporation

Graphs are Central to Analytics

Data is not just getting bigger, it’s getting more connected

In many use cases, the relationship between data points provides as

much value or more than the data points themselves

Discovering data relationships and interdependencies is critical to

many applications– fraud detection

– better understanding customer relationships

– ranking web pages or people in social networks

Graph analytics is a powerful tool for understanding and exploiting

the connections in data

Graph applications are everywhere today and are a critical

component of many next generation applications

Page 11: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

11 © 2016 IBM Corporation

What is a Graph?

A graph is a mathematical structure used to model relations between

objects. A graph is made up of vertices and edges that connect them. The

vertices are the objects and the edges are the relationships between

them.

Directed graph – A graph where the edges have a direction associated with them. An example of a

directed graph is a Twitter follower. User Bob can follow user Carol without implying that

user Carol follows user Bob.

Regular graph– Graph where each vertex has the same number of edges. An example of a regular graph

is Facebook friends. If Bob is a friend of Carol, then Carol is also a friend of Bob.

Page 12: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

12 © 2016 IBM Corporation

Spark GraphX

Graph processing system, NOT a database

GraphX extends Spark RDD by introducing a Graph abstraction– A directed multigraph with properties attached to each vertex and edge

GraphX exposes a set of fundamental operators to support graph

computation– Subgraph, joinVertices, aggregateMessages, …

Algorithms to simplify graph analytics tasks– In addition to a highly flexible API, GraphX comes with a growing library of

graph algorithms

– PageRank, Triangle Counting, …

Page 13: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

13 © 2016 IBM Corporation

Spark GraphX – Flexible Graphing

GraphX unifies ETL, exploratory analysis, and iterative graph

computation

You can view the same data as both graphs and collections,

transform and join graphs with RDDs efficiently, and write custom

iterative graph algorithms with the API

Page 14: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

14 © 2016 IBM Corporation

GraphX and the Alternatives

GraphX– Optimized for running complex algorithms on the entire graph

Relational databases are inadequate for any real type of graph

analysis

Graph Databases– Database transactions - updates and deletes

– Typically work with small sections of the graph• Ex. Query small groups of vertices

Page 15: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

15 © 2016 IBM Corporation

Graph Databases

The same restrictions that enable graph databases to achieve

substantial performance gains also limit their ability to express many

of the important stages in a typical graph-analytics pipeline

Often require data-movement outside of the graph topology to

express operations that are more naturally expressed as

relational/table operations

Page 16: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

16 © 2016 IBM Corporation

GraphX Benefits

Unify graph and data centric computation in one system with a single

composable API

Enables users to view data both as graphs and as collections (i.e.,

RDDs) or tables (DataFrames) without data movement or duplication

Page 17: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

17 © 2016 IBM Corporation

Agenda

Spark Overview – a quick review

Introduction to Graph Processing and Spark GraphX

GraphX Overview

Demo Scenario Overview

Demo

Wrap-up

Page 18: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

18 © 2016 IBM Corporation

Property Graphs

GraphX implements an object called the property graph– Directed multigraph with user defined objects attached to each vertex and edge

Like RDDs, property graphs are immutable, distributed, and fault-

tolerant

Directed multigraphs can have multiple edges in parallel– Every edge and vertex has user defined properties associated with it

– The parallel edges allow multiple relationships between the same vertices

Page 19: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

19 © 2016 IBM Corporation

Vertex and Edge RDDs

GraphX exposes RDD views of the vertices and edges stored within

the graph

The VertexRDD[A] extends RDD[(VertexID, A)] and adds the

additional constraint that each VertexID occurs only once

The EdgeRDD[ED] extends RDD[Edge[ED]] organizes the edges in

blocks partitioned using one of the various partitioning strategies

defined in PartitionStrategy

Page 20: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

20 © 2016 IBM Corporation

Example Property Graph

Page 21: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

21 © 2016 IBM Corporation

Example – Constructing a Property Graph

Construct a property graph consisting of the various collaborators– Vertex property might contain the username and occupation

– Edges with a string describing the relationships between collaborators

Page 22: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

22 © 2016 IBM Corporation

Deconstructing a Graph– Vertex and edge views

– Use ‘graph.vertices’ and ‘graph.edges’ members

Alternately, use the case class type constructor as in the following:

Example – Working with a Property Graph

Page 23: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

23 © 2016 IBM Corporation

Logically joins the vertex and edge properties

RDD[EdgeTriplet[VD, ED]] contains instances of the EdgeTriplet

class

This join can be expressed in the following SQL expression:

or graphically as:

Triplet Views

Page 24: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

24 © 2016 IBM Corporation

Extends the Edge class by adding the srcAttr and dstAttr members

Renders a collection of strings describing relationships between

users

EdgeTriplet Class

Page 25: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

25 © 2016 IBM Corporation

Similar to RDD basic operations like map, filter, and reduceByKey

Core operators have optimized implementations

Graph Operators types:– Property Operators (mapVertices, mapEdges, mapTriplets)

– Structural Operators (reverse, subgraph, mask, groupEdges)

– Join Operators (joinVertices, outerJoinVertices)

Graph Operators

Page 26: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

26 © 2016 IBM Corporation

Graph Operators - Subgraph

The subgraph operator takes vertex and edge predicates and returns

the graph containing only the vertices that satisfy the vertex

predicate and edges that satisfy the edge predicate– Vertices that satisfy the vertex predicate are connected

The subgraph operator can be used in number of situations to

restrict the graph to the vertices and edges of interest or eliminate

broken links

Page 27: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

27 © 2016 IBM Corporation

Graph Algorithms - PageRank

An algorithm created by Google to rank websites in their search

engine results– named after Larry Page one of the founders of Google

PageRank works by counting the number and quality of links to a

page to determine a rough estimate of how important the website is – Underlying assumption is that more important websites are likely to receive

more links from other websites

The mathematics of PageRank are entirely general and apply to any

graph or network in any domain– e.g. Personalized PageRank is used by Twitter to present users with other

accounts they may wish to follow

Page 28: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

28 © 2016 IBM Corporation

Graph Algorithms - Triangle Counting

GraphX implements a triangle counting algorithm– The triangle is a three-node small graph, where every two nodes are connected

– Used in many real world applications as a measure of clustering

Determines the number of triangles passing through each vertex– A vertex is part of a triangle when it has two adjacent vertices with an edge

between them

TriangleCount requires the edges to be in canonical orientation

(srcId < dstId) and the graph to be partitioned using

Graph.partitionBy– E.g. RandomVertexCut collocates all same-direction edges between two

vertices hashing the source and destination vertexIDs

Page 29: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

29 © 2016 IBM Corporation

Agenda

Spark Overview – a quick review

Introduction to Graph Processing and Spark GraphX

GraphX Overview

Demo Scenario Overview

Demo

Wrap-up

Page 30: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

30 © 2016 IBM Corporation

Demo Scenario

Explore and analyze airline data– Vertices representing airports

– Edges representing flights between airports and their associated distance

Use a number of operators provided by GraphX to analyze data in the

graph and the relationship between the data– E.g. find the airports with the greatest number inbound and outbound flights

Employ graph operators to transform the graphs into new graphs– Based on transformation criteria, like the distance between airports

Employ graph algorithms included with GraphX, like PageRank and

Triangle Counting, to determine the busiest airports

Page 31: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

31 © 2016 IBM Corporation

Demo Scenario Data

Airline data in CSV format is readily available on the US Bureau of

Transportation (BTS) website– http://www.rita.dot.gov/bts/home)

This demo employs US flight data for March 2016

Page 32: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

32 © 2016 IBM Corporation

Demo Flow

Download the data (CSV format)

Read in the CSV file as a DataFrame (infer the schema)

Clean up the DataFrame– Remove blank column and rows that contain nulls

Convert the DataFrame to an RDD– Use custom case class

– GraphX is based on RDDs, so must convert the DataFrame into an RDD

Extract data (airport IDs and airport codes) for the graph vertices

Extract data (origin airport ID, destination airport ID, distance

between airports) for graph edges

Create the EdgeRDD

Page 33: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

33 © 2016 IBM Corporation

Example Demo Graph

Page 34: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

34 © 2016 IBM Corporation

Demo Flow (continued)

Create the graph

Investigate the graph– Show vertices

– Count number of vertices/airports

– Show edges/flights

– Count the number of edges/flights and distinct routes

– Query the graph based on vertex and edge attributes and properties

Create a triple view of the graph– Query the triplet view

Compute the highest degree vertices (in, out, and total)

Calculate Page Ranks for the graph vertices

Page 35: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

35 © 2016 IBM Corporation

Demo Flow (conclusion)

Create a subgraph

Explore the subgraph– Using both vertex predicates and edge predicates

Create a subgraph for Triangle Counting– TriangleCount requires the edges be in canonical orientation

– Also required that the graph is partitioned

Create a Triangle Count graph

Investigate the vertices/airports with the highest triangle count

Page 36: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

36 © 2016 IBM Corporation

Agenda

Spark Overview – a quick review

Introduction to Graph Processing and Spark GraphX

GraphX Overview

Demo Scenario Overview

Demo

Wrap-up

Page 37: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

37 © 2016 IBM Corporation

Agenda

Spark Overview – a quick review

Introduction to Graph Processing and Spark GraphX

GraphX Overview

Demo Scenario Overview

Demo

Wrap-up

Page 38: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

38 © 2016 IBM Corporation

Summary

Graphs provide a powerful way to model and analyze connected data

GraphX builds on the massively parallel, fault-tolerant foundation of

Spark to provide graph processing

Spark provides the ability to complement graph processing with

relational processing in a single consistent framework and set of

APIs

GraphX is a graph processing system and not a database

GraphX provides a number of operators and algorithms to facilitate

working with and understanding the connections in the data

Page 39: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

39 © 2016 IBM Corporation

GraphX Challenges

Scala API only– No Python or Java APIs

Utilizes lower level RDD (vs. DataFrame) based API

Does not benefit from Spark DataFrame optimizations such as the

Catalyst query optimizer or Tungsten memory management

Page 40: Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

40 © 2016 IBM Corporation

Enter Spark GraphFrames

DataFrames based graphs for Apache Spark– Vertices and edges are represented as DataFrames

– Enables arbitrary data to be stored with each vertex

and edge

Python, Java and Scala APIs

Simplified interactive queries– Phrase queries in the familiar, powerful Spark SQL and DataFrame APIs

Supports motif finding for structural pattern search– For example, to recommend whom to follow, you might search for triplets of

users A,B,C where A follows B and B follows C, but A does not follow C

Benefits from Spark DataFrame optimizations

GraphFrames fully integrate with GraphX via conversions between

the two representations