apache spark streaming and hbase

39
® © 2015 MapR Technologies 1 ® © 2014 MapR Technologies Overview of Apache Spark Streaming Carol McDonald

Upload: carol-mcdonald

Post on 11-Feb-2017

3.529 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Apache Spark streaming and HBase

®© 2015 MapR Technologies 1

®

© 2014 MapR Technologies

Overview of Apache Spark Streaming

Carol McDonald

Page 2: Apache Spark streaming and HBase

®© 2015 MapR Technologies 2

Agenda •  Why Apache Spark Streaming ? •  What is Apache Spark Streaming?

–  Key Concepts and Architecture

•  How it works by Example

Page 3: Apache Spark streaming and HBase

®© 2015 MapR Technologies 3

Why Spark Streaming?

•  Process Time Series data : –  Results in near-real-time

•  Use Cases –  Social network trends –  Website statistics, monitoring –  Fraud detection –  Advertising click monetization

put put

put put

Time stamped data

data

•  Sensor, System Metrics, Events, log files •  Stock Ticker, User Activity •  Hi Volume, Velocity

Data for real-time monitoring

Page 4: Apache Spark streaming and HBase

®© 2015 MapR Technologies 4

What is time series data? •  Stuff with timestamps

–  Sensor data –  log files –  Phones..

Credit Card Transactions Web user behaviour

Social media Log files

Geodata

Sensors

Page 5: Apache Spark streaming and HBase

®© 2015 MapR Technologies 5

Why Spark Streaming ?

What If? •  You want to analyze data as it arrives?

For Example Time Series Data: Sensors, Clicks, Logs, Stats

Page 6: Apache Spark streaming and HBase

®© 2015 MapR Technologies 6

Batch Processing

It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees

It was hot at 6:05 yesterday!

Batch processing may be too late for some events

Page 7: Apache Spark streaming and HBase

®© 2015 MapR Technologies 7

Event Processing

It's 6:05 and 90 degrees

Someone should open a window!

Streaming

Its becoming important to process events as they arrive

Page 8: Apache Spark streaming and HBase

®© 2015 MapR Technologies 8

What is Spark Streaming?

• extension of the core Spark AP

• enables scalable, high-throughput, fault-tolerant stream processing of live data

Data Sources Data Sinks

Page 9: Apache Spark streaming and HBase

®© 2015 MapR Technologies 9

Stream Processing Architecture

Streaming

Sources/Apps

MapR-FS

Data Ingest

Topics

MapR-DB

Data Storage

MapR-FS

Apps  

Stream Processing

Page 10: Apache Spark streaming and HBase

®© 2015 MapR Technologies 10

Key Concepts

•  Data Sources: –  File Based: HDFS –  Network Based: TCP sockets,

Twitter, Kafka, Flume, ZeroMQ, Akka Actor

•  Transformations •  Output Operations

MapR-FS

Topics

Page 11: Apache Spark streaming and HBase

®© 2015 MapR Technologies 11

Spark Streaming Architecture

• Divide  data  stream  into  batches  of  X  seconds    – Called  DStream  =    sequence  of  RDDs      

Spark Streaming

input data stream

DStream RDD batches

Batch interval

data from time 0 to 1

data from time 1 to 2

RDD @ time 2

data from time 2 to 3

RDD @ time 3 RDD @ time 1

Page 12: Apache Spark streaming and HBase

®© 2015 MapR Technologies 12

Resilient Distributed Datasets (RDD)

Spark revolves around RDDs •  read only collection of

elements

Page 13: Apache Spark streaming and HBase

®© 2015 MapR Technologies 13

Resilient Distributed Datasets (RDD)

Spark revolves around RDDs •  read only collection of

elements •  operated on in parallel •  Cached in memory

–  Or on disk •  Fault tolerant

Page 14: Apache Spark streaming and HBase

®© 2015 MapR Technologies 14

Working With RDDs

RDDRDDRDDRDD

Transformations

Action Value

linesWithErrorRDD.count()!6 !!linesWithErrorRDD.first()!# Error line!

textFile = sc.textFile(”SomeFile.txt”) !

linesWithErrorRDD = linesRDD.filter(lambda line: “ERROR” in line) !

Page 15: Apache Spark streaming and HBase

®© 2015 MapR Technologies 15

Process DStream

transform  

Transform  map  

reduceByValue  count  

DStream RDDs

Dstream  RDDs  

transform  transform  

•  Process  using  transformaBons    – creates  new  RDDs  

data from time 0 to 1

data from time 1 to 2

RDD @ time 2

data from time 2 to 3

RDD @ time 3 RDD @ time 1

RDD @ time 1 RDD @ time 2 RDD @ time 3

Page 16: Apache Spark streaming and HBase

®© 2015 MapR Technologies 16

Key Concepts

•  Data Sources •  Transformations: create new DStream

–  Standard RDD operations: map, filter, union, reduce, join, … –  Stateful operations: UpdateStateByKey(function),

countByValueAndWindow, …

•  Output Operations

Page 17: Apache Spark streaming and HBase

®© 2015 MapR Technologies 17

Spark Streaming Architecture

•  processed  results  are  pushed  out    in  batches  

Spark

batches of processed results

Spark Streaming

input data stream

DStream RDD batches

data from time 0 to 1

data from time 1 to 2

RDD @ time 2

data from time 2 to 3

RDD @ time 3 RDD @ time 1

Page 18: Apache Spark streaming and HBase

®© 2015 MapR Technologies 18

Key Concepts

•  Data Sources •  Transformations •  Output Operations: trigger Computation

–  saveAsHadoopFiles – save to HDFS –  saveAsHadoopDataset – save to Hbase–  saveAsTextFiles –  foreach – do anything with each batch of RDDs

MapR-DB

MapR-FS

Page 19: Apache Spark streaming and HBase

®© 2015 MapR Technologies 19

Learning Goals

•  How it works by example

Page 20: Apache Spark streaming and HBase

®© 2015 MapR Technologies 20

Use Case: Time Series Data

Data for real-time monitoring

read

Spark Processing

Spark

Streaming

Oil Pump Sensor data

Page 21: Apache Spark streaming and HBase

®© 2015 MapR Technologies 21

Convert Line of CSV data to Sensor Object

case class Sensor(resid: String, date: String, time: String, hz: Double, disp: Double, flo: Double, sedPPM: Double, psi: Double, chlPPM: Double) def parseSensor(str: String): Sensor = { val p = str.split(",") Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble, p(6).toDouble, p(7).toDouble, p(8).toDouble) }

Page 22: Apache Spark streaming and HBase

®© 2015 MapR Technologies 22

Schema

•  All events stored, data CF could be set to expire data •  Filtered alerts put in alerts CF •  Daily summaries put in Stats CF

Row key CF data CF alerts CF stats

hz … psi psi … hz_avg … psi_min

COHUTTA_3/10/14_1:01 10.37 84 0

COHUTTA_3/10/14 10 0

Page 23: Apache Spark streaming and HBase

®© 2015 MapR Technologies 23

Basic Steps for Spark Streaming code

These are the basic steps for Spark Streaming code: 1.  create a Dstream

1.  Apply transformations 2.  Apply output operations

2.  Start receiving data and processing it –  using streamingContext.start().

3.  Wait for the processing to be stopped –  using streamingContext.awaitTermination().

Page 24: Apache Spark streaming and HBase

®© 2015 MapR Technologies 24

Create a DStream

val ssc = new StreamingContext(sparkConf, Seconds(2))val linesDStream = ssc.textFileStream(“/mapr/stream")

batch  

'me  0-­‐1  

linesDStream

batch  'me  1-­‐2  

batch  'me  1-­‐2  

DStream:  a  sequence  of  RDDs  represenBng  a  stream  of  data  

stored  in  memory  as  an  RDD  

Page 25: Apache Spark streaming and HBase

®© 2015 MapR Technologies 25

Process DStream

val linesDStream = ssc.textFileStream(”directory path")val sensorDStream = linesDStream.map(parseSensor)

map  new  RDDs  created  for  every  batch    

batch  'me  0-­‐1  

linesDStream RDDs

sensorDstream  RDDs  

batch  'me  1-­‐2  

map  map  

batch  'me  1-­‐2  

Page 26: Apache Spark streaming and HBase

®© 2015 MapR Technologies 26

Process DStream

// for Each RDD sensorDStream.foreachRDD { rdd => // filter sensor data for low psi val alertRDD = rdd.filter(sensor => sensor.psi < 5.0) . . .}

Page 27: Apache Spark streaming and HBase

®© 2015 MapR Technologies 27

DataFrame and SQL Operations

// for Each RDD parse into a sensor object filtersensorDStream.foreachRDD { rdd => . . . alertRdd.toDF().registerTempTable(”alert”) // join alert data with pump maintenance info val res = sqlContext.sql( "select s.resid,s.psi, p.pumpType from alert s join pump p on s.resid = p.resid join maint m on p.resid=m.resid") . . .}

Page 28: Apache Spark streaming and HBase

®© 2015 MapR Technologies 28

Save to HBase

// for Each RDD parse into a sensor object filtersensorDStream.foreachRDD { rdd => . . . // convert alert to put object write to HBase alerts alertRDD.map(Sensor.convertToPutAlert) .saveAsHadoopDataset(jobConfig)}

Page 29: Apache Spark streaming and HBase

®© 2015 MapR Technologies 29

Save to HBase

rdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig)

map  

Put  objects  wriFen    To  HBase  

batch  'me  0-­‐1  

linesRDD DStream

sensorRDD  Dstream  

batch  'me  1-­‐2  

map  map  

batch  'me  1-­‐2  

HBase

save save save

output  opera'on:  persist  data  to  external  storage  

Page 30: Apache Spark streaming and HBase

®© 2015 MapR Technologies 30

Start Receiving Data

sensorDStream.foreachRDD { rdd => . . .

}// Start the computation ssc.start() // Wait for the computation to terminate ssc.awaitTermination()

Page 31: Apache Spark streaming and HBase

®© 2015 MapR Technologies 31

Using HBase as a Source and Sink

read

write

Spark application HBase database

EXAMPLE: calculate and store summaries, Pre-Computed, Materialized View

Page 32: Apache Spark streaming and HBase

®© 2015 MapR Technologies 32

HBase

HBase Read and Write

val hBaseRDD = sc.newAPIHadoopRDD( conf,classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])

keyStatsRDD.map { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)

newAPIHadoopRDD

Row key Result

saveAsHadoopDataset

Key Put

HBase

Scan Result

Page 33: Apache Spark streaming and HBase

®© 2015 MapR Technologies 33

Read HBase

// Load an RDD of (rowkey, Result) tuples from HBase table val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) // get Result val resultRDD = hBaseRDD.map(tuple => tuple._2) // transform into an RDD of (RowKey, ColumnValue)s val keyValueRDD = resultRDD.map(

result => (Bytes.toString(result.getRow()).split(" ")(0), Bytes.toDouble(result.value))) // group by rowkey , get statistics for column value val keyStatsRDD = keyValueRDD.groupByKey().mapValues(list => StatCounter(list))

Page 34: Apache Spark streaming and HBase

®© 2015 MapR Technologies 34

Write HBase

// save to HBase table CF data val jobConfig: JobConf = new JobConf(conf, this.getClass) jobConfig.setOutputFormat(classOf[TableOutputFormat]) jobConfig.set(TableOutputFormat.OUTPUT_TABLE, tableName) // convert psi stats to put and write to hbase table stats column family keyStatsRDD.map { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)

Page 35: Apache Spark streaming and HBase

®© 2015 MapR Technologies 35

MapR Blog: Using Apache Spark DataFrames for Processing of Tabular Data

•  https://www.mapr.com/blog/spark-streaming-hbase

Page 36: Apache Spark streaming and HBase

®© 2015 MapR Technologies 36

Free HBase On Demand Training (includes Hive and MapReduce with HBase)

•  https://www.mapr.com/services/mapr-academy/big-data-hadoop-online-training

Page 37: Apache Spark streaming and HBase

®© 2015 MapR Technologies 37

Soon to Come

•  Spark On Demand Training –  https://www.mapr.com/services/mapr-academy/

Page 38: Apache Spark streaming and HBase

®© 2015 MapR Technologies 38

References •  Spark web site: http://spark.apache.org/ •  https://databricks.com/ •  Spark on MapR:

–  http://www.mapr.com/products/apache-spark

•  Spark SQL and DataFrame Guide •  Apache Spark vs. MapReduce – Whiteboard Walkthrough •  Learning Spark - O'Reilly Book •  Apache Spark

Page 39: Apache Spark streaming and HBase

®© 2015 MapR Technologies 39

Q & A

@mapr maprtech

Engage with us!

MapR

maprtech

mapr-technologies