apache spark its place within a big data stack

33
Its Place Within a Big Data Stack Junjun Olympia

Upload: junjun-olympia

Post on 21-Apr-2017

590 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Apache spark  its place within a big data stack

Its Place Within a Big Data Stack

Junjun Olympia

Page 2: Apache spark  its place within a big data stack

Image from http://mattturck.com/2016/02/01/big-data-landscape/

Page 3: Apache spark  its place within a big data stack

A Quick Review of What Spark Is

Page 4: Apache spark  its place within a big data stack

Spark is a fast, large-scale data processing engine● Runs both in-memory and on-disk● 10x-100x faster than Hadoop MapReduce● Can be written in Java, Scala, Python, R, & SQL● Supports both batch and streaming workflows● Has several modules

○ Spark Core○ Spark Streaming○ Spark MLLib○ GraphX

Page 5: Apache spark  its place within a big data stack

It is the most active open-source project in big data

Next three images from http://go.databricks.com/2015-spark-survey

Page 6: Apache spark  its place within a big data stack

Gaining adoption across industries and companies

Page 7: Apache spark  its place within a big data stack

Used to power several different systems

Page 8: Apache spark  its place within a big data stack

(Big) Data systems perform common functions

Page 9: Apache spark  its place within a big data stack

Capture and extract data

Page 10: Apache spark  its place within a big data stack

Data can come from several sources● Existing databases and data warehouses● Flat files from legacy systems● Web, mobile, and application logs● Data feeds from social media● IoT devices

Page 11: Apache spark  its place within a big data stack

Extract from database: Sqoop vs Spark JDBC API$ sqoop import --connect jdbc:postgresql:dbserver --table schema.tablename \

--fields-terminated-by '\t' --lines-terminated-by '\n' \

--optionally-enclosed-by '\"

val jdbcDF = sqlContext.read.format("jdbc").options(

Map("url" -> "jdbc:postgresql:dbserver",

"dbtable" -> "schema.tablename")).load()

Page 12: Apache spark  its place within a big data stack

Read JSON files// JSON file as a dataframe

val df = sqlContext.read.json("people.json")

CREATE TEMPORARY TABLE people

USING org.apache.spark.sql.json

OPTIONS (path 'people.json')

Page 13: Apache spark  its place within a big data stack

Ingest streaming data from Kafkaimport org.apache.spark.streaming.kafka._

val directKafkaStream = KafkaUtils.createDirectStream[

[key class], [value class], [key decoder class], [value decoder class] ](

streamingContext, [map of Kafka parameters], [set of topics to consume])

var offsetRanges = Array[OffsetRange]()

directKafkaStream.transform { rdd =>

offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

rdd

}.map {...}.foreachRDD { rdd => ... }

Page 14: Apache spark  its place within a big data stack
Page 15: Apache spark  its place within a big data stack

Transform Data

Page 16: Apache spark  its place within a big data stack

Data in an analytic pipeline usually need transformation● Check and/or correct for data quality issues● Handle missing values● Cast values into specific data types or formats● Compute derived fields● Split or merge records to achieve desired granularity● Join with another dataset (i.e. reference lookups)● Restructure as required by downstream applications or target databases

Page 17: Apache spark  its place within a big data stack

There’s plenty of tools that do this● Before big data

○ Informatica PowerCenter○ Pentaho Kettle○ Talend○ SSIS○ OWB

● Early Hadoop○ Apache Pig○ Hive via HQL○ Plain ol’ MapReduce

● Spark core, Streaming, DataFrames

Page 18: Apache spark  its place within a big data stack
Page 19: Apache spark  its place within a big data stack

Store data

Page 20: Apache spark  its place within a big data stack

Data can then be stored several different ways● As self-describing files like Parquet, Avro, JSON, XML● Hive metastore-managed tables● Other low-latency SQL-on-Hadoop engines (i.e. Impala, Drill, Kudu)● Key-value and wide-table databases for fast random access (i.e. HBase,

Cassandra)● Search databases (i.e. ElasticSearch, Solr)● Conventional data warehouses and databases

Page 21: Apache spark  its place within a big data stack
Page 22: Apache spark  its place within a big data stack

Query, analyze, visualize

Page 23: Apache spark  its place within a big data stack

There’s plenty of tools here, too● Databases offering JDBC/ODBC connectivity

○ Hive, Impala, Drill○ MPP data warehouses○ Spark SQL via JDBC Thrift Server

● BI Tools via SQL○ Qlikview○ Tableau○ Pentaho BI

● For richer analyses beyond Spark SQL○ Spark shell○ Better with notebooks (i.e. Zeppelin, Jupyter)

Page 24: Apache spark  its place within a big data stack
Page 25: Apache spark  its place within a big data stack
Page 26: Apache spark  its place within a big data stack

Spark is an essential part of the modern big data stack.

Page 27: Apache spark  its place within a big data stack

A unified framework such as Spark offers benefits, too● Fewer moving pieces● Smaller stack to administer and manage● Common languages● Familiar patterns● Encourages team members to become cross-functional

Page 28: Apache spark  its place within a big data stack

Some common questions about Spark

Page 29: Apache spark  its place within a big data stack

Is Spark a database?

Page 30: Apache spark  its place within a big data stack

Fine; is it a data warehouse then?

Page 31: Apache spark  its place within a big data stack

Is Spark a Hadoop replacement?

Page 32: Apache spark  its place within a big data stack

Are there other technologies similar to Spark?

Page 33: Apache spark  its place within a big data stack

Questions?