pipeline unified big data analytics - github pagesfrank19900731.github.io/downloads/file/unified big...

37
Unified Big Data Analytics Pipeline [email protected]

Upload: others

Post on 20-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Unified Big Data Analytics Pipeline

连城 [email protected]

Page 2: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

What is

● A fast and general engine for large-scale data processing

● An open source implementation of Resilient Distributed Datasets (RDD)

● Has an advanced DAG execution engine that supports cyclic data flow and in-memory computing

Page 3: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Why

● Fast○ Run machine learning like iterative programs up to

100x faster than Hadoop MapReduce in memory or 10x faster on disk

○ Run HiveQL compatible queries 100x faster than Hive (with Shark / Spark SQL)

Page 4: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Why

● Compatibility○ Compatible with most popular storage systems on

top of HDFS○ New users needn’t suffer ETL to deploy Spark

Page 5: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Why

● Easy to use○ Fluent Scala / Java / Python API○ Interactive shell○ 2-5x less code (than Hadoop MapReduce)

Page 6: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Why

● Easy to use○ Fluent Scala / Java / Python API○ Interactive shell○ 2-5x less code (than Hadoop MapReduce)

sc.textFile("hdfs://...")

.flatMap(_ split " ")

.map(_ -> 1)

.reduceByKey(_ + _)

.collectAsMap()

Page 7: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Why

● Easy to use○ Fluent Scala / Java / Python API○ Interactive shell○ 2-5x less code (than Hadoop MapReduce)

sc.textFile("hdfs://...")

.flatMap(_ split " ")

.map(_ -> 1)

.reduceByKey(_ + _)

.collectAsMap()

Try to write down WordCount in 30

seconds with Hadoop MapReduce :-)

Page 8: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Why

● Unified big data analytics pipeline for○ Batch / interactive (Spark Core vs MR / Tez)○ SQL (Shark / Spark SQL vs Hive)○ Streaming (Spark Streaming vs Storm)○ Machine learning (MLlib vs Mahout)○ Graph (GraphX vs Giraph)

Shark (SQL)

SparkStreaming

MLlib (machinelearning)

GraphX (graph)

Spark Core

Page 9: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

The Hadoop Family

Page 10: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

The Hadoop Family

GeneralBatching

Specialized systems

Streaming Iterative Ad-hoc / SQL Graph

MapReduce Storm Mahout Pig Giraph

S4 Hive

Samza Drill

Impala

Page 11: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Using separate frameworks

Big Data Analytics Pipeline Today

HDFS HDFS HDFS

Page 12: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Using separate frameworks

Big Data Analytics Pipeline Today

HDFSread

HDFSwrite

HDFSread

HDFSwrite

HDFSread Q

uery HDFS

writeTrai

n

ETL

Page 13: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Using separate frameworks

Big Data Analytics Pipeline Today

HDFSread

HDFSwrite

HDFSread

HDFSwrite

HDFSread Q

uery HDFS

writeTrai

n

ETL

HDFSread

HDFSwrite

HDFSread

HDFSwrite

HDFSread Q

uery HDFS

writeTrai

n

ETL

Page 14: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Using separate frameworks

Big Data Analytics Pipeline Today

HDFSread Q

uery HDFS

writeTrai

n

ETL

HDFSread

HDFSwrite

HDFSread

HDFSwrite

HDFSread Q

uery HDFS

writeTrai

n

ETL

Page 15: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Using separate frameworks

Big Data Analytics Pipeline Today

HDFSread

HDFSwrite

HDFSread

HDFSwrite

HDFSread Q

uery HDFS

writeTrai

n

ETL

How?

HDFSread Q

uery HDFS

writeTrai

n

ETL

Page 16: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

D A T A H A S

Ahhh!!!

Page 17: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Resilient Distributed Datasets

● RDDs○ are read-only, partitioned collections of records○ can be cached in-memory and are locality aware○ can only be created through deterministic operations

on either (1) data in stable storage, or (2) other RDDs

Page 18: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Resilient Distributed Datasets

● Core data sharing abstraction in Spark● Computation can be represented by lazily

evaluated RDD lineage DAG(s)

Page 19: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Resilient Distributed Datasets

● An RDD○ is a read-only, partitioned, locality aware distributed

collection of records○ either points to a stable data source, or○ is created through deterministic transformations on

some other RDD(s)

Page 20: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Resilient Distributed Datasets

● Coarse grained○ Applies the same operation on many records

● Fault tolerant○ Failed tasks can be recovered in parallel with the

help of the lineage DAG

Page 21: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Resilient Distributed Datasets

Page 22: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Data Sharing in Spark

● Frequently accessed RDDs can be materialized and cached in memory○ Cached RDD can also be replicated for fault

tolerance● Spark scheduler takes cached data locality

into account

Page 23: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Data Sharing in SparkWorker node

Executor

TasksTasksTasks

Cluster Manager

Cache

Worker node

Executor

TasksTasksTasksCache

Driver

SparkContext

CacheManager

Scheduler tries to schedule tasks to nodes where stable data source or cached input are available.

Executors report caching information

back to driver to improve data locality.

Page 24: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Data Sharing in Spark

● As a consequence, intermediate results are○ either cached in memory, or○ written to local file system as shuffle map output and

waiting to be fetched by reducer tasks (similar to MapReduce)

● Enables in-memory data sharing, avoids unnecessary HDFS I/O

Page 25: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Generality of RDDs

● From an expressiveness point of view○ The coarse grained nature is not a big limitation

since many parallel programs naturally apply the same operation to many records, making them easy to express

○ In fact, RDDs can emulate any distributed system, and will do so efficiently in many cases unless the system is sensitive to network latency

Page 26: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Generality of RDDs

● From system perspective○ RDDs gave applications control over the most

common bottleneck resources in cluster applications (i.e., network and storage I/O)

○ Makes it possible to express the same optimizations that specialized systems have and achieve similar performance

Page 27: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Limitations of RDDs

● Not suitable for○ Latency (millisecond scale)○ Communication patterns beyond all-to-all○ Asynchrony○ Fine-grained updates

Page 28: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Analytics Models Built over RDDs

MR / Dataflowspark

.textFile("hdfs://...")

.flatMap(_ split " ")

.map(_ -> 1)

.reduceByKey(_ + _)

.collectAsMap()

Streamingval streaming = new StreamingContext(

spark, Seconds(1))

streaming.socketTextStream(host, port)

.flatMap(_ split " ")

.map(_ -> 1)

.reduceByKey(_ + _)

.print()

streaming.start()

Page 29: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Analytics Models Built over RDDs

SQLval hiveContext = new HiveContext(spark)

import hiveContext._

val children = hiveql(

"""SELECT name, age FROM people

|WHERE age < 13

""".striMargin).map {

case Row(name: String, age: Int) =>

name -> age

}.foreach(println)

Machine Learningval points = spark.textFile("hdfs://...")

.map(_.split.map(_.toDouble)).splitAt(1)

.map { case (Array(label), features) =>

LabeledPoint(label, features)

}

val model = KMeans.train(points)

Page 30: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Mixing Different Analytics ModelsMapReduce & SQL// ETL with Spark Core

case class Person(name: String, age: Int)

val people = spark.textFile("hdfs://people.txt").map { line =>

val Array(name, age) = line.split(",")

Person(name, age.trim.toInt)

}.registerAsTable("people")

// Query with Spark SQL

val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

teenagers.collect().foreach(println)

Page 31: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Mixing Different Analytics ModelsSQL & iterative computation (machine learning)// ETL with Spark SQL & Spark Core

val trainingData = sql("""SELECT e.action, u.age, u.latitude, u.longitude

|FROM Users u JOIN Events e ON u.userId = e.userId

""".stripMargin).map {

case Row(_, age: Double, latitude: Double, longitude: Double) =>

val features = Array(age, latitude, longitude)

LabeledPoint(age, features)

}

// Training with MLlib

val model = new LogisticRegressionWithSGD().run(trainingData)

Page 32: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Unified Big Data Analytics Pipeline

HDFS HDFS

Core MLlib SQL

RDD RDD

Page 33: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Unified Big Data Analytics Pipeline

HDFSread Q

uery HDFS

writeTrai

n

ETL

Core MLlib SQL

RDD RDD

Page 34: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Unified Big Data Analytics Pipeline

HDFSread Q

uery HDFS

writeTrai

n

ETL

Core MLlib SQL

Page 35: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Unified Big Data Analytics Pipeline

HDFSread Q

uery

HDFSwrite

Trai

n

ETL

Interactive analysis

Page 36: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

● Unified in-memory data sharing abstraction● Efficient runtime brings significant

performance boost and enables interactive analysis

● “One stack to rule them all”

Unified Big Data Analytics Pipeline

HDFSread Q

uery HDFS

writeTrai

n

ETL

Page 37: Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Thanks!Q&A