an introduction to apache spark big data madison: 29 july 2014

63
William Benton @willb Red Hat, Inc. An Introduction to Apache Spark Big Data Madison: 29 July 2014

Upload: phungtu

Post on 14-Feb-2017

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Introduction to Apache Spark Big Data Madison: 29 July 2014

William Benton @willb Red Hat, Inc.

An Introduction to Apache Spark Big Data Madison: 29 July 2014

Page 2: An Introduction to Apache Spark Big Data Madison: 29 July 2014

About me

• At Red Hat for almost 6 years, working on distributed computing

• Currently contributing to Spark, active Fedora packager and sponsor

• Before Red Hat: concurrency and program analysis research

Page 3: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Forecast

• Background

• Resilient Distributed Datasets

• Spark Libraries

• Community Overview

Page 4: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Background

Page 5: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Processing data in 2005

• MapReduce paper applied very old ideas to distributed data processing

• Hadoop provided open-source MR and distributed FS implementations

• MR allows scale-out on commodity clusters for some real problems

Page 6: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Processing data in 2005

• MapReduce paper applied very old ideas to distributed data processing

• Hadoop provided open-source MR and distributed FS implementations

• MR allows scale-out on commodity clusters for some real problems

Dean & Ghemawat, 2004

Page 7: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Processing data in 2005

• MapReduce paper applied very old ideas to distributed data processing

• Hadoop provided open-source MR and distributed FS implementations

• MR allows scale-out on commodity clusters for some real problems

Dean & Ghemawat, 2004 McCarthy, 1960

Page 8: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Hadoop and MR shortcomings

• MR works well for batch jobs; suffers for iterative or interactive ones

• Special-purpose extensions for machine learning, query, etc.

• MR model is not a natural fit for many programmers or programs

Page 9: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Hadoop and MR shortcomings

• MR works well for batch jobs; suffers for iterative or interactive ones

• Special-purpose extensions for machine learning, query, etc.

• MR model is not a natural fit for many programmers or programs

MAHOUT

Page 10: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Hadoop and MR shortcomings

• MR works well for batch jobs; suffers for iterative or interactive ones

• Special-purpose extensions for machine learning, query, etc.

• MR model is not a natural fit for many programmers or programs

MAHOUT HIVE, PIG

Page 11: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Hadoop and MR shortcomings

• MR works well for batch jobs; suffers for iterative or interactive ones

• Special-purpose extensions for machine learning, query, etc.

• MR model is not a natural fit for many programmers or programs

MAHOUT HIVE, PIG

Page 12: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Hadoop and MR shortcomings

• MR works well for batch jobs; suffers for iterative or interactive ones

• Special-purpose extensions for machine learning, query, etc.

• MR model is not a natural fit for many programmers or programs

MAHOUT HIVE, PIG

DATA SCIENTIST TIME: $460/8 hours EC2 C3 Large INSTANCE: $0.84/8 hours

Page 13: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Apache Spark

• Introduced in 2009; donated to Apache in 2013; 1.0 release in 2014

• Based on a fundamental abstraction rather than an execution model

• Supports in-memory computing and a wide range of problems

Page 14: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Spark is general

Spark core

Page 15: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Spark is general

Spark core

Graph

Page 16: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Spark is general

Spark core

Graph SQL

Page 17: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Spark is general

Spark core

Graph SQL ML

Page 18: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Spark is general

Spark core

Graph SQL ML Streaming

Page 19: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Spark is general

Spark core

Graph SQL ML Streaming

ad hoc

Page 20: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Spark is general

Spark core

Graph SQL ML Streaming

ad hoc Mesos

Page 21: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Spark is general

Spark core

Graph SQL ML Streaming

ad hoc Mesos YARN

Page 22: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Spark is general

Spark core

Graph SQL ML Streaming

ad hoc Mesos YARN

APIs for SCALA, Java, Python, and R (3rd-party bindings for Clojure et al.)

Page 23: An Introduction to Apache Spark Big Data Madison: 29 July 2014

RDDs

Page 24: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Resilient distributed datasets

Page 25: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Resilient distributed datasets

Page 26: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Resilient distributed datasets

Page 27: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Resilient distributed datasets

Page 28: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Resilient distributed datasets

Partitioned across machines by range…

Page 29: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Resilient distributed datasets

Partitioned across machines by range…

Page 30: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Resilient distributed datasets

Partitioned across machines by range…

…or BY HASH

Page 31: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Resilient distributed datasets

Page 32: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Resilient distributed datasets

Failures mean Partitions can disappear…

?

Page 33: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Resilient distributed datasets

Failures mean Partitions can disappear…

…but they can be reconstructed!

Page 34: An Introduction to Apache Spark Big Data Madison: 29 July 2014

RDDs are partitioned, immutable, lazy collections

Page 35: An Introduction to Apache Spark Big Data Madison: 29 July 2014

RDDs are partitioned, immutable, lazy collections

TRANSFORMATIONS create new RDDs that encode a dependency DAG !

ACTIONS result in executing cluster jobs & return values to the driver

Page 36: An Introduction to Apache Spark Big Data Madison: 29 July 2014

RDDs (more formally)

• A set of partitions

• Lineage information

• A function to compute partitions from parent partitions

• A partitioning strategy

• Preferred locations for partitions

Page 37: An Introduction to Apache Spark Big Data Madison: 29 July 2014

RDDs (more formally)

• A set of partitions

• Lineage information

• A function to compute partitions from parent partitions

• A partitioning strategy

• Preferred locations for partitions(THESE ARE OPTIONAL)

Page 38: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Creating RDDs

• From a collection: parallelize()

• …a local or remote file: textFile()

• …or HDFS: hadoopFile(); sequenceFile(); objectFile()

• (These all act lazily)

Page 39: An Introduction to Apache Spark Big Data Madison: 29 July 2014

RDD[T] transformations

•map(f: T=>U): RDD[U]

•flatMap(f: T=>Seq[U]): RDD[U]

•filter(f: T=>Boolean): RDD[T]

•distinct(): RDD[T]

•keyBy(f: T=>K): RDD[(K, T)]

Page 40: An Introduction to Apache Spark Big Data Madison: 29 July 2014

RDD[(K,V)] transformations

•sortByKey(): RDD[(K,V)]

•groupByKey(): RDD[(K,Seq[V])]

•reduceByKey(f: (V,V)=>V): RDD[(K,V)]

•join(other: RDD[(K,W)]): RDD[(K,(V,W))]

Page 41: An Introduction to Apache Spark Big Data Madison: 29 July 2014

RDD[(K,V)] transformations

•sortByKey(): RDD[(K,V)]

•groupByKey(): RDD[(K,Seq[V])]

•reduceByKey(f: (V,V)=>V): RDD[(K,V)]

•join(other: RDD[(K,W)]): RDD[(K,(V,W))]

…and many Others including cartesian product, cogroup, set operations, &C.

Page 42: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Other RDD transformations

• Explicitly repartition and shuffle or coalesce to fewer partitions

• Provide hints to cache an intermediate RDD in memory or persist it to memory and/or disk

(Remember: all transformations are lazy)

Page 43: An Introduction to Apache Spark Big Data Madison: 29 July 2014

RDD[T] actions

•collect(): Array[T]

•count(): Long

•reduce(f: (T,T)=>T): T

•saveAsTextFile(path)

•saveAsSequenceFile(path)

Page 44: An Introduction to Apache Spark Big Data Madison: 29 July 2014

RDD[T] actions

•collect(): Array[T]

•count(): Long

•reduce(f: (T,T)=>T): T

•saveAsTextFile(path)

•saveAsSequenceFile(path)…and many Others including foreach, take, sampling, &C.

Page 45: An Introduction to Apache Spark Big Data Madison: 29 July 2014

RDD[T] actions

•collect(): Array[T]

•count(): Long

•reduce(f: (T,T)=>T): T

•saveAsTextFile(path)

•saveAsSequenceFile(path)…and many Others including foreach, take, sampling, &C.

(Remember: ALL actions are eager)

Page 46: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Example: word count in Sparkval file = spark.textFile("hdfs://...") !

val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) !

counts.saveAsTextFile("hdfs://...")

Page 47: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Libraries

Page 48: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Spark libraries

Spark core

Graph SQL ML Streaming

ad hoc Mesos YARN

Page 49: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Spark libraries

Spark core

Graph SQL ML Streaming

ad hoc Mesos YARNFIXED-point computation

Page 50: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Spark libraries

Spark core

Graph SQL ML Streaming

ad hoc Mesos YARNFIXED-point computation

query execution

Page 51: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Spark libraries

Spark core

Graph SQL ML Streaming

ad hoc Mesos YARNFIXED-point computation

query executionmodel parameter Optimization

Page 52: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Spark libraries

Spark core

Graph SQL ML Streaming

ad hoc Mesos YARNFIXED-point computation

query executionmodel parameter Optimization

Page 53: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Spark MLlib• Implementations of classic learning

algorithms on data in RDDs

• Regression, classification, clustering, recommendation, filtering, etc.

• High performance due to caching, in-memory execution

Page 54: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Spark SQL features

• SchemaRDD type

• Catalyst: a relational algebra analysis/optimization framework

• SQL and HiveQL implementations; Hive warehouse support

• LINQ-style embedded query DSL

Page 55: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Spark SQL examplecase class Trackpoint(lat: Double, lon: Double, ts: Long) {} !

// assume points is an RDD of Trackpoints points.registerAsTable("points") !

val results = sql("""select * from points where ts > max(ts) - 600""")

Page 56: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Spark Streaming

• Goal: use the same abstraction for streaming as for batch or interactive

• Discretized stream abstraction: streams as sequences of RDDs

Streaming engine Spark

input stream

windowed data (RDDs)

processed data (RDDs)

Page 57: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Community & Applications

Page 58: An Introduction to Apache Spark Big Data Madison: 29 July 2014

Developer community

• https://github.com/apache/spark

• First open-source release in 2010

• 121k lines of code

• 300+ contributors all-time; 80+ in the last month

• Active and friendly mailing list

Page 59: An Introduction to Apache Spark Big Data Madison: 29 July 2014

User community

• Spark Summit: established in 2013; growing and expanding

• Lots of cool applications (BI, medical, geospatial, security, fun)

• Many learning resources

Page 60: An Introduction to Apache Spark Big Data Madison: 29 July 2014

User community

• Spark Summit: established in 2013; growing and expanding

• Lots of cool applications (BI, medical, geospatial, security, fun)

• Many learning resources

more than 2x as big in ‘14

Page 61: An Introduction to Apache Spark Big Data Madison: 29 July 2014

User community

• Spark Summit: established in 2013; growing and expanding

• Lots of cool applications (BI, medical, geospatial, security, fun)

• Many learning resources

more than 2x as big in ‘14

Spark SUMMIT EAST: NYC 2015

Page 62: An Introduction to Apache Spark Big Data Madison: 29 July 2014

http://spark-summit.org/2014/agenda

Page 63: An Introduction to Apache Spark Big Data Madison: 29 July 2014

[email protected] http://chapeau.freevariable.com