apache spark - san diego big data meetup jan 14th 2015
TRANSCRIPT
![Page 1: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/1.jpg)
CONFIDENTIAL - RESTRICTED
Introduction to SparkSDBigData Meetup #6
January 14th 2015
Maxime Dumas
Systems Engineer, Cloudera
![Page 2: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/2.jpg)
Thirty Seconds About Max
• Systems Engineer
• aka Sales Engineer
• SoCal, AZ, NV
• former coder of PHP
• teaches meditation + yoga
• avid cyclist
• from Montreal, Canada
2
![Page 3: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/3.jpg)
What Does Cloudera Do?
• product
• distribution of Hadoop components, Apache licensed
• enterprise tooling
• support
• training
• services (aka consulting)
• community
3
![Page 4: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/4.jpg)
4
Quick and dirty, for context.
The Apache Hadoop Ecosystem
![Page 5: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/5.jpg)
©2014 Cloudera, Inc. All rights
reserved.
• Scalability• Simply scales just by adding nodes• Local processing to avoid network bottlenecks
• Efficiency• Cost efficiency (<$1k/TB) on commodity hardware• Unified storage, metadata, security (no duplication or
synchronization)
• Flexibility• All kinds of data (blobs, documents, records, etc)• In all forms (structured, semi-structured, unstructured)• Store anything then later analyze what you need
Why Hadoop?
![Page 6: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/6.jpg)
Why “Ecosystem?”
• In the beginning, just Hadoop
• HDFS
• MapReduce
• Today, dozens of interrelated components
• I/O
• Processing
• Specialty Applications
• Configuration
• Workflow
6
![Page 7: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/7.jpg)
HDFS
• Distributed, highly fault-tolerant filesystem
• Optimized for large streaming access to data
• Based on Google File System
• http://research.google.com/archive/gfs.html
7
![Page 8: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/8.jpg)
Lots of Commodity Machines
8
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
![Page 9: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/9.jpg)
MapReduce (MR)
• Programming paradigm
• Batch oriented, not realtime
• Works well with distributed computing
• Lots of Java, but other languages supported
• Based on Google’s paper
• http://research.google.com/archive/mapreduce.html
9
![Page 10: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/10.jpg)
Apache Hive
• Abstraction of Hadoop’s Java API
• HiveQL “compiles” down to MR
• a “SQL-like” language
• Eases analysis using MapReduce
10
![Page 11: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/11.jpg)
CDH: the App Store for Hadoop
11
Integration
Storage
Resource Management
Met
adat
a
NoSQLDBMS
…Analytic
MPPDBMS
SearchEngine
In-Memory
Batch Processing
System Management
Data Management
Support
Secu
rity
Machine Learning
MapReduce
![Page 12: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/12.jpg)
12
Introduction to Apache Spark
Credits:
• Ben White
• Todd Lipcon
• Ted Malaska
• Jairam Ranganathan
• Jayant Shekhar
• Sandy Ryza
![Page 13: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/13.jpg)
Can we improve on MR?
• Problems with MR:
• Very low-level: requires a lot of code to do simple things
• Very constrained: everything must be described as “map” and “reduce”. Powerful but sometimes difficult to think in these terms.
13
![Page 14: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/14.jpg)
Can we improve on MR?
• Two approaches to improve on MapReduce:
1. Special purpose systems to solve one problem domain well.• Giraph / Graphlab (graph processing)• Storm (stream processing)• Impala (real-time SQL)
2. Generalize the capabilities of MapReduce to provide a richer foundation to solve problems.• Tez, MPI, Hama/Pregel (BSP), Dryad (arbitrary DAGs)
Both are viable strategies depending on the problem!
14
![Page 15: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/15.jpg)
What is Apache Spark?
Spark is a general purpose computational framework
Retains the advantages of MapReduce:• Linear scalability• Fault-tolerance• Data Locality based computations
…but offers so much more:• Leverages distributed memory for better performance• Supports iterative algorithms that are not feasible in MR• Improved developer experience• Full Directed Graph expressions for data parallel computations• Comes with libraries for machine learning, graph analysis, etc.
15
![Page 16: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/16.jpg)
What is Apache Spark?
Run programs up to 100x faster than HadoopMapReduce in memory, or 10x faster on disk.
One of the largest open source projects in big data:
• 170+ developers contributing
• 30+ companies contributing
• 400+ discussions per month on the mailing list
16
![Page 17: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/17.jpg)
Popular project
17
![Page 18: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/18.jpg)
Getting started with Spark
• Java API
• Interactive shells:
• Scala (spark-shell)
• Python (pyspark)
18
![Page 19: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/19.jpg)
Execution modes
19
![Page 20: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/20.jpg)
Execution modes
• Standalone Mode
• Dedicated master and worker daemons
• YARN Client Mode
• Launches a YARN application with the driver program running locally
• YARN Cluster Mode
• Launches a YARN application with the driver program running in the YARN ApplicationMaster
20
Dynamic resource management between Spark, MR, Impala…
Dedicated Spark runtime with static resource limits
![Page 21: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/21.jpg)
Spark Concepts
21
![Page 22: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/22.jpg)
RDD – Resilient Distributed Dataset
• Collections of objects partitioned across a cluster
• Stored in RAM or on Disk
• You can control persistence and partitioning
• Created by:
• Distributing local collection objects
• Transformation of data in storage
• Transformation of RDDs
• Automatically rebuilt on failure (resilient)
• Contains lineage to compute from storage
• Lazy materialization
22
![Page 23: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/23.jpg)
RDD transformations
23
![Page 24: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/24.jpg)
Operations on RDDs
Transformations lazily transform a RDD to a new RDD
• map
• flatMap
• filter
• sample
• join
• sort
• reduceByKey
• …
Actions run computation to return a value
• collect
• reduce(func)
• foreach(func)
• count
• first, take(n)
• saveAs
• …
24
![Page 25: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/25.jpg)
Fault Tolerance
• RDDs contain lineage.
• Lineage – source location and list of transformations
• Lost partitions can be re-computed from source data
25
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)).map(lambda s: s.split(“\t”)[2])
HDFS File Filtered RDD Mapped RDDfilter
(func = startsWith(…))map
(func = split(...))
![Page 26: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/26.jpg)
26
Examples
![Page 27: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/27.jpg)
Word Count in MapReduce
27
package org.myorg;
import java.io.IOException;import java.util.*;
import org.apache.hadoop.fs.Path;import org.apache.hadoop.conf.*;import org.apache.hadoop.io.*;import org.apache.hadoop.mapreduce.*;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {private final static IntWritable one = new IntWritable(1);private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());context.write(word, one);
}}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {int sum = 0;for (IntWritable val : values) {
sum += val.get();}context.write(key, new IntWritable(sum));
}}
public static void main(String[] args) throws Exception {Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);}
}
![Page 28: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/28.jpg)
Word Count in Spark
sc.textFile(“words”)
.flatMap(line => line.split(" "))
.map(word=>(word,1))
.reduceByKey(_+_).collect()
28
![Page 29: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/29.jpg)
Logistic Regression
• Read two sets of points
• Looks for a plane W that separates them
• Perform gradient descent:
• Start with random W
• On each iteration, sum a function of W over the data
• Move W in a direction that improves it
29
![Page 30: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/30.jpg)
Intuition
30
![Page 31: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/31.jpg)
Logistic Regression
31
![Page 32: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/32.jpg)
Logistic Regression Performance
32
![Page 33: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/33.jpg)
33
Spark and Hadoop:a Framework within a Framework
![Page 34: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/34.jpg)
34
![Page 35: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/35.jpg)
35
Integration
Storage
Resource Management
Met
adat
a
HBase …Impala Solr SparkMap
Reduce
System Management
Data Management
Support
Secu
rity
![Page 36: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/36.jpg)
Spark Streaming
• Takes the concept of RDDs and extends it to DStreams• Fault-tolerant like RDDs
• Transformable like RDDs
• Adds new “rolling window” operations• Rolling averages, etc.
• But keeps everything else!• Regular Spark code works in Spark Streaming
• Can still access HDFS data, etc.
• Example use cases: • “On-the-fly” ETL as data is ingested into Hadoop/HDFS.
• Detecting anomalous behavior and triggering alerts.
• Continuous reporting of summary metrics for incoming data.
36
![Page 37: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/37.jpg)
Micro-batching for on the fly ETL
37
![Page 38: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/38.jpg)
What about SQL?
38
http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.htmlhttp://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/
![Page 39: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/39.jpg)
Fault Recovery Recap
• RDDs store dependency graph
• Because RDDs are deterministic:Missing RDDs are rebuilt in parallel on other nodes
• Stateful RDDs can have infinite lineage
• Periodic checkpoints to disk clears lineage
• Faster recovery times
• Better handling of stragglers vs row-by-row streaming
39
![Page 40: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/40.jpg)
Why Spark?
• Flexible like MapReduce
• High performance
• Machine learning, iterative algorithms
• Interactive data explorations
• Concise, easy API for developer productivity
40
![Page 41: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/41.jpg)
41
Demo Time!
• Log file Analysis
• Machine Learning
• Spark Streaming
![Page 42: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/42.jpg)
What’s Next?
• Download Hadoop!
• CDH available at www.cloudera.com
• Try it online: Cloudera Live
• Cloudera provides pre-loaded VMs
• http://tiny.cloudera.com/quickstartvm
42
![Page 43: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/43.jpg)
43
Preferably related to the talk… or not.
Questions?
![Page 45: Apache Spark - San Diego Big Data Meetup Jan 14th 2015](https://reader034.vdocument.in/reader034/viewer/2022052700/55a638a91a28ab701e8b46f5/html5/thumbnails/45.jpg)
45