cassandra summit 2014: apache spark - the sdk for all big data platforms
DESCRIPTION
Apache Spark has grown to be one of the largest open source communities in big data, with over 190 developers and dozens of companies contributing. The latest 1.0 release alone includes contributions from 117 people. A clean API, interactive shell, distributed in-memory computation, stream processing, interactive SQL, and libraries delivering everything from machine learning to graph processing make it an excellent unified platform to solve a number of problems. Apache Spark works very well with a growing number of big data solutions, including Cassandra and Hadoop. Come learn about Apache Spark and see how easy it is for you to get started using Spark to build your own high performance big data applications today.TRANSCRIPT
![Page 1: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/1.jpg)
Apache SparkEasy and Fast Big Data Analytics
Pat McDonough
![Page 2: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/2.jpg)
Founded by the creators of Apache Sparkout of UC Berkeley’s AMPLab
Fully committed to 100% open source Apache Spark
Support and Grow the Spark Community and Ecosystem
Building Databricks Cloud
![Page 3: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/3.jpg)
Databricks & DatastaxApache Spark is packaged as part of Datastax
Enterprise Analytics 4.5
Databricks & Datstax Have Partnered for Apache Spark Engineering and Support
![Page 4: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/4.jpg)
Big Data AnalyticsWhere We’ve Been
• 2003 & 2004 - Google GFS & MapReduce Papers are Precursors to Hadoop
• 2006 & 2007 - Google BigTable and Amazon DynamoDB Paper Precursor to Cassandra, HBase, Others
![Page 5: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/5.jpg)
Big Data AnalyticsA Zoo of Innovation
![Page 6: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/6.jpg)
Big Data AnalyticsA Zoo of Innovation
![Page 7: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/7.jpg)
Big Data AnalyticsA Zoo of Innovation
![Page 8: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/8.jpg)
Big Data AnalyticsA Zoo of Innovation
![Page 9: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/9.jpg)
What's Working?
Many Excellent Innovations Have Come From Big Data Analytics:
• Distributed & Data Parallel is disruptive ... because we needed it
• We Now Have Massive throughput… Solved the ETL Problem
• The Data Hub/Lake Is Possible
![Page 10: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/10.jpg)
What Needs to Improve? Go Beyond MapReduce
MapReduce is a Very Powerful and Flexible Engine
Processing Throughput Previously Unobtainable on
Commodity Equipment
But MapReduce Isn’t Enough:
• Essentially Batch-only
• Inefficient with respect to memory use, latency
• Too Hard to Program
![Page 11: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/11.jpg)
What Needs to Improve? Go Beyond (S)QL
SQL Support Has Been A Welcome Interface on Many
Platforms
And in many cases, a faster alternative
But SQL Is Often Not Enough:
• Sometimes you want to write real programs (Loops, variables, functions, existing libraries) but don’t want to build UDFs.
• Machine Learning (see above, plus iterative)
• Multi-step pipelines
• Often an Additional System
![Page 12: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/12.jpg)
What Needs to Improve? Ease of Use
Big Data Distributions Provide a number of Useful Tools and
Systems
Choices are Good to Have
But This Is Often Unsatisfactory:
• Each new system has it’s own configs, APIs, and management, coordination of multiple systems is challenging
• A typical solution requires stringing together disparate systems - we need unification
• Developers want the full power of their programming language
![Page 13: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/13.jpg)
What Needs to Improve? Latency
Big Data systems are throughput-oriented
Some new SQL Systems provide interactivity
But We Need More:
• Interactivity beyond SQL interfaces
• Repeated access of the same datasets (i.e. caching)
![Page 14: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/14.jpg)
Can Spark Solve These Problems?
![Page 15: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/15.jpg)
Apache SparkOriginally developed in 2009 in UC Berkeley’s
AMPLab
Fully open sourced in 2010 – now at Apache Software Foundation
http://spark.apache.org
![Page 16: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/16.jpg)
Project ActivityJune 2013 June 2014
total contributors 68 255
companies contributing 17 50
total linesof code 63,000 175,000
![Page 17: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/17.jpg)
Project ActivityJune 2013 June 2014
total contributors 68 255
companies contributing 17 50
total linesof code 63,000 175,000
![Page 18: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/18.jpg)
Compared to Other Projects
0
300
600
900
1200
0
75000
150000
225000
300000
Commits Lines of Code Changed
Activity in past 6 months
![Page 19: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/19.jpg)
Compared to Other Projects
0
300
600
900
1200
0
75000
150000
225000
300000
Commits Lines of Code Changed
Activity in past 6 months
Spark is now the most active project in the Hadoop ecosystem
![Page 20: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/20.jpg)
Spark on GithubSo active on Github, sometimes we break it
Over 1200 Forks (can’t display Network Graphs)
~80 commits to master each week
So many PRs We Built our own PR UI
![Page 21: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/21.jpg)
Apache Spark - Easy to Use And Very Fast
Fast and general cluster computing system interoperable with Big Data Systems Like Hadoop and Cassandra
Improved Efficiency: • In-memory computing primitives
• General computation graphs
Improved Usability: • Rich APIs
• Interactive shell
![Page 22: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/22.jpg)
Apache Spark - Easy to Use And Very Fast
Fast and general cluster computing system interoperable with Big Data Systems Like Hadoop and Cassandra
Improved Efficiency: • In-memory computing primitives
• General computation graphs
Improved Usability: • Rich APIs
• Interactive shell
Up to 100× faster (2-10× on disk)
2-5× less code
![Page 23: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/23.jpg)
Apache Spark - A Robust SDK for Big Data Applications
SQL Machine Learning Streaming Graph
Core
Unified System With Libraries to Build a Complete Solution !
Full-featured Programming Environment in Scala, Java, Python…
Very developer-friendly, Functional API for working with Data !
Runtimes available on several platforms
![Page 24: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/24.jpg)
Spark Is A Part Of Most Big Data Platforms
• All Major Hadoop Distributions Include Spark
• Spark Is Also Integrated With Non-Hadoop Big Data Platforms like DSE
• Spark Applications Can Be Written Once and Deployed Anywhere
SQL Machine Learning Streaming Graph
Core
Deploy Spark Apps Anywhere
![Page 25: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/25.jpg)
Easy: Get Started Immediately
Interactive Shell Multi-language support
Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count()
Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count()
Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); }}).count();
![Page 26: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/26.jpg)
Easy: Clean API
Resilient Distributed Datasets
• Collections of objects spread across a cluster, stored in RAM or on Disk
• Built through parallel transformations
• Automatically rebuilt on failure
Operations
• Transformations (e.g. map, filter, groupBy)
• Actions(e.g. count, collect, save)
Write programs in terms of transformations on distributed datasets
![Page 27: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/27.jpg)
Easy: Expressive APImap reduce
![Page 28: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/28.jpg)
Easy: Expressive APImap filter groupBy sort union join leftOuterJoin rightOuterJoin
reduce count fold reduceByKey groupByKey cogroup cross zip
sample take first partitionBy mapWith pipe save ...
![Page 29: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/29.jpg)
Easy: Example – Word Count
public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { ! private final static IntWritable one = new IntWritable(1); private Text word = new Text(); ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } !public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { ! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
Hadoop MapReduceval spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
Spark
![Page 30: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/30.jpg)
Easy: Example – Word Count
public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { ! private final static IntWritable one = new IntWritable(1); private Text word = new Text(); ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } !public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { ! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
Hadoop MapReduceval spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
Spark
![Page 31: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/31.jpg)
Easy: Works Well With Hadoop
Data Compatibility
• Access your existing Hadoop Data
• Use the same data formats
• Adheres to data locality for efficient processing
!
Deployment Models
• “Standalone” deployment
• YARN-based deployment
• Mesos-based deployment
• Deploy on existing Hadoop cluster or side-by-side
![Page 32: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/32.jpg)
Example: Logistic Regression
data = spark.textFile(...).map(readPoint).cache()
!
w = numpy.random.rand(D)
!
for i in range(iterations):
gradient = data
.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y)
w -= gradient
!
print “Final w: %s” % w
![Page 33: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/33.jpg)
Fast: Using RAM, Operator Graphs
In-memory Caching
• Data Partitions read from RAM instead of disk
Operator Graphs
• Scheduling Optimizations
• Fault Tolerance
= cached partition
= RDD
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
map
![Page 34: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/34.jpg)
Fast: Logistic Regression Performance
Runn
ing
Tim
e (s
)
0
1000
2000
3000
4000
Number of Iterations1 5 10 20 30
Hadoop Spark
110 s / iteration
first iteration 80 s further iterations 1 s
![Page 35: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/35.jpg)
Fast: Scales Down SeamlesslyEx
ecution time (s)
0
25
50
75
100
% of working set in cache
Cache disabled 25% 50% 75% Fully cached
11.5304
29.747140.7407
58.061468.8414
![Page 36: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/36.jpg)
Easy: Fault RecoveryRDDs track lineage information that can be used to
efficiently recompute lost data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2])
HDFS File Filtered RDDMapped
RDDfilter(func = startsWith(…))
map(func = split(...))
![Page 37: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/37.jpg)
How Spark Works
![Page 38: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/38.jpg)
Working With RDDs
![Page 39: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/39.jpg)
Working With RDDs
RDD
textFile = sc.textFile(”SomeFile.txt”)
![Page 40: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/40.jpg)
Working With RDDs
RDDRDD
RDDRDD
Transformations
linesWithSpark = textFile.filter(lambda line: "Spark” in line)
textFile = sc.textFile(”SomeFile.txt”)
![Page 41: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/41.jpg)
Working With RDDs
RDDRDD
RDDRDD
Transformations
Action Value
linesWithSpark = textFile.filter(lambda line: "Spark” in line)
linesWithSpark.count() 74 !linesWithSpark.first() # Apache Spark
textFile = sc.textFile(”SomeFile.txt”)
![Page 42: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/42.jpg)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
![Page 43: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/43.jpg)
Worker
Worker
Worker
Driver
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
![Page 44: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/44.jpg)
Worker
Worker
Worker
Driver
lines = spark.textFile(“hdfs://...”)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
![Page 45: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/45.jpg)
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”))
Worker
Worker
Worker
Driver
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
![Page 46: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/46.jpg)
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”))
Worker
Worker
Worker
Driver
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
![Page 47: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/47.jpg)
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count()
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
![Page 48: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/48.jpg)
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count() Action
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
![Page 49: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/49.jpg)
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
![Page 50: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/50.jpg)
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Workermessages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Drivertasks
tasks
tasks
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
![Page 51: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/51.jpg)
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Workermessages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Read HDFS Block
Read HDFS Block
Read HDFS Block
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
![Page 52: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/52.jpg)
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Workermessages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
Process& Cache Data
Process& Cache Data
Process& Cache Data
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
![Page 53: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/53.jpg)
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Workermessages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
results
results
results
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
![Page 54: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/54.jpg)
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Workermessages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
![Page 55: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/55.jpg)
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Workermessages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
tasks
tasks
tasks
Driver
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
![Page 56: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/56.jpg)
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Workermessages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Driver
ProcessfromCache
ProcessfromCache
ProcessfromCache
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
![Page 57: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/57.jpg)
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Workermessages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Driverresults
results
results
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
![Page 58: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/58.jpg)
Example: Log Mining
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Workermessages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Driver
Cache your data ➔ Faster Results Full-text search of Wikipedia • 60GB on 20 EC2 machines • 0.5 sec from cache vs. 20s for on-disk
Load error messages from a log into memory, then interactively search for various patterns
![Page 59: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/59.jpg)
Cassandra + Spark: A Great Combination
Both are Easy to Use
Spark Can Help You Bridge Your Hadoop and Cassandra Systems
Use Spark Libraries, Caching on-top of Cassandra-stored Data
Combine Spark Streaming with Cassandra Storage Datastaxspark-cassandra-connector:https://github.com/datastax/spark-cassandra-connector
![Page 60: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms](https://reader033.vdocument.in/reader033/viewer/2022042613/554126554a79596b218b4575/html5/thumbnails/60.jpg)
Schema RDDs (Spark SQL)• Built-in Mechanism for recognizing Structured data in Spark
• Allow for systems to apply several data access and relational optimizations (e.g. predicate push-down, partition pruning, broadcast joins)
• Columnar in-memory representation when cached
• Native Support for structured formats like parquet, JSON
• Great Compatibility with the Rest of the Stack (python, libraries, etc.)