why spark by stratio - v.1.0

13

Upload: stratio

Post on 14-Jul-2015

322 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 2: Why spark by Stratio - v.1.0
Page 3: Why spark by Stratio - v.1.0

WHY SPARK?

RDD-Based Matrices

“In Spark, we explicitly wanted to come up with a single programming model that is very general that covers these interactive [SQL] use cases, the streaming ones, the more complex applications.

I think the thing that really sets Spark apart compared to some other systems that tackle these is that it can actually do all of them. You only have to learn one system and you can easily make an application that combines these. It’s only one thing to manage, and I think that’s what gets people interested in it.”

Databricks co-founder and CTO Matei Zaharia (source)

Page 4: Why spark by Stratio - v.1.0

WHY SPARK?

RDD-Based Matrices

June 2013 June 2014

contributors 68 255

companies 17 50

lines of code 63000 175000

Spark one of the most active projects at Apache Spark is the most active project in the Hadoop ecosystem

[Data source: Git logs; chart courtesy of Matei Zaharia]

Spark, real world use cases, by Datanami

Spark role in the Big Data ecosystem, by databricks

Page 5: Why spark by Stratio - v.1.0

WHY SPARK?

Since Spark was open sourced it has generated rapid interest–with over 200 contributors from 50+ organizations collaborating around the project;

Open source contributors, Cloudera, Databricks, IBM, Intel, and MapR announced last july that they are joining efforts to broaden support for Apache Spark (Spark), while simultaneously standardizing it as the framework of choice by bringing popular tools from the MapReduce world to this new engine.

Spark has quickly become a standard in many Hadoop distributions, with rapid customer adoption and use in a variety of use cases, ranging from machine learning to stream processing workloads.

Page 6: Why spark by Stratio - v.1.0

WHY SPARK?

Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.

All these benchmarks are public and available at Apache Spark website

Page 7: Why spark by Stratio - v.1.0

WHY SPARK?

As a general iterative computing framework, Spark is the core of many other products, such as Spark SQL, Spark Streaming, MLlib or GraphX.

Every contribution and benefit added to Spark core will be immediately added to the other modules.

SparkSQL

SparkStreaming

MLlib(machine learning)

GraphX(graph)

RDD as a general data abstraction allow Spark to talk to many file systems and databases.

In fact, all that support Hadoop Input format could be easily integrated into Spark.

Page 8: Why spark by Stratio - v.1.0

WHY SPARK?

One of the challenges organizations face when adopting Hadoop is a shortage of developers who have experience building Hadoop applications.

Our professional services organization has helped dozens of companies with the development and deployment of Hadoop applications, and our training department has trained countless engineers.

Organizations are hungry for solutions that make it easier to develop Hadoop applications while increasing

developer productivity, and Spark fits this bill. Spark jobs can require as little as 1/5th the number of lines of code.

by Tomer Shiran, VP of Product Management, MapR MapR Integrates the Complete Spark Stack

Page 9: Why spark by Stratio - v.1.0

WHY SPARK?

RDDs remember the sequence of operations that created it from the original fault-tolerant input data

Batches of input data are replicated in memory of multiple worker nodes, therefore fault-tolerant

Data lost due to worker failure, can be recomputed from input data

Recovers from faults/stragglers within 1 sec

Spark Streaming, at Strata Conference, February 2013

Page 10: Why spark by Stratio - v.1.0

WHY SPARK?

Hadoop does a pretty terrible job with machine learning. Spark is good with logistic regression, and that can help with anything that involves a binary decision: Is this message spam? Should I show this ad to this user?— Reynold Xin (source)

Spark is amazing for iterative computing (Machine Learning algorithms) and interactive analytics.

Most ML algorithms run on the same data set iteratively and in MapReduce , there was no easy way to communicate a shared state from iteration to iteration.

MLlib was added to the spark ecosystem and now is one of the most active modules.

In addition, SparkR is in its way and Mahout is working to incorporate the benefits of Spark and is exploring other high performance back-ends as well.

Page 11: Why spark by Stratio - v.1.0

WHY SPARK?

Spark is Java but also embraces Python and Scala and it provides a set of pre-defined APIs for building new programs.

Code with spark in your machine and deploy in a cluster.

Page 12: Why spark by Stratio - v.1.0

WHY SPARK?

Spark can run on hardware clusters managed by Apache Mesos. the advantages include:• dynamic partitioning between Spark and other frameworks• scalable partitioning between multiple instances of Spark

If you decide to run Spark on YARN, you can decide on an application-by-application basis whether to run in YARN client mode or cluster mode. When you run Spark in client mode, the driver process runs locally; in cluster mode, it runs remotely on an ApplicationMaster.

Page 13: Why spark by Stratio - v.1.0