patrick wendell databricks spark.incubator.apache.org spark 1.0 and beyond

33
Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond

Upload: dortha-griffith

Post on 17-Dec-2015

232 views

Category:

Documents


1 download

TRANSCRIPT

Patrick Wendell

Databricks

Spark.incubator.apache.org

Spark 1.0 and Beyond

About meCommitter and PMC member of Apache Spark

“Former” PhD student at Berkeley

Release manager for Spark 1.0

Background in networking and distributed systems

Today’s Talk

Spark background

About the Spark release process

The Spark 1.0 release

Looking forward to Spark 1.1

What is Spark?

EfficientGeneral execution graphs

In-memory storage

UsableRich APIs in Java, Scala, Python

Interactive shell

Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop

2-5× less code

Up to 10× faster on disk,100× in memory

30-Day Commit Activity

Patches0

50

100

150

200

250

MapReduceStormYarnSpark

Lines Added0

5000

10000

15000

20000

25000

30000

35000

40000

45000

MapReduceStormYarnSpark

Lines Removed0

2000

4000

6000

8000

10000

12000

14000

16000

MapReduceStormYarnSpark

Spark PhilosophyMake life easy and productive for data scientists

Well documented, expressive API’s

Powerful domain specific libraries

Easy integration with storage systems

… and caching to avoid data movement

Predictable releases, stable API’s

Spark Release Process

Quarterly release cycle (3 months)

2 months of general development

1 month of polishing, QA and fixes

Spark 1.0 Feb 1 April 8th, April 8th+

Spark 1.1 May 1 July 8th, July 8th+

Spark 1.0:By the numbers- 3 months of development

- 639 patches

- 200+ JIRA issues

- 100+ contributors

API Stability in 1.XAPI’s are stable for all non-alpha projects

Spark 1.1, 1.2, … will be compatible

@DeveloperApi

Internal API that is unstable

@Experimental

User-facing API that might stabilize later

Today’s Talk

About the Spark release process

The Spark 1.0 release

Looking forward to Spark 1.1

Spark 1.0 FeaturesCore engine improvements

Spark streaming

MLLib

Spark SQL

Spark CoreHistory server for Spark UI

Integration with YARN security model

Unified job submission tool

Java 8 support

Internal engine improvements

History ServerConfigure with :

spark.eventLog.enabled=truespark.eventLog.dir=hdfs://XX

In Spark Standalone, history server is embedded in the master.

In YARN/Mesos, run history server as a daemon.

Job Submission ToolApps don’t need to hard-code master: conf = new SparkConf().setAppName(“My App”) sc = new SparkContext(conf)

./bin/spark-submit <app-jar> \ --class my.main.Class --name myAppName --master local[4] --master spark://some-cluster

Java 8 SupportRDD operations can use lambda syntaxclass Split extends FlatMapFunction<String, String> { public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); });JavaRDD<String> words = lines.flatMap(new Split());

JavaRDD<String> words = lines .flatMap(s -> Arrays.asList(s.split(" ")));

Old

New

Java 8 SupportNOTE: Minor API changes

(a) If you are extending Function classes, use implements rather than extends.

(b) Return-type sensitive functions

mapToPairmapToDouble

Python API Coveragerdd operators

intersection(), take(), top(), topOrdered()

meta-data

name(), id(), getStorageLevel()

runtime configuration

setJobGroup(), setLocalProperty()

Integration with YARN SecuritySupports Kerberos authentication in YARN environments:

spark.authenticate = true

ACL support for user interfaces:

spark.ui.acls.enable = true

spark.ui.view.acls = patrick, matei

Engine ImprovementsJob cancellation directly from UI

Garbage collection of shuffle and RDD data

DocumentationUnified Scaladocs across modules

Expanded MLLib guide

Deployment and configuration specifics

Expanded API documentation

Spark

RDDs, Transformations, and Actions

Spark Streamin

greal-time

SparkSQL

MLLibmachine learning

DStream’s: Streams of

RDD’s

SchemaRDD’s RDD-Based Matrices

Spark SQL

Turning an RDD into a Relation// Define the schema using a case class.case class Person(name: String, age: Int)

// Create an RDD of Person objects, register it as a table.val people = sc.textFile("examples/src/main/resources/people.txt") .map(_.split(",") .map(p => Person(p(0), p(1).trim.toInt))

people.registerAsTable("people")

 

Querying using SQL// SQL statements can be run directly on RDD’sval teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

// The results of SQL queries are SchemaRDDs and support // normal RDD operations.val nameList = teenagers.map(t => "Name: " + t(0)).collect()

// Language integrated queries (ala LINQ)val teenagers = people.where('age >= 10).where('age <= 19).select('name)

Import and Export// Save SchemaRDD’s directly to parquetpeople.saveAsParquetFile("people.parquet")

// Load data stored in Hiveval hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)import hiveContext._

// Queries can be expressed in HiveQL.hql("FROM src SELECT key, value")

In Memory Columnar StorageSpark SQL can cache tables using an in-memory columnar format:

- Scan only required columns- Fewer allocated objects (less GC)- Automatically selects best compression

Spark StreamingWeb UI for streaming

Graceful shutdown

User-defined input streams

Support for creating in Java

Refactored API

MLlibSparse vector support

Decision trees

Linear algebra

SVD and PCA

Evaluation support

3 contributors in the last 6 months

MLlibNote: Minor API change

val data = sc.textFile("data/kmeans_data.txt")val parsedData = data.map( s => s.split(‘\t').map(_.toDouble).toArray)val clusters = KMeans.train(parsedData, 4, 100)

val data = sc.textFile("data/kmeans_data.txt")val parsedData = data.map( s => Vectors.dense(s.split(' ').map(_.toDouble)))val clusters = KMeans.train(parsedData, 4, 100)

1.1 and BeyondData import/export leveraging catalyst HBase, Cassandra, etc

Shark-on-catalyst

Performance optimizationsExternal shuffle

Pluggable storage strategies

Streaming: Reliable input from Flume and Kafka

Unifying ExperienceSchemaRDD represents a consistent integration point for data sources

spark-submit abstracts the environmental details (YARN, hosted cluster, etc).

API stability across versions of Spark

ConclusionVisit spark.apache.org for videos, tutorials, and hands-on exercises.

Help us test a release candidate!

Spark Summit on June 30th

spark-summit.org

Meetup group meetup.com/spark-users