scala and spark are ideal for big data

Scala and Spark are Ideal for Big

DataJohn Nestor47 Degrees

Seattle Unstructured Data Science Pop-UpOctober 7, 2015

www.47deg.com 47deg.com

http://www.47deg.com/

http://47deg.com/

47deg.com

Why Scala?

• Strong typing

• Concise elegant syntax

• Runs on JVM (Java Virtual Machine)

• Supports both object-oriented and functional

• Small simple programs through large parallel distributed systems

• Easy to cleanly extend with new libraries and DSL’s

• Ideal for parallel and distributed systems

http://47deg.com/

47deg.com

Scala: Strong Typing and Concise Syntax• Strong typing like Java.

• Compile time checks

• Better modularity via strongly typed interfaces

• Easier maintenance: types make code easier to understand

• Concise syntax like Python.

• Type inference. Compiler infers most types that had to be explicit in Java.

• Powerful syntax that avoid much of the boilerplate of Java code (see next slide).

• Best of both worlds: safety of strong typing with conciseness (like Python).

http://47deg.com/

47deg.com

Scala Case Class• Java version

class User { private String name; private Int age; public User(String name, Int age) { this.name = name; this.age = age; } public getAge() { return age; } public setAge(Int age) { this.age = age;}}User joe = new User(“Joe”, 30);

• Scala version

case class User(name:String, var age:Int)val joe = User(“Joe”, 30)

http://47deg.com/

47deg.com

Functional Scala• Anonymous functions.

(a:Int,b:Int) => a+b

• Functions that take and return other functions.

• Rarely need variables or loops

• Immutable collections: Seq[T], Map[K,V], …

• Works well with concurrent or distributed systems

• Natural for functional programming

• Functional collection operations (a small sample)

• map, flatMap, reduce, …

• filter, groupBy, sortBy, take, drop, …

http://47deg.com/

47deg.com

Scala Availability and Support

• Open Source

• Typesafe provides support. Founded my Martin Odersky who designed Scala.

• IDEs: Intellij IDEA and Eclipse

• Libraries: lots now and more every day

• ScalaNLP - Epic (natural language processing)

• Major Scala users: LinkedIn, Twitter, Goldman Sachs, Coursera, Angies List, Whitepages

• Major systems written in Scala: Spark, Kafka

http://47deg.com/

47deg.com

Typesafe Scala Components• Scala Compiler (includes REPL)

• Scala Standard Libraries

• SBT - Scala Build Tool

• Play - scaleable web applications

• Scala JS - compiles Scala to JavaScript

• Akka - for parallel and distributed computation

• Spray - high performance asynchronous TCP/ HTTP library

• Spark - Typesafe also supports Spark

• Slick - for SQL database access

• ConductR - Scala deployment/devops tool

• Reactive Monitoring (Beta)

http://47deg.com/

47deg.com

Why Spark?

• Support for not only batch but also (near) real-time

• Fast - keeps data in memory as much as possible

• Often 10X to 100X Hadoop speed

• A clean easy-to-use API

• A richer set of functional operations than just map and reduce

• A foundation for a wide set of integrated data applications

• Can recover from failures - recompute or (optional) replication

• Scalable for very large data sets and reduced time

http://47deg.com/

47deg.com

Spark RDDs

• RDD[T] - resilient distributed data set

• typed (must be serializable)

• immutable

• ordered

• can be processed in parallel

• lazy evaluation - permits more global optimizations

• Rich set of functional operations ( a small sample)

• map, flatMap, reduce, …

• filter, groupBy, sortBy, take, drop, …

http://47deg.com/

47deg.com

Spark Components• Spark Core

• Scalable multi-node cluster

• Failure detection and recovery

• RDDs and functional operations

• MLLib - for machine learning

• linear regression, SVMs, clustering, collaborative filtering, dimension reduction

• more on the way!

• GraphX - for graph computation

• Streaming - for near real-time

• Dataframes - for SQL and Json

http://47deg.com/

47deg.com

Spark Availability and Support

• Open Source - top level Apache project

• Over 750 contributors from over 200 organizations

• Can process multiple petabytes on clusters of over 8000 nodes

• Databricks. Matei Zaharia who wrote the original Spark is a founder and CTO

• Packages (more every day)

• Zeppelin - Scala notebooks

• Cassandra, Kafka connectors

http://47deg.com/

47deg.com

Clusters and Scalability

• Scala Akka clusters (process distribution, micro services)

• message passing

• remote Actors

• Spark clusters (data distribution)

• local

• Stand alone (optionally with ZooKeeper)

• Apache Mesos

• Hadoop Yarn

• can run above on Amazon and Google clouds

http://47deg.com/

47deg.com

Why Scala for Spark?

• Why not Python, R, or Java for Spark?

• Spark is written in Scala

• Scala source code is important Spark documentation

• Spark is best extended in Scala

• The primary API for Spark is Scala

• The functional features of Scala and Spark are a natural fit and easiest to use in Scala

• If you want to build scalable high performance production code based on Spark, R by itself is too specialized, Python is too slow and Java is tedious to write and maintain

http://47deg.com/

47deg.com

Demo

http://47deg.com/

47deg.com

Seattle Resources

• Seattle Meetups

• Scala at the Sea Meetup http://www.meetup.com/Seattle-Scala-User-Group/

• Seattle Spark Meetup http://www.meetup.com/Seattle-Spark-Meetup/

• Seattle Training: Spark and Typesafe Scala Classes http://www.47deg.com/events#training

• UW Scala Professional Certificate Program http://www.pce.uw.edu/certificates/scala-functional-reactive-programming.html

http://47deg.com/

http://www.meetup.com/Seattle-Scala-User-Group/

mailto:http://www.meetup.com/Seattle-Spark-Meetup/?subject=

http://www.47deg.com/

http://www.pce.uw.edu/certificates/scala-functional-reactive-programming.html

scala and spark are ideal for big data

Software