scala and spark are ideal for big data
TRANSCRIPT
![Page 1: Scala and Spark are Ideal for Big Data](https://reader035.vdocument.in/reader035/viewer/2022081806/5888ab441a28ab80248b4e2b/html5/thumbnails/1.jpg)
Scala and Spark are Ideal for Big
DataJohn Nestor47 Degrees
Seattle Unstructured Data Science Pop-UpOctober 7, 2015
www.47deg.com 47deg.com
![Page 2: Scala and Spark are Ideal for Big Data](https://reader035.vdocument.in/reader035/viewer/2022081806/5888ab441a28ab80248b4e2b/html5/thumbnails/2.jpg)
47deg.com
Why Scala?
• Strong typing
• Concise elegant syntax
• Runs on JVM (Java Virtual Machine)
• Supports both object-oriented and functional
• Small simple programs through large parallel distributed systems
• Easy to cleanly extend with new libraries and DSL’s
• Ideal for parallel and distributed systems
![Page 3: Scala and Spark are Ideal for Big Data](https://reader035.vdocument.in/reader035/viewer/2022081806/5888ab441a28ab80248b4e2b/html5/thumbnails/3.jpg)
47deg.com
Scala: Strong Typing and Concise Syntax• Strong typing like Java.
• Compile time checks
• Better modularity via strongly typed interfaces
• Easier maintenance: types make code easier to understand
• Concise syntax like Python.
• Type inference. Compiler infers most types that had to be explicit in Java.
• Powerful syntax that avoid much of the boilerplate of Java code (see next slide).
• Best of both worlds: safety of strong typing with conciseness (like Python).
![Page 4: Scala and Spark are Ideal for Big Data](https://reader035.vdocument.in/reader035/viewer/2022081806/5888ab441a28ab80248b4e2b/html5/thumbnails/4.jpg)
47deg.com
Scala Case Class• Java version
class User { private String name; private Int age; public User(String name, Int age) { this.name = name; this.age = age; } public getAge() { return age; } public setAge(Int age) { this.age = age;}}User joe = new User(“Joe”, 30);
• Scala version
case class User(name:String, var age:Int)val joe = User(“Joe”, 30)
![Page 5: Scala and Spark are Ideal for Big Data](https://reader035.vdocument.in/reader035/viewer/2022081806/5888ab441a28ab80248b4e2b/html5/thumbnails/5.jpg)
47deg.com
Functional Scala• Anonymous functions.
(a:Int,b:Int) => a+b
• Functions that take and return other functions.
• Rarely need variables or loops
• Immutable collections: Seq[T], Map[K,V], …
• Works well with concurrent or distributed systems
• Natural for functional programming
• Functional collection operations (a small sample)
• map, flatMap, reduce, …
• filter, groupBy, sortBy, take, drop, …
![Page 6: Scala and Spark are Ideal for Big Data](https://reader035.vdocument.in/reader035/viewer/2022081806/5888ab441a28ab80248b4e2b/html5/thumbnails/6.jpg)
47deg.com
Scala Availability and Support
• Open Source
• Typesafe provides support. Founded my Martin Odersky who designed Scala.
• IDEs: Intellij IDEA and Eclipse
• Libraries: lots now and more every day
• ScalaNLP - Epic (natural language processing)
• Major Scala users: LinkedIn, Twitter, Goldman Sachs, Coursera, Angies List, Whitepages
• Major systems written in Scala: Spark, Kafka
![Page 7: Scala and Spark are Ideal for Big Data](https://reader035.vdocument.in/reader035/viewer/2022081806/5888ab441a28ab80248b4e2b/html5/thumbnails/7.jpg)
47deg.com
Typesafe Scala Components• Scala Compiler (includes REPL)
• Scala Standard Libraries
• SBT - Scala Build Tool
• Play - scaleable web applications
• Scala JS - compiles Scala to JavaScript
• Akka - for parallel and distributed computation
• Spray - high performance asynchronous TCP/ HTTP library
• Spark - Typesafe also supports Spark
• Slick - for SQL database access
• ConductR - Scala deployment/devops tool
• Reactive Monitoring (Beta)
![Page 8: Scala and Spark are Ideal for Big Data](https://reader035.vdocument.in/reader035/viewer/2022081806/5888ab441a28ab80248b4e2b/html5/thumbnails/8.jpg)
47deg.com
Why Spark?
• Support for not only batch but also (near) real-time
• Fast - keeps data in memory as much as possible
• Often 10X to 100X Hadoop speed
• A clean easy-to-use API
• A richer set of functional operations than just map and reduce
• A foundation for a wide set of integrated data applications
• Can recover from failures - recompute or (optional) replication
• Scalable for very large data sets and reduced time
![Page 9: Scala and Spark are Ideal for Big Data](https://reader035.vdocument.in/reader035/viewer/2022081806/5888ab441a28ab80248b4e2b/html5/thumbnails/9.jpg)
47deg.com
Spark RDDs
• RDD[T] - resilient distributed data set
• typed (must be serializable)
• immutable
• ordered
• can be processed in parallel
• lazy evaluation - permits more global optimizations
• Rich set of functional operations ( a small sample)
• map, flatMap, reduce, …
• filter, groupBy, sortBy, take, drop, …
![Page 10: Scala and Spark are Ideal for Big Data](https://reader035.vdocument.in/reader035/viewer/2022081806/5888ab441a28ab80248b4e2b/html5/thumbnails/10.jpg)
47deg.com
Spark Components• Spark Core
• Scalable multi-node cluster
• Failure detection and recovery
• RDDs and functional operations
• MLLib - for machine learning
• linear regression, SVMs, clustering, collaborative filtering, dimension reduction
• more on the way!
• GraphX - for graph computation
• Streaming - for near real-time
• Dataframes - for SQL and Json
![Page 11: Scala and Spark are Ideal for Big Data](https://reader035.vdocument.in/reader035/viewer/2022081806/5888ab441a28ab80248b4e2b/html5/thumbnails/11.jpg)
47deg.com
Spark Availability and Support
• Open Source - top level Apache project
• Over 750 contributors from over 200 organizations
• Can process multiple petabytes on clusters of over 8000 nodes
• Databricks. Matei Zaharia who wrote the original Spark is a founder and CTO
• Packages (more every day)
• Zeppelin - Scala notebooks
• Cassandra, Kafka connectors
![Page 12: Scala and Spark are Ideal for Big Data](https://reader035.vdocument.in/reader035/viewer/2022081806/5888ab441a28ab80248b4e2b/html5/thumbnails/12.jpg)
47deg.com
Clusters and Scalability
• Scala Akka clusters (process distribution, micro services)
• message passing
• remote Actors
• Spark clusters (data distribution)
• local
• Stand alone (optionally with ZooKeeper)
• Apache Mesos
• Hadoop Yarn
• can run above on Amazon and Google clouds
![Page 13: Scala and Spark are Ideal for Big Data](https://reader035.vdocument.in/reader035/viewer/2022081806/5888ab441a28ab80248b4e2b/html5/thumbnails/13.jpg)
47deg.com
Why Scala for Spark?
• Why not Python, R, or Java for Spark?
• Spark is written in Scala
• Scala source code is important Spark documentation
• Spark is best extended in Scala
• The primary API for Spark is Scala
• The functional features of Scala and Spark are a natural fit and easiest to use in Scala
• If you want to build scalable high performance production code based on Spark, R by itself is too specialized, Python is too slow and Java is tedious to write and maintain
![Page 15: Scala and Spark are Ideal for Big Data](https://reader035.vdocument.in/reader035/viewer/2022081806/5888ab441a28ab80248b4e2b/html5/thumbnails/15.jpg)
47deg.com
Seattle Resources
• Seattle Meetups
• Scala at the Sea Meetup http://www.meetup.com/Seattle-Scala-User-Group/
• Seattle Spark Meetup http://www.meetup.com/Seattle-Spark-Meetup/
• Seattle Training: Spark and Typesafe Scala Classes http://www.47deg.com/events#training
• UW Scala Professional Certificate Program http://www.pce.uw.edu/certificates/scala-functional-reactive-programming.html