stratosphere v 0.4 upcoming features
DESCRIPTION
An overview of the key features added to Stratosphere Version 0.4, to be released in the next days.TRANSCRIPT
![Page 2: Stratosphere v 0.4 upcoming features](https://reader036.vdocument.in/reader036/viewer/2022062513/554f9300b4c9052a518b54ad/html5/thumbnails/2.jpg)
2
Release Preview
Official release coming end of November
Hands on sessions today with the latest code snapshot
![Page 3: Stratosphere v 0.4 upcoming features](https://reader036.vdocument.in/reader036/viewer/2022062513/554f9300b4c9052a518b54ad/html5/thumbnails/3.jpg)
3
New Features in a Nutshell
• Declarative Scala Programming API
• Iterative Programso Bulk (batch-to-batch in memory) and Incremental (Delta Updates)o Automatic caching and cross-loop optimizations
• Runs on top of YARN (Hadoop Next Gen)
• Various deployment methodso VMs, Debian packages, EC2 scripts, ...
• Many usability fixes and of bugfixes
![Page 4: Stratosphere v 0.4 upcoming features](https://reader036.vdocument.in/reader036/viewer/2022062513/554f9300b4c9052a518b54ad/html5/thumbnails/4.jpg)
4
Stratosphere System Stack
Sky JavaAPI
Storage
Stratosphere Runtime
HDFS Local Files S3
ClusterManager
YARN EC2 Direct
Stratosphere Optimizer
Sky ScalaAPI Meteor ...
...
![Page 5: Stratosphere v 0.4 upcoming features](https://reader036.vdocument.in/reader036/viewer/2022062513/554f9300b4c9052a518b54ad/html5/thumbnails/5.jpg)
5
MapReduceIt is nice and good,
but...
Map
Map Red.
Red.Map
Map Red.
Red.
Map
Map Red.
Red.
Map
Map
Map
Map
Red.
Red.
Very verbose and low level. Only usable by system programmers.
Everything slightly more complex mustresult in a cascade of jobs. Losesperformance and optimization potential.
![Page 6: Stratosphere v 0.4 upcoming features](https://reader036.vdocument.in/reader036/viewer/2022062513/554f9300b4c9052a518b54ad/html5/thumbnails/6.jpg)
6
SQL (or Hive or Pig)It is nice and good,
but...
• Allow you to do a subset of the tasks efficiently and elegantly
• What about the cases that do not fit SQL?o Custom typeso Custom non-relational functions (they occur a lot!)o Iterative Algorithms Machine learning, graph analysis
• How does it look to mix SQL with MapReduce?
![Page 7: Stratosphere v 0.4 upcoming features](https://reader036.vdocument.in/reader036/viewer/2022062513/554f9300b4c9052a518b54ad/html5/thumbnails/7.jpg)
7
SQL (or Hive or Pig) is nice and good, but...
A = load 'WordcountInput.txt';B = MAPREDUCE wordcount.jar store A into 'inputDir‘ load 'outputDir' as (word:chararray, count: int) 'org.myorg.WordCount inputDir outputDir';C = sort B by count;
FROM ( FROM pv_users MAP pv_users.userid, pv_users.date USING 'map_script' AS dt, uid CLUSTER BY dt) map_outputINSERT OVERWRITE TABLE pv_users_reduced REDUCE map_output.dt, map_output.uid USING 'reduce_script' AS date, count;
Hive
Pig
• Program Fragmentation• Impedance Mismatch• Breaks optimization
![Page 8: Stratosphere v 0.4 upcoming features](https://reader036.vdocument.in/reader036/viewer/2022062513/554f9300b4c9052a518b54ad/html5/thumbnails/8.jpg)
8
Sky Language
MapReduce style functions
(Map, Reduce, Join, CoGroup, Cross, ...)
Relational Set Operations
(filter, map, group, join,aggregate, ...)
Database / UDF Runtime
Scala Embedded Language
Optimizer
Write like a programming language, execute like a database...
![Page 9: Stratosphere v 0.4 upcoming features](https://reader036.vdocument.in/reader036/viewer/2022062513/554f9300b4c9052a518b54ad/html5/thumbnails/9.jpg)
9
Sky Language
Add a bit of"languages and compilers"sauce to the database stack
![Page 10: Stratosphere v 0.4 upcoming features](https://reader036.vdocument.in/reader036/viewer/2022062513/554f9300b4c9052a518b54ad/html5/thumbnails/10.jpg)
10
Scala API by Example• The classical word count example
val input = TextFile(textInput)
val words = input flatMap { line => line.split("\\W+") }
val counts = words groupBy { word => word } count()
![Page 11: Stratosphere v 0.4 upcoming features](https://reader036.vdocument.in/reader036/viewer/2022062513/554f9300b4c9052a518b54ad/html5/thumbnails/11.jpg)
11
Scala API by Example• The classical word count example
val input = TextFile(textInput)
val words = input flatMap { line => line.split("\\W+") }
val counts = words groupBy { word => word } count()
In-situ data sourceTransformation
function
Group by entire data type (the words)
Count per group
![Page 12: Stratosphere v 0.4 upcoming features](https://reader036.vdocument.in/reader036/viewer/2022062513/554f9300b4c9052a518b54ad/html5/thumbnails/12.jpg)
12
Scala API by Example• Graph Triangles (Friend-of-a-Friend problem)
o Recommending friends, finding important connections
• 1) Enumerate candidate triads• 2) Close as triangles
![Page 13: Stratosphere v 0.4 upcoming features](https://reader036.vdocument.in/reader036/viewer/2022062513/554f9300b4c9052a518b54ad/html5/thumbnails/13.jpg)
13
Scala API by Example
case class Edge(from: Int, to: Int)case class Triangle(apex: Int, base1: Int, base1: Int)
val vertices = DataSource("hdfs:///...", CsvFormat[Edge])
val byDegree = vertices map { projectToLowerDegree } val byID = byDegree map { (x) => if (x.from < x.to) x else Edge(x.to, x.from) } val triads = byDegree groupBy { _.from } reduceGroup { buildTriads }
val triangles = triads join byID where { t => (t.base1, t.base2) } isEqualTo { e => (e.from, e.to) } map { (triangle, edge) => triangle }
![Page 14: Stratosphere v 0.4 upcoming features](https://reader036.vdocument.in/reader036/viewer/2022062513/554f9300b4c9052a518b54ad/html5/thumbnails/14.jpg)
14
Scala API by Example
case class Edge(from: Int, to: Int)case class Triangle(apex: Int, base1: Int, base1: Int)
val vertices = DataSource("hdfs:///...", CsvFormat[Edge])
val byDegree = vertices map { projectToLowerDegree } val byID = byDegree map { (x) => if (x.from < x.to) x else Edge(x.to, x.from) } val triads = byDegree groupBy { _.from } reduceGroup { buildTriads }
val triangles = triads join byID where { t => (t.base1, t.base2) } isEqualTo { e => (e.from, e.to) } map { (triangle, edge) => triangle }
Custom Data TypesIn-situ data source
![Page 15: Stratosphere v 0.4 upcoming features](https://reader036.vdocument.in/reader036/viewer/2022062513/554f9300b4c9052a518b54ad/html5/thumbnails/15.jpg)
15
Scala API by Example
case class Edge(from: Int, to: Int)case class Triangle(apex: Int, base1: Int, base2: Int)
val vertices = DataSource("hdfs:///...", CsvFormat[Edge])
val byDegree = vertices map { projectToLowerDegree } val byID = byDegree map { (x) => if (x.from < x.to) x else Edge(x.to, x.from) } val triads = byDegree groupBy { _.from } reduceGroup { buildTriads }
val triangles = triads join byID where { t => (t.base1, t.base2) } isEqualTo { e => (e.from, e.to) } map { (triangle, edge) => triangle }
RelationalJoin
Non-relationallibrary function
Non-relational function
![Page 16: Stratosphere v 0.4 upcoming features](https://reader036.vdocument.in/reader036/viewer/2022062513/554f9300b4c9052a518b54ad/html5/thumbnails/16.jpg)
16
Scala API by Example
case class Edge(from: Int, to: Int)case class Triangle(apex: Int, base1: Int, base2: Int)
val vertices = DataSource("hdfs:///...", CsvFormat[Edge])
val byDegree = vertices map { projectToLowerDegree } val byID = byDegree map { (x) => if (x.from < x.to) x else Edge(x.to, x.from) } val triads = byDegree groupBy { _.from } reduceGroup { buildTriads }
val triangles = triads join byID where { t => (t.base1, t.base2) } isEqualTo { e => (e.from, e.to) } map { (triangle, edge) => triangle }
Key References
![Page 17: Stratosphere v 0.4 upcoming features](https://reader036.vdocument.in/reader036/viewer/2022062513/554f9300b4c9052a518b54ad/html5/thumbnails/17.jpg)
17
Optimizing Programs• Program optimization happens in two phases
1. Data type and function code analysis inside the Scala Compiler2. Relational-style optimization of the data flow
Run Time
Scala Compiler
ParserProgramType
Checker
Execution
CodeGeneration
Stratosphere Optimizer
Instantiate FinalizeGlue Code
CreateScheduleOptimize
AnalyzeData Types
GenerateGlue Code
Instantiate
![Page 18: Stratosphere v 0.4 upcoming features](https://reader036.vdocument.in/reader036/viewer/2022062513/554f9300b4c9052a518b54ad/html5/thumbnails/18.jpg)
18
Type Analysis/Code Gen
• Types and Key Selectors are mapped to flat schema
• Generated code for interaction with runtimePrimitive Types, Arrays, Lists Single Value
TuplesTuples /Classes
NestedTypes
Recursivelyflattened
recursivetypes
Tuples(w/ BLOB for
recursion)
Int, Double, Array[String], ...
(a: Int, b: Int, c: String)class T(x: Int, y: Long)
class T(x: Int, y: Long)class R(id: String, value: T)
(a: Int, b: Int, c: String)(x: Int, y: Long)
class Node(id: Int, left: Node, right: Node)
(id:Int, left:BLOB, right:BLOB)
(x: Int, y: Long)(id:String, x:Int, y:Long)
![Page 19: Stratosphere v 0.4 upcoming features](https://reader036.vdocument.in/reader036/viewer/2022062513/554f9300b4c9052a518b54ad/html5/thumbnails/19.jpg)
19
Optimization
val orders = DataSource(...)val items = DataSource(...)
val filtered = orders filter { ... }
val prio = filtered join items where { _.id } isEqualTo { _.id } map {(o,li) => PricedOrder(o.id, o.priority, li.price)}
val sales = prio groupBy {p => (p.id, p.priority)} aggregate ({_.price},SUM)
Filter
Grp/AggJoin
Orders Items
partition(0)
sort (0,1)
partition(0)
sort (0)
Filter
Join
Grp/Agg
Orders Items
(0,1)
(0) = (0)
(∅)
case class Order(id: Int, priority: Int, ...)case class Item(id: Int, price: double, )case class PricedOrder(id, priority, price)
![Page 20: Stratosphere v 0.4 upcoming features](https://reader036.vdocument.in/reader036/viewer/2022062513/554f9300b4c9052a518b54ad/html5/thumbnails/20.jpg)
20
Iterative Programs• Many programs have a loop and make
multiple passes over the datao Machine Learning algorithms iteratively refine the modelo Graph algorithms propagate information one hop by hop
Step Step Step Step Step
Client
Iteration
Loop outside the system
Loop inside the system
![Page 21: Stratosphere v 0.4 upcoming features](https://reader036.vdocument.in/reader036/viewer/2022062513/554f9300b4c9052a518b54ad/html5/thumbnails/21.jpg)
21
Why Iterations
• Algorithms that need iterationso Clustering (K-Means, …)o Gradient descento Page-Ranko Logistic Regressiono Path algorithms on graphs (shortest paths, centralities, …)o Graph communities / dense sub-componentso Inference (believe propagation)o …
All the hot algorithms for building predictive models
![Page 22: Stratosphere v 0.4 upcoming features](https://reader036.vdocument.in/reader036/viewer/2022062513/554f9300b4c9052a518b54ad/html5/thumbnails/22.jpg)
22
Two Types of Iterations
Bulk Iterations
Incremental Iterations(aka. Workset
Iterations)
IterativeFunction
Initial Dataset
Result
InitialWorkset
InitialSolutionset
IterativeFunction State
Result
![Page 23: Stratosphere v 0.4 upcoming features](https://reader036.vdocument.in/reader036/viewer/2022062513/554f9300b4c9052a518b54ad/html5/thumbnails/23.jpg)
23
Iterations inside the System
0
200000
400000
600000
800000
1000000
1200000
1400000
Iteration
# V
ert
ices
(th
ou
san
ds)
Naïve
Incremental
Twitter Webbase (20)0
1000
2000
3000
4000
5000
6000
Computations performed ineach iteration for connectedcommunities of a social graph
Runtime (secs)
![Page 24: Stratosphere v 0.4 upcoming features](https://reader036.vdocument.in/reader036/viewer/2022062513/554f9300b4c9052a518b54ad/html5/thumbnails/24.jpg)
24
Iterative Program (Scala)
def step = (s: DataSet[Vertex], ws: DataSet[Vertex]) => {
val min = ws groupBy {_.id} reduceGroup { x => x.minBy { _.component } }
val delta = s join minNeighbor where { _.id } isEqualTo { _.id } flatMap { (c,o) => if (c.component < o.component) Some(c) else None }
val nextWs = delta join edges where {v => v.id} isEqualTo {e => e.from} map { (v, e) => Vertex(e.to, v.component) } (delta, nextWs)}
val components = vertices.iterateWithWorkset(initialWorkset, {_.id}, step)
![Page 25: Stratosphere v 0.4 upcoming features](https://reader036.vdocument.in/reader036/viewer/2022062513/554f9300b4c9052a518b54ad/html5/thumbnails/25.jpg)
25
Iterative Program (Scala)
def step = (s: DataSet[Vertex], ws: DataSet[Vertex]) => {
val min = ws groupBy {_.id} reduceGroup { x => x.minBy { _.component } }
val delta = s join minNeighbor where { _.id } isEqualTo { _.id } flatMap { (c,o) => if (c.component < o.component) Some(c) else None }
val nextWs = delta join edges where {v => v.id} isEqualTo {e => e.from} map { (v, e) => Vertex(e.to, v.component) } (delta, nextWs)}
val components = vertices.iterateWithWorkset(initialWorkset, {_.id}, step)
Define Step function
Return Delta andnext Workset Invoke Iteration
![Page 26: Stratosphere v 0.4 upcoming features](https://reader036.vdocument.in/reader036/viewer/2022062513/554f9300b4c9052a518b54ad/html5/thumbnails/26.jpg)
26
Iterative Program (Java)
![Page 27: Stratosphere v 0.4 upcoming features](https://reader036.vdocument.in/reader036/viewer/2022062513/554f9300b4c9052a518b54ad/html5/thumbnails/27.jpg)
27
Graph Processing in Stratosphere
![Page 28: Stratosphere v 0.4 upcoming features](https://reader036.vdocument.in/reader036/viewer/2022062513/554f9300b4c9052a518b54ad/html5/thumbnails/28.jpg)
28
Optimizing Iterative Programs
Caching Loop-invariant DataPushing work„out of the loop“
Maintain state as index
![Page 29: Stratosphere v 0.4 upcoming features](https://reader036.vdocument.in/reader036/viewer/2022062513/554f9300b4c9052a518b54ad/html5/thumbnails/29.jpg)
29
Support for YARN• Clusters are typically shared between
applicationso Different userso Different systems, or different versions of the same system
• YARN manages cluster as a collection of resourceso Allows systems to deploy themselves on the cluster for a task
StratosphereClient
YARNManage
r
![Page 30: Stratosphere v 0.4 upcoming features](https://reader036.vdocument.in/reader036/viewer/2022062513/554f9300b4c9052a518b54ad/html5/thumbnails/30.jpg)
30
Project: http://stratosphere.euDev: http://github.com/stratosphere
Tweet: #StratoSummit
Be Part of a GreatOpen Source Project
Use Stratosphere & give us feedback on the experience
Partner with us and become a pilot user/customer
Contribute to the system