Download - Data centric Metaprogramming by Vlad Ulreche
Vlad UrechePhD in the Scala Team @ EPFL. Soon to graduate ;)
● Working on program transformations focusing on data representation● Author of miniboxing, which improves generics performance by up to 20x● Contributed to the Scala compiler and to the scaladoc tool.
@
@VladUreche
@VladUreche
Motivation
Comparison graph from http://fr.slideshare.net/databricks/spark-summit-eu-2015-spark-dataframes-simple-and-fast-analysis-of-structured-data and used with permission.
Motivation
Comparison graph from http://fr.slideshare.net/databricks/spark-summit-eu-2015-spark-dataframes-simple-and-fast-analysis-of-structured-data and used with permission.
Performance gap betweenRDDs and DataFrames
Motivation
RDD
● strongly typed● slower
DataFrame
● dynamically typed● faster
?
● strongly typed● faster
Motivation
RDD
● strongly typed● slower
DataFrame
● dynamically typed● faster
Dataset
● strongly typed● faster
Motivation
RDD
● strongly typed● slower
DataFrame
● dynamically typed● faster
Dataset
● strongly typed● faster mid-way
Motivation
RDD
● strongly typed● slower
DataFrame
● dynamically typed● faster
Dataset
● strongly typed● faster mid-way
Why just mid-way?What can we do to speed them up?
Object Composition
class Employee(...)
ID NAME SALARY
class Vector[T] { … }The Vector collection
in the Scala library
Object Composition
class Employee(...)
ID NAME SALARY
class Vector[T] { … }The Vector collection
in the Scala library
Corresponds to a table row
Object Composition
class Employee(...)
ID NAME SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
class Vector[T] { … }
Object Composition
class Employee(...)
ID NAME SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
class Vector[T] { … }
Traversal requiresdereferencing a pointer
for each employee.
A Better Representation
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
A Better Representation
● more efficient heap usage● faster iteration
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
The Problem● Vector[T] is unaware of Employee
– Which makes Vector[Employee] suboptimal
● Not limited to Vector, other classes also affected
The Problem● Vector[T] is unaware of Employee
– Which makes Vector[Employee] suboptimal
● Not limited to Vector, other classes also affected– Spark pain point: Functions/closures
The Problem● Vector[T] is unaware of Employee
– Which makes Vector[Employee] suboptimal
● Not limited to Vector, other classes also affected– Spark pain point: Functions/closures
– We'd like a "structured" representation throughout
The Problem● Vector[T] is unaware of Employee
– Which makes Vector[Employee] suboptimal
● Not limited to Vector, other classes also affected– Spark pain point: Functions/closures
– We'd like a "structured" representation throughout
Challenge: No means of communicating this
to the compiler
Data-Centric Metaprogramming● compiler plug-in that allows● Tuning data representation● Website: scala-ildl.org
Transformation
Definition Application● can't be automated● based on experience● based on speculation● one-time effort
Transformation
programmer
Definition Application● can't be automated● based on experience● based on speculation● one-time effort
Transformation
programmer
Definition Application● can't be automated● based on experience● based on speculation● one-time effort
● repetitive and complex● affects code
readability● is verbose● is error-prone
Transformation
programmer
Definition Application● can't be automated● based on experience● based on speculation● one-time effort
● repetitive and complex● affects code
readability● is verbose● is error-prone
compiler (automated)
Transformation
programmer
Definition Application● can't be automated● based on experience● based on speculation● one-time effort
● repetitive and complex● affects code
readability● is verbose● is error-prone
compiler (automated)
Data-Centric Metaprogrammingobject VectorOfEmployeeOpt extends Transformation {
type Target = Vector[Employee] type Result = EmployeeVector
def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ...
def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ...}
Data-Centric Metaprogrammingobject VectorOfEmployeeOpt extends Transformation {
type Target = Vector[Employee] type Result = EmployeeVector
def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ...
def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ...}
What to transform?What to transform to?
Data-Centric Metaprogrammingobject VectorOfEmployeeOpt extends Transformation {
type Target = Vector[Employee] type Result = EmployeeVector
def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ...
def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ...}
How totransform?
Data-Centric Metaprogrammingobject VectorOfEmployeeOpt extends Transformation {
type Target = Vector[Employee] type Result = EmployeeVector
def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ...
def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ...} How to run methods on the updated representation?
Transformation
programmer
Definition Application● can't be automated● based on experience● based on speculation● one-time effort
● repetitive and complex● affects code
readability● is verbose● is error-prone
compiler (automated)
Transformation
programmer
Definition Application● can't be automated● based on experience● based on speculation● one-time effort
● repetitive and complex● affects code
readability● is verbose● is error-prone
compiler (automated)
Motivation
Transformation
Applications
Challenges
Conclusion
Spark
Open World
Best Representation?
Composition
Scenario
class Employee(...)
ID NAME SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
class Vector[T] { … }
Scenario
class Employee(...)
ID NAME SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
class Vector[T] { … }
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
Scenario
class Employee(...)
ID NAME SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
class Vector[T] { … }
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
class NewEmployee(...) extends Employee(...)
ID NAME SALARY DEPT
Scenario
class Employee(...)
ID NAME SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
class Vector[T] { … }
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
class NewEmployee(...) extends Employee(...)
ID NAME SALARY DEPT
Scenario
class Employee(...)
ID NAME SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
class Vector[T] { … }
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
class NewEmployee(...) extends Employee(...)
ID NAME SALARY DEPT Oooops...
Open World Assumption● Globally anything can happen● Locally you have full control:
– Make class Employee final or
– Limit the transformation to code that uses Employee
Open World Assumption● Globally anything can happen● Locally you have full control:
– Make class Employee final or
– Limit the transformation to code that uses Employee
How?
Open World Assumption● Globally anything can happen● Locally you have full control:
– Make class Employee final or
– Limit the transformation to code that uses Employee
How?
Using Scopes!
Scopestransform(VectorOfEmployeeOpt) {
def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =
for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary )
}
Scopestransform(VectorOfEmployeeOpt) {
def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =
for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary )
}
Scopestransform(VectorOfEmployeeOpt) {
def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =
for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary )
}
Now the method operateson the EmployeeVector
representation.
Scopes● Can wrap statements, methods, even entire classes
– Inlined immediately after the parser
– Definitions are visible outside the "scope"
Scopes● Can wrap statements, methods, even entire classes
– Inlined immediately after the parser
– Definitions are visible outside the "scope"
● Mark locally closed parts of the code– Incoming/outgoing values go through conversions
– You can reject unexpected values
Motivation
Transformation
Applications
Challenges
Conclusion
Spark
Open World
Best Representation?
Composition
Best ...?
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
It depends.
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
Best ...?
Tungsten repr.<compressed binary blob>
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
It depends.
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
Best ...?
EmployeeJSON{ id: 123, name: “John Doe” salary: 100}
Tungsten repr.<compressed binary blob>
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
It depends.
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
Scopes allow mixing data representationstransform(VectorOfEmployeeOpt) {
def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =
for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary )
}
Scopestransform(VectorOfEmployeeOpt) {
def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =
for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary )
}
Operating on theEmployeeVectorrepresentation.
Scopestransform(VectorOfEmployeeCompact) {
def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =
for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary )
}
Operating on thecompact binary representation.
Scopestransform(VectorOfEmployeeJSON) {
def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =
for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary )
}
Operating on theJSON-based
representation.
Motivation
Transformation
Applications
Challenges
Conclusion
Spark
Open World
Best Representation?
Composition
Composition● Code can be
– Left untransformed (using the original representation)
– Transformed using different representations
Composition● Code can be
– Left untransformed (using the original representation)
– Transformed using different representations
calling● Original code● Transformed code
● Original code● Transformed code
● Same transformation● Different transformation
Composition
calling● Original code● Transformed code
● Original code● Transformed code
● Same transformation● Different transformation
Composition
calling● Original code● Transformed code
● Original code● Transformed code
● Same transformation● Different transformation
Composition
calling● Original code● Transformed code
● Original code● Transformed code
● Same transformation● Different transformation
Easy one. Do nothing
Composition
calling● Original code● Transformed code
● Original code● Transformed code
● Same transformation● Different transformation
Composition
calling● Original code● Transformed code
● Original code● Transformed code
● Same transformation● Different transformation
Composition
calling● Original code● Transformed code
● Original code● Transformed code
● Same transformation● Different transformation
Composition
calling● Original code● Transformed code
● Original code● Transformed code
● Same transformation● Different transformation
Automatically introduce conversionsbetween values in the two representationse.g. EmployeeVector Vector[Employee] or back→
Composition
calling● Original code● Transformed code
● Original code● Transformed code
● Same transformation● Different transformation
Composition
calling● Original code● Transformed code
● Original code● Transformed code
● Same transformation● Different transformation
Composition
calling● Original code● Transformed code
● Original code● Transformed code
● Same transformation● Different transformation
Composition
calling● Original code● Transformed code
● Original code● Transformed code
● Same transformation● Different transformation
Hard one. Do not introduce any conversions. Even across separate compilation
Composition
calling● Original code● Transformed code
● Original code● Transformed code
● Same transformation● Different transformation
Composition
calling● Original code● Transformed code
● Original code● Transformed code
● Same transformation● Different transformation
Hard one. Automatically introduce double conversions (and warn the programmer)e.g. EmployeeVector Vector[Employee] CompactEmpVector→ →
Composition
calling● Original code● Transformed code
● Original code● Transformed code
● Same transformation● Different transformation
Composition
callingoverriding
● Original code● Transformed code
● Original code● Transformed code
● Same transformation● Different transformation
Scopestrait Printer[T] { def print(elements: Vector[T]): Unit}
class EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ...}
Scopestrait Printer[T] { def print(elements: Vector[T]): Unit}
class EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ...}
Method print in the classimplements
method print in the trait
Scopestrait Printer[T] { def print(elements: Vector[T]): Unit}
class EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ...}
Scopestrait Printer[T] { def print(elements: Vector[T]): Unit}
transform(VectorOfEmployeeOpt) { class EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... }}
Scopestrait Printer[T] { def print(elements: Vector[T]): Unit}
transform(VectorOfEmployeeOpt) { class EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... }} The signature of method
print changes according tothe transformation it no→
longer implements the trait
Scopestrait Printer[T] { def print(elements: Vector[T]): Unit}
transform(VectorOfEmployeeOpt) { class EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... }} The signature of method
print changes according tothe transformation it no→
longer implements the trait
Taken care by thecompiler for you!
Motivation
Transformation
Applications
Challenges
Conclusion
Spark
Open World
Best Representation?
Composition
Column-oriented Storage
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
Column-oriented Storage
NAME ...NAME
EmployeeVector
ID ID ...
...SALARY SALARY
Vector[Employee]
ID NAME SALARY
ID NAME SALARY
iteration is 5x faster
Retrofitting value class statusTuples in Scala are specialized butare still objects (not value classes)
= not as optimized as they could be
(3,5)
3 5Header
reference
Retrofitting value class status
0l + 3 << 32 + 5
(3,5)
Tuples in Scala are specialized butare still objects (not value classes)
= not as optimized as they could be
(3,5)
3 5Header
reference
Retrofitting value class status
0l + 3 << 32 + 5
(3,5)
Tuples in Scala are specialized butare still objects (not value classes)
= not as optimized as they could be
(3,5)
3 5Header
reference
14x faster, lower heap requirements
DeforestationList(1,2,3).map(_ + 1).map(_ * 2).sum
List(2,3,4) List(4,6,8) 18
transform(ListDeforestation) { List(1,2,3).map(_ + 1).map(_ * 2).sum}
DeforestationList(1,2,3).map(_ + 1).map(_ * 2).sum
List(2,3,4) List(4,6,8) 18
transform(ListDeforestation) { List(1,2,3).map(_ + 1).map(_ * 2).sum}
accumulatefunction
DeforestationList(1,2,3).map(_ + 1).map(_ * 2).sum
List(2,3,4) List(4,6,8) 18
transform(ListDeforestation) { List(1,2,3).map(_ + 1).map(_ * 2).sum}
accumulatefunction
accumulatefunction
DeforestationList(1,2,3).map(_ + 1).map(_ * 2).sum
List(2,3,4) List(4,6,8) 18
transform(ListDeforestation) { List(1,2,3).map(_ + 1).map(_ * 2).sum}
accumulatefunction
accumulatefunction
compute:18
DeforestationList(1,2,3).map(_ + 1).map(_ * 2).sum
List(2,3,4) List(4,6,8) 18
transform(ListDeforestation) { List(1,2,3).map(_ + 1).map(_ * 2).sum}
accumulatefunction
accumulatefunction
compute:18
6x faster
Motivation
Transformation
Applications
Challenges
Conclusion
Spark
Open World
Best Representation?
Composition
Spark● Optimizations
– DataFrames do deforestation
– DataFrames do predicate push-down
– DataFrames do code generation● Code is specialized for the data representation● Functions are specialized for the data representation
Spark● Optimizations
– RDDs do deforestation
– RDDs do predicate push-down
– RDDs do code generation● Code is specialized for the data representation● Functions are specialized for the data representation
Spark● Optimizations
– RDDs do deforestation
– RDDs do predicate push-down
– RDDs do code generation● Code is specialized for the data representation● Functions are specialized for the data representation
This is whatmakes them slower
Spark● Optimizations
– Datasets do deforestation
– Datasets do predicate push-down
– Datasets do code generation● Code is specialized for the data representation● Functions are specialized for the data representation
User Functions
serializeddata
encodeddata
X Y encodeddata
user function
fdecode encode
Allocate object Allocate object
User Functions
serializeddata
encodeddata
X Y encodeddata
user function
fdecode encode
Allocate object Allocate object
User Functions
serializeddata
encodeddata
X Y encodeddata
user function
fdecode encode
Modified user function(automatically derived
by the compiler)
User Functions
serializeddata
encodeddata
encodeddata
Modified user function(automatically derived
by the compiler)
User Functions
serializeddata
encodeddata
encodeddata
Modified user function(automatically derived
by the compiler) Nowhere near assimple as it looks
Challenge: Transformation not possible
● Example: Calling outside (untransformed) method● Solution: Issue compiler warnings
Challenge: Transformation not possible
● Example: Calling outside (untransformed) method● Solution: Issue compiler warnings
– Explain why it's not possible: due to the method call
Challenge: Transformation not possible
● Example: Calling outside (untransformed) method● Solution: Issue compiler warnings
– Explain why it's not possible: due to the method call
– Suggest how to fix it: enclose the method in a scope
Challenge: Transformation not possible
● Example: Calling outside (untransformed) method● Solution: Issue compiler warnings
– Explain why it's not possible: due to the method call
– Suggest how to fix it: enclose the method in a scope
● Reuse the machinery in miniboxing
scala-miniboxing.org
Challenge: Internal API changes
● Spark internals rely on Iterator[T]– Requires materializing values
– Needs to be replaced throughout the code base
– By rather complex buffers
Challenge: Internal API changes
● Spark internals rely on Iterator[T]– Requires materializing values
– Needs to be replaced throughout the code base
– By rather complex buffers
● Solution: Extensive refactoring/rewrite
Challenge: Automation
● Existing code should run out of the box● Solution:
– Adapt data-centric metaprogramming to Spark
– Trade generality for simplicity
– Do the right thing for most of the cases
Challenge: Automation
● Existing code should run out of the box● Solution:
– Adapt data-centric metaprogramming to Spark
– Trade generality for simplicity
– Do the right thing for most of the cases
Where are we now?
Prototype Hack● Modified version of Spark core
– RDD data representation is configurable
● It's very limited:– Custom data repr. only in map, filter and flatMap
– Otherwise we revert to costly objects
– Large parts of the automation still need to be done
Prototype Hacksc.parallelize(/* 1 million */ records). map(x => ...). filter(x => ...). collect() Not yet 2x faster,
but 1.45x faster
Motivation
Transformation
Applications
Challenges
Conclusion
Spark
Open World
Best Representation?
Composition
Conclusion● Object-oriented composition → inefficient representation● Solution: data-centric metaprogramming
Conclusion● Object-oriented composition → inefficient representation● Solution: data-centric metaprogramming
– Opaque data → Structured data
Conclusion● Object-oriented composition → inefficient representation● Solution: data-centric metaprogramming
– Opaque data → Structured data
– Is it possible? Yes.
Conclusion● Object-oriented composition → inefficient representation● Solution: data-centric metaprogramming
– Opaque data → Structured data
– Is it possible? Yes.
– Is it easy? Not really.
Conclusion● Object-oriented composition → inefficient representation● Solution: data-centric metaprogramming
– Opaque data → Structured data
– Is it possible? Yes.
– Is it easy? Not really.
– Is it worth it? You tell me!
Deforestation and Language Semantics
● Notice that we changed language semantics:– Before: collections were eager
– After: collections are lazy
– This can lead to effects reordering
Deforestation and Language Semantics
● Such transformations are only acceptable with programmer consent– JIT compilers/staged DSLs can't change semantics
– metaprogramming (macros) can, but it should be documented/opt-in
Code Generation● Also known as
– Deep Embedding
– Multi-Stage Programming
● Awesome speedups, but restricted to small DSLs● SparkSQL uses code gen to improve performance
– By 2-4x over Spark
Low-level Optimizers● Java JIT Compiler
– Access to the low-level code
– Can assume a (local) closed world
– Can speculate based on profiles
Low-level Optimizers● Java JIT Compiler
– Access to the low-level code
– Can assume a (local) closed world
– Can speculate based on profiles
● Best optimizations break semantics– You can't do this in the JIT compiler!
– Only the programmer can decide to break semantics
Scala Macros● Many optimizations can be done with macros
– :) Lots of power
– :( Lots of responsibility● Scala compiler invariants● Object-oriented model● Modularity