ml base ml base ml baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...code master server...

132
base ML Collaborators: Tim Kraska 2 , Virginia Smith 1 , Xinghao Pan 1 , Shivaram Venkataraman 1 , Matei Zaharia 1 , Rean Griffith 3 , John Duchi 1 , Joseph Gonzalez 1 , Michael Franklin 1 , Michael I. Jordan 1 1 UC Berkeley 2 Brown 3 VMware www.mlbase.org Evan Sparks and Ameet Talwalkar UC Berkeley UC Berkeley

Upload: others

Post on 12-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

baseML

baseML

baseML

baseML

ML base

ML base

ML base

ML base

ML baseCollaborators:  Tim  Kraska2,  Virginia  Smith1,  Xinghao  Pan1,  Shivaram  Venkataraman1,  Matei  Zaharia1,  Rean  Griffith3,  John  Duchi1,  Joseph  Gonzalez1,  

Michael  Franklin1,  Michael  I.  Jordan1  

1UC  Berkeley            2Brown          3VMware

www.mlbase.org

Evan  Sparks  and  Ameet  TalwalkarUC  Berkeley

UC Berkeley

Page 2: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Problem:  Scalable  implementa>ons  difficult  for  ML  Developers…

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Page 3: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Problem:  Scalable  implementa>ons  difficult  for  ML  Developers…

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Page 4: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Problem:  Scalable  implementa>ons  difficult  for  ML  Developers…

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Page 5: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Too  many  algorithms…

Problem:  ML  is  difficultfor  End  Users…

Page 6: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Too  many  algorithms…

Too  many  knobs…

Problem:  ML  is  difficultfor  End  Users…

Page 7: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Too  many  algorithms…

Too  many  knobs…

Problem:  ML  is  difficultfor  End  Users…

Difficult  to  debug…

Page 8: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Too  many  algorithms…

Too  many  knobs…

Problem:  ML  is  difficultfor  End  Users…

Difficult  to  debug…

Doesn’t  scale…

Page 9: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Too  many  algorithms…

Too  many  knobs…

Problem:  ML  is  difficultfor  End  Users…

Difficult  to  debug…

Reliable

Fast

Accurate

Provable

Doesn’t  scale…

Page 10: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ML  Experts Systems  ExpertsMLbase

Page 11: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

1. Easy  scalable  ML  development  (ML  Developers)2. User-­‐friendly  ML  at  scale  (End  Users)

ML  Experts Systems  ExpertsMLbase

Page 12: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

1. Easy  scalable  ML  development  (ML  Developers)2. User-­‐friendly  ML  at  scale  (End  Users)

Along  the  way,  we  gain  insight  into  data  intensive  compuJng

ML  Experts Systems  ExpertsMLbase

Page 13: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

VisionMLI  DetailsCurrent  StatusML  Workflow

Page 14: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Matlab  Stack

Page 15: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Matlab  Stack

Single Machine

Page 16: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Lapack

Matlab  Stack

Single Machine

✦ Lapack:  low-­‐level  Fortran  linear  algebra  library

Page 17: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Lapack

Matlab Interface

Matlab  Stack

Single Machine

✦ Lapack:  low-­‐level  Fortran  linear  algebra  library✦ Matlab  Interface

✦ Higher-­‐level  abstracVons  for  data  access  /  processing✦ More  extensive  funcVonality  than  Lapack✦ Leverages  Lapack  whenever  possible

Page 18: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Lapack

Matlab Interface

Matlab  Stack

Single Machine

✦ Lapack:  low-­‐level  Fortran  linear  algebra  library✦ Matlab  Interface

✦ Higher-­‐level  abstracVons  for  data  access  /  processing✦ More  extensive  funcVonality  than  Lapack✦ Leverages  Lapack  whenever  possible

✦ Similar  stories  for  R  and  Python

Page 19: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

MLbase  Stack

Lapack

Matlab Interface

Single Machine

Page 20: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

MLbase  Stack

Runtime(s)

Lapack

Matlab Interface

Single Machine

Page 21: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

MLbase  Stack

Runtime(s)Spark

Lapack

Matlab Interface

Single Machine

Spark:  cluster  compuJng  system  designed  for  iteraJve  computaJon

Page 22: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

MLbase  Stack

Runtime(s)

MLlib

Spark

Lapack

Matlab Interface

Single Machine

Spark:  cluster  compuJng  system  designed  for  iteraJve  computaJon

MLlib:  low-­‐level  ML  library  in  Spark

Page 23: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

MLbase  Stack

Runtime(s)

MLlib

MLI

Spark

Lapack

Matlab Interface

Single Machine

Spark:  cluster  compuJng  system  designed  for  iteraJve  computaJon

MLlib:  low-­‐level  ML  library  in  Spark

MLI:  API  /  plaPorm  for  feature  extracJon  and  algorithm  development✦ PlaXorm  independent

Page 24: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

MLbase  Stack

Runtime(s)

MLlib

MLI

ML Optimizer

Spark

Lapack

Matlab Interface

Single Machine

Spark:  cluster  compuJng  system  designed  for  iteraJve  computaJon

MLlib:  low-­‐level  ML  library  in  Spark

MLI:  API  /  plaPorm  for  feature  extracJon  and  algorithm  development✦ PlaXorm  independent

ML  Op>mizer:  automates  model  selecJon✦ Solves  a  search  problem  over  feature  extractors  and  algorithms  in  MLI

Page 25: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Example:  MLlib

Page 26: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Example:  MLlib

✦ Goal:  ClassificaVon  of  text  file

Page 27: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Example:  MLlib

✦ Goal:  ClassificaVon  of  text  file✦ Featurize  data  manually

1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3

4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6

7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))

10 val featureizedTable = classes.zip(ngrams)11

12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }

1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3

4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6

7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9

10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }

Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).

Page 28: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Example:  MLlib

✦ Goal:  ClassificaVon  of  text  file✦ Featurize  data  manually✦ Calls  MLlib’s  LR  funcVon

1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3

4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6

7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))

10 val featureizedTable = classes.zip(ngrams)11

12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }

1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3

4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6

7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9

10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }

Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).

Page 29: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Example:  MLI

Page 30: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Example:  MLI

✦ Use  built-­‐in  feature  extracVon  funcVonality

1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3

4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6

7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))

10 val featureizedTable = classes.zip(ngrams)11

12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }

1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3

4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6

7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9

10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }

Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).

Page 31: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Example:  MLI

✦ Use  built-­‐in  feature  extracVon  funcVonality✦ MLI  LogisVc  Regression  leverages  MLlib

1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3

4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6

7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))

10 val featureizedTable = classes.zip(ngrams)11

12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }

1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3

4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6

7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9

10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }

Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).

Page 32: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Example:  MLI

✦ Use  built-­‐in  feature  extracVon  funcVonality✦ MLI  LogisVc  Regression  leverages  MLlib✦ Extensions:

✦ Embed  in  cross-­‐validaVon  rouVne✦ Use  different  feature  extractors  /  algorithms✦ Write  new  ones

1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3

4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6

7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))

10 val featureizedTable = classes.zip(ngrams)11

12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }

1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3

4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6

7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9

10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }

Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).

Page 33: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Example:  ML  OpVmizer

var  X  =  load(”text_file”,  2  to  10)var  y  =  load(”text_file”,  1)var  (fn-­‐model,  summary)  =  doClassify(X,  y)

✦ User  declaraVvely  specifies  task✦ ML  OpVmizer  searches  through  MLI

Page 34: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Example:  ML  OpVmizer

✦ User  declaraVvely  specifies  task✦ ML  OpVmizer  searches  through  MLI

SQL Result MQL Model

Page 35: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

VisionMLI  DetailsCurrent  StatusML  Workflow

Page 36: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Lay  of  the  LandEase  of  u

se

Performance,  Scalability

Page 37: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Lay  of  the  LandEase  of  u

se

Performance,  Scalability

Matlab,  Rx  +      Easy  (Resembles  math,  limited  set  up)

 +      Sufficient  for  prototyping  /  wriVng  papers—    Ad-­‐hoc,  non-­‐scalable  scripts—    Loss  of  translaVon  upon  re-­‐implementaVon

Page 38: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Lay  of  the  LandEase  of  u

se

Performance,  Scalability

Matlab,  Rx  +      Easy  (Resembles  math,  limited  set  up)

 +      Sufficient  for  prototyping  /  wriVng  papers—    Ad-­‐hoc,  non-­‐scalable  scripts—    Loss  of  translaVon  upon  re-­‐implementaVon

GraphLab,  VWx

Mahoutx

 +      Scalable  and  (someVmes)  fast  +      ExisVng  open-­‐source  libraries—    Difficult  to  set  up,  extend

Page 39: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ExamplesML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Page 40: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ExamplesML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

‘Distributed’  Divide-­‐Factor-­‐Combine  (DFC)✦ IniVal  studies  in  MATLAB  (Not  distributed)✦ Distributed  prototype  involving  compiled  MATLAB

Page 41: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ExamplesML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

‘Distributed’  Divide-­‐Factor-­‐Combine  (DFC)✦ IniVal  studies  in  MATLAB  (Not  distributed)✦ Distributed  prototype  involving  compiled  MATLAB

Mahout  ALS  with  Early  Stopping✦ Theory:  simple  if-­‐statement  (3  lines  of  code)

Page 42: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ExamplesML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

‘Distributed’  Divide-­‐Factor-­‐Combine  (DFC)✦ IniVal  studies  in  MATLAB  (Not  distributed)✦ Distributed  prototype  involving  compiled  MATLAB

Mahout  ALS  with  Early  Stopping✦ Theory:  simple  if-­‐statement  (3  lines  of  code)✦ PracVce:  sib  through  7  files,  nearly  1K  lines  of  code

Page 43: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Lay  of  the  LandEase  of  u

se

Performance,  Scalability

Matlab,  Rx  +      Easy  (Resembles  math,  limited  set  up)

 +      Sufficient  for  prototyping  /  wriVng  papers—    Ad-­‐hoc,  non-­‐scalable  scripts—    Loss  of  translaVon  upon  re-­‐implementaVon

GraphLab,  VWx

Mahoutx

 +      Scalable  and  (someVmes)  fast  +      ExisVng  open-­‐source  libraries—    Difficult  to  set  up,  extend

Page 44: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Lay  of  the  LandEase  of  u

se

Performance,  Scalability

MLlibx

Matlab,  Rx  +      Easy  (Resembles  math,  limited  set  up)

 +      Sufficient  for  prototyping  /  wriVng  papers—    Ad-­‐hoc,  non-­‐scalable  scripts—    Loss  of  translaVon  upon  re-­‐implementaVon

GraphLab,  VWx

Mahoutx

 +      Scalable  and  (someVmes)  fast  +      ExisVng  open-­‐source  libraries—    Difficult  to  set  up,  extend

Page 45: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Lay  of  the  LandEase  of  u

se

Performance,  Scalability

MLIx

MLlibx

Matlab,  Rx  +      Easy  (Resembles  math,  limited  set  up)

 +      Sufficient  for  prototyping  /  wriVng  papers—    Ad-­‐hoc,  non-­‐scalable  scripts—    Loss  of  translaVon  upon  re-­‐implementaVon

GraphLab,  VWx

Mahoutx

 +      Scalable  and  (someVmes)  fast  +      ExisVng  open-­‐source  libraries—    Difficult  to  set  up,  extend

Page 46: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ML  Developer  API  (MLI)

Page 47: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ML  Developer  API  (MLI)

OLDval  x:  RDD[Array[Double]]

Page 48: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ML  Developer  API  (MLI)

OLDval  x:  RDD[Array[Double]]val  x:  RDD[spark.uJl.Vector]

Page 49: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ML  Developer  API  (MLI)

OLDval  x:  RDD[Array[Double]]val  x:  RDD[spark.uJl.Vector]val  x:  RDD[breeze.linalg.Vector]

Page 50: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ML  Developer  API  (MLI)

OLDval  x:  RDD[Array[Double]]val  x:  RDD[spark.uJl.Vector]val  x:  RDD[breeze.linalg.Vector]val  x:  RDD[BIDMat.SMat]

Page 51: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ML  Developer  API  (MLI)

OLDval  x:  RDD[Array[Double]]val  x:  RDD[spark.uJl.Vector]val  x:  RDD[breeze.linalg.Vector]val  x:  RDD[BIDMat.SMat]

Page 52: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ML  Developer  API  (MLI)

OLDval  x:  RDD[Array[Double]]val  x:  RDD[spark.uJl.Vector]val  x:  RDD[breeze.linalg.Vector]val  x:  RDD[BIDMat.SMat]

NEWval  x:  MLTable

Page 53: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ML  Developer  API  (MLI)

OLDval  x:  RDD[Array[Double]]val  x:  RDD[spark.uJl.Vector]val  x:  RDD[breeze.linalg.Vector]val  x:  RDD[BIDMat.SMat]

NEWval  x:  MLTable

✦ Abstract  interface  for  arbitrary  backend✦ Common  interface  to  support  an  opVmizer

Page 54: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ML  Developer  API  (MLI)

Page 55: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ML  Developer  API  (MLI)✦ Shield  ML  Developers  from  low-­‐details

✦ provide  familiar  mathemaVcal  operators  in  distributed  sedng

Page 56: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ML  Developer  API  (MLI)✦ Shield  ML  Developers  from  low-­‐details

✦ provide  familiar  mathemaVcal  operators  in  distributed  sedng

✦ Table  ComputaVon  (MLTable)✦ Flexibility  when  loading  data  (heterogenous,  missing)✦ Common  interface  for  feature  extracVon  /  algorithms✦ Supports  MapReduce  and  relaVonal  operators

Page 57: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ML  Developer  API  (MLI)✦ Shield  ML  Developers  from  low-­‐details

✦ provide  familiar  mathemaVcal  operators  in  distributed  sedng

✦ Table  ComputaVon  (MLTable)✦ Flexibility  when  loading  data  (heterogenous,  missing)✦ Common  interface  for  feature  extracVon  /  algorithms✦ Supports  MapReduce  and  relaVonal  operators

✦ Linear  Algebra  (MLSubMatrix)✦ Linear  algebra  on  *local*  parVVons✦ Sparse  and  Dense  matrix  support

Page 58: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ML  Developer  API  (MLI)✦ Shield  ML  Developers  from  low-­‐details

✦ provide  familiar  mathemaVcal  operators  in  distributed  sedng

✦ Table  ComputaVon  (MLTable)✦ Flexibility  when  loading  data  (heterogenous,  missing)✦ Common  interface  for  feature  extracVon  /  algorithms✦ Supports  MapReduce  and  relaVonal  operators

✦ Linear  Algebra  (MLSubMatrix)✦ Linear  algebra  on  *local*  parVVons✦ Sparse  and  Dense  matrix  support

✦ OpVmizaVon  PrimiVves  (MLSolve)✦ Distributed  implementaVons  of  common  paferns

Page 59: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ML  Developer  API  (MLI)✦ Shield  ML  Developers  from  low-­‐details

✦ provide  familiar  mathemaVcal  operators  in  distributed  sedng

✦ Table  ComputaVon  (MLTable)✦ Flexibility  when  loading  data  (heterogenous,  missing)✦ Common  interface  for  feature  extracVon  /  algorithms✦ Supports  MapReduce  and  relaVonal  operators

✦ Linear  Algebra  (MLSubMatrix)✦ Linear  algebra  on  *local*  parVVons✦ Sparse  and  Dense  matrix  support

✦ OpVmizaVon  PrimiVves  (MLSolve)✦ Distributed  implementaVons  of  common  paferns

✦ DFC:  ~50  lines  of  code✦ ALS:  early  stopping  in  3  lines;  <  40  lines  total

Page 60: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

MLI  Ease  of  Use

Page 61: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

MLI  Ease  of  UseLogisVc  Regression

AlternaVng  Least  Squares

System Lines  of  Code

Matlab 11

Vowpal  Wabbit 721

MLI 55

System Lines  of  Code

Matlab 20

Mahout 865

GraphLab 383

MLI 32

Page 62: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

MLI  Ease  of  UseLogisVc  Regression

AlternaVng  Least  Squares

System Lines  of  Code

Matlab 11

Vowpal  Wabbit 721

MLI 55

System Lines  of  Code

Matlab 20

Mahout 865

GraphLab 383

MLI 32

Page 63: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

MLI  Ease  of  UseLogisVc  Regression

AlternaVng  Least  Squares

System Lines  of  Code

Matlab 11

Vowpal  Wabbit 721

MLI 55

System Lines  of  Code

Matlab 20

Mahout 865

GraphLab 383

MLI 32

Page 64: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

MLI/Spark  Performance

Page 65: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

✦ Wall>me:  elapsed  Vme  to  execute  task

MLI/Spark  Performance

Page 66: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

✦ Wall>me:  elapsed  Vme  to  execute  task

✦ Weak  scaling✦ fix  problem  size  per  processor✦ ideally:  constant  wallVme  as  we  grow  cluster

MLI/Spark  Performance

Page 67: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

✦ Wall>me:  elapsed  Vme  to  execute  task

✦ Weak  scaling✦ fix  problem  size  per  processor✦ ideally:  constant  wallVme  as  we  grow  cluster

✦ Strong  scaling✦ fix  total  problem  size✦ ideally:  linear  speed  up  as  we  grow  cluster

MLI/Spark  Performance

Page 68: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

✦ Wall>me:  elapsed  Vme  to  execute  task

✦ Weak  scaling✦ fix  problem  size  per  processor✦ ideally:  constant  wallVme  as  we  grow  cluster

✦ Strong  scaling✦ fix  total  problem  size✦ ideally:  linear  speed  up  as  we  grow  cluster

✦ EC2  Experiments✦ m2.4xlarge  instances,  up  to  32  machine  clusters

MLI/Spark  Performance

Page 69: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

LogisVc  Regression  -­‐  Weak  Scaling

Page 70: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

LogisVc  Regression  -­‐  Weak  Scaling

✦ Full  dataset:  200K  images,  160K  dense  features

Page 71: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

LogisVc  Regression  -­‐  Weak  Scaling

✦ Full  dataset:  200K  images,  160K  dense  features✦ Similar  weak  scaling

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)

n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K

Fig. 5: Walltime for weak scaling for logistic regression.

0 5 10 15 20 25 300

2

4

6

8

10

rela

tive

wal

ltim

e

# machines

MLbaseVWIdeal

Fig. 6: Weak scaling for logistic regression

MLbase VW Matlab0

200

400

600

800

1000

1200

1400

wal

ltim

e (s

)

1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines

Fig. 7: Walltime for strong scaling for logistic regression.

0 5 10 15 20 25 300

5

10

15

20

25

30

35

# machines

spee

dup

MLbaseVWIdeal

Fig. 8: Strong scaling for logistic regression

with respect to computation. In practice, we see comparablescaling results as more machines are added.

In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.

Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.

From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.

System Lines of CodeMLbase 32

GraphLab 383Mahout 865

MATLAB-Mex 124MATLAB 20

TABLE II: Lines of code for various implementations of ALS

B. Collaborative Filtering: Alternating Least Squares

Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n

be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV

T .Commonly, U and V are estimated using the following bi-convex objective:

min

U,V

X

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)

Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u

⇤i 2 Rk is the optimal solution for the

i

th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the

MLlib

Page 72: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

LogisVc  Regression  -­‐  Weak  Scaling

✦ Full  dataset:  200K  images,  160K  dense  features✦ Similar  weak  scaling✦ MLI/Spark  within  a  factor  of  2  of  VW’s  wallJme

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)

n=6K, d=160Kn=12.5K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K

MLI/Spark

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)

n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K

Fig. 5: Walltime for weak scaling for logistic regression.

0 5 10 15 20 25 300

2

4

6

8

10

rela

tive

wal

ltim

e

# machines

MLbaseVWIdeal

Fig. 6: Weak scaling for logistic regression

MLbase VW Matlab0

200

400

600

800

1000

1200

1400

wal

ltim

e (s

)

1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines

Fig. 7: Walltime for strong scaling for logistic regression.

0 5 10 15 20 25 300

5

10

15

20

25

30

35

# machines

spee

dup

MLbaseVWIdeal

Fig. 8: Strong scaling for logistic regression

with respect to computation. In practice, we see comparablescaling results as more machines are added.

In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.

Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.

From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.

System Lines of CodeMLbase 32

GraphLab 383Mahout 865

MATLAB-Mex 124MATLAB 20

TABLE II: Lines of code for various implementations of ALS

B. Collaborative Filtering: Alternating Least Squares

Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n

be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV

T .Commonly, U and V are estimated using the following bi-convex objective:

min

U,V

X

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)

Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u

⇤i 2 Rk is the optimal solution for the

i

th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the

MLlib

Page 73: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

LogisVc  Regression  -­‐  Strong  Scaling

Page 74: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

LogisVc  Regression  -­‐  Strong  Scaling

✦ Fixed  Dataset:  50K  images,  160K  dense  features

Page 75: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

LogisVc  Regression  -­‐  Strong  Scaling

✦ Fixed  Dataset:  50K  images,  160K  dense  features✦ MLI/Spark  exhibits  be^er  scaling  properJes

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)

n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K

Fig. 5: Walltime for weak scaling for logistic regression.

0 5 10 15 20 25 300

2

4

6

8

10

rela

tive

wal

ltim

e# machines

MLbaseVWIdeal

Fig. 6: Weak scaling for logistic regression

MLbase VW Matlab0

200

400

600

800

1000

1200

1400

wal

ltim

e (s

)

1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines

Fig. 7: Walltime for strong scaling for logistic regression.

0 5 10 15 20 25 300

5

10

15

20

25

30

35

# machines

spee

dup

MLbaseVWIdeal

Fig. 8: Strong scaling for logistic regression

with respect to computation. In practice, we see comparablescaling results as more machines are added.

In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.

Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.

From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.

System Lines of CodeMLbase 32

GraphLab 383Mahout 865

MATLAB-Mex 124MATLAB 20

TABLE II: Lines of code for various implementations of ALS

B. Collaborative Filtering: Alternating Least Squares

Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n

be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV

T .Commonly, U and V are estimated using the following bi-convex objective:

min

U,V

X

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)

Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u

⇤i 2 Rk is the optimal solution for the

i

th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the

MLlib

Page 76: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

LogisVc  Regression  -­‐  Strong  Scaling

✦ Fixed  Dataset:  50K  images,  160K  dense  features✦ MLI/Spark  exhibits  be^er  scaling  properJes✦ MLI/Spark  faster  than  VW  with  16  and  32  machines

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)

n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K

Fig. 5: Walltime for weak scaling for logistic regression.

0 5 10 15 20 25 300

2

4

6

8

10

rela

tive

wal

ltim

e

# machines

MLbaseVWIdeal

Fig. 6: Weak scaling for logistic regression

MLbase VW Matlab0

200

400

600

800

1000

1200

1400

wal

ltim

e (s

)

1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines

Fig. 7: Walltime for strong scaling for logistic regression.

0 5 10 15 20 25 300

5

10

15

20

25

30

35

# machines

spee

dup

MLbaseVWIdeal

Fig. 8: Strong scaling for logistic regression

with respect to computation. In practice, we see comparablescaling results as more machines are added.

In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.

Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.

From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.

System Lines of CodeMLbase 32

GraphLab 383Mahout 865

MATLAB-Mex 124MATLAB 20

TABLE II: Lines of code for various implementations of ALS

B. Collaborative Filtering: Alternating Least Squares

Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n

be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV

T .Commonly, U and V are estimated using the following bi-convex objective:

min

U,V

X

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)

Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u

⇤i 2 Rk is the optimal solution for the

i

th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the

MLI/Spark

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)

n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K

Fig. 5: Walltime for weak scaling for logistic regression.

0 5 10 15 20 25 300

2

4

6

8

10

rela

tive

wal

ltim

e# machines

MLbaseVWIdeal

Fig. 6: Weak scaling for logistic regression

MLbase VW Matlab0

200

400

600

800

1000

1200

1400

wal

ltim

e (s

)

1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines

Fig. 7: Walltime for strong scaling for logistic regression.

0 5 10 15 20 25 300

5

10

15

20

25

30

35

# machines

spee

dup

MLbaseVWIdeal

Fig. 8: Strong scaling for logistic regression

with respect to computation. In practice, we see comparablescaling results as more machines are added.

In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.

Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.

From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.

System Lines of CodeMLbase 32

GraphLab 383Mahout 865

MATLAB-Mex 124MATLAB 20

TABLE II: Lines of code for various implementations of ALS

B. Collaborative Filtering: Alternating Least Squares

Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n

be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV

T .Commonly, U and V are estimated using the following bi-convex objective:

min

U,V

X

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)

Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u

⇤i 2 Rk is the optimal solution for the

i

th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the

MLlib

Page 77: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ALS  -­‐  WallVme

Page 78: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ALS  -­‐  WallVme

✦ Dataset:  Scaled  version  of  NePlix  data  (9X  in  size)✦ Cluster:  9  machines

Page 79: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ALS  -­‐  WallVme

✦ Dataset:  Scaled  version  of  NePlix  data  (9X  in  size)✦ Cluster:  9  machines

System Wall>me  (seconds)

Matlab 15443

Mahout 4206

GraphLab 291

MLI/Spark 481

Page 80: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ALS  -­‐  WallVme

✦ Dataset:  Scaled  version  of  NePlix  data  (9X  in  size)✦ Cluster:  9  machines

System Wall>me  (seconds)

Matlab 15443

Mahout 4206

GraphLab 291

MLI/Spark 481

Page 81: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ALS  -­‐  WallVme

✦ Dataset:  Scaled  version  of  NePlix  data  (9X  in  size)✦ Cluster:  9  machines✦ MLI/Spark  an  order  of  magnitude  faster  than  Mahout✦ MLI/Spark  within  factor  of  2  of  GraphLab

System Wall>me  (seconds)

Matlab 15443

Mahout 4206

GraphLab 291

MLI/Spark 481

Page 82: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

VisionMLI  DetailsCurrent  StatusML  Workflow

Page 83: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Linear  Regression  (+Lasso,  Ridge)

AlternaVng  Least  Squares,  DFC

K-­‐Means,  DP-­‐Means

LogisVc  Regression,  Linear  SVM  (+L1,  L2),  MulVnomial  Regression,  Naive  Bayes,  Decision  Trees

Parallel  Gradient,  Local  SGD,  L-­‐BFGS,  ADMM,  Adagrad

Principal  Component  Analysis  (PCA),  N-­‐grams,  feature  normalizaVon

Cross  ValidaVon,  EvaluaVon  Metrics

MLI  FuncVonalityRegression:

Collabora>ve  Filtering:

Clustering:

Classifica>on:

Op>miza>on  Primi>ves:

Feature  Extrac>on:

ML  Tools:

Page 84: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Linear  Regression  (+Lasso,  Ridge)

AlternaVng  Least  Squares,  DFC

K-­‐Means,  DP-­‐Means

LogisVc  Regression,  Linear  SVM  (+L1,  L2),  MulVnomial  Regression,  Naive  Bayes,  Decision  Trees

Parallel  Gradient,  Local  SGD,  L-­‐BFGS,  ADMM,  Adagrad

Principal  Component  Analysis  (PCA),  N-­‐grams,  feature  normalizaVon

Cross  ValidaVon,  EvaluaVon  Metrics

MLI  FuncVonalityRegression:

Collabora>ve  Filtering:

Clustering:

Classifica>on:

Op>miza>on  Primi>ves:

Feature  Extrac>on:

ML  Tools:

Page 85: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

MLlib

MLI

ML OptimizerEnd  User

MLbase  Stack  Status

Spark

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Page 86: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

MLlib

MLI

ML OptimizerEnd  User

MLbase  Stack  Status

Spark

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Page 87: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

MLlib

MLI

ML OptimizerEnd  User

MLbase  Stack  Status

Goal 1:Summer Release

Spark

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Page 88: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

MLlib

MLI

ML OptimizerEnd  User

MLbase  Stack  Status

Goal 1:Summer Release

Goal 2:Winter Release

Spark

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Page 89: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Future  DirecVons✦ Iden>fy  minimal  set  of  ML  operators

✦ Expose  internals  of  ML  algorithms  to  opVmizer

✦ Plug-­‐ins  to  Python,  R

✦ Visualiza>on  for  unsupervised  learning  and  exploraVon

✦ Advanced  ML  capabili>es✦ Time-­‐series  algorithms✦ Graphical  models✦ Advanced  OpVmizaVon  (e.g.,  asynchronous  computaVon)✦ Online  updates✦ Sampling  for  efficiency  

Page 90: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

VisionMLI  DetailsCurrent  StatusML  Workflow

Page 91: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Typical  Data  Analysis  Workflow

Adapted  from  slides  by  Ariel  Kleiner

Page 92: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Typical  Data  Analysis  WorkflowSpark,  MLI Obtain  /  Load  Raw  Data

Adapted  from  slides  by  Ariel  Kleiner

Page 93: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Typical  Data  Analysis  WorkflowSpark,  MLI Obtain  /  Load  Raw  Data

Spark,  [MLI] Data  Explora>on

Adapted  from  slides  by  Ariel  Kleiner

Page 94: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Typical  Data  Analysis  WorkflowSpark,  MLI Obtain  /  Load  Raw  Data

Spark,  [MLI] Data  Explora>on

MLI Feature  Extrac>on

Adapted  from  slides  by  Ariel  Kleiner

Page 95: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Typical  Data  Analysis  WorkflowSpark,  MLI Obtain  /  Load  Raw  Data

Spark,  [MLI] Data  Explora>on

MLI Feature  Extrac>on

MLI,  MLlib Learning

Adapted  from  slides  by  Ariel  Kleiner

Page 96: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Typical  Data  Analysis  WorkflowSpark,  MLI Obtain  /  Load  Raw  Data

Spark,  [MLI] Data  Explora>on

MLI Feature  Extrac>on

MLI Evalua>on

MLI,  MLlib Learning

Adapted  from  slides  by  Ariel  Kleiner

Page 97: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Typical  Data  Analysis  WorkflowSpark,  MLI Obtain  /  Load  Raw  Data

Spark,  [MLI] Data  Explora>on

MLI Feature  Extrac>on

Scala Deployment

MLI Evalua>on

MLI,  MLlib Learning

Adapted  from  slides  by  Ariel  Kleiner

Page 98: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Typical  Data  Analysis  WorkflowSpark,  MLI Obtain  /  Load  Raw  Data

Spark,  [MLI] Data  Explora>on

MLI Feature  Extrac>on

Scala Deployment

MLI Evalua>on

MLI,  MLlib Learning

Adapted  from  slides  by  Ariel  Kleiner

Page 99: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Binary  ClassificaVon

1.2 Definitions and terminology 3

Figure 1.1 The zig-zag line on the left panel is consistent over the blue and redtraining sample, but it is a complex separation surface that is not likely to generalizewell to unseen data. In contrast, the decision surface on the right panel is simplerand might generalize better in spite of its misclassification of a few points of thetraining sample.

Which concept families can actually be learned, and under what conditions? Howwell can these concepts be learned computationally?

1.2 Definitions and terminology

We will use the canonical problem of spam detection as a running example toillustrate some basic definitions and to describe the use and evaluation of machinelearning algorithms in practice. Spam detection is the problem of learning toautomatically classify email messages as either spam or non-spam.

Examples: Items or instances of data used for learning or evaluation. In our spamproblem, these examples correspond to the collection of email messages we will usefor learning and testing.

Features: The set of attributes, often represented as a vector, associated to anexample. In the case of email messages, some relevant features may include thelength of the message, the name of the sender, various characteristics of the header,the presence of certain keywords in the body of the message, and so on.

Labels: Values or categories assigned to examples. In classification problems,examples are assigned specific categories, for instance, the spam and non-spamcategories in our binary classification problem. In regression, items are assignedreal-valued labels.

Training sample: Examples used to train a learning algorithm. In our spamproblem, the training sample consists of a set of email examples along with theirassociated labels. The training sample varies for di↵erent learning scenarios, asdescribed in section 1.4.

Validation sample: Examples used to tune the parameters of a learning algorithm

Adapted  from  slides  by  Ariel  Kleiner

Page 100: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Binary  ClassificaVon

Goal:  Learn  a  mapping  from  enVVes  to  discrete  labels

1.2 Definitions and terminology 3

Figure 1.1 The zig-zag line on the left panel is consistent over the blue and redtraining sample, but it is a complex separation surface that is not likely to generalizewell to unseen data. In contrast, the decision surface on the right panel is simplerand might generalize better in spite of its misclassification of a few points of thetraining sample.

Which concept families can actually be learned, and under what conditions? Howwell can these concepts be learned computationally?

1.2 Definitions and terminology

We will use the canonical problem of spam detection as a running example toillustrate some basic definitions and to describe the use and evaluation of machinelearning algorithms in practice. Spam detection is the problem of learning toautomatically classify email messages as either spam or non-spam.

Examples: Items or instances of data used for learning or evaluation. In our spamproblem, these examples correspond to the collection of email messages we will usefor learning and testing.

Features: The set of attributes, often represented as a vector, associated to anexample. In the case of email messages, some relevant features may include thelength of the message, the name of the sender, various characteristics of the header,the presence of certain keywords in the body of the message, and so on.

Labels: Values or categories assigned to examples. In classification problems,examples are assigned specific categories, for instance, the spam and non-spamcategories in our binary classification problem. In regression, items are assignedreal-valued labels.

Training sample: Examples used to train a learning algorithm. In our spamproblem, the training sample consists of a set of email examples along with theirassociated labels. The training sample varies for di↵erent learning scenarios, asdescribed in section 1.4.

Validation sample: Examples used to tune the parameters of a learning algorithm

Adapted  from  slides  by  Ariel  Kleiner

Page 101: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Binary  ClassificaVon

Goal:  Learn  a  mapping  from  enVVes  to  discrete  labels

Example:  Spam  ClassificaVon✦ EnVVes  are  emails✦ Labels  are  {spam,  not-­‐spam}

1.2 Definitions and terminology 3

Figure 1.1 The zig-zag line on the left panel is consistent over the blue and redtraining sample, but it is a complex separation surface that is not likely to generalizewell to unseen data. In contrast, the decision surface on the right panel is simplerand might generalize better in spite of its misclassification of a few points of thetraining sample.

Which concept families can actually be learned, and under what conditions? Howwell can these concepts be learned computationally?

1.2 Definitions and terminology

We will use the canonical problem of spam detection as a running example toillustrate some basic definitions and to describe the use and evaluation of machinelearning algorithms in practice. Spam detection is the problem of learning toautomatically classify email messages as either spam or non-spam.

Examples: Items or instances of data used for learning or evaluation. In our spamproblem, these examples correspond to the collection of email messages we will usefor learning and testing.

Features: The set of attributes, often represented as a vector, associated to anexample. In the case of email messages, some relevant features may include thelength of the message, the name of the sender, various characteristics of the header,the presence of certain keywords in the body of the message, and so on.

Labels: Values or categories assigned to examples. In classification problems,examples are assigned specific categories, for instance, the spam and non-spamcategories in our binary classification problem. In regression, items are assignedreal-valued labels.

Training sample: Examples used to train a learning algorithm. In our spamproblem, the training sample consists of a set of email examples along with theirassociated labels. The training sample varies for di↵erent learning scenarios, asdescribed in section 1.4.

Validation sample: Examples used to tune the parameters of a learning algorithm

Adapted  from  slides  by  Ariel  Kleiner

Page 102: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Binary  ClassificaVon

Goal:  Learn  a  mapping  from  enVVes  to  discrete  labels

Example:  Spam  ClassificaVon✦ EnVVes  are  emails✦ Labels  are  {spam,  not-­‐spam}✦ Given  past  labeled  emails,  we  want  to  predict  whether  

a  new  email  is  spam  or  not-­‐spam

1.2 Definitions and terminology 3

Figure 1.1 The zig-zag line on the left panel is consistent over the blue and redtraining sample, but it is a complex separation surface that is not likely to generalizewell to unseen data. In contrast, the decision surface on the right panel is simplerand might generalize better in spite of its misclassification of a few points of thetraining sample.

Which concept families can actually be learned, and under what conditions? Howwell can these concepts be learned computationally?

1.2 Definitions and terminology

We will use the canonical problem of spam detection as a running example toillustrate some basic definitions and to describe the use and evaluation of machinelearning algorithms in practice. Spam detection is the problem of learning toautomatically classify email messages as either spam or non-spam.

Examples: Items or instances of data used for learning or evaluation. In our spamproblem, these examples correspond to the collection of email messages we will usefor learning and testing.

Features: The set of attributes, often represented as a vector, associated to anexample. In the case of email messages, some relevant features may include thelength of the message, the name of the sender, various characteristics of the header,the presence of certain keywords in the body of the message, and so on.

Labels: Values or categories assigned to examples. In classification problems,examples are assigned specific categories, for instance, the spam and non-spamcategories in our binary classification problem. In regression, items are assignedreal-valued labels.

Training sample: Examples used to train a learning algorithm. In our spamproblem, the training sample consists of a set of email examples along with theirassociated labels. The training sample varies for di↵erent learning scenarios, asdescribed in section 1.4.

Validation sample: Examples used to tune the parameters of a learning algorithm

Adapted  from  slides  by  Ariel  Kleiner

Page 103: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Binary  ClassificaVon

Goal:  Learn  a  mapping  from  enVVes  to  discrete  labels

Other  Examples:✦ Click  (and  clickthrough  rate)  predicVon✦ Fraud  detecVon✦ Face  detecVon✦ Exercise:  “ARTS”  vs  “LIFE”  on  Wikipedia

✦ Real  data  

1.2 Definitions and terminology 3

Figure 1.1 The zig-zag line on the left panel is consistent over the blue and redtraining sample, but it is a complex separation surface that is not likely to generalizewell to unseen data. In contrast, the decision surface on the right panel is simplerand might generalize better in spite of its misclassification of a few points of thetraining sample.

Which concept families can actually be learned, and under what conditions? Howwell can these concepts be learned computationally?

1.2 Definitions and terminology

We will use the canonical problem of spam detection as a running example toillustrate some basic definitions and to describe the use and evaluation of machinelearning algorithms in practice. Spam detection is the problem of learning toautomatically classify email messages as either spam or non-spam.

Examples: Items or instances of data used for learning or evaluation. In our spamproblem, these examples correspond to the collection of email messages we will usefor learning and testing.

Features: The set of attributes, often represented as a vector, associated to anexample. In the case of email messages, some relevant features may include thelength of the message, the name of the sender, various characteristics of the header,the presence of certain keywords in the body of the message, and so on.

Labels: Values or categories assigned to examples. In classification problems,examples are assigned specific categories, for instance, the spam and non-spamcategories in our binary classification problem. In regression, items are assignedreal-valued labels.

Training sample: Examples used to train a learning algorithm. In our spamproblem, the training sample consists of a set of email examples along with theirassociated labels. The training sample varies for di↵erent learning scenarios, asdescribed in section 1.4.

Validation sample: Examples used to tune the parameters of a learning algorithm

Adapted  from  slides  by  Ariel  Kleiner

Page 104: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ClassificaVon  PipelineClassifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Adapted  from  slides  by  Ariel  Kleiner

Page 105: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ClassificaVon  PipelineClassifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

1. Randomly  split  full  data  into  disjoint  subsets

Adapted  from  slides  by  Ariel  Kleiner

Page 106: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ClassificaVon  PipelineClassifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

1. Randomly  split  full  data  into  disjoint  subsets2. Featurize  the  data

Adapted  from  slides  by  Ariel  Kleiner

Page 107: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ClassificaVon  PipelineClassifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

1. Randomly  split  full  data  into  disjoint  subsets2. Featurize  the  data3. Use  training  set  to  learn  a  classifier

Adapted  from  slides  by  Ariel  Kleiner

Page 108: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ClassificaVon  PipelineClassifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

1. Randomly  split  full  data  into  disjoint  subsets2. Featurize  the  data3. Use  training  set  to  learn  a  classifier4. Evaluate  classifier  on  test  set  (avoid  overfidng)

Adapted  from  slides  by  Ariel  Kleiner

Page 109: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

ClassificaVon  PipelineClassifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

1. Randomly  split  full  data  into  disjoint  subsets2. Featurize  the  data3. Use  training  set  to  learn  a  classifier4. Evaluate  classifier  on  test  set  (avoid  overfidng)5. Use  classifier  to  predict  in  the  wild

Adapted  from  slides  by  Ariel  Kleiner

Page 110: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

E.g.,  Spam  ClassificaVon

Example:(Spam(Classifica<on(

From: [email protected]

"Eliminate your debt by giving us your money..."

From: [email protected]

"Hi, it's been a while! How are you? ..."

spam

not-spam

spam

not-­‐spam

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Adapted  from  slides  by  Ariel  Kleiner

Page 111: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

FeaturizaVon   Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Adapted  from  slides  by  Ariel  Kleiner

Page 112: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

FeaturizaVon  

✦ Most  classifiers  require  numeric  descripJons  of  enJJes

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Adapted  from  slides  by  Ariel  Kleiner

Page 113: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

FeaturizaVon  

✦ Most  classifiers  require  numeric  descripJons  of  enJJes

✦ Featuriza>on:  Transform  each  enJty  into  a  vector  of  real  numbers

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Adapted  from  slides  by  Ariel  Kleiner

Page 114: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

FeaturizaVon  

✦ Most  classifiers  require  numeric  descripJons  of  enJJes

✦ Featuriza>on:  Transform  each  enJty  into  a  vector  of  real  numbers✦ Opportunity  to  incorporate  domain  knowledge✦ Useful  even  when  original  data  is  already  numeric

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Adapted  from  slides  by  Ariel  Kleiner

Page 115: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

E.g.,  “Bag  of  Words”  

Example:(Spam(Classifica<on(

From: [email protected]

"Eliminate your debt by giving us your money..."

From: [email protected]

"Hi, it's been a while! How are you? ..."

Vocabularybeendebt

eliminategivinghowit'smoneywhile

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Adapted  from  slides  by  Ariel  Kleiner

Page 116: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

E.g.,  “Bag  of  Words”  ✦ EnVVes  are  documents

Example:(Spam(Classifica<on(

From: [email protected]

"Eliminate your debt by giving us your money..."

From: [email protected]

"Hi, it's been a while! How are you? ..."

Vocabularybeendebt

eliminategivinghowit'smoneywhile

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Adapted  from  slides  by  Ariel  Kleiner

Page 117: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

E.g.,  “Bag  of  Words”  ✦ EnVVes  are  documents

✦ Build  Vocabulary

Example:(Spam(Classifica<on(

From: [email protected]

"Eliminate your debt by giving us your money..."

From: [email protected]

"Hi, it's been a while! How are you? ..."

Vocabularybeendebt

eliminategivinghowit'smoneywhile

Example:(Spam(Classifica<on(

From: [email protected]

"Eliminate your debt by giving us your money..."

From: [email protected]

"Hi, it's been a while! How are you? ..."

Vocabularybeendebt

eliminategivinghowit'smoneywhile

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Adapted  from  slides  by  Ariel  Kleiner

Page 118: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Example:(Spam(Classifica<on(

From: [email protected]

"Eliminate your debt by giving us your money..."

been

debt

eliminate

giving

how

it's

money

while

0

1

1

1

0

0

1

0

E.g.,  “Bag  of  Words”  ✦ EnVVes  are  documents

✦ Build  Vocabulary

✦ Derive  feature  vectors  from  Vocabulary✦ Exercise:  we’ll  use  bigrams

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Adapted  from  slides  by  Ariel  Kleiner

Page 119: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Support  Vector  Machines  (SVMs)

Credit:  FoundaVons  of  Machine  Learning  Mohri,  Rostamizadeh,  Talwalkar

64 Support Vector Machines

w·x+b=0

w·x+b=0

Figure 4.1 Two possible separating hyperplanes. The right-hand side figure showsa hyperplane that maximizes the margin.

4.2 SVMs — separable case

In this section, we assume that the training sample S can be linearly separated,that is, we assume the existence of a hyperplane that perfectly separates thetraining sample into two populations of positively and negatively labeled points,as illustrated by the left panel of figure 4.1. But there are then infinitely manysuch separating hyperplanes. Which hyperplane should a learning algorithm select?The solution returned by the SVM algorithm is the hyperplane with the maximummargin, or distance to the closest points, and is thus known as the maximum-marginhyperplane. The right panel of figure 4.1 illustrates that choice.

We will present later in this chapter a margin theory that provides a strongjustification for this solution. We can observe already, however, that the SVMsolution can also be viewed as the “safest” choice in the following sense: a testpoint is classified correctly by a separating hyperplane with margin ⇢ even whenit falls within a distance ⇢ of the training samples sharing the same label; for theSVM solution, ⇢ is the maximum margin and thus the “safest” value.

4.2.1 Primal optimization problem

We now derive the equations and optimization problem that define the SVMsolution. The general equation of a hyperplane in RN is

w · x + b = 0, (4.3)

where w 2 RN is a non-zero vector normal to the hyperplane and b 2 R ascalar. Note that this definition of a hyperplane is invariant to non-zero scalarmultiplication. Hence, for a hyperplane that does not pass through any samplepoint, we can scale w and b appropriately such that min(x,y)2S |w · x + b| = 1.

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Page 120: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Support  Vector  Machines  (SVMs)

Credit:  FoundaVons  of  Machine  Learning  Mohri,  Rostamizadeh,  Talwalkar

64 Support Vector Machines

w·x+b=0

w·x+b=0

Figure 4.1 Two possible separating hyperplanes. The right-hand side figure showsa hyperplane that maximizes the margin.

4.2 SVMs — separable case

In this section, we assume that the training sample S can be linearly separated,that is, we assume the existence of a hyperplane that perfectly separates thetraining sample into two populations of positively and negatively labeled points,as illustrated by the left panel of figure 4.1. But there are then infinitely manysuch separating hyperplanes. Which hyperplane should a learning algorithm select?The solution returned by the SVM algorithm is the hyperplane with the maximummargin, or distance to the closest points, and is thus known as the maximum-marginhyperplane. The right panel of figure 4.1 illustrates that choice.

We will present later in this chapter a margin theory that provides a strongjustification for this solution. We can observe already, however, that the SVMsolution can also be viewed as the “safest” choice in the following sense: a testpoint is classified correctly by a separating hyperplane with margin ⇢ even whenit falls within a distance ⇢ of the training samples sharing the same label; for theSVM solution, ⇢ is the maximum margin and thus the “safest” value.

4.2.1 Primal optimization problem

We now derive the equations and optimization problem that define the SVMsolution. The general equation of a hyperplane in RN is

w · x + b = 0, (4.3)

where w 2 RN is a non-zero vector normal to the hyperplane and b 2 R ascalar. Note that this definition of a hyperplane is invariant to non-zero scalarmultiplication. Hence, for a hyperplane that does not pass through any samplepoint, we can scale w and b appropriately such that min(x,y)2S |w · x + b| = 1.

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Page 121: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Support  Vector  Machines  (SVMs)

Credit:  FoundaVons  of  Machine  Learning  Mohri,  Rostamizadeh,  Talwalkar

64 Support Vector Machines

w·x+b=0

w·x+b=0

Figure 4.1 Two possible separating hyperplanes. The right-hand side figure showsa hyperplane that maximizes the margin.

4.2 SVMs — separable case

In this section, we assume that the training sample S can be linearly separated,that is, we assume the existence of a hyperplane that perfectly separates thetraining sample into two populations of positively and negatively labeled points,as illustrated by the left panel of figure 4.1. But there are then infinitely manysuch separating hyperplanes. Which hyperplane should a learning algorithm select?The solution returned by the SVM algorithm is the hyperplane with the maximummargin, or distance to the closest points, and is thus known as the maximum-marginhyperplane. The right panel of figure 4.1 illustrates that choice.

We will present later in this chapter a margin theory that provides a strongjustification for this solution. We can observe already, however, that the SVMsolution can also be viewed as the “safest” choice in the following sense: a testpoint is classified correctly by a separating hyperplane with margin ⇢ even whenit falls within a distance ⇢ of the training samples sharing the same label; for theSVM solution, ⇢ is the maximum margin and thus the “safest” value.

4.2.1 Primal optimization problem

We now derive the equations and optimization problem that define the SVMsolution. The general equation of a hyperplane in RN is

w · x + b = 0, (4.3)

where w 2 RN is a non-zero vector normal to the hyperplane and b 2 R ascalar. Note that this definition of a hyperplane is invariant to non-zero scalarmultiplication. Hence, for a hyperplane that does not pass through any samplepoint, we can scale w and b appropriately such that min(x,y)2S |w · x + b| = 1.

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Page 122: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Support  Vector  Machines  (SVMs)

Credit:  FoundaVons  of  Machine  Learning  Mohri,  Rostamizadeh,  Talwalkar

64 Support Vector Machines

w·x+b=0

w·x+b=0

Figure 4.1 Two possible separating hyperplanes. The right-hand side figure showsa hyperplane that maximizes the margin.

4.2 SVMs — separable case

In this section, we assume that the training sample S can be linearly separated,that is, we assume the existence of a hyperplane that perfectly separates thetraining sample into two populations of positively and negatively labeled points,as illustrated by the left panel of figure 4.1. But there are then infinitely manysuch separating hyperplanes. Which hyperplane should a learning algorithm select?The solution returned by the SVM algorithm is the hyperplane with the maximummargin, or distance to the closest points, and is thus known as the maximum-marginhyperplane. The right panel of figure 4.1 illustrates that choice.

We will present later in this chapter a margin theory that provides a strongjustification for this solution. We can observe already, however, that the SVMsolution can also be viewed as the “safest” choice in the following sense: a testpoint is classified correctly by a separating hyperplane with margin ⇢ even whenit falls within a distance ⇢ of the training samples sharing the same label; for theSVM solution, ⇢ is the maximum margin and thus the “safest” value.

4.2.1 Primal optimization problem

We now derive the equations and optimization problem that define the SVMsolution. The general equation of a hyperplane in RN is

w · x + b = 0, (4.3)

where w 2 RN is a non-zero vector normal to the hyperplane and b 2 R ascalar. Note that this definition of a hyperplane is invariant to non-zero scalarmultiplication. Hence, for a hyperplane that does not pass through any samplepoint, we can scale w and b appropriately such that min(x,y)2S |w · x + b| = 1.

✦ “Max-­‐Margin”:  find  linear  separator  with  the  largest  separaVon  between  the  two  classes

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Page 123: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Support  Vector  Machines  (SVMs)

Credit:  FoundaVons  of  Machine  Learning  Mohri,  Rostamizadeh,  Talwalkar

64 Support Vector Machines

w·x+b=0

w·x+b=0

Figure 4.1 Two possible separating hyperplanes. The right-hand side figure showsa hyperplane that maximizes the margin.

4.2 SVMs — separable case

In this section, we assume that the training sample S can be linearly separated,that is, we assume the existence of a hyperplane that perfectly separates thetraining sample into two populations of positively and negatively labeled points,as illustrated by the left panel of figure 4.1. But there are then infinitely manysuch separating hyperplanes. Which hyperplane should a learning algorithm select?The solution returned by the SVM algorithm is the hyperplane with the maximummargin, or distance to the closest points, and is thus known as the maximum-marginhyperplane. The right panel of figure 4.1 illustrates that choice.

We will present later in this chapter a margin theory that provides a strongjustification for this solution. We can observe already, however, that the SVMsolution can also be viewed as the “safest” choice in the following sense: a testpoint is classified correctly by a separating hyperplane with margin ⇢ even whenit falls within a distance ⇢ of the training samples sharing the same label; for theSVM solution, ⇢ is the maximum margin and thus the “safest” value.

4.2.1 Primal optimization problem

We now derive the equations and optimization problem that define the SVMsolution. The general equation of a hyperplane in RN is

w · x + b = 0, (4.3)

where w 2 RN is a non-zero vector normal to the hyperplane and b 2 R ascalar. Note that this definition of a hyperplane is invariant to non-zero scalarmultiplication. Hence, for a hyperplane that does not pass through any samplepoint, we can scale w and b appropriately such that min(x,y)2S |w · x + b| = 1.

✦ “Max-­‐Margin”:  find  linear  separator  with  the  largest  separaVon  between  the  two  classes

✦ Extensions:✦ non-­‐separable  sedng✦ non-­‐linear  classifiers  (kernels)

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Page 124: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Model  EvaluaVon Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Adapted  from  slides  by  Ariel  Kleiner

Page 125: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Model  EvaluaVon

✦ Test  set  simulates  performance  on  new  enVty✦ Performance  on  training  data  overly  opJmisJc!✦ “Overfijng”;  “GeneralizaJon”

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Adapted  from  slides  by  Ariel  Kleiner

Page 126: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Model  EvaluaVon

✦ Test  set  simulates  performance  on  new  enVty✦ Performance  on  training  data  overly  opJmisJc!✦ “Overfijng”;  “GeneralizaJon”

✦ Various  metrics  for  quality;  accuracy  is  most  common

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Adapted  from  slides  by  Ariel  Kleiner

Page 127: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Model  EvaluaVon

✦ Test  set  simulates  performance  on  new  enVty✦ Performance  on  training  data  overly  opJmisJc!✦ “Overfijng”;  “GeneralizaJon”

✦ Various  metrics  for  quality;  accuracy  is  most  common✦ EvaluaVon  process

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Adapted  from  slides  by  Ariel  Kleiner

Page 128: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Model  EvaluaVon

✦ Test  set  simulates  performance  on  new  enVty✦ Performance  on  training  data  overly  opJmisJc!✦ “Overfijng”;  “GeneralizaJon”

✦ Various  metrics  for  quality;  accuracy  is  most  common✦ EvaluaVon  process

✦ Train  on  training  set  (don’t  expose  test  set  to  classifier)

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Adapted  from  slides  by  Ariel  Kleiner

Page 129: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Model  EvaluaVon

✦ Test  set  simulates  performance  on  new  enVty✦ Performance  on  training  data  overly  opJmisJc!✦ “Overfijng”;  “GeneralizaJon”

✦ Various  metrics  for  quality;  accuracy  is  most  common✦ EvaluaVon  process

✦ Train  on  training  set  (don’t  expose  test  set  to  classifier)✦ Make  predicJons  using  test  set  (ignoring  test  labels)

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Adapted  from  slides  by  Ariel  Kleiner

Page 130: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Model  EvaluaVon

✦ Test  set  simulates  performance  on  new  enVty✦ Performance  on  training  data  overly  opJmisJc!✦ “Overfijng”;  “GeneralizaJon”

✦ Various  metrics  for  quality;  accuracy  is  most  common✦ EvaluaVon  process

✦ Train  on  training  set  (don’t  expose  test  set  to  classifier)✦ Make  predicJons  using  test  set  (ignoring  test  labels)✦ Compute  fracJon  of  correct  predicJons  on  test  set

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Adapted  from  slides  by  Ariel  Kleiner

Page 131: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Model  EvaluaVon

✦ Test  set  simulates  performance  on  new  enVty✦ Performance  on  training  data  overly  opJmisJc!✦ “Overfijng”;  “GeneralizaJon”

✦ Various  metrics  for  quality;  accuracy  is  most  common✦ EvaluaVon  process

✦ Train  on  training  set  (don’t  expose  test  set  to  classifier)✦ Make  predicJons  using  test  set  (ignoring  test  labels)✦ Compute  fracJon  of  correct  predicJons  on  test  set

✦ Other  more  sophisVcated  evaluaVon  methods,  e.g.,  cross-­‐validaVon

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Adapted  from  slides  by  Ariel  Kleiner

Page 132: ML base ML base ML baseampcamp.berkeley.edu/wp-content/uploads/2013/08/mlbase...Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library

Contribu>ons  encouraged!

www.mlbase.org

baseML

baseML

baseML

baseML

ML base

ML base

ML base

ML base

ML basebiglearn.org