ameet talwalkar, assistant professor of computer science, ucla at mlconf sf
Post on 04-Jul-2015
2.547 Views
Preview:
DESCRIPTION
TRANSCRIPT
Collaborators:*Evan*Sparks,*Michael*Franklin,*Michael*I.*Jordan,*Tim*Kraska*
UC Berkeley
Ameet*Talwalkar
Towards*an*OpBmizer*for*MLbase
Problem:"Scalable"implementa.ons"difficult"for"ML"Developers…
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
Key Features
-
-
® ®
The Language of Technical Computing
MATLAB® is a high-level language and interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop algorithms, and create models and applications. The language, tools, and built-in math functions enable you to explore multiple approaches and reach a solution faster than with spreadsheets or traditional programming languages, such as C/C++ or Java™.
You can use MATLAB for a range of appli-cations, including signal processing and communications, image and video process-ing, control systems, test and measurement, computational finance, and computational biology. More than a million engineers and scientists in industry and academia use MATLAB, the language of technical computing.
MATLAB Overview 2:04
Analyzing and visualizing data using the MATLAB desktop. The MATLAB environment also lets you write programs and develop algorithms and applications.
Key Features
-
-
® ®
The Language of Technical Computing
MATLAB® is a high-level language and interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop algorithms, and create models and applications. The language, tools, and built-in math functions enable you to explore multiple approaches and reach a solution faster than with spreadsheets or traditional programming languages, such as C/C++ or Java™.
You can use MATLAB for a range of appli-cations, including signal processing and communications, image and video process-ing, control systems, test and measurement, computational finance, and computational biology. More than a million engineers and scientists in industry and academia use MATLAB, the language of technical computing.
MATLAB Overview 2:04
Analyzing and visualizing data using the MATLAB desktop. The MATLAB environment also lets you write programs and develop algorithms and applications.
Problem:"Scalable"implementa.ons"difficult"for"ML"Developers…
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
Problem:"Scalable"implementa.ons"difficult"for"ML"Developers…
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
CHALLENGE:*Can*we*simplify*distributed*ML*development?
Too*many*ways*to*preprocess…
Too*many*knobs…
Problem:"ML"is"difficultfor"End"Users…
Difficult*to*debug…
Doesn’t*scale…
Too*many*algorithms…
Too*many*ways*to*preprocess…
Too*many*knobs…
Problem:"ML"is"difficultfor"End"Users…
Difficult*to*debug…
Doesn’t*scale…
CHALLENGE:*Can*we*automate*ML*pipeline*
construcBon?
Too*many*algorithms…
MLbase
4
MLlib
MLI
MLOpt
Apache*Spark
Spark:*Cluster*compuBng*system*designed*for*iteraBve*computaBon
MLlib:*Spark’s*core*ML*library
MLI:*API*to*simplify*ML*development
MLOpt:*DeclaraBve*layer*to*automate*hyperparameter*tuning
MLbase'aims'to'simplify'development'and'deployment'of'scalable'ML'pipelines
MLbase
4
MLlib
MLI
MLOpt
Apache*Spark
Spark:*Cluster*compuBng*system*designed*for*iteraBve*computaBon
MLlib:*Spark’s*core*ML*library
MLI:*API*to*simplify*ML*development
MLOpt:*DeclaraBve*layer*to*automate*hyperparameter*tuning
MLbase'aims'to'simplify'development'and'deployment'of'scalable'ML'pipelinesMLOpt and MLI are
experimental testbeds
5
MLlib+**Scalable*and*fast**+**Simple*development*environment*+**Part*of*Spark’s*robust*ecosystem
SparkSQL
Apache Spark
Spark Streaming
MLlib (machine learning)
GraphX (graph)
AcBve*DevelopmentIni.al"Release• Developed*by*MLbase*team*in*AMPLab*(11*contributors)
• Scala,*Java
• Shipped*with*Spark*v0.8*(Sep*2013)
15"months"later…• 60+*contributors*from*various*organizaBons
• Scala,*Java,*Python
• Many*more*algorithms*and*ML*uBliBes
• Improved*documentaBon*/*code*examples,*API*stability
• Latest*release*part*of*Spark*v1.1*(Sept*2014)
MLlib,*MLI*and*Roadmap• MLI:*Shield*ML*Developers*from*low]details*
• Provide*familiar*mathemaBcal*operators*in*distributed*se^ng*
(tables,*matrices,*opBmizaBon*primiBves)*
• Standard*APIs*defining*ML*algorithms*and*feature*extractors
• MLlib*v1.2*will*include*ML*API*inspired*by*MLI
• Longer*term*for*MLlib*• Scalable*implementaBons*of*standard*ML*algorithms*and*
underlying*opBmizaBon*primiBves*
• Further*support*for*ML*pipeline*development*(including*hyper*
parameter*tuning*using*ideas*from*MLOpt)
MLlib,*MLI*and*Roadmap• MLI:*Shield*ML*Developers*from*low]details*
• Provide*familiar*mathemaBcal*operators*in*distributed*se^ng*
(tables,*matrices,*opBmizaBon*primiBves)*
• Standard*APIs*defining*ML*algorithms*and*feature*extractors
• MLlib*v1.2*will*include*ML*API*inspired*by*MLI
• Longer*term*for*MLlib*• Scalable*implementaBons*of*standard*ML*algorithms*and*
underlying*opBmizaBon*primiBves*
• Further*support*for*ML*pipeline*development*(including*hyper*
parameter*tuning*using*ideas*from*MLOpt)
Feedback'and'Contribu9ons'Encouraged!
Vision"MLOpt
✦ User*declaraBvely*specifies*task
✦ PAQ*=*PredicBve*AnalyBc*Query
✦ Search*through*MLlib*to*find*the*best*model/pipeline
SQL Result
TuPAQ: An Efficient Planner for Large-scale Predictive
Analytic Queries
ABSTRACTThe proliferation of massive datasets combined with the develop-ment of sophisticated analytical techniques have enabled a widevariety of novel applications such as improved product recommen-dations, automatic image tagging, and improved speech driven in-terfaces. These and many other applications can be supported byPredictive Analytic Queries (PAQs). The major obstacle to support-ing these queries is the challenging and expensive process of PAQplanning, which involves identifying and training an appropriatepredictive model. Recent efforts aiming to automate this processhave focused on single node implementations and have assumedthat model training itself is a black box, thus limiting the effective-ness of such approaches on large-scale problems. In this work, webuild upon these recent efforts and propose an integrated PAQ plan-ning architecture that combines advanced model search techniques,bandit resource allocation via runtime algorithm introspection, andphysical optimization via batching. The resulting system, TUPAQ,solves the PAQ planning problem with comparable accuracy to ex-haustive strategies but an order of magnitude faster, and can scale tomodels trained on terabytes of data across hundreds of machines.
1. INTRODUCTIONOver the past four decades, a great deal of database systems re-
search has focused on finding efficient execution strategies for anumber of different workloads – single node and distributed trans-actional, analytical, and stream processing workloads have all beenactive areas of study. Rapidly growing data volumes coupled withthe maturity of sophisticated statistical techniques have led to de-mand for support of a new type of workload: predictive analyticsover large scale, distributed datasets. Indeed, the support of pre-dictive analytic queries in database systems is an increasingly wellstudied area. Systems like MLbase [37], MADLib [33], COLUM-BUS [38], MauveDB [27], BayesStore [52] and DimmWitted [56]are all efforts to integrate statistical query processing with a datamanagement system.
Concretely, users would like to issue queries that involve reason-ing about predicted attributes, where predicted attributes are de-rived from a set of observed input attributes along with a labeled
SELECT e.sender, e.subject, e.message
FROM Emails e
WHERE e.user = ’Bob’
AND PREDICT(e.spam, e.message) = false GIVENLabeledData
(a) Spam labeling.SELECT um.title
FROM UserMovies um
WHERE um.user = ’Bob’ AND um.viewed = falseORDER BY PREDICT(um.rating) GIVEN Ratings
DESC LIMIT 50
(b) Movie recommendation.SELECT p.image
FROM Pictures p
WHERE PREDICT(p.tag, p.photo) = ’Plant’ GIVENLabeledPhotos
AND p.likes > 500
(c) Photo classification.Figure 1: Three examples of PAQs, with the predictive clauseshighlighted in green. (1a) leverages predicted values in the spamattribute to return Bob’s non-spam emails. (1b) leverages Bob’spredicted movie ratings to return Bob’s list of the top 50 predictedmovie titles. (1c) finds popular pictures of photos based on an im-age classification model – even if the images are not labeled. Eachof these use cases may require considerable training data.
training dataset. We refer to such queries as Predictive AnalyticQueries, or PAQs. The predictions returned by the system shouldbe highly accurate both on training data and new data as it comesinto the system. PAQs consist of traditional database queries alongwith new predictive clauses, and these predictive clauses are thefocus of this work. Examples of PAQs are illustrated in Figure 1,with the predictive clauses highlighted in green.
Given recent advances in statistical methodology, supervised ma-chine learning (ML) techniques are a natural way to support thepredictive clauses in PAQs. In the supervised learning setting, a sta-tistical model leverages training data to relate the input attributes tothe desired output attribute. Furthermore, ML methods learn bettermodels as the size of the training data increases, and recent ad-vances in distributed ML algorithm development, e.g., [13, 44, 40]enable large-scale model training in the distributed setting. Unfor-tunately, the application of supervised learning techniques to a newinput dataset is computationally demanding and technically chal-lenging. For a non-expert, the process of carefully preprocessingthe input attributes, selecting the appropriate ML model, and tun-
1
PAQ Model
ML
!Data
Feature*
ExtracBon
Model*
Training
Final**
Model
A*Standard*ML*Pipeline
✦ In*pracBce,*model*building*is*an*iteraBve*process*of*conBnuous*
refinement
✦ Our*grand*vision*is*to*automate*the*construcBon*of*these*
pipelines
Training*A*Model
✦ For*each*point*in*dataset*✦ compute*gradient*✦ update*model*✦ repeat*unBl*converged
✦ Requires*mul$ple'passes✦ Common*access*padern*
✦ Naive*Bayes,*Trees,*etc.
✦ Minutes*to*train*an*SVM*on*200GB*of*data*on*a*16]node*cluster
The*Tricky*Part
✦ Algorithms*✦ LogisBc*Regression,*SVM,*Tree]
based,*etc.*
✦ Algorithm*hyper]parameters*✦ Learning*Rate,*RegularizaBon,*etc.
Algorithms
Hyper*Parameters
FeaturizaBon
✦ FeaturizaBon*✦ Text:*n]grams,*TF]IDF*✦ Images:*Gabor*filters,*random*
convoluBons*✦ Random*projecBon?*Scaling?
A*Standard*ML*Pipeline
✦ In*pracBce,*model*building*is*an*iteraBve*process*of*conBnuous*
refinement
✦ Our*grand*vision*is*to*automate*the*construcBon*of*these*
pipelines
✦ Start*with*one*aspect*of*the*pipeline*]*model*selecBon
!Data
Feature*
ExtracBon
Model*
Training
Final**
Model
Automated"Model"Selec.on
One*Approach
Learning*Rate
RegularizaBon
Best"answer
✦ Try*it*all!*✦ Search*over*all*
hyperparameters,*algorithms,*features,*etc.
✦ Drawbacks*✦ Expensive*to*compute*models*✦ Hyperparameter*space*is*large
✦ Some*version*of*this*sBll*olen*done*in*pracBce!
A*Beder*Approach
✦ Beder*resource*uBlizaBon*✦ through*batching*
✦ Algorithmic*Speedups*✦ via*early*stopping*/*bandits*
✦ Improved*Search*✦ e.g.,*via*randomizaBon
Learning*Rate
RegularizaBon
Best"answer
A*Tale*Of*3*OpBmizaBons
BeLer"Resource"U.liza.on"
Algorithmic"Speedups"
Improved"Search
✦ Typical*model*update*requires*2]4*flops/double
✦ But*modern*memory*much*slower*than*processors*
✦ We*can*do*25*flops*/*double*read!*
✦ This*equates*to*6]8*model*updates*per*double*we*read,*assuming*models*fit*in*cache
✦ Train"mul.ple"models"simultaneously
Beder*Resource*UBlizaBon
What*Do*We*See*In*Spark?
✦ 2x*and*5x*increase*in*models*trained/sec*with*batching*
✦ Overhead*from*virtualizaBon,*network,*etc.
What*Do*We*See*In*Spark?
✦ These*numbers*are*with*vector]matrix*mulBplies
✦ Can*do*beder*when*rewriBng*in*terms*of*matrix]matrix*mulBplies
A*Tale*Of*3*OpBmizaBons
BeLer"Resource"U.liza.on"
Algorithmic"Speedups"
Improved"Search
Learning*Rate
RegularizaBon
Best"answer
Bandit*Search
✦ Each*point*is*a*trained*model
✦ Some*models*look*bad*early
✦ So*we*give*up*early!
✦ Can*frame*this*as*a*mul.Qarmed"bandit*problem✦ Each*model*is*an*arm
Bandit*Search
✦ Each*point*is*a*trained*model
✦ Some*models*look*bad*early
✦ So*we*give*up*early!
✦ Can*frame*this*as*a*mul.Qarmed"bandit*problem✦ Each*model*is*an*arm
BeLer"Resource"U.liza.on"
Algorithmic"Speedups"
Improved"Search
A*Tale*Of*3*OpBmizaBons
What*Search*Method?*GRID NELDER_MEAD POWELL RANDOM SMAC SPEARMINT TPE
0.00.10.20.30.40.5
0.00.10.20.30.40.5
0.00.10.20.30.40.5
0.00.10.20.30.40.5
0.00.10.20.30.40.5
australianbreast
diabetesfourclass
splice
16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625Method and Maximum Calls
Data
set a
nd V
alid
atio
n Er
ror
Maximum Calls1681256625
Comparison of Search Methods Across Learning Problems
✦ Various*derivaBve]free*opBmizaBon*techniques*✦ Simple*ones*(Grid,*Random)*✦ Classic*DerivaBve]Free*(Nelder]Mead,*Powell’s*method)*✦ Bayesian*(SMAC,*TPE,*Spearmint)
✦ What*should*we*do?*✦ Tried*on*5*datasets,*opBmized*over*4*hyperparameters!
What*Search*Method?*GRID NELDER_MEAD POWELL RANDOM SMAC SPEARMINT TPE
0.00.10.20.30.40.5
0.00.10.20.30.40.5
0.00.10.20.30.40.5
0.00.10.20.30.40.5
0.00.10.20.30.40.5
australianbreast
diabetesfourclass
splice
16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625Method and Maximum Calls
Data
set a
nd V
alid
atio
n Er
ror
Maximum Calls1681256625
Comparison of Search Methods Across Learning Problems
●●●●●●●●●●●●●●●●
●●●●
●●●●
●●●●
●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●
●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●
●●●
●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0.25
0.50
0.75
0 200 400 600 800Time elapsed (m)
Best
Val
idat
ion
Erro
r See
n So
Far
Search Method●
●
●
Grid − UnoptimizedRandom − OptimizedTPE − Optimized
Model Convergence Over Time
Pu^ng*It*All*Together✦ First*version*of*MLbase*opBmizer
✦ 30GB*dense*images*(240K*x*16K)
✦ 2*model*families,*5*hyperparams
✦ Baseline:*grid*search
✦ Our*method:*combinaBon*of*✦ Batching*✦ Early*stopping*/*Bandit*✦ Random*or*TPE*
20x"speedup"compared'to'grid'search'15'minutes'vs'5'hours!'"
Does*It*Scale?
✦ 1.5TB*dataset*(1.2M*x*160K)
✦ 128*nodes,*thousands*of*passes*over*data
✦ Tried*32*models*in*15*hours*
✦ Good*answer*aler*11*hours
●●●●●●●● ●●●●●● ●●●●● ●●●●
●●●●
●● ●● ●
0.25
0.50
0.75
5 10Time elapsed (h)
Best
Val
idat
ion
Erro
r See
n So
Far
Convergence of Model Accuracy on 1.5TB Dataset
Future*Work
!Data
Feature*
ExtracBon
Model*
Training
Final**
Model
Automated"ML"Pipeline"Construc.on
Data Image!Parser Normalizer Convolver
sqrt,mean
Zipper
Linear Solver
Symmetric!Rectifier
ident,absident,mean
Global
Pooler
Patch!Extractor
Patch Whitener
KMeans!Clusterer
Feature Extractor
Label!Extractor
Linear!Mapper Model
Test!Data
Label!Extractor
Feature Extractor
Test Error
Error!Computer
No Hyperparameters!A few Hyperparameters!Lotsa Hyperparameters
Slide courtesy of Evan Sparks
Other*Future*Work
✦ Ensembling
✦ Leverage*sampling
✦ Beder*parallelism*for*smaller*datasets
✦ MulBple*hypothesis*tesBng*issues
MLOpt:*DeclaraBve*layer*that*aims*to*automate*ML*pipeline*construcBon*
MLI:*API*to*simplify*ML*development*
MLlib:*Spark’s*core*ML*library*
Spark:*Cluster*compuBng*system*designed*for*iteraBve*computaBon
THANKS! QUESTIONS?
baseML
baseML
baseML
baseML
ML base
ML base
ML base
ML base
ML base
www.mlbase.org
MLlib
MLI
MLOpt
Apache*Spark
top related