h2o world - consensus optimization and machine learning - stephen boyd

Post on 08-Jan-2017

2.190 Views

Category:

Software

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Consensus Optimization and Machine Learning

Stephen Boyd and Steven Diamond

EE & CS Departments

Stanford University

H2O World, 11/10/2015

1

Outline

Convex optimization

Model fitting via convex optimization

Consensus optimization and model fitting

2

Outline

Convex optimization

Model fitting via convex optimization

Consensus optimization and model fitting

Convex optimization 3

Convex optimization problem

convex optimization problem:

minimize f0(x)subject to fi (x) ≤ 0, i = 1, . . . ,m

Ax = b

I variable x ∈ Rn

I equality constraints are linear

I f0, . . . , fm are convex: for θ ∈ [0, 1],

fi (θx + (1− θ)y) ≤ θfi (x) + (1− θ)fi (y)

i.e., fi have nonnegative (upward) curvature

Convex optimization 4

Why convex optimization?

I we can solve convex optimization problems effectively

I there are lots of applications

Convex optimization 5

Why convex optimization?

I we can solve convex optimization problems effectively

I there are lots of applications

Convex optimization 5

Why convex optimization?

I we can solve convex optimization problems effectively

I there are lots of applications

Convex optimization 5

Application areas

I machine learning, statistics

I finance

I supply chain, revenue management, advertising

I control

I signal and image processing, vision

I networking

I circuit design

I and many others . . .

Convex optimization 6

Convex optimization solvers

I medium scale (1000s–10000s variables, constraints)interior-point methods on single machine

I large-scale (100k – 1B variables, constraints)custom (often problem specific) methods, e.g., SGD

I lots of on-going research

I growing list of open source solvers

Convex optimization 7

Convex optimization modeling languages

I (new) high level language support for convex optimizationI describe problem in high level languageI problem compiled to standard form and solved

I implementations:I YALMIP, CVX (Matlab)I CVXPY (Python)I Convex.jl (Julia)

Convex optimization 8

CVXPY

(Diamond & Boyd, 2013)

minimize ‖Ax − b‖22 + γ‖x‖1subject to ‖x‖∞ ≤ 1

from cvxpy import *

x = Variable(n)

cost = sum_squares(A*x-b) + gamma*norm(x,1)

prob = Problem(Minimize(cost),

[norm(x,"inf") <= 1])

opt_val = prob.solve()

solution = x.value

Convex optimization 9

Example: Image in-painting

I guess pixel values in obscured/corrupted parts of image

I total variation in-painting: choose pixel values xij ∈ R3 tominimize total variation

TV(x) =∑ij

∥∥∥∥[ xi+1,j − xijxi ,j+1 − xij

]∥∥∥∥2

I a convex problem

Convex optimization 10

Example

512× 512 color image (n ≈ 800000 variables)

Original Corrupted

Convex optimization 11

Example

Original Recovered

Convex optimization 12

Example

80% of pixels removed

Original Corrupted

Convex optimization 13

Example

80% of pixels removed

Original Recovered

Convex optimization 14

Outline

Convex optimization

Model fitting via convex optimization

Consensus optimization and model fitting

Model fitting via convex optimization 15

Predictor

I given data (xi , yi ), i = 1, . . . ,m

I x is feature vector, y is outcome or label

I find predictor ψ so that

y ≈ y = ψ(x) for data (x , y) that you haven’t seen

I ψ is a regression model for y ∈ R

I ψ is a classifier for y ∈ {−1, 1}

Model fitting via convex optimization 16

Loss minimization predictor

I predictor parametrized by θ ∈ Rn

I loss function L(xi , yi , θ) gives miss-fit for data point (xi , yi )

I for given θ, predictor is

ψ(x) = argminy

L(x , y , θ)

I how do we choose parameter θ?

Model fitting via convex optimization 17

Model fitting via regularized loss minimization

I choose θ by minimizing regularized loss

1

m

m∑i=1

L(xi , yi , θ) + λr(θ)

I regularization r(θ) penalizes model complexity, enforcesconstraints, or represents prior

I λ > 0 scales regularization

I for many useful cases, this is a convex problem

Model fitting via convex optimization 18

Model fitting via regularized loss minimization

I choose θ by minimizing regularized loss

1

m

m∑i=1

L(xi , yi , θ) + λr(θ)

I regularization r(θ) penalizes model complexity, enforcesconstraints, or represents prior

I λ > 0 scales regularization

I for many useful cases, this is a convex problem

Model fitting via convex optimization 18

Examples

predictor L(x , y , θ) ψ(x) r(θ)

least-squares (θT x − y)2 θT x 0ridge regression (θT x − y)2 θT x ‖θ‖22lasso (θT x − y)2 θT x ‖θ‖1logistic classifier log(1 + exp(−yθT x)) sign(θT x) 0SVM (1− yθT x)+ sign(θT x) ‖θ‖22

I can mix and match, e.g., r(θ) = ‖θ‖1 sparsifies

I all lead to convex fitting problems

Model fitting via convex optimization 19

Robust (Huber) regression

I loss L(x , y , θ) = φhub(θT x − y)

I φhub is Huber function (with threshold M > 0):

φhub(u) =

{u2 |u| ≤ M2Mu −M2 |u| > M

I same as least-squares for small residuals, but allows (some)large residuals

I and so, robust to outliers

Model fitting via convex optimization 20

Example

I m = 450 measurements, n = 300 regressors

I choose θtrue; xi ∼ N (0, I )

I set yi = (θtrue)T xi + εi , εi ∼ N (0, 1)

I with probability p, replace yi with −yiI data has fraction p of (non-obvious) wrong measurements

I distribution of ‘good’ and ‘bad’ yi are the same

I try to recover θtrue ∈ Rn from measurements y ∈ Rm

I ‘prescient’ version: we know which measurements are wrong

Model fitting via convex optimization 21

Example

50 problem instances, p varying from 0 to 0.15

Model fitting via convex optimization 22

Example

Model fitting via convex optimization 23

Quantile regression

I quantile regression: use tilted `1 loss

L(x , y , θ) = τ(r)+ + (1− τ)(r)−

with r = θT x − y , τ ∈ (0, 1)

I τ = 0.5: equal penalty for over- and under-estimatingI τ = 0.1: 9× more penalty for under-estimatingI τ = 0.9: 9× more penalty for over-estimating

I τ -quantile of residuals is zero

Model fitting via convex optimization 24

Example

I time series xt , t = 0, 1, 2, . . .

I auto-regressive predictor:

xt+1 = θT (1, xt , . . . , xt−M)

I M = 10 is memory of predictor

I use quantile regression for τ = 0.1, 0.5, 0.9

I at each time t, gives three one-step-ahead predictions:

x0.1t+1, x0.5t+1, x0.9t+1

Model fitting via convex optimization 25

Example

time series xt

Model fitting via convex optimization 26

Example

xt and predictions x0.1t+1, x0.5t+1, x0.9t+1 (training set, t = 0, . . . , 399)

Model fitting via convex optimization 27

Example

xt and predictions x0.1t+1, x0.5t+1, x0.9t+1 (test set, t = 400, . . . , 449)

Model fitting via convex optimization 28

Example

residual distributions for τ = 0.9, 0.5, and 0.1 (training set)

Model fitting via convex optimization 29

Example

residual distributions for τ = 0.9, 0.5, and 0.1 (test set)

Model fitting via convex optimization 30

Outline

Convex optimization

Model fitting via convex optimization

Consensus optimization and model fitting

Consensus optimization and model fitting 31

Consensus optimization

I want to solve problem with N objective terms

minimize∑N

i=1 fi (x)

e.g., fi is the loss function for ith block of training data

I consensus form:

minimize∑N

i=1 fi (xi )subject to xi − z = 0

I xi are local variablesI z is the global variableI xi − z = 0 are consistency or consensus constraints

Consensus optimization and model fitting 32

Consensus optimization via ADMM

with xk = (1/N)∑N

i=1 xki (average over local variables)

xk+1i := argmin

xi

(fi (xi ) + (ρ/2)‖xi − xk + uki ‖22

)uk+1i := uki + (xk+1

i − xk+1)

I get global minimum, under very general conditions

I uk is running sum of inconsistencies (PI control)

I minimizations carried out independently and in parallel

I coordination is via averaging of local variables xi

Consensus optimization and model fitting 33

Consensus model fitting

I variable is θ, parameter in predictor

I fi (θi ) is loss + (share of) regularizer for ith data block

I θk+1i minimizes local loss + additional quadratic term

I local parameters converge to consensus, same as if wholedata set were handled together

I privacy preserving: agents don’t reveal data to each other

Consensus optimization and model fitting 34

Example

I SVM:I hinge loss l(u) = (1− u)+I sum square regularization r(θ) = ‖θ2‖2

I baby problem with n = 2, m = 400 to illustrate

I examples split into N = 20 groups, in worst possible way:each group contains only positive or negative examples

Consensus optimization and model fitting 35

Iteration 1

−3 −2 −1 0 1 2 3−10

−8

−6

−4

−2

0

2

4

6

8

10

Consensus optimization and model fitting 36

Iteration 5

−3 −2 −1 0 1 2 3−10

−8

−6

−4

−2

0

2

4

6

8

10

Consensus optimization and model fitting 37

Iteration 40

−3 −2 −1 0 1 2 3−10

−8

−6

−4

−2

0

2

4

6

8

10

Consensus optimization and model fitting 38

CVXPY implementation

(Steven Diamond)

I N = 105 samples, n = 103 (dense) features

I hinge (SVM) loss with `1 regularization

I data split into 100 chunks

I 100 processes on 32 cores

I 26 sec per ADMM iteration

I 100 iterations for objective to converge

I 10 iterations (5 minutes) to get good model

Consensus optimization and model fitting 39

CVXPY implementation

Consensus optimization and model fitting 40

H2O implementation

(Tomas Nykodym)

I click-through data derived from a kaggle data set

I 20000 features, 20M examples

I logistic loss, elastic net regularization

I examples divided into 100 chunks (of different sizes)

I run on 100 H2O instances

I 5 iterations to get good global model

Consensus optimization and model fitting 41

H2O implementation

ROC, iteration 1

Consensus optimization and model fitting 42

H2O implementation

ROC, iteration 2

Consensus optimization and model fitting 43

H2O implementation

ROC, iteration 3

Consensus optimization and model fitting 44

H2O implementation

ROC, iteration 5

Consensus optimization and model fitting 45

H2O implementation

ROC, iteration 10

Consensus optimization and model fitting 46

Summary

ADMM consensus

I can do machine learning across distributed data sources

I the data never moves

I get same model as if you had collected all data in one place

Consensus optimization and model fitting 47

Resources

many researchers have worked on the topics covered

I Convex Optimization

I Distributed Optimization and Statistical Learning via theAlternating Direction Method of Multipliers

I EE364a (course slides, videos, code, homework, . . . )

I software CVX, CVXPY, Convex.jl

all available online

Consensus optimization and model fitting 48

top related