machine learning: intro & linear clasifier · supervised machine learning •training phase...

Machine Learning: Intro & Linear Clasifier

Olga Veksler

Outline

• Introduction to Machine Learning

• motivation of machine learning

• types of machine learning

• basic concepts

• overfitting, underfitting, generalization

• Optimization with gradient descent

• Linear Classifier• Two classes

• Loss functions

• Gradient descent schemes

• Multiple classes

INTRO TO ML

Why use Machine Learning? • Difficult to come up with explicit program for some tasks

• Digit Recognition, a classic example

• Easy to collect images of digits with their correct labels

• Machine Learning Algorithm takes collected data and produces program for recognizing digits• done right, program will recognize correctly new images it has never seen

What is Machine Learning?

• General definition (Tom Mitchell):

• Based on experience E, improve performance on task T as measured by performance measure P

• Digit Recognition Example

• T = recognize character in the image

• P = percentage of correctly classified images

• E = dataset of human-labeled images of characters

Types of Machine Learning • Supervised Learning

• given training examples with corresponding outputs (also called label, target, class, answer, etc.)

• learn to produces correct output for a new example

• Unsupervised Learning

• given training examples only

• discover good data representation

• e.g. “natural” clusters

• k-means is the most widely known example

• Reinforcement Learning

• learn to select action that maximizes payoff

Supervised Machine Learning: subtypes

• Classification • output belongs to a finite set

• example: age {baby, child, adult, elder}

• output is also called class or label

• Regression• output is continuous

• example: age [0,130]

• Difference mostly in design of loss function

Supervised Machine Learning

salmon salmonsea bass sea bass

• Have examples with corresponding outputs

• Example: fish classification, salmon or sea bass?

x1 x2 x3 x4

y1=0 y2=1 y3 = 0 y4=1

• Each example represented by a vector • data may be given in vector form from the start

• if not, for each example i, extract useful features and put them in a vector xi

• fish classification example• extract two features, fish length and average fish brightness

• can extract as many other features

• for images, can also use raw pixel intensity or color as features

• an example is often called feature vector

• yi is the output for example xi

Supervised Machine Learning

• Training phase

• estimate function y = f(x) from labeled data• f(x) is called classifier, learning machine, prediction function, etc.

• Testing phase (deployment)

• predict output f(x) for a new (unseen) sample x

• We are given

1. Training examples x1, x2,…, xn

2. Target output for each sample y1, y2,…ynlabeled data

label prediction

Training/Testing Phases Illustrated

training labels

training examples

Training

feature vectors

feature vector

Testing

test Image

Learned model f

More on Training Phase

• Estimate prediction function y = f(x) from labeled data

• Choose hypothesis space f(x) belongs to

• hypothesis space f(x,w) is parameterized by vector of weights w

• each setting of w corresponds to a different hypothesis

• find f(x,w) in the hypothesis space s.t. f(xi,w) = yi “as much as possible” for training examples

• define “as much as possible” via some loss function L(f(x,w),y)

f(x,w1)f(x,w3)

f(x,w2)

f(x,w4)

f(x,w5)

hypothesis space

Training Phase Example in 1D• 2 class classification problem, yi ∊{-1,1}

• Training set class -1: {-2, -1, 1} class 1: {2, 3, 5}

class -1 class 1

• Hypothesis space f(x,w) = sign(w0 + w1x)

• with w0 = -1, w1 = 2, f(x) = sign( -1 + 2x )

Training Phase Example in 1D

class -1 class 1

• The process of finding good w is weight tuning, or training

• Hypothesis space f(x,w) = sign(w0+w1x )• With w0 = -1.5, w1 = 1, f(x) = sign(-1.5 + x )

• 2 class classification problem, yi ∊{-1,1}

• Training set class -1: {-2, -1, 1} class 1: {2, 3, 5}

Training Phase Example in 2D• For 2 class problem and 2 dimensional samples

f(x,w) = sign(w0+w1x1+w2x2)

decision boundary

decision regions

• Can be generalized to examples of arbitrary dimension

• Classifier that makes a decision based on linear combination of features is called a linear classifier

Training Phase: Linear Classifier

classification error 38%

bad setting of w

better setting of w

classification error 4%

Training Stage: More Complex Classifier

• for example if f(x,w) is a polynomial of high degree

• 0% classification error

Test Classifier on New Data

• The goal is to classify well on new data

• Test “wiggly” classifier on new data: 25% error

Overfitting

• Have only limited amount of data for training

• Complex model often has too many parameters to fit reliably with limited data

• Complex model may adapt too closely to random noise of training data, rather than look at a ‘big picture’

Overfitting: Extreme Example• 2 class problem: face and non-face images

• Memorize (i.e. store) all the “face” images

• For a new image, see if it is one of the stored faces

• if yes, output “face” as the classification result

• If no, output “non-face”

• also called “rote learning”

• problem: new “face” images are different from stored

“face” examples

• zero error on stored data, 50% error on test (new) data

• decision boundary is very irregular

• Rote learning is memorization without generalization

slide is modified from Y. LeCun

Generalizationtraining data

• Ability to produce correct outputs on previously unseen examples

is called generalization

• Big question of learning theory: how to get good generalization

with a limited number of examples

• Intuitive idea: favor simpler classifiers

• William of Occam (1284-1347): “entities are not to be multiplied without necessity”

• Simpler decision boundary may not fit ideally to training data but

tends to generalize better to new data

new data

Training and Testing• How to diagnose overfitting?

• Divide all labeled samples x1,x2,…xn into training set and test set

• Use training set (training samples) to tune classifier weights w

• Use test set (test samples) to see how well classifier with tuned weights w work on unseen examples

• Thus there are 2 main phases in classifier design

1. training

2. testing

Training Phase

• Find weights w s.t. f(xi,w) = yi “as much as possible” for training samples xi

• define “as much as possible”• penalty whenever f(xi,w) ≠ yi

• via loss function L(f(xi,w), yi)

• how to search for good setting of w?• usually through optimization

• can be very time consuming

• classification error on training data is called training error

Testing Phase• The goal is good performance on unseen examples

• Evaluate performance of the trained classifier f(x,w) on the test samples (unseen labeled samples)

• Testing on unseen labeled examples is an approximation of how well classifier will perform in practice

• If testing results are poor, may have to go back to the training phase and redesign f(x,w)

• Classification error on test data is called test error

• Side note

• when we “deploy” the final classifier f(x,w) in practice, this is also called testing

• Can also underfit data, i.e. too simple decision boundary • chosen hypothesis space is not

expressive enough

• No linear decision boundary can separate the samples well

• Training error is too high• test error is, of course, also high

Underfitting

Underfitting → Overfitting

underfitting “just right” overfitting

• high training error

• high test error

• low training error

• low test error

• low training error

• high test error

Sketch of Supervised Machine Learning

• Chose a hypothesis space f(x,w)

• w are tunable weights

• x is the input sample

• tune w so that f(x,w) gives the correct label for training samples x

• Which hypothesis space f(x,w) to choose?

• has to be expressive enough to model our problem well, i.e. to avoid underfitting

• yet not to complicated to avoid overfitting

Classification System Design Overview

• Collect and label data by handsalmon salmon salmonsea bass sea bass sea bass

• Preprocess data (i.e. segmenting fish from background)

• Extract possibly discriminating features• length, lightness, width, number of fins,etc.

• Classifier design• choose model for classifier

• train classifier on training data

• Test classifier on test data

• Split data into training and test sets

mostly look at these steps in the course

OPTIMIZATION

( ) 0xJdx

Optimization

• How to minimize a function of a single variable

J(x) =(x-5)2

• From calculus, take derivative, set it to 0

• Solve the resulting equation• maybe easy or hard to solve

• Example above is easy

( ) ( ) 5x05x2xJdx

d==−=

( ) 0xJ

Optimization• How to minimize a function of many variables

J(x) = J(x1,…, xd)

• From calculus, take partial derivatives, set them to 0

gradient

• Solve the resulting system of d equations

• It may not be possible to solve the system of equations above analytically

Optimization: Gradient Direction

J(x1, x2)

Picture from Andrew Ng

• Gradient J(x) points in the direction of steepest increase of function J(x)

• - J(x) points in the direction of steepest decrease

Gradient Direction in 2D• J(x1, x2) =(x1-5)2+(x2-10)2

( ) ( )5x2xJx

( ) ( )10x2xJx

( ) 101

( ) 102

global min

10aJ•

• Let

−=−

10aJ•

Gradient Descent: Step Size

• J(x1, x2) =(x1-5)2+(x2-10)2

• Which step size to take?

• Controlled by parameter

• called learning rate

• From previous slide

• J(10, 5) = 50; J(8,7) = 18

global min

−=−

10aJ,a•

• Let = 0.2

( )aJa −α

Gradient Descent Algorithm

x(1) x(2)

-J(x(1))

-J(x(2))

-J(x(k))0

x(1) = any initial guess

choose ,

while ||J(x(k))|| >

x(k+1) = x (k) - J(x(k))

k = k + 1

Gradient Descent: Local Minimum

• Not guaranteed to find global minimum• gets stuck in local minimum

x(1)x(2)

-J(x(1))

-J(x(2))

-J(x(k))=0

global minimum

• Still gradient descent is very popular because it is simple and applicable to any differentiable function

How to Set Learning Rate ?

• If too large, may overshoot the local minimum and possibly never even converge

• If too small, too many iterations to converge

x(2) x(1)x(4) x(3)

• It helps to compute J(x) as a function of iteration number, to make sure we are properly minimizing it

Variable Learning Rate

choose ,

while ||J(x(k))|| >

x(k+1) = x (k) - J(x(k))

k = k + 1

• If desired, can change learning rate at each iteration

choose

while ||J(x(k))|| >

choose (k)

x(k+1) = x (k) - (k) J(x(k))

k = k + 1

Variable Learning Rate

choose ,

while ||J(x(k))|| >

x(k+1) = x (k) - J(x(k))

k = k + 1

• Usually do not keep track of all intermediate solutions

x = any initial guess

choose ,

while ||J(x)|| >

x = x - J(x)

k = k + 1

Learning Rate

• Monitor learning rate by looking at how fast the objective function decreases

number of iterations

very high learning rate

high learning rate

low learning rate

good learning rate

1.0=Learning Rate: Loss Surface Illustration

001.0=

updates 3~ k

updates 30.~ k

Linear Classifiers

Supervised Machine Learning: Recap• Chose type of f(x,w)

• w are tunable weights, x is the input example

• f(x,w) should output the correct class of sample x

• use labeled samples to tune weights w so that f(x,w) give the correct class y for x in training data

• loss function L(f(x,w) ,y)

• How to choose type of f(x,w)?

• many choices

• simple choice: linear classifier

• other choices

• Neural Network

• Convolutional Neural Network

Linear Classifier

• Use

• y = 1 for the first class

• y = -1 for the second class

• One choice for linear classifier

f(x,w) = sign(g(x,w))

• 1 if g(x,w) is positive

• -1 if g(x,w) is negative

• Classifier it makes a decision based on linear combination of features

g(x,w) = w0+x1w1 + … + xdwd

• g(x,w) called discriminant function

bad boundary

Linear Classifier: Decision Boundary

• f(x,w) = sign(g(x,w)) = sign(w0+x1w1 + … + xdwd)

• Decision boundary is linear

• Find w0, w1,…, wd that give best separation of two classes with a linear boundary

better boundary

More on Linear Discriminant Function (LDF)

• LDF: g(x,w,w0) = w0+x1w1 + … + xdwd

decision boundary

g(x) = 0

g(x) > 0

decision region for class 1

g(x) < 0

decision region for class 2

1bias or threshold

More on Linear Discriminant Function (LDF)

• Decision boundary: g(x,w) = w0+x1w1 + … + xdwd = 0

• This is a hyperplane, by definition

• a point in 1D

• a line in 2D

• a plane in 3D

• a hyperplane in higher dimensions

Vector Notation • Linear discriminant function g(x, w, w0) = wtx + w0

• Example in 2D

( ) 423 210 ++= xxw,w,xg 42

• Shorter notation if add extra feature of value 1 to x

xz ( )

xaza,zg t

( ) 21 234 xxaza,zg t ++== ( )00 w,w,xgwwxt =+=

• Use atz instead of wtx + w0

Fitting Parameters w

• Rewrite g(x,w,w0) = [w0 wt] = atz = g(z,a)1xnew

feature vector z

new weight vector a

• z is called augmented feature vector

• new problem equivalent to the old g(z,a) = atz

• f(z,a) = sign(g(z,a))

Solution Region• If there is a that classifies all examples correctly, it is

called separating or solution vector

• then there are infinitely many solution vectors a

• original samples x1,… xn are also linearly separable

Loss Function• How to find solution vector a?

• or, if no separating a exists, a good approximation a?

• Design a non-negative loss function L(a)

• L(a) is small if a is good

• L(a) is large if a is bad

• Minimize L(a) with gradient descent

• Two steps in design of L(a)

1. per-example loss L(f(zi,a),yi)• penalizes for deviations of f(zi,a) from yi

2. total loss adds up per-sample loss over all training examples

( ) ( )( )=i

ii y,a,zfLaL

Loss Function, First Attempt• Per-example loss function ‘counts’ if error happens

( )( ) ( )

=otherwise

ya,zfify,a,zfL

• Example

11z 11 =y

12z 12 −=y

( )( ) 111 =y,a,zfL

( ) ( )11 zasigna,zf t= ( ) ( )22 zasigna,zf t=

( )( ) 022 =y,a,zfL

( )2321 −= sign

( )4321 −= sign

Loss Function, First Attempt• Per-example loss function ‘counts’ if error happens

( )( ) ( )

=otherwise

ya,zfify,a,zfL

• Total loss function counts total number of errors

( )( ) 111 =y,a,zfL

( )( ) 022 =y,a,zfL

( ) ( )( )=i

ii y,a,zfLaL

• For previous example

( ) 101 =+=aL

• Total loss

Loss Function: First Attempt

• ‘error count’ loss function not suitable for gradient descent

• piecewise constant, gradient zero or does not exist

Perceptron Loss Function• Different Loss Function: Perceptron Loss

• Lp(a) is non-negative• positive misclassified example zi

• atzi < 0

• yi = 1

• yi(atzi) < 0

• negative misclassified example zi

• atzi > 0

• yi = -1

• yi(atzi) < 0

• if zi is misclassified then yi(atzi) < 0

• if zi is misclassified then -yi(atzi) > 0

• Lp(a) proportional to distance of misclassified example to boundary

( )( ) ( )( )

otherwisezay

ya,zfify,a,zfL

Perceptron Loss Function

( )( ) ( )( )

otherwisezay

ya,zfify,a,zfL

• Example

11z 11 =y

12z 12 −=y

( )( ) 411 =y,a,zfLp

( ) ( )11 zasigna,zf t= ( ) ( )22 zasigna,zf t=

( )( ) 022 =y,a,zfLp

( )2321 −= sign

( )4321 −= sign

1−=( )4−= sign

• Total loss Lp(a) = 4 + 0 = 4

Perceptron Loss Function

• Lp(a) is piecewise linear and suitable for gradient descent

• Per-example loss

( ) ( )( )=i

iip y,a,zfLaL

• Total loss

( )( ) ( )( )

otherwisezay

ya,zfify,a,zfL

Optimizing with Gradient Descent

• Gradient descent to minimize Lp(a)

a= a – Lp(a)

• Need gradient vector Lp(a)• the same dimension as a

( ) ( )( )=i

iip y,a,zfLaL

• Total loss

( )( ) ( )( )

otherwisezay

ya,zfify,a,zfL

• Gradient descent to minimize Lp(a)

a= a – Lp(a)

( ) ( )( )=i

iip y,a,zfLaL

• Total loss

( )( ) ( )( )

otherwisezay

ya,zfify,a,zfL

• Compute

( )( )=i

iip y,a,zfL( )aLp ( )( )=

iip y,a,zfL

( )( )

y,a,zfL

per example gradient

• Compute and add up per example gradients

Per Example Loss Gradient

( )( )( )

otherwise?

ya,zfif

y,a,zfL

• First case, f(zi,a) = yi

• Per-example loss has two cases

( )( ) ( )( )

otherwisezay

ya,zfify,a,zfL

( )( )( )

otherwise?

ya,zfify,a,zfL

• To save space, rewrite

Per Example Loss Gradient

( )( )= iip y,a,zfL

• Second case, f(zi,a) ≠ yi

( )( )

zazazaya

332211

iizy−=

• Per-example loss has two cases

( )( ) ( )( )

otherwisezay

ya,zfify,a,zfL

( )( ) ( )

otherwisezy

ya,zfify,a,zfL

• Gradient for per-example loss

• Simpler formula

( ) −=

iexamplesiedmisclassif

iip zyaL

• Gradient decent update rule for Lp(a)

iizyaa α

• called batch because it is based on all examples

• Examples

32 321 xxx

Perceptron Loss Batch Example

• Labels

12 =y 13 =y 14 −=y 15 −=y

• Add extra feature

31 54 xx

class 1 class 2

• Pile all examples as rows in matrix Z

651311531341321

• Pile all labels into column vector Y

• Initial weights

• Examples in Z, labels in Y

• This is line x1 + x2 + 1 = 0

• Perceptron Batch

iizyaa α

• Sample misclassified if y(atz) < 0

• Let us use learning rate α = 0.2

iizy.aa 20

• Can compute y(atz) for all samples

−−

125986

*.a*Z*.Y

Total loss is L(a) = 5+12 = 17

( )( ) ( )( )

otherwisezay

ya,zfify,a,zfL

0• Per example loss is

• Samples 4 and 5 misclassified

• Perceptron Batch rule update

iizy.aa 20

120.aa

• This is line -0.2x1 -0.8 x2 +0.6 = 0

• Find all misclassified samples

( ) Y*.a*Z

• Total loss is L(a) = 2.2 + 2.6 +4= 8.8• previous loss was 17 with 2 misclassified examples

• Perceptron Batch rule update

iizy.aa 20

120.aa

• This is line 1.6x1 +1.4 x2 + 1.2 = 0

( ) Y*.a*Z

• Find all misclassified samples

( ) Y*.a*Z

• Total loss is L(a) = 7 + 17.6 = 24.6• previous loss was 8.8 with 3 misclassified examples

• loss went up, means learning rate of 0.2 is too high

• Batch gradient descent can be slow to converge if lots of examples

• Single sample optimization• update weights a as soon as possible, after seeing 1 example

• One iteration (epoch) • go over all examples, in random order

• update after seeing one example zi

Single-Sample Gradient Descent

𝐚 = 𝐚 − 𝛂𝛻𝐋p 𝐟 𝐳i, 𝐚 , 𝐲i

Single sample gradient descent, one iteration

global min

finish

Batch Size: Loss Surface Illustration

Batch Gradient Descent, one iteration

see only one example

global min

finish

• Mini-Batch optimization• update weights a after seeing m examples

• One iteration (epoch) • go over all examples, in random order

• update after seeing m examples

Mini-Batch Gradient Descent

𝐚 = 𝐚 − 𝛂

𝛻𝐋p 𝐟 𝐳i, 𝐚 , 𝐲i

• Most often used in practice

• Practical Issue: both batch and mini-batch algorithms converge faster if features are roughly on the same scale

Logistic Regression, Motivation• Recall line fitting

• solution with squared error loss

• one sample misclassified

• solution with perceptron loss

• all samples classified correctly

• Instead of trying to get atz close to y• use differentiable function Ϭ(atz) that “squishes” the range

• try to get Ϭ(atz) close to y

Linear Classifier: Logistic Regression

( )( )texp

• Use logistic sigmoid function Ϭ(t) for “squishing” atz

• Denote classes with 1 and 0 now• yi = 1 for positive class, yi = 0 for negative

• Despite “regression” in the name, logistic regression is used for classification, not regression

Logistic Regression vs. Regresson

Ϭ(atz)

quadratic loss

logistic regression loss

Logistic Regression: Loss Function• Could use (yi- Ϭ(atz)) 2 as per-example loss function

• Instead use a different loss • if z has label 1, want Ϭ(aTz) close to 1, define loss as

–log [Ϭ(aTz)]

• if z has label 0, want Ϭ(aTz) close to 0, define loss as

–log [1-Ϭ(aTz)]

-log t

• Gradient descent batch update rule

( )( ) i

iti zzayaa −+= σα

• Probabilistic interpretation• P(class 1) = Ϭ(aTz)

• P(class 0) = 1 - P(class 1)

• loss function is negative log-likelihood , -log P(y)

• standard objective in statistics

Logistic Regression vs. Perceptron• Green example classified correctly, but

close to decision boundary• no loss under Perceptron

• non-negligible loss under logistic regression

Ϭ(wtx)0.5

• Logistic Regression encourages decision boundary move away from training samples

• better generalization

• zero Perceptron loss

• smaller LR loss• zero Perceptron loss

• larger LR loss

• red classifier works better for new data

Logistic Regression vs. Regression vs. Perceptron

misclassified classified correctly but close to

decision boundary

classified correctly and not too close

to the decision boundary

quadratic loss

perceptron

• Assuming labels are +1 and -1

logistic regression

• Define m linear discriminant functions

gi(x) = witx + wi0 for i = 1, 2, … m

Multiple Classes: General Case

• Assign x to class i if

gi(x) > gj(x) for all j ≠ i

• Let Ri be decision region for class i• all points in Ri assigned to class i

g2(x) > g1(x)g2(x) > g3(x)

g1(x) > g2(x)g1(x) > g3(x)

g3(x) > g1(x)g3(x) > g2(x)

Multiple Classes

• Can be shown that decision regions are convex

• In particular, they must be spatially contiguous

Multiclass Linear Classifier: Matrix Notation

• Assume examples x are augmented with extra feature 1, no need to write bias explicitly• but from now on will not change notation to z’s

• Define m discriminant functions

gi(x) = witx for i = 1, 2, … m

• Assign x to i that gives maximum gi(x)

• Picture illustration

5→ 5

→ -9

→ 10

pile all outputs into one vector

decide class 4

Multiclass Linear Classifier: Matrix Notation• Could use one dimensional output yi ∊ {1, 2, 3, …, m}

got this

want this

• Convenient to use multi-dimensional outputs (one-hot encoding)

class 1

class 2

class 3

class 4

5• if sample is of class i, want output vector to be 0 everywhere except position i, where it should be 1 x is of class 2

• Assign x to i that gives maximum gi(x)= witx

• In matrix notation

xW Wx• Assign x to class that corresponds to largest row of Wx

Multiclass Linear Classifier: Matrix Notation

742 −

239 −

172 −

• Assign sample xi to class that corresponds to largest row of Wxi

• Loss function?

( ) ( )( ) −=i

tiii xyWxWL

• Can use quadratic loss per sample xi as ½||Wxi - yi ||2

• for example above, loss (22 + 42 + 472 +442)/2

• total loss on all training samples L(W) = ½Σi|| Wxi - yi ||2

• gradient of the loss

• L(W) has the same shape as the same shape as W

• batch gradient descent update

( ) ( )ti

ii xyWxWW −−=

Quadratic Loss Function

• Consider gradient descent update, single sample x with α = 1

( ) txyWxWW −−=

• Suppose and is of class 2 and

too smalltoo large

−−−

345117

466923

• update rule

−−−

−−

354419

446419

−−−

−−

354419

446419

Wx• With new ,

• Already saw that quadratic loss does not work that well for classification

too smalltoo large

Softmax Function

• Define softmax(a) function

softmax

• Example

( )( ) ( ) ( )

expexpexp

softmax

• Softmax renormalizes a vector so that it can be interpreted as a vector of probabilities

• Generalization of logistic regression to multiclass case

• Instead of raw scores

• Use softmax scores

maxsoft

classPr

Softmax Loss Function

maxsoft

• Classifier output interpreted as probability for each class

• Optimize under –log Pr( yi) loss function

Gradient Descent: Softmax Loss Function

• Example

• Loss on this example is –log(0.000000000000001) = 40

( )=iWxmaxsoft

classPr

maxsoft

00000010.00000000

42945850.99999999

56027960.00000000

01026190.00000000

• Update rule for weight matrix W

( )( ) ( ) −+=i

tiii xWxmaxsoftyWW

Gradient Descent: Softmax Loss Function

• Example, single sample gradient descent with α = 0.1

= maxsoft.W

217612

817493

• Update for W

More General Discriminant Functions

• Linear discriminant functions

• simple decision boundary

• should try simpler models first to avoid overfitting

• optimal for certain type of data

• Gaussian distributions with equal covariance

• May not be optimal for other data distributions

• Discriminant functions can be more general than linear

• For example, polynomial discriminant functions

• Decision boundaries more complex than linear

• Later will look more at non-linear discriminant functions

• Linear classifier works well when examples are linearly separable, or almost separable

• Perceptron Loss function was the first historic loss

• Logistic regression/softmax work better in practice

• Optimization with gradient descent

• stochastic mini-batch works best in practice

Summary

Linear Classifier: Quadratic Loss

( )( ) ( )22

1 itiiip zayz,a,zfL −=

• This is standard line fitting • note that even correctly classified examples can have a large loss

• Can find optimal weight a analytically with least squares• expensive for large problems

• Gradient descent more efficient for a larger problem

( ) ( ) i

itip zzayaL −−=

iti zzayaa −+= α

• Quadratic per-example loss

• Batch update rule

machine learning: intro & linear clasifier · supervised machine learning •training phase...

Documents

machine learning -...

masters of craft: programming - machine learning :...

1-machine learning in oracle - itoug2017/12/01 · –...

cs7267 machine learning introduction to machine learning

advanced topics in machine learning: bayesian machine...

machine learning: machine learning:

11 machine learning important issues in machine learning

comp 562: introduction to machine learning · introduction...

deep learning - machine learning...

jumbo glass clasifier en

machine learning with matlab - … · 2 agenda machine...

machine learning introduction machine...

machine learning in logistics: machine learning...

introduction to azure machine learning...expected learning...

machine learning for nlp - ethics and machine learning

machine learning learning

hawaii machine learning meetup · introduction –why...

machine learning fabspace march 2018 - irit.fr learning...

1. introduction to machine - lis-lab.fr ›...

¿qué es machine learning? usos de machine learning