greg grudicintro ai1 introduction to artificial intelligence csci 3202: the perceptron algorithm...

Greg Grudic Intro AI 1

Introduction to Artificial IntelligenceCSCI 3202:

The Perceptron Algorithm

Greg Grudic


Questions?


Binary Classification

• A binary classifier is a mapping from a set of d inputs to a single output which can take on one of TWO values

• In the most general setting

• Specifying the output classes as -1 and +1 is arbitrary!– Often done as a mathematical convenience

{ }inputs:

output:

1, 1

d

y

x Î Â

Î - +


A Binary Classifier

ClassificationModelx ˆ 1, 1y

Given learning data: ( ) ( )1 1, ,..., ,N Ny yx x

A model is constructed:

( )M x


Linear Separating Hyper-Planes

1x

2x

01

0d

i ii

xb b=

+ £å

01

0d

i ii

xb b=

+ >å

01

0d

i ii

xb b=

+ =å

1y=-

1y=+



• The Model:

• Where:

• The decision boundary:

( )0 1ˆ ˆ ˆˆ ( ) sgn ,..., T

dy M b b bé ù= = +ê úë ûx x

[ ]1 if 0

sgn1 otherwise

AA

ì >ïï=íï -ïî

( )0 1ˆ ˆ ˆ,..., 0T

db b b+ =x



• The model parameters are:

• The hat on the betas means that they are estimated from the data

• Many different learning algorithms have been proposed for determining

( )0 1ˆ ˆ ˆ, ,..., db b b

( )0 1ˆ ˆ ˆ, ,..., db b b


Rosenblatt’s Preceptron Learning Algorithm

• Dates back to the 1950’s and is the motivation behind Neural Networks

• The algorithm:– Start with a random hyperplane– Incrementally modify the hyperplane such that

points that are misclassified move closer to the correct side of the boundary

– Stop when all learning examples are correctly classified

( )0 1ˆ ˆ ˆ, ,..., db b b


Rosenblatt’s Preceptron Learning Algorithm

• The algorithm is based on the following property:– Signed distance of any point to the boundary is:

• Therefore, if is the set of misclassified learning examples, we can push them closer to the boundary by minimizing the following

( ) ( )( )0 1 0 1ˆ ˆ ˆ ˆ ˆ ˆ, ,..., ,..., T

d i d ii M

D yb b b b b bÎ

=- +å x

( ) ( )0 1

0 1

2

1

ˆ ˆ ˆ,...,ˆ ˆ ˆ,...,

ˆ

Td T

dd

ii

dx

xb b b

b b b

b=

+= µ +

æ ö÷ç ÷ç ÷ç ÷è øå

x

M


Rosenblatt’s Minimization Function

• This is classic Machine Learning!

• First define a cost function in model parameter space

• Then find an algorithm that modifies such that this cost function is minimized

• One such algorithm is Gradient Descent

( )0 1 01

ˆ ˆ ˆ ˆ ˆ, ,...,d

d i k iki M k

D y xb b b b bÎ =

æ ö÷ç=- + ÷ç ÷ç ÷çè øå å

( )0 1ˆ ˆ ˆ, ,..., db b b


Gradient Descent

( )0 1ˆ ˆ ˆ, ,..., dD b b b =

0b̂ = 1̂b =


The Gradient Descent Algorithm

( )0 1ˆ ˆ ˆ, ,...,

ˆ ˆˆ

d

i i

i

D b b bb b r

b

¶¬ -

¶

Where the learning rate is defined by: 0r >


The Gradient Descent Algorithm for the Perceptron

0 0

11 1

ˆ ˆ

ˆ ˆ

ˆ ˆ

i

i i

i idd d

y

y x

y x

b b

b br

b b

æ ö æ ö æ ö÷ ÷ç ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ ÷¬ - ç ÷ç ç÷ ÷ ç ÷÷ ÷ç ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ç÷ ÷ çè øç ç÷ ÷ç ç ÷÷ ÷è ø è øç ç÷ ÷

MM M

( )0 1

0

ˆ ˆ ˆ, ,...,

ˆd

ii M

Dy

b b b

b Î

¶=-

¶å

( )0 1ˆ ˆ ˆ, ,...,

, 1,...,ˆ

d

i iji Mj

Dy x j d

b b b

b Î

¶=- =

¶å

0 0

11 1

ˆ ˆ

ˆ ˆ

ˆ ˆ

ii M

i ii M

d d i idi M

y

y x

y x

b b

b br

b b

Î

Î

Î

æ ö- ÷ç ÷æ ö æ ö ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ ÷ -çç ç ÷÷ ÷ ç ÷ç ç÷ ÷ ÷çç ç÷ ÷¬ - ÷çç ç÷ ÷ ÷ç÷ ÷ç ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ ÷ ç ÷ç ç÷ ÷è ø è ø - ÷ç ÷ç ÷çè ø

å

å

åM M M

Update One misclassified point at a time (online)

Update all misclassified points at once (batch)

Two Versions of the Perceptron Algorithm:


The Learning Data

• Matrix Representation of N learning examples of d dimensional inputs

11 1 1

1

, d

N Nd N

x x y

X Y

x x y

æ ö æ ö÷ ÷ç ç÷ ÷ç ç÷ ÷ç ç÷ ÷= =ç ç÷ ÷ç ç÷ ÷ç ç÷ ÷÷ ÷ç çè ø è ø

K

M O M M

L

( ) ( )1 1, ,..., ,N Ny yx xTraining Data:


The Good Theoretical Properties of the Perceptron Algorithm

• If a solution exists the algorithm will always converge in a finite number of steps!

• Question: Does a solution always exist?


Linearly Separable Data

• Which of these datasets are separable by a linear boundary?

+

a) b)

+

+

-

-

-

+

+

-

-

-


Linearly Separable Data

• Which of these datasets are separable by a linear boundary?

+

a) b)

+

+

-

-

-

+

+

-

-

- NotLinearly

Separable!


Bad Theoretical Properties of the Perceptron Algorithm

• If the data is not linearly separable, algorithm cycles forever!– Cannot converge!– This property “stopped” active research in this area between

1968 and 1984…• Perceptrons, Minsky and Pappert, 1969

• Even when the data is separable, there are infinitely many solutions– Which solution is best?

• When data is linearly separable, the number of steps to converge can be very large (depends on size of gap between classes)


What about Nonlinear Data?

• Data that is not linearly separable is called nonlinear data

• Nonlinear data can often be mapped into a nonlinear space where it is linearly separable


Nonlinear Models

• The Linear Model:

• The Nonlinear (basis function) Model:

• Examples of Nonlinear Basis Functions:

01

ˆ ˆˆ ( ) sgnd

i ii

y M xb b=

é ùê ú= = +ê úë û

åx

( )01

ˆ ˆˆ ( ) sgnk

i ii

y M b bf=


åx x

( ) ( ) ( ) ( ) ( )2 21 1 2 2 3 1 2 4 55sinx x x x xff ff= = = =x x x x


Linear Separating Hyper-Planes In Nonlinear Basis Function Space

1f

2f

01

0k

i ii

b bf=

+ £å

01

0k

i ii

b bf=

+ >å

01

0k

i ii

b bf=

+ =å

1y=-

1y=+


An Example

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

x1

x 2

: y=+1: y=-1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 = x

12

2 = x

22

: y=+1: y=-1

F


Kernels as Nonlinear Transformations

• Polynomial

• Sigmoid

• Gaussian or Radial Basis Function (RBF)

( ) ( )( ) ( )

( ) 2

2

, ,

, tanh ,

1, exp

2

k

i j i j

i j i j

i j i j

K q

K

K

x x x x

x x x x

x x x x

k q

s

= +

= +

æ ö÷ç= - - ÷ç ÷÷çè ø


The Kernel Model

( )01

ˆ ˆˆ ( ) sgn ,N

i ii

y M Kx x xb b=


å


The number of basis functions equals the number of training examples!

- Unless some of the beta’s get set to zero…


Gram (Kernel) Matrix

( ) ( )

( ) ( )

1 1 1

1

, ,

, ,

N

N N N

K K

K

K K

æ ö÷ç ÷ç ÷ç ÷=ç ÷ç ÷ç ÷ç ÷÷çè ø

x x x x

x x x x

K

M O M

L


Properties:•Positive Definite Matrix

•Symmetric•Positive on diagonal

•N by N


Picking a Model Structure?

• How do you pick the Kernels?– Kernel parameters

• These are called learning parameters or hyperparamters– Two approaches choosing learning paramters

• Bayesian– Learning parameters must maximize probability of correct

classification on future data based on prior biases• Frequentist

– Use the training data to learn the model parameters – Use validation data to pick the best hyperparameters.

• More on learning parameter selection later

( )0 1ˆ ˆ ˆ, ,..., db b b


Perceptron Algorithm Convergence

• Two problems:– No convergence when data is not separable in

basis function space– Gives infinitely many solutions when data is

separable

• Can we modify the algorithm to fix these problems?

greg grudicintro ai1 introduction to artificial intelligence csci 3202: the perceptron algorithm...

Documents

greg grudicintro ai4

greg grudicintro ai14

greg grudicintro ai12

greg grudicintro ai15

greg grudicintro ai13

greg grudicintro ai17

greg grudicintro ai16

following slide