computational vision notes

8/3/2019 Computational Vision Notes

1/12

Sept 12, 2011

Horseshoe crab limulus polythemus.

Live in the mud by the bottom of the ocean.

Mates at high tide, under a full moon.

Multiple visual systems.

Compound eyes made of little facets. Each facet focuses light on a single point: 3-7reticular cells, the light-sensitive neurons. Can function in 10^4 variation in input

photon levels.

Neural network: input layer, processing, output layer. Physically arranged in layers

of the eye. Lateral plexus is the intermediate layer.

Experiment on this by measuring the firing rate in response to the parameter. Max isabout 100 spikes per second. Min is about four spikes per second. Spike trains in

response to light have a very rapid response and then it moves into a regular firing

pattern.

If the light gets brighter on sensor N, the firing rate decreases at a rate of about 10

spikes per second. Neighboring sensors have a firing rate that drops off as you

move away along the eye.

The more area thats illuminated, the more firing is decreased.

Wiring diagram for the limulus: each sensor is connected to the nearby few neurons.

F_i = e_i - \sum \alpha_{ij} e_j

Firing rate of one neuron, inhibited by neighboring neurons.

The result of this model is that this picks out edges. Increases apparent contrast.

Like subtracting a smoothing kernel? Ernst Mach called these Mach Bands. A brightband and then a dark band, seeming to lie on either part of the edge. This doesnt

necessarily find the edges, just intensifies them.

Gradient: Mexican Hat function excitatory center surrounded by inhibitory

interactions.

Recurrent network vs. feed-forward network. Recurrent networks give an iteration:

neurons stimulate sensors. Feed-forward: only propagates backwards.

Sept. 14, 2011

Nonlinearity


2/12

Limulus linearis a slice through the compound eye, lateral inhibitory network,

edge detection filter.

In our model, a_{ij} = 1 for excitatory connections, -0.5 for inhibitory.

Example: a signal like 0, 0, 0, 10, 10, 10 gives output 0 0 -5 5 0, 0, a Mach band.

If the signal is 0, 0, 0, 20, 20, 20, the out put is 0, 0, -10, 10 0, 0.If the input is 10, 10, 10, 20, 20, 20, we get 0, 0, -5, 5, 0, 0. Doesnt add linearly.

Take two step functions , shifted, and take the difference of f(u) and f(u), and it

looks like f(u-u).

Impulse response: h(x) = L(delta(x))

Step response: S(x) = L(u(x))

S(x) = \int h(x) dx

If I is continuous, \hat{I}(x) = I(0) u(x) + I(Delta) I(0) u(x Delta) + I(2 Delta)

I(Delta) u(x - 2 Delta) and so on.Like a Riemann sum.

Output O is \int I(\tau) h(x-\tau)

Convolution with h.

If you know the impulse response of a linear system, you know its response to any

input.

Unsharp max: average over a neighborhood, and subtract that from the value at a

point. Increases contrast.

Blur: average over an interval.

We want: sharp(blur) = blur(sharp) = identity.

Sept. 19, MATLAB tutorial.

Loops are very bad. Stick to matrix multiplication. The basic data type is a multi-dimensional array.

A = [1 2; 3 4]Basic syntax for declaring a variable.

1:5

interpolates numbers 1 through 5.

.*

pointwise multiplication

Dont nest functions too deeply; lookups are slow.

Sept. 21

Staggered input to limulus can give ripples where there are no lights.

Is there an input that could give the same output?


3/12

What is the temperature of a wine cellar as a function of time? Sinusoidal with a 24-

hour period. Depth of a wine cellar is the function of depth convolved with the

sinusoidal oscillation this is how Fourier series were developed!

F(t) = a_0 + \sum a_i cos(\omega t) + \sum \beta_i \sin(\omega t)

A_i = 2/T \int I(t) cos(n \omega t) dt

B_i = 2/T I(t) \sin(n \omega t) dt

Fourier coefficients

A sinusoid is invariant, up to amplitude, by the linear lateral plexus.

How good a Fourier approximation? How many frequencies do we want? More

frequencies smaller error, but higher frequency error. Blurring can actually make

convergence faster.

F(x) = F(x)F(x) = e^x.

Take F(i * h), Fourier transform of the convolution of two functions, if F(i) F(h)

Low pass filter: only low frequencies.

High pass filter: only high frequencies.

Aliasing: things like delta functions overlap in the frequency domain. Need the

sampling to be tighter in time to avoid aliasing. The right frequency to avoid

aliasing is the Nyquist frequency how fast to sample so as to get no aliasing. Need

to filter sample by convolving with a box. Whose Fourier transform is sin x /x.

Sinc(t) * (S f)

Reconstructs the sampled filter. Bandlimit the input!

Modulation Transfer Function high frequencies are bandpassed.

NOW we realize the visual system is not linear!

Sept. 26

Limulus linearis:1. modeling of input-output behavior2. from the time or space domain: convolution theorem and superposition.3. Frequency domain; Fourier series; convolution in space domain corresponds

to multiplication in the frequency domain.

I * h(x) -> I(x)^ H(x)^

I h(x) -> I(x)^ * H(x)^


4/12

I = I_1 + I_2

S(I) = S(I_1) + S(I_2).

The beginnings of a linear algebra for images.

The mating behavior of a male limulus: looking for a female at high tide

under a full moon. If he can find one, hell move in that direction. How doeshe find this?

Hes looking for a dark object against a light background, seen through the

lateral inhibitory network. Convolution of H(x) against a tiny spike. As he

gets closer, this will be a pair of peaks, the edges of the female.

|Features * Image|_\theta

Template for seeking those features. A threshold on the output. Value 1 if

the peak is above the threshold, 0 otherwise. This threshold stage is

nonlinear.Simplest possible decision mechanism. But none of the material weve

discussed thus far holds.The wiring of the network is the template that the animal is looking for in the

world.

Simplest test: lay the template on top of the image and see if the values are in

agreement or not.

Match(m, n) = \Sum_{m, n} |I(i, j), T(i-m, j-n))|^2

If the match is close enough, between the image and the template, you decide

to accept.

Why squaring? Emphasizes bigger values. Why not a different power?

Various possible norms. L_1 (sum of absolute values), l_2 or SSD, or any l_p

norm, or l_\infty or supremum norm.

Normalized correlation:

||I(I, j) T(i-m, j-n)||_2 / (\sum I^2(i, j))

This is the standard, oldest template-matching distance.

It can recognize letters against noisy backgrounds. How do you want to

define the threshold? It depends how noisy the background. Too noisy and

you cant detect at all.

Of course an image template is too simplethink of multiple fonts of theletter A. Facial recognition is even harder (we didnt even do it in class!)

Graph structure on the face? Nothing really handles variation in age. Hard to

handle variation in camera viewpoint and lighting and facial expression.

Sept 28, 2011

Ion channel: in one configuration, allows ions to pass, in another, does not.

Open: H. Closed:T. Strings: TTHTHTHHTHTHHHTHTHTHHHHHTTHHTTT


5/12

Binomial probability:

(n choose k) p^k (1-p)^(k-1)

sigma: 1 when the channel is open, 0 when the channel is closed.

E

Z = \sum e^{-\beta E(\sigma)}= e^{-\beta energy_open} + e^{-\beta energy_closed}

Boltzmann probability:

e^{-\beta energy_open}/(e^{-\beta energy_open} + e^{-\beta energy_closed}

Bayesian probability:

P(x = a | y = b) = P(x = a, y = b)/P(y = b)

Chain rule:

P(x, y | H) = P(x | y, H) P(y | H) = P(y | H) p(x| H)

Bayes Rule:

P(y | x) = P(x | y) P(y)/P(x)

Binary template matching

The template is a matrix of zeros and ones.

P(I = i | Scene = y) = \prod P^{1 - |I(x, y) H(, y)|} (1-p)^{|I(x, y) H(x, y)|}

Where they disagree, product of ps; where they disagree, product of (1-p)s.

Oct 3, 2011

Midterm: two weeks from today.

Paper: What the frogs eye tells the frogs brain by Jerry Lettvin et al.

Show a black disk to a frog: if you move the disc, a certain group of neurons

fire, and if you stop moving, they stop firing. Bug detector. Testing for:

dark contrast, size around 1 degree of physical angle, moving with a velocityin some range, etc. Logical AND between all these conditions. Sometimes

would continue to fire when the black spot is occluded, sometimes not.

What do you do if you have two moving spots in your visual world? Where

do you jump? Not the mean of the two spots! But its apparently rare

enough that the frogs brain doesnt know that.

You have a stack of images I(x, y, t), sampled in position and in time.

dI/dt = dI/dx dx/dt chain rule.


6/12

If we think of I(x, y) as frames in a movie, theres some delta_t between

frames.

Aliasing issue: attached to every point in an image is a vector, referring how

they change in time.

For the fly: Werner Reichart made a model for the fly.Multiple omatidia, spaced dx apart. Take a product (correlation) between

two signals. If the combined output is a significant number, then you know a

velocity of a moving point has been detected, at dx/dt.

To do this in 2-d, track in two dimensions. But we have problems. View

through one omatidium is one vector; at the following time could be a new

vector. Whats the motion? There are more unknowns than there are

constraints. This is called the aperture problem. The best we can hope for is a

pseudo- solution.

Two dots, both move to the right. Are they drifting together, or is one ofthem popping to the right of the other? Depends whether the time is short or

long.

A moving object that lives for less than 55 milliseconds, what you see is a

streak. If it lives longer, you see a dot, it moves, and then disappears.

Apparent motion: your visual system is integrating between periods of 60

and 100 milliseconds. Why dont we see smear?

The blowfly, Calliphora vicina, has big compound eyes. Male flies are focused

on female flies chase each other around. Some sensors are an order ofmagnitude faster and those are used to keep their eyes on the female.

Like the limulus: omatidia, lateral complex just below, medulla, and thenlobula-plate. H1: does velocity calculation. Integrates all over the visual field

of the compound eye on one side.

Experimental setup: glue a fly down, put electrodes in the visual system, and

show a computer screen of a pseudo-random noise pattern. How can you

reconstruct a stimulus from the spike train?

Reverse correlation. Record the spikes, note what the pattern of the worldwas for one second back in time.

Also, you want to know howm I flying. Optical flow patterns. The axonal

arborization is highly structured; theres a pattern to the neurons. Provides

direct measurements of pitch and yaw.

Oct 5, 2011


7/12

The Primate Brain

It was once thought (in Da Vincis day) that different functions of the brain

lived in different ventricles.

19th

cenrury: mapping the brain. Sherrington. No pain receptors in thebrain. Two hemispheres; the back of the brain is attached to the optic nerve.

Light, detection, form, color, ocular movement.

Stereo vision: how did it evolve? Theres something about monocularprocessing that could have an advantage for the creature.

Occipital lobe in the back of the cerebrum.

Anatomically interesting substructures are determined by staining for

myelin.Corpus callosum connects two halves of the brain.

Brain layered, dark and light.V3: layer of cortex. MT: motion area. (Stereo as well as motion; misnomer.)

Oct 12, 2011

Computational religion.

Unitarian: everyone talks to everyone else.

Catholic: theres an organizational structure: people answer to priests who

answer to bishops, then archbishops, cardinals, and the Pope. Its a tree

hierarchy. Decisions made from the top down.

Tree structure can also be used for search; categories divided into further

categories.

Retinal ganglion cells have a receptive field an area where a spot of light

makes them fire most.Simple cell with odd symmetry: depends on orientation!

Retinal ganglion cells: on center, off center receptive field. Lateral geniculate

nucleus: again, on center, off center receptive field.

In the back, first visual area, V1, if you stain for myelin or metabolic enzymes

or something, you get a layered organization.

Six layers: superficial, intermediate/input, and deep. Layer I, Layer II/III,

Layer IV, Layer V, and Layer VI.Layer I has no myelin, so it looks light in pictures.

Parvo and Magno layers in the lateral geniculate cortex go to X and Y layers

in the input layer. Kept in positional registration consistent with the x/y

coordinates of the visual field. The axons of these cells go down to layer V,

which go down to layer VI, which goes down to midbrain structures

including the lateral geniculate nucleus, etc. Theres a circuit: V1 deep layers,

Lateral Geniculate, back into VI, then down to the deep layers.


8/12

The other: Dendrite goes up to layer I, forms synapses with V2, and then back

to I. Superficial layers.

Orientationally selective cells: inhibited in some regions and excited in

others. Simple cells easy to separate the excitatory center from the

inhibitory surround.

1. Stripe of excitation in the center2. Excitation everywhere but the center3. Exitation on top, inhibition on the bottom.Profile of the receptive field: like a sinusoidal grating.

Complex cells: subdomains are difficult to separate from one another.

Perhaps non-linear in input. Multiple peaks.

Hypercomplex cells: cells that respond to a corner.

Simple cells might be edge detectors, complex cells are generalizing the

position of edges over a range of locations, and hypercomplex cells detect

edges.

Endstopping: if the length gets to long, it stops the response. This functions

as a corner detector. Orientation column: at every position, theres a group

of cells at different sizes and all orientations.

It doesnt look like the tuning of orientation depends on contrast its

contrast independent. Does the Hubel-Wiesel model really explain this?How do you just look at the lateral geniculate without seeing any influence

from the lateral geniculate? Cool it down to about 10 degrees. David Ferster

did this.

Extracellular recording: put an electrode right on the outside. Ferster wentinside, looked at intracellular potential. Inhibitory and excitatory post-

synaptic potentials. So you can tell which cell is tuned to each response;

orthogonal to that is basically a flat response. Preferred orientation andnull orientation. But if you cool the cortex down, the potentials look about

the same; the cortex isnt changing much.

Tuning is broad in layer IV, gets much tighter in II/III. And also gets contrast-

invariant. This is a basic fact that requires explanation.

4 excitatory cells for each inhibitory cell. Excitatory connections: long

distance. Inhibitory connections: short distance. The sum of a small

excitatory Gaussian and a broad inhibitory Gaussian looks a lot like the

Laplacian of a Gaussian. DOG=Difference Of Gaussian models.


9/12

Turing:

dU/dt = k_1 U + k_2 V + d_U \Delta u

Diffusion equation.

dV/dt = k_4 V + D_V \Delta V

its making itself in proportion to its concentration, its inhibited by the

presence of the other agent, and its diffusing out to nearby cells.

The inhibitor agent is making itself and diffusing out.

Equal: U = V = 0: nothing going on.

Then: excitatory afferent from the lateral geniculate.

Bump in the concentration of U.

U diffuses outward.

Models of this flavor explain orientation tuning and how it gets tighter as itgoes up into superficial layers. But what about contrast?

Think of a ball living in layer II/III, living in position/orientation space.

Connected between cells in nearby columns. Long-range horizontal

connections.

Contrast-invariant orientation tuning exists on a smaller scale when you get

rid of cortical connections. It just amplifies the signal you get from the lateral

geniculate. Look at an LGN cell. When the on-center sees bright it sees

activity, and when the off-center cell sees dark it sees activity. But if you add

up the contrast, it only looks at the positive part, and the rest is zero. The

average increases as the contrast increases. Suppose youre a simple cell thatwants to see vertical orientation, and theres some horizontal grating. The

LGN cells are going to be firing. But the order theyll be firing in wont be the

right order that the cells in this receptive field want to see. The wiring of thelayer IV cell cares about the orientation.

When two cells are out of phase (antiphase) with respect with one another,

one wants to see black where the other wants to see white and vice versa.

How do you wire up in-phase and antiphase cells? The counterphase

inhibitory cell inhibits the excitatory cell. The LGN stimulates all the cells.

When the stimulus is at the preferred orientation, the excitatory andthe

inhibitory cell in the preferred direction is firing a lot, but the out of phasecells arent. The excitatory cell is keeping itself active. The inhibitory cell is

inhibiting the excitatory counterphase cell, which excites the inhibitory

counterphase cell, which inhibits the excitatory in-phase cell. But that

inhibition isnt going to matter much. The excitatory cell is going to take over

and win.


10/12

But take an out-of-phase stimulus. In phase and out-of-phase cells all get

about the same input. Everybody inhibits everybody else, theres no activity.

This is called push-pull.

Think of the cortical cells as being arranged in these phase/antiphase

relationships and balanced between excitation and inhibition.

Are simple cells edge detectors? Are they participating in a hierarchy? This

field doesnt look at this. But theyre still important questions.

Oct 19, 2011

The ancient days of computer vision

How do you segment an image from an object with foreground and

background? Analysis of the Pap smear.Threshold selection problem? Cells are distributions of high intensity in

pixel value. If theyre Gaussian distributions of equal variance, you segmentthem right in the middle. Segmentation: partition the image into parts,

assign a pixel to each.

GMs bin-of-parts problem. Photo of a part what part is it?

Minsky thought we needed to solve this as a computer vision problem; the

robots should guess which part is which.

Where do you put the threshold for the histogram? Picking the first thing

doesnt necessarily pick out the right part.

Its not so clear what to do once you have the segmentation. And youre using

almost no structure about the intensities its just a first-order statistic.

Build a derivative operator (discrete approximation) and use high

derivatives as our feature. Add a little noise. Problems.

D_x *G * I(x)Derivative, Gaussian, image.

Regularized derivative.This picks out the edges pretty well.

But the numbers in the Sobel Operator are sort of made up.

The Bar Mitzvah of computer vision.

Take another derivative! Second derivative in 2d or Laplacian.

Larry Roberts: the guy who wrote the first computer vision PhD thesis. Poor

mans approximation to the gradient:- ++ -

called the Roberts Cross.

Larry Roberts, by the way, invented packet switching which became the basis

of the Arpanet.

David Marr came to MIT; Minsky wanted him to solve vision. First bottom-up

vision problem.


11/12

The more smoothing you put in, the blurrier the edge becomes.

Marr says look at the Laplacian of the Gaussian and see where it crosses zero.

Marr-Hildreth operator.

Next giant step: canny operator.

The Marr-Hildreth operator mucks things up in places wehre there are a lotof structure finds too many blobs.

Put a coordinate system on the boundary parallel and perpendicular.

Curves can be edges, bright lines, or dark lines. Linear operators cant tell

which is which.

NOTE:

dQ/dt = I = C dV/dt

Total current is capacitance times the derivative of the voltage.

Oct 24, 2011

Logical/Linear Operator

Think of a matrix of receptors. For a line to be a highlight, it should be

bright-dark-bright or dark-bright-dark. A boundary is dark-bright or bright-

dark.

In order to build a scheme like this, you cant convolve a linear operator with

the image its non-linear.

Look at pairs of receptors; are they positive-positive? Atalllogical positions?

Hierarchical tree of convolutions and logical combinations of them.

Hammond/Mackay began finding nonlinearities in responses of cells. This is

evidence that the hierarchical view is accurate.

Other ways to build a tree: Fourier expansion.

Mona Lisas simile disappears and reappears depending on where you look.

The eye is foveated. If you look at the low-frequency content you see a bigsmile; if you look at the high-frequency content, you dont see a smile. So it

depends where you look.

Face recognition/Machine Learning

Ax = \lambda x

\Lambda = S^{-1} A S similarity transformation, where S is a matrix of

eigenvectors.

X: covariance matrix. E[(X_i-\mu_i)(x_j - \mu_j)]


12/12

Eigenvectors of this covariance matrix can tell the components of greatest

variance.

Natural expansion of eigenvectors by size of eigenvalue; this helps us with

approximation.

computational vision notes

Documents