computational vision notes

Upload: srconstantin

Post on 06-Apr-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Computational Vision Notes

    1/12

    Sept 12, 2011

    Horseshoe crab limulus polythemus.

    Live in the mud by the bottom of the ocean.

    Mates at high tide, under a full moon.

    Multiple visual systems.

    Compound eyes made of little facets. Each facet focuses light on a single point: 3-7reticular cells, the light-sensitive neurons. Can function in 10^4 variation in input

    photon levels.

    Neural network: input layer, processing, output layer. Physically arranged in layers

    of the eye. Lateral plexus is the intermediate layer.

    Experiment on this by measuring the firing rate in response to the parameter. Max isabout 100 spikes per second. Min is about four spikes per second. Spike trains in

    response to light have a very rapid response and then it moves into a regular firing

    pattern.

    If the light gets brighter on sensor N, the firing rate decreases at a rate of about 10

    spikes per second. Neighboring sensors have a firing rate that drops off as you

    move away along the eye.

    The more area thats illuminated, the more firing is decreased.

    Wiring diagram for the limulus: each sensor is connected to the nearby few neurons.

    F_i = e_i - \sum \alpha_{ij} e_j

    Firing rate of one neuron, inhibited by neighboring neurons.

    The result of this model is that this picks out edges. Increases apparent contrast.

    Like subtracting a smoothing kernel? Ernst Mach called these Mach Bands. A brightband and then a dark band, seeming to lie on either part of the edge. This doesnt

    necessarily find the edges, just intensifies them.

    Gradient: Mexican Hat function excitatory center surrounded by inhibitory

    interactions.

    Recurrent network vs. feed-forward network. Recurrent networks give an iteration:

    neurons stimulate sensors. Feed-forward: only propagates backwards.

    Sept. 14, 2011

    Nonlinearity

  • 8/3/2019 Computational Vision Notes

    2/12

    Limulus linearis a slice through the compound eye, lateral inhibitory network,

    edge detection filter.

    In our model, a_{ij} = 1 for excitatory connections, -0.5 for inhibitory.

    Example: a signal like 0, 0, 0, 10, 10, 10 gives output 0 0 -5 5 0, 0, a Mach band.

    If the signal is 0, 0, 0, 20, 20, 20, the out put is 0, 0, -10, 10 0, 0.If the input is 10, 10, 10, 20, 20, 20, we get 0, 0, -5, 5, 0, 0. Doesnt add linearly.

    Take two step functions , shifted, and take the difference of f(u) and f(u), and it

    looks like f(u-u).

    Impulse response: h(x) = L(delta(x))

    Step response: S(x) = L(u(x))

    S(x) = \int h(x) dx

    If I is continuous, \hat{I}(x) = I(0) u(x) + I(Delta) I(0) u(x Delta) + I(2 Delta)

    I(Delta) u(x - 2 Delta) and so on.Like a Riemann sum.

    Output O is \int I(\tau) h(x-\tau)

    Convolution with h.

    If you know the impulse response of a linear system, you know its response to any

    input.

    Unsharp max: average over a neighborhood, and subtract that from the value at a

    point. Increases contrast.

    Blur: average over an interval.

    We want: sharp(blur) = blur(sharp) = identity.

    Sept. 19, MATLAB tutorial.

    Loops are very bad. Stick to matrix multiplication. The basic data type is a multi-dimensional array.

    A = [1 2; 3 4]Basic syntax for declaring a variable.

    1:5

    interpolates numbers 1 through 5.

    .*

    pointwise multiplication

    Dont nest functions too deeply; lookups are slow.

    Sept. 21

    Staggered input to limulus can give ripples where there are no lights.

    Is there an input that could give the same output?

  • 8/3/2019 Computational Vision Notes

    3/12

    What is the temperature of a wine cellar as a function of time? Sinusoidal with a 24-

    hour period. Depth of a wine cellar is the function of depth convolved with the

    sinusoidal oscillation this is how Fourier series were developed!

    F(t) = a_0 + \sum a_i cos(\omega t) + \sum \beta_i \sin(\omega t)

    A_i = 2/T \int I(t) cos(n \omega t) dt

    B_i = 2/T I(t) \sin(n \omega t) dt

    Fourier coefficients

    A sinusoid is invariant, up to amplitude, by the linear lateral plexus.

    How good a Fourier approximation? How many frequencies do we want? More

    frequencies smaller error, but higher frequency error. Blurring can actually make

    convergence faster.

    F(x) = F(x)F(x) = e^x.

    Take F(i * h), Fourier transform of the convolution of two functions, if F(i) F(h)

    Low pass filter: only low frequencies.

    High pass filter: only high frequencies.

    Aliasing: things like delta functions overlap in the frequency domain. Need the

    sampling to be tighter in time to avoid aliasing. The right frequency to avoid

    aliasing is the Nyquist frequency how fast to sample so as to get no aliasing. Need

    to filter sample by convolving with a box. Whose Fourier transform is sin x /x.

    Sinc(t) * (S f)

    Reconstructs the sampled filter. Bandlimit the input!

    Modulation Transfer Function high frequencies are bandpassed.

    NOW we realize the visual system is not linear!

    Sept. 26

    Limulus linearis:1. modeling of input-output behavior2. from the time or space domain: convolution theorem and superposition.3. Frequency domain; Fourier series; convolution in space domain corresponds

    to multiplication in the frequency domain.

    I * h(x) -> I(x)^ H(x)^

    I h(x) -> I(x)^ * H(x)^

  • 8/3/2019 Computational Vision Notes

    4/12

    I = I_1 + I_2

    S(I) = S(I_1) + S(I_2).

    The beginnings of a linear algebra for images.

    The mating behavior of a male limulus: looking for a female at high tide

    under a full moon. If he can find one, hell move in that direction. How doeshe find this?

    Hes looking for a dark object against a light background, seen through the

    lateral inhibitory network. Convolution of H(x) against a tiny spike. As he

    gets closer, this will be a pair of peaks, the edges of the female.

    |Features * Image|_\theta

    Template for seeking those features. A threshold on the output. Value 1 if

    the peak is above the threshold, 0 otherwise. This threshold stage is

    nonlinear.Simplest possible decision mechanism. But none of the material weve

    discussed thus far holds.The wiring of the network is the template that the animal is looking for in the

    world.

    Simplest test: lay the template on top of the image and see if the values are in

    agreement or not.

    Match(m, n) = \Sum_{m, n} |I(i, j), T(i-m, j-n))|^2

    If the match is close enough, between the image and the template, you decide

    to accept.

    Why squaring? Emphasizes bigger values. Why not a different power?

    Various possible norms. L_1 (sum of absolute values), l_2 or SSD, or any l_p

    norm, or l_\infty or supremum norm.

    Normalized correlation:

    ||I(I, j) T(i-m, j-n)||_2 / (\sum I^2(i, j))

    This is the standard, oldest template-matching distance.

    It can recognize letters against noisy backgrounds. How do you want to

    define the threshold? It depends how noisy the background. Too noisy and

    you cant detect at all.

    Of course an image template is too simplethink of multiple fonts of theletter A. Facial recognition is even harder (we didnt even do it in class!)

    Graph structure on the face? Nothing really handles variation in age. Hard to

    handle variation in camera viewpoint and lighting and facial expression.

    Sept 28, 2011

    Ion channel: in one configuration, allows ions to pass, in another, does not.

    Open: H. Closed:T. Strings: TTHTHTHHTHTHHHTHTHTHHHHHTTHHTTT

  • 8/3/2019 Computational Vision Notes

    5/12

    Binomial probability:

    (n choose k) p^k (1-p)^(k-1)

    sigma: 1 when the channel is open, 0 when the channel is closed.

    E

    Z = \sum e^{-\beta E(\sigma)}= e^{-\beta energy_open} + e^{-\beta energy_closed}

    Boltzmann probability:

    e^{-\beta energy_open}/(e^{-\beta energy_open} + e^{-\beta energy_closed}

    Bayesian probability:

    P(x = a | y = b) = P(x = a, y = b)/P(y = b)

    Chain rule:

    P(x, y | H) = P(x | y, H) P(y | H) = P(y | H) p(x| H)

    Bayes Rule:

    P(y | x) = P(x | y) P(y)/P(x)

    Binary template matching

    The template is a matrix of zeros and ones.

    P(I = i | Scene = y) = \prod P^{1 - |I(x, y) H(, y)|} (1-p)^{|I(x, y) H(x, y)|}

    Where they disagree, product of ps; where they disagree, product of (1-p)s.

    Oct 3, 2011

    Midterm: two weeks from today.

    Paper: What the frogs eye tells the frogs brain by Jerry Lettvin et al.

    Show a black disk to a frog: if you move the disc, a certain group of neurons

    fire, and if you stop moving, they stop firing. Bug detector. Testing for:

    dark contrast, size around 1 degree of physical angle, moving with a velocityin some range, etc. Logical AND between all these conditions. Sometimes

    would continue to fire when the black spot is occluded, sometimes not.

    What do you do if you have two moving spots in your visual world? Where

    do you jump? Not the mean of the two spots! But its apparently rare

    enough that the frogs brain doesnt know that.

    You have a stack of images I(x, y, t), sampled in position and in time.

    dI/dt = dI/dx dx/dt chain rule.

  • 8/3/2019 Computational Vision Notes

    6/12

    If we think of I(x, y) as frames in a movie, theres some delta_t between

    frames.

    Aliasing issue: attached to every point in an image is a vector, referring how

    they change in time.

    For the fly: Werner Reichart made a model for the fly.Multiple omatidia, spaced dx apart. Take a product (correlation) between

    two signals. If the combined output is a significant number, then you know a

    velocity of a moving point has been detected, at dx/dt.

    To do this in 2-d, track in two dimensions. But we have problems. View

    through one omatidium is one vector; at the following time could be a new

    vector. Whats the motion? There are more unknowns than there are

    constraints. This is called the aperture problem. The best we can hope for is a

    pseudo- solution.

    Two dots, both move to the right. Are they drifting together, or is one ofthem popping to the right of the other? Depends whether the time is short or

    long.

    A moving object that lives for less than 55 milliseconds, what you see is a

    streak. If it lives longer, you see a dot, it moves, and then disappears.

    Apparent motion: your visual system is integrating between periods of 60

    and 100 milliseconds. Why dont we see smear?

    The blowfly, Calliphora vicina, has big compound eyes. Male flies are focused

    on female flies chase each other around. Some sensors are an order ofmagnitude faster and those are used to keep their eyes on the female.

    Like the limulus: omatidia, lateral complex just below, medulla, and thenlobula-plate. H1: does velocity calculation. Integrates all over the visual field

    of the compound eye on one side.

    Experimental setup: glue a fly down, put electrodes in the visual system, and

    show a computer screen of a pseudo-random noise pattern. How can you

    reconstruct a stimulus from the spike train?

    Reverse correlation. Record the spikes, note what the pattern of the worldwas for one second back in time.

    Also, you want to know howm I flying. Optical flow patterns. The axonal

    arborization is highly structured; theres a pattern to the neurons. Provides

    direct measurements of pitch and yaw.

    Oct 5, 2011

  • 8/3/2019 Computational Vision Notes

    7/12

    The Primate Brain

    It was once thought (in Da Vincis day) that different functions of the brain

    lived in different ventricles.

    19th

    cenrury: mapping the brain. Sherrington. No pain receptors in thebrain. Two hemispheres; the back of the brain is attached to the optic nerve.

    Light, detection, form, color, ocular movement.

    Stereo vision: how did it evolve? Theres something about monocularprocessing that could have an advantage for the creature.

    Occipital lobe in the back of the cerebrum.

    Anatomically interesting substructures are determined by staining for

    myelin.Corpus callosum connects two halves of the brain.

    Brain layered, dark and light.V3: layer of cortex. MT: motion area. (Stereo as well as motion; misnomer.)

    Oct 12, 2011

    Computational religion.

    Unitarian: everyone talks to everyone else.

    Catholic: theres an organizational structure: people answer to priests who

    answer to bishops, then archbishops, cardinals, and the Pope. Its a tree

    hierarchy. Decisions made from the top down.

    Tree structure can also be used for search; categories divided into further

    categories.

    Retinal ganglion cells have a receptive field an area where a spot of light

    makes them fire most.Simple cell with odd symmetry: depends on orientation!

    Retinal ganglion cells: on center, off center receptive field. Lateral geniculate

    nucleus: again, on center, off center receptive field.

    In the back, first visual area, V1, if you stain for myelin or metabolic enzymes

    or something, you get a layered organization.

    Six layers: superficial, intermediate/input, and deep. Layer I, Layer II/III,

    Layer IV, Layer V, and Layer VI.Layer I has no myelin, so it looks light in pictures.

    Parvo and Magno layers in the lateral geniculate cortex go to X and Y layers

    in the input layer. Kept in positional registration consistent with the x/y

    coordinates of the visual field. The axons of these cells go down to layer V,

    which go down to layer VI, which goes down to midbrain structures

    including the lateral geniculate nucleus, etc. Theres a circuit: V1 deep layers,

    Lateral Geniculate, back into VI, then down to the deep layers.

  • 8/3/2019 Computational Vision Notes

    8/12

    The other: Dendrite goes up to layer I, forms synapses with V2, and then back

    to I. Superficial layers.

    Orientationally selective cells: inhibited in some regions and excited in

    others. Simple cells easy to separate the excitatory center from the

    inhibitory surround.

    1. Stripe of excitation in the center2. Excitation everywhere but the center3. Exitation on top, inhibition on the bottom.Profile of the receptive field: like a sinusoidal grating.

    Complex cells: subdomains are difficult to separate from one another.

    Perhaps non-linear in input. Multiple peaks.

    Hypercomplex cells: cells that respond to a corner.

    Simple cells might be edge detectors, complex cells are generalizing the

    position of edges over a range of locations, and hypercomplex cells detect

    edges.

    Endstopping: if the length gets to long, it stops the response. This functions

    as a corner detector. Orientation column: at every position, theres a group

    of cells at different sizes and all orientations.

    It doesnt look like the tuning of orientation depends on contrast its

    contrast independent. Does the Hubel-Wiesel model really explain this?How do you just look at the lateral geniculate without seeing any influence

    from the lateral geniculate? Cool it down to about 10 degrees. David Ferster

    did this.

    Extracellular recording: put an electrode right on the outside. Ferster wentinside, looked at intracellular potential. Inhibitory and excitatory post-

    synaptic potentials. So you can tell which cell is tuned to each response;

    orthogonal to that is basically a flat response. Preferred orientation andnull orientation. But if you cool the cortex down, the potentials look about

    the same; the cortex isnt changing much.

    Tuning is broad in layer IV, gets much tighter in II/III. And also gets contrast-

    invariant. This is a basic fact that requires explanation.

    4 excitatory cells for each inhibitory cell. Excitatory connections: long

    distance. Inhibitory connections: short distance. The sum of a small

    excitatory Gaussian and a broad inhibitory Gaussian looks a lot like the

    Laplacian of a Gaussian. DOG=Difference Of Gaussian models.

  • 8/3/2019 Computational Vision Notes

    9/12

    Turing:

    dU/dt = k_1 U + k_2 V + d_U \Delta u

    Diffusion equation.

    dV/dt = k_4 V + D_V \Delta V

    its making itself in proportion to its concentration, its inhibited by the

    presence of the other agent, and its diffusing out to nearby cells.

    The inhibitor agent is making itself and diffusing out.

    Equal: U = V = 0: nothing going on.

    Then: excitatory afferent from the lateral geniculate.

    Bump in the concentration of U.

    U diffuses outward.

    Models of this flavor explain orientation tuning and how it gets tighter as itgoes up into superficial layers. But what about contrast?

    Think of a ball living in layer II/III, living in position/orientation space.

    Connected between cells in nearby columns. Long-range horizontal

    connections.

    Contrast-invariant orientation tuning exists on a smaller scale when you get

    rid of cortical connections. It just amplifies the signal you get from the lateral

    geniculate. Look at an LGN cell. When the on-center sees bright it sees

    activity, and when the off-center cell sees dark it sees activity. But if you add

    up the contrast, it only looks at the positive part, and the rest is zero. The

    average increases as the contrast increases. Suppose youre a simple cell thatwants to see vertical orientation, and theres some horizontal grating. The

    LGN cells are going to be firing. But the order theyll be firing in wont be the

    right order that the cells in this receptive field want to see. The wiring of thelayer IV cell cares about the orientation.

    When two cells are out of phase (antiphase) with respect with one another,

    one wants to see black where the other wants to see white and vice versa.

    How do you wire up in-phase and antiphase cells? The counterphase

    inhibitory cell inhibits the excitatory cell. The LGN stimulates all the cells.

    When the stimulus is at the preferred orientation, the excitatory andthe

    inhibitory cell in the preferred direction is firing a lot, but the out of phasecells arent. The excitatory cell is keeping itself active. The inhibitory cell is

    inhibiting the excitatory counterphase cell, which excites the inhibitory

    counterphase cell, which inhibits the excitatory in-phase cell. But that

    inhibition isnt going to matter much. The excitatory cell is going to take over

    and win.

  • 8/3/2019 Computational Vision Notes

    10/12

    But take an out-of-phase stimulus. In phase and out-of-phase cells all get

    about the same input. Everybody inhibits everybody else, theres no activity.

    This is called push-pull.

    Think of the cortical cells as being arranged in these phase/antiphase

    relationships and balanced between excitation and inhibition.

    Are simple cells edge detectors? Are they participating in a hierarchy? This

    field doesnt look at this. But theyre still important questions.

    Oct 19, 2011

    The ancient days of computer vision

    How do you segment an image from an object with foreground and

    background? Analysis of the Pap smear.Threshold selection problem? Cells are distributions of high intensity in

    pixel value. If theyre Gaussian distributions of equal variance, you segmentthem right in the middle. Segmentation: partition the image into parts,

    assign a pixel to each.

    GMs bin-of-parts problem. Photo of a part what part is it?

    Minsky thought we needed to solve this as a computer vision problem; the

    robots should guess which part is which.

    Where do you put the threshold for the histogram? Picking the first thing

    doesnt necessarily pick out the right part.

    Its not so clear what to do once you have the segmentation. And youre using

    almost no structure about the intensities its just a first-order statistic.

    Build a derivative operator (discrete approximation) and use high

    derivatives as our feature. Add a little noise. Problems.

    D_x *G * I(x)Derivative, Gaussian, image.

    Regularized derivative.This picks out the edges pretty well.

    But the numbers in the Sobel Operator are sort of made up.

    The Bar Mitzvah of computer vision.

    Take another derivative! Second derivative in 2d or Laplacian.

    Larry Roberts: the guy who wrote the first computer vision PhD thesis. Poor

    mans approximation to the gradient:- ++ -

    called the Roberts Cross.

    Larry Roberts, by the way, invented packet switching which became the basis

    of the Arpanet.

    David Marr came to MIT; Minsky wanted him to solve vision. First bottom-up

    vision problem.

  • 8/3/2019 Computational Vision Notes

    11/12

    The more smoothing you put in, the blurrier the edge becomes.

    Marr says look at the Laplacian of the Gaussian and see where it crosses zero.

    Marr-Hildreth operator.

    Next giant step: canny operator.

    The Marr-Hildreth operator mucks things up in places wehre there are a lotof structure finds too many blobs.

    Put a coordinate system on the boundary parallel and perpendicular.

    Curves can be edges, bright lines, or dark lines. Linear operators cant tell

    which is which.

    NOTE:

    dQ/dt = I = C dV/dt

    Total current is capacitance times the derivative of the voltage.

    Oct 24, 2011

    Logical/Linear Operator

    Think of a matrix of receptors. For a line to be a highlight, it should be

    bright-dark-bright or dark-bright-dark. A boundary is dark-bright or bright-

    dark.

    In order to build a scheme like this, you cant convolve a linear operator with

    the image its non-linear.

    Look at pairs of receptors; are they positive-positive? Atalllogical positions?

    Hierarchical tree of convolutions and logical combinations of them.

    Hammond/Mackay began finding nonlinearities in responses of cells. This is

    evidence that the hierarchical view is accurate.

    Other ways to build a tree: Fourier expansion.

    Mona Lisas simile disappears and reappears depending on where you look.

    The eye is foveated. If you look at the low-frequency content you see a bigsmile; if you look at the high-frequency content, you dont see a smile. So it

    depends where you look.

    Face recognition/Machine Learning

    Ax = \lambda x

    \Lambda = S^{-1} A S similarity transformation, where S is a matrix of

    eigenvectors.

    X: covariance matrix. E[(X_i-\mu_i)(x_j - \mu_j)]

  • 8/3/2019 Computational Vision Notes

    12/12

    Eigenvectors of this covariance matrix can tell the components of greatest

    variance.

    Natural expansion of eigenvectors by size of eigenvalue; this helps us with

    approximation.