and you can too!. 1972 2011 1992 sbs introduction evidence for statistics bays law informative...
Post on 19-Dec-2015
215 views
TRANSCRIPT
THE BRAIN AS A STATISTICAL INFORMATION PROCESSOR
And you can too!
My History and Ours
1972 20111992
SBS
The Brain as a Statistical IP
Introduction Evidence for Statistics Bays Law Informative Priors Joint Models Inference Conclusion
Evidence for StatisticsTwo examples that seem to indicate that the brain is indeed processing statistical information
Statistics for Word Segmentation
Saffran, Aslin, Newport. “Statistical Learning in 8-Month-Old Infants”
The infants listen to strings of nonsense words with no auditory clues to word boundaries.
E.g., “bidakupa …” where “bidaku is the first word.
They learn to distinguish words from other combinations that occur (with less frequency) over word boundaries.
They Pay More Attention toNon-Words
Speaker
LightChild
Statistics in Low-level Vision
Based on Rosenholtz et. al. (2011)
A B
Statistics in Low-level Vision
Based on Rosenholtz et. al. (2011)
A N O B E L
Are summary statistics a good choice of representation?
A much better idea than spatial subsampling
Original patch ~1000 pixels
Original patch
Are summary statistics a good choice of representation?
A rich set of statistics can capture a lot of useful information
Patch synthesized to match~1000 statistical parameters(Portilla & Simoncelli, 2000)
Discrimination based on P&S stats predicts crowded letter recognition
Balas, Nakano, & Rosenholtz, JoV, 2009
Bayes Law and Cognitive Science
To my mind, at least, it packs a lot of information
Bayes Law and Cognitive Science
P(M|E) = P(M) P(E|M)P(E)
M = Learned Model of the worldE = Learner’s environment (sensory input)
Bayes Law
P(M|E) =P(M) P(E|M)P(E)
It divides up responsibility correctly.It requires a generative model. (big, joint)It (obliquely) suggests that as far as
learning goes we ignore the programs that use the model.
But which M?
Bayes Law Does not Pick M
Don’t pick M. Integrate over all of them.
Pick the M that maximizes P(M)P(E|M).
Pick the average P(M) (Gibbs sampling).
P(E) = Σ P(M)P(E|M)M
My Personal Opinion
Don’t sweat it.
Informative PriorsThree examples where they are critical
Parsing Visual Scenes(Sudderth, Jordan)
trees
skyscraper sky
bell
dome
templebuildings
sky
Spatially Dependent Pitman-Yor
• Cut random surfaces (samples from a GP) with thresholds(as in Level Set Methods)• Assign each pixel to the first surface which exceeds threshold(as in Layered Models)
Duan, Guindani, & Gelfand, Generalized Spatial DP, 2007
Samples from Spatial Prior
Comparison: Potts Markov Random Field
Prior for Word Segmentation
Based on the work of Goldwater et. al. Separate one “word” from the next in
child-directed speech.
E.g., yuwanttusiD6bUk You want to see the book
Bag of Words
Generative Story For each utterance: For each word w (or STOP) pick with probability P(w) If w=STOP break
If we pick M to maximize P(E|M) the model memorizes the data. I.e., It creates one “word” which is the concatenation of all the words in that sentence.
Results Using a Dirichlet Prior
Precision: 61.6 Recall: 47.6
Example: youwant to see thebook
Part-of-speech Induction
Primarily based on Clark (2003)
Given a sequence of words, deduce their parts of speech (e.g., DT, NN, etc.)
Generative story: For each word position (i) in the text 1) propose part-of-speech (t) p(t|t-1) 2) propose a word (w) using p(w|t)
Sparse Tag Distributions
We could put a Dirichlet prior on P(w|t) But what we really want is sparse P(t|w) Almost all words (by type) have only one
part-of-speech We do best by only allowing this. E.g., “can” is only a model verb (we
hope!) Putting a sparse prior on P(word-type|t)
also helps.
Joint Generative Modeling
Two examples that show the strengths of modeling many phenomena jointly.
Joint POS Tagging and Morphology
Clark pos tagger also includes something sort of like a morphology model.
It assumes POS tags are correlated with spelling.
True morphology would recognize that “ride” “riding” and “rides” share a root.
I do not know of any true joint tagging-morphology model.
Joint Reference and (Named) Entity Recognition
Based on Haghighi & Klein 2010
Weiner said the problems were all Facebook’s fault. They should never have given him an account.
(person) Type1 (organization) Type2
Obama, Weiner, fatherIBM, Facebook, company
InferenceOtherwise know as hardware.
It is not EM
More generally it is not any mechanism that requires tracking all expectations.
Consider the word boundary. Between every two phonemes there may or may not be a boundary.
abcde a|bcde ab|cde abc|de abcd|e a|b|cde …
Gibbs Sampling
Start out with random guesses. Do (roughly) forever: Pick a random point. Compute p(split) and p(join). Pick r, 0<r<1:
if p(split) > r split,p(split)+p(join)
else join.
Gibbs has Very Nice Properties
Gibbs has Very Nice Properties
It is not Gibbs Either
First, the nice properties only hold for “exchangeable” distributions. It seems likely that most of the ones we care about are not (e.g., Haghighi & Klein)
But critically it assumes we have all the training data at once and go over it many times.
It is Particle Filterning
Or something like it. At the level of detail here, just think
“beam search.”
Parsing and CKY
NPNNSDogs
VBSlike
NPNNSbones
VP
SInformation Barrier
It is Particle Filterning
Or something like it. At the level of detail here, just think
“beam search.”(ROOT
(Root (S (NP (NNS Dogs)
(ROOT (NP (NNS Dogs)
(ROOT (S (NP (NNS Dogs)) (VP (VBS eat)
Conclusion
The brain operates by manipulating probabilities.
World-model induction is governed by Bayes Law
This implies we have a large joint generative model
It seems overwhelmingly likely that we have a very informative prior.
Something like particle filtering is the inference/use mechanism.
THANK YOU