ling 696b: maximum-entropy and random fields
DESCRIPTION
LING 696B: Maximum-Entropy and Random Fields. Review: two worlds. Statistical model and OT seem to ask different questions about learning UG: what is possible/impossible? Hard-coded generalizations Combinatorial optimization (sorting) - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: LING 696B: Maximum-Entropy and Random Fields](https://reader035.vdocument.in/reader035/viewer/2022062500/56814fdc550346895dbda34b/html5/thumbnails/1.jpg)
1
LING 696B: Maximum-Entropy and Random Fields
![Page 2: LING 696B: Maximum-Entropy and Random Fields](https://reader035.vdocument.in/reader035/viewer/2022062500/56814fdc550346895dbda34b/html5/thumbnails/2.jpg)
2
Review: two worlds Statistical model and OT seem to ask
different questions about learning UG: what is possible/impossible?
Hard-coded generalizations Combinatorial optimization (sorting)
Statistical: among the things that are possible, what is likely/unlikely? Soft-coded generalizations Numerical optimization
Marriage of the two?
![Page 3: LING 696B: Maximum-Entropy and Random Fields](https://reader035.vdocument.in/reader035/viewer/2022062500/56814fdc550346895dbda34b/html5/thumbnails/3.jpg)
3
Review: two worlds OT: relate possible/impossible
patterns in different languages through constraint reranking
Stochastic OT: consider a distribution over all possible grammars to generate variation
Today: model frequency of input/output pairs (among the possible) directly using a powerful model
![Page 4: LING 696B: Maximum-Entropy and Random Fields](https://reader035.vdocument.in/reader035/viewer/2022062500/56814fdc550346895dbda34b/html5/thumbnails/4.jpg)
4
Maximum entropy and OT Imaginary data:
Stochastic OT: let *[+voice]>>Ident(voice) and Ident(voice)>>*[+voice] 50% of the time each
Maximum-Entropy (using positive weights): p([bab]|/bap/) ~ (1/Z) exp{-(2*w1)}p([pap]|/bap/) ~ (1/Z) exp{-(w2)}
/bap/ P(.) *[+voice]
Ident(#voi)
Bab .5 2
pap .5 1
![Page 5: LING 696B: Maximum-Entropy and Random Fields](https://reader035.vdocument.in/reader035/viewer/2022062500/56814fdc550346895dbda34b/html5/thumbnails/5.jpg)
5
Maximum entropy Why have Z?
Need to be a conditional distribution: p([bab]|/bap/) + p([pap]|/bap/) = 1
So Z = exp{-(2*w1)} + exp{-(w2)} (same for all candidates) -- called a normalization constant
Z can quickly become difficult to compute, when number of candidates is large
Very similar proposal in Smolensky, 86 How to get w1, w2?
Learned from data (by calculating gradients) Need: frequency counts, violation vectors
(same as stochastic OT)
![Page 6: LING 696B: Maximum-Entropy and Random Fields](https://reader035.vdocument.in/reader035/viewer/2022062500/56814fdc550346895dbda34b/html5/thumbnails/6.jpg)
6
Maximum entropy Why do exp{.}?
It’s like take maximum, but “soft” -- easy to differentiate and optimize
![Page 7: LING 696B: Maximum-Entropy and Random Fields](https://reader035.vdocument.in/reader035/viewer/2022062500/56814fdc550346895dbda34b/html5/thumbnails/7.jpg)
7
Maximum entropy and OT Inputs are violation vectors: e.g. x=(2,0) and (0,1) Outputs are one of K winners -- essentially a
classification problem Violating a constraint works against the candidate
(prob ~ exp{-(x1*w1 + x2*w2)} Crucial difference: ordering candidates by one
score, not by lexico-graphic orders
/bap/ P(.) *[+voice]
Ident(voice)
Bab .5 2
Pap .5 1
![Page 8: LING 696B: Maximum-Entropy and Random Fields](https://reader035.vdocument.in/reader035/viewer/2022062500/56814fdc550346895dbda34b/html5/thumbnails/8.jpg)
8
Maximum entropy Ordering discrete outputs from
input vectors is a common problem: Also called Logistic Regression (recall
Nearey) Explaining the name:
Let P= p([bab]|/bap/), then log[P/(1-P)] = w2 - 2*w1
Linear regressionLogistic transform
![Page 9: LING 696B: Maximum-Entropy and Random Fields](https://reader035.vdocument.in/reader035/viewer/2022062500/56814fdc550346895dbda34b/html5/thumbnails/9.jpg)
9
The power of Maximum Entropy Max Eng/logistic regression is widely used
in many areas with interacting, correlated inputs Recall Nearey: phones, diphones, … NLP: tagging, labeling, parsing … (anything with
a discrete output) Easy to learn: only a global maximum,
optimization efficient Isn’t this the greatest thing in the world?
Need to understand the story behind the exp{} (in a few minutes)
![Page 10: LING 696B: Maximum-Entropy and Random Fields](https://reader035.vdocument.in/reader035/viewer/2022062500/56814fdc550346895dbda34b/html5/thumbnails/10.jpg)
10
Demo: Spanish diminutives Data from Arbisi-Kelm
Constraints: ALIGN(TE,Word,R), MAX-OO(V), DEP-IO and BaseTooLittle
![Page 11: LING 696B: Maximum-Entropy and Random Fields](https://reader035.vdocument.in/reader035/viewer/2022062500/56814fdc550346895dbda34b/html5/thumbnails/11.jpg)
11
Stochastic OT and Max-Ent Is better fit always a good thing?
![Page 12: LING 696B: Maximum-Entropy and Random Fields](https://reader035.vdocument.in/reader035/viewer/2022062500/56814fdc550346895dbda34b/html5/thumbnails/12.jpg)
12
Stochastic OT and Max-Ent Is better fit always a good thing? Should model-fitting become a new
fashion in phonology?
![Page 13: LING 696B: Maximum-Entropy and Random Fields](https://reader035.vdocument.in/reader035/viewer/2022062500/56814fdc550346895dbda34b/html5/thumbnails/13.jpg)
13
The crucial difference What are the possible distributions
of p(.|/bap/) in this case?
/bap/ P(.) *[+voice]
Ident(voice)
Bab 2
Pap 1
Bap 1
pab 1 1
![Page 14: LING 696B: Maximum-Entropy and Random Fields](https://reader035.vdocument.in/reader035/viewer/2022062500/56814fdc550346895dbda34b/html5/thumbnails/14.jpg)
14
The crucial difference What are the possible distributions
of p(.|/bap/) in this case? Max-Ent considers a much wider
range of distributions
/bap/ P(.) *[+voice]
Ident(voice)
Bab 2
Pap 1
Bap 1
pab 1 1
![Page 15: LING 696B: Maximum-Entropy and Random Fields](https://reader035.vdocument.in/reader035/viewer/2022062500/56814fdc550346895dbda34b/html5/thumbnails/15.jpg)
15
What is Maximum Entropy anyway? Jaynes, 53: the most ignorant state
corresponds to the distribution with the most entropy
Given a dice, which distribution has the largest entropy?
![Page 16: LING 696B: Maximum-Entropy and Random Fields](https://reader035.vdocument.in/reader035/viewer/2022062500/56814fdc550346895dbda34b/html5/thumbnails/16.jpg)
16
What is Maximum Entropy anyway? Jaynes, 53: the most ignorant state
corresponds to the distribution with the most entropy
Given a dice, which distribution has the largest entropy?
Add constraints to distributions: the average of some feature functions is assumed to be fixed:
Observed value
![Page 17: LING 696B: Maximum-Entropy and Random Fields](https://reader035.vdocument.in/reader035/viewer/2022062500/56814fdc550346895dbda34b/html5/thumbnails/17.jpg)
17
What is Maximum Entropy anyway?
Example of features: violations, word counts, N-grams, co-occurrences, …
The constraints change the shape of the maximum entropy distribution Solve constrained optimization problem
This leads to p(x) ~ exp{k wk*fk(x)} Very general (see later), many choices of
fk
![Page 18: LING 696B: Maximum-Entropy and Random Fields](https://reader035.vdocument.in/reader035/viewer/2022062500/56814fdc550346895dbda34b/html5/thumbnails/18.jpg)
18
The basic intuition Begin “ignorant” as much as possible (with
maximum entropy), as far as the chosen distribution matches certain “descriptions” of the empirical data (statistics of fk(x))
Approximation property: any distribution can be approximated with a max-ent distribution with sufficient number of features (Cramer and Wold) Common practice in NLP
This is better seen as a “descriptive” model
![Page 19: LING 696B: Maximum-Entropy and Random Fields](https://reader035.vdocument.in/reader035/viewer/2022062500/56814fdc550346895dbda34b/html5/thumbnails/19.jpg)
19
Going towards Markov random fields Maximum entropy applied to
conditional/joint distributionp(y|x) or p(x,y) ~ exp{k wk*fk(x,y)}
There can be many creative ways of extracting features fk(x,y) One way is to let a graph structure
guide the calculation of features. E.g. neighborhood/clique
Known as Markov network/random field
![Page 20: LING 696B: Maximum-Entropy and Random Fields](https://reader035.vdocument.in/reader035/viewer/2022062500/56814fdc550346895dbda34b/html5/thumbnails/20.jpg)
20
Conditional random field Impose a chain-structured graph,
and assign features to edges Still a max-ent, same calculation
f(xi, yi)
m(yi, yi+1)
![Page 21: LING 696B: Maximum-Entropy and Random Fields](https://reader035.vdocument.in/reader035/viewer/2022062500/56814fdc550346895dbda34b/html5/thumbnails/21.jpg)
21
Wilson’s idea Isn’t this a familiar picture in
phonology?
m(yi, yi+1) -- Markedness
f(xi, yi)Faithfulnes
s
Surface form
Underlying form
![Page 22: LING 696B: Maximum-Entropy and Random Fields](https://reader035.vdocument.in/reader035/viewer/2022062500/56814fdc550346895dbda34b/html5/thumbnails/22.jpg)
22
The story of smoothing In Max-Ent models, the weights can get
very large and “over-fit” the data (see demo)
Common to penalize (smooth) this with a new objective function:new objective = old objective + parameter * magnitude of weights
Wilson’s claim: this smoothing parameter has to do with substantive bias in phonological learning Constraints that force less similarity --> a
higher penalty for them to change value
![Page 23: LING 696B: Maximum-Entropy and Random Fields](https://reader035.vdocument.in/reader035/viewer/2022062500/56814fdc550346895dbda34b/html5/thumbnails/23.jpg)
23
Wilson’s model fitting to the velar palatalization data