![Page 1: Machine Learning in Natural Language More on Discriminative models](https://reader036.vdocument.in/reader036/viewer/2022062410/56815c1d550346895dc9f246/html5/thumbnails/1.jpg)
1
Machine Learning in
Natural Language
More on Discriminative models
Dan RothUniversity of Illinois, [email protected]://L2R.cs.uiuc.edu/~danr
![Page 2: Machine Learning in Natural Language More on Discriminative models](https://reader036.vdocument.in/reader036/viewer/2022062410/56815c1d550346895dc9f246/html5/thumbnails/2.jpg)
2
Generalization (since the representation is the same) How many examples are needed to get to a given level of accuracy?
Efficiency How long does it take to learn a hypothesis and evaluate it (per-example)? Robustness; Adaptation to a new domain, ….
How to Compare?
![Page 3: Machine Learning in Natural Language More on Discriminative models](https://reader036.vdocument.in/reader036/viewer/2022062410/56815c1d550346895dc9f246/html5/thumbnails/3.jpg)
3
S= I don’t know whether to laugh or cry Define a set of features:
features are relations that hold in the sentence
Map a sentence to its feature-based representation The feature-based representation will give some of the
information in the sentence
Use this as an example to your algorithm
-
Sentence Representation
![Page 4: Machine Learning in Natural Language More on Discriminative models](https://reader036.vdocument.in/reader036/viewer/2022062410/56815c1d550346895dc9f246/html5/thumbnails/4.jpg)
4
S= I don’t know whether to laugh or cry
Define a set of features: features are relations that hold in the sentence Conceptually, there are two steps in coming up with a feature-based representation
What are the information sources available? Sensors: words, order of words, properties (?) of
words What features to construct based on these?
Why needed?
Sentence Representation
![Page 5: Machine Learning in Natural Language More on Discriminative models](https://reader036.vdocument.in/reader036/viewer/2022062410/56815c1d550346895dc9f246/html5/thumbnails/5.jpg)
5
Weather
Whether
523341321 xxxxxxxxx 541 yyy
New discriminator in functionally simpler
Embedding
![Page 6: Machine Learning in Natural Language More on Discriminative models](https://reader036.vdocument.in/reader036/viewer/2022062410/56815c1d550346895dc9f246/html5/thumbnails/6.jpg)
6
The number of potential features is very large
The instance space is sparse
Decisions depend on a small set of features (sparse)
Want to learn from a number of examples that is small relative to the dimensionality
Domain Characteristics
![Page 7: Machine Learning in Natural Language More on Discriminative models](https://reader036.vdocument.in/reader036/viewer/2022062410/56815c1d550346895dc9f246/html5/thumbnails/7.jpg)
7
Dominated by the sparseness of the function space Most features are irrelevant # of examples required by multiplicative algorithms depends mostly on # of relevant features (Generalization bounds depend on ||w||;)
Lesser issue: Sparseness of features space: advantage to additive. Generalization depend on ||x|| (Kivinen/Warmuth 95)
Generalization
![Page 8: Machine Learning in Natural Language More on Discriminative models](https://reader036.vdocument.in/reader036/viewer/2022062410/56815c1d550346895dc9f246/html5/thumbnails/8.jpg)
8
Function: At least 10 out of fixed 100 variables are activeDimensionality is n Perceptron,SVMs
n: Total # of Variables (Dimensionality)
Winnow
Mistakes bounds for 10 of 100 of n#
of m
ista
kes
to c
onve
rgen
ce
![Page 9: Machine Learning in Natural Language More on Discriminative models](https://reader036.vdocument.in/reader036/viewer/2022062410/56815c1d550346895dc9f246/html5/thumbnails/9.jpg)
9
Dominated by the size of the feature space
Could be more efficient since work is done in the original feature space.
i
ii )K(x,xcf(x)
Additive algorithms allow the use of Kernels No need to explicitly generate the complex features
kn ) (x)... (x), (x), (x) n321 (),...,,( 321 kxxxxX
Most features are functions (e.g., n-grams) of raw attributes
Efficiency
![Page 10: Machine Learning in Natural Language More on Discriminative models](https://reader036.vdocument.in/reader036/viewer/2022062410/56815c1d550346895dc9f246/html5/thumbnails/10.jpg)
10
Update rule: Multiplicative /Additive/NB (+ regularization)
Feature space: Infinite Attribute Space - examples of variable size: only active features - determined in a data driven way Multi Class Learner
Several approaches are possible
{0,1}
• Makes Possible: Generation of many complex/relational types of
features Only a small fraction is actually represented Computationally efficient (on-line!)
SNoW
![Page 11: Machine Learning in Natural Language More on Discriminative models](https://reader036.vdocument.in/reader036/viewer/2022062410/56815c1d550346895dc9f246/html5/thumbnails/11.jpg)
11
Other methods are used broadly today in NLP: SVM, AdaBoost,
Mutli class classification
Dealing with lack of data: Semi-supervised learning Missing data
Other Issues in Classification
![Page 12: Machine Learning in Natural Language More on Discriminative models](https://reader036.vdocument.in/reader036/viewer/2022062410/56815c1d550346895dc9f246/html5/thumbnails/12.jpg)
12
Weather
Whether
523341321 xxxxxxxxx 541 yyy
New discriminator in functionally simpler
Embedding
![Page 13: Machine Learning in Natural Language More on Discriminative models](https://reader036.vdocument.in/reader036/viewer/2022062410/56815c1d550346895dc9f246/html5/thumbnails/13.jpg)
13
M f(x)
zz))S(z)K(x,(Th
A method to run Perceptron on a very large feature set, without incurring the cost of keeping a very large weight vector.
Computing the weight vector is done in the original space.
Notice: this pertains only to efficiency. Generalization is still relative to the real
dimensionality. This is the main trick in SVMs. (Algorithm - different)
(although many applications actually use linear kernels).
Kernel Based Methods
![Page 14: Machine Learning in Natural Language More on Discriminative models](https://reader036.vdocument.in/reader036/viewer/2022062410/56815c1d550346895dc9f246/html5/thumbnails/14.jpg)
14
(demotion) 1)x (if 1- w w,xbut w 0Class If)(promotion 1)x (if 1 w w,xwbut 1Class If
iii
iii
)xxw(Th f(x)
R w:Hypothesis ;{0,1} x :Examplesn
1i ii
nn
)(
Let I be the set t1,t2,t3 …of monomials (conjunctions) over The feature space x1, x2… xn.
Then we can write a linear function over this new feature space.
)xtw(Th f(x) i ii
I
)(
0 (11010)xx 1 (11010)xxx :Example 43421
Kernel Base Methods
![Page 15: Machine Learning in Natural Language More on Discriminative models](https://reader036.vdocument.in/reader036/viewer/2022062410/56815c1d550346895dc9f246/html5/thumbnails/15.jpg)
15
nn R w:Hypothesis ;{0,1} x :Examples
Great Increase in expressivity Can run Perceptron (and Winnow) but the convergence
bound may suffer exponential growth.
Exponential number of monomials are true in each example.
Also, will have to keep many weights.
)xtw(Th f(x) i ii
I
)(
(demotion) 1)x (if 1- w w,xbut w 0Class If)(promotion 1)x (if 1 w w,xwbut 1Class If
iii
iii
Kernel Based Methods
![Page 16: Machine Learning in Natural Language More on Discriminative models](https://reader036.vdocument.in/reader036/viewer/2022062410/56815c1d550346895dc9f246/html5/thumbnails/16.jpg)
16
nn R w:Hypothesis ;{0,1} x :Examples
• Consider the value of w used in the prediction.• Each previous mistake, on example z, makes an
additive contribution of +/-1 to w, iff t(z) = 1.• The value of w is determined by the number of
mistakes on which t() was satisfied.
)xtw(Th f(x) i ii
I
)(
(demotion) 1)x (if 1- w w,xbut w 0Class If)(promotion 1)x (if 1 w w,xwbut 1Class If
iii
iii
The Kernel Trick(1)
![Page 17: Machine Learning in Natural Language More on Discriminative models](https://reader036.vdocument.in/reader036/viewer/2022062410/56815c1d550346895dc9f246/html5/thumbnails/17.jpg)
17
nn R w:Hypothesis ;{0,1} x :Examples
• P – set of examples on which we Promoted• D – set of examples on which we Demoted• M = P D
I
I
((
)(
i iiMz
i i1(z)tD,z1(z)tP,z
x)z)ttS(z)(Th
)xt11(Th f(x)ii
)xtw(Th f(x) i ii
I
)(
(demotion) 1)x (if 1- w w,xbut w 0Class If)(promotion 1)x (if 1 w w,xwbut 1Class If
iii
iii
The Kernel Trick(2)
![Page 18: Machine Learning in Natural Language More on Discriminative models](https://reader036.vdocument.in/reader036/viewer/2022062410/56815c1d550346895dc9f246/html5/thumbnails/18.jpg)
18
• P – set of examples on which we Promoted• D – set of examples on which we Demoted• M = P D
• Where S(z)=1 if z P and S(z) = -1 if z D. Reordering:
)xtw(Th f(x) i ii
I
)(
I
I
((
)(
i iiMz
i i1(z)tD,z1(z)tP,z
x)z)ttS(z)(Th
)xt11(Th f(x)ii
M
I
(( f(x) z
iii ))xz)ttS(z)(Th
The Kernel Trick(3)
![Page 19: Machine Learning in Natural Language More on Discriminative models](https://reader036.vdocument.in/reader036/viewer/2022062410/56815c1d550346895dc9f246/html5/thumbnails/19.jpg)
19
• S(y)=1 if y P and S(y) = -1 if y D.
M
I
(( f(x) z
iii ))xz)ttS(z)(Th
• A mistake on z contributes the value +/-1 to all monomials satisfied by z. The total contribution of z to the sum is equal to the number of monomials that satisfy both x and z.
• Define a dot product in the t-space:
M f(x)
zz))S(z)K(x,(Th
)xz)tt z)K(x,i
ii
I
((
)xtw(Th f(x) i ii
I
)(
• We get the standard notation:
The Kernel Trick(4)
![Page 20: Machine Learning in Natural Language More on Discriminative models](https://reader036.vdocument.in/reader036/viewer/2022062410/56815c1d550346895dc9f246/html5/thumbnails/20.jpg)
20
What does this representation give us?
We can view this Kernel as the distance between x,z in the t-space.
But, K(x,z) can be measured in the original space, without explicitly writing the t-representation of x, z
M f(x)
zz))S(z)K(x,(Th
)xz)tt z)K(x,i
ii
I
((
Kernel Based Methods
![Page 21: Machine Learning in Natural Language More on Discriminative models](https://reader036.vdocument.in/reader036/viewer/2022062410/56815c1d550346895dc9f246/html5/thumbnails/21.jpg)
21
M f(x)
zz))S(z)K(x,(Th )xz)tt z)K(x,
iii
I
((
• Consider the space of all 3n monomials (allowing both positive and negative literals).
• Then, • Where same(x,z) is the number of features
that have the same value for both x and z.. We get:
• Example: Take n=2; x=(00), z=(01), ….• Other Kernels can be used.
1 z)same(x,2 z)K(x,
M f(x)
zz)same(x, 1)-S(z)(2(Th
Kernel Based Methods
![Page 22: Machine Learning in Natural Language More on Discriminative models](https://reader036.vdocument.in/reader036/viewer/2022062410/56815c1d550346895dc9f246/html5/thumbnails/22.jpg)
22
M f(x)
zz))S(z)K(x,(Th
)xz)tt z)K(x,i
ii
I
((
• Simply run Perceptron in an on-line mode, but keep track of the set M.
• Keeping the set M allows to keep track of S(z).
• Rather than remembering the weight vector w,
remember the set M (P and D) – all those examples on which we made mistakes.
Dual Representation
Implementation
![Page 23: Machine Learning in Natural Language More on Discriminative models](https://reader036.vdocument.in/reader036/viewer/2022062410/56815c1d550346895dc9f246/html5/thumbnails/23.jpg)
23
M f(x)
zz))S(z)K(x,(Th
• A method to run Perceptron on a very large feature set, without incurring the cost of keeping a very large weight vector.
• Computing the weight vector can still be done in the original feature space.
• Notice: this pertains only to efficiency: The classifier is identical to the one you get by blowing up the feature space.
• Generalization is still relative to the real dimensionality.
• This is the main trick in SVMs. (Algorithm - different) (although most applications actually use linear kernels)
Summary – Kernel Based Methods I
![Page 24: Machine Learning in Natural Language More on Discriminative models](https://reader036.vdocument.in/reader036/viewer/2022062410/56815c1d550346895dc9f246/html5/thumbnails/24.jpg)
24
There is a tradeoff between the computational efficiency with which these kernels can be computed and the generalization ability of the classifier.
For example, using such kernels the Perceptron algorithm can make an exponential number of mistakes even when learning simple functions.
In addition, computing with kernels depends strongly on the number of examples. It turns out that sometimes working in the blown up space is more efficient than using kernels.
Next: Kernel methods in NLP
Efficiency-Generalization Tradeoff
![Page 25: Machine Learning in Natural Language More on Discriminative models](https://reader036.vdocument.in/reader036/viewer/2022062410/56815c1d550346895dc9f246/html5/thumbnails/25.jpg)
25
Other methods are used broadly today in NLP: SVM, AdaBoost,
Mutliclass classification
Dealing with lack of data: Semi-supervised learning Missing data: EM
Other Issues in Classification