![Page 1: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/1.jpg)
Lecture 1: IntroductionLast updated: 2 Sept 2013
Uppsala University
Department of Linguistics and Philology, September 2013
1 Lecture 1: Introduction
Machine Learning for Language Technology(Schedule)
![Page 2: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/2.jpg)
Lecture 1: Introduction2
Practical Information
Reading list, assignments, exams, etc.
Most slides from previous courses (i.e. from E. Alpaydin and J. Nivre) with adaptations
![Page 3: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/3.jpg)
Lecture 1: Introduction3
Course web pages: http://stp.lingfil.uu.se/~santinim/ml/ml_fall2013.pdf
http://stp.lingfil.uu.se/~santinim/ml/MachineLearning_fall2013.htm
Contact details: [email protected] ([email protected])
![Page 4: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/4.jpg)
Lecture 1: Introduction4
About the Course Introduction to machine learning Focus on methods used in Language Technology and
NLP Decision trees and nearest neighbor methods (Lecture 2) Linear models –The Weka ML package (Lectures 3 and 4) Ensemble methods – Structured Predictions (Lectures 5 and
6) Text Mining and Big Data - R and Rapid Miner(Lecture 7) Unsupervised learning (clustering) (Lecture 8, Magnus
Rosell)
Builds on Statistical Methods in NLP Mostly discriminative methods Generative probability models covered in first course
![Page 5: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/5.jpg)
Lecture 1: Introduction5
Digression: Generative vs. Discriminative Methods
A generative method only applies to probabilistic models. A model is generative if it gives us the model of the joint distribution of x and y together (P (Y , X ). It is called generative because you can generate with the correct probability distribution data points.
Conditional methods model the conditional distribution of the output given the input: P (Y | X ).
Discriminative methods do not model probabilities at all, but they map the input to the output directly.
![Page 6: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/6.jpg)
Lecture 1: Introduction6
Compulsory Reading List Main textbooks:
1. Ethem Alpaydin. 2010. Introduction to Machine Learning. Second Edition. MIT Press (free online version)
2. Hal Daumé III. 2012. A Course in Machine Learning (free online version)
3. Ian H. Witten, Eibe Frank Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (free online version)
Additional material :1. Dietterich, T. G. (2000). Ensemble Methods in Machine
Learning. In J. Kittler and F. Roli (Ed.) First International Workshop on Multiple Classifier Systems
2. Michael Collins. 2002. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms.In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing
3. Hanna M. Wallach. 2004. Conditional Random Fields: An Introduction.Technical Report MS-CIS-04-21. Department of Computer and Information Science, University of Pennsylvania.
![Page 7: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/7.jpg)
Lecture 1: Introduction7
Optional Reading Hal Daumé III, John Langford, Daniel Marcu
(2005) Search-Based Structured Prediction as Classification. NIPS Workshop on Advances in Structured Learning for Text and Speech Processing (ASLTSP).
![Page 8: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/8.jpg)
Lecture 1: Introduction8
Assignments and Examination
Three Assignments: Decision trees and nearest neighbors Perceptron learning Clustering
General Info: No lab sessions, supervision by email Reports and Assignments 1 & 2 must be submitted to
[email protected] Report and Assignment 3 must be submitted to [email protected]
Examination: Written report submitted for each assignment All three assignments necessary to pass the course Grade determined by majority grade on assignments
![Page 9: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/9.jpg)
Lecture 1: Introduction9
Practical Organization Marina Santini (1-7); Magnus Rosell (8) 45min + 15 min break
Lectures on Course webpage and SlideShare Email all your questions to me:
[email protected] Video Recordings of the previous ML course:
http://stp.lingfil.uu.se/~nivre/master/ml.html
Send me an email, so I make sure that I have all the correct email addresses to [email protected]
![Page 10: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/10.jpg)
Lecture 1: Introduction10
Schedule: http://stp.lingfil.uu.se/~santinim/ml/ml_fall2013.pdf
![Page 11: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/11.jpg)
Lecture 1: Introduction11
What is Machine Learning?
Introduction to:•Classification•Regression•Supervised Learning•Unsupervised Learning•Reinforcement Learning
![Page 12: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/12.jpg)
Lecture 1: Introduction12
What is Machine Learning Machine learning is programming computers
to optimize a performance criterion for some task using example data or past experience
Why learning? No known exact method – vision, speech
recognition, robotics, spam filters, etc. Exact method too expensive – statistical physics Task evolves over time – network routing
Compare: No need to use machine learning for computing
payroll… we just need an algorithm
![Page 13: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/13.jpg)
Lecture 1: Introduction13
Machine Learning – Data Mining – Artificial Intelligence – Statistics Machine Learning: creation of a model that uses training
data or past experience
Data Mining: application of learning methods to large datasets (ex. physics, astronomy, biology, etc.) Text mining = machine learning applied to unstructured textual
data (ex. sentiment analyisis, social media monitoring, etc. Text Mining, Wikipedia)
Artificial intelligence: a model that can adapt to a changing environment.
Statistics: Machine learning uses the theory of statistics in building mathematical models, because the core task is making inference from a sample.
![Page 14: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/14.jpg)
Lecture 1: Introduction14
The bio-cognitive analogy Imagine that a learning algorithm as a single
neuron. This neuron receives input from other
neurons, one for each input feature. The strength of these inputs are the feature
values. Each input has a weight and the neuron
simply sums up all the weighted inputs. Based on this sum, the neuron decides
whether to “fire” or not. Firing is interpreted as being a positive example and not firing is interpreted as being a negative example.
![Page 15: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/15.jpg)
Lecture 1: Introduction15
Elements of Machine Learning1. Generalization:
Generalize from specific examples Based on statistical inference
2. Data: Training data: specific examples to learn from Test data: (new) specific examples to assess
performance
3. Models: Theoretical assumptions about the task/domain Parameters that can be inferred from data
4. Algorithms: Learning algorithm: infer model (parameters) from data Inference algorithm: infer predictions from model
![Page 16: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/16.jpg)
Lecture 1: Introduction16
Types of Machine Learning Association Supervised Learning
Classification Regression
Unsupervised Learning Reinforcement Learning
![Page 17: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/17.jpg)
Lecture 1: Introduction17
Learning Associations Basket analysis:
P (Y | X ) probability that somebody who buys X also buys Y where X and Y are products/services
Example: P ( chips | beer ) = 0.7
![Page 18: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/18.jpg)
Lecture 1: Introduction18
Classification
Example: Credit scoring
Differentiating between low-risk and high-risk customers from their income and savings
Discriminant: IF income > θ1 AND savings > θ2 THEN low-risk ELSE high-risk
![Page 19: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/19.jpg)
Lecture 1: Introduction19
Classification in NLP
Binary classification: Spam filtering (spam vs. non-spam) Spelling error detection (error vs. non error)
Multiclass classification: Text categorization (news, economy, culture,
sport, ...) Named entity classification (person, location,
organization, ...) Structured prediction:
Part-of-speech tagging (classes = tag sequences) Syntactic parsing (classes = parse trees)
![Page 20: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/20.jpg)
Regression
Example: Price of used car
x : car attributesy : price
y = g (x | q )g ( ) model, q parameters
Lecture 1: Introduction20
y = wx+w0
![Page 21: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/21.jpg)
21
Uses of Supervised Learning
Prediction of future cases: Use the rule to predict the output for future inputs
Knowledge extraction: The rule is easy to understand
Compression: The rule is simpler than the data it explains
Outlier detection: Exceptions that are not covered by the rule, e.g.,
fraud
Lecture 1: Introduction
![Page 22: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/22.jpg)
22
Unsupervised Learning Finding regularities in data No mapping to outputs Clustering:
Grouping similar instances Example applications:
Customer segmentation in CRM Image compression: Color quantization NLP: Unsupervised text categorization
Lecture 1: Introduction
![Page 23: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/23.jpg)
23
Reinforcement Learning Learning a policy = sequence of outputs/actions No supervised output but delayed reward Example applications:
Game playing Robot in a maze NLP: Dialogue systems, for example:
NJFun: A Reinforcement Learning Spoken Dialogue System (http://acl.ldc.upenn.edu/W/W00/W00-0304.pdf)
Reinforcement Learning for Spoken Dialogue Systems: Comparing Strengths and Weaknesses for Practical Deployment (http://research.microsoft.com/apps/pubs/default.aspx?id=70295)
Lecture 1: Introduction
![Page 24: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/24.jpg)
Lecture 1: Introduction24
Supervised Learning
Introduction to:•Margin•Noise•Bias
![Page 25: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/25.jpg)
Lecture 1: Introduction25
Supervised Classification Learning the class C of a “family car” from
examples Prediction: Is car x a family car? Knowledge extraction: What do people expect
from a family car? Output (labels):
Positive (+) and negative (–) examples Input representation (features):
x1: price, x2 : engine power
![Page 26: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/26.jpg)
26
Training set X
X {x t,r t }t1N
r 1 if x is positive
0 if x is negative
Lecture 1: Introduction
x x1
x2
![Page 27: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/27.jpg)
Lecture 1: Introduction27
Hypothesis class H
p1 price p2 AND e1 engine power e2
![Page 28: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/28.jpg)
28
Empirical (training) error
Lecture 1: Introduction
h(x) 1 if h says x is positive
0 if h says x is negative
E(h | X ) 1 h x t r t t1
N
Empirical error of h on X:
![Page 29: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/29.jpg)
Lecture 1: Introduction29
S, G, and the Version Space
most specific hypothesis, S
most general hypothesis, G
h Î H, between S and G is consistent [E( h | X) = 0] and make up the version space
![Page 30: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/30.jpg)
Lecture 1: Introduction30
Margin Choose h with largest margin
![Page 31: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/31.jpg)
Lecture 1: Introduction31
NoiseUnwanted anomaly in data Imprecision in input attributes Errors in labeling data points Hidden attributes (relative to H)
Consequence: No h in H may be consistent!
![Page 32: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/32.jpg)
Lecture 1: Introduction32
Noise and Model ComplexityArguments for simpler model (Occam’s razor
principle)1. Easier to make predictions
2. Easier to train (fewer parameters)
3. Easier to understand
4. Generalizes better (if data is noisy)
![Page 33: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/33.jpg)
Lecture 1: Introduction33
Inductive Bias Learning is an ill-posed problem
Training data is never sufficient to find a unique solution
There are always infinitely many consistent hypotheses
We need an inductive bias: Assumptions that entail a unique h for a training
set X1. Hypothesis class H – axis-aligned rectangles2. Learning algorithm – find consistent hypothesis with
max-margin3. Hyperparameters – trade-off between training error
and margin
![Page 34: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/34.jpg)
Lecture 1: Introduction34
Model Selection and Generalization Generalization – how well a model performs
on new data Overfitting: H more complex than C Underfitting: H less complex than C
![Page 35: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/35.jpg)
Lecture 1: Introduction35
Triple Trade-Off Trade-off between three factors:
1. Complexity of H, c(H)2. Training set size N3. Generalization error E on new data
Dependencies: As N , E¯ As c(H) , first E¯ and then E
![Page 36: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/36.jpg)
Lecture 1: Introduction36
Model Selection Generalization Error To estimate generalization error, we need data
unseen during training:
Given models (hypotheses) h1, ..., hk induced from the training set X, we can use E(hi | V ) to select the model hi with the smallest generalization error
ˆ E E(h |V ) 1 h x t r t t1
M
V {x t,r t }t1M X
![Page 37: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/37.jpg)
Lecture 1: Introduction37
Model Assessment To estimate the generalization error of the
best model hi, we need data unseen during training and model selection
Standard setup:1. Training set X (50–80%)2. Validation (development) set V (10–25%)3. Test (publication) set T (10–25%)
Note: Validation data can be added to training set before
testing Resampling methods can be used if data is limited
![Page 38: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/38.jpg)
Cross-Validation
Lecture 1: Introduction38
121
312
22
321
11
KK
KK
K
K
XXXTXV
XXXTXV
XXXTXV
K-fold cross-validation: Divide X into X1, ..., XK
Note: Generalization error estimated by means across K folds Training sets for different folds share K–2 parts Separate test set must be maintained for model
assessment
![Page 39: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/39.jpg)
Bootstrapping
Lecture 1: Introduction39
36801
1 1 .
e
N
N
Generate new training sets of size N from X by random sampling with replacement
Use original training set as validation set (V = X ) Probability that we do not pick an instance after
N draws
that is, only 36.8% of instances are new!
![Page 40: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/40.jpg)
Measuring Error
Lecture 1: Introduction40
Error rate = # of errors / # of instances = (FP+FN) / N
Accuracy = # of correct / # of instances = (TP+TN) / N
Recall = # of found positives / # of positives = TP / (TP+FN)
Precision = # of found positives / # of found = TP / (TP+FP)
![Page 41: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/41.jpg)
Lecture 1: Introduction41
Statistical Inference Interval estimation to quantify the precision of
our measurements
Hypothesis testing to assess whether differences between models are statistically significant
m 1.96N
e01 e10 1 2
e01 e10
~ X12
![Page 42: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/42.jpg)
Lecture 1: Introduction
Supervised Learning – Summary
42
Training data + learner hypothesis Learner incorporates inductive bias
Test data + hypothesis estimated generalization Test data must be unseen
Next lectures: Different learners in LT with different inductive biases
![Page 43: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/43.jpg)
Anatomy of a Supervised Learner (Dimensions of a supervised machine learning algorithm)
Model:
Loss function:
Optimization procedure:
g x |
E | X L r t ,g x t | t
Lecture 1: Introduction43
*arg min
E | X
![Page 44: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/44.jpg)
Lecture 1: Introduction44
Reading Alpaydin (2010): Ch. 1-2; 19 (mathematical
underpinnings) Witten and Frank (2005): Ch. 1 (examples and
domains of application)
![Page 45: Lecture 01: Machine Learning for Language Technology - Introduction](https://reader035.vdocument.in/reader035/viewer/2022062617/54c5e05d4a7959315d8b456f/html5/thumbnails/45.jpg)
Lecture 1: Introduction45
End of Lecture 1
Thanks for your attention