introduction to machine learning for nlp i - dl-nlp… · i grade of the code written for the...

52
Introduction to Machine Learning for NLP I Benjamin Roth CIS LMU M¨ unchen Benjamin Roth (CIS LMU M¨ unchen) Introduction to Machine Learning for NLP I 1 / 49

Upload: trinhtruc

Post on 18-Aug-2018

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Introduction to Machine Learning for NLP I

Benjamin Roth

CIS LMU Munchen

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 1 / 49

Page 2: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Outline

1 This Course

2 Overview

3 Machine Learning DefinitionData (Experience)TasksPerformance Measures

4 Linear Regression: Overview and Cost Function

5 Summary

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 2 / 49

Page 3: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Course Overview

Foundations of machine learningI loss functionsI linear regressionI logistic regressionI gradient-based optimizationI neural networks and backpropagation

Deep learning tools in PythonI NumpyI TheanoI KerasI (some) Tensorflow?, (some) Pytorch?

ApplicationsI Word EmbeddingsI Senitment AnalysisI Relation extractionI (some) Machine Translation?I Practical projects (NLP related, to be agreed on during the course)

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 3 / 49

Page 4: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Lecture Times, Tutorials

Course homepage:dl-nlp.github.io

9-11 is supposed to be the lecture slot, and 11-12 the tutorial slot ...

... but we will not stick to that allocation

We will sometimes have longer Q&A-style/interactive “tutorial”sessions, sometimes more lectures (see next slide)

Tutor: Simon SchaferI Will discuss exercise sheets in the tutorialsI Will help you with the projects

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 4 / 49

Page 5: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Plan

9-11 slot 11-12 slot Ex. sheet

10/18 Overview / ML Intro I ML Intro I Linear algebra chapter10/25 Linear algebra Q&A / ML II ML II Probability chapter11/1 public holiday11/8 Probability Q&A / ML III Numpy Numpy11/15 ML IV/Theano Intro Convolution Theano I

9-11 slot 11-12 slot Ex. sheet

11/22 Embeddings / CNNs & RNNs for NLP Numpy Q&A Read LSTM/RNN11/29 LSTM (reading group) Theano I Q&A Theano II12/6 Keras Keras Keras12/13 DL for Relation Prediction Theano II Q&A Relation Prediction12/20 Word Vectors Project Topics Project Assignments

9-11 slot 11-12 slot Ex. sheet

1/10 Keras Q&A, Rel.Extr. Q&A Tensorflow –1/17 optimization methods/PyTorch Help with projects –1/24 Other Work at CIS / LMU, Neural MT Help with projects –1/31 Project presentations presentations –2/7 Project presentations presentations –

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 5 / 49

Page 6: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Formalities

This class is graded by a project

The grade of the project is determined taking the average of:I Grade of the code written for the project.I Grade of project documentation / mini-report.I Grade of presentation about your project.I ⇒ You have to pass all three elements in order to pass the course.

Bonus points: The grade can be improved by 0.5 absolute gradesthrough the exercise sheets before New Year.

Formula:

gproject =gproject-code + gproject-report + gproject-presentation

3

gfinal = round(gproject − 0.5 · x)

where x is the fraction of points reached in the exercises (between 0and 1), and round selects the closest value of 1; 1.3; 1.7; 2; · · · 3.7; 4

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 6 / 49

Page 7: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Exercise sheets, Projects, Presentations

6 ECTS, 14 weeks⇒ avg work load ∼ 13hrs / week (3 in class, 10 at home)

I in the first weeks, spend enough time to read and prepare so that youare not lost later

I from mid-November to mid-December: programming assignments -coding takes time, and can be frustating (but rewarding)!

Exercise sheetsI Work on non-programming exercise sheets individuallyI For exercise sheets that contain programming parts, submit in teams of

2 or 3

ProjectsI A list of topics will be proposed by me: ∼ Implement a deep learning

technique applied to information extaction (or other NLP task)I Own ideas also possible, need to be discussed with meI Work in groups of two or threeI Project report: 3 pages / team member

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 7 / 49

Page 8: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Good project code ...

... shows that you master the techniques taught in the lectures andexercises.

... shows that you can make “own decisions”: e.g. adapt model /task / training data etc if necessary.

... is well-structured and easy to understand (telling variable names,meaningful modularization – avoid: code duplication, dead code)

... is correct (especially: train/dev/test splits, evaluation)

... is within the scope of this lecture (time-wise should not exceed5× 10h)

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 8 / 49

Page 9: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

A good project presentation ...

... is short (10 min. p.P. + 15 min. Q&A per team)

... similar to the report, contains the problem statement, motivation,model, and results

... is targeted to your fellow students, who do not know detailsbeforehand

... contains interesting stuff: unexpected observations? conclusions/ recommendations? did you deviate from some common practice?

... demonstrates that all team members worked together on theproject

Possible outlineI Background / MotivationI Formal characterization of techniques usedI Technical Approach and DifficultiesI Experiments, Results and Interpretation

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 9 / 49

Page 10: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

A good project report ...

... is concise (3 pages / person) and clear

... motivates and describes the model that you have implemented andthe results that you have obtained

... shows that you can correctly describe the concepts taught in thisclass

... contains interesting stuff: unexpected observations? conclusions/ recommendations? did you deviate from some common practice?

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 10 / 49

Page 11: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Outline

1 This Course

2 Overview

3 Machine Learning DefinitionData (Experience)TasksPerformance Measures

4 Linear Regression: Overview and Cost Function

5 Summary

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 11 / 49

Page 12: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Machine Learning

Machine learning for natural language processingI Why?I Advantages and disadvantages to alternatives?I Accuracy; Coverage; resources required (data, expertise, human

labour); Reliability/Robustness; Explainability

P → NP VPVP → V NPNP → Det NN

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 12 / 49

Page 13: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Deep Learning

Learn complex functions, that are (recursively) composed of simplerfunctions.

Many parameters have to be estimated.

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 13 / 49

Page 14: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Deep LearningMain Advantage: Feature learning

I Models learn to capture most essential properties of data (according tosome performance measure) as intermediate representations.

I No need to hand-craft feature extraction algorithms

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 14 / 49

Page 15: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Neural Networks

First training methods for deep nonlinear NNs appeared in the 1960s(Ivakhnenko and others).

Increasing interest in NN technology (again) since around 5 years ago(“Neural Network Renaissance”):Orders of magnitude more data and faster computers now.

Many successes:I Image recognition and captioningI Speech regonitionI NLP and Machine translation (demo of Bahdanau / Cho / Bengio

system)I Game playing (AlphaGO)I ...

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 15 / 49

Page 16: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Machine Learning

Deep Learning builds on general Machine Learning concepts

argminθ∈H

m∑i=1

L(f (xi ;θ), yi )

Fitting data vs. generalizing from data

feature

prediction

xx

xx

xx

xx x

feature

prediction

xx

xx

xx

xx x

feature

prediction

xx

xx

xx

xx x

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 16 / 49

Page 17: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Outline

1 This Course

2 Overview

3 Machine Learning DefinitionData (Experience)TasksPerformance Measures

4 Linear Regression: Overview and Cost Function

5 Summary

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 17 / 49

Page 18: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

A Definition

“A computer program is said to learn from experience E with respect tosome class of tasks T and performance measure P, if its performanceat tasks in T , as measured by P, improves with experience E .”(Mitchell 1997)

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 18 / 49

Page 19: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

A Definition

“A computer program is said to learn from experience E with respect tosome class of tasks T and performance measure P, if its performance attasks in T , as measured by P, improves with experience E .”(Mitchell 1997)

Learning: Attaining the ability to perform a task.

A set of examples (“experience”) represents a more general task.

Examples are described by features:sets of numerical properties that can be represented as vectors x ∈ Rn.

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 19 / 49

Page 20: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Outline

1 This Course

2 Overview

3 Machine Learning DefinitionData (Experience)TasksPerformance Measures

4 Linear Regression: Overview and Cost Function

5 Summary

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 20 / 49

Page 21: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Data“A computer program is said to learn from experience E [...], if itsperformance [...] improves with experience E .”

Dataset: collection of examples

Design matrixX ∈ Rn×m

I n: number of examplesI m: number of featuresI Example: Xi,j count of feature j (e.g. a stem form) in document i .

Unsupervised learning:I Model X, or find interesting properties of X.I Training data: only X.

Supervised learning:I Predict specific additional properties from X.I Training data: Label vector y ∈ Rn together with X

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 21 / 49

Page 22: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Data

Low training error does not mean good generalization.

Algorithm may overfit.

feature

prediction

xx

xx

xx

xx x

feature

prediction

xx

xx

xx

xx x

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 22 / 49

Page 23: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Data Splits

Best Practice: Split data into training, cross-validation and test set.(“Cross-validation set” = “development set”).

I Optimize low-level parameters (feature weights ...) on training set.I Select models and hyper-parameters on cross-validation set. (type of

machine learning model, number of features, regularization, priors).I It is possible to overfit both in the training as well as in the model

selection stage!I ⇒ Report final score on test set only after model has been selected!

Don’t report the error on training or cross-validation set as yourmodel performance!

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 23 / 49

Page 24: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Outline

1 This Course

2 Overview

3 Machine Learning DefinitionData (Experience)TasksPerformance Measures

4 Linear Regression: Overview and Cost Function

5 Summary

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 24 / 49

Page 25: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Machine Learning Tasks“A computer program is said to learn [...] with respect to some class oftasks T [...] if its performance at tasks in T [...] improves [...]”Types of Tasks:

Classification

Regression

Structured Prediction

Anomaly Detection

synthesis and sampling

Imputation of missing values

Denoising

Clustering

Reinforcement learning

. . .

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 25 / 49

Page 26: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Machine Learning Tasks:Typical Examples & Examples from Recent NLP Reserch

What are the most important conferences relevant to the intersection ofML and NLP?

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 26 / 49

Page 27: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Task: Classification

Which of k classes does an example belong to?

f : Rn → {1 . . . k}

Typical example: Categorize image patchesI Feature vector: color intensities for each pixel; derived features.I Output categories: Predefined set of labels

Typical example: Spam ClassificationI Feature vector: High-dimensional, sparse vector.

Each dimension indicates occurrence of a particular word, or otheremail-specific information.

I Output categories: “spam” vs. ‘ham”

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 27 / 49

Page 28: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Task: Classification

EMNLP 2017: Given a person name in a sentence that containskeywords related to police (“officer”, “police” ...) and to killing(“killed”, “shot”), was the person a civilian killed by police?

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 28 / 49

Page 29: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Task: Regression

Predict a numerical value given some input.

f : Rn → R

Typical examples:I Predict the risk of an insurance customer.I Predict the value of a stock.

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 29 / 49

Page 30: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Task: Regression

ACL 2017: Given a response in a multi-turn dialogue, predict thevalue (on a scale from 1 to 5) how natural a response is.

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 30 / 49

Page 31: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Task: Structured PredictionPredict a multi-valued output with special inter-dependencies andconstraints.Typical examples:

I Part-of-speech tagging

I Syntactic parsing

I Protein-folding

Often involves search and problem-specific algorithms.Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 31 / 49

Page 32: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Task: Structured Prediction

ACL 2017: jointly find all relations relations of interest in a sentenceby tagging arguments and combining them.

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 32 / 49

Page 33: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Task: Reinforcement LearningIn reinforcement learning, the model (also called agent) needs toselect a serious of actions, but only observes the outcome (reward) atthe end.The goal is to predict actions that will maximize the outcome.

EMNLP 2017: The computer negotiates with humans in naturallanguage in order to maximize its points in a game.

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 33 / 49

Page 34: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Task: Anomaly Detection

Detect atypical items or events.

Common approach: Estimate density and identify items that have lowprobability.

Examples:I Quality assuranceI Detection of criminal activity

Often items categorized as outliers are sent to humans for furtherscrutiny.

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 34 / 49

Page 35: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Task: Anomaly Detection

ACL 2017: Schizophrenia patients can be detected by theirnon-standard use of mataphors, and more extreme sentimentexpressions.

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 35 / 49

Page 36: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Supervised and Unsupervised Learning

Unsupervised learning: Learn interesting properties, such asprobability distribution p(x)

Supervised learning: learn mapping from x to y , typically byestimating p(y |x)

Supervised learning in an unsupervised way:

p(y |x) =p(x, y)∑y ′ p(x, y ′)

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 36 / 49

Page 37: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Outline

1 This Course

2 Overview

3 Machine Learning DefinitionData (Experience)TasksPerformance Measures

4 Linear Regression: Overview and Cost Function

5 Summary

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 37 / 49

Page 38: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Performance Measures

“A computer program is said to learn [...] with respect to some [...]performance measure P, if its performance [...] as measured by P,improves [...]”

Quantitative measure of algorithm performance.

Task-specific.

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 38 / 49

Page 39: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Discrete Loss Functions

Can be used to measure classificationperformance.

Not applicable to measure densityestimation or regression performance.

AccuracyI Proportion of examples for which model

produces correct output.I 0-1 loss = error rate = 1 - accuracy.

Accuracy may be inappropriate for skewedlabel distributions, where relevant categoryis rare

F1-score =2 · Prec · Rec

Prec + Rec

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 39 / 49

Page 40: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Discrete Loss Functions

Can be used to measure classificationperformance.

Not applicable to measure densityestimation or regression performance.

AccuracyI Proportion of examples for which model

produces correct output.I 0-1 loss = error rate = 1 - accuracy.

Accuracy may be inappropriate for skewedlabel distributions, where relevant categoryis rare

F1-score =2 · Prec · Rec

Prec + Rec

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 39 / 49

Page 41: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Discrete Loss Functions

Can be used to measure classificationperformance.

Not applicable to measure densityestimation or regression performance.

AccuracyI Proportion of examples for which model

produces correct output.I 0-1 loss = error rate = 1 - accuracy.

Accuracy may be inappropriate for skewedlabel distributions, where relevant categoryis rare

F1-score =2 · Prec · Rec

Prec + Rec

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 39 / 49

Page 42: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Discrete Loss Functions

Can be used to measure classificationperformance.

Not applicable to measure densityestimation or regression performance.

AccuracyI Proportion of examples for which model

produces correct output.I 0-1 loss = error rate = 1 - accuracy.

Accuracy may be inappropriate for skewedlabel distributions, where relevant categoryis rare

F1-score =2 · Prec · Rec

Prec + Rec

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 39 / 49

Page 43: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Discrete vs. Continuous Loss Functions

Discrete loss functions cannot indicate how wrong a wrong decisionfor one example is.

Continuous loss functions . . .I . . . are more widely applicable.I . . . are often easier to optimize (differentiable).I . . . can also be applied to discrete tasks (classification).

Sometimes algorithms are optimized using one loss (e.g. Hinge loss)and evaluated using another loss (e.g. F1-Score).

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 40 / 49

Page 44: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Examples for Continuous Loss Functions

Density estimation: log probability of example

Regression: squared error

Classification: Loss L(yi · f (xi )) is function of label×prediction

I label ∈ {−1, 1}, prediction ∈ RI Correct prediction:

yi · f (xi ) > 0

I Wrong prediction:yi · f (xi ) <= 0

I zero-one loss, Hinge-loss,logistic loss ...

Loss on data set is sum of per-example losses.

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 41 / 49

Page 45: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Outline

1 This Course

2 Overview

3 Machine Learning DefinitionData (Experience)TasksPerformance Measures

4 Linear Regression: Overview and Cost Function

5 Summary

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 42 / 49

Page 46: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Linear Regression

For one instance:I Input: vector x ∈ Rn

I Output: scalar y ∈ R(actual output: y ; predicted output: y)

I Linear function

y = wTx =n∑

j=1

wjxj

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 43 / 49

Page 47: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Linear Regression

Linear function:

y = wTx =n∑

j=1

wjxj

Parameter vector w ∈ Rn

Weight wj decides if value of feature xj increases or decreasesprediction y .

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 44 / 49

Page 48: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Linear Regression

For the whole data set:I Use matrix X and vector y to stack instances on top of each other.I Typically first column contains all 1 for the intercept (bias, shift) term.

X =

1 x12 x13 . . . x1n1 x22 x23 . . . x2n...

......

. . ....

1 xm2 xm3 . . . xmn

y =

y1y2...

ym

For entire data set, predictions are stacked on top of each other:

y = Xw

Estimate parameters using X(train) and y(train).

Make high-level decisions (which features...) using X(dev) and y(dev).

Evaluate resulting model using X(test) and y(test).

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 45 / 49

Page 49: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Simple Example: Housing PricesPredict Munich property prices (in 1K Euros) from just one feature:Square meters of property.

X =

1 4501 9001 1350

y =

73013001700

Prediction is:

y =

w1 + 450w2

w1 + 900w2

w1 + 1350w2

=

1 4501 9001 1350

· [w1

w2

]= Xw

w1 will contain costs incurred in any property acquisitionw2 will contain remaining average price per square meter.Optimal parameters are for the above case:

w =

[273.31.08

]y =

759.11245.11731.1

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 46 / 49

Page 50: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Linear Regression: Mean Squared Error

Mean squared error of training (or test) data set is the sum of squareddifferences between the predictions and labels of all m instances.

MSE (train) =1

m

m∑i=1

(y(train)i − y

(train)i )2

In matrix notation:

MSE (train) =1

m||y(train) − y(train))||22

=1

m||X(train)w − y(train))||22

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 47 / 49

Page 51: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

Outline

1 This Course

2 Overview

3 Machine Learning DefinitionData (Experience)TasksPerformance Measures

4 Linear Regression: Overview and Cost Function

5 Summary

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 48 / 49

Page 52: Introduction to Machine Learning for NLP I - dl-nlp… · I Grade of the code written for the project. I Grade of project ... The grade can be improved by 0.5 absolute grades through

SummaryDeep Learning

I many successes in recent yearsI feature learning instead of feature engineeringI builds on general machine learning concepts

Machine learning definitionI DataI TaskI Cost function

Machine tasksI ClassificationI RegressionI ...

Linear regressionI Output depends linearly on inputI Cost function: Mean squared error

Next up: estimating the parameters

Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 49 / 49