cs 4375 introduction to machine learning

39
CS 4375 Introduction to Machine Learning Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate

Upload: others

Post on 19-May-2022

34 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 4375 Introduction to Machine Learning

CS 4375Introduction to Machine Learning

Nicholas RuozziUniversity of Texas at Dallas

Slides adapted from David Sontag and Vibhav Gogate

Page 2: CS 4375 Introduction to Machine Learning

Course Info.β€’ Instructor: Nicholas Ruozzi

β€’ Office: ECSS 3.409

β€’ Office hours: M 11:30-12:30, W 12:30pm-1:30pm

β€’ TA: Hailiang Dong

β€’ Office hours and location: T 10:30-12:00, R 13:30-15:00

β€’ Course website: www.utdallas.edu/~nicholas.ruozzi/cs4375/2019fa/

β€’ Book: none required

β€’ Piazza (online forum): sign-up link on eLearning

2

Page 3: CS 4375 Introduction to Machine Learning

Prerequisites

β€’ CS3345, Data Structures and Algorithms

β€’ CS3341, Probability and Statistics in Computer Science

β€’ β€œMathematical sophistication”

β€’ Basic probability

β€’ Linear algebra: eigenvalues/vectors, matrices, vectors, etc.

β€’ Multivariate calculus: derivatives, gradients, etc.

β€’ I’ll review some concepts as we come to them, but you should brush up on areas that you aren’t as comfortable

β€’ Take prerequisite β€œquiz” on eLearning3

Page 4: CS 4375 Introduction to Machine Learning

Course Topicsβ€’ Dimensionality reduction

β€’ PCAβ€’ Matrix Factorizations

β€’ Learningβ€’ Supervised, unsupervised, active, reinforcement, …‒ SVMs & kernel methodsβ€’ Decision trees, k-NN, logistic regression, … β€’ Parameter estimation: Bayesian methods, MAP estimation, maximum

likelihood estimation, expectation maximization, …‒ Clustering: k-means & spectral clustering

β€’ Probabilistic modelsβ€’ Bayesian networksβ€’ NaΓ―ve Bayes

β€’ Neural networksβ€’ Evaluation

β€’ AOC, cross-validation, precision/recallβ€’ Statistical methods

β€’ Boosting, bagging, bootstrappingβ€’ Sampling

4

Page 5: CS 4375 Introduction to Machine Learning

Grading

β€’ 5-6 problem sets (50%)

β€’ See collaboration policy on the web

β€’ Mix of theory and programming (in MATLAB or Python)

β€’ Available and turned in on eLearning

β€’ Approximately one assignment every two weeks

β€’ Midterm Exam (20%)

β€’ Final Exam (30%)

β€’ Attendance policy?

-subject to change-5

Page 6: CS 4375 Introduction to Machine Learning

What is ML?

6

Page 7: CS 4375 Introduction to Machine Learning

What is ML?

β€œA computer program is said to learn from experience E with respect to some task T and some performance measure P, if its

performance on T, as measured by P, improves with experience E.”

- Tom Mitchell

7

Page 8: CS 4375 Introduction to Machine Learning

Basic Machine Learning Paradigm

β€’ Collect data

β€’ Build a model using β€œtraining” data

β€’ Use model to make predictions

8

Page 9: CS 4375 Introduction to Machine Learning

Supervised Learning

β€’ Input: π‘₯π‘₯(1),𝑦𝑦(1) , … , (π‘₯π‘₯(𝑀𝑀),𝑦𝑦(𝑀𝑀))

β€’ π‘₯π‘₯(π‘šπ‘š) is the π‘šπ‘šπ‘‘π‘‘π‘‘ data item and 𝑦𝑦(π‘šπ‘š) is the π‘šπ‘šπ‘‘π‘‘π‘‘ label

β€’ Goal: find a function 𝑓𝑓 such that 𝑓𝑓 π‘₯π‘₯(π‘šπ‘š) is a β€œgood approximation” to 𝑦𝑦(π‘šπ‘š)

β€’ Can use it to predict 𝑦𝑦 values for previously unseen π‘₯π‘₯ values

9

Page 10: CS 4375 Introduction to Machine Learning

Examples of Supervised Learning

β€’ Spam email detection

β€’ Handwritten digit recognition

β€’ Stock market prediction

β€’ More?

10

Page 11: CS 4375 Introduction to Machine Learning

Supervised Learning

β€’ Hypothesis space: set of allowable functions 𝑓𝑓:𝑋𝑋 β†’ π‘Œπ‘Œ

β€’ Goal: find the β€œbest” element of the hypothesis space

β€’ How do we measure the quality of 𝑓𝑓?

11

Page 12: CS 4375 Introduction to Machine Learning

Supervised Learning

β€’ Simple linear regression

β€’ Input: pairs of points π‘₯π‘₯(1),𝑦𝑦(1) , … , (π‘₯π‘₯(𝑀𝑀),𝑦𝑦(𝑀𝑀)) with π‘₯π‘₯(π‘šπ‘š) ∈ ℝ and 𝑦𝑦(π‘šπ‘š) ∈ ℝ

β€’ Hypothesis space: set of linear functions 𝑓𝑓 π‘₯π‘₯ = π‘Žπ‘Žπ‘₯π‘₯ + 𝑏𝑏with π‘Žπ‘Ž, 𝑏𝑏 ∈ ℝ

β€’ Error metric: squared difference between the predicted value and the actual value

12

Page 13: CS 4375 Introduction to Machine Learning

Regression

π‘₯π‘₯

𝑦𝑦

13

Page 14: CS 4375 Introduction to Machine Learning

Regression

π‘₯π‘₯

𝑦𝑦

Hypothesis class: linear functions 𝑓𝑓 π‘₯π‘₯ = π‘Žπ‘Žπ‘₯π‘₯ + 𝑏𝑏

How do we compute the error of a specific hypothesis?

14

Page 15: CS 4375 Introduction to Machine Learning

Linear Regression

β€’ For any data point, π‘₯π‘₯, the learning algorithm predicts 𝑓𝑓(π‘₯π‘₯)

β€’ In typical regression applications, measure the fit using a squared loss function

𝐿𝐿 𝑓𝑓 =1π‘€π‘€οΏ½π‘šπ‘š

𝑓𝑓 π‘₯π‘₯ π‘šπ‘š βˆ’ 𝑦𝑦 π‘šπ‘š 2

β€’ Want to minimize the average loss on the training data

β€’ The optimal linear hypothesis is then given by

minπ‘Žπ‘Ž,𝑏𝑏

1π‘€π‘€οΏ½π‘šπ‘š

π‘Žπ‘Žπ‘₯π‘₯(π‘šπ‘š) + 𝑏𝑏 βˆ’ 𝑦𝑦(π‘šπ‘š) 2

15

Page 16: CS 4375 Introduction to Machine Learning

Linear Regression

minπ‘Žπ‘Ž,𝑏𝑏

1π‘€π‘€οΏ½π‘šπ‘š

π‘Žπ‘Žπ‘₯π‘₯(π‘šπ‘š) + 𝑏𝑏 βˆ’ 𝑦𝑦(π‘šπ‘š) 2

β€’ How do we find the optimal π‘Žπ‘Ž and 𝑏𝑏?

16

Page 17: CS 4375 Introduction to Machine Learning

Linear Regression

minπ‘Žπ‘Ž,𝑏𝑏

1π‘€π‘€οΏ½π‘šπ‘š

π‘Žπ‘Žπ‘₯π‘₯(π‘šπ‘š) + 𝑏𝑏 βˆ’ 𝑦𝑦(π‘šπ‘š) 2

β€’ How do we find the optimal π‘Žπ‘Ž and 𝑏𝑏?

β€’ Solution 1: take derivatives and solve (there is a closed form solution!)

β€’ Solution 2: use gradient descent

17

Page 18: CS 4375 Introduction to Machine Learning

Linear Regression

minπ‘Žπ‘Ž,𝑏𝑏

1π‘€π‘€οΏ½π‘šπ‘š

π‘Žπ‘Žπ‘₯π‘₯(π‘šπ‘š) + 𝑏𝑏 βˆ’ 𝑦𝑦(π‘šπ‘š) 2

β€’ How do we find the optimal π‘Žπ‘Ž and 𝑏𝑏?

β€’ Solution 1: take derivatives and solve (there is a closed form solution!)

β€’ Solution 2: use gradient descent

β€’ This approach is much more likely to be useful for general loss functions

18

Page 19: CS 4375 Introduction to Machine Learning

Gradient Descent

Iterative method to minimize a (convex) differentiable function 𝑓𝑓

A function 𝑓𝑓:ℝ𝑛𝑛 β†’ ℝ is convex if

πœ†πœ†π‘“π‘“ π‘₯π‘₯ + 1 βˆ’ πœ†πœ† 𝑓𝑓 𝑦𝑦 β‰₯ 𝑓𝑓 πœ†πœ†π‘₯π‘₯ + 1 βˆ’ πœ†πœ† 𝑦𝑦

for all πœ†πœ† ∈ [0,1] and all π‘₯π‘₯,𝑦𝑦 ∈ ℝ𝑛𝑛

19

Page 20: CS 4375 Introduction to Machine Learning

Gradient Descent

Iterative method to minimize a (convex) differentiable function 𝑓𝑓

β€’ Pick an initial point π‘₯π‘₯0

β€’ Iterate until convergence

π‘₯π‘₯𝑑𝑑+1 = π‘₯π‘₯𝑑𝑑 βˆ’ 𝛾𝛾𝑑𝑑𝛻𝛻𝑓𝑓(π‘₯π‘₯𝑑𝑑)

where 𝛾𝛾𝑑𝑑 is the 𝑑𝑑𝑑𝑑𝑑 step size (sometimes called learning rate)

20

Page 21: CS 4375 Introduction to Machine Learning

Gradient Descent

21

𝑓𝑓 π‘₯π‘₯ = π‘₯π‘₯2

π‘₯π‘₯(0) = βˆ’4Step size: .8

Page 22: CS 4375 Introduction to Machine Learning

Gradient Descent

22

𝑓𝑓 π‘₯π‘₯ = π‘₯π‘₯2

π‘₯π‘₯(1) = βˆ’4 βˆ’ .8 β‹… 2 β‹… (βˆ’4)π‘₯π‘₯(0) = βˆ’4Step size: .8

Page 23: CS 4375 Introduction to Machine Learning

Gradient Descent

23

𝑓𝑓 π‘₯π‘₯ = π‘₯π‘₯2

π‘₯π‘₯(1) = .4π‘₯π‘₯(0) = βˆ’4Step size: .8

Page 24: CS 4375 Introduction to Machine Learning

Gradient Descent

24

𝑓𝑓 π‘₯π‘₯ = π‘₯π‘₯2

π‘₯π‘₯(1) = 0.4

π‘₯π‘₯(2) = .4 βˆ’ .8 β‹… 2 β‹… .4π‘₯π‘₯(1) = .4π‘₯π‘₯(0) = βˆ’4Step size: .8

Page 25: CS 4375 Introduction to Machine Learning

Gradient Descent

25

𝑓𝑓 π‘₯π‘₯ = π‘₯π‘₯2

π‘₯π‘₯(2) = βˆ’.24π‘₯π‘₯(1) = .4π‘₯π‘₯(0) = βˆ’4Step size: .8

Page 26: CS 4375 Introduction to Machine Learning

Gradient Descent

26

𝑓𝑓 π‘₯π‘₯ = π‘₯π‘₯2

π‘₯π‘₯(2) = βˆ’.24π‘₯π‘₯(1) = .4π‘₯π‘₯(0) = βˆ’4

π‘₯π‘₯(5) = 0.05184π‘₯π‘₯(4) = βˆ’0.0864π‘₯π‘₯(3) = .144

π‘₯π‘₯(30) = βˆ’1.474𝑒𝑒 βˆ’ 07

Step size: .8

Page 27: CS 4375 Introduction to Machine Learning

Gradient Descent

minπ‘Žπ‘Ž,𝑏𝑏

1π‘€π‘€οΏ½π‘šπ‘š

π‘Žπ‘Žπ‘₯π‘₯(π‘šπ‘š) + 𝑏𝑏 βˆ’ 𝑦𝑦(π‘šπ‘š) 2

β€’ What is the gradient of this function?

β€’ What does a gradient descent iteration look like for this simple regression problem?

(on board)

27

Page 28: CS 4375 Introduction to Machine Learning

Linear Regression

β€’ In higher dimensions, the linear regression problem is essentially the same with π‘₯π‘₯(π‘šπ‘š) ∈ ℝ𝑛𝑛

minπ‘Žπ‘Žβˆˆβ„π‘›π‘›,𝑏𝑏

1π‘€π‘€οΏ½π‘šπ‘š

π‘Žπ‘Žπ‘‡π‘‡π‘₯π‘₯(π‘šπ‘š) + 𝑏𝑏 βˆ’ 𝑦𝑦(π‘šπ‘š) 2

β€’ Can still use gradient descent to minimize this

β€’ Not much more difficult than the 𝑛𝑛 = 1 case

28

Page 29: CS 4375 Introduction to Machine Learning

Gradient Descent

β€’ Gradient descent converges under certain technical conditions on the function 𝑓𝑓 and the step size 𝛾𝛾𝑑𝑑

β€’ If 𝑓𝑓 is convex, then any fixed point of gradient descent must correspond to a global minimum of 𝑓𝑓

β€’ In general, for a nonconvex function, may only converge to a local optimum

29

Page 30: CS 4375 Introduction to Machine Learning

Regressionβ€’ What if we enlarge the hypothesis class?

β€’ Quadratic functions: π‘Žπ‘Žπ‘₯π‘₯2 + 𝑏𝑏π‘₯π‘₯ + 𝑐𝑐

β€’ π‘˜π‘˜-degree polynomials: π‘Žπ‘Žπ‘˜π‘˜π‘₯π‘₯π‘˜π‘˜ + π‘Žπ‘Žπ‘˜π‘˜βˆ’1π‘₯π‘₯π‘˜π‘˜βˆ’1 + β‹―+ π‘Žπ‘Ž1π‘₯π‘₯ + π‘Žπ‘Ž0

30

minπ‘Žπ‘Žπ‘˜π‘˜,…,π‘Žπ‘Ž0

1π‘€π‘€οΏ½π‘šπ‘š

π‘Žπ‘Žπ‘˜π‘˜ π‘₯π‘₯ π‘šπ‘š π‘˜π‘˜+ β‹―+ π‘Žπ‘Ž1π‘₯π‘₯ π‘šπ‘š + π‘Žπ‘Ž0 βˆ’ 𝑦𝑦(π‘šπ‘š)

2

Page 31: CS 4375 Introduction to Machine Learning

Regressionβ€’ What if we enlarge the hypothesis class?

β€’ Quadratic functions: π‘Žπ‘Žπ‘₯π‘₯2 + 𝑏𝑏π‘₯π‘₯ + 𝑐𝑐

β€’ π‘˜π‘˜-degree polynomials: π‘Žπ‘Žπ‘˜π‘˜π‘₯π‘₯π‘˜π‘˜ + π‘Žπ‘Žπ‘˜π‘˜βˆ’1π‘₯π‘₯π‘˜π‘˜βˆ’1 + β‹―+ π‘Žπ‘Ž1π‘₯π‘₯ + π‘Žπ‘Ž0

β€’ Can we always learn β€œbetter” with a larger hypothesis class?

31

Page 32: CS 4375 Introduction to Machine Learning

Regressionβ€’ What if we enlarge the hypothesis class?

β€’ Quadratic functions: π‘Žπ‘Žπ‘₯π‘₯2 + 𝑏𝑏π‘₯π‘₯ + 𝑐𝑐

β€’ π‘˜π‘˜-degree polynomials: π‘Žπ‘Žπ‘˜π‘˜π‘₯π‘₯π‘˜π‘˜ + π‘Žπ‘Žπ‘˜π‘˜βˆ’1π‘₯π‘₯π‘˜π‘˜βˆ’1 + β‹―+ π‘Žπ‘Ž1π‘₯π‘₯ + π‘Žπ‘Ž0

β€’ Can we always learn β€œbetter” with a larger hypothesis class?

32

Page 33: CS 4375 Introduction to Machine Learning

Regression

β€’ Larger hypothesis space always decreases the cost function, but this does NOT necessarily mean better predictive performance

β€’ This phenomenon is known as overfitting

β€’ Ideally, we would select the simplest hypothesis consistent with the observed data

β€’ In practice, we cannot simply evaluate our learned hypothesis on the training data, we want it to perform well on unseen data (otherwise, we can just memorize the training data!)

β€’ Report the loss on some held out test data (i.e., data not used as part of the training process)

33

Page 34: CS 4375 Introduction to Machine Learning

Binary Classification

β€’ Regression operates over a continuous set of outcomes

β€’ Suppose that we want to learn a function 𝑓𝑓:𝑋𝑋 β†’ {0,1}

β€’ As an example:

How do we pick the hypothesis space?

How do we find the best 𝑓𝑓 in this space?

π’™π’™πŸπŸ π’™π’™πŸπŸ π‘₯π‘₯3 𝑦𝑦1 0 0 1 02 0 1 0 13 1 1 0 14 1 1 1 0

34

Page 35: CS 4375 Introduction to Machine Learning

Binary Classification

β€’ Regression operates over a continuous set of outcomes

β€’ Suppose that we want to learn a function 𝑓𝑓:𝑋𝑋 β†’ {0,1}

β€’ As an example:

π’™π’™πŸπŸ π’™π’™πŸπŸ π‘₯π‘₯3 𝑦𝑦1 0 0 1 02 0 1 0 13 1 1 0 14 1 1 1 0

How many functions with three binary inputs and one binary output are there?

35

Page 36: CS 4375 Introduction to Machine Learning

Binary Classification

π’™π’™πŸπŸ π’™π’™πŸπŸ π‘₯π‘₯3 𝑦𝑦0 0 0 ?

1 0 0 1 02 0 1 0 1

0 1 1 ?1 0 0 ?1 0 1 ?

3 1 1 0 14 1 1 1 0

28 possible functions

24 are consistent with the observations

How do we choose the best one?

What if the observations are noisy?

36

Page 37: CS 4375 Introduction to Machine Learning

Challenges in ML

β€’ How to choose the right hypothesis space?

β€’ Number of factors influence this decision: difficulty of learning over the chosen space, how expressive the space is, …

β€’ How to evaluate the quality of our learned hypothesis?

β€’ Prefer β€œsimpler” hypotheses (to prevent overfitting)

β€’ Want the outcome of learning to generalize to unseen data

37

Page 38: CS 4375 Introduction to Machine Learning

Challenges in ML

β€’ How do we find the best hypothesis?

β€’ This can be an NP-hard problem!

β€’ Need fast, scalable algorithms if they are to be applicable to real-world scenarios

38

Page 39: CS 4375 Introduction to Machine Learning

Other Types of Learning

β€’ Unsupervisedβ€’ The training data does not include the desired output

β€’ Semi-supervisedβ€’ Some training data comes with the desired output

β€’ Active learningβ€’ Semi-supervised learning where the algorithm can ask for the

correct outputs for specifically chosen data points

β€’ Reinforcement learningβ€’ The learner interacts with the world via allowable actions which

change the state of the world and result in rewardsβ€’ The learner attempts to maximize rewards through trial and

error

39