cs 4375 introduction to machine learning
TRANSCRIPT
CS 4375Introduction to Machine Learning
Nicholas RuozziUniversity of Texas at Dallas
Slides adapted from David Sontag and Vibhav Gogate
Course Info.β’ Instructor: Nicholas Ruozzi
β’ Office: ECSS 3.409
β’ Office hours: M 11:30-12:30, W 12:30pm-1:30pm
β’ TA: Hailiang Dong
β’ Office hours and location: T 10:30-12:00, R 13:30-15:00
β’ Course website: www.utdallas.edu/~nicholas.ruozzi/cs4375/2019fa/
β’ Book: none required
β’ Piazza (online forum): sign-up link on eLearning
2
Prerequisites
β’ CS3345, Data Structures and Algorithms
β’ CS3341, Probability and Statistics in Computer Science
β’ βMathematical sophisticationβ
β’ Basic probability
β’ Linear algebra: eigenvalues/vectors, matrices, vectors, etc.
β’ Multivariate calculus: derivatives, gradients, etc.
β’ Iβll review some concepts as we come to them, but you should brush up on areas that you arenβt as comfortable
β’ Take prerequisite βquizβ on eLearning3
Course Topicsβ’ Dimensionality reduction
β’ PCAβ’ Matrix Factorizations
β’ Learningβ’ Supervised, unsupervised, active, reinforcement, β¦β’ SVMs & kernel methodsβ’ Decision trees, k-NN, logistic regression, β¦ β’ Parameter estimation: Bayesian methods, MAP estimation, maximum
likelihood estimation, expectation maximization, β¦β’ Clustering: k-means & spectral clustering
β’ Probabilistic modelsβ’ Bayesian networksβ’ NaΓ―ve Bayes
β’ Neural networksβ’ Evaluation
β’ AOC, cross-validation, precision/recallβ’ Statistical methods
β’ Boosting, bagging, bootstrappingβ’ Sampling
4
Grading
β’ 5-6 problem sets (50%)
β’ See collaboration policy on the web
β’ Mix of theory and programming (in MATLAB or Python)
β’ Available and turned in on eLearning
β’ Approximately one assignment every two weeks
β’ Midterm Exam (20%)
β’ Final Exam (30%)
β’ Attendance policy?
-subject to change-5
What is ML?
6
What is ML?
βA computer program is said to learn from experience E with respect to some task T and some performance measure P, if its
performance on T, as measured by P, improves with experience E.β
- Tom Mitchell
7
Basic Machine Learning Paradigm
β’ Collect data
β’ Build a model using βtrainingβ data
β’ Use model to make predictions
8
Supervised Learning
β’ Input: π₯π₯(1),π¦π¦(1) , β¦ , (π₯π₯(ππ),π¦π¦(ππ))
β’ π₯π₯(ππ) is the πππ‘π‘π‘ data item and π¦π¦(ππ) is the πππ‘π‘π‘ label
β’ Goal: find a function ππ such that ππ π₯π₯(ππ) is a βgood approximationβ to π¦π¦(ππ)
β’ Can use it to predict π¦π¦ values for previously unseen π₯π₯ values
9
Examples of Supervised Learning
β’ Spam email detection
β’ Handwritten digit recognition
β’ Stock market prediction
β’ More?
10
Supervised Learning
β’ Hypothesis space: set of allowable functions ππ:ππ β ππ
β’ Goal: find the βbestβ element of the hypothesis space
β’ How do we measure the quality of ππ?
11
Supervised Learning
β’ Simple linear regression
β’ Input: pairs of points π₯π₯(1),π¦π¦(1) , β¦ , (π₯π₯(ππ),π¦π¦(ππ)) with π₯π₯(ππ) β β and π¦π¦(ππ) β β
β’ Hypothesis space: set of linear functions ππ π₯π₯ = πππ₯π₯ + ππwith ππ, ππ β β
β’ Error metric: squared difference between the predicted value and the actual value
12
Regression
π₯π₯
π¦π¦
13
Regression
π₯π₯
π¦π¦
Hypothesis class: linear functions ππ π₯π₯ = πππ₯π₯ + ππ
How do we compute the error of a specific hypothesis?
14
Linear Regression
β’ For any data point, π₯π₯, the learning algorithm predicts ππ(π₯π₯)
β’ In typical regression applications, measure the fit using a squared loss function
πΏπΏ ππ =1πποΏ½ππ
ππ π₯π₯ ππ β π¦π¦ ππ 2
β’ Want to minimize the average loss on the training data
β’ The optimal linear hypothesis is then given by
minππ,ππ
1πποΏ½ππ
πππ₯π₯(ππ) + ππ β π¦π¦(ππ) 2
15
Linear Regression
minππ,ππ
1πποΏ½ππ
πππ₯π₯(ππ) + ππ β π¦π¦(ππ) 2
β’ How do we find the optimal ππ and ππ?
16
Linear Regression
minππ,ππ
1πποΏ½ππ
πππ₯π₯(ππ) + ππ β π¦π¦(ππ) 2
β’ How do we find the optimal ππ and ππ?
β’ Solution 1: take derivatives and solve (there is a closed form solution!)
β’ Solution 2: use gradient descent
17
Linear Regression
minππ,ππ
1πποΏ½ππ
πππ₯π₯(ππ) + ππ β π¦π¦(ππ) 2
β’ How do we find the optimal ππ and ππ?
β’ Solution 1: take derivatives and solve (there is a closed form solution!)
β’ Solution 2: use gradient descent
β’ This approach is much more likely to be useful for general loss functions
18
Gradient Descent
Iterative method to minimize a (convex) differentiable function ππ
A function ππ:βππ β β is convex if
ππππ π₯π₯ + 1 β ππ ππ π¦π¦ β₯ ππ πππ₯π₯ + 1 β ππ π¦π¦
for all ππ β [0,1] and all π₯π₯,π¦π¦ β βππ
19
Gradient Descent
Iterative method to minimize a (convex) differentiable function ππ
β’ Pick an initial point π₯π₯0
β’ Iterate until convergence
π₯π₯π‘π‘+1 = π₯π₯π‘π‘ β πΎπΎπ‘π‘π»π»ππ(π₯π₯π‘π‘)
where πΎπΎπ‘π‘ is the π‘π‘π‘π‘π‘ step size (sometimes called learning rate)
20
Gradient Descent
21
ππ π₯π₯ = π₯π₯2
π₯π₯(0) = β4Step size: .8
Gradient Descent
22
ππ π₯π₯ = π₯π₯2
π₯π₯(1) = β4 β .8 β 2 β (β4)π₯π₯(0) = β4Step size: .8
Gradient Descent
23
ππ π₯π₯ = π₯π₯2
π₯π₯(1) = .4π₯π₯(0) = β4Step size: .8
Gradient Descent
24
ππ π₯π₯ = π₯π₯2
π₯π₯(1) = 0.4
π₯π₯(2) = .4 β .8 β 2 β .4π₯π₯(1) = .4π₯π₯(0) = β4Step size: .8
Gradient Descent
25
ππ π₯π₯ = π₯π₯2
π₯π₯(2) = β.24π₯π₯(1) = .4π₯π₯(0) = β4Step size: .8
Gradient Descent
26
ππ π₯π₯ = π₯π₯2
π₯π₯(2) = β.24π₯π₯(1) = .4π₯π₯(0) = β4
π₯π₯(5) = 0.05184π₯π₯(4) = β0.0864π₯π₯(3) = .144
π₯π₯(30) = β1.474ππ β 07
Step size: .8
Gradient Descent
minππ,ππ
1πποΏ½ππ
πππ₯π₯(ππ) + ππ β π¦π¦(ππ) 2
β’ What is the gradient of this function?
β’ What does a gradient descent iteration look like for this simple regression problem?
(on board)
27
Linear Regression
β’ In higher dimensions, the linear regression problem is essentially the same with π₯π₯(ππ) β βππ
minππββππ,ππ
1πποΏ½ππ
πππππ₯π₯(ππ) + ππ β π¦π¦(ππ) 2
β’ Can still use gradient descent to minimize this
β’ Not much more difficult than the ππ = 1 case
28
Gradient Descent
β’ Gradient descent converges under certain technical conditions on the function ππ and the step size πΎπΎπ‘π‘
β’ If ππ is convex, then any fixed point of gradient descent must correspond to a global minimum of ππ
β’ In general, for a nonconvex function, may only converge to a local optimum
29
Regressionβ’ What if we enlarge the hypothesis class?
β’ Quadratic functions: πππ₯π₯2 + πππ₯π₯ + ππ
β’ ππ-degree polynomials: πππππ₯π₯ππ + ππππβ1π₯π₯ππβ1 + β―+ ππ1π₯π₯ + ππ0
30
minππππ,β¦,ππ0
1πποΏ½ππ
ππππ π₯π₯ ππ ππ+ β―+ ππ1π₯π₯ ππ + ππ0 β π¦π¦(ππ)
2
Regressionβ’ What if we enlarge the hypothesis class?
β’ Quadratic functions: πππ₯π₯2 + πππ₯π₯ + ππ
β’ ππ-degree polynomials: πππππ₯π₯ππ + ππππβ1π₯π₯ππβ1 + β―+ ππ1π₯π₯ + ππ0
β’ Can we always learn βbetterβ with a larger hypothesis class?
31
Regressionβ’ What if we enlarge the hypothesis class?
β’ Quadratic functions: πππ₯π₯2 + πππ₯π₯ + ππ
β’ ππ-degree polynomials: πππππ₯π₯ππ + ππππβ1π₯π₯ππβ1 + β―+ ππ1π₯π₯ + ππ0
β’ Can we always learn βbetterβ with a larger hypothesis class?
32
Regression
β’ Larger hypothesis space always decreases the cost function, but this does NOT necessarily mean better predictive performance
β’ This phenomenon is known as overfitting
β’ Ideally, we would select the simplest hypothesis consistent with the observed data
β’ In practice, we cannot simply evaluate our learned hypothesis on the training data, we want it to perform well on unseen data (otherwise, we can just memorize the training data!)
β’ Report the loss on some held out test data (i.e., data not used as part of the training process)
33
Binary Classification
β’ Regression operates over a continuous set of outcomes
β’ Suppose that we want to learn a function ππ:ππ β {0,1}
β’ As an example:
How do we pick the hypothesis space?
How do we find the best ππ in this space?
ππππ ππππ π₯π₯3 π¦π¦1 0 0 1 02 0 1 0 13 1 1 0 14 1 1 1 0
34
Binary Classification
β’ Regression operates over a continuous set of outcomes
β’ Suppose that we want to learn a function ππ:ππ β {0,1}
β’ As an example:
ππππ ππππ π₯π₯3 π¦π¦1 0 0 1 02 0 1 0 13 1 1 0 14 1 1 1 0
How many functions with three binary inputs and one binary output are there?
35
Binary Classification
ππππ ππππ π₯π₯3 π¦π¦0 0 0 ?
1 0 0 1 02 0 1 0 1
0 1 1 ?1 0 0 ?1 0 1 ?
3 1 1 0 14 1 1 1 0
28 possible functions
24 are consistent with the observations
How do we choose the best one?
What if the observations are noisy?
36
Challenges in ML
β’ How to choose the right hypothesis space?
β’ Number of factors influence this decision: difficulty of learning over the chosen space, how expressive the space is, β¦
β’ How to evaluate the quality of our learned hypothesis?
β’ Prefer βsimplerβ hypotheses (to prevent overfitting)
β’ Want the outcome of learning to generalize to unseen data
37
Challenges in ML
β’ How do we find the best hypothesis?
β’ This can be an NP-hard problem!
β’ Need fast, scalable algorithms if they are to be applicable to real-world scenarios
38
Other Types of Learning
β’ Unsupervisedβ’ The training data does not include the desired output
β’ Semi-supervisedβ’ Some training data comes with the desired output
β’ Active learningβ’ Semi-supervised learning where the algorithm can ask for the
correct outputs for specifically chosen data points
β’ Reinforcement learningβ’ The learner interacts with the world via allowable actions which
change the state of the world and result in rewardsβ’ The learner attempts to maximize rewards through trial and
error
39