spectral embedding of k-cliques, graph partitioning, and k-meanspa336/mls16/ml-sp16-lec1.pdf ·...
TRANSCRIPT
Machine Learning
Instructor: Pranjal Awasthi
Course Info
• Requested an SPN and emailed me
– Wait for Carol Difrancesco to give them out.
• Not registered and need SPN
– Email me after class
– No promises
• It’s a large class so won’t allow to sit in without registering. Sorry
Course Staff
• Instructor: Pranjal Awasthi ([email protected])
• Research Interests:– Semi supervised learning
– Clustering
– Online learning
– Learning theory
• Office hours: Monday 2-3pm. Hill 448.
• Course website: www.rci.rutgers.edu/~pa336/ml_s16.html
Course Staff
• TA: Yan Zhu ([email protected])
• Research Interests:
– Large scale machine learning
– Deep learning
– Computer vision
• Office hours: Friday 11-12am. CBIM.
Course Info
• No required textbook
• Recommended
Course Info
• ~ 5 Homeworks (40%)
• In class midterm (30%) (March 10 – no makeup exam)
• Final project (30%)
• Zero tolerance for cheating
– Academic integrity policy
• Grading:
[90-100] [85-89] [80-84] [75-79] [70-74] [< 70]
A B+ B C+ C F
Homework Policy
• ~2 weeks/HW. Submit via sakai.
• Should be typeset in LaTex. See website.
• Late homeworks not accepted.
• No regrading policy.
• TA is the boss.
• Encouraged to discuss
– Write solution in your own words
– Write names of people you discussed with
Start Early!
Homework Policy
• Typically two parts– Conceptual/Analytical – Programming
• Conceptual – justify your solution, rigorous proofs when asked for. Aim to test fundamentals.
• Programming – Matlab for homeworks. Justify your findings. Submit code. Well documented. Make sure the code runs.
• HW0 up on the webpage – no need to submit
A word about the course
• The course is designed to be tough
– More theoretical than previous courses.
– Should be comfortable with basic probability, linear algebra, algorithms.
– If cannot do HW0, consider dropping the course.
– Come to lectures, ask questions.
– Take notes.
– Play around with data and methods.
How can I do well?
What is Machine Learning?
The study of algorithms that improve performance on a given task with time and experience.
The science of making sound inference and predictions from data.
The part of AI that is actually useful!
Statistics Computer Science
History of Machine Learning?
• Pre 1950’s – Statistics and probability theory
• 1950-80’s – The AI phase
• Post 90’s – modern machine learning
Pre 1950’s
• Collection and analysis of data has always been around
– Traditionally for governance and politics
– 1500’s – collection of data on deaths, marriages, baptisms in England and France.
– Analyzed by humans.
– Not very scientific.
• 1700’s: Probability theory became a big tool
– Lot of work on studying gambling.
Pre 1950’s
• Pearson: Analyzed crab population near Naples
• Wanted to understand the nature of
the population.
• Claimed that there are two
underlying species.
Statistical Modeling
Pre 1950’s
Experimental Design
I can taste and tell whether tea was added first or the
milk.
Hmm…how do i verify that?
Pre 1950’s
• Lots of fundamental questions that are still relevant
– How to design an experiment?
– How to collect data and do survey/polls?
– How to choose between different hypothesis?
– Understand hidden structure in the data?
Post 1950’s
• CS enters
• AI is coined
• Can intelligent machines
be built?
• Turing test
Post 1950’s
• 1952: Program for checkers
Post 1950’s
• 60’s: ELIZA
Post 1950’s
• 70’s: MYCIN for medical diagnosis
– Knowledge base of ~600 rules
• Most machine learning systems we rule based or knowledge based
• Limitations quickly became clear
Post 90’s
• Statistical machine learning
• Data driven algorithm design
Modern ML
ML algorithm
Modern ML
ML algorithm
What you’ll learn in this course
ML algorithm
Support vector machines, Naïve Bayes, Logistic regression, linear regression, Decision trees, Boosting, Graphical models, Reinforcement learning, Deep learning, Model selection, Optimization, Kernel methods, Learning theory, Bayesian methods, Semi supervised learning.
Probability Overview
• Random variable X
– a map from a set Ω to ℜ
• Ω equipped with probability 𝑃.
– 𝑃 𝑋 ∈ 𝐴 = 𝑃(𝜔 ∈ Ω: 𝑋 𝜔 ∈ 𝐴)
– 𝑋 has distribution 𝑃, denoted as 𝑋 ∼ 𝑃.
Probability Overview
• Cumulative Density Function(cdf)
– 𝐹𝑋 𝑥 = 𝑃(𝑋 ≤ 𝑥)
• If 𝑋 is discrete
– probability mass function (pmf), 𝑝(𝑥)
– 𝑃 𝑋 = 𝑥 = 𝑝(𝑥)
• If 𝑋 is continuous
– probability density function (pdf), 𝑝(𝑥)
– 𝑃 𝑋 ∈ 𝐴 = 𝐴 𝑝(𝑥) 𝑑𝑥
Probability Overview
• Expected value of 𝑋
– 𝐸 𝑋 = 𝑥 𝑝 𝑥 𝑑𝑥 (continuous)
– 𝐸 𝑋 = ∑𝑥 𝑝 𝑥 (discrete)
• Variance of 𝑋
– 𝑉𝑎𝑟 𝑋 = 𝐸 𝑋 − 𝐸 𝑋 2
– = 𝐸 𝑋2 − 𝐸[𝑋] 2
Probability Overview
• Independence: 𝑋 and 𝑌 are independent iff
– 𝑃 𝑋 ∈ 𝐴, 𝑌 ∈ 𝐵 = 𝑃 𝑋 ∈ 𝐴 𝑃 𝑌 ∈ 𝐵 , ∀𝐴, 𝐵
• Covariance between 𝑋 and 𝑌
– 𝐶𝑜𝑣 𝑋, 𝑌 = 𝐸[(𝑋 − 𝐸[𝑋])(𝑌 − 𝐸[𝑌])]
• If 𝑋 and 𝑌 are independent then
– 𝐶𝑜𝑣 𝑋, 𝑌 = 0
– 𝑉𝑎𝑟 𝑋 + 𝑌 = 𝑉𝑎𝑟 𝑋 + 𝑉𝑎𝑟 𝑌
Probability Overview
• Conditional distribution
– Distribution of 𝑋 conditioned on 𝑌 = 𝑦
– pdf: 𝑝 𝑥 𝑦 =𝑝 𝑥,𝑦
𝑝(𝑦)
• Joint distribution of 𝑋 and 𝑌
– pdf: 𝑝(𝑥, 𝑦)
– marginal density of 𝑥, 𝑝 𝑥 = 𝑦 𝑝 𝑥, 𝑦 𝑑𝑦
Probability Inequalities
• Markov’s inequality
– If X > 0, 𝑃 𝑋 > 𝑡𝐸 𝑋 ≤1
𝑡
• Chebychev’s inequality
– 𝑃 𝑋 − 𝐸 𝑋 ≥ 𝑡𝑉𝑎𝑟 𝑋 ≤1
𝑡2
Probability Inequalities
• Let 𝑋1, 𝑋2, …𝑋𝑛 be independent and identically distributed(i.i.d.), taking values in {0,1}.
– 𝐸 𝑋𝑖 = 𝜇
– 𝑋𝑛 =∑𝑖 𝑋𝑖
𝑛
• Chernoff bound: For 𝛿 ∈ [0,1],
– 𝑃 𝑋𝑛 > 𝜇 1 + 𝛿 ≤ 𝑒−𝑛𝜇𝛿2
3
– 𝑃 𝑋𝑛 < 𝜇 1 − 𝛿 ≤ 𝑒−𝑛𝜇𝛿2
2
Probability Inequalities
• Let 𝑋1, 𝑋2, …𝑋𝑛 be independent and identically distributed(i.i.d.), taking values in {0,1}.
– 𝐸 𝑋𝑖 = 𝜇
– 𝑋𝑛 =∑𝑖 𝑋𝑖
𝑛
• Hoeffding bound: For 𝛿 ∈ [0,1],
– 𝑃 𝑋𝑛 > 𝜇 + 𝛿 ≤ 𝑒−2𝑛𝛿2
– 𝑃 𝑋𝑛 < 𝜇 − 𝛿 ≤ 𝑒−2𝑛𝛿2
Onto new content
Point Estimation
• Goal: Estimate the bias of a coin.
Why??I came here to master
deep learning
Point Estimation• Given a coin
– Comes up heads(1) with probability 𝑝.
– Comes up tails(0) with probability 1 − 𝑝.
• Estimate p?
• Your idea: toss it a few times and see…..
• What is the estimate?
• How many flips needed?
Point Estimation• A random variable X distributed according to D(𝜃)
• Given i.i.d. samples from D
• Goal: Estimate 𝜃
Point Estimation• Given a coin
– Comes up heads(1) with probability 𝑝.
– Comes up tails(0) with probability 1 − 𝑝.
• Estimate p?
• Your idea: toss it a few times and see…..
• What is the estimate?
• How many flips needed?
Point Estimation• A random variable X distributed according to D(𝜃)
• Given i.i.d. samples from D
• Goal: Estimate 𝜃
• Three methods
– Method of moments(MoM)
– Maximum Likelihood Estimation(MLE)
– Bayesian Estimation ( )
Method of Moments• Given a coin
– Comes up heads(1) with probability 𝑝.
– Comes up tails(0) with probability 1 − 𝑝.
• Estimate p?
• Idea: match observed distribution to true distribution
– Moments: an elegant way to achieve this.
Method of Moments• A random variable X distributed according to D(𝜃)
• Given i.i.d. samples from D
• Goal: Estimate 𝜃
• Moments of X
Method of Moments• Given a coin
– Comes up heads(1) with probability 𝑝.
– Comes up tails(0) with probability 1 − 𝑝.
• Estimate p?
• Idea: match observed distribution to true distribution
– Moments: an elegant way to achieve this.
Method of Moments• Given a coin
– Comes up heads(1) with probability 𝑝.
– Comes up tails(0) with probability 1 − 𝑝.
• Estimate p?
• Idea: match observed distribution to true distribution
– Moments: an elegant way to achieve this.
– All moments of our distribution are p.
– What about moments of the observed data?
Method of Moments• Given a coin
– Comes up heads(1) with probability 𝑝.
– Comes up tails(0) with probability 1 − 𝑝.
• Estimate p?
• Idea: match observed distribution to true distribution
– Moments: an elegant way to achieve this.
– All moments of our distribution are p.
– What about moments of the observed data?
Method of Moments• Given a coin
– Comes up heads(1) with probability 𝑝.
– Comes up tails(0) with probability 1 − 𝑝.
• Estimate p?
• Idea: match observed distribution to true distribution
– Moments: an elegant way to achieve this.
– All moments of our distribution are p.
– All moments of the observed data are
Method of Moments• Given a coin
– Comes up heads(1) with probability 𝑝.
– Comes up tails(0) with probability 1 − 𝑝.
• Estimate p?
• How good is the estimate?
• How many samples(n) do we need?
Method of Moments• How good is the estimate?
• Need a notion of error
– Mean squared error(MSE)
Method of Moments• How good is the estimate?
• Need a notion of error
– Mean squared error(MSE)
Method of Moments
Method of Moments
Method of Moments• How good is the estimate?
• Need a notion of error
– Mean squared error(MSE)
Method of Moments
• How good is the estimate?
• Need a notion of error
– Mean squared error(MSE)
– How many samples?
Method of Moments
• How good is the estimate?
• Need a notion of error
– Mean squared error(MSE)
– How many samples?
Point Estimation• A random variable X distributed according to D(𝜃)
• Given i.i.d. samples from D
• Goal: Estimate 𝜃
• Your estimate:𝜃( )
• Is MSE always equal to
Point Estimation• A random variable X distributed according to D(𝜃)
• Given i.i.d. samples from D
• Goal: Estimate 𝜃
• Your estimate:𝜃( )
Point Estimation
Method of Moments• Given a coin
– Comes up heads(1) with probability 𝑝.
– Comes up tails(0) with probability 1 − 𝑝.
• Estimate p?
• Idea: match observed distribution to true distribution
– Moments: an elegant way to achieve this.
– A natural approach
– Matching moments = solving system of equations
– Equations get messy pretty soon!
– Limited algorithmic tools, limited theory
what does the optimal classifier look like?
Maximum Likelihood Estimation• Given a coin
– Comes up heads(1) with probability 𝑝.
– Comes up tails(0) with probability 1 − 𝑝.
• Estimate p?
• Idea: find p that is most likely to generate the given data.
Maximum Likelihood Estimation• A random variable X distributed according to D(𝜃)
• Given i.i.d. samples from D
• Goal: Estimate 𝜃
• Idea: output መ𝜃 that is most likely to generate the data.