spectral embedding of k-cliques, graph partitioning, and k-meanspa336/mls16/ml-sp16-lec1.pdf ·...

Machine Learning

Instructor: Pranjal Awasthi

Course Info

• Requested an SPN and emailed me

– Wait for Carol Difrancesco to give them out.

• Not registered and need SPN

– Email me after class

– No promises

• It’s a large class so won’t allow to sit in without registering. Sorry

Course Staff

• Instructor: Pranjal Awasthi ([email protected])

• Research Interests:– Semi supervised learning

– Clustering

– Online learning

– Learning theory

• Office hours: Monday 2-3pm. Hill 448.

• Course website: www.rci.rutgers.edu/~pa336/ml_s16.html

Course Staff

• TA: Yan Zhu ([email protected])

• Research Interests:

– Large scale machine learning

– Deep learning

– Computer vision

• Office hours: Friday 11-12am. CBIM.

Course Info

• No required textbook

• Recommended

Course Info

• ~ 5 Homeworks (40%)

• In class midterm (30%) (March 10 – no makeup exam)

• Final project (30%)

• Zero tolerance for cheating

– Academic integrity policy

• Grading:

[90-100] [85-89] [80-84] [75-79] [70-74] [< 70]

A B+ B C+ C F

http://academicintegrity.rutgers.edu/

Homework Policy

• ~2 weeks/HW. Submit via sakai.

• Should be typeset in LaTex. See website.

• Late homeworks not accepted.

• No regrading policy.

• TA is the boss.

• Encouraged to discuss

– Write solution in your own words

– Write names of people you discussed with

Start Early!

Homework Policy

• Typically two parts– Conceptual/Analytical – Programming

• Conceptual – justify your solution, rigorous proofs when asked for. Aim to test fundamentals.

• Programming – Matlab for homeworks. Justify your findings. Submit code. Well documented. Make sure the code runs.

• HW0 up on the webpage – no need to submit

A word about the course

• The course is designed to be tough

– More theoretical than previous courses.

– Should be comfortable with basic probability, linear algebra, algorithms.

– If cannot do HW0, consider dropping the course.

– Come to lectures, ask questions.

– Take notes.

– Play around with data and methods.

How can I do well?

What is Machine Learning?

The study of algorithms that improve performance on a given task with time and experience.

The science of making sound inference and predictions from data.

The part of AI that is actually useful!

Statistics Computer Science

History of Machine Learning?

• Pre 1950’s – Statistics and probability theory

• 1950-80’s – The AI phase

• Post 90’s – modern machine learning

Pre 1950’s

• Collection and analysis of data has always been around

– Traditionally for governance and politics

– 1500’s – collection of data on deaths, marriages, baptisms in England and France.

– Analyzed by humans.

– Not very scientific.

• 1700’s: Probability theory became a big tool

– Lot of work on studying gambling.

Pre 1950’s

• Pearson: Analyzed crab population near Naples

• Wanted to understand the nature of

the population.

• Claimed that there are two

underlying species.

Statistical Modeling

Pre 1950’s

Experimental Design

I can taste and tell whether tea was added first or the

milk.

Hmm…how do i verify that?

Pre 1950’s

• Lots of fundamental questions that are still relevant

– How to design an experiment?

– How to collect data and do survey/polls?

– How to choose between different hypothesis?

– Understand hidden structure in the data?

Post 1950’s

• CS enters

• AI is coined

• Can intelligent machines

be built?

• Turing test

Post 1950’s

• 1952: Program for checkers

Post 1950’s

• 60’s: ELIZA

Post 1950’s

• 70’s: MYCIN for medical diagnosis

– Knowledge base of ~600 rules

• Most machine learning systems we rule based or knowledge based

• Limitations quickly became clear

Post 90’s

• Statistical machine learning

• Data driven algorithm design

Modern ML

ML algorithm

What you’ll learn in this course

ML algorithm

Support vector machines, Naïve Bayes, Logistic regression, linear regression, Decision trees, Boosting, Graphical models, Reinforcement learning, Deep learning, Model selection, Optimization, Kernel methods, Learning theory, Bayesian methods, Semi supervised learning.

Probability Overview

• Random variable X

– a map from a set Ω to ℜ

• Ω equipped with probability 𝑃.

– 𝑃 𝑋 ∈ 𝐴 = 𝑃(𝜔 ∈ Ω: 𝑋 𝜔 ∈ 𝐴)

– 𝑋 has distribution 𝑃, denoted as 𝑋 ∼ 𝑃.


• Cumulative Density Function(cdf)

– 𝐹𝑋 𝑥 = 𝑃(𝑋 ≤ 𝑥)

• If 𝑋 is discrete

– probability mass function (pmf), 𝑝(𝑥)

– 𝑃 𝑋 = 𝑥 = 𝑝(𝑥)

• If 𝑋 is continuous

– probability density function (pdf), 𝑝(𝑥)

– 𝑃 𝑋 ∈ 𝐴 = 𝐴 𝑝(𝑥) 𝑑𝑥


• Expected value of 𝑋

– 𝐸 𝑋 = 𝑥 𝑝 𝑥 𝑑𝑥 (continuous)

– 𝐸 𝑋 = ∑𝑥 𝑝 𝑥 (discrete)

• Variance of 𝑋

– 𝑉𝑎𝑟 𝑋 = 𝐸 𝑋 − 𝐸 𝑋 2

– = 𝐸 𝑋2 − 𝐸[𝑋] 2


• Independence: 𝑋 and 𝑌 are independent iff

– 𝑃 𝑋 ∈ 𝐴, 𝑌 ∈ 𝐵 = 𝑃 𝑋 ∈ 𝐴 𝑃 𝑌 ∈ 𝐵 , ∀𝐴, 𝐵

• Covariance between 𝑋 and 𝑌

– 𝐶𝑜𝑣 𝑋, 𝑌 = 𝐸[(𝑋 − 𝐸[𝑋])(𝑌 − 𝐸[𝑌])]

• If 𝑋 and 𝑌 are independent then

– 𝐶𝑜𝑣 𝑋, 𝑌 = 0

– 𝑉𝑎𝑟 𝑋 + 𝑌 = 𝑉𝑎𝑟 𝑋 + 𝑉𝑎𝑟 𝑌


• Conditional distribution

– Distribution of 𝑋 conditioned on 𝑌 = 𝑦

– pdf: 𝑝 𝑥 𝑦 =𝑝 𝑥,𝑦

𝑝(𝑦)

• Joint distribution of 𝑋 and 𝑌

– pdf: 𝑝(𝑥, 𝑦)

– marginal density of 𝑥, 𝑝 𝑥 = 𝑦 𝑝 𝑥, 𝑦 𝑑𝑦

Probability Inequalities

• Markov’s inequality

– If X > 0, 𝑃 𝑋 > 𝑡𝐸 𝑋 ≤1

𝑡

• Chebychev’s inequality

– 𝑃 𝑋 − 𝐸 𝑋 ≥ 𝑡𝑉𝑎𝑟 𝑋 ≤1

𝑡2


• Let 𝑋1, 𝑋2, …𝑋𝑛 be independent and identically distributed(i.i.d.), taking values in {0,1}.

– 𝐸 𝑋𝑖 = 𝜇

– 𝑋𝑛 =∑𝑖 𝑋𝑖

𝑛

• Chernoff bound: For 𝛿 ∈ [0,1],

– 𝑃 𝑋𝑛 > 𝜇 1 + 𝛿 ≤ 𝑒−𝑛𝜇𝛿2

3

– 𝑃 𝑋𝑛 < 𝜇 1 − 𝛿 ≤ 𝑒−𝑛𝜇𝛿2

2


• Let 𝑋1, 𝑋2, …𝑋𝑛 be independent and identically distributed(i.i.d.), taking values in {0,1}.

– 𝐸 𝑋𝑖 = 𝜇

– 𝑋𝑛 =∑𝑖 𝑋𝑖

𝑛

• Hoeffding bound: For 𝛿 ∈ [0,1],

– 𝑃 𝑋𝑛 > 𝜇 + 𝛿 ≤ 𝑒−2𝑛𝛿2

– 𝑃 𝑋𝑛 < 𝜇 − 𝛿 ≤ 𝑒−2𝑛𝛿2

Onto new content

Point Estimation

• Goal: Estimate the bias of a coin.

Why??I came here to master

deep learning

Point Estimation• Given a coin

– Comes up heads(1) with probability 𝑝.

– Comes up tails(0) with probability 1 − 𝑝.

• Estimate p?

• Your idea: toss it a few times and see…..

• What is the estimate?

• How many flips needed?

Point Estimation• A random variable X distributed according to D(𝜃)

• Given i.i.d. samples from D

• Goal: Estimate 𝜃

Point Estimation• Given a coin



• Estimate p?

• Your idea: toss it a few times and see…..

• What is the estimate?

• How many flips needed?




• Three methods

– Method of moments(MoM)

– Maximum Likelihood Estimation(MLE)

– Bayesian Estimation ( )

Method of Moments• Given a coin



• Estimate p?

• Idea: match observed distribution to true distribution

– Moments: an elegant way to achieve this.

Method of Moments• A random variable X distributed according to D(𝜃)



• Moments of X




• Estimate p?






• Estimate p?



– All moments of our distribution are p.

– What about moments of the observed data?




• Estimate p?



– All moments of our distribution are p.

– All moments of the observed data are




• Estimate p?

• How good is the estimate?

• How many samples(n) do we need?

Method of Moments• How good is the estimate?

• Need a notion of error

– Mean squared error(MSE)

Method of Moments

Method of Moments• How good is the estimate?



Method of Moments

• How good is the estimate?



– How many samples?




• Your estimate:𝜃( )

• Is MSE always equal to




• Your estimate:𝜃( )

Point Estimation




• Estimate p?



– A natural approach

– Matching moments = solving system of equations

– Equations get messy pretty soon!

– Limited algorithmic tools, limited theory

what does the optimal classifier look like?

Maximum Likelihood Estimation• Given a coin



• Estimate p?

• Idea: find p that is most likely to generate the given data.

Maximum Likelihood Estimation• A random variable X distributed according to D(𝜃)



• Idea: output መ𝜃 that is most likely to generate the data.

spectral embedding of k-cliques, graph partitioning, and k-meanspa336/mls16/ml-sp16-lec1.pdf ·...

Documents