point estimation linear regressionguestrin/class/10701-s05/slides/... · point estimation linear...

Point EstimationLinear Regression

Machine Learning – 10701/15781Carlos GuestrinCarnegie Mellon University

January 12th, 2005

Announcements

Recitations – New Day and RoomDoherty Hall 1212 Thursdays – 5-6:30pmStarting January 20th

Use mailing list701-instructors@boysenberry.srv.cs.cmu.edu

Your first consulting job

A billionaire from the suburbs of Seattle asks you a question:

He says: I have thumbtack, if I flip it, what’s the probability it will fall with the nail up?You say: Please flip it a few times:

You say: The probability is:

He says: Why???You say: Because…

Thumbtack – Binomial Distribution

P(Heads) = θ, P(Tails) = 1-θ

Flips are i.i.d.:Independent eventsIdentically distributed according to Binomial distribution

Sequence D of αH Heads and αT Tails

Maximum Likelihood Estimation

Data: Observed set D of αH Heads and αT Tails Hypothesis: Binomial distribution Learning θ is an optimization problem

What’s the objective function?

MLE: Choose θ that maximizes the probability of observed data:

Your first learning algorithm

Set derivative to zero:

How many flips do I need?

Billionaire says: I flipped 3 heads and 2 tails.You say: θ = 3/5, I can prove it!He says: What if I flipped 30 heads and 20 tails?You say: Same answer, I can prove it!

He says: What’s better?You say: Humm… The more the merrier???He says: Is this why I am paying you the big bucks???

Simple bound (based on Hoeffding’s inequality)

For N = αH+αT, and

Let θ* be the true parameter, for any ε>0:

PAC Learning

PAC: Probably Approximate CorrectBillionaire says: I want to know the thumbtack parameter θ, within ε = 0.1, with probability at least 1-δ = 0.95. How many flips?

What about prior

Billionaire says: Wait, I know that the thumbtack is “close” to 50-50. What can you?You say: I can learn it the Bayesian way…

Rather than estimating a single θ, we obtain a distribution over possible values of θ

Bayesian Learning

Use Bayes rule:

Or equivalently:

Bayesian Learning for Thumbtack

Likelihood function is simply Binomial:

What about prior?Represent expert knowledgeSimple posterior form

Conjugate priors:Closed-form representation of posteriorFor Binomial, conjugate prior is Beta distribution

Beta prior distribution – P(θ)

Likelihood function:Posterior:

Posterior distribution

Prior:Data: αH heads and αT tails

Posterior distribution:

Using Bayesian posterior

Posterior distribution:

Bayesian inference:No longer single parameter:

Integral is often hard to compute

MAP: Maximum a posteriori approximation

As more data is observed, Beta is more certain

MAP: use most likely parameter:

MAP for Beta distribution

MAP: use most likely parameter:

Beta prior equivalent to extra thumbtack flipsAs N →∞, prior is “forgotten”But, for small sample size, prior is important!

What about continuous variables?

Billionaire says: If I am measuring a continuous variable, what can you do for me?You say: Let me tell you about Gaussians…

MLE for Gaussian

Prob. of i.i.d. samples x1,…,xN:

Log-likelihood of data:

Your second learning algorithm:MLE for mean of a GaussianWhat’s MLE for mean?

MLE for variance

Again, set derivative to zero:

Learning Gaussian parameters

Bayesian learning is also possibleConjugate priors

Mean: Gaussian priorVariance: Wishart Distribution

Prediction of continuous variables

Billionaire says: Wait, that’s not what I meant! You says: Chill out, dude.He says: I want to predict a continuous variable for continuous inputs: I want to predict salaries from GPA.You say: I can regress that…

The regression problemInstances: <xj, tj>Learn: Mapping from x to t(x)Hypothesis space:

Given, basis functionsFind coeffs w={w1,…,wk}

Precisely, minimize the residual error:

Solve with simple matrix operations:Set derivative to zeroGo to recitation Thursday 1/20

Billionaire (again) says: Why sum squared error???You say: Gaussians, Dr. Gateson, Gaussians…

Model:

Learn w using MLE

But, why?

Maximizing log-likelihood

Maximize:

Bias-Variance Tradeoff

Choice of hypothesis class introduces learning bias

More complex class → less biasMore complex class → more variance

What you need to know

Go to recitation for regressionAnd, other recitations too

Point estimation:MLEBayesian learningMAP

Gaussian estimationRegression

Basis function = featuresOptimizing sum squared errorRelationship between regression and Gaussians

Bias-Variance trade-off

point estimation linear regressionguestrin/class/10701-s05/slides/... · point estimation linear...

Documents

learning from conditionals -...

hmms - carnegie mellon school of computer...

markov decision processes (mdps)...

r 10701 · r 10701. san luis obispo city council, march 15,...

10701• (d ndaba) · 2011. 11. 3. · 10701• (d ndaba) ....

10701+introduction+to+machine+learning+mgormley/courses/10701-f16/slides/lectur… ·...

10701 it's time to take over the big energy firms

10701 recitation: model selection and decision trees ·...

vc dimension - carnegie mellon school of computer...

10701 machine learning...

29 wells ave, yonkers, ny 10701 - skil-care...

10701: introduction to machine...

10701 holder street, cypress, ca - bouma•caputo€¦ ·...

transductive svmsguestrin/class/10701-s06/slides/tsvms...1...

km 10701 marine operators manual 2.4l4.3l5.7l6.0l6.2l...

oral health program 17-10701 agreement packet 11-27-17

introduction to machine learning cmu-10701

10701 machine learning clustering -...

expectation...

support vector machinesawm/15781/slides/svm_with...support...