beyond classification

8/8/2019 Beyond Classification

http://slidepdf.com/reader/full/beyond-classification 1/23

Beyond ClassificationBeyond ClassificationBeyond ClassificationBeyond ClassificationBeyond Classification

Rob Schapire

Princeton University[currently visiting Yahoo! Research]



Classification and BeyondClassification and BeyondClassification and BeyondClassification and BeyondClassification and Beyond

• earlier, studied classification learning• goal: learn to classify examples into fixed set of categories• want to predict correct class as often as possible

• many applications

• however, often faced with learning problems that don’t fit thisparadigm:

• predicting real-valued quantities:

• how many times will some web page be visited?• how much will be bid on a particular advertisement?

• predicting probabilities:

• what is the probability user will click on some link?• how likely is it that some user is a spammer?



This LectureThis LectureThis LectureThis LectureThis Lecture

• general techniques for:

•

predicting real-valued quantities — “regression”• predicting probabilities

• central, unifying idea: loss minimization



RegressionRegressionRegressionRegressionRegression



Example: Weather PredictionExample: Weather PredictionExample: Weather PredictionExample: Weather PredictionExample: Weather Prediction

• meteorologists A and B apply for job• to test which is better:

• ask each to predict how much it will rain• observe actual amount•

repeatpredictions actual

A B outcome

Monday 1.2 0.5 0.9

Tuesday 0.1 0.3 0.0

Wednesday 2.0 1.0 2.1

• how to judge who gave better predictions?



Example (cont.)Example (cont.)Example (cont.)Example (cont.)Example (cont.)

• natural idea:•

measure discrepancy between predictions and outcomes• e.g., measure using absolute difference

• choose forecaster with closest predictions overall

predictions actual difference

A B outcome A B Monday 1.2 0.5 0.9 0.3 0.4

Tuesday 0.1 0.3 0.0 0.1 0.3

Wednesday 2.0 1.0 2.1 0.1 1.1

0.5 1.8

• could have measured discrepancy in other ways• e.g., difference squared

• which measure to use?



LossLossLossLossLoss

• each forecast scored using loss functionx = weather conditions

f (x ) = predicted amounty = actual outcome

• loss function L(f (x ), y ) measures discrepancy betweenprediction f (x ) and outcome y

• e.g.:• absolute loss: L(f (x ), y ) = |f (x ) − y |• square loss: L(f (x ), y ) = (f (x ) − y )2

• which L to use?

• need to understand properties of loss functions



Square LossSquare LossSquare LossSquare LossSquare Loss

• square loss often sensible because encourages predictions closeto true expectation

• fix x

• say y random with µ = E [y ]

• predict f = f (x )

• can show:

E [L(f , y )] = E

(f − y )2

= (f − µ)2 + Var(y ) intrinsic randomness

• therefore:• minimized when f = µ• lower square loss ⇒ f closer to µ• forecaster with lowest square loss has predictions closest

to E [y |x ] on average



Learning for RegressionLearning for RegressionLearning for RegressionLearning for RegressionLearning for Regression

• say examples (x , y ) generated at random

• expected square loss

E [Lf ] ≡ E

(f (x ) − y )2

minimized when f (x ) = E [y |x ] for all x

• how to minimize from training data (x 1, y 1), . . . , (x m, y m)?

• attempt to find f with minimum empirical loss:

E [Lf ] ≡1

m

mi =1

(f (x i ) − y i )2

• if ∀f : E [Lf ] ≈ E [Lf ] then f that minimizes E [Lf ] willapproximately minimize E [Lf ]

• to be possible, need to choose f of restricted form to avoid

overfitting



Linear RegressionLinear RegressionLinear RegressionLinear RegressionLinear Regression

• e.g., if x ∈ Rn, could choose to use linear predictors of form

f (x) = w · x

• then need to find w to minimize

1m

mi =1

(w · xi − y i )2

• can solve in closed form

• can also minimize on-line (e.g. using gradient descent)



RegularizationRegularizationRegularizationRegularizationRegularization

• to constrain predictor further, common to add regularizationterm to encourage small weights:

1

m

mi =1

(w · xi − y i )2 + λw2

(in this case, called “ridge regression”)

• can significantly improve performance by limiting overfitting

• requires tuning of λ parameter

• different forms of regularization have different properties• e.g., using w1 instead tends to encourage “sparse”

solutions



Absolute LossAbsolute LossAbsolute LossAbsolute LossAbsolute Loss

• what if instead use L(f (x ), y ) = |f (x ) − y | ?

• can show E [|f (x ) − y |] minimized when

f (x ) = median of y ’s conditional distribution, given x

• potentially, quite different behavior from square loss

• not used so often



Summary so farSummary so farSummary so farSummary so farSummary so far

• can handle prediction of real-valued outcomes by:

• choosing a loss function• computing a prediction rule with minimum loss on

training data

• different loss functions have different properties:• square loss estimates conditional mean• absolute loss estimates conditional median

• what if goal is to estimate entire conditional distribution of y

given x?



Estimating ProbabilitiesEstimating ProbabilitiesEstimating ProbabilitiesEstimating ProbabilitiesEstimating Probabilities



Plausible ApproachesPlausible ApproachesPlausible ApproachesPlausible ApproachesPlausible Approaches

• similar to classification• but goal now is to predict probability of class

• could reduce to regression:

y = 1 if rain0 if no-rain

• minimize square loss to estimate

E [y |x ] = Pr[y = 1|x ] = Pr[rain|x ]

• reasonable, though somewhat awkward and unnatural(especially when more than two possible outcomes)



Different Approach: Maximum LikelihoodDifferent Approach: Maximum LikelihoodDifferent Approach: Maximum LikelihoodDifferent Approach: Maximum LikelihoodDifferent Approach: Maximum Likelihood

• each forecaster predicting distribution over set of outcomes

y ∈ {rain, no-rain} for given x • can compute probability of observed outcomes according to

each forecaster — “likelihood”

predictions actual likelihood

A B outcome A B

Monday 60% 80% rain 0.6 0.8

Tuesday 20% 70% no-rain 0.8 0.3

Wednesday 90% 50% no-rain 0.1 0.5

likelihood(A) = .6 × .8 × .1likelihood(B ) = .8 × .3 × .5

• intuitively, higher likelihood ⇒ better fit of estimatedprobabilities to observations

• so: choose maximum-likelihood forecaster

L L



Log LossLog LossLog LossLog LossLog Loss

• given training data (x 1, y 1), . . . , (x m, y m)

• f (y |x ) = predicted probability of y on given x

• likelihood of f =m

i =1

f (y i |x i )

• maximizing likelihood ≡ minimizing negative log likelihood

mi =1

(− log f (y i |x i ))

• L(f (·|x ), y ) = − log f (y |x ) called “log loss”

E i i P b bili iE i i P b bili iE i i P b bili iE P b b lE i i P b bili i



Estimating ProbabilitiesEstimating ProbabilitiesEstimating ProbabilitiesEstimating ProbabilitiesEstimating Probabilities

• Pr[y |x ] = true probability of y given x

• can prove: E [− log f (y |x )] minimized when f (y |x ) = Pr[y |x ]

• more generally,

E [− log f (y |x )] = (average distance between f (·|x ) and Pr[·|x ])+(intrinsic uncertainty of Pr[·|x ])

• so: minimizing log loss encourages choice of predictor close totrue conditional probabilities



L L d S LL L d S LL L d S LL L d S LL L d S L



Log Loss and Square LossLog Loss and Square LossLog Loss and Square LossLog Loss and Square LossLog Loss and Square Loss

• e.g.: if x ∈ Rn, y ∈ R, can take f (y |x ) to be gaussian with

mean w · x and fixed variance

• then minimizing log loss ≡ linear regression• general: square loss ≡ log loss with gaussian conditional

probability distributions (and fixed variance)

Cl ifi ti d L Mi i i tiCl ifi ti d L Mi i i tiCl ssifi ti d L ss Mi i i tiCl ifi ti d L Mi i i tiCl ifi ti d L Mi i i ti



Classification and Loss MinimizationClassification and Loss MinimizationClassification and Loss MinimizationClassification and Loss MinimizationClassification and Loss Minimization

• in classification learning, try to minimize 0-1 loss

L(f (x ), y ) =

1 if f (x ) = y

0 else

•

expected 0-1 loss = generalization error• empirical 0-1 loss = training error

• computationally and numerically difficult loss sincediscontinuous and not convex

• to handle, both AdaBoost and SVM’s minimize alternativesurrogate losses

• AdaBoost: “exponential” loss• SVM’s: “hinge” loss

S aS aSummarySS a



SummarySummarySummarySummarySummary

• much of learning can be viewed simply as loss minimization

• different losses have different properties and purposes

• regression (real-valued labels):

• use square loss to estimate conditional mean• use absolute loss to estimate conditional median

• estimating conditional probabilities:

• use log loss (≡ maximum likelihood)

• classification:

• use 0/1-loss (or surrogate)• provides unified and flexible means of algorithm design

beyond classification

Documents