chapter 9: inferential statistics of discrete-choice models
TRANSCRIPT
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models
Chapter 9: Inferential Statistics ofDiscrete-Choice Models
I 9.1. Maximum-Likelihood EstimationI 9.2 Estimation Errors: Variance-Covariance
MatrixI 9.2.1 Example 1: SP Survey in the AudienceI 9.2.2 Example 2: RP Survey in the Audience
I 9.3 Significance Tests
I 9.4 Goodness-of-Fit Measures
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.1. Maximum-Likelihood Estimation
9.1. Maximum-Likelihood Estimation: the likelihood function
I The maximum-likelihood (ML) estimation is applicable for general stochasticmodels where the probabilities depend on a parameter vector β
I The goal is to maximize the likelihood function L(β), i.e., the probability that themodel predicts all data points (yn,xn), n = 1, ..., N :
L(β) = P(y1(β) = y1, ..., yN (β) = yN
)where yn = y(xn) gives the model estimate for xn
I For continuous endogenous variables, the likelihood function is given by themulti-dimensional probability density at the data points:
L(β) = fy1(β),...,yN (β)(y1, ...,yN )
? Verify that the density formulation is equivalent to the probability definition byrequiring the model estimations to be in small intervals around the data instead ofhitting the data exactly.
! The multi-dimensional probability density f(.) is defined such thatdP = fy1,...,yN (y)dNy. Keeping dNy small and constant, dP and thus P ismaximized if and only if f(.) is maximized.
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.1. Maximum-Likelihood Estimation
Maximum-likelihood estimation
I The ML method maximizes the likelihood function:
β = arg maxβ
L(β)
I Equivalently, and often better, one maximizes the log-likelihood:
β = arg maxβ
L(β), L(β) = lnL(β)
? Why it does not matter whether to maximize the likelihood or the log-likelihood?
! Since, as a probability or probability density, L > 0 and the log function is definedand strictly monotonously increasing in this range. Since (i) in this case
x > y ⇔ f(x) > f(y)
(ii) the maximum function is based on this inequality relation, the argument of themaximum remains unchanged.
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.1. Maximum-Likelihood Estimation
Application 1: Regression models
Besides OLS, the ML can also be used to estimate regression models. Does it give thesame result, at least if the statistical Gauß-Markow conditions are satisfied?
L(β)εn independent
=
N∏n=1
fn(yn)εn∼i.d.N(0,σ2)
=
N∏n=1
1√2πσ2
exp
[−(yn − βxn)2
2σ2
],
L(β) =
N∑n=1
ln fn(yn) =
N∑n=1
{−1
2(ln 2π + lnσ2)−
[(yn − βxn)2
2σ2
]}= −N
2(ln 2π + lnσ2)− 1
2σ2(y − Xβ)′(y − Xβ)
Except for the irrelevant additive and multiplicative constants, this is the SSE function ofthe OLS method and therefore leads to the same estimator!
? Why it is possible to express L(β) as a product?
! Since the random terms εn ∼ i.i.dN(0, σ2), particularly, they are independent fromeach other
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.1. Maximum-Likelihood Estimation
Application 2: Discrete-choice models
I Probability to predict the chosen alternative in for a single decision n:
P(Y n = yn
)= P
(Yn1 = yn1, ..., YnI = ynI
)=
I∏i=1
[Pni(β)]yni = Pnin(β)
(this relies on the exclusivity/completeness of An and of independent RUs)
I Probability to predict all the decisions correctly assuming independent decisions:
L(β) = P (Y1(β) = y1, ...,YN (β) = yN )
=
N∏n=1
I∏i=1
[Pni(β)]yni
ML estimation:
β = arg maxβ
L(β), L(β) =
N∑n=1
I∑i=1
yni lnPni(β) =
N∑n=1
lnPnin(β)
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.1. Maximum-Likelihood Estimation
Question
? Show that, in deriving the main ML result L =∑n
∑i yni lnPni, the random utilities need not to be un-
correlated between alternatives, only between choices
! Because of the exclusivity/completeness requirement for thealternatives, exactly one alternative can be chosen per de-cision so it is enough to maximize the corresponding prob-ability (which, of course, depends on possible correlations)
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.1. Maximum-Likelihood Estimation
Estimating models with only ACsIf there are no exogenous variables, we are left with just the ACs reflecting that peopleprefer certain alternatives over others for unknown reasons:
Vni =
I−1∑m=1
βmδmi or Vni = βi if i 6= I, VnI = 0
This AC-only model will be the “reference case” when estimating the model quality, e.g.,by the likelihood-ratio index.
? Show that the estimated models gives probabilities Pni = Pi that are equal to the observedchoice fractions Ni/N . (Hint: Lagrange multiplicators to satisfy
∑i Pi = 1)
! we have L(P ) =∑n lnPin =
∑iNi lnPi; maximize under the constraint
∑i Pi = 1:
d
dPi
(L(P )− λ(
∑i
Pi − 1)
)!= 0 ⇒ Ni
Pi= λ⇒ Pi ∝ Ni
? Based on this result Pi = Ni/N , give the parameters for the AC-only MNL and for the binaryi.i.d. Probit model Logit: Pi/PI = Ni/NI = exp(βi) (notice that I is the reference w/o AC)
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.1. Maximum-Likelihood Estimation
Exercise: simple binomial model with an AC and travel time
Vni = β1δi1 + β2Tni
Choice set Tped = T1 [min] Tbike = T2 [min] # chosen 1 # chosen 2
1 15 30 3 22 10 15 2 33 20 20 1 44 30 25 1 45 30 20 0 56 60 30 0 5
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.1. Maximum-Likelihood Estimation
I: Graphical solution
Vni = β1δi1 + β2Tni
Logit:
L = −12β1 = −1.3, β2 = −0.14,
AC in minutes: − β1β2
= −9 min
Probit:
L = −12β1 = −1.1, β2 = −0.12,
AC in minutes: − β1β2
= −9 min
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.1. Maximum-Likelihood Estimation
II. Numerical solution
I Generally, we have a nonlinear optimization problem.
I For parameter-linear utilities, we know for the MNL that a maximum exists and isunique.
I Standard methods of nonlinear optimization are possible:I Newton’s and quasi-Newton method: Fast but may be unstableI Gradient/steepest descent methods: slow but reliableI Broyden-Fletcher-Goldfarb-Shanno (BFGS) or Levenberg-Marquardt algorithm
combining gradient and Newton methods. Such methods are used in many softwarepackages
I genetic algorithms if the objective function landscape is complicated (nonlinearutilities).
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.1. Maximum-Likelihood Estimation
Special case: estimating the MNL
The special structure of the MNL with parameter-linear utilities, Vni =∑
m βmXmni
allows for an intuitive formulation of the estimation problem:
The observed and modeled property sums sums of the factors X for agiven parameter m should be the same
XMNLm = Xdata
m ,∑n,i
xmni Pni(β) =∑n,i
xmni yni =∑n
xmnin
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.1. Maximum-Likelihood Estimation
Example: four factors, two alternativesMNL model, Vni = β1Tni + β2Cni + β3giδi1 + β4δi1, g = 0, g = 1:
I X1 = T : Total travel time for the chosen alternatives:
TMNL =∑n,i
Pni(β)Tni, T data =∑n,i
yniTni =∑n
Tnin
I X2 = C: Total money spent by the decision makers:
CMNL =∑n,i
Pni(β)Cni, Cdata =∑n,i
yniCni =∑n
Cnin
I X3 = N1,
: number of woman choosing alternative 1:
NMNL
1,=∑n
Pn1(β)gn, Ndata
1,=∑n
yn1gn
I X4 = N1: total number of persons choosing alternative 1:
NMNL1 =
∑n
Pn1(β), Ndata1 =
∑n
yn1
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.2 Estimation errors
9.2 Estimation Errors: Variance-Covariance MatrixSince the log-likelihood is maximized at β, we have
∂L
∂β= 0 ⇒ L(β) ≈ Lmax +
1
2∆β T ·H ·∆β, ∆β = β − β
with the (negative definite) Hessian Hlm = ∂2L(β)∂βl ∂βm
∣∣∣β=β
Compare L(β) near its maximum with the density f(x) of the general multivariate normaldistribution with variance-covariance matrix Σ:
L(β) = Lmax exp
(1
2∆β T ·H ·∆β
),
f(x) =((2π)MDetΣ
)−1/2exp
(−1
2x′Σ−1 x
)Identify ∆β with x, the sought-after variance-covariance matrix V with Σ, and assume the
asymptotic limit (higher than quadratic terms in L(β) negligible): ⇒
V = Cov(β) = E
[(β − β
)(β − β
)′]≈ −H−1(β)
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.2 Estimation errors
Fisher’s information matrix
The variance-covariance matrix is related to Fisher’s information matrix I:
I = V−1 = −H , Ilm = − ∂2L(β)
∂βl ∂βm
I Roughly speaking, information is missing uncertainty, so the higher the main components ofI, the lower the main components of V
I Cramer-Rao inequality: A lower bound for the variance-covariance matrix is the inverse ofFisher’s information matrix ⇒ The ML estimator is asymptotically efficient
I Comparison with the OLS estimator V OLS = 2σ2H−1SSE of regression models:
I = −H = H SSE/(2σ2) = X ′X /σ2
The negative Hesse matrix of L(β) is proportional to the Hesse matrix of the regression SSES(β).
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.2.1 Example 1: SP Survey in the Audience
9.2.1 Example 1 from past lecture:SP Survey in the Audience WS18/19 (red: bad weather, W = 1)
ChoiceSet
Alt. 1:Ped
Alt. 2:Bike
Alt. 3:PT/Car
Alt 1 Alt 2 Alt 3
1 30 min 20 min 20 min+0e 1 3 7
2 30 min 20 min 20 min+2e 2 9 2
3 30 min 20 min 20 min+1e 1 5 7
4 30 min 20 min 30 min+0e 2 9 3
5 50 min 20 min 30 min+0e 0 9 4
6 50 min 30 min 30 min+0e 0 3 9
7 50 min 40 min 30 min+0e 0 2 10
8 180 min 60 min 60 min+2e 0 4 11
9 180 min 40 min 60 min+2e 0 9 6
10 180 min 40 min 60 min+2e 0 1 14
11 12 min 8 min 10 min+0e 3 5 6
12 12 min 8 min 10 min+1e 5 7 2
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.2.1 Example 1: SP Survey in the Audience
Model specification for Model 1 of the past lecture
Vi = β0δi1 + β1δi2+ β2Ki + β3Ti
β0 = −0.95± 0.37,β1 = −0.28± 0.24,β2 = +0.17± 0.19,β3 = −0.04± 0.02
β0
−β3= −22.4 min,
β1
−β3= −6.6 min,
60β3
β2= −15e/h
AIC=275, BIC=303,ρ2 = 0.200, ρ2 = 0.177
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.2.1 Example 1: SP Survey in the Audience
Likelihood and log-likelihood function for varying cost (β2) and time (β3)sensitivities
Vi = β0δi1 + β1δi2 + β2K + β3T
Likelihood functionL(β2, β3|β0, β0)
Log-likelihood functionL(β2, β3|β0, β1)
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.2.1 Example 1: SP Survey in the Audience
Log-likelihood function in parameter space
Vi = β0δi1 + β1δi2 + β2K + β3T + β4Wδi3
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.2.2 Example 2: RP Survey in the Audience
9.2.2 Example 2: RP Survey in the Audience
Distance classes for the trip home to university (cumulated till 2018)
Weather: good
DistanceClass-center
ChoiceAlt. 1:ped
ChoiceAlt. 2:bike
ChoiceAlt. 2:PT
ChoiceAlt. 3:car
0-1 km 0.5 km 17 16 10 0
1-2 km 1.5 km 9 23 20 2
2-5 km 3.5 km 2 27 55 4
5-10 km 7.5 km 0 7 42 7
10-20 km 12.5 km 0 0 18 7
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.2.2 Example 2: RP Survey in the Audience
Revealed Choice: fit quality
V1 = β1 + β4r,V2 = β2 + β5r,V3 = β3 + β6r,V4 = 0
β1 = 4.1± 0.6,β2 = 3.6± 0.5,β3 = 3.0± 0.5,β4 = −1.43± 0.26,β5 = −0.48± 0.08,β6 = −0.14± 0.05
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.2.2 Example 2: RP Survey in the Audience
Revealed Choice: Modal split as a function of distance
V1 = β1 + β4r,V2 = β2 + β5r,V3 = β3 + β6r,V4 = 0
β1 = 4.1± 0.6,β2 = 3.6± 0.5,β3 = 3.0± 0.5,β4 = −1.43± 0.26,β5 = −0.48± 0.08,β6 = −0.14± 0.05
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.2.2 Example 2: RP Survey in the Audience
Likelihood and Log-Likelihood as f(β1, β2)
Vi =∑3
m=1 βmδm,i +∑3
m=1 βm+3 rδm,i
LikelihoodfunktionL(β1, β2, β3, ...)
Log-LikelihoodfunktionL(β1, β2, β3, ...)
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.2.2 Example 2: RP Survey in the Audience
Log-Likelihood: Sections through parameter space
Vi =∑3m=1 βmδm,i +
∑3m=1 βm+3 rδm,i
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.3 Significance Tests
9.3 Significance Tests: Parametric Tests
I The parameter test procedures are exactly the same as that of regression models.Because we only consider the asymptotic limit, the test statistic is always Gaussian:
I Confidence interval of a parameter βm:
CIα(βm) = [βm −∆α, βm + ∆α], ∆α = z1−α/2√Vmm
I Test of a parameter βm for H0 : βj = βj0, ≥ βj0, or ≤ βj0:
T =βj − βj0√
Vjj∼ N(0, 1) |H∗0
I p-values for H0 : βj = βj0, ≥ βj0, or ≤ βj0, respectively:
p= = 2(1− Φ(|tdata|)
), p≤ = 1− Φ(tdata), p≥ = Φ(tdata)
I As in regression, a factor 4 of more data halves the error
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.3 Significance Tests
Significance tests II: Likelihood-ratio (LR) test
Like in regression (F-test), one sometimes wants to test null hypotheses fixing severalparameters simultaneously to given values, i.e., H0 corresponds to a restraint model
I H0: The restraint model with some fixed parameters and Mr remaining parametersdescribes the data as well as the full model with M parameters
I Test statistics:
λLR = 2 ln
L(β)
Lr(β
r) = 2
[L(β)− Lr
(β
r)]∼ χ2(M −Mr) if H0
I Data realization: calibrate both M and Mr and evaluate λLRdata
I Result: reject H0 at α based on the 1− α quantile:
λLRdata > χ2
1−α,M−Mr
p-value: p = 1− Fχ2(M−Mr)
(λLR
data
)
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.3 Significance Tests
Example: Mode choice for the route to this lecture
Distance class nDistancern
i = 1 (ped/bike) i = 2 (PT/car)
n = 1: 0-1 km 0.5 km 7 1n = 2: 1-2 km 1.5 km 6 4n = 3: 2-5 km 3.5 km 6 12n = 4: 5-10 km 7.5 km 1 10n = 5: 10-20 km 15.0 km 0 5
Vn1(β1, β2) = β1rn + β2,
Vn2(β1, β2) = 0
I β1: Difference in distance sensitivity (utility/km) for choosing ped/bike over PT/car(expected < 0)
I β2: Utility difference ped/bike over PT/car at zero distance (> 0)
Do the data allow to distinguish this model from the trivial model Vni = 0?
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.3 Significance Tests
LR test for the corresponding Logit models
I H0: The trivial model Vni = 0 describes the data as well as the full modelVn1(β1, β2) = (β1rn + β2)δi1
I Test statistics: λLR = 2[L(β1, β2)− L(0, 0)
]∼ χ2(2)|H0
I Data realization (1 L-unit per contour): λLRdata = 2(−26.5 + 35.5) = 18
I Decision: Rejection range λLR > χ22,0.95 = 5.99 ⇒ H0 rejected.
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.3 Significance Tests
Fit quality of the full model
? What would be the modeledped/bike modal split for the nullmodel Vni = 0? 50:50
? Read off from the L contour plotthe parameter of the AC-onlymodel Vni = β2δi1 and give themodeled modal splitβ2 = ln(P1/P2) = −0.5, OK with
P1/P2 = eβ2 ≈ N1/N2 = 20/32
? Motivate the negative correlationbetween the parameter errors Thismakes at least sure that, in caseof correlated errors, about thesame fraction choosesalternative 2 as for the calibratedmodel
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.4 Goodness-of-Fit Measures
9.4 Goodness-of-Fit Measures
I The parameter tests for equality and the LR test are related to significance: Is themore complicated of two nested models significantly better in describing the data?
I This can be used to find the best model using the top-down ansatz:
Make is as simple as possible but not simpler!
I Problem: For very big samples, nearly any new parameter gives significance and thetop-down ansatz fails
I More importantly: Significance/LR tests cannot give evidence for missing butrelevant factors
I A further problem: We cannot compare non-nested models
I Finally, in reality, one often is interested in effect strength (difference in the fit andvalidation quality), not significance
⇒ we need measures for absolute fit quality
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.4 Goodness-of-Fit Measures
Information-based goodness-of-fit (GoF) measures
I Akaike’s information criterion:
AIC = −2L+ 2MN
N − (M + 1)
I Bayesian information criterion:
BIC = −2L+M lnN
N : number of decisions; M : number of parameters
I Both criteria give the needed additional information (in bit) to obtain the actual micro-datafrom the model’s prediction, including an over-fitting penalty: the lower, the better.
I Both the AIC and BIC are equivalent to the corresponding GoF measures of regression.
I the BIC focuses more on parsimonious models (low M).
I For nested models satisfying the null hypothesis of the LR test and N �M , the expectedAIC is the same (verify!). However, since the AIC is an absolute measure, it allowscomparing non-nested models.
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.4 Goodness-of-Fit Measures
GoF measures corresponding to the coefficient of determination R2 of linear
models (L0: log-likelihood of the estimated AC-only or trivial model)
I LR-Index resp. McFadden’s R2:ρ2 = 1− L
L0
I Adjusted LR-Index/McFadden’s R2:
ρ2 = 1− L−ML0
I The LR-Index ρ2 and the adjusted LR-Index ρ2 correspond to the coefficient of determination R2
and the adjusted coefficient R2 of regression models, respectively: The higher, the better.
I In contrast to regression models, even the best-fitting model has ρ2 and ρ2 values far from 1. Valuesas low as 0.3 may characterize a good model, see the Example 9.2.1 , while R2 = 0.3 means a really badfit for a regression model.
I An over-fitted model with M parameters fitting N = M decisions reaches the “ideal” LR-index valueρ2 = 1 while ρ2 is near zero.
Econometrics Master’s Course: Methods Chapter 9: Inferential Statistics of Discrete-Choice Models 9.4 Goodness-of-Fit Measures
Questions on GoF metrics
? Discuss the model to be tested, the AC-only model, and the trivial model in thecontext of weather forecastsFull forecast info, info from climate table, 50:50
? Give the log-likelihood of the AC-only and trivial models if there are I alternativesand Ni decisions for alternative i (total number of decisions N =
∑Ii=1Ni)
Trivial model: Pni = 1/I, L =∑
n lnPin =∑
iNi lnPi = −N ln I;AC-only model: Pni = Ni/N , L =
∑iNi lnPi = N lnN −
∑iNi lnNi
? Consider a binary choice situation where the N/2 persons with short trips chose thepedestrian/bike option with a probability of 3/4, and the PT/car option with 1/4.The other N/2 persons with long trips had the reverse modal split with a ped/bikeusage of 25 %, only.What would be the LR-index for the “perfect” model exactly reproducing theobserved 3:1 and 1:3 modal splits for the short and long trips, respectively?(less than 0.18)