cramming before the exam: estimating the causal e ect of...
TRANSCRIPT
“Cramming” Before the Exam:
Estimating the Causal Effect of Exam Preparatory
Programs in a Non-randomized Study
Ming-sen Wang
Department of Economics
University of Arizona∗†
May 04, 2012
FIRST DRAFT: January 12, 2012
Abstract
In this empirical paper, I estimate the impact of attending exam preparatory pro-
grams, in particular “cram schools,” on students’ academic performance. I measure
the outcome by admission to a public high school and an “elite” high school. Fo-
cusing on the problem that students are not randomly assigned to “cram schools,”
I approach the issue using propensity score matching and a Bayesian simultaneous-
equations model. Using data from a survey of Taiwanese junior high school students
in the Taiwan Youth Project, I find evidence that there is an insignificantly negative
∗I am indebted for continuous guidance of Ronald Oaxaca and helpful comments and suggestions
from Katherine Barnes, Price Fishback, Keisuke Hirano, and Tiemen Woutersen. I have benefited from
discussions with Mario Samano-Sanchez, Sandeep Shetty, and Ju-Chun Yen. All the remaining errors
are of my own. E-mail: [email protected]; the latest version of the paper can be found at:
http://www.u.arizona.edu/∼mswang.†Data analyzed in this paper were collected by the research project Taiwan Youth Project sponsored
by the Academia Sinica ( AS-93-TP-C01). This research project was carried out by Institute of Sociology,
Academia Sinica, and directed by Chin-Chun Yi. The Center for Survey Research of Academia Sinica
is responsible for the data distribution. The authors appreciate the assistance in providing data by the
institutes and individuals aforementioned. The views expressed herein are the authors’ own.
1
sorting into exam preparatory programs and attending an exam preparatory program
improves a student’s possibility of being admitted to a public high school or an “elite”
high school. Both approaches indicate similar positive treatment effects.
1 Introduction
In many East Asian countries, such as Taiwan and Japan, attendance of the so-called “cram
school” is prevalent. A “cram school” is a type of shadow education that is aimed at
improving a student’s exam writing skills. Attending “cram school” imposes additional
burdens on a student and her family. It puts additional stress on a student since it requires
time and effort. It puts financial loads on parents because sending a child to a program for
a month can cost more than tuition fees for a semester in a public school.
Given the prevalence and important role of exam preparatory programs in the education
system, it is surprising that there are few rigorous evaluations. One problem is that students
often self-select into these prep-programs.(Jackson(2012)[33]) As shown in Figure (1), the
number of “cram schools” in Taiwan grows steadily. However, there has never been a rigorous
proof that attending exam prep-program indeed improves students placement of high school.
In a seminal paper, Stevenson and Baker (1992)[47] point out possible factors that foster
“cram schools”: (1) the use of a centrally administered examination, (2) the use of “con-
test rules” instead of “sponsorship rules”, and (3) tight linkages between the outcomes of
educational allocation in elementary and secondary schooling and future educational oppor-
tunities. Taiwanese society has all these factors. Graduates of an “elite” university in Taiwan
have significant advantages in the labor market (Lin (1983) [36])1. A student’s performance
in the Joint High-school Entrance Exam and the Joint College Entrance Exam is strongly
linked to future opportunities. It causes a prevalence of “cram schools” in Taiwan and makes
Taiwan an ideal candidate to study.
The paper distinguishes itself from previous work in two ways (See Stevenson and Baker
(1992)[47] and Lin et al.(2006)[37]). Firstly,while other literatures define exam performances
as outcome, I focus on admission to public high school and “elite” high school as outcome of
interest to avoid selection issue related to taking the Joint Entrance Exam. Since Taiwan has
undergone a significant education reform lately as we will discuss in the next section, focusing
on admission circumvents complication of modeling and necessity of exclusion restrictions.
1Notice this result can hardly be interpreted as causal since the research does not control for the selectionthat the graduates of an“elite” university in Taiwan is productive to begin with.
2
Besides, I estimate the effect of “cramming” using a dataset of junior high school students
while previous work uses sample from high school students. The difference is meaningful
in the sense that senior high school is an important stage of educational stratification in
Taiwan. Whether attending prep-programs affects teenagers’ life trajectory to academic
track or vocational track is an interesting question per se.
I compare estimates from propensity score matching and a Bayesian simultaneous-equations
model. Identification of the two approaches comes from different untestable assumptions:
propensity score matching relies on conditional independence assumption (Rosenbaum and
Rubin(1983)[43]) while the Bayesian model relies on exogeneity of the exclusion restrictions.
Both approaches differ slightly in the interpretation of the estimate but indicate positive
effects of attending “cram school” on admission to public high school or “elite” high school.
year
Numb
er of
Tutor
ing S
choo
ls
500
1000
1500
2000
2500
2002 2004 2006 2008 2010
county
Taipei City
Taipei County
Yilan County
Figure 1: Growth in Number of Tutoring Schools in Taiwan (2002 - 2010)
† Data of this bar chart comes from http://ap4.kh.edu.tw/. The database is maintained by the Education
Bureau of Kaohsiung City Government. The database has county-level statistics for all cram schools and
after-school tutoring in Taiwan. The figure shows the number of tutoring schools in the 3 countries under
study increase over time from 2002 to 2010.
3
1.1 Institutional Backgrounds
In 1987, Taiwan ended the martial law that has been in effect since 1949. Along with the
freer and more opener political atmosphere, many civil groups started to request reforms
in the education system. One of the most significant changes was to replace the old Joint
Exam System with the new Multi-Opportunities System. In the old system, every junior
high graduate had to attend the Joint High-school Entrance Exam that took place in the
summer after the graduation. Students were ranked based on their exam grades. The ranking
determined their priorities to choose an academic high school or a vocational high school.
Their performance on the Joint High-school Entrance Exam determined their high school.
The Exam decided the educational stratification.
In 2001, the Ministry of Education officially executed the new Multi-Opportunities Sys-
tem. The main idea of the new system is to separate admissions from exams. Two joint
exams, the Basic Scholastic Ability Test and the Joint High-school Entrance Exam, are held
in a school year to provide students one more chance. Under the new system, students
can be admitted to high schools through multiple channels, such as (1) the Joint Entrance
Exam, (2) the Special Admission Quotas for Recommended Students, and (3) Other Chan-
nels without Entrance Exam Grades. Even though using grades of the Joint Entrance Exam
as outcome provides a universal measurement, it involves complication to handle selection
to take the Exam. Defining admission as outcome very much simplifies the modeling.
1.2 Literature Review
Human capital investment has been a research focus ever since Becker(1962)[7]’s first rigorous
treatment on the topic. A large literature is dedicated to estimating the returns of the
formal schooling.(See Ashenfelter and Krueger (1994)[6]; Card (1995)[11]; Card(2001)[12];
Belzil(2007)[10]) Regan et al.(2007)[41], on the other hand, focuses on the optimal level of
stopping schooling instead of estimating the rate of returns.
On the other hand, if a prep-program does not directly increase human capital and it
only affects a student’s exam performance, the program can be considered as a way to reduce
high school costs. It is of particular interest to investigate whether “cram school” increases
the likelihood of being admitted to public high school. Admission to an “elite” high school
increases the likelihood of being admitted to a better public university2. Again, tuition fees
2Since Taiwanese government subsidizes higher education heavily, public universities in general are rankedas better universities.
4
in a public university are significantly lower than in a private university. Lower tuition fees
affect a student’s decision of stopping schooling.
As pointed out in Jackson(2010)[32],we can motivate the question in the context of the
Becker–Willis-Rosen life cycle model of human capital investment (See Becker(1993)[9] and
Willis and Rosen (1979)[52]).
Suppose the log of earnings y is an increasing concave function of the years of schooling
s:
y = eg(s)
Individuals pay a cost c to attend school, and δ is the discount rate. Then in the Becker-
Rosen framework, a student who considers two levels of schooling chooses T years over no
schooling if:
V (T ) ≥ V (0) =
∫ ∞T
eg(T )e−δtdt−∫ T
0
ce−δtdt ≥∫ ∞0
eg(0)e−δtdt
If c is lowered by the decision to attend a “cram school”, then a student’s utility when she
acquires more education increases. A student will more likely acquire more education and
postpone termination of schooling.
If prep-programs have no effect or negative effects on placement of high school, then
attending the programs is fundamentally a rent-seeking behavior.(See Krueger(1974)[34])
The motivation to send a teenager to “cram school” is affected by some behavioral factors,
say unrealistic concerns that their children will be left behind if all other children go to
“cram school.”
Jackson(2010)[32] is the most similar study using a U.S. high school dataset. He looks at
the short-term outcome of the Advanced Placement Incentive Program (APIP), which pays
both teachers and students for passing grades of Advanced Placement (AP) examinations.
Using propensity score matching methods, he finds that APIP adoption is associated with a
13 percent increase in the number of students scoring above 1100/24 on the SAT/ACT and
4.96 percent increase in the number of students matriculating in college. My study shows
some similar patterns in Taiwan to his findings.
2 Data: Taiwan Youth Projects
The Taiwan Youth Project (TYP) was started in the spring of 2000, with junior high students
from Taipei County, Taipei City, and Yilan County as the study population. In order to
5
examine the effects of Taiwan’s educational reforms on the students, TYP takes two cohorts
as the study subjects: the 1st year junior high students with an average age of 13 (those
taking reformed high school entrance system) and the 3rd year junior high students with
an average age of 15 (those taking old high school entrance system). TYP collects 1000
students in the junior high’s 1st and 3rd year from both Taipei City and Taipei County and
800 students in the junior high’s 1st and 3rd year from Yilan County. The total sampling
size is 5600 students.
I use the cohort of the first year junior high students since I observe their program
attendance history. After sample attrition, I am left with 2449 observations. In Table (1), I
summarize the key variables in the dataset.
Table 1: Summary Statistics
Mean SD Mean SD“cram School” in Senior Year 0.48 0.50 Male 0.51 0.50
Sound Family 0.87 0.33 Number of Siblings 3.56 0.87Ever Fail a Class 0.35 0.48 Admission to Public HS 0.30 0.46
Admission to Elite HS 0.10 0.31 Intent to Attend HS 0.68 0.47Minutes to “cram School” 18.99 11.60 - - -
Cram School History for First 2 years00 0.35 0.48 01 0.09 0.2810 0.12 0.32 11 0.45 0.50
CountiesTaipei City 0.39 0.49 Taipei County 0.39 0.49
Yilan County 0.22 0.41 - - -Father’s Educ. Mother’s Educ.
Elementary School 0.13 0.34 Elementary School 0.17 0.37Junior High School 0.26 0.44 Junior High School 0.26 0.44
High School Graduate 0.25 0.43 High School Graduate 0.26 0.44Vocational School 0.08 0.27 Vocational School 0.10 0.30Vocational College 0.06 0.24 Vocational College 0.05 0.23
University 0.11 0.31 University 0.08 0.27Grad School 0.04 0.18 Grad School 0.01 0.11
Not Applicable 0.00 0.05 Not Applicable 0.01 0.09No Education 0.06 0.24 No Education 0.06 0.24
Family Incomeless than NTD 30,000 0.18 0.38 NTD 30,000 -NTD 49,999 0.22 0.41
NTD 50,000 -NTD 59,999 0.21 0.40 NTD 60,000 -NTD 69,999 0.07 0.26NTD 70,000 -NTD 79,999 0.08 0.27 NTD 80,000 -NTD 89,999 0.05 0.22NTD 90,000 -NTD 99,999 0.04 0.20 NTD 100,000 -NTD 109,999 0.04 0.20
NTD 110,000 -NTD 119,999 0.03 0.17 NTD 120,000 -NTD 129,999 0.02 0.14NTD 130,000 -NTD 139,999 0.01 0.10 NTD 140,000 -NTD 149,999 0.01 0.10
more than NTD 150,000 0.04 0.20 - - -
6
3 Propensity Score Matching
I approach the question firstly by propensity score matching. I define the treatment as
attending an exam prep-program in the senior year because attending “cram school” in that
year has the strongest linkage to placement of high school. Because in the data we only
observe realized outcome of the treatment group, the propensity score matching approach is
to construct a counterfactual outcome for each treated unit based on the propensity score.
Identification of propensity score matching relies on conditional independence assumption
(Rosenbaum and Rubin(1983)[43]):
Ti ⊥ Yi(1), Yi(0)|Xi
where Yi(1) and Yi(0) denote potential outcomes given treatment.
Conditional on observable characteristics, potential outcomes are independent of treat-
ment. In our context, I assume attending “cram school” is independent of the potential
admission outcomes given attending “cram school” or not after controlling for the observed
family background and students’ performance in school. It requires a strong but empirically
untestable assumption on the mechanism that there is no unobserved characteristics that
affect both outcome and exam prep-program attendance. Hence, it is important to select
covariates so that the conditional independence assumption is likely to hold. In addition to
standard covariates in education literatures, I proxy for ability by whether a student ever
fails a class and for motivation by whether she intends to attend high school.
Given the richness of covariates I adopt propensity score approach. Rosenbaum and
Rubin(1983)[43] shows that conditioning on the full covariates is equivalent to conditioning
on the propensity score, which is the coarsest balancing score. I non-parametrically estimate
the propensity score by series logit regression. By 10-fold cross-validation, the first-order
series yields the smallest predicted error. I present the estimates in the propensity score in
Table(2).
3.1 Overlap Condition
An important issue that often hampers the propensity score matching approach is lack of
overlap in the covariate distributions. Figure (2) shows the histogram of the estimated
propensity scores of both treatment and control groups. Even though the treatment group is
concentrated more to higher value of propensity score and the control group is concentrated
7
Table 2: Estimated Propensity Score
Estimate Std. Error z value Pr(>|z|)(Intercept) -4.1436 1.1167 -3.71 0.0002
Male -0.0747 0.1129 -0.66 0.5081Num of Siblings -0.1443 0.0694 -2.08 0.0376Sound Family 0.4229 0.1806 2.34 0.0192
Attendance Histories11 3.3385 0.1450 23.02 0.000010 0.4082 0.1914 2.13 0.032901 2.8512 0.1986 14.36 0.0000
Fail a Class (Proxy for Ability) 0.5430 0.1269 4.28 0.0000Intention to HS (Proxy for Motivation) 0.3859 0.1249 3.09 0.0020
Father’s Educ. YesMother’s Educ. YesFather’s Occ. YesMother’s Occ. Yes
School FE YesFamily Income Level Yes
more to the lower value, both share a common support. An implication of the figure is that
we should use a small number of matches to avoid too much smoothing and extrapolation.
3.2 Results
The benchmark result of propensity score matching is presented in Table(3). I compares
different matching approaches. In 1-nearest-neighbor matching, the counterfactual out-
come is constructed based on the shortest distance in the control group to the treated.
10-nearest-neighbor-matching, instead, matches the closest 10 units. By using more com-
parison units, the precision of the estimate increases at the cost of larger bias. The trade-off
between 1-nearest-neighbor and 10-nearest-neighbor is well-known variance-bias trade-off in
non-parametric literatures. On the other hand, caliper matching uses all the control units
within the predefined caliper but drops the treated units that have no matches. The problem
with caliper matching is that the choice of caliper is arbitrary to the researcher’s judgment
and that dropping unmatched units alters the interpretation of the estimate. Instead of the
average treatment effect on the treated (ATT), the estimate of caliper matching should be
interpreted as conditional treatment effect on the treated given the matched subset (CATT).
All the estimates for ATT are significantly positive, ranging from 15% to 18% improvement
in chances of admission to public high school and from 3% to 5% improvement in chances of
admission to “elite” high school. In words, the students who attended “cram school” would
8
0
1
0
1
2
3
4
0
1
2
3
4
0.0 0.2 0.4 0.6 0.8 1.0propensity score
dens
ity
Figure 2: Histograms of Estimated Propensity Scores
have lost 15% to 18% chances of being admitted to public high school and 3% to 5% chances
of being admitted to “elite” high school if she had not attended “cram school.”
Since I am interested in estimating ATT, I can apply the covariate balancing strategy
proposed by Rubin (2006)[45] given overlap in covariate distributions is a concern. The idea
is to select a more balanced subsample before estimating the ATT. The procedure works as
follows:
1. Order the treated units by an estimated propensity score
2. Match without replacement by decreasing value of the estimated propensity score to
select corresponding control units. This leads to a balanced sample with sample size
2×N1.
3. Redo an analysis, say propensity score matching, on the balanced sample. Con
An advantage of the approach is that the interpretation of the estimate is not affected by
trimming control units as long as we are interested in ATT.
I report the result in Table(4). Consistent with the previous results, attending an exam
prep-program improves a student’s chance of being admitted to public high school by signif-
icantly 15% to 18% and to “elite” high school by 2% to 5%.
9
Table 3: Propensity Score Matching: Full Sample
Outcome Est. A-I S.E.† Num. Matched1-Nearest-Neighbor
Public High School 0.157∗∗∗ 0.035 1199Elite High School 0.029 0.022 1199
10-Nearest-NeighborPublic High School 0.145∗∗∗ 0.030 1199Elite High School 0.030 0.020 1199
Caliper δ = 0.001Public High School 0.180∗∗∗ 0.011 457Elite High School 0.045∗∗∗ 0.007 457
† The standard errors are calculated based on Abadie and Imbens(2006)[1].
Table 4: Propensity Score Matching: Rubin Subsample
Outcome Est. A-I S.E. Num. Matched1-Nearest-Neighbor
Public High School 0.190∗∗∗ 0.049 1199Elite High School 0.050 0.036 1199
10-Nearest-NeighborPublic High School 0.187∗∗∗ 0.041 1199Elite High School 0.053∗ 0.031 1199
Caliper δ = 0.001Public High School 0.172∗∗∗ 0.013 376Elite High School 0.024∗∗∗ 0.009 376
10
Table (5) shows the estimates of ATT using the subsample of students who intends to
attend high school. The sample gets rid of observations that are interested in professional
training or termination of schooling. This is the first attempt to deal with ability sorting
issue. Students better at academics would like to attend high school; therefore, they are more
likely to go to “cram school.” The estimated effect may be exaggerated. On the other hand,
if students who go to “cram school” are those who would like to attend high school but do not
have comparative advantage in academic, then we would expect the estimate to be downward
biased. Again, the estimator relies on the assumption that the conditional independence
assumption holds within the subsample even though some may doubt its validity on the full
sample.
Since the estimate only exploits a subsample, the interpretation of estimates is again
changed from ATT to CATT: treatment effect on the treated given students who would like
to go to high school. All estimates show slightly larger effects but still consistent with the
previous estimates.
Table 5: Propensity Score Matching: Intention-to-HS Subsample
Outcome Est. A-I S.E. Num. Matched1-Nearest-Neighbor
Public High School 0.218∗∗ 0.061 931Elite High School 0.096∗∗ 0.046 931
10-Nearest-NeighborPublic High School 0.236∗∗∗ 0.050 931Elite High School 0.102∗∗ 0.040 931
Caliper δ = 0.001Public High School 0.176∗∗∗ 0.013 224Elite High School 0.068∗∗∗ 0.010 224
4 Bayesian Simultaneous Equations Model
As mentioned briefly in the last section, some may be concerned about the validity of condi-
tional independence assumption since students may select to attending “cram school” based
on their motivation and ability. In this section, I set up a Bayesian simultaneous equations
model that attempts to take possible selection into account.
The model assumes latent potential outcomes Y ∗i (0) and Y ∗i (1) as a linear function of
family characteristics, Xi, treatment (“cram school” attendance), Ti, and an unobserved
random shock ε1i.
11
In addition, I assume that the treatment effect is constant over population
Y ∗i (1)− Y ∗i (0) = τ, ∀i
and that the unobservable characteristics for each individual are the same whether she gets
treatment or not. The constant treatment effect assumption is somehow unrealistic and
restrictive. It may still be a good approximation. As noted in Angrist(2001)[2], in practice,
more general estimation strategies allowing heterogeneous treatment effect often lead to
similar average treatment effect. The assumption allows me to extrapolate the treatment
effect on those whose decision is affected by the exclusion restriction to the whole population.
I, in turn, express the latent potential outcomes as:
Y ∗i (1) = τ +Xiβ1 + ε1i
Y ∗i (0) = Xiβ1 + ε1i
The observed outcome becomes:
Y ∗i = Yi(0) + Ti[Yi(1)− Yi(0)] (1)
= τTi +Xiβ1 + ε1i (2)
I observe Yi = 1 if Y ∗i > 0; Yi = 0 otherwise.
In order to accommodate the selection problem, I follow the standard strategy of Heckman
(1979)[30] to assume a household makes their optimal decision whether to send their children
to an exam preparatory program. A household sends their children to a “cram school” if
the utility is greater than a certain threshold. Therefore, I can interpret T ∗i as the latent
normalized utility:
T ∗i = γzi +Xiβ2 + ε2i (3)
I observe Ti = 1 if T ∗i > 0; Ti = 0 otherwise.
The argument implies: given we know Y ∗i and T ∗i , and I can solve for the simultaneous
equations model, the estimate for τ is an estimate for the average treatment effect. Identi-
fication of the model boils down to whether I can solve the simultaneous equations. I will
discuss the issue in Section (4.2).
12
4.1 Model Assumptions
In order to estimate the behavioral model specified above, I adopt a parametric approach
for the efficiency concern and simplicity.
Normality Assumption [ε1i
ε2i
] ∣∣∣∣ Xi, Zi ∼ N (0,Σ)
The assumption specifies how the unobserved characteristics affect the outcome Y ∗i and
the selection rule T ∗i . Under normality assumption, the data augmentation approach
comes into play. From an initial guess of the latent variables Y ∗i and T ∗i , we can
sequentially estimate the parameters and update the latent variables based on the
estimates and the normality assumption.
Re-parametrization Assumption
var(ε2i) = 1 and ε1i = δε2i + ηi, where ηi ∼ N(0, σ2
)(4)
The assumption says the disturbance term of one equation is linear in the disturbance
term of the other with an additive error term. The assumption implies
Σ =
[σ2 + δ2 δ
δ 1
]
Following the assumption, I can naturally re-parametrize the variance-covariance ma-
trix. It has 3 advantages. First, it allows the researcher to explicitly estimate the
components in Σ. Second, it normalizes a diagonal term in Σ to 1. Third, numeri-
cally, the re-parametrization speeds up the convergence of Gibbs sampling described
in details in Appendix (A).
4.2 Identification
The model is fundamentally a special case of the simultaneous equations models presented
in Heckman (1978) [29]. I would briefly summarize his identification arguments. Follow-
ing Heckman’s argument, this class of simultaneous equations model is non-parametrically
identified if 3 conditions are satisfied.
1. Principle Assumption
13
The principle assumption requires that the endogenous variable xi does not enter both
equations. It is a sufficient and necessary condition for the class of simultaneous equa-
tions models to be well-defined. It guarantees we can uniquely solve each parameter
from the equations. My model trivially satisfies the assumption.
2. Normalization of Variance
Given the selection equation has an interpretation as utility, the utility is invariant
to different scaling. The coefficients in the equations are identified up to a constant.
I can normalize a diagonal term of the variance-covariance matrix. I adopt a re-
parametrization approach presented in Section (4.1).
3. Exclusion Restrictions3
Even though I can purely rely on nonlinearity of normality assumption for identifica-
tion, lacking in exclusion restrictions in simultaneous equations models usually hampers
robustness of the estimates.(Manski(1989) [39]) On the other hand, a natural exclusion
restriction is often difficult to find.
Since selection to “cram school” can be interpreted as demand for “cram school,” it
is natural to look for a cost shifter. I follow the insight of Card(1995)[11] to exploit
the geographic variation. The idea is to use the distance between one’s school to
“cram school” to be the exogenous variation. The cost of attending a “cram school”
is composed of the time cost of transportation, the tuition fees, and the time cost the
teenagers spend in the class. As the traveling time to “cram school” increases, the
cost of attending “cram school” rises. Meanwhile, the traveling time does not affect
students’ performance in the admission procedure. It satisfies the exogeneity condition
for an ideal exclusion restriction.
Since I only observe how long it takes the attendants to go to a “cram school” in
the dataset, I estimate a censored regression of commuting time to “cram school” of
the attendants against their family characteristics. I impute the missing commuting
time of the non-attendants using the linear fitted values from the censored regression
estimates. The imputation is valid because students go to school in their school district
in junior high school. Without the rights to driver’s license, junior high students rely
on public transportation, walking, biking, or their parents for mobility. The exam
prep-programs are localized in the school districts. If teenagers with similar family
3I am indebted to Sandeep Shetty for the idea of the exclusion restriction.
14
backgrounds live in a similar neighborhood within the school district, the imputation
based on the fitted value of the censored regression will provide a good approximation
for the missing commuting time for the non-attendants.
4.3 Reference Prior
To complete the Bayesian models, standard normal-gamma conjugate priors are imposed on
the parameters ([24]).
β1 ∼MVN (β01 , σ
2β1I)
τ ∼ N (τ0, σ2τ )
β2 ∼MVN (β02 , σ
2β2I)
δ ∼ N (δ0, σ2δ )
σ2 ∼ IG(a, b)
This is a commonly used proper reference prior, which is an approximation to standard
improper reference prior.(See Christensen(2011)[18]) The parameters on the means of the
normal distributions, β01 , β0
2 , τ , are set to 0. The choice of prior parameters is philosophically
consistent with Zellner (2007) [5] in the sense that all variation is considered random or
nonsystematic unless shown otherwise. I set σ2β, σ2
α, and σ2γ to 106 and set the shape and
scale parameters of the inverse-gamma distribution to 10−3. It gives the inverse gamma
distribution an ε-ε form, which has concentrated density at 0+ and has a long tail. The choice
of theses parameter values are standard. The reference prior corresponding to the standard
frequentist MLE or least squared methods are well-known in the Bayesian literatures, such
as Chib(1992)[15] and Christensen(2011)[18].
4.4 The Results
In Table (6), I present the empirical results estimated by the Bayesian simultaneous equations
model. The first panel shows the results when the outcome is defined as admission to a
public high school. In order to compare the result with the propensity score matching result,
I calculate the average partial effect given treated, P (Y = 1|X, T = 1). Consistent with the
matching estimates, the estimated effect is about 14% increase in the probability of being
admitted to public high school. It also indicates an insignificant negative selection into “cram
school.”
15
The second panel shows the results when the outcome is defined as admission to an “elite”
high school. Conditional on participating in a prep-program, the partial effect of attending
“cram school” increases the chances of being admitted to an “elite” high school by around
6.6%. The estimate is also consistent with matching results. Sorting to “cram school” in
this case is also insignificantly negative.
Table 6: Empirical Results of the Key Variables
Post. Mean Post. SD APE† Post. Mean Post. SD APERegressor Public HS Elite HS
Cram School 1.128∗∗ 0.523 0.107 0.820∗∗∗ 0.208 0.084Minutes to Cram School -0.050∗∗∗ 0.004 - -0.046∗∗∗ 0.004 -
σ2 8.230 1.776 - 2.377 0.470 -δ 0.289 0.327 - -0.153 0.284 -
† The average partial effect of “cram school” is defined conditional on “cram school” attendance when “cramschool” attendance switches from 0 to 1.
∆P̂ (Y = 1|X, T = 1) ≈ 1
N
∑i
φ(Xβ̂)β̂j∆xj
Figure (3) shows the Markov chains and the histograms of the posterior distributions
of treatment effect parameters. The posterior distributions are of standard shape for the
normal-inverse-gamma model. Since both of them are unimodal and symmetric, the 90%
confidence set are simply represented by the 5% and 95% posterior quantiles. I present the
full empirical results in the Appendix.
4.5 Robustness Check
A possible concern about the empirical results may be: Is the result robust in absence of
the Bayesian model? In this section, I implement a standard bivariate Probit model and
compare the results with the Bayesian approach.
Table (7) shows the empirical results using standard bivariate Probit model. Notice that
compared with the Bayesian model developed in the previous section, Probit model imposes
an additional constraint of equal variances. The results indicate that “cram school” raises
students’ chances of being admitted to public high school by 18.8% while it increases their
chances of being admitted to “elite” high school by 3.3%. Both specifications also indicate
slightly negative sorting into “cram school.”
16
0.0
0.2
0.4
0.6
0.8
0 1 2 3
Histogram: Public High School
Cram School
dens
ity
0.0
0.5
1.0
1.5
2.0
2.5
200 400 600 800 1000
Markov Chain: Public High School
Iterations
Cra
m S
choo
l
0.0
0.2
0.4
0.6
0.8
0 1 2
Histogram: Elite High School
Cram School
dens
ity
0.0
0.5
1.0
1.5
2.0
200 400 600 800 1000
Markov Chain: Elite High School
Iterations
Cra
m S
choo
l
Figure 3: Posterior Distributions
† The Markov chains plot every 5 draws of the simulated chains.
Table 7: Robustness Check: Bivariate Probit Model
Post. Mean Post. SD APE† Post. Mean Post. SD APERegressor Public HS Elite HS
Cram School 0.730∗ 0.366 0.220 1.035∗∗ 0.316 0.065Minutes to Cram School -0.051∗∗ 0.005 - -0.047∗∗ 0.005 -
ρ -0.083 0.211 - -0.368∗ 0.205 -
17
4.6 Individual Decision Problem
An advantage of applying Bayesian methods to program evaluation is that it allows the
researchers to think of the problem as a decision problem. (Dehejia(2005)[21]) Imagine a
student wonder whether she should attend “cram school” given her performance in school
and family background. She may be concerned about the uncertainty of the model esti-
mates. The researcher can help her out by exploiting the Bayesian model. The decision
problem for a student to decide whether to enroll in an exam prep-program is associated
with the outcome. It is important to embody the uncertainty of the outcomes from the
model by allowing for parameter uncertainty. The predictive posterior distribution of the
Bayesian model constructs a distribution of outcome based on the posterior distribution of
the parameters.
I simulate the predictive posterior distribution in the following way: for each individual i
in the cohort of interest, say group of family income less than NTD30,000 per month, living
in Taipei City, and having failed a class. Given the covariates Xi as observed, I set T̃ 1i = 1
to simulate for the treated and T̃ 0i = 0 to simulate for the control. Using the stored draws
from the posterior distributions {τ (j), α(j), σ2(j)}5000j=1 , I draw for Y ∗1i |{T̃i, Xi} ∼ N (τ (j) +
Xiα(j), σ2(j)) and Y ∗0i |{T̃i, Xi} ∼ N (Xiα
(j), σ2(j)). Finally, I obtain predicted outcome by
Y 1i = 1(Y ∗1i > 0) and Y 0
i = 1(Y ∗0i > 0).
In Table (8), I show the average predicted probability of being admitted to a public high
school given different levels of family income and whether she has failed a course. I define
low income as earning less than NTD30,000 per month, median income as NTD 50,000 -
NTD 59,999 per month, and high school as earning more than 150,000 per month. The
result shows that the likelihood of being admitted to a public high school is significantly
higher for students from higher income families. Among students who have failed a class, or
less able in academia, the predicted improvement in probability by going to “cram school” is
larger for higher income students than lower income students. However, we do not observe
the same pattern among students who have never failed a class.
Comparing students who fail a class with those who have never failed one, the predicted
effect is also significantly higher for the students who are more able. The effect of “cram
school” for less motivated or less able children is smaller than for motivated and able students.
This suggests that parents should think twice before sending their children who are not
interested in studying to a “cram school” to ”force” them to academic track. The effect may
not outweigh the costs of time, tuition fees, and unnecessary additional pressure.
18
Table 8: Mean and Variance of Predicted Probability of being Admitted to Public HS
Treated Control Treated - ControlCohorts Pred. Prob. S.D. Pred. Prob. S.D. Pred. Diff. S.D. Num. Obs.
Taipei CityLow Income; Fail 0.170 0.376 0.095 0.293 0.075 0.455 111
Median Income; Fail 0.176 0.381 0.099 0.298 0.077 0.466 122High Income; Fail 0.302 0.459 0.189 0.391 0.114 0.575 25
Low Income; Never Fail 0.594 0.491 0.448 0.497 0.146 0.663 36Median Income; Never Fail 0.683 0.465 0.542 0.498 0.141 0.651 69
High Income; Never Fail 0.728 0.445 0.593 0.491 0.134 0.633 26Taipei County
Low Income; Fail 0.110 0.313 0.056 0.230 0.054 0.375 142Median Income; Fail 0.136 0.342 0.070 0.256 0.065 0.415 127
High Income; Fail 0.119 0.324 0.059 0.236 0.060 0.392 20Low Income; Never Fail 0.516 0.500 0.371 0.483 0.145 0.662 46
Median Income; Never Fail 0.534 0.499 0.389 0.487 0.145 0.664 73High Income; Never Fail 0.599 0.490 0.455 0.498 0.145 0.669 9
Yilan CountyLow Income; Fail 0.148 0.355 0.081 0.273 0.067 0.430 90
Median Income; Fail 0.175 0.380 0.098 0.297 0.077 0.464 81High Income; Fail 0.179 0.383 0.103 0.304 0.076 0.467 11
Low Income; Never Fail 0.579 0.494 0.433 0.495 0.146 0.656 18Median Income; Never Fail 0.677 0.468 0.541 0.498 0.136 0.643 36
High Income; Never Fail 0.608 0.488 0.458 0.498 0.150 0.674 5
19
5 Discussions
Exam preparatory programs are prevalent in many East Asian countries because of the
usage of a centrally administered exam system to allocate scarce educational resources.
However, because attendants to these programs may be highly self-selected, there is a lack
of rigorous study on evaluation of the programs. The research question of whether an exam
prep-program increases likelihood of being admitted to a public high school or an “elite”
high school can be motivated by Rosen-Willis life cycle model of human capital investment.
Investment in exam preparatory programs can be considered as a current investment to
decrease future educational costs. We expect “cram school” increases propensity of admission
to a public high school or an “elite” high school.
The paper provides two alternative empirical approaches to evaluate the effectiveness
of “cram schools.” Identification of propensity score matching approach relies on uncon-
foundedness assumption, which states: given the observed characteristics, attending “cram
school” is independent of potential placements. If unconfoundedness assumption holds, then
the average treatment effect on the treated can be obtained by matching each treated unit
with a control unit with the shortest distance in propensity score. The result suggests that
“cram school” increases chances to be admitted to public high school by 16% to 20% and to
“elite” high school by 4% to 7%.
Alternatively, I set up a Bayesian simultaneous equations model that specifies the selec-
tion rule. Identification of the model relies on exogeneity of exclusion restriction. I assume
commuting time from school to “cram school” is relevant to students’ attendance decision and
exogenous to their high school placement. Imposing the constant treatment effect assump-
tion, I can extrapolate the effect of the ”compliers,” who are discouraged from participating
a program due to longer commuting time, to the population of interest. I find evidence that
average partial effect given treated is around 11% in chances of being admitted to a public
high school and 8.4% to an “elite” high school. The result also indicates the correlation of
the unobservable characteristics that affect both selection and outcome is not significantly
different from 0.
The paper adds an important policy perspective to the ongoing debate of educational
reform in Taiwan. The findings suggest that “cram schools” pass the test of the market.
Attending “cram school,” indeed, improves students’ chances to be admitted to a public
high school and an “elite” high school. Even though the Taiwanese policy makers consider
“cram schools” as an unnecessary sources of pressure, these programs will continue to play
an significant role in Taiwanese teenagers’ life without a fundamental change in the centrally
20
administered admission system and the belief of ”elitism.” Whenever there is a demand for
“elite” high school, the market will persist.
References
[1] Alberto Abadie and Guido Imbens. Large sample properties of matching estimators for average treat-
ment effects. Econometrica, 74(1):235–267, 2006.
[2] Joshua D. Angrist. Estimation of limited dependent variable models with dummy endogenous regres-
sors: Simple strategies for empirical practice. Journal of Business & Economic Statistics, 19(1):2–16,
2001. ArticleType: research-article / Full publication date: Jan., 2001 / Copyright c© 2001 American
Statistical Association.
[3] Joshua D. Angrist and Alan Krueger. Empirical strategies in labor economics. In Orley Ashenfelter
and David Card, editors, Handbook of Labor Economics, volume 3. Elsevier Science B.V., 1999.
[4] Zellner Arnold. Bayesian analysis in econometrics. Journal of Econometrics, 37(1):27–50, 1988. doi:
10.1016/0304-4076(88)90072-3.
[5] Zellner Arnold. Philosophy and objectives of econometrics. Journal of Econometrics, 136(2):331–339,
2007. doi: 10.1016/j.jeconom.2005.11.001.
[6] Orley Ashenfelter and Alan Krueger. Estimates of the economic return to schooling from a new sample
of twins. The American Economic Review, 84(5):1157–1173, 1994. ArticleType: research-article / Full
publication date: Dec., 1994 / Copyright c© 1994 American Economic Association.
[7] Gary S. Becker. Investment in human capital: A theoretical analysis. Journal of Political Economy,
70(5):9–49, 1962.
[8] Gary S. Becker. A theory of the allocation of time. The Economic Journal, 75(299), 1965.
[9] Gary S. Becker. Human capital a theoretical and empirical analysis, with special reference to education.
University of Chicago Press, 1993.
[10] Christian Belzil. The return to schooling in structural dynamic models: a survey. European Economic
Review, 51(5):1059–1105, 2007. doi: DOI: 10.1016/j.euroecorev.2007.01.008.
[11] David Card. Using geographic variation in college proximity to estimate the return to schooling. In
Aspects of Labor Market Behaviour: Essays in Honour of John Vanderkamp. Toronto: University of
Toronto Press, 1995.
[12] David Card. Estimating the return to schooling: Progress on some persistent econometric problems.
Econometrica, 69(5):1127–1160, 2001. ArticleType: research-article / Full publication date: Sep., 2001
/ Copyright c© 2001 The Econometric Society.
[13] G. Casella. Empirical bayes gibbs sampling. Biostatistics, 2(4):485–500, 2001.
[14] George Casella and Edward I. George. Explaining the gibbs sampler. The American Statistician,
46(3):167–174, 1992. ArticleType: research-article / Full publication date: Aug., 1992 / Copyright c©1992 American Statistical Association.
21
[15] Siddhartha Chib. Bayes inference in the tobit censored regression model. Journal of Econometrics,
51(1-2):79–99, 1992. doi: 10.1016/0304-4076(92)90030-U.
[16] Siddhartha Chib and Edward Greenberg. Markov chain monte carlo simulation methods in econometrics.
Econometric Theory, 12(3):409–431, 1996. ArticleType: research-article / Full publication date: Aug.,
1996 / Copyright c© 1996 Cambridge University Press.
[17] Siddhartha Chib and Edward Greenberg. Analysis of multivariate probit models. Biometrika, 85(2):347–
361, 1998.
[18] Ronald Christensen. Bayesian ideas and data analysis : an introduction for scientists and statisticians.
CRC Press, Boca Raton, FL, 2011.
[19] Mary Kathryn Cowles and Bradley P. Carlin. Markov chain monte carlo convergence diagnostics: A
comparative review. Journal of the American Statistical Association, 91(434):883–904, 1996. Article-
Type: research-article / Full publication date: Jun., 1996 / Copyright c© 1996 American Statistical
Association.
[20] Rajeev H. Dehejia. Was there a riverside miracle? a hierarchical framework for evaluating programs
with grouped data. Journal of Business & Economic Statistics, 21(1):1–11, 2003. ArticleType: research-
article / Full publication date: Jan., 2003 / Copyright c© 2003 American Statistical Association.
[21] Rajeev H. Dehejia. Program evaluation as a decision problem. Journal of Econometrics, 125(1):141–173,
2005.
[22] B. Efron and R. Tibshirani. Bootstrap methods for standard errors, confidence intervals, and other
measures of statistical accuracy. Statistical Science, 1(1):54–75, 1986. ArticleType: research-article /
Full publication date: Feb., 1986 / Copyright c© 1986 Institute of Mathematical Statistics.
[23] Andrew Gelman. A bayesian formulation of exploratory data analysis and goodness-of-fit testing*.
International Statistical Review, 71(2):369–382, 2003.
[24] Andrew Gelman. Bayesian data analysis. Chapman & Hall/CRC, Boca Raton, Fla., 2004.
[25] Andrew Gelman and Donald B. Rubin. Inference from iterative simulation using multiple sequences.
Statistical Science, 7(4):457–472, 1992. ArticleType: research-article / Full publication date: Nov., 1992
/ Copyright c© 1992 Institute of Mathematical Statistics.
[26] John Geweke, Gautam Gowrisankaran, and Robert J. Town. Bayesian inference for hospital quality in
a selection model. Econometrica, 71(4):1215–1238, 2003.
[27] J. Heckman and V. Joseph Hotz. Choosing among nonexperimental methods for estimating the impact
of social programs: The case of manpower training. Journal of the American Statistical Association,
84(408):862–874, 1989.
[28] James Heckman. Varieties of selection bias. The American Economic Review, 80(2):313–318, 1990.
ArticleType: research-article / Issue Title: Papers and Proceedings of the Hundred and Second Annual
Meeting of the American Economic Association / Full publication date: May, 1990 / Copyright c© 1990
American Economic Association.
22
[29] James J. Heckman. Dummy endogenous variables in a simultaneous equation system. Econometrica:
Journal of the Econometric Society, 46(4):931–959, 1978.
[30] James J. Heckman. Sample selection bias as a specification error. Econometrica, 47(1):153–161, 1979.
ArticleType: research-article / Full publication date: Jan., 1979 / Copyright c© 1979 The Econometric
Society.
[31] Guido W. Imbens and Joshua D. Angrist. Identification and estimation of local average treatment
effects. Econometrica, 62(2):467–475, 1994. ArticleType: research-article / Full publication date: Mar.,
1994 / Copyright c© 1994 The Econometric Society.
[32] Kirabo Jackson. A little now for a lot later: A look at a texas advanced placement incentive program.
The Journal of Human Resources, 45(3):591–639, 2010.
[33] Kirabo Jackson. Do college-prep programs improve long-term outcomes? NBER Working Paper No.
15722, 2012.
[34] Anne O. Krueger. The political economy of the rent-seeking society. The American Economic Review,
64(3):pp. 291–303, 1974.
[35] Kai Li. Bayesian inference in a simultaneous equation model with limited dependent variables. Journal
of Econometrics, 85(2):387–400, 1998. doi: 10.1016/S0304-4076(97)00106-1.
[36] C. Lin. The republic of china (taiwan). In R. M. Thomas and T. W. Postlethwaite, editors, Schooling
in East Asia: Forces of Change, pages 104–35. Pergamon, New York, 1983.
[37] Da-Sen Lin and Yi-Fen Chen. Cram school attendance and college entrance exam scores of senior high
school students in taiwan. Bulletin of Educational Research, 52(4):35 – 70, 2006.
[38] D. V. Lindley and A. F. M. Smith. Bayes estimates for the linear model. Journal of the Royal Statistical
Society. Series B (Methodological), 34(1):1–41, 1972. ArticleType: research-article / Full publication
date: 1972 / Copyright c© 1972 Royal Statistical Society.
[39] Charles F. Manski. Anatomy of the selection problem. The Journal of Human Resources, 24(3):343–360,
1989. ArticleType: research-article / Full publication date: Summer, 1989 / Copyright c© 1989 The
Board of Regents of the University of Wisconsin System.
[40] Andrew D. Martin, Kevin M. Quinn, and Jong Hee Park. MCMCpack: Markov chain monte carlo in
R. Journal of Statistical Software, 42(9):22, 2011.
[41] Tracy L. Regan, Ronald L. Oaxaca, and Galen Burghardt. A human capital model of the effets of
ability and family background on optimal schooling levels. Economic Inquiry, 45(4):712–738, 2007.
[42] Maria L. Rizzo. Statistical computing with R. Chapman & Hall/CRC, Boca Raton, 2008.
[43] Paul R. Rosenbaum and Donald B. Rubin. The central role of the propensity score in observational
studies for causal effects. Biometrika, 70(1):41–55, 1983. 10.1093/biomet/70.1.41.
[44] Peter E. Rossi, Greg M. Allenby, and Robert E. McCulloch. Bayesian statistics and marketing, 2005.
[45] Donald B. Rubin. Matched sampling for causal effects. Cambridge University Press, Cambridge; New
York, 2006.
23
[46] A. F. M. Smith and G. O. Roberts. Bayesian computation via the gibbs sampler and related markov
chain monte carlo methods. Journal of the Royal Statistical Society. Series B (Methodological), 55(1):3–
23, 1993. ArticleType: research-article / Full publication date: 1993 / Copyright c© 1993 Royal Statis-
tical Society.
[47] David Lee Stevenson and David P. Baker. Shadow education and allocation in formal schooling: Tran-
sition to university in japan. American Journal of Sociology, 97(6):1639–1657, 1992. ArticleType:
research-article / Full publication date: May, 1992 / Copyright c© 1992 The University of Chicago
Press.
[48] Martin A. Tanner and Wing Hung Wong. The calculation of posterior distributions by data augmen-
tation. Journal of the American Statistical Association, 82(398):528–540, 1987. ArticleType: research-
article / Full publication date: Jun., 1987 / Copyright c© 1987 American Statistical Association.
[49] R Development Core Team. R: A language and environment for statistical computing, 2011.
[50] Francis Vella. Estimating models with sample selection bias: A survey. The Journal of Human Re-
sources, 33(1):127–169, 1998. ArticleType: research-article / Full publication date: Winter, 1998 /
Copyright c© 1998 The Board of Regents of the University of Wisconsin System.
[51] Greg C. G. Wei and Martin A. Tanner. A monte carlo implementation of the em algorithm and the poor
man’s data augmentation algorithms. Journal of the American Statistical Association, 85(411):699–704,
1990. ArticleType: research-article / Full publication date: Sep., 1990 / Copyright c© 1990 American
Statistical Association.
[52] Robert J. Willis and Sherwin Rosen. Education and self-selection. The Journal of Political Economy,
87(5):S7–S36, 1979.
[53] Arnold Zellner. Bayesian econometrics. Econometrica, 53(2):253–269, 1985. ArticleType: research-
article / Full publication date: Mar., 1985 / Copyright c© 1985 The Econometric Society.
[54] Arnold Zellner and Tomohiro Ando. A direct monte carlo approach for bayesian analysis of
the seemingly unrelated regression model. Journal of Econometrics, 159(1):33–45, 2010. doi:
10.1016/j.jeconom.2010.04.005.
[55] Arnold Zellner and Peter E. Rossi. Bayesian analysis of dichotomous quantal response models. Journal
of Econometrics, 25(3):365–393, 1984. doi: 10.1016/0304-4076(84)90007-1.
24
A Gibbs Sampling Algorithm
The posterior distributions of the parameters of the Bayesian selection models are obtained through Gibbs
Sampling procedure. The Gibbs sampling is a Markov chain Monte Carlo simulation techniques. (Dehejia
(2003)[20]) The algorithm allows me to simulate random variables from a distribution indirectly without
having to calculate its density. A good introductory survey of this method is Casella and George (1992)[14].
The basic idea of Gibbs Sampling is that, by sequentially sampling from the conditional distribution of each
parameter on the remaining parameters, the simulated draws would converge in distribution to a stationary
distribution that is the joint distribution of interest under some regularity conditions. Given the conjugate
priors, all conditional distributions have closed forms. It simplifies the algorithm to sequential drawings from
the following conditionals after I complete the data augmentation steps.
The steps of the Gibbs sampler are as follows4:
Step 1
T ∗i |Ti = 1 ∼ tN[0,∞)
(γzi +Xiβ2 +
δ
σ2 + δ2(Y ∗i − β −Xiβ1),
σ2
σ2 + δ2
)T ∗i |Ti = 0 ∼ tN(−∞,0]
(γzi +Xiβ2 +
δ
σ2 + δ2(Y ∗i −Xiβ1),
σ2
σ2 + δ2
)where tN denotes truncated normal distribution.
Step 2
Y ∗i |Yi = 1 ∼ tN(0,∞)
(τTi +Xiβ1 + δ(T ∗i − γzi −Xiβ2), σ2
)Y ∗i |Yi = 0 ∼ tN(−∞,0)
(τTi +Xiβ1 + δ(T ∗i − γzi −Xiβ2), σ2
)Step 1 and Step 2 are often called ”data augmentation” steps by the seminal work in Tanner and
Wong (1987)[48]. Intuitively, given I observe Ti and the normality assumption on the disturbances, I
can ”observe” the latent variables T ∗i . In addition, given the fixed censored point assumption and the
normality assumption, I can impute the missing values of Y ∗i by drawing from the truncated normal
distribution.
The above argument leads to an algorithm of successive substitution to solve for a fixed point. Nat-
urally, the Gibbs sampling algorithm is ideally applicable.
Step 3
σ2 ∼ IG
(a+
N
2, b+
1
2
[N∑i=1
ε21i − 2δ
N∑i=1
ε1iε2i + δ2N∑i=1
ε22i
])where I denote ε1i = Y ∗i − τ −Xiβ1 and ε2i = T ∗i − γzi −Xiβ2
Step 4
δ ∼ N
(δ0σ
2 + σ2δ
∑Ni=1 ε1iε2i
σ2 + σ2δ
∑Ni=1 ε
22i
,σ2δσ
2
σ2 + σ2δ
∑Ni=1 ε
22i
)4Notice that I suppress the conditionals for simplicity of notations.
25
Step 5 β, α, γ are simulated by Bayesian Regressions:
Notice that I can rearrange Equation (2) and (3):
T ∗i −δ
σ2 + δ2(Y ∗i − τTi −Xiβ1) = γzi +Xiβ2 + ξ2i, where ξ2i ∼ N (0,
σ2
σ2 + δ2)
and
Y ∗i − δ(T ∗i − γzi −Xiβ2) = τTi +Xiβ1 + ξ1i, where ξ1i ∼ N (0, σ2)
To simulate the conditional requires 2 steps. Firstly, notice that the joint normality assumption
immediately leads to [Y ∗iT ∗i
]∼ N
([τTi +Xiβ1
γzi +Xiβ2
],
[σ2 + δ2 δ
δ 1
])By the property of joint normal distribution, I have
E[Y ∗i |T ∗i ] = τTi +Xiβ1 + δ(T ∗i − γzi −Xiβ2)
var[Y ∗i |T ∗i ] = σ2
Now, we can estimate β and α using standard Bayesian regression, which is a special case of Lindley
and Smith (1974)[38]:
Y ∗i − δ(T ∗i − γzi −Xiβ2) = τTi +Xiβ + η1i, where η1i ∼ N (0, σ2)
Secondly, we observe that
E[T ∗i |Y ∗i ] = γzi +Xiβ2 +δ
σ2 + δ2(y∗i − τTi −Xiβ1)
var[T ∗i |Y ∗i ] =σ2
σ2 + δ2
Again, I can estimate γ by standard Bayesian regression5:
T ∗i −δ
σ2 + δ2(Y ∗i − τTi −Xiβ1) = γzi +Xiβ2 + η2i, where η2i ∼ N (0,
σ2
σ2 + δ2)
5In the previous literatures, such as Li (1998)[35], these 2 steps are usually completed by Zellner’s seem-ingly unrelated regressions in one step. Even though these 2 approaches are theoretically equivalent, SURmodel requires computation of the inverse of a sparse matrix of high dimensionality. The approximationerror in the computer routines is likely to slow down the convergence of the chains or even bias the estimates.Considering I do not intend to run a more complex model, such as a multilevel model, the 2-step methodcan be more suitable.
26
B Full Empirical Results
In this section, I report the full empirical results. I obtain all the results by simulating 55000 draws and
discarding the first 5000 draws as the burn-in period. It is a standard procedure in Bayesian estimation to
minimize the impact of the choice of initial points on the simulated posterior distribution. Since I use the
standard normal-gamma model, the posterior distributions are all unimodal so that the 90% high-propensity
confidence set can be easily obtained by looking at 5% and 95% quantiles. I also report the probability that
a parameter is greater than 0 for one-sided significance test. In addition, shrinkage factors are computed
using Gelman-Rubin convergence diagnostics (Cowles (1996)[19] and Gelman et al.(1992)[25]) by simulating
4 parallel chains with initial points disperse around the original initial points. All shrinkage factors in
all specifications are stabilized around 1 implying the convergence of the Markov chains to the stationary
distribution. I do not report the shrinkage factors in the table.
27
Table 9: Bayesian Model: Public High School
Mean SD 5% 25% 50% 75% 95% P(x>0)Equation 1: Public HS
Cram School 1.128 0.523 0.281 0.784 1.120 1.465 2.018 0.985Cram before 0.449 0.322 -0.068 0.230 0.441 0.660 0.984 0.923
male -0.166 0.176 -0.465 -0.281 -0.159 -0.048 0.117 0.167Num. Siblings -0.449 0.132 -0.676 -0.532 -0.442 -0.358 -0.246 0.000
Intention to HS 1.515 0.244 1.136 1.349 1.504 1.665 1.927 1.000less than NTD 30,000 -1.411 0.337 -1.961 -1.637 -1.407 -1.183 -0.864 0.000
NTD 30,000 -NTD 49,999 -0.893 0.316 -1.405 -1.108 -0.891 -0.678 -0.376 0.002NTD 50,000 -NTD 59,999 -1.155 0.327 -1.695 -1.372 -1.150 -0.934 -0.627 0.000NTD 60,000 -NTD 69,999 -0.590 0.390 -1.234 -0.849 -0.585 -0.336 0.042 0.067NTD 70,000 -NTD 79,999 -1.005 0.390 -1.653 -1.270 -0.999 -0.732 -0.371 0.004NTD 80,000 -NTD 89,999 -0.080 0.448 -0.818 -0.380 -0.073 0.217 0.650 0.430NTD 90,000 -NTD 99,999 -0.089 0.438 -0.792 -0.390 -0.097 0.198 0.642 0.411
NTD 100,000 -NTD 109,999 -0.516 0.472 -1.288 -0.836 -0.516 -0.189 0.257 0.141NTD 110,000 -NTD 119,999 -0.748 0.502 -1.560 -1.088 -0.753 -0.409 0.083 0.068NTD 120,000 -NTD 129,999 -1.009 0.580 -1.966 -1.395 -1.011 -0.622 -0.048 0.042NTD 130,000 -NTD 139,999 -0.437 0.682 -1.567 -0.898 -0.440 0.025 0.674 0.262NTD 140,000 -NTD 149,999 0.630 0.760 -0.628 0.112 0.639 1.147 1.886 0.799
more than NTD 150,000 -0.960 0.468 -1.729 -1.269 -0.960 -0.652 -0.199 0.021Sound family -0.055 0.270 -0.506 -0.229 -0.057 0.125 0.388 0.419Fail a subject 3.509 0.372 2.942 3.254 3.484 3.746 4.146 1.000
School FE YesParents’ Educ. YesParents’ Occ. Yes
Equation 2: Cram SchoolCommuting time -0.050 0.004 -0.057 -0.053 -0.050 -0.047 -0.044 0.000
Cram before 1.510 0.065 1.406 1.466 1.508 1.552 1.618 1.000male -0.084 0.061 -0.183 -0.124 -0.084 -0.043 0.017 0.085
Num. Siblings -0.107 0.037 -0.168 -0.132 -0.107 -0.082 -0.046 0.003Intention to HS 0.290 0.068 0.178 0.244 0.289 0.336 0.399 1.000
less than NTD 30,000 0.125 0.220 -0.240 -0.018 0.121 0.269 0.491 0.717NTD 30,000 -NTD 49,999 0.119 0.218 -0.235 -0.027 0.118 0.261 0.482 0.711NTD 50,000 -NTD 59,999 0.140 0.221 -0.220 -0.010 0.137 0.287 0.512 0.736NTD 60,000 -NTD 69,999 0.097 0.237 -0.293 -0.065 0.100 0.253 0.495 0.657NTD 70,000 -NTD 79,999 0.048 0.235 -0.331 -0.106 0.046 0.204 0.446 0.574NTD 80,000 -NTD 89,999 -0.042 0.251 -0.459 -0.208 -0.045 0.123 0.378 0.425NTD 90,000 -NTD 99,999 0.441 0.251 0.026 0.271 0.441 0.611 0.860 0.959
NTD 100,000 -NTD 109,999 -0.109 0.261 -0.534 -0.285 -0.113 0.067 0.321 0.335NTD 110,000 -NTD 119,999 0.007 0.278 -0.447 -0.178 0.004 0.191 0.475 0.505NTD 120,000 -NTD 129,999 0.229 0.319 -0.293 0.012 0.222 0.442 0.768 0.762NTD 130,000 -NTD 139,999 0.005 0.363 -0.596 -0.242 0.005 0.251 0.597 0.506NTD 140,000 -NTD 149,999 -0.235 0.412 -0.904 -0.512 -0.238 0.042 0.447 0.281
more than NTD 150,000 0.127 0.266 -0.309 -0.051 0.124 0.303 0.572 0.686Sound family 0.080 0.100 -0.084 0.012 0.079 0.147 0.245 0.791Fail a subject 0.299 0.071 0.181 0.252 0.300 0.346 0.417 1.000
School FE YesParents’ Educ. YesParents’ Occ. Yes
σ2 8.230 1.776 5.794 6.938 8.006 9.264 11.527 1.000δ 0.289 0.327 -0.261 0.089 0.290 0.506 0.818 0.819
28
Table 10: Bayesian Model: “elite” High School
Mean SD 5% 25% 50% 75% 95% P(x>0)Equation 1: Elite HS
Cram School 0.820 0.208 0.485 0.677 0.819 0.955 1.172 1.000Cram before -0.094 0.269 -0.525 -0.278 -0.100 0.093 0.349 0.366
male -0.056 0.142 -0.289 -0.149 -0.052 0.040 0.167 0.356Num. Siblings -0.320 0.092 -0.480 -0.377 -0.315 -0.258 -0.180 0.000
Intention to HS 0.469 0.189 0.164 0.343 0.464 0.594 0.781 0.993less than NTD 30,000 -1.320 0.334 -1.874 -1.545 -1.324 -1.093 -0.777 0.000
NTD 30,000 -NTD 49,999 -0.838 0.300 -1.333 -1.040 -0.837 -0.632 -0.349 0.002NTD 50,000 -NTD 59,999 -0.824 0.297 -1.317 -1.020 -0.825 -0.628 -0.335 0.004NTD 60,000 -NTD 69,999 -0.366 0.342 -0.930 -0.588 -0.370 -0.143 0.193 0.139NTD 70,000 -NTD 79,999 -0.938 0.347 -1.513 -1.169 -0.934 -0.705 -0.358 0.003NTD 80,000 -NTD 89,999 -0.783 0.391 -1.428 -1.048 -0.776 -0.522 -0.148 0.020NTD 90,000 -NTD 99,999 -0.274 0.365 -0.871 -0.523 -0.278 -0.030 0.327 0.229
NTD 100,000 -NTD 109,999 -0.380 0.404 -1.049 -0.654 -0.367 -0.108 0.283 0.171NTD 110,000 -NTD 119,999 -0.980 0.436 -1.704 -1.274 -0.978 -0.679 -0.264 0.012NTD 120,000 -NTD 129,999 -0.848 0.493 -1.684 -1.170 -0.849 -0.514 -0.042 0.041NTD 130,000 -NTD 139,999 0.197 0.553 -0.720 -0.171 0.193 0.562 1.118 0.643NTD 140,000 -NTD 149,999 -0.703 0.640 -1.738 -1.121 -0.700 -0.277 0.326 0.140
more than NTD 150,000 -0.605 0.381 -1.236 -0.859 -0.600 -0.348 0.026 0.056Sound family -0.336 0.227 -0.711 -0.491 -0.336 -0.182 0.037 0.070Fail a subject 1.984 0.222 1.629 1.834 1.981 2.126 2.344 1.000
School FE YesParents’ Educ. YesParents’ Occ. Yes
Equation 2: Cram SchoolCommuting time -0.046 0.004 -0.053 -0.049 -0.046 -0.043 -0.038 0.000
Cram before 1.509 0.069 1.395 1.462 1.509 1.556 1.623 1.000male -0.034 0.066 -0.142 -0.078 -0.034 0.010 0.075 0.304
Num. Siblings -0.079 0.040 -0.146 -0.106 -0.079 -0.051 -0.012 0.027Intention to HS 0.319 0.074 0.199 0.270 0.320 0.369 0.438 1.000
less than NTD 30,000 0.192 0.228 -0.178 0.037 0.187 0.346 0.568 0.799NTD 30,000 -NTD 49,999 0.133 0.228 -0.239 -0.022 0.129 0.287 0.513 0.724NTD 50,000 -NTD 59,999 0.169 0.229 -0.208 0.012 0.167 0.322 0.553 0.769NTD 60,000 -NTD 69,999 0.234 0.252 -0.170 0.068 0.231 0.398 0.659 0.826NTD 70,000 -NTD 79,999 0.031 0.245 -0.367 -0.141 0.034 0.197 0.435 0.553NTD 80,000 -NTD 89,999 -0.004 0.263 -0.436 -0.177 -0.007 0.177 0.426 0.487NTD 90,000 -NTD 99,999 0.403 0.268 -0.037 0.218 0.405 0.579 0.848 0.935
NTD 100,000 -NTD 109,999 -0.144 0.270 -0.586 -0.329 -0.143 0.041 0.299 0.302NTD 110,000 -NTD 119,999 0.245 0.294 -0.236 0.050 0.245 0.441 0.734 0.800NTD 120,000 -NTD 129,999 0.413 0.343 -0.149 0.177 0.416 0.641 0.978 0.885NTD 130,000 -NTD 139,999 0.105 0.375 -0.507 -0.151 0.100 0.359 0.717 0.606NTD 140,000 -NTD 149,999 -0.081 0.424 -0.788 -0.368 -0.071 0.207 0.592 0.430
more than NTD 150,000 0.136 0.268 -0.304 -0.046 0.134 0.315 0.579 0.691Sound family 0.090 0.107 -0.084 0.018 0.090 0.162 0.270 0.798Fail a subject 0.327 0.078 0.199 0.277 0.327 0.378 0.453 1.000
School FE YesParents’ Educ. YesParents’ Occ. Yes
σ2 2.377 0.470 1.676 2.073 2.331 2.630 3.219 1.000δ -0.153 0.284 -0.628 -0.344 -0.156 0.037 0.323 0.292
29
Table 11: Bivariate Probit Model: Public High School
Variable Coefficient Std. Err. Coefficient Std. Err.Equation 1 : Public HS Equation 2 : Cram School
Cram School 0.730∗ 0.366 - -Commuting time - - -0.051∗∗ 0.005Cram before 0.021 0.196 1.494∗∗ 0.082Male 0.016 0.060 -0.080 0.079Num. Siblings -0.043 0.049 -0.103∗∗ 0.029less than NTD 30,000 -0.761 0.524 -0.072 0.311NTD 30,000 -NTD 49,999 -0.560 0.538 -0.076 0.289NTD 50,000 -NTD 59,999 -0.663 0.539 -0.062 0.288NTD 60,000 -NTD 69,999 -0.503 0.556 -0.106 0.307NTD 70,000 -NTD 79,999 -0.641 0.535 -0.152 0.332NTD 80,000 -NTD 89,999 -0.286 0.539 -0.252 0.335NTD 90,000 -NTD 99,999 -0.313 0.575 0.249 0.307NTD 100,000 -NTD 109,999 -0.475 0.511 -0.306 0.325NTD 110,000 -NTD 119,999 -0.585 0.546 -0.174 0.327NTD 120,000 -NTD 129,999 -0.793 0.572 0.034 0.327NTD 130,000 -NTD 139,999 -0.514 0.600 -0.201 0.424NTD 140,000 -NTD 149,999 0.397 0.655 -0.450 0.405more than NTD 150,000 -0.639 0.549 -0.075 0.345Sound family 0.128 0.094 0.062 0.094Intention to HS 0.617∗∗ 0.079 0.295∗∗ 0.073Fail a subject 1.288∗∗ 0.092 0.303∗∗ 0.090School FE YesParents’ Educ. YesParents’ Occ. Yesρ -0.083 0.211
30
Table 12: Bivariate Probit Model: “elite” High School
Variable Coefficient Std. Err. Coefficient Std. Err.Equation 1 : Elite HS Equation 2 : Cram School
Cram School 1.035∗∗ 0.316 - -Commuting time - - -0.047∗∗ 0.005Cram before -0.251 0.172 1.486∗∗ 0.087Male 0.039 0.075 -0.043 0.084Num. Siblings -0.077 0.065 -0.080∗∗ 0.030less than NTD 30,000 -2.339∗∗ 0.474 0.466 0.402NTD 30,000 -NTD 49,999 -2.076∗∗ 0.480 0.417 0.378NTD 50,000 -NTD 59,999 -2.041∗∗ 0.473 0.448 0.371NTD 60,000 -NTD 69,999 -1.774∗∗ 0.436 0.501 0.411NTD 70,000 -NTD 79,999 -2.125∗∗ 0.443 0.304 0.413NTD 80,000 -NTD 89,999 -2.074∗∗ 0.477 0.267 0.412NTD 90,000 -NTD 99,999 -1.670∗∗ 0.481 0.677† 0.376NTD 100,000 -NTD 109,999 -1.789∗∗ 0.439 0.122 0.395NTD 110,000 -NTD 119,999 -2.298∗∗ 0.499 0.551 0.432NTD 120,000 -NTD 129,999 -2.200∗∗ 0.535 0.710† 0.422NTD 130,000 -NTD 139,999 -1.365∗∗ 0.506 0.402 0.513NTD 140,000 -NTD 149,999 -2.253∗∗ 0.643 0.187 0.499more than 150,000 -1.886∗∗ 0.490 0.398 0.392Sound family -0.104 0.127 0.058 0.110Intention to HS 0.426∗∗ 0.138 0.317∗∗ 0.074Fail a subject 1.391∗∗ 0.154 0.319∗∗ 0.101School FE YesParents’ Educ. YesParents’ Occ. Yesρ -0.368† 0.205
31