pier summer course on bayesian models in education ...hseltman/pier/bmer/topic1-introduction.pdf ·...

1

PIER Summer Course on Bayesian Models in Education Research

Review and Overview Howard Seltman July 12+15 2016

I. Some Context and Motivation

a. The Efficacy of the Hedges Correction for Unmodeled Clustering, and Its Generalizations

in Practical Settings by Nathan VanHoudnos, CMU PhD Dissertation, 2014

(http://www.stat.cmu.edu/~nmv/wp-content/uploads/2014/05/vanhoudnos-

dissertation-5-5-2014.pdf).

“The aim of the evidence based education movement is two-fold: (i) to determine the

best practices from scientifically rigorous studies and (ii) to apply those best practices to

educational decision making. To assist the adoption of evidence based practices by US

educators and policy makers, Congress created the What Works Clearinghouse (WWC)

with a mission to evaluate evidence about educational interventions and to disseminate

information about best practices. The WWC synthesizes the results of education

research and publishes these recommendations for use by educators and policy makers.

Throughout its history, however, the evidence based education movement has struggled

with the low quality of education research. For example, a common analytic error made

by education researchers is that an experiment will be designed to randomize entire

schools to treatment and control conditions, but then the experiment will be analyzed

ignoring the grouped nature of the randomization. This error is well known to lead to

invalid conclusions because it overstates the statistical significance of the treatment

effect. The WWC chose to address this common error by attempting to remove the

anti-conservative bias of these misspecified analyses by calculating a correction to the

test statistic.”

b. Revisiting higher education data analysis: A Bayesian perspective by M. Subbiah, M. R.

Srinivasan and S. Shanthi, International Journal of Science and Technology Education

Research, 1(2), 32-38 (2011),

(http://www.academicjournals.org/article/article1379514827_Subbiah%20et%20al.pdf)

Although the English usage is poor, this paper is a nice example of the “Bayesian

Perspective” in practice.

http://www.stat.cmu.edu/~nmv/wp-content/uploads/2014/05/vanhoudnos-dissertation-5-5-2014.pdf

http://www.stat.cmu.edu/~nmv/wp-content/uploads/2014/05/vanhoudnos-dissertation-5-5-2014.pdf

http://www.academicjournals.org/article/article1379514827_Subbiah%20et%20al.pdf

2

II. Review of Principles of Experimental Design

a. IV vs. DV

b. Variable types

c. Causality, confounding, randomization, and internal validity

d. External Validity / Generalizability (not related to sample size!)

e. Construct validity

f. Power and precision (high with high sample size, low variability, and strong treatment

effects)

g. Means models, including interaction

h. Error models to match DV type and correlation pattern

3

III. Background Common to Classical and Bayesian Inference

a. Intuitive definitions of probability

i. Long run frequency

ii. Subjective chance of occurrence

b. Set theory

i. A set is a list of items with no duplicates

ii. If A and B are sets, their union, 𝐴 ∪ 𝐵 is the set of all elements in A, B or both.

It is at least as big as the larger of the two.

iii. If A and B are sets, their intersection, 𝐴 ∩ 𝐵 is the set of all elements in both A

and B. It is no bigger than the smaller of the two.

iv. 𝐴 ⊆ 𝐵 means that A is the same set at B or is only a subset of B.

v. Disjoint sets have no elements in common.

vi. The complement of set A, Ac, is all elements except those in A.

c. Probability axioms and notation

i. Consider a situation with an unknown outcome.

ii. Let the sample space, represent all possible outcomes.

iii. Let represent each specific possible (discrete) elementary outcome.

iv. Let E, an event, represent a specific collection of possible outcomes.

v. Birthday example

vi. 0 ≤ Pr() ≤ 1

vii. Pr()=1

viii. If E1 and E2 are disjoint, then Pr(E1 ᴜ E2) = Pr(E1) + Pr(E2)

4

d. Conditional probability

i. The probability that event B will occur given that we know that event A did

occur is called the conditional probability of B given A and is written Pr(B|A).

ii. As a trivial example, if𝐵 ⊆ 𝐴, it is obvious that Pr(A|B)=1.

iii. In general, Pr(𝐵|𝐴) = Pr (𝐴∩𝐵)

Pr (𝐴). This is best thought of as a reduction of the

sample space from to A.

iv. Useful alternative version: Pr(B|A) Pr(A) = Pr(A ∩ B)

v. Partition rule: If sets A1 through Ak partition then

Pr(B) = Pr(B|A1)Pr(A1)+…+Pr(B|Ak) Pr(Ak)

vi. Bayes’ rule (also true and useful for Classical Statistics):

Pr(𝐴|𝐵) = Pr (𝐴 ∩ 𝐵)

Pr (𝐵)=

Pr(𝐵|𝐴) Pr (𝐴)

Pr(𝐵|𝐴) Pr(𝐴) + Pr(𝐵|𝐴𝑐) Pr (𝐴𝑐)

From the above example:

Pr(𝐴|𝐵) = 0.02

0.05=

0.10 ∙0.20

0.10 ∙0.20+ 0.0375 ∙0.80 = 0.40

vii. Statistical models are based on the idea of conditional distributions: we model

the distribution of the DV given each specific combination of IVs.

5

e. A random variable, usually represented by a capital Roman letter such as X or Y, is a

“mapping” from all events to the set of real numbers, ℝ.

i. The population mean or expected value of Y is the weighted average where the

weights are the probabilities of each Y value. This is written E(Y), or often as .

The corresponding quantity in a sample of Y values is the sample mean, �̅�.

ii. The population variance of Y is the weighted average of the squared distances

of the Y values from the population mean. It is a measure of the spread of the Y

values around the mean Y value. It may be written as var(Y) or Y2. The

notation var(Y) can also refer to the sample variance.

iii. Standard deviation is another measure of spread defined as the square root of

the variance of Y. For any distribution no more than 1/k2 of the distribution is

more than k s.d. from E(Y) according to Chebyshev's inequality.

f. Independence: If knowing that event A happened does not change your estimate of the

chance that event B will happen, then A and B are independent (which is very different

from “disjoint”; disjoint events are always dependent).

i. Notation: A B can be read A and B are independent

ii. A B implies Pr(B|A) = Pr(B) and Pr(A|B) = Pr(A)

iii. A B impliesPr(𝐴 ∩ 𝐵) = Pr(𝐴) Pr(𝐵)

iv. Practically, random variables X and Y are independent if knowing one result tells

us nothing about the other: Pr(Y|X=x)=Pr(Y) and Pr(X|Y=y)=Pr(X).

v. The covariance and correlation between two random variables tells how

information about one provides information about the other.

cov(𝑋, 𝑌) = 𝐸[(𝑋 − 𝐸(𝑋))(𝑌 − 𝐸(𝑌)]

cor(𝑋, 𝑌) =cov(𝑋, 𝑌)

sd(𝑋) sd(𝑌)

vi. Correlation is between -1 and 1. Independent variables always have zero

correlation. Correlations of 1 or -1 mean Y can be perfectly predicted from X.

vii. Technical note: The only general case where uncorrelated implies independent

is when X and Y are bivariate Gaussian.

6

g. Properties of combinations of random variables

i. Let X and Y be random variables, and a, b, and c be constants

ii. E(aX + bY + c) = aE(X) + bE(Y) + c

iii. var(aX + bY +c) = a2 var(X) + b2 var(Y) + 2ab cov(X, Y)

iv. E.g., Xi=weight of apple i, Yi = weight of cereal box j. Assume all Xis and Yis are

independent, E(Xi)=100 g, E(Yi)=300 g, var(Xi)=10 g2, var(Yi)=5 g2.

Boxes are packed with a=4 apples and b=2 oranges and each box weighs c=100

g. The mean weight of a box is 4(100) + 2(300) + 100 = 1100 g. The variance of

the weights is 16*10 + 4*5 + 0 = 180 g2, and the s.d. is √180 = 13.4 g. If the

distributions are Gaussian, then 95% of the boxes weigh in the range 1100 +/-

(1.96)13.4 = [1073.7, 1126.3] g.

v. E.g., Xi is the number of opera tickets student i buys, and Yi is the number of

Pirates tickets. Assume E(Xi)=0.5, var(Xi)=0.6, E(Yi) =1.5, var(Yi)=0.8, and

cov(Xi,Yi)=-0.45.

s.d.(Xi) = √0.6 = 0.77, s.d.(Yi) = √0.8 = 0.89, cor(Xi,Yi) = -0.45 / (0.77 * 0.89)

= -0.66. Let Zi = the total number of opera +Pirates tickets as student buys. E(Zi)

= 0.5 + 1.5 = 2, var(Zi) = 0.6 + 0.8 + 2*(-0.45) = 0.5. s.d.(Zi)=0.71.

ii. Specifying probability distributions

i. For discrete outcomes

a. Distribution is defined with a probability mass function (pmf)

b. All possible values are listed, a probability is given for each value, and

these probabilities add to 1.

c. Bernoulli “coin flipping” distribution: Yϵ {0,1}, Pr(Y=1)=,

Pr(Y=0)=1-, ϵ[0,1]

d. The Bernoulli distribution, like most, is actually a family of

distributions, one for each parameter value, .

e. For Bernoulli() we can compute E(Y)= and var(Y)=(1-).

f. The binomial distribution (independent flips of a given coin): The

outcome, Y, is the number of “heads” for n independent flips of a coin

with chance of heads for each flip. E(X)=n, Var(Y)= n The

pmf is Pr(Y = y) = n!

y! (n−y)! 𝜋𝑦 (1 − 𝜋)𝑛−𝑦, where “!” mean

factorial.

7

g. The Poisson distribution: The outcome Y can be thought of as the

count of independent events in a fixed time period. Y can be 0, 1, 2,

…. The family is indexed by the parameter . E(Y)=, var(Y)=. The

pmf is Pr(Y = Y) = λ𝑦𝑒−𝜆

y! where e is Euler’s constant, 2.71828….

h. Other discrete distributions include discrete uniform, negative

binomial, hypergeometic, etc.

ii. For continuous outcomes

a. The distribution is defined with a probability density function (pdf).

b. The probability is defined for any subset of the real line, ℝ.

c. A Gaussian distribution is defined by its mean parameter, , and its

variance parameter 2.

The pdf is 𝑓(𝑦) =1

√2𝜋𝜎2 𝑒

−(𝑦−𝜇)2

2𝜎2 .

d. Given the pdf of any continuous distribution, we can compute the probability of an event consisting of all Y values between, say, the numbers a and b:

Pr(𝑎 < 𝑌 < 𝑏) = ∫ 𝑓(𝑦)𝑑𝑦𝑏

𝑎

e. For a Gaussian distribution +/- 1.96 hold 95% of the values.

f. Other continuous distributions with support (possible values of Y) on

the real line are the t-distribution, the Laplace distribution, etc.

g. Continuous distributions with support on the positive half of the real

line include gamma, chi-square, F, Gompertz, etc.

h. Continuous distributions with support over a defined interval include

beta (0≤Y≤1), continuous uniform (a≤Y≤b), etc.

-2 -1 0 1 2 3 4

0.00.1

0.20.3

0.4

Measurement "Y"

Dens

ity

Area=Pr(-0.5<Y<0.5)=0.242

Normal(mean=1, sd=1)

8

h. A bit on likelihood

i. Both Classical and Bayesian statistics rely on the probability model of the observed DVs

given specific parameter values and the observed IVs. Most generally, the model is a

specification of Pr(Y|x,). When this specification is thought of in reverse, i.e., taking

the observed data as fixed and the parameters as the “variable” of the equation, the

term likelihood is used, often in the form ℒ(|x,y).

i. For discrete outcomes, the probability model is literally the probability. For

continuous outcomes the probability of exactly observing the any specific data

set is technically zero; instead the probability model comes from the pdf and is

not a probability but does define the relative chances of different sets of DVs.

ii. E.g., a coin has heads probability of . With n independent flips, the probability

of y=n heads is y. In general the probability of y heads is Pr(y|n,) = 𝑦!(𝑛−𝑦)!

𝑛!𝜋𝑦(1 − 𝜋)𝑛−𝑦, so the likelihood is ℒ(|n,y) =

𝑦!(𝑛−𝑦)!

𝑛!𝜋𝑦(1 − 𝜋)𝑛−𝑦.

iii. E.g., 10 patients with arthritis are measured for range of motion after one week

on each of two different medications in counterbalanced order. The differences

for drug B minus drug A are called D1 through D10, and are assumed to be

independent and follow a Gaussian distribution with =2. The parameter of

interest is the mean of the differences, . The likelihood is:

ℒ (=2, d) = ∏ [1

√8𝜋 𝑒−

(𝑑𝑖−𝜇)2

8 ]10𝑖=1

-2 -1 0 1 2

0e+0

02e

-09

4e-0

96e

-09

8e-0

91e

-08

mu

Like

lihoo

d

Gaussian likelihood for R = 0.83, 1.21, 0.4, 3.65, 1.04, 3.38, 1.21, 2.34, 1.8, -1.18

mu = 1.475

9

IV. Review of Linear Regression

a. Formal model: Yi = 0 + 1x1i + … + kxki + i, I ~ N(0, 2), 1 2

b. Underlying assumptions

i. x values are fixed (not random) or practically, the error in measuring x is small.

ii. Y is linear in x. Practically, curves can be represented by using x2, log(x), etc.

Also if categorical variables are represented as indicator variables for all but a

“baseline” level, no unreasonable ordering and spacing assumptions are made.

The slope for xr is the same at each value of xs. Adding a new variable that is the

product of the two allows different slopes.

iii. The variability of Y among a group of subjects with all of the same x variables is

Gaussian in shape (Normality).

iv. The variability of Y among a group of subjects with all of the same x variables is

of the same magnitude (2). This is “homoscedasticity”.

v. The errors, i and j, for any two Ys are independent.

c. Assumptions can and should be checked using EDA and residual plots, and perhaps tests

such as Levene’s test of equal variance and/or the Durbin-Watson test of serial

correlation. The quantile-normal plot checks Normality. The residual vs. fit and residual

vs. x plots check equal variance and linearity.

d. The estimates of beta and sigma squared, as well of their standard errors, are closed-

form. Either the least squares or the maximum likelihood principles lead to the same

solutions.

e. The general linear model allows modeling of unequal variance and/or dependence

(correlation). In the unlikely case of known values of the variances and correlations (or

covariances), the solution is closed-form. If only the form of the variance-covariance

structure is known, the solution requires an iterative, numeric solution.

f. The generalized linear model allows non-Normal outcomes, including binomial and

Poisson. The solutions require iterative, numeric methods. Logistic regression models

the log of the odds of success (vs. failure) as a linear combination of IVs.

V. Model selection principles

a. Theory, experience, EDA, and residual plots guide model selection. The best principle is

that you should check assumptions by relaxing them (trying a more complex model for

which the current model is a special case) and testing whether the model improves.

b. In most cases, penalized likelihood methods are a good way to compare models to see

if added complexity is justified, i.e., if the better fit of a more complex model is within

the range of randomness or a “real” improvement.

c. Lower AIC is a good criterion if prediction is the goal. Lower BIC is a good criterion if

understanding the roles of the IVs is the goal.

10

VI. Classical vs. Bayesian Inference

a. Fundamentals of classical inference

i. Parameters are considered to be fixed, unknown values.

ii. Inference is indirect: Consider the possible outcomes for a particular parameter

value and compare these to the observed outcomes.

iii. Real-life data is a subset (ideally random) from a population of outcomes.

Interest is in the population parameters, not the sample.

iv. Definition: a statistic is any quantity calculable from collected data (and known

constants) without using the (unknown) values of the parameters.

v. The fundamental concept is the sampling distribution of the statistic. Each

time you re-run a study, you get a different set of data and a different value of

each statistic of interest. The probability distribution of a statistic of interest

over repeated experiments (or other studies) is the sampling distribution of the

statistic. Practically we compute this from model assumptions and probability

theory, rather than repeat experiments.

Sampling Distribution of the Mean

sample mean (n=15)

Fre

qu

en

cy

1 2 3 4 5

02

46

81

01

21

4

Population

X ~ N(mu=2.5, sigma=3)

3.67

2.383.03

3.57 2.63 2.65 3.242.01

1.87

2.46 . . .

Samples

11

vi. The p-value for a given null hypothesis, e.g., H0: =0, is the probability of

obtaining the observed value of a given statistic or one less supportive of the

null hypothesis, based on the sampling distribution of that statistic under (i.e.,

assuming) that null hypothesis. A “small” p-value (less than , often using

=0.05) indirectly supports rejection of the null hypothesis because it says that

the observed data are unlikely given that the null hypothesis is true and that the

sampling distribution assumptions are true.

vii. Read carefully: Given that the null hypothesis is true and an appropriate

statistic and sampling distribution are used, the probability of the event

“p≤0.05” is 0.05 (5%). This is the type-1 (false positive) error rate. Because it is

not generally true that Pr(A|B)=Pr(B|A), we cannot make any probability

statement about the chance that H0 is true or false given p≤0.05!!!

viii. Power is a key issue because when H0 is false, the observed statistic may not be

unlikely under H0. If the power of an experiment is high, then even under

“small” alternative hypotheses, the overlap between the null sampling

distribution of the test statistic and the alternative sampling distribution is

small, and only then do large p-values correspond to “the null hypothesis is

unlikely” (and the confidence intervals for the parameter are narrow).

ix. A 95% confidence interval (or other %) is a random interval for which we

expect over repeated studies the true parameter value will be contained inside

the interval 95% of the time. Again, this is a statement about the distribution of

the interval given a particular value of the parameter, and we cannot compute

the reverse; in fact, in classical statistics, the distribution of the parameter given

the interval does not make sense because the parameter is a fixed constant.

-2 0 2 4 6 8

0.0

0.2

0.4

0.6

0.8

1.0

Power: Null vs. Alternatives

Test Statistic

Dens

ity

p 0.05Null Sampling Distribution

Low Power Alternative

High Power Alternative

Null 0.05 Cutoff

12

x. The Multiple Comparison Problem: If multiple null hypotheses are true and you

do standard testing, the chance of a false positive (p≤0.05) grows very rapidly

with the number of tests. E.g., for 20 outcomes not related to treatment, the

chance of all p values being >0.05 is 1 – 0.9520 = 0.64, so there is a 36% chance

of at least one false positive. You have lost the protection of “only” 5% false

positives. Several good methods can be used to restore the type 1 error rate,

but all reduce power by a substantial degree. There is no free lunch!

b. Fundamentals of Bayesian inference

i. Bayesians consider parameters to have distributions which reflect our (perhaps

subjective) view as to what values are most likely.

ii. Bayesians are required to specify prior parameter distributions which reflect

their beliefs about the parameter values before observing their data.

iii. The main outcomes of a Bayesian analysis are posterior parameter

distributions which reflect what a rational observer should believe about the

parameter values after observing the data based on the standard laws of

probability adhered to by all statisticians and on the prior distributions.

Posterior probability intervals for parameters, the Bayesian parallel to Classical

confidence intervals, do behave like most naïve users incorrectly think

confidence intervals behave: if we compute a 95% posterior interval for , there

is a 95% chance that is inside the interval.

iv. We can think of Bayesian analysis as “updating” (refining and concentrating) our

beliefs based on the observed data. Here is a typical example:

-20 -10 0 10 20

0.0

0.1

0.2

0.3

0.4

Treatment Effect

Den

sity

Prior: N(mean=0,sd=10)

Posterior: N(mean=0,sd=10)

13

v. Bayesian inference uses Bayes’ Rule in this way: Call the DV “Y”, the IVs “x” and

the parameter(s) . We must specify the prior distribution of , Pr() based on

prior knowledge/beliefs. We also specify the likelihood of the data given the

covariates and parameters, Pr(Y|; x), which is essentially the assumed “model”.

Bayes’ rule can then be used to find the posterior distribution of , Pr(|Y; x).

vi. In the simplest cases, pure statistical calculation can be used to find Pr(|Y; x).

E.g., in simple linear regression, if the prior distributions of 0 and 1 are

Gaussian, then we can show that the posterior distributions are Gaussian and

we can find formulas for the posterior means and variances that are simple

combinations of the data and the prior means and variances.

vii. Bayesians have several ways to specify prior distributions:

i. Fully informative prior distributions are based on all available evidence

and belief, possibly using expert elicitation.

ii. Weakly informative prior distributions are based on plausibility without

specifying very much information or confidence.

iii. “Objective” and/or uninformative prior distributions try to avoid

subjectivity, specify little to nothing about the parameters, and may be

improper (i.e., they may violate the axioms of probability). They don’t

always lead to valid posterior distributions, particularly in hierarchical

models.

viii. Even Bayesians need to fix some parameters; these are called hyperparameters.

ix. Fully or weakly informative prior distributions for a given likelihood may be

conjugate. This means that the prior and posterior parameter distributions are

members of the same family of distributions. This leads to computational

simplicity, but may be inappropriate if the prior beliefs are not close to a

member of the family.

x. In the case of conjugacy and some other cases, the posteriors are closed-form,

and we only need to compute the parameters of the posterior distribution. In

the remaining majority of cases, computationally intensive Monte Carlo

methods (more specifically Markov Chain Monte Carlo or MCMC), are needed

to obtain the posterior distribution. In fact, these methods produce a sample

from the posterior distribution rather than the distribution itself. The key

results are then obtained indirectly from the sample, e.g., via a sample posterior

interval or a density plot.

14

xi. ҉ One of the beauties of Bayesian analysis is that it is relatively easy to

construct a computer algorithm to produce a sample of all of the parameters

from a complex model, even one that has never before been specified.

i. The Gibbs Sampling method allows one to compute each parameter in

turn rather than “jointly”.

ii. The Metropolis-Hastings algorithm is a general purpose method to

sample non-conjugate posterior distributions

xii. Available general purpose Bayesian software includes WinBUGS and jags. Both

may be run from within R. I use my R package, rube, to make the process

easier, safer, and more productive.

xiii. The information needed (by WinBUGS, jags, or for writing your own MCMC) is

the prior distributions and parameters and the directed acyclic graph (DAG). A

DAG is an intuitive method for specifying statistical models. Directed arrows

represent dependence relationships.

xiv. E.g., we think that our outcomes, Yi, are normally distributed with population

means, 0 + 1 *xi and with variance 2. We specify weakly informative prior

distributions as 0 ~ N(=0, =10), 1 ~ N(=0, =10), 2 ~ Gamma(=1,

=0.2). Here is the DAG:

15

VII. Single Level vs. Hierarchical Models

a. In studies with independent errors (deviations of the DVs from their mean given the IVs)

the likelihood for all of the DVs is the likelihood of each DV multiplied together. If the

errors are correlated, more complex models and corresponding likelihoods are needed,

which include correlation parameters rather than assuming zero correlation.

b. The means model states the population mean of the DV values for each

subject/treatment combination given the IVs and the parameter. The error model

specifies how the DV varies around that mean for different subject/treatment

combinations. If knowing the error (deviation from the mean) for one DV value changes

your knowledge about the error for another DV value, then these errors are correlated

and a model ignoring the correlation will result in using the wrong null sampling

distribution. Then everything based on the null sampling distribution will be wrong, e.g.

i. The p-value no longer tells the probability of obtaining the observed test

statistic or one more like the alternative under the null hypothesis. Specifically,

over repeated null studies p≤0.05 happens more frequently (anti-conservative)

or less frequently (conservative) than 5 % of the time.

ii. Confidence intervals are two wide or narrow.

iii. Power calculations are incorrect.

Similar problems occur in Bayesian analysis for other reasons.

c. The main sources of correlated DVs are

i. adjacency effects (temporal or spatial), e.g., stock prices or social networks

ii. hidden grouping where a part of the deviation from the mean for a group is due

to common, effective, unmeasured factors or covariates.

d. Hierarchies often lead to important hidden groupings

i. Students in the same classroom perform better or worse due to teacher (and

other classroom effects).

ii. Students in the same school perform better or worse due to administrative,

fiscal, cultural (and other school effects).

iii. Etc.

e. Appropriate modeling of correlation due to these hidden factors results in correct p-

values, confidence intervals, posterior distributions, etc. The common effects within a

group may be modeled directly as shared mean shifts or indirectly as correlations.

16

f. When a level of a hierarchy has many groups it often makes sense to estimate only the

variance of the shared group effects rather than estimating each individual effect,

because

i. future effects may be different (e.g., regress towards the mean)

ii. interest often centers on other groups than those studied, e.g., teachers in

general rather than the specific teachers of our study

iii. more “information” is available to more efficiently estimate other parameters of

interest if it is not “wasted” estimating the individual group effects

This idea of estimating only the variance (and subsequent induced correlation) for

effects whose levels are randomly chosen from a large population of effects is the main

idea behind random effects. In combination with the usual fixed effects, this gives

mixed effects models, a key tool in modeling hierarchical data.

pier summer course on bayesian models in education ...hseltman/pier/bmer/topic1-introduction.pdf ·...

Documents