unit 6a: motivating principal components analysis © andrew ho, harvard graduate school of...

19
Unit 6a: Motivating Principal Components Analysis © Andrew Ho, Harvard Graduate School of Education Unit 6a– Slide 1 ttp://xkcd.com/388/

Upload: connor-parrill

Post on 14-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Unit 6a: Motivating Principal Components Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 6a– Slide 1

Unit 6a: Motivating Principal Components Analysis

© Andrew Ho, Harvard Graduate School of Education Unit 6a– Slide 1

http://xkcd.com/388/

Page 2: Unit 6a: Motivating Principal Components Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 6a– Slide 1

• Interitem Correlations• Reliability… and multilevel modeling, revisited (AHHH!)• Transition to PCA, by VVV: Visualizing Variables as Vectors

© Andrew Ho, Harvard Graduate School of Education Unit 6a– Slide 2

Multiple RegressionAnalysis (MRA)

Multiple RegressionAnalysis (MRA) iiii XXY 22110

Do your residuals meet the required assumptions?

Test for residual

normality

Use influence statistics to

detect atypical datapoints

If your residuals are not independent,

replace OLS by GLS regression analysis

Use Individual

growth modeling

Specify a Multi-level

Model

If time is a predictor, you need discrete-

time survival analysis…

If your outcome is categorical, you need to

use…

Binomial logistic

regression analysis

(dichotomous outcome)

Multinomial logistic

regression analysis

(polytomous outcome)

If you have more predictors than you

can deal with,

Create taxonomies of fitted models and compare

them.

Form composites of the indicators of any common

construct.

Conduct a Principal Components Analysis

Use Cluster Analysis

Use non-linear regression analysis.

Transform the outcome or predictor

If your outcome vs. predictor relationship

is non-linear,

Use Factor Analysis:EFA or CFA?

Course Roadmap: Unit 6a

Today’s Topic Area

Page 3: Unit 6a: Motivating Principal Components Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 6a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 6a– Slide 3

Here’s a dataset in which teachers’ responses to what the investigators believed were multiple indicators/ predictors of a single underlying construct of Teacher Job Satisfaction: The data are described in TSUCCESS_info.pdf.

Here’s a dataset in which teachers’ responses to what the investigators believed were multiple indicators/ predictors of a single underlying construct of Teacher Job Satisfaction: The data are described in TSUCCESS_info.pdf.

Dataset TSUCCESS.txt

Overview Responses of national sample of teachers to six questions about job satisfaction.

SourceAdministrator and Teacher Survey of the High School and Beyond (HS&B) dataset, 1984 administration, National Center for Education Statistics (NCES). All NCES datasets are also available free from the EdPubs on-line supermarket.

Sample Size 5269 teachers (4955 with complete data).

More Info

HS&B was established to study educational, vocational, and personal development of young people beginning in their elementary or high school years and following them over time as they began to take on adult responsibilities. The HS&B survey included two cohorts: (a) the 1980 senior class, and (b) the 1980 sophomore class. Both cohorts were surveyed every two years through 1986, and the 1980 sophomore class was also surveyed again in 1992.

Multiple Indicators of a Common Construct

Page 4: Unit 6a: Motivating Principal Components Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 6a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 6a– Slide 4

Col Var Variable Description Labels

1 X1 You have high standards of teacher performance.

1 = strongly disagree 2 = disagree3 = slightly disagree 4 = slightly agree5 = agree 6 = strongly agree

2 X2 You are continually learning on the job.

1 = strongly disagree 2 = disagree3 = slightly disagree 4 = slightly agree5 = agree 6 = strongly agree

3 X3 You are successful in educating your students.

1 = not successful 2 = a little successful3 = successful 4 = very successful

4 X4 It’s a waste of time to do your best as a teacher.

1 = strongly agree 2 = agree,3 = slightly agree 4 = slightly disagree,5 = disagree 6 = strongly disagree

5 X5 You look forward to working at your school.

1 = strongly disagree 2 = disagree3 = slightly disagree 4 = slightly agree5 = agree 6 = strongly agree

6 X6 How much of the time are you satisfied with your job?

1 = never 2 = almost never3 = sometimes 4 = always

As is typical of many datasets, TSUCCESS contains: Multiple variables – or “indicators” – that record

teacher’s responses to the survey items. These multiple indicators are intended to provide

teachers with replicate opportunities to report their job satisfaction (“teacher job satisfaction” being the focal “construct” in the research).

To incorporate these multiple indicators successfully into subsequent analysis – whether as outcome or predictor – you must deal with several issues:

1. You must decide whether each of the indicators should be treated as a separate variable in subsequent analyses, or whether they should be combined to form a “composite” measure of the underlying construct of teacher job satisfaction.

2. To form such a composite, you must be able to confirm that the multiple indicators actually “belong together” in a single composite.

3. If you can confirm that the multiple indicators do indeed belong together in a composite, you must decide on the “best way” to form that composite.

Always know your items. Read each one. Take the test.

Page 5: Unit 6a: Motivating Principal Components Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 6a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 6a– Slide 5

VarVariable

Description Labels

X1You have high standards of teacher performance.

1 = strongly disagree 2 = disagree3 = slightly disagree 4 = slightly agree5 = agree 6 = strongly agree

X2 You are continually learning on the job.

1 = strongly disagree 2 = disagree3 = slightly disagree 4 = slightly agree5 = agree 6 = strongly agree

X3You are successful in educating your students.

1 = not successful 2 = a little successful3 = successful 4 = very successful

X4It’s a waste of time to do your best as a teacher.

1 = strongly agree 2 = agree,3 = slightly agree 4 = slightly disagree,5 = disagree 6 = strongly disagree

X5You look forward to working at your school.

1 = strongly disagree 2 = disagree3 = slightly disagree 4 = slightly agree5 = agree 6 = strongly agree

X6How much of the time are you satisfied with your job?

1 = never 2 = almost never3 = sometimes 4 = always

• Different Indicators Have Different Metrics:i. Indicators X1, X2, X4, & X5 are

measured on 6-point scales.ii. Indicators X3 & X6 are measured on 4-

point scales.iii. Does this matter, and how do we deal

with it in the compositing process?iv. Is there a “preferred” scale length?

• Some Indicators “Point” In A “Positive” Direction And Some In A “Negative” Direction:i. Notice the coding direction of X4,

compared to the directions of the rest of the indicators.

ii. When we composite the indicators, what should we do about this?

• Coding Indicators On The “Same” Scale Does Not Necessarily Mean That They Have The Same “Value” At The Same Scale Points:i. Compare scale point “3” for indicators

X3 and X6, for instance.ii. How do we deal with this, in

compositing?

Indicators are not created equally. Different scales Positive or negative wording/direction/“polarity” Different variances on similar scales Different means on similar scales (difficulty) Different associations with the construct (discrimination)

Indicators are not created equally. Different scales Positive or negative wording/direction/“polarity” Different variances on similar scales Different means on similar scales (difficulty) Different associations with the construct (discrimination)

Always know the scale of your items. Score your test.

Page 6: Unit 6a: Motivating Principal Components Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 6a– Slide 1

*-----------------------------------------------------------------------------------* Input the raw dataset, name and label the variables and selected values.*-----------------------------------------------------------------------------------* Input the target dataset: infile X1-X6 using "C:\My Documents\ … \Datasets\TSUCCESS.txt"

* Label the variables: label variable X1 "Have high standards of teaching" label variable X2 "Continually learning on job" label variable X3 "Successful in educating students" label variable X4 "Waste of time to do best as teacher" label variable X5 "Look forward to working at school" label variable X6 "Time satisfied with job"

* Label the values of the variables: label define lbl1 1 "Strongly Disagree" 2 "Disagree" 3 "Slightly Disagree" /// 4 "Slightly Agree" 5 "Agree" 6 "Strongly Agree" label values X1 X2 X3 lbl1

label define lbl2 1 "Strongly Agree" 2 "Agree" 3 "Slightly Agree" /// 4 "Slightly Disagree" 5 "Disagree" 6 "Strongly Disagree" label values X4 lbl2

label define lbl3 1 "Not Successful" 2 "Somewhat Successful" /// 3 "Successful" 4 "Very Successful" label values X3 lbl3

label define lbl4 1 "Almost Never" 2 "Sometimes" /// 3 "Almost Always" 4 "Always" label values X6 lbl4

© Andrew Ho, Harvard Graduate School of Education Unit 6a– Slide 6

Standard data-input and indicator-naming statements. Label items descriptively,

ideally with item stems/prompts.

Make absolutely sure that your item scales are oriented in the same

direction: Positive should mean something similar.

Look at your data…

10. 3 5 3 6 3 3 9. 6 6 3 6 5 3 8. 6 4 4 1 1 2 7. 4 4 4 4 5 3 6. . 5 2 4 3 3 5. 4 4 3 2 4 3 4. . 6 3 5 3 3 3. 4 4 2 2 2 2 2. 4 3 2 1 1 2 1. 5 5 3 3 4 2 X1 X2 X3 X4 X5 X6

. list X1-X6 in 1/35, nolabel clean

Every row is a person. A person-by-item matrix, a standard data representation in psychometrics.

Note that we have some missing data.

Page 7: Unit 6a: Motivating Principal Components Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 6a– Slide 1

Exploratory Data Analysis for Item Responses

010

2030

40P

erce

nt

0 2 4 6Have high standards of teaching

010

2030

40P

erce

nt

0 2 4 6Continually learning on job

020

4060

Per

cent

0 1 2 3 4Successful in educating students

010

2030

Per

cent

0 2 4 6Waste of time to do best as teacher

010

2030

40P

erce

nt

0 2 4 6Look forward to working at school

020

4060

80P

erce

nt

0 1 2 3 4Time satisfied with job

X6 2.835902 .5724269 5125 1 4 X5 4.418882 1.33348 5116 1 6 X4 4.223199 1.669808 5121 1 6 X3 3.152216 .673924 5144 1 4 X2 3.873361 1.247791 5109 1 6 X1 4.331175 1.090758 5097 1 6 variable mean sd N min max

. tabstat X1-X6, stats(mean sd n min max) col(statistics)

6 93 5 12 4 13 3 13 2 15 1 169 0 4,955 NMISSING Freq.

. table NMISSING

Are these items on the same “scale”?Are these items on the same “scale”?

© Andrew Ho, Harvard Graduate School of Education Unit 6a– Slide 7

Page 8: Unit 6a: Motivating Principal Components Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 6a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 6a– Slide 8

Missing Data and Pairwise Correlations: Pairwise Deletion vs. Casewise/Listwise Deletion

6 93 5 12 4 13 3 13 2 15 1 169 0 4,955 NMISSING Freq.

. table NMISSING

X6 2.836529 .5714207 4955 1 4 X5 4.42442 1.328885 4955 1 6 X4 4.227043 1.665968 4955 1 6 X3 3.154995 .6692635 4955 1 4 X2 3.873663 1.242735 4955 1 6 X1 4.329364 1.088205 4955 1 6 variable mean sd N min max

. tabstat X1-X6 if NMISSING==0, stats(mean sd n min max) col(statistics)

X6 2.835902 .5724269 5125 1 4 X5 4.418882 1.33348 5116 1 6 X4 4.223199 1.669808 5121 1 6 X3 3.152216 .673924 5144 1 4 X2 3.873361 1.247791 5109 1 6 X1 4.331175 1.090758 5097 1 6 variable mean sd N min max

. tabstat X1-X6, stats(mean sd n min max) col(statistics)

. * Contrast selected univariate statistics for the listwise-deleted sample:

4955 4955 4955 4955 4955 4955 X6 0.1921 0.2225 0.4326 0.3993 0.5529 1.0000 4955 4955 4955 4955 4955 X5 0.2531 0.2697 0.3557 0.4478 1.0000 4955 4955 4955 4955 X4 0.2127 0.2313 0.2990 1.0000 4955 4955 4955 X3 0.1610 0.1663 1.0000 4955 4955 X2 0.5548 1.0000 4955 X1 1.0000 X1 X2 X3 X4 X5 X6

. pwcorr X1-X6, obs casewise

. * Note that "casewise" is a synonymn for "listwise":

. * With listwise deletion of cases that contain missing values.

5060 5069 5094 5082 5081 5125 X6 0.1931 0.2241 0.4367 0.3954 0.5500 1.0000 5069 5070 5088 5091 5116 X5 0.2533 0.2706 0.3564 0.4465 1.0000 5071 5079 5094 5121 X4 0.2110 0.2324 0.2961 1.0000 5069 5082 5144 X3 0.1611 0.1642 1.0000 5058 5109 X2 0.5520 1.0000 5097 X1 1.0000 X1 X2 X3 X4 X5 X6

. pwcorr X1-X6, obs

. * With pairwise deletion of cases that contain missing values:

Diagonals of correlation matrices are always 1 (or left blank). In this case, the n-count is the number of teachers who responded to Question 1.

Diagonals of correlation matrices are always 1 (or left blank). In this case, the n-count is the number of teachers who responded to Question 1.

. The n-count here is the number of teachers who responded to BOTH Questions 2 and 3. Note that N-counts differ across pairs.

. The n-count here is the number of teachers who responded to BOTH Questions 2 and 3. Note that N-counts differ across pairs.

Complete data vs. Casewise/Listwise deletion (keep if NMISSING==0 ; drop if NMISSING>0). Note the differing n-counts across variables in complete data but not in casewise/listwise deleted data.

Complete data vs. Casewise/Listwise deletion (keep if NMISSING==0 ; drop if NMISSING>0). Note the differing n-counts across variables in complete data but not in casewise/listwise deleted data.

We’ll proceed with listwise deletion here, but keep in mind the assumption that data are missing at random from the population. If missing data are few, no worries. Otherwise, explicitly state your assumptions and your approach, and consider advanced techniques like “multiple imputation.”

We’ll proceed with listwise deletion here, but keep in mind the assumption that data are missing at random from the population. If missing data are few, no worries. Otherwise, explicitly state your assumptions and your approach, and consider advanced techniques like “multiple imputation.”

Page 9: Unit 6a: Motivating Principal Components Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 6a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 6a– Slide 9

Pairwise Correlations and the Argument for a “Construct”

Indicator X1 X2 X3 X4 X5 X6

X1: Have high standards of teaching

0.56 0.16 0.21 0.25 0.19

X2: Continually learning on the job

0.55(5058)

0.17 0.23 0.27 0.22

X3: Successful in educating students

0.16(5069)

0.16(5082)

0.30 0.36 0.43

X4: Waste of time to do best as teacher

0.21(5071)

0.23(5079)

0.30(5094)

0.45 0.40

X5: Look forward to working at school

0.25(5069)

0.27(5070)

0.36(5088)

0.45(5091)

0.55

X6: Time satisfied with job

0.19(5060)

0.22(5069)

0.44(5094)

0.40(5082)

0.55(5081)

Bivariate correlations estimated under pairwise deletion

Bivariate correlations estimated under listwise deletion (n=4955)

To justify forming a single composite, you must argue that all indicators measure the same construct: Here, generally positive inter-correlations support a “uni-dimensional” view. But, the small & heterogeneous values of indicator inter-correlations also suggest:

Either there is considerable measurement error in each indicator, Or that some, or all, of indicators may also measure other unrelated constructs.

This is bad news for the “internal consistency” (reliability) of the ultimate composite.

Sample inter-correlations among the indicators: Are all positive

(thankfully!), Are of small to

moderate magnitude but differ widely (unfortunately!).

Page 10: Unit 6a: Motivating Principal Components Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 6a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 6a– Slide 10

Three ways of looking at “reliability”

Three Definitions of Reliability1. Reliability is the correlation between two sets of

observed scores from a replication of a measurement procedure.

2. Reliability is the proportion of “observed score variance” that is accounted for by “true score variance.”

3. Reliability is like an average of pairwise interitem correlations, “scaled up” according to the number of items on the test (because averaging over more items decreases error variance).

Three Definitions of Reliability1. Reliability is the correlation between two sets of

observed scores from a replication of a measurement procedure.

2. Reliability is the proportion of “observed score variance” that is accounted for by “true score variance.”

3. Reliability is like an average of pairwise interitem correlations, “scaled up” according to the number of items on the test (because averaging over more items decreases error variance).

Three Necessary Intuitions1. Any observed score is one of many possible

replications.2. Any observed score is the sum of a “true score”

(average of all theoretical replications) and an error term.

3. Averaging over replications gives us better estimates of “true” scores by averaging over error terms.

Three Necessary Intuitions1. Any observed score is one of many possible

replications.2. Any observed score is the sum of a “true score”

(average of all theoretical replications) and an error term.

3. Averaging over replications gives us better estimates of “true” scores by averaging over error terms.

𝜌𝑋 𝑋 ′=𝐄 (𝐶𝑜𝑟𝑟 (𝑋 ,𝑋 ′ ))𝜌𝑋 𝑋 ′=𝐄 (𝐶𝑜𝑟𝑟 (𝑋 ,𝑋 ′ ))

𝜌𝑋 𝑋 ′=𝜎𝑇

2

𝜎 𝑋2

𝜌𝑋 𝑋 ′=𝜎𝑇

2

𝜎 𝑋2

�̂�𝜶=𝑛 𝑗𝑟

1+ (𝑛 𝑗− 1 )𝑟�̂�𝜶=

𝑛 𝑗𝑟

1+ (𝑛 𝑗− 1 )𝑟

𝑇 𝑖

𝑋 𝑖1 𝑋 𝑖2𝑋 𝑖3

𝐸𝑖 1

𝐸𝑖 2

𝐸𝑖 3

Page 11: Unit 6a: Motivating Principal Components Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 6a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 6a– Slide 11

1) Correlation between two replications of a measurement procedure

Robert Brennan, reliability guru, likes to use this aphorism: A person with one watch knows what time it is. A person with two watches is never quite sure.

Robert Brennan, reliability guru, likes to use this aphorism: A person with one watch knows what time it is. A person with two watches is never quite sure.

Operationalizing a “Replication” Parallel Forms (same students, different everything else)

• Ideal but impractical, confounded by intermediate learning, systematic decline in motivation

Test, Retest Reliability (same students, same items, different everything else)

• Assumes items are “fixed,” also impractical, confounded by learning, motivation, and remembering.

Split Half Reliability (take two halves of a test as a replication)

• Dependent on which half you select. Only captures error due to sampling of items.

Internal Consistency (e.g., Cronbach’s )• A average of all possible split halves. Only captures

error due to sampling of items.

Operationalizing a “Replication” Parallel Forms (same students, different everything else)

• Ideal but impractical, confounded by intermediate learning, systematic decline in motivation

Test, Retest Reliability (same students, same items, different everything else)

• Assumes items are “fixed,” also impractical, confounded by learning, motivation, and remembering.

Split Half Reliability (take two halves of a test as a replication)

• Dependent on which half you select. Only captures error due to sampling of items.

Internal Consistency (e.g., Cronbach’s )• A average of all possible split halves. Only captures

error due to sampling of items.

Population Reliability, Defined:The reliability parameter is the expected correlation of two sets of observed scores across a replication of a measurement procedure.

Reliability parameterE – Expected value (long-run average) – Replication 1 of a measurement procedure.Replication 2 of a measurement procedure.

Population Reliability, Defined:The reliability parameter is the expected correlation of two sets of observed scores across a replication of a measurement procedure.

Reliability parameterE – Expected value (long-run average) – Replication 1 of a measurement procedure.Replication 2 of a measurement procedure.

Reliability, Estimated:1. Obtain a sample of scores, .2. Replicate the measurement

procedure on the same participants, .

3. Calculate the correlation:

Reliability, Estimated:1. Obtain a sample of scores, .2. Replicate the measurement

procedure on the same participants, .

3. Calculate the correlation:

Page 12: Unit 6a: Motivating Principal Components Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 6a– Slide 1

2) Proportion of Observed Score Variance accounted for by True Score Variance

Classical Test Theory:

A person ’s observed score, , is the sum of their true (unobserved) score, , and an error term, .

Note difference between xpectation and rror. The long-run average of error terms is 0. The long-run average of observed scores is the true score.

Classical Test Theory:

A person ’s observed score, , is the sum of their true (unobserved) score, , and an error term, .

Note difference between xpectation and rror. The long-run average of error terms is 0. The long-run average of observed scores is the true score.

The Reliability of a measure (or composite) is a population parameter that describes how much of the observed variance in the measure (or composite) is actually true variance:The Reliability of a measure (or composite) is a population parameter that describes how much of the observed variance in the measure (or composite) is actually true variance:

Scores of Variance Population

Scores of Variance Population

Observed

TrueT

TE

𝜌𝑋 𝑋 ′=¿ =

𝑇 𝑖

𝑋 𝑖1 𝑋 𝑖2𝑋 𝑖3

𝐸𝑖 1

𝐸𝑖 2

𝐸𝑖 3

And, because and are uncorrelated, (following from our Classical Test Theory definitions), we know that , thus:And, because and are uncorrelated, (following from our Classical Test Theory definitions), we know that , thus:

Page 13: Unit 6a: Motivating Principal Components Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 6a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 6a– Slide 13

Interlude: To Standardize or not to Standardize

For an additive composite of “standardized” indicators:First, each indicator is standardized to a mean of 0 and a standard

deviation of 1:

Then, the standardized indicator scores are summed*6

*5

*4

*3

*2

*1

*iiiiiii XXXXXXX

57.0

84.2

33.1

42.4

67.1

23.467.0

15.3

24.1

87.3

09.1

33.4

6*6

5*5

4*4

3*3

2*2

1*1

ii

ii

ii

ii

ii

ii

XX

XX

XX

XX

XX

XX

For an additive composite of “raw” indicators:Each indicator remains in its original metric.Composite scores are the sum of the scores on the raw

indicators, for each person in the sample:

where X1i is the raw score of the ith teacher on the 1st indicator, and so on …

iiiiiii XXXXXXX 654321

This is consequential. Consider: Are the scales interchangable? Do a “very successful” and an “always” and a “slightly agree” share meaning? How would you score the test? If it’s a 4-point item, do you only add a maximum of 4 points? Standardizing assumes that 1) scores are sums of , 2) scale points do not share meaning across items, and 3) “one standard deviation” has more in common than “one unit,” across items. Here, we standardize, but I would probably multiply the 4-point items by 3/2 and not standardize, in practice.

This is consequential. Consider: Are the scales interchangable? Do a “very successful” and an “always” and a “slightly agree” share meaning? How would you score the test? If it’s a 4-point item, do you only add a maximum of 4 points? Standardizing assumes that 1) scores are sums of , 2) scale points do not share meaning across items, and 3) “one standard deviation” has more in common than “one unit,” across items. Here, we standardize, but I would probably multiply the 4-point items by 3/2 and not standardize, in practice.

Page 14: Unit 6a: Motivating Principal Components Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 6a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 6a– Slide 14

3) Average Interitem Covariance, Scaled Up

Reliability, Defined:Internal consistency reliability can be operationalized as a “standardized coefficient alpha”:

Where is the number of items, and is the average inter-item correlation.When ,

Reliability, Defined:Internal consistency reliability can be operationalized as a “standardized coefficient alpha”:

Where is the number of items, and is the average inter-item correlation.When ,

Scale reliability coefficient: 0.7355Number of items in the scale: 6Average interitem covariance: .3167178

Test scale = mean(unstandardized items)

. alpha STDX1-STDX6

alpha: Our straightforward command to obtain Cronbach’s alpha, an “internal consistency” estimate of population reliability.

Running this on standardized variables, STD1-STD6 (or, running this on unstandardized variables and using the std option) gives us “standardized coefficient alpha”

alpha: Our straightforward command to obtain Cronbach’s alpha, an “internal consistency” estimate of population reliability.

Running this on standardized variables, STD1-STD6 (or, running this on unstandardized variables and using the std option) gives us “standardized coefficient alpha”

Recall that covariance is an “unstandardized” correlation.

A covariance on standardized variables is thus a correlation.

This is the straight average of our interitem correlations from Slide 9.

Recall that covariance is an “unstandardized” correlation.

A covariance on standardized variables is thus a correlation.

This is the straight average of our interitem correlations from Slide 9.

Now, we get to “scale up” the interitem correlation by the number of items in the scale, following the earlier equation: . Why?

Now, we get to “scale up” the interitem correlation by the number of items in the scale, following the earlier equation: . Why?

The long-run average of errors is zero. Correlation between averages will rise. Proportion of observed score variance will rise as error variance drops.

The long-run average of errors is zero. Correlation between averages will rise. Proportion of observed score variance will rise as error variance drops.

Page 15: Unit 6a: Motivating Principal Components Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 6a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 6a– Slide 15

Providing each indicator in a composite is measuring the same underlying construct: The more indicators you include in the composite,

the higher the reliability of the composite. Because:

Measurement errors in each indicator are random, and cancel out in the composite.

Any true variation in each indicator combines and surfaces through the noise.

Number of Items in

Composite

CompositeReliability(r = .2)

CompositeReliability(r = .4)

CompositeReliability(r = .6)

CompositeReliability(r = .8)

1 0.2000 0.4000 0.6000 0.80002 0.3333 0.5714 0.7500 0.88893 0.4286 0.6667 0.8182 0.92314 0.5000 0.7273 0.8571 0.94125 0.5556 0.7692 0.8824 0.95246 0.6000 0.8000 0.9000 0.96007 0.6364 0.8235 0.9130 0.96558 0.6667 0.8421 0.9231 0.96979 0.6923 0.8571 0.9310 0.9730

10 0.7143 0.8696 0.9375 0.9756

0.00

0.25

0.50

0.75

1.00

1 2 3 4 5 6 7 8 9 10

Rel

iabi

lity

of C

ompo

site

Number of Indicators, or Multiplicative Factor, K

= .8

= .6

= .4

= .2

As a consequence, the population reliability of a composite of K indicators, each with separate, known population reliability , can be predicted using the Spearman-Brown Prophesy Formula:

More generally, this can be interpreted as the estimated reliability if a test’s length is multiplied by a factor of .

The Spearman-Brown “Prophecy” Formula

Page 16: Unit 6a: Motivating Principal Components Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 6a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 6a– Slide 16

Three ways of looking at “reliability”

Three Definitions of Reliability1. Reliability is the correlation between two sets of

observed scores from a replication of a measurement procedure.

2. Reliability is the proportion of “observed score variance” that is accounted for by “true score variance.”

3. Reliability is like an average of pairwise interitem correlations, “scaled up” according to the number of items on the test (because averaging over more items decreases error variance).

Three Definitions of Reliability1. Reliability is the correlation between two sets of

observed scores from a replication of a measurement procedure.

2. Reliability is the proportion of “observed score variance” that is accounted for by “true score variance.”

3. Reliability is like an average of pairwise interitem correlations, “scaled up” according to the number of items on the test (because averaging over more items decreases error variance).

Three Necessary Intuitions1. Any observed score is one of many possible

replications.2. Any observed score is the sum of a “true score”

(average of all theoretical replications) and an error term.

3. Averaging over replications gives us better estimates of “true” scores by averaging over error terms.

Three Necessary Intuitions1. Any observed score is one of many possible

replications.2. Any observed score is the sum of a “true score”

(average of all theoretical replications) and an error term.

3. Averaging over replications gives us better estimates of “true” scores by averaging over error terms.

𝜌𝑋 𝑋 ′=𝐄 (𝐶𝑜𝑟𝑟 (𝑋 ,𝑋 ′ ))𝜌𝑋 𝑋 ′=𝐄 (𝐶𝑜𝑟𝑟 (𝑋 ,𝑋 ′ ))

𝜌𝑋 𝑋 ′=𝜎𝑇

2

𝜎 𝑋2

𝜌𝑋 𝑋 ′=𝜎𝑇

2

𝜎 𝑋2

�̂�𝜶=𝑛 𝑗𝑟

1+ (𝑛 𝑗− 1 )𝑟�̂�𝜶=

𝑛 𝑗𝑟

1+ (𝑛 𝑗− 1 )𝑟

𝑇 𝑖

𝑋 𝑖1 𝑋 𝑖2𝑋 𝑖3

𝐸𝑖 1

𝐸𝑖 2

𝐸𝑖 3

Page 17: Unit 6a: Motivating Principal Components Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 6a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 6a– Slide 17

A Baseline Reliability Analysis

Test scale 0.3167 0.7355 mean(standardized items) X6 4955 + 0.7110 0.5449 0.2951 0.6768 Time satisfied with jobX5 4955 + 0.7313 0.5733 0.2872 0.6682 Look forward to working at schoolX4 4955 + 0.6579 0.4726 0.3161 0.6979 Waste of time to do best as teacherX3 4955 + 0.6133 0.4140 0.3336 0.7145 Successful in educating studentsX2 4955 + 0.6209 0.4239 0.3306 0.7118 Continually learning on jobX1 4955 + 0.6029 0.4007 0.3377 0.7183 Have high standards of teaching Item Obs Sign corr. corr. corr. alpha Label item-test item-rest interitem

Test scale = mean(standardized items)

. alpha X1-X6, label item casewise std

We use unstandardized items but include the std option, to standardize.

casewise deletion leads to 4955 observations across all items.

Positive signage because we already reversed the polarity of X4.

We use unstandardized items but include the std option, to standardize.

casewise deletion leads to 4955 observations across all items.

Positive signage because we already reversed the polarity of X4.

These are diagnostics that explain item functioning and sometimes, with additional analysis, warrant item adaptation or exclusion. However, no item should be altered or excluded on the basis of these statistics alone. Item-Test Correlation is the straight correlation between item responses and total test scores (higher the better). Item-Rest Correlation is the same, but the total test score excludes the target item (higher the better). Interitem Correlation shows for all items not including the target item (lower the better) Alpha (excluded-item alpha) shows the would-be estimate if the item were excluded (lower the better).

These are diagnostics that explain item functioning and sometimes, with additional analysis, warrant item adaptation or exclusion. However, no item should be altered or excluded on the basis of these statistics alone. Item-Test Correlation is the straight correlation between item responses and total test scores (higher the better). Item-Rest Correlation is the same, but the total test score excludes the target item (higher the better). Interitem Correlation shows for all items not including the target item (lower the better) Alpha (excluded-item alpha) shows the would-be estimate if the item were excluded (lower the better).

.317 is the straight average interitem correlation, . .74 is the scaled-up interitem correlation, our “internal

consistency” estimate of reliability. Cronbach’s alpha is 0.74 and can be interpreted as the

estimated correlation between two sets of teacher scores (calculated as summed standardized item scores) from a replication of this measurement procedure.

Cronbach’s alpha is 0.74 and can be interpreted as the estimated proportion of observed score variance accounted for by true score variance.

.317 is the straight average interitem correlation, . .74 is the scaled-up interitem correlation, our “internal

consistency” estimate of reliability. Cronbach’s alpha is 0.74 and can be interpreted as the

estimated correlation between two sets of teacher scores (calculated as summed standardized item scores) from a replication of this measurement procedure.

Cronbach’s alpha is 0.74 and can be interpreted as the estimated proportion of observed score variance accounted for by true score variance.

Page 18: Unit 6a: Motivating Principal Components Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 6a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 6a– Slide 18

Reliability from a Multilevel Modeling Perspective: Reshaping Data

The Data in Wide Format:Every participant is a row. Every item is a column.The Data in Wide Format:Every participant is a row. Every item is a column.

The Data in Long Format:Every item score is a row. A single column for all score replications..The Data in Long Format:Every item score is a row. A single column for all score replications..

Think it might be possible to consider teachers as grouping variables for item scores? xtset ID?Think it might be possible to consider teachers as grouping variables for item scores? xtset ID?

10. -1.22 0.91 -0.23 1.06 -1.07 0.29 9. 1.54 1.71 -0.23 1.06 0.43 0.29 8. 1.54 0.10 1.26 -1.94 -2.58 -1.46 7. -0.30 0.10 1.26 -0.14 0.43 0.29 6. . . . . . . 5. -0.30 0.10 -0.23 -1.34 -0.32 0.29 4. . . . . . . 3. -0.30 0.10 -1.73 -1.34 -1.82 -1.46 2. -0.30 -0.70 -1.73 -1.94 -2.58 -1.46 1. 0.62 0.91 -0.23 -0.74 -0.32 -1.46 STDX1 STDX2 STDX3 STDX4 STDX5 STDX6

24. 4 6 . 1 23. 4 5 . 1 22. 4 4 . 1 21. 4 3 . 1 20. 4 2 . 1 19. 4 1 . 1 18. 3 6 -1.46 0 17. 3 5 -1.82 0 16. 3 4 -1.34 0 15. 3 3 -1.73 0 14. 3 2 0.10 0 13. 3 1 -0.30 0 12. 2 6 -1.46 0 11. 2 5 -2.58 0 10. 2 4 -1.94 0 9. 2 3 -1.73 0 8. 2 2 -0.70 0 7. 2 1 -0.30 0 6. 1 6 -1.46 0 5. 1 5 -0.32 0 4. 1 4 -0.74 0 3. 1 3 -0.23 0 2. 1 2 0.91 0 1. 1 1 0.62 0 ID ITEM STDX NMISSING

Page 19: Unit 6a: Motivating Principal Components Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 6a– Slide 1

© Andrew Ho, Harvard Graduate School of Education Unit 6a– Slide 19

Reliability from a Multilevel Modeling Perspective: Intraclass Correlation

rho .31676527 (fraction of variance due to u_i) sigma_e .82654215 sigma_u .56279317 _cons 8.20e-09 .0135431 0.00 1.000 -.0265439 .0265439 ITEM -1.83e-09 .0028069 -0.00 1.000 -.0055014 .0055014 STDX Coef. Std. Err. z P>|z| [95% Conf. Interval]

corr(u_i, X) = 0 (assumed) Prob > chi2 = 1.0000 Wald chi2(1) = 0.00

overall = 0.0000 max = 6 between = 0.0000 avg = 6.0R-sq: within = 0.0000 Obs per group: min = 6

Group variable: ID Number of groups = 4955Random-effects GLS regression Number of obs = 29730

. xtreg STDX ITEM

panel variable: ID (balanced). xtset ID

.7355725

. di e(sigma_u)^2/(e(sigma_u)^2+e(sigma_e)^2/6)

.31676527

. di e(sigma_u)^2/(e(sigma_u)^2+e(sigma_e)^2)

.31676527

. di e(rho)

Scale reliability coefficient: 0.7355Number of items in the scale: 6Average interitem covariance: .3167178

Test scale = mean(unstandardized items)

. alpha STDX1-STDX6

The multilevel is , the intraclass correlation (expected value of the correlation between two draws within the same teacher) is the average interitem correlation. Make sense?

Or the proportion of total variation attributable to between-group (between-teacher) variation?

Knowing that the error variance will be divided by 6, we can estimate reliability as a scaled-up intraclass correlation coefficient.

The multilevel is , the intraclass correlation (expected value of the correlation between two draws within the same teacher) is the average interitem correlation. Make sense?

Or the proportion of total variation attributable to between-group (between-teacher) variation?

Knowing that the error variance will be divided by 6, we can estimate reliability as a scaled-up intraclass correlation coefficient.

�̂�𝛼=�̂� 𝑢

2

�̂�𝑢2 +

�̂�𝑒2

𝑛 𝑗

= .562

.562+ .8276

=.736�̂�𝛼=�̂� 𝑢

2

�̂�𝑢2 +

�̂�𝑒2

𝑛 𝑗

= .562

.562+ .8276

=.736