diagnostic modeling of intra-organizational mechanisms for … · 2020-01-17 · dissertation...

Diagnostic Modeling of Intra-Organizational Mechanisms for Supporting Policy Implementation

Ryan Brock Mutcheson

Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in

partial fulfillment of the requirements for the degree of

Doctor of Philosophy

in

Educational Research and Evaluation

Gary E. Skaggs, Chair

Sue G. Magliaro

Yasuo Miyazaki

Kusum Singh

April 28, 2016

Blacksburg, Virginia

Keywords: cognitive diagnostic modeling, latent class model, teacher evaluation, effective

teaching, psychometrics, crum model

Diagnostic Modeling of Intra-Organizational Mechanisms for Supporting Policy Implementation

Ryan Brock Mutcheson

ABSTRACT

The Virginia Guidelines for Uniform Performance Standards and Evaluation Criteria for

Teachers represented a significant overhaul of conventional teacher evaluation criteria in

Virginia. The policy outlined seven performance standards by which all Virginia teachers would

be evaluated. This study explored the application of cognitive diagnostic modeling to measure

teachers’ perceptions of intra-organizational mechanisms available to support educational

professionals in implementing this policy.

It was found that a coarse-grained, four-attribute compensatory, re-parameterized unified

model (C-RUM) fit teacher perception data better and had lower standard errors than the

competing finer-grained models. The Q-matrix accounted for the complex loadings of items to

the four theoretically and empirically driven mechanisms of implementation support including

characteristics of the policy, teachers, leadership, and the organization. The mechanisms were

positively, significantly, and moderately correlated which suggested that each mechanism

captured a different, yet related, component of policy implementation support. The diagnostic

profile estimates indicated that the majority of teachers perceived support on items relating to

“characteristics of teachers.” Moreover, almost 60% of teachers were estimated to belong to

profiles with perceived support on “characteristics of the policy.” Finally, multiple group

multinomial log-linear models (Xu and Von Davier, 2008) were used to analyze the data across

subjects, grade levels, and career status. There was lower perceived support by STEM teachers

than non-STEM teachers who have the same profile, suggesting that STEM teachers required

differential support than non-STEM teachers.

The precise diagnostic feedback on the implementation process provided by this

application of diagnostic models will be beneficial to policy makers and educational leaders.

Specifically, they will be better prepared to identify strengths and weaknesses and target

resources for a more efficient, and potentially more effective, policy implementation process. It

is assumed that when equipped with more precise diagnostic feedback, policy makers and

school leaders may be able to more confidently engage in empirical decision making, especially

in regards to targeting resources for short-term and long-term organizational goals subsumed

within the policy implementation initiative.

iv

Table of Contents

CHAPTER 1 ................................................................................................................................................ 1

RESEARCH PROBLEM ........................................................................................................................... 1

Context of the Research Study ............................................................................................................... 1

Statement of the Problem .................................................................................................................... 2

Purpose of the Study ............................................................................................................................. 5

Overview of the Study ............................................................................................................................ 6

Overview of the Theoretical Framework .............................................................................................. 6

Overview of the Sample and Data Collection ....................................................................................... 8

Overview of the Methodology ............................................................................................................ 12

Significance of the Study ...................................................................................................................... 15

Limitations ............................................................................................................................................. 17

Organization of the Study .................................................................................................................... 19

CHAPTER 2 .............................................................................................................................................. 20

REVIEW OF THE LITERATURE......................................................................................................... 20

Introduction ........................................................................................................................................... 20

Teacher Effectiveness ........................................................................................................................... 20

Traditional Teacher Evaluation Systems and the Widget Effect ..................................................... 22

Teacher Evaluation Policies and the Multiple Measures of Teacher Effectiveness ........................ 24

The Policy: Virginia Guidelines for Uniform Performance Standards and Evaluation Criteria for

Teachers ................................................................................................................................................. 27

Theoretical Framework: Policy Implementation in K-12 Organizations ........................................ 31

Preliminary Assumptions about the Policy Implementation Process ................................................. 31

Theories of Utility Maximization and Policy Implementation ............................................................ 33

Social Capital and Policy Implementation ........................................................................................... 34

Frameworks for K-12 Policy Implementation ..................................................................................... 35

Review of Cognitive Diagnostic Modeling Concepts .......................................................................... 38

Measurement Theory ......................................................................................................................... 38

Cognitive Diagnostic Modeling Theory and Applications ................................................................... 44

Model Specification: Compensatory vs. Non-Compensatory ............................................................. 47

v

The Core Compensatory and Non-Compensatory Cognitive Diagnostic Models ............................... 49

The Q-Matrix ....................................................................................................................................... 52

CHAPTER 3 .............................................................................................................................................. 56

METHODOLOGY ................................................................................................................................... 56

The ITES Project, Current Study, and the Role of the Researcher ................................................. 57

Data Collection and Sample ................................................................................................................. 57

Instrumentation..................................................................................................................................... 62

Item Analysis ....................................................................................................................................... 63

Descriptive Statistics ........................................................................................................................... 67

Plan of Analysis ..................................................................................................................................... 69

The Q-Matrix: Defining the Attribute and Skills Space ....................................................................... 70

The Compensatory Reparameterized Unified Model (C-RUM) .......................................................... 72

Cognitive Diagnostic Model Estimation Method ................................................................................ 73

Analyses for Research Question 1 ...................................................................................................... 75

Analysis for Research Question 2 ........................................................................................................ 80

CHAPTER 4 .............................................................................................................................................. 84

RESULTS .................................................................................................................................................. 84

Preliminary Analysis: Q- Matrix Development ................................................................................. 84

Data Dimensionality ............................................................................................................................ 84

Exploratory Factor Analysis ................................................................................................................. 85

The Q-Matrix and Model Specification ............................................................................................... 90

Findings .................................................................................................................................................. 91

Research Question 1: Testing the New Application of Cognitive Diagnostic Models ......................... 91

Research Question 2: Exploring Group Comparisons Using the Diagnostic Model .......................... 107

CHAPTER 5 ............................................................................................................................................ 124

IMPLICATIONS .................................................................................................................................... 124

Discussion on Research Question 1a ................................................................................................. 125

Discussion on Research Question 1b ................................................................................................. 127

Discussion on Research Question 1c.................................................................................................. 127

Discussion for Research Question 2a ................................................................................................ 129

Discussion for Research Question 2b ................................................................................................ 130

Summary of Results ............................................................................................................................ 132

vi

Limitations ........................................................................................................................................... 134

Future Research .................................................................................................................................. 136

References ................................................................................................................................................ 139

APPENDECIES ...................................................................................................................................... 146

Appendix A: ............................................................................................................................................. 146

ITES Survey Items By Factor ............................................................................................................... 146

Appendix B .......................................................................................................................................... 150

IRB Approval ...................................................................................................................................... 150

Appendix C .......................................................................................................................................... 151

Informal Blueprint of Teacher Survey Item Development ................................................................ 151

Appendix D .......................................................................................................................................... 152

Polytomous EFA Model Dimensionality Summary ............................................................................. 152

Appendix E .......................................................................................................................................... 154

Appendix F .......................................................................................................................................... 157

Description of Model Evaluation Discrimination Index (DI) ............................................................. 157

Appendix G .......................................................................................................................................... 158

vii

List of Figures

Figure 1. Conceptual Framework of Mechanisms that Influence the Implementation ............................... 8

Figure 2. Item Flagged for Substantive Review ........................................................................................... 65

Figure 3. Distribution of ITES Survey Dichotomized Total Scores ............................................................... 68

Figure 4. Exploratory Factor Analysis Scree Plot ......................................................................................... 87

Figure 5. Average Standard Errors of Intercept Parameter Estimates ....................................................... 98

Figure 6. Comparisons of Confidence Intervals for Item 22 Intercept Estimates by Group ..................... 117



Figure 9. Comparisons of Confidence Intervals for Four Item Slope Estimates by Grade Level .............. 123

Figure 10. Description of Model Evaluation Discrimination Index (DI) .................................................... 157

viii

List of Tables

Table 1. Data Sources and Descriptions ........................................................................................................ 9

Table 2. Virginia Guidelines for Uniform Performance Standards and Evaluation Criteria for Teachers . 29

Table 3. Example of Q-Matrix ..................................................................................................................... 53

Table 4. Data Description for the Current Study ........................................................................................ 58

Table 5. 2013-2014 School-Year Local Education Agency Information ...................................................... 59

Table 6. Comparison of Teacher Characteristics......................................................................................... 61

Table 7. Crosstabs of Teacher Characteristics ............................................................................................ 61

Table 8. Polytomous Item Descriptive Statistics ......................................................................................... 64

Table 9. Dichotomized Item Descriptive Statistics ...................................................................................... 67

Table 10. Total Scores by Group ................................................................................................................. 69

Table 11. Exploratory Factor Analysis Eigenvalues by Factor ..................................................................... 86

Table 12. Summary Interpretation of Factors ............................................................................................. 89

Table 13. Final Model: Grain Sizes .............................................................................................................. 90

Table 14. An example of the Final Q-Matrix ............................................................................................... 91

Table 15. Fit Results for Models with 7-10 Attributes ................................................................................ 94

Table 16. Fit Results for Unidimensional Model and Models with 4-6 Attributes ..................................... 97

Table 17. Comparisons of 4-Attribute Model Average Parameter Estimate Distributions ...................... 100

Table 18. Parameter Distributions for Final Model .................................................................................. 101

Table 19. Example of Estimated Posterior Probabilities of Attribute Perceived Support By Respondent103

Table 20. Example of Estimated Latent Profiles Based on Posterior Probabilities By Respondent ......... 103

Table 21. Distribution of Diagnostic Categorical Profiles for 4-Attribute Model ...................................... 105

Table 22. Percentage of Teachers in Profiles with Perceived Support by Attribute ................................. 107

Table 23. Estimated Group Proportions by Profile ................................................................................... 111

Table 24. Correlations Between Attributes .............................................................................................. 112

Table 25. Fit Statistic Comparisons Between Models ............................................................................... 113

Table 26. Comparisons of Intercept Estimate Distributions of Single vs. Multi-Group Models ............... 115

Table 27. Comparisons of Slope Estimate Distributions of Single vs. Multi-Group Models ..................... 120

Table 28: Factor Analysis of Polytomous Items ........................................................................................ 153

Table 29: 4-Factor Solution for 70 Item Teacher Evaluation Instrument by Applying the Maximum

Likelihood for Continuous Variables ......................................................................................................... 154

Table 30. Full-Information Exploratory Factor Analysis Model Fit ........................................................... 158

Table 31. Deviance Test ............................................................................................................................ 159

1

CHAPTER 1

RESEARCH PROBLEM

Context of the Research Study

State and local education agencies (LEAs) are investing substantial resources, including

in both human and financial capital, towards a nationwide initiative that re-conceptualizes

teacher evaluation. (Darling-Hammond et al., 2012; Hallinger, Heck, & Murphy, 2014).

Moreover, the Gates Foundation has invested $45 million in the Measures of Effective Teaching

(MET) project that uses multiple measure of teacher effectiveness, including student evaluations

of teachers, student classroom work, and evaluations of classroom practice using multiple rubrics

(e.g., Kane, McCaffrey, Miller, & Staiger, 2013).

According to Sartain, Stoelinga, and Brown (2011), two main factors have motivated this

movement. First, the traditional teacher evaluation process is generally not an effective

mechanism for promoting and supporting teacher development in order to improve student

achievement. Secondly, the traditional teacher evaluation process has not proven to be an

effective mechanism for providing data for making empirically supported personnel decisions.

For example, in a project funded by the New Teacher Project, Weisberg, Sexton, Mulhern, and

Keeling (2009) found that under the traditional teacher evaluation system in Chicago Public

Schools (CPS), 93 percent of teachers were rated as either “superior” or “excellent.” At the same

time, 66 percent of CPS schools were failing to meet state standards, suggesting a major

disconnect between classroom results and classroom evaluations. Moreover, they found that 99

percent of teachers were rated as “satisfactory” when their schools used a binary

satisfactory/unsatisfactory rating system.

2

In September 2011, with bipartisan support, the United States Department of Education

(USDOE) invited all State Educational Agencies (SEA) to request flexibility regarding specific

requirements of the Elementary and Secondary Education Act (ESEA). Designing and

implementing teacher performance-based evaluation had been the main focus of the efforts to

implement ESEA flexibility (USDOE, Jan. 2013). Initially, the USDOE granted waivers to 34

states and the District of Columbia, including Virginia. The Virginia Department of Education

(VDOE) efforts produced the Guidelines for Uniform Performance Standards and Evaluation

Criteria for Teachers (GUPSECT), a document that became effective on July 1, 2012. The

guidelines called for 40% of teachers’ evaluations to be based on student academic progress

using multiple measures of learning and achievement. The formal signing of this policy

represented a significant overhaul of conventional evaluation criteria. The 2012-13 school-year

was the first year of the state-wide pilot and implementation.

Statement of the Problem

A better understanding of the impacts of the teacher evaluation policy in Virginia is

necessary to guide educational leaders and policy makers. At the national level, researchers

supported by major government and non-profit funding agencies, including Institute of

Education Science (IES), National Science Foundation (NSF), the Gates Foundation, and

Spencer Foundation, have been exploring the reliability and validity of the effective

performance-based evaluation rubrics and toolkits (Sun & Mutcheson, 2014). Unfortunately,

however, such policy studies have proven difficult to conduct due, in part, to issues regarding the

variance in key aspects of the implementation of educational policies. Even in cases where

statewide policies exist, there is enough ambiguity in the policy to lead to disparities in the

implementation across LEAs. Moreover, some components of the policies are explicitly intended

http://www2.ed.gov/policy/gen/guid/secletter/110923.html

3

for LEAs to interpret and implement in ways that suit their particular needs in the best interests

of the students they serve.

Principals and superintendents across the country have highlighted concerns about the

successful implementation of the new systems, including: gaining teachers’ buy-in, the cost of

training teachers under the new system, and tying evaluations to strategic compensation plans

(Sun & Mutcheson, 2014). Thus, in order to understand policy impacts, greater insight is needed

into the discrepancies in the intention of the education policy makers and the way that the policy

unfolds in reality (Coburn, 2001; Datnow & Park, 2009; Spillane & Miele, 2007; Spillane,

Reiser, & Reimer, 2002). Since political and legislative efforts have already outlined the new

teacher evaluation criteria, the issues facing practitioners is how to successfully implement this

system and use it in ways that promote teachers’ professional growth (Sun, Mutcheson & Kim,

2014).

Given that teachers are ultimately responsible for implementing teacher evaluation

policies in that they incorporate the policy principles into practice, their voices are potentially

very valuable sources of information in understanding what is and is not working in the policy

implementation process. Although few empirical studies explore how to best support LEAs in

conducting the new teacher evaluations (Rothstein & Mathis, 2013), there are even fewer studies

that attempt to improve the precision with which the key mechanisms for supporting the policy

implementation process are measured. More precise diagnostic feedback on the fidelity of policy

implementation could potentially help practitioners and researchers understand strengths and

weaknesses on organizational support mechanisms, identify remedial pathways, and target

resources toward perceived support, on all teacher evaluation policy components. Most

importantly, more precise diagnostic feedback may help researchers and practitioners to better

4

understand under what conditions the new teacher evaluation system works and help school

leaders reflect and adjust past practices of supporting teacher development. It is assumed that

more effective measurement practices will promote a stronger culture of continuous

improvement. Diagnostic results can be used to organize and inform teacher training, and

ultimately gain teacher “buy-in” on policies and procedures. All school districts invest

substantial resources on district-wide professional development. Hypothetically, if those

planning professional development had diagnostic information about teachers’ perceptions of

intra-organizational supports, they could use this information to direct teacher training in a much

more efficient, cost-effective manner. Teachers could be put in groups to receive personalized

support on the necessary policy elements. Moreover, patterns among the results can be used to

inform discussions among school leaders, and they can learn effective support strategies from

one another. It should be noted that this methodology is congruent with the increasingly popular

“standards-based” approach in K-12 education. Thus, the results from this type of study should

only be used for formative purposes. More clearly, this study does not support the use of these

results for personnel decisions such as hiring, firing, promotion, or compensation.

This will be the first study to approach the implementation of teacher evaluation policies

using cognitive diagnostic models. As discussed further in the following chapters, cognitive

diagnostic modeling will provide key measurement related advantages. According to Haberman,

von Davier, and Lee (2008), “…analyzing data from assessments that are created with the intent

of scaling respondents on a unidimensional continuum almost always provides extremely poor

results from a diagnostic standpoint.” One recent study (Halpin, 2015) focuses on using latent

class analysis to investigate the actual teacher evaluation scores of teachers when different

instruments were used. That study demonstrates the usefulness in applying a new methodology

5

to a popular, worthwhile concept. It is anticipated that this study will accomplish a similar

methodological goal. Specifically, this approach will offer useful, formative, diagnostic

information to LEAs, with the key difference being that the diagnostic information will be about

teachers’ perceptions of the strengths and weaknesses of the teacher evaluation policy

implementation process, instead of overall evaluation scores.

The anticipated result of this study will be diagnostic output that provides detailed

empirical information about teachers’ perceptions of policy support that are involved in the

response processes and the manner in which these components interact were obtained (diBello,

Roussos, & Stout, 2007). In an educational assessment context, it is commonly believed that an

identification of these perceptions, sometimes referred to as “mental components,” may help to

identify remedial pathways toward perceived support, on all components that are relevant and

educationally meaningful to the respondents (diBello, Roussos, & Stout, 2007).

Purpose of the Study

The purpose of this study is to explore a new way to measure teachers’ perceptions of

school intra-organizational mechanisms for supporting teacher evaluation policy implementation.

Cognitive diagnostic models have not previously been applied to policy implementation support

constructs. The diagnostic output from the analysis in this study will provide detailed empirical

information about teachers’ perceptions of support. It is assumed that more precise diagnostic

feedback will be beneficial to policy makers and school leaders in identifying strengths and

weaknesses and in targeting resources in the policy implementation process. When equipped

with more precise diagnostic feedback, policy makers and school leaders may be able to more

confidently engage in empirical decision making, especially in regards to targeting resources for

6

short-term and long-term organizational goals subsumed within the policy implementation

initiative. Specifically, the following research questions are addressed:

1. Can cognitive diagnostic models be applied to understanding teachers’ perceptions of

intra-organizational mechanisms for supporting policy implementation?

a. Do diagnostic models specifying finer-grained attributes representing implementation

support mechanisms fit the data better than models specifying coarser grained

attributes?

b. Do diagnostic models specifying finer-grained attributes representing implementation

support mechanisms have more stable and accurate parameter estimates than models

specifying coarser grained attributes?

c. What are the posterior estimated proportions of teachers that are diagnosed to fall

under each latent profile?

2. Are there group differences in the diagnostic model fit based on grade level, subject

taught, and career status?

a. What are the group differences of the estimated latent class profile distributions

based on grade level, subject taught, and career status?

b. What are the group differences in diagnostic model estimations of parameter

estimates based on grade level, subject taught, and career status?

Overview of the Study

Overview of the Theoretical Framework

The development of the theoretical framework for this project relies on utility

maximization theories that explain individual behaviors in their social settings (Akerlof &

Kranton, 2005; Youngs, Frank, Thum, & Low, 2012), and the importance of capacity building as

schools organize for policy implementation (e.g., Bryk, Sebring, Allensworth, Lupperscu, &

Easton, 2010; Spillane, Gomez, & Mesler, 2009). Moreover, the concept for this study is

influenced by organizational practices of applying new assessment strategies to propel

7

transitioning individuals, teams, and organizations to a desired future state. These ideas are

congruent with the ideas outlined in Rogers’ Diffusion of Innovations (1962) and have been

extended and adapted to inform many other studies that were used in the development of the

framework for this study. These ideas are explained in detail in chapter 2.

In this study, teachers’ perceptions are to be analyzed from a systematic view of school

and district organizational supports for policy implementation intended to permeate schools and

their classrooms (Sun & Mutcheson, 2014). The extent to which teachers perceive that they are

supported on the new teacher evaluation system is assumed to have implications for the

effectiveness of the implementation of the policy. Moreover, the variation in successful

implementation is also assumed to affect the variation in the degree to which the new evaluation

system could achieve its potential to provide more useful feedback to teachers and promote

effective teaching. Finally, the policy implementation is hypothesized to be influenced by

various factors, including the clarity of the guidelines, principals’ and teachers’ acceptance of

various aspects of the new teacher evaluation system, teacher and principal capacity to carry out

the reform, and the coherence of the new system with other school practices.

Relying on the aforementioned literature and assumptions, the mechanisms for

supporting the policy implementation are initially investigated in four components. The first

component is the teachers’ perceived support on policy guidelines. This includes, among other

supports, how clear the policy guidelines are to teachers, how specific and adaptable they are,

and how the policy lends itself to being communicated and monitored. Secondly, teachers’

perceptions of support mechanisms at the teacher level are explored. This includes teacher self-

efficacy, expertise, and capacity to change. It also includes the important support gained from

situated contexts like close colleagues. Next, school leadership are investigated. These include

8

leadership expertise and advocacy for the policy. Moreover, it includes teachers’ perceptions of

whether leadership values align with the policy values. Finally, organization conditional

supports, such as professional development, collaboration, resources, and locus of decision

making are explored. These areas of the conceptual framework are discussed more thoroughly in

chapter two. The assumptions, theoretical components, and the overlapping relationships are

visually represented by the guiding model presented in Figure 1.

Figure 1. Conceptual Framework of Mechanisms that Influence the Implementation

Overview of the Sample and Data Collection

The data used in this study was attained from a secondary source. Specific permission

was granted from the Principal Investigator from the project entitled “Exploring the

Implementation of Virginia’s New Teacher Evaluation Policy (ITES)”. The role of the researcher

in the current study was as a research team member of the ITES study. A clear distinction

between the studies and the role of the researcher is outlined in chapter 3.

9

The ITES study included three participating LEAs for a total of 35 schools: 6 high, 6

middle and 23 elementary schools. This ITES project used a convenience sample of partnership

LEAs located in Southwest Virginia. A total of 19,315 students are served across all three

districts. All full-time teachers in each of the LEAs were eligible to take the survey. The final

sample included 747 teachers. The overall survey response rate was just over 70%. In the larger

project, both quantitative and qualitative data were collected from multiple sources. However,

for this dissertation study, the primary sources of data included:

Table 1. Data Sources and Descriptions

Data Description Collection Method

Local Education

Agency

Administrative

Data

This included teachers’ growth measures; other

measures of teachers’ performance including peer

evaluation and student surveys of classroom

instruction; teacher background; and school and

division documents relevant to the

implementation.

Provided by

administrators in

partnership LEAs.

NCES This included necessary data from National Center

for Education Statistics Common Core of Data

Online Source

ITES Teacher

Surveys This included teachers’ attitudes towards supports

for the new teacher evaluation system and their

perceptions of major barriers in the adoption

Research team

survey

administration

It should be noted that the larger ITES study is currently in the third year of data

collection, and has already attracted greater research interest outside the initial project. Multiple

educational researchers at various institutions have already begun projects using this data.

Moreover, publications and presentations at conferences and professional forums have already

resulted from this data. Despite the strengths of this data, it does have limitations. One limitation

is that the data was not collected from a probabilistic sample. Rather, the data was collected

from a convenience sample of partnership LEAs in Southwest Virginia. However, there are

multiple reasons why this is the data that will answer the previously presented research

10

questions. Most importantly, no other dataset currently includes the variables necessary to

answer the research questions in this study. There are a few reasons for this. First, this study is

focused on K-12 organizational policy implementation. Such policy studies have proven difficult

to study due, in part, to the fact that education policies are typically implemented at the state

level, thus, there exist issues regarding the variance in key aspects of the implementation

process. With no federally mandated policy, stipulations vary across states. Even in cases where

similar objectives are to be met from two separate statewide policies, there are enough

ambiguities in the policies or the organizational structures to lead to disparities in the

implementation across LEAs. Secondly, a nationally representative dataset would not necessarily

be better since the goal of this study is to investigate a state-level policy implementation process.

Unless a federal policy mandates a uniform teacher evaluation policy, which is highly unlikely,

there will not be a national dataset available for this particular topic. Thus, the instrument was

not developed to collect data from or investigate any other policy than Virginia Guidelines for

Uniform Performance Standards and Evaluation Criteria for Teachers. The consequence of this

focus is that the statistical and diagnostic results will not generalize beyond the Virginia

education borders. However, this limitation in traditional generalizability should not be taken to

preclude potential implications for this study beyond Virginia borders. In fact, it is anticipated

that there will be interest in this study outside of Virginia because of the unique methodological

approach and comprehensive conceptual model. As previously mentioned, cognitive diagnostic

modeling has not previously been applied to intra-organizational support mechanisms. Although

the statistical results are not applicable outside of Virginia, the methodological approach to

assessing policy implication may draw interest from both the measurement and educational

policy communities. It must be reiterated and emphasized that the aim of this study is to explore

11

the Virginia state education policy implementation. Although the convenience sample precludes

generalizability to the national level, the study is designed to make strong local impacts at the

state and district level. Hence, external validity, although important, is not as important as the

potential of the study to help promote a local culture of policy implementation assessment. Since

the population in this study is all Virginia public schools, and the sample in this study is

representative of that population in terms of locality (e.g., suburban, rural, urban), demographic

indicators (e.g., race/ethnicity, gender), and achievement (e.g., SOL), the level of external

validity is adequate.

The second most important reason why this is the best dataset for this study is that no

other datasets allow researchers to explore cognitive diagnostic modeling to analyze supports in

this way. As described in the theoretical framework, teachers’ perceptions of four separate levels

or attributes will be modeled. While some datasets include items about teachers’ perceptions of

workplace conditions, they either ask about organizational factors or leadership factors. This

dataset includes items that capture teachers’ perceptions on all four levels: policy characteristics,

teacher characteristics, leadership characteristics, and organizational characteristics. One

alternative dataset that was considered was the NCES Schools and Staffing Survey. Although

this dataset is nationally representative and collects teachers’ perceptions of working conditions,

none of the items on this survey specifically address supports for policy implementation. Rather,

the items address general working conditions. Another dataset that was considered was from the

Measures of Effective Teaching. Despite the stringent restrictions on this data, it may have

provided the most plausible alternative as it is a nationally representative randomized dataset and

was collected in the context of teacher evaluation policy implementation in multiple LEAs.

However, the teacher survey in this study asks about teacher perceptions of more general

12

supports for teacher evaluation and does not ask about supports that are directly related to the

Virginia state teacher evaluation policy. Moreover, the data does not provide information about

teachers’ perceptions of supports at all four levels described in the framework for this study.

Lastly, in addition to the diagnostic results about the specific policy, this study will

contribute to the growing literature on cognitive diagnostic models. Many cognitive diagnostic

models exist, but none have been applied to teachers’ perceptions of support mechanisms.

Traditionally, they are applied to K-12 skills or diagnosing medical conditions where the

presence of symptoms is a binary outcome. Thus, although generalizability is important, proving

the application of these models and the potential value in the diagnostic output is a more

important component in regards to informing future studies. Even so, as previously mentioned,

this study does possess the appropriate degree of generalizability as the results will be applicable

to the target population which includes all Virginia schools. Despite the limitation, exploring

ways to support teachers in using performance information to adjust instruction is an important

activity for educational leaders and researchers to engage in. This effort can be crucial part of

broader initiatives to build school capacity to better serve students in this performance

accountability era (Sun, Mutcheson, & Kim, 2014).

Overview of the Methodology

As previously established, the data includes insights into four levels of implementation.

The levels included the policy, teacher characteristics, leadership characteristics, and the

organizational characteristics. However, items were written to capture the overlapping sections

of these levels. More specifically, the survey items were developed so that the structure of the

instrument was such that unidimensional scores on each dimension was not practical because, in

most cases, items could theoretically be mapped to more than one dimension (see Figure 1

13

above). The complexity of overlapping policy implementation components was one reason why

it was hypothesized that cognitive diagnostic models would provide more measurement precision

than traditional unidimensional models. Many items that were included in the instrument could

potentially load onto more than one of the hypothesized components. Cognitive diagnostic

models provided the ability to account for this unique loading structure. Similar to the

methodological requirements in the study by Halpin and Kieffer (2015), it was essential to

identify a measurement methodology that captured the item-level diagnostic information that the

instruments were designed to provide. Cognitive diagnostic models offer this advantage. As will

be reviewed in the development of the theoretical framework, and as is clearly evident upon a

review of the items (see Appendix A), most, if not all, items can be attributed to multiple sources

of supports. For example, one particular item asked teachers to rate the extent to which the

professional development they received on the policy useful. A closer look at this item reveals

that it likely requires support from at least two main sources. First, organizational conditions

must have provided for adequate resources, such as time and learning tools for teachers to

indicate that they received adequate support on this item. However, the adequate level of support

for usefulness of professional development could instead require expertise, judgement, and

capacity from the principal or leader responsible for providing professional development. Thus,

this item will contribute variance to at least two other different sources of support variables. This

is one example of why cognitive diagnostic models have been frequently promoted by

psychometricians as important modelling alternatives for analyzing response data in situations

where multivariate classifications of respondents are made on the basis of multiple postulated

latent skills (Rupp & Templin, 2008).

14

In addition to descriptive statistics, the first research question was addressed by exploring

the dimensionality of the constructs relating to policy implementation. First, responses provided

by teachers in the first year of implementation of the project were explored using exploratory

factor analyses (EFA). Using the factors resulting from this process, a Q-Matrix was developed

in order to map each individual item to one or more attributes. The Q-matrix is the hypothesized

item loading. In order to empirically validate the Q-Matrix, multiple strategies were relied upon.

A comprehensive review of prior strategies used in Q-matrix development is included in chapter

2. In chapter 3, this discussion is followed by a detailed explanation of the techniques used in the

development and validation of the Q-Matrix in this study. Following the Q-matrix validation,

teachers’ perceptions were modeled using the C-RUM model, which focuses on only the main

effects (Rupp & Templin 2008).

The data cleaning will be completed with Stata software and the analysis will be

completed using mdltm software (Von Davier, 2015). Marginal maximum likelihood estimation

is used in mdltm. After fitting this model to the data, the results will then be interpreted and

compared to the unidimensional model. Results will be evaluated using fit indices and the

standard errors of the item estimates. Using the results from the final model, teachers will be

grouped into latent classes based on the pre-conceived Q-matrix and their responses to annual

survey distributed in the first year of implementation. The profiles will capture the probability of

each latent class in contrast to the traditional “total score” or item response theory approaches.

More generally, they provided insights into which areas of the implementation process teachers

perceive support, and where support may be lacking.

In summary, diagnostic modeling provided the means to capturing the information that

the survey was actually designed to capture. Moreover, the C-RUM model accounted for the

15

assumed compensatory relationship between the attributes. Thus, it is anticipated that this

approach would offer useful, formative, diagnostic information about teachers’ perceptions of

the strengths and weaknesses of the implementation process.

Significance of the Study

This study may have implications for several constituencies. One group of stakeholders

that may benefit from the results includes Virginia K-12 Policy makers. As previously

mentioned, in this study, multi-dimensional models are applied to teachers’ perceptions of

mechanisms for supporting teacher evaluation policy implementation. The results are anticipated

to include measurements that are more precise and useful than those offered by alternative

methodologies. The distinctions between cognitive diagnostic modeling and alternative

methodologies such as multidimensional IRT (mIRT) and confirmatory factor analysis are made

in chapter 3. When equipped with more precise diagnostic feedback, policy makers and school

leaders may be able to more confidently engage in empirical decision making, especially in

regards to targeting resources for short-term and long-term organizational goals subsumed within

the policy implementation initiative.

It follows that if policy makers are equipped with more precise feedback, they better

understand how to support those agents involved in the actual implementation of the policy. This

includes Virginia K-12 principals and teachers. With more precise feedback, principals and

teachers can maximize the potential of the policy. And, according to Virginia’s Guidelines for

Uniform Performance Standards and Evaluation Criteria for Teachers (GUPSECT), this means

that principals and teachers will use the feedback to enhance the effectiveness of teachers.

Finally, with more effective teachers, all stakeholders can anticipate stronger, better prepared

graduates contributing to Virginia communities, culture, and global workforce. Parents of

16

students, students, employers, and society as a whole, all stand to benefit from more effective

teachers. The dynamics of the successful implementation of this policy are depicted in Figure 1

on page 15.

When policies are determined to not be making the impact anticipated when the

investments were committed, diagnostic profiles of educational policy implementation will

inform conversations and debates about strengths and weaknesses regarding the vital

components of implementation. The patterns of associations among the components will provide

valuable information as to where efforts and resources should be targeted. For example, consider

the scenario where a large percentage of teachers indicate that they do not feel supported in

regards to characteristics of the policy guidelines. Furthermore, they indicate that they are

supported in regards to characteristics of teachers, leadership, and the organization. This would

result in a high percentage of teachers in that district with a profile of 0, 1, 1, 1. Policy-makers

and educational leaders could use this data to support the decision to focus on making the policy

clearer or more adaptable and specific. Various strategies will then be targeted to mitigate the

identified deficiency and the limited resources available will not be wasted on areas in which

teachers already feel supported. The profiles will create a broad picture that can be used to

narrow the focus and identify areas that require further inquiry.

Although the substantive results will not be statistically generalizable beyond the

population of K-12 public school stakeholders in Southwest Virginia, the research

methodological significance of the study may extend beyond Virginia to broader boarders. As

previously mentioned, in educational assessment contexts, it is commonly believed that an

identification and understanding of skill components helps to identify remedial pathways toward

perceived support, on all components that are relevant and educationally meaningful to the

17

respondents (diBello, Roussos, & Stout, 2007). These models have not previously been applied

to investigations into individual, team, or organizational support mechanisms, policy

implementation strategies, or organizational change management contexts. If the diagnostic

output does prove to be valuable in this context, it may justify extending the application of these

models to more generalizable follow-up studies in broader organizational contexts. Thus, this

study could ultimately lead to equally significant methodological results.

At a minimum, it is anticipated that this study contributes to promoting a culture of

inquiry surrounding policy and innovation implementation and assessment at all levels of K-12

organizations. Considering the substantial investments made into developing and promoting

these expensive innovations and the increasingly limited resources available for public education

institutions, on-going inquiry into the strengths and barriers to the implementation process are

necessary to ensure the policies achieve their potential—that the investments are worth

continuing. In this case, it is posited that the degree to which teachers feel supported on the new

teacher evaluation policy affects the variation in the degree to which the new evaluation system

could achieve its potential to provide more useful feedback to teachers and promote effective

teaching. By legitimizing a new approach to such inquiry, this study will ultimately have a

positive effect on promoting a culture of inquiry.

Limitations

The dataset limits the generalizability of the substantive results to the population of K12

public schools in Southwest Virginia. The final sample included 794 teachers. This sample is not

small, but it is not overly large either. It should be noted that a high response rate (70%) was

attained. Thus, although the data was not collected from a probabilistic sample, the sample is

very representative of the target population. Moreover, there were many reasons presented as to

18

why this was the data that will provide the best answers to the previously presented research

questions. Most importantly, no other dataset currently includes the variables necessary to

answer the research questions in this study.

It is acknowledged that often the exact specification of the Q-matrix is unknown a priori.

Mechanisms of policy implementation support in schools are not fully understood, and thus the

exact relationships in such a complex model cannot be known for certain. For this reason, in

addition to traditional approaches to Q-matrix development, empirically based Q-matrix

discovery techniques are pursued in this study. The methods are discussed in full detail in

chapter 3. There is need for the investigation is the development of empirical techniques for

determining the entries of the Q-matrix.

Further confounding this analysis was the use of data from an instrument that was not

necessarily developed for the purpose of cognitive diagnostic modeling. Fortunately, however,

the four guiding coarse-grained mechanisms used in this study were the same mechanisms used

to develop the original instrument. This preserved some sense of consistency from the prior

study to the current study. Additionally, the current study does not address within-teacher

variation because it relies on a single distribution of the instrument. The organizations and the

contexts within which they are situated can be expected to evolve along with the guidelines of

the policy. Hence, the follow-up studies currently in process further inquire into the supports for

implementation and will contribute to the understanding of within-teacher variation over time.

Finally, in the current study, the data for the final specified model was dichotomized.

Although this strategy has precedent, it does present an additional limitation. With a much larger

sample, the data may have remained polytomous. However, the current sample size (n=747) does

not support the number of parameters that needed to be estimated if the data remained

19

polytomous. This limitation turned out to not be significant, as a preliminary analysis showed

that there were almost no differences between the structures resulting from the polytomous data

and the dichotomous data. This is discussed in more detail in chapter 3.

Organization of the Study

This study is organized around five chapters. Chapter One introduces the topic of the

study, the research questions and the significance of the study. The second chapter reviews the

literature relevant to the study. Chapter Three describes the methodology of the study, including

the sampling techniques and the procedures used to collect and analyze the data. The fourth

chapter describes the results of the study while the final chapter discusses those results and their

implications for future practice, research, theory, and policy.

20

CHAPTER 2

REVIEW OF THE LITERATURE

Introduction

The framework for this study extends beyond the realm of psychometrics, and borrows

from concepts rooted in economics, organizational research and management, and K-12 policy

implementation. Additionally, the purpose of this study is to examine the supports available for a

specific state policy. Thus, it is imperative to review the origins and key components of that

policy. In the first section of this chapter, the literature on teacher effects and teacher evaluation

policies is reviewed. This includes a brief review of the United States Department of Education

teacher evaluation initiative through Race for the Top and narrows down to focus on Virginia’s

Guidelines for Uniform Performance Standards and Evaluation Criteria for Teachers

(GUPSECT). GUPSECT is the particular policy being examined in this study and so the key

policy components are described in detail. A thorough review of the development of the

theoretical framework and the supporting literature for the study is described next. Finally, as

previously mentioned, the crux of this chapter includes a thorough description of the

methodological concepts and applications. The literature on cognitive diagnostic modeling is

reviewed and key definitions are provided. Model fit and parameter estimation techniques are

examined by reviewing prior studies.

Teacher Effectiveness

Teacher effects have been found to be the most significant school-related variable

impacting student learning outcomes (Stronge, 2006). In one notable study, Rockoff (2004)

obtained multiple years of data on elementary-school students and teachers and found that

21

raising teacher quality was a key instrument in improving student outcomes. The researcher

used panel data collected from New Jersey local education agencies and found that an increase in

one standard deviation in teacher fixed effect distribution raises both reading and math test

scores by approximately 0.1 standard deviations on a nationally standardized scale.

Consistent with this finding, Nye, Konstantopoulos, and Hedges (2004) found

statistically significant teacher effects on achievement gains. They also found larger effects on

mathematics achievement than on reading achievement and much larger teacher effect variance

was found in low socioeconomic status schools that in high SES schools. This study was

especially notable because of the unique data available to researchers. More specifically,

researchers used data from a four-year experiment called the Tennessee Class Size Experiment,

in which teachers and students were randomly assigned to classes to estimate teacher effects on

student achievement.

There still exists some debate about how to define, identify, measure, and develop

effective teaching (Hallinger, Heck & Murphy, 2014). Without a common understanding of

these concepts, policy makers and practitioners have faced substantial challenges in increasing

the quality of education for all students. This has led to significant issues when policy makers

attempt to make empirical personnel decisions, such as teacher selection, tenure, and

compensation. Decisions in these areas have traditionally been addressed using data on teacher

education, experience, certification, and salary schedules (Rothstein & Mathis, 2013). However,

there has been much evidence against the use of such measures in high stakes contexts. For

example, Hanushek (1986, 1997) conducted educational production function studies and found

that the characteristics that form the basis for teacher compensation (e.g., graduate degrees and

experience) were weak predictors of a teacher’s contribution to student achievement. In a more

22

recent example, Goldhaber and Brewer (2000) used the National Educational Longitudinal Study

of 1988 (NELS:88) to calculate the value-added scores of 3,786 math and 2,524 science public

school students in the 12th grade. Results indicated that mathematics and science students who

had teachers with emergency credentials did no worse than students whose teachers had standard

teaching credentials.

Further evidence was provided in a similar study by Rivkin, Hanushek, and Kain (2005).

In this study, researchers used unique matched panel data from the Texas Schools Project to find

that teachers had powerful effects on reading and mathematics achievement, but that little of the

variation in teacher quality was explained by observable characteristics such as education or

experience. Taken together, these studies suggested that alternative measurement mechanisms

were necessary to increase the validity with which teachers were evaluated, and also in order to

effectively develop teacher capacity and make empirically guided personnel decisions.

Traditional Teacher Evaluation Systems and the Widget Effect

In addition to the traditionally approach of collecting information about teacher

education, experience, and certification, school administrators have been collecting data from

observations and teacher evaluations of instruction for over a century (Hallinger, Heck &

Murphy, 2014). The potential of teacher evaluation systems to empirically inform the

development of teacher capacity has been known for some time. In one study, Johnston (1997)

interviewed sixty-three participants from various local education agencies across three states to

explore conceptions of effective teaching held across various roles within the school organization

and discussed the implications of these for teacher-evaluation policy. He found that the process

of teacher evaluation could be valuable in assessing the effectiveness of classroom teachers and

identifying areas in need of improvement. In this same study, he also found that teacher

23

evaluation systems could be helpful in making professional development more individualized

and improving overall instruction schoolwide. These findings supported the endeavor of

developing and implementing more effective teacher evaluation systems so that State education

agencies (SEAs) and local education agencies (LEAs) could effectively identify, reward,

develop, and retain the strongest teachers in the interest of their students.

Despite that there has been empirical evidence about the potential usefulness of teacher

evaluation systems, there has been limited evidence to support the notion that inferences made

from traditional teacher evaluation systems actually possess an adequate degree of reliability and

validity (Darling Hammond et al., 2012). Without empirical evidence, one cannot contend that

traditional teacher evaluation systems would be any more effective than using the education,

experience, and certification. In fact, substantial evidence existed to support the opposite

contention—that, in fact, there are legitimate concerns with traditional teacher evaluation

systems. In one notable study funded by The New Teacher Project, Weisberg, Sexton, Mulhern,

and Keeling (2009) surveyed over 15,000 teachers and 1,300 principals in 12 LEAs across four

states. Researchers found that almost all teachers were rated as great or at least good.

Furthermore, in schools that used a binary system where teachers were rated as either

satisfactory or unsatisfactory on various standards, 99 percent of teachers were rated as

satisfactory. Finally, they found that when schools used an evaluation scale with more than two

performance rating options (i.e., more than just satisfactory and unsatisfactory), an

overwhelming 94 percent of all teachers received one of the top two ratings. Termed “The

Widget Effect,” Weisberg et al. (2009) noted that when schools tended to assume that teachers’

effectiveness in the classroom is the same from teacher to teacher, they treated them as

interchangeable parts. Several consequences of this were noted. Most notably, evaluation

24

systems that failed to differentiate performance among teachers led to systems in which excellent

teachers could not be recognized or rewarded. Moreover, chronically low-performing teachers

languished, and the wide majority of teachers performing at moderate levels do not get the

differentiated support and development they need to improve as professionals.

Although teacher evaluation systems had previously been shown to have potential in

growing teacher and school capacity in terms of targeting professional development and

identifying and promoting talent, validity and reliability issues have precluded such systems

from being heavily relied upon in high-stakes personnel decisions such as tenure, compensation,

and dismissal. Despite this, teacher evaluation systems’ potential as a mechanism to improve

education quality through developing teacher capacity and through recruitment and retention of

talent, continued to attracted interest. Increasingly, assistance from external pressures, including

the Federal Race to the Top Funding Program, has enabled policymakers to experiment with

evaluation and accountability for individual teachers. In order for schools to receive certain

funding, the federal program, Race to the Top, required participating SEAs and LEAs to measure

and reward teachers and school leaders based on multiple measures of teaching. The results of

these experiments, one which required substantial funding (e.g., Gates Foundation Measures of

Effective Teaching Project), have been extensively studied over the past five years and will

continue to be explored for some time.

Teacher Evaluation Policies and the Multiple Measures of Teacher Effectiveness

In light of increased demand for greater school accountability and the limitations

documented with traditional teacher evaluation systems, education policy has gradually shifted

from holding schools accountable for policy compliance to accountability for learning outcomes

(Atkinson, 2009). Teacher effectiveness has been increasingly connected to student academic

25

progress as it has become generally agreed upon that measures of student learning in the

evaluation process provided the “ultimate accountability” for educating students (Tucker, &

Stronge, 2001). However, debate continued over the scope of educational outcomes, and how to

increase the degree of reliability and validity when measuring them.

This debate has extended into wide experimentation and research into the implementation

of new teacher evaluation systems. One strong voice in this debate has been the Gates

Foundation, which invested $45 million in a nationwide project called Measures of Effective

Teaching (MET). The goal of this project was to not only improve teacher evaluation, but also

use this information to make high-stakes decisions about teachers’ careers. Various researchers

from a wide-range of institutions measured teacher effectiveness in many different ways,

including student evaluations of teachers, student classroom work, evaluations of classroom

practice using commonly used rubrics, and student and parent surveys (e.g., Kane, McCaffrey,

Miller, & Staiger, 2013). In one report from this study, Kane et al. (2013) found that a composite

measure of effectiveness could identify teachers who produced higher achievement among their

students. They also found that the actual impacts on student achievement were approximately

equal on average to what the existing measures of effectiveness had predicted. These were

especially notable findings because they were causal impacts because they were estimated with

randomly assigned groups.

In another key finding from the MET project, Mihaly (2013) explored how indicators

(e.g., student evaluations, principal observations, value-added scores) could be combined to

improve inferences about a teacher’s impact on student achievement and about teaching.

Researchers estimated the parameters of an optimal combined measure of teacher effectiveness

26

and found that for a typical teacher, one year of data on value-added for state tests is highly

correlated with a teacher’s stable impact on student achievement gains on state tests.

Preliminary investigations appear to support the use of multiple measures to evaluate

teachers. Teacher evaluation systems that use multiple measures of teacher performance in

public schools has been shown to have potential to produce useful information for teachers. For

example, Taylor and Tyler (2012) studied a sample of midcareer elementary and middle school

teachers in the Cincinnati Public Schools, all of whom were evaluated in a yearlong program

based largely on classroom observation between the 2003-04 and 2009-10 school years. Their

analyses showed that teachers were more effective at raising student achievement during the

school year when they were being evaluated than they were previously, and even more effective

in the years after evaluation, particularly in mathematics. Other studies, beyond the MET project,

have also explored the relationships between the multiple measures of teaching. For example,

Harris and Sass (2009) used data from a midsize Florida LEA and found that teacher value-

added and principals’ subjective ratings were positively correlated. They also found that

principals’ evaluations were better predictors of a teacher’s value added than traditional

approaches to teacher compensation focused on experience and formal education. Perhaps most

importantly, they found that in settings where schools were judged on student test scores,

teachers’ ability to raise those scores was important to principals, as reflected in their subjective

teacher ratings.

However, not all studies support the use of the increasingly common teacher evaluation

systems based on multiple measures. In contrast to some of the previously discussed studies,

Hallinger, Heck, and Murphy (2013) examined the new generation of teacher evaluation systems

along three lines of analysis: evidence on the magnitude, consistency, and stability of teacher

27

effects on student learning, evidence on the impact of teacher evaluation on growth in student

learning, and literature from the sociology of organizations on how schools function. One key

finding in this study was that the policy logic supporting the teacher evaluation reform was

considerably stronger than the empirical evidence supporting the actual reform. This contrasting

finding suggests that more exploration into these systems may be necessary.

In this section, background on teacher effectiveness and teacher evaluation systems is

outlined. It was clear that, increasingly, education policy makers and researchers have turned

towards experimentation with new models of teacher performance evaluation focused on

multiple measures. In the next section, Virginia’s efforts towards a more reliable and valid

teacher evaluation system based on multiple measures are described. Following a review of the

specific policy being examined in this study called GUPSECT, literature on organizational

policy implementation is explored and narrowed down to K-12 teacher evaluation contexts.

The Policy: Virginia Guidelines for Uniform Performance Standards and Evaluation

Criteria for Teachers

In September 2011, in order to move forward with reforms to increase the quality of

instruction for all students, the United States Department of Education (USDOE) invited all State

Educational Agencies (SEA) and local education agencies (LEA) to request flexibility regarding

specific requirements of the Elementary and Secondary Education Act (ESEA) of 1965.

Designing and implementing teacher performance-based evaluation had been the main focus of

the efforts to implement ESEA flexibility (USDOE, Jan. 2013). The authority for this flexibility

was pursuant to the authority in section 9401 of ESEA, which allowed the Secretary of

Education, Arne Duncan, to waive statutory or regulatory requirements of the ESEA for an SEA

that receives funds under a program authorized by the ESEA and requests a waiver. Funds also

28

came from the American Recovery and Reinvestment Act (ARRA). Since this invitation, state

and local education agencies have invested substantial resources, including in both human and

financial capital, towards this nationwide initiative that re-conceptualizes teacher evaluation.

Initially, the USDOE granted waivers to 34 states and the District of Columbia, including

Virginia. The current study examines the implementation of the policy which was used in partial

fulfillment of Virginia’s SEA successful request for ESEA flexibility. In Virginia, GUPSECT

became effective on July 1, 2012. The 2012-13 school-year was the first year of the state-wide

pilot and implementation. Based on the official GUPSECT document (2011), the primary

purposes of a quality teacher evaluation system were to:

contribute to the successful achievement of the goals and objectives defined in

the school division’s educational plan;

improve the quality of instruction by ensuring accountability for classroom

performance and teacher effectiveness;

implement a performance evaluation system that promotes a positive working

environment and continuous communication between the teacher and the

evaluator that promotes continuous professional growth and improved student

outcomes;

promote self-growth, instructional effectiveness, and improvement of overall

professional performance; and, ultimately

optimize student learning and growth (p. 14).

With these principles guiding the policy making process, the formal signing of this policy

represented a significant overhaul of conventional evaluation criteria. Most notably, GUPSECT

set forth seven performance standards for all Virginia teachers. Pursuant to state law, teacher

evaluations were required to be consistent with the performance standards included in the official

GUPSECT document (2011). It was recommended that each teacher receive a summative

evaluation rating, and that the rating be determined by weighting the first six standards equally at

29

10 percent each, and that the seventh standard, student academic progress, account for 40 percent

of the summative evaluation (see Table 2).

In addition to establishing the uniform performance standards for teachers, the official

document also made recommendations on how teacher performance could be effectively

documented and used to provide comprehensive and accurate feedback on teacher performance.

Specifically, GUSPECT recommended that evaluators use five data sources for the evaluation.

The five data sources included formal observations, informal observations, student surveys,

portfolios/document logs, and self-evaluation.

Table 2. Virginia Guidelines for Uniform Performance Standards and Evaluation Criteria for

Teachers

Standard Description of Standard

Performance Standard 1:

Professional Knowledge

The teacher demonstrates an understanding of the curriculum,

subject content, and the developmental needs of students by

providing relevant learning experiences.


Instructional Planning

The teacher plans using the Virginia Standards of Learning,

the school’s curriculum, effective strategies, resources, and

data to meet the needs of all students.


Instructional Delivery

The teacher effectively engages students in learning by using

a variety of instructional strategies in order to meet individual

learning needs.


Assessment of and for

Student Learning

The teacher systematically gathers, analyzes, and uses all

relevant data to measure student academic progress, guide

instructional content and delivery methods, and provide

timely feedback to both students and parents throughout the

school year.


Learning Environment

The teacher uses resources, routines, and procedures to

provide a respectful, positive, safe, student-centered

environment that is conducive to learning.


Professionalism

The teacher maintains a commitment to professional ethics,

communicates effectively, and takes responsibility for and

participates in professional growth that results in enhanced

student learning.


Student Academic Progress

The work of the teacher results in acceptable, measurable, and

appropriate student academic progress.

*Adapted from the Virginia Guidelines for Uniform Performance Standards and Evaluation Criteria for Teachers

30

Since the use of student learning measures in teacher evaluation was new for both

teachers and principals, perhaps the most difficult component of the policy in terms of gaining

teachers’ buy-in was Performance Standard 7: Student Academic Progress. There are three

potential explanations for the lack of teacher buy-in on this standard. First, traditional evaluation

systems have not included student progress as a measure of teacher performance. Thus, as with

any innovation, especially one implemented in contexts as sensitive and important as job

performance, the novelty factor presents an initial barrier. Secondly, and perhaps most

prominently, the heavy weighting of this new standard marks a cultural shift in how “effective

teaching” is defined. Since student academic progress is weighted at four times any other

standard, the shift focuses on the idea that what students learn from a teacher is as important as

how or what teachers teach. This type of cultural shift understandably takes time to gain buy-in

from all stakeholders, those most directly affected by the change. A third factor that may help

explain why teacher buy-in faces difficulty is that although teacher quality is a significant

predictor of student achievement, many other factors are also significant. One example of these

factors includes socio-economic status of students.

As previously mentioned, the guidelines recommended that student academic progress

accounted for 40 percent of a teacher’s summative evaluation. However, LEAs were also

granted some flexibility under this standard. For example, at least 20 percent of the teacher

evaluation (half of the student academic progress measure) was comprised of student growth

percentiles as provided from the Virginia Department of Education when the data were available

and could be used appropriately. Moreover, another 20 percent of the teacher evaluation (half of

the student academic progress measure) was recommended to be measured using one or more

alternative measures with evidence that the alternative measure is valid. Thus, in choosing

31

measures of student academic progress, schools and LEAs were encouraged to consider

individual teacher and schoolwide goals, and align performance measures to the goals. In

anticipation of resistance to the use of standardized testing in the evaluation, policymakers

highlighted that less than 30 percent of teachers in Virginia’s public schools should have a direct

measure of student academic progress available based on the State assessment results (Standards

of Learning).

The main components of GUSPECT have been highlighted in this section. However, it

should be evident that although some important components, such as the weighting of the

standards, were mandated at the state level, some components were flexible in terms of how they

were implemented by LEAs. One result of providing wider local flexibility in interpretation is

could be a larger degree of variance in the fidelity of policy implementation. This study explores

that variance using teachers’ perceptions. In the next section, the development of the guiding

theoretical framework was described.

Theoretical Framework: Policy Implementation in K-12 Organizations

Preliminary Assumptions about the Policy Implementation Process

The development of the theoretical framework for this study was based on key

underlying assumptions. First, the extent to which a sample of Virginia’s teachers perceived that

they were supported on the new teacher evaluation system was assumed to have implications for

the effectiveness of the implementation of the policy. Thus, environments in which teachers

indicated receiving more support were environments in which the policy was being implemented

more effectively. Secondly, the variation in successful implementation was assumed to affect the

variation in the degree to which the new evaluation system could achieve its potential to provide

more useful feedback to teachers and promote effective teaching. Taken together, these first two

32

assumptions indicated that when teachers perceived more support in implementing the policy the

implementation was more effective which leads to increased teacher effectiveness and capacity.

The final piece to these assumptions, based on empirical studies described in the previous

section, was that increased teacher effectiveness leads to increased student achievement which

should be the ultimate goal of any educational organization.

Relying on the aforementioned studies and assumptions, the mechanisms for supporting

the policy implementation were initially investigated in four components. These assumptions

and the remainder of this section were summarized in figure 1 on page 6 and were included again

below for convenience. As is evident in figure 1, many of the supports investigated in this study

actually may fall under multiple categories. How a teacher perceives support in one component

may actually influence the perceived support in another. This idea will be discussed in detail in

order to provide a more accurate depiction of the development of the framework

The first component of support mechanisms is the teachers’ perceived support on policy

guidelines. This includes, among other supports, how clear the policy guidelines are to teachers,

how specific and adaptable they are, and how the policy lends itself being communicated and

monitored. Secondly, teachers’ perceptions of support mechanisms at the teacher level are

explored. This includes teacher self-efficacy, expertise, and capacity to change. It also includes

the important support gained from situated contexts like close colleagues. Next, school

leadership supports are investigated. These include leadership expertise and advocacy for the

policy. Moreover, it includes teachers’ perceptions of whether leadership values align with the

policy values. Finally, organization conditional supports, such as professional development,

collaboration, resources, and locus of decision making are explored. These areas of the

conceptual framework are discussed more thoroughly in chapter three. The assumptions,

33

theoretical components, and the overlapping relationships are visually represented by the guiding

model presented in Figure 4 above. The next sub-sections in the section describe the theories that

informed the development of the framework for this study.

Theories of Utility Maximization and Policy Implementation

Utility maximization theory is an economics concept that directly relates to the evaluation

of teacher quality as teachers operate as individuals in their social settings (Akerlof & Kranton,

2005). Specifically, the utility maximization problem refers to individuals attempting to attain

the greatest value possible from expenditure of least amount of resources. In a consumer market,

this unfolds as individuals operating to maximize the total value derived from some currency—

usually money. In this study of a school organization context, teachers operate to maximize their

utility within the school organization. In teacher evaluation contexts, teachers’ utility is derived

from their value within the organization which, as discussed in previous sections, is increasingly

being derived from the value they add to student learning from expenditures in the amount time,

effort, and other learning resources and supports available. Thus, in order to increase utility,

teachers should capitalize on the available supports in implementing this policy. This concept

can also be applied to a broader level as individual schools and districts operate to increase the

capacity of their teacher workforce from time and budget constraints. The optimal decision in

any given situation maximizes the average utility over all possible outcomes of a decision

(Akerlof & Kranton, 2005).

Akerlof and Kranton, (2005) proposed that workers’ self-image as jobholders, coupled

with their ideal as to how their job should be done, can be a major work incentive. They showed

how identities can flatten reward schedules, as they solve the “principal-agent” problem. The

principal-agent problem is based on the assertion that the principal (e.g., the employer) cannot

34

observe the true effects of agents (e.g., employees); rather, only agents themselves know their

true effort and performance (Sun, Mutcheson, & Kim, 2015). To address this asymmetric locus

of information, the employer can align the incentives for employees with the organization’s

goals. For example, aligning measures of teachers’ performance closely with students’ learning

is expected to motivate teachers’ efforts and in turn contribute to schools’ organizational values

(Sun, Mutcheson & Kim, 2015).

Social Capital and Policy Implementation

As individuals become more valuable in an organization, they increase their social capital

(Rogers, 2003). Social capital is the expected collective or economic benefits derived from the

preferential treatment and cooperation between individuals and groups (Putnam, 2000). One

theory directly related to the importance of social capital in policy implementation contexts is

Rogers’ (2003) diffusion of innovations. Diffusion of innovations applies to organizations

seeking to explain how innovations are communicated through certain channels over time among

the participants in a social system or culture. Under this theory, school systems are organizations,

and as such, the rate at which policies spread is heavily dependent on social capital, the

organizational communication channels, time, and the social system (Rogers, 2003).

In one study, Frank, Zhao, and Borman (2004) applied diffusion of innovation theory to

educational contexts when they characterized informal access to expertise and responses to social

pressure as manifestations of social capital. They used longitudinal and network data in a study

of the implementation of computer technology in six schools and found that the effects of

perceived social pressure and access to expertise through help and talk were at least as important

as the effects of traditional constructs. This suggested that teachers were better able to gain

35

access to each other’s’ expertise informally and are more likely to respond to social pressure to

implement an innovation, regardless of their own perceptions of the value of the innovation.

In another study investigating social capital in schools, Youngs, Frank, Thum, and Low

(2012) explained the effects of mentoring and induction activities on new teachers’ commitment,

instructional quality, and effectiveness. They found that “…when beginning teachers’ beliefs and

practices were aligned with those of their mentors, subgroups, and other colleagues, they may

feel little professional tension and may be able to promote student learning and respond to

others’ expectations by exerting effort on a single dimension” (p. 22). Alternatively, when

novice teachers were not aligned with their mentors, subgroups, or administrators, they required

“…exerted effort simply to meet others’ expectations and they may experience significant

professional tensions” (p. 22).

Taken together, these studies provide evidence of the importance of social capital in the

diffusion of innovations in organizations. Social capital can be used to positively influence

teachers. However, it can also negatively impact teachers’ commitment to the implementation

process which could potentially attenuate or even nullify the intended effects of the policy. This

is important to the current study for many reasons. Most obviously, it suggests that when

measuring teachers’ perceptions of policy implementation supports, the influence teachers have

on colleague’s perceptions is important to consider. This can be accomplished in many ways, one

of which is to include measures asking teachers how their perceptions align with colleague’s

perceptions. That is the strategy taken for the purposes of this study.

Frameworks for K-12 Policy Implementation

There are multiple frameworks for policy implementation that influenced the design of

this study. With their objective of advancing the vocabulary of implementation science, Proctor

36

et al. (2012) extended the theory of diffusion of innovations and put forth a taxonomy of

implementation outcomes to help organize the key variables and frame research questions

required to advance implementation science. The taxonomy consisted of eight conceptually

distinct implementation outcomes including acceptability, adoption, appropriateness, feasibility,

fidelity, implementation cost, penetration, and sustainability. Researchers defined

implementation outcomes as “…the effects of deliberate and purposive actions to implement new

treatments, practices, and services” (p.65). Moreover, they identified three important functions of

implementation outcomes. One function was to serve as indicators of the implementation

success. Second, implementation outcomes serve as proximal indicators of implementation

processes. Finally, they were recognized as key intermediate outcomes (Rosen & Proctor, 1981)

in relation to a service system such as school organizations.

Similar to the previous study, Century, Cassata, Rudnick, and Freeman (2012) put forth a

framework intending to move toward common language and shared conceptual understanding.

They focus on two aspects of “implementation” including innovation implementation and the

implementation process. They provide definitions of these aspects that have been relied upon in

the development of the current study. First, innovation implementation refers to the status of the

innovation or extent to which the innovation itself is enacted, in whole or part. This is commonly

referred to as “implementation fidelity.” Secondly, the implementation process includes

innovation implementation as well as all of the contextual factors that contribute to and/or inhibit

the innovation implementation. Most relevant and useful for the current study, the researchers

describe the mechanisms that influence policy implementation in different grain-sizes. For

example, in their framework, mechanisms fall under various categories including: characteristics

of the innovation, characteristics of the individual users, characteristics of the leadership,

37

characteristics of the organization, and elements of the environment. Each of these grains is then

broken down into smaller grain sizes and easily measureable elements of policy implementation.

As is evident from chapter one, this framework was very instrumental in the development of the

hypotheses, design, and framework for the current study.

In an alternative approach more targeted towards understanding policy implementation

science in K-12 contexts, Spillane, Reiser, and Reimer (2002) developed a cognitive framework

to characterize sense-making in the implementation process. This framework was especially

relevant for recent education policy initiatives, such as standards-based reforms that press for

tremendous changes in classroom instruction, such as teacher evaluation. According to Spillane

et. al. (2002) “…a key dimension of the implementation process is whether, and in what ways,

implementing agents come to understand their practice, potentially changing their beliefs and

attitudes in the process” (p. 387). They argued that one plausible explanation for the evolution of

policies during implementation is the process of human sense-making. In this approach, they

highlighted the importance of unpacking how and why policy evolves as it does. They note that

“…this strategy is likely to generate important insights into the implementation process, insights

that can inform the design of state and national standards as well as other education policies.

In summary, multiple theories and frameworks were used to frame the contextual

elements of statewide K-12 policy implementation. As was evident in the previous discussions,

the measurement of the processes was complex. The complexity of overlapping policy

implementation components is one reason why it was hypothesized that cognitive diagnostic

models would provide more measurement precision than traditional unidimensional models. In

the next section, a conceptual review of cognitive diagnostic models is provided in order to

justify the need for this proof of concept endeavor. Some important studies are reviewed to

38

provide insight into past applications of cognitive diagnostic models highlight the potential

advantages of this methodology.

Review of Cognitive Diagnostic Modeling Concepts

In order to determine whether cognitive diagnostic models will provide more precise

measurements of teachers’ perceptions than traditional measurement frameworks, a review of

the concepts and applications of cognitive diagnostic models is necessary. First, however, it is

necessary to review alternative and closely related psychometric approaches in order to detail the

precise statistical properties of cognitive diagnostic modeling that make it advantageous for the

purposes of this study. Several alternative psychometric approaches are described. In comparing

and contrasting these alternative approaches with cognitive diagnostic modeling, it becomes

clear as to why the final approach is selected, especially in light of the goals of the current study.

Following this comparison, the concepts and applications of cognitive diagnostic modeling are

reviewed.

Measurement Theory

The most common measurement framework is informed by classical test theory (CTT). .

In CTT, for each respondent exists a true score that is assumed to be measured without error. The

true score is the expected score if a respondent were to respond an infinite number of

independent trials. Since a respondent can never actually do this, it is assumed that the true score

is the observed score for a respondent plus some magnitude of error. Hence, the formula;

𝑋 = 𝑇 + 𝐸; (1)

where X represents the observed score of a respondent, T represents the true score of a

respondent, and E represents the measurement error. The reliability of the observed test

39

scores can be shown to be equal to the proportion of the variance in the test scores that we could

explain if we knew the true scores, and is given by:

𝜌2𝑋𝑇

=𝜎2

𝑇

𝜎2𝑋

= 𝜎2

𝑇

𝜎2𝑇+ 𝜎2

𝐸 ; (2)

Looking at the formulas, it is clear that the assumptions of this measurement approach are

unrealistic for the purposes of this study. In this study, the overarching hypothesis is that policy

implementation support is actually multidimensional. In this sense, several related constructs are

actually contributing to the teachers’ overall perceived support. However, classical test theory is

a unidimensional measurement approach in that all measurements are taken on a linear scale for

one single construct. In CTT, examinee characteristics and test characteristics can only be

interpreted in the context of the other and the standard error of measurement is assumed to be the

same for all examinees. Hence, classical test-theory is not practical for this study because it is a

test oriented approach, so it is not particularly useful in terms of predicting respondent item

performance which is exactly what we are investigating in this study.

A more advantageous methodology considered for this study is latent class modeling.

Latent class models are a type of mixture model where the data is categorical and item responses

are independent given class (Lazarsfeld & Henry, 1968). The analysis involves latent variables

that are indicated by measured items. In contrast to more common measurement models, such as

factor analysis and item response theory, the latent variable in the analysis are categorical rather

than continuous. This aligns with the objectives of the current study because the categorical

variables contain diagnostic properties (Embretson & Yang, 2013). The first use of latent class

models in educational measurement was by Macready and Dayton (1977). Most of the early

psychometric applications were centered on identifying participants who cluster together based

on item scores. Cognitive diagnosis models are very similar to latent class models except with a

40

set equality constraints placed on class probabilities (Templin, 2008). Moreover, in cognitive

diagnostic modeling, the classes, or skill patterns, are specified and defined a priori, whereas in

latent class analysis, the classes are not known prior to the analysis (Halpin and Kieffer, 2015).

Thus, latent class analysis is more of an exploratory procedure for understanding data. In this

study, the literature is used to define the specific 4-attributes of interest, thus making a more

confirmatory procedure more appropriate for the diagnostic modeling component.

Factor analysis (FA) is another commonly used measurement framework. EFA can be

used to analyze the associations between observed item responses using latent variables, such as

the intra-organizational mechanisms being investigated in this study. EFA is often used to

uncover the underlying structure of the data. Similar to latent class analysis, it is exploratory in

nature. Thus, EFA is used when no a priori hypothesis about factors or patterns of measured

variables exists. All observed variables are allowed to freely load on all latent variables. In the

current study, the coarse-grained mechanisms are defined a priori via literature review. However,

due to a lack of empirical understanding of the finer-grained mechanisms for supporting policy

implementation, an EFA is used in the preliminary analysis. This procedure is discussed in more

detail in chapters 3 and 4.

Also within the FA framework, confirmatory factor analysis (CFA) differs from EFA in

that this procedure requires that the loading structures be specified a priori. The development of

these loadings is typically theory-based. Most commonly, in both EFA and CFA a simple

loading structure is the goal. A simple loading structure occurs when all items load to one latent

variable resulting in analysis of between-item variance, but it does not always occur. Although

the CFA framework is capable of accommodating a complex structure where each item can load

to multiple latent constructs, the a priori specification of a complex loading structure is not as

41

common in the literature as a simple structure. This is because items are usually written to

measure one latent construct, as opposed to multiple latent constructs. Complex loading

structures allow one to analyze within item-variance and are more commonly used in

multidimensional IRT models and cognitive diagnostic models. Additionally, in EFA and CFA

estimation routines typically utilize summary statistics of the data such as means, variance,

covariance, and correlations (Rupp, Templin, and Henson, 2008). This is referred to as partial

information estimation. These statistics contain all relevant information about the unknown

parameters of interest in these models.

An alternative framework for modeling latent constructs is item response theory (IRT).

Typically, in IRT and cognitive diagnostic modeling frameworks, full-information statistics are

used for estimation (Rupp, Templin, and Henson, 2008). The statistical properties subsumed

within an item-response theory (IRT) framework provide similar advantages as cognitive

diagnostic modeling. The IRT model can be used to account for important parameters such as

item difficulties and discriminations. However, according to Ravand and Robitzsch, (2015),

“…conventional IRT models locate test takers on a broadly defined single latent variable,

whereas diagnostic models provide information about the perceived support status of test takers

of a set of interrelated separable attributes” (p.1). Such conventional models are unidimensional,

and they typically have simple loading structures in that they relate the observed response

variables to a single latent variable. Thus, conventional, unidimensional IRT models cannot be

useful for this study.

Increasingly, multidimensional IRT (mIRT) models are used to calibrate multiple

dimensions to estimate correlations of the latent variables. Many of these models are quite

similar to those found in the family of cognitive diagnostic models. For example, a priori

42

specifications of the relationships between observed and latent variables are made in both

approaches. In cognitive diagnostic modeling, this is referred to as a Q-matrix. Most Q-matrices

in cognitive diagnostic models specify complex loading structures that accommodate within item

multidimensionality where each item can load onto multiple dimensions. In other words, items

can be attributed to multiple sources of supports. For example, one particular item asked teachers

to rate the extent to which the professional development they received on the policy useful. A

closer look at this item reveals that it likely requires support from at least two main sources.

First, organizational conditions must have provided for adequate resources, such as time and

learning tools for teachers to indicate that they received adequate support on this item. However,

the adequate level of support for usefulness of professional development could instead require

expertise, judgement, and capacity from the principal or leader responsible for providing

professional development. Thus, this item will contribute variance to at least two other different

sources of support variables. This is one example of why diagnostic models have been frequently

promoted by psychometricians as important modelling alternatives for analyzing response data in

situations where multivariate classifications of respondents are made on the basis of multiple

postulated latent skills (Rupp and Templin, 2008). Although mIRT can accommodate such

within item multidimensionality, a simple loading structure focused on between-item

multidimensionality is much more common.

Perhaps the most important key distinction between mIRT and cognitive diagnostic

models is that mIRT uses continuous latent variables which allows for multiple real number

scales and norm-referenced interpretations. Conversely, the latent variables in cognitive

diagnostic models are categorical, so they provide for the support of the type of criterion-

referenced interpretations sought to be made in this study. For example, in this study, teachers’

43

perceptions will be used to classify them into latent classes in order to inform training and

development on policy implementation. Teachers will be proficient or not proficient on each

pertinent skill to the policy implementation process. The key advantage to cognitive diagnostic

modeling is that the cut scores will be set to maximize the reliable separation of respondents

(Templin, 2009). In comparison, to make the same proficient/non-proficient determination from

the continuous outcome variable in mIRT, one would require a second, external phase (e.g.

standard setting). As previously mentioned, in this study, the ability to make standards-based

(proficient/non-proficient) classifications is highly desirable. This is because the proposed

application of this methodology which is discussed in previous sections.

Templin and Hensen (2009) identified further advantages to cognitive diagnostic models.

First, they noted that they hold great potential because of the promise of providing more detailed

information related to the defined attributes. Secondly, they provide a tool that can aid in the

development of tailored action plans which could save leaders and teachers time. Finally, most

current studies do not provide a typical low stakes situation, but instead represent nearly ideal

situations with large sample sizes that provide for the demonstration of a new methodology.

Thus, there is need for studies using cognitive diagnostic modeling in real life situations.

In another paper, Gorin (2009) provided additional support by highlighting the additional

advantages to cognitive diagnostic models. First, their numerically derived cut-scores allow for

criterion-reference score interpretations in that they generate multidimensional score estimates in

terms of diagnostic classifications regarding student mastery or non-mastery of measured skills.

Secondly, cognitive diagnostic models hold diagnostic power based on multidimensional data.

Third, given current legislative demands on educational assessment for curricular design and

44

educational accountability, the use of criterion-referenced score interpretations has

overshadowed the historical importance of normative score interpretations.

It is anticipated that the approach taken in this study will provide useful, formative,

diagnostic information about teachers’ perceptions of the strengths and weaknesses of the

implementation process. The result will be diagnostic output that provides detailed empirical

information about teachers’ perceptions of policy support that are involved in the response

processes and the manner in which these components interact were obtained (diBello, Roussos,

& Stout, 2007). In an educational assessment context, it is commonly believed that an

identification of these perceptions, sometimes referred to as “mental components,” may help to

identify remedial pathways toward mastery on all components that are relevant and educationally

meaningful to the respondents (diBello, Roussos, & Stout, 2007).

Cognitive Diagnostic Modeling Theory and Applications

The statistical purpose of cognitive diagnostic modeling is to… “…develop a

multivariate profile of respondents’ traits that is based on classifying them according to their

degree of mastery on each of the traits” (Rupp & Templin 2008, p. 226). Substantively, they

provide detailed diagnostic profiles that promote assessment for learning through modification in

target areas (Jang, 2009). Cognitive diagnostic models use categorical latent variables that are

indicated by observed measured items. Instead of a continuous ability estimate, a cognitive

diagnostic model will estimate the probability that a respondent has mastered each attribute

(Rupp & Templin, 2008). If that probability is greater than 0.5, the respondent will be classified

as having mastered the attribute. For each respondent, a profile will result in which mastery/non-

mastery on each attribute is estimated. Attributes are the skills one is interested in measuring.

45

They can also be content knowledge, cognitive skills, mental processes, or, as in this case,

perceptions of support for policy implementation. Typically, cognitive diagnostic analyses

produce output for respondents as a profile on the attributes. The attributes usually have two

levels (mastery/non-mastery), or more than two levels as specified in the model. For the purposes

of this study, two categories are used, however, those categories are more accurately described as

“perceived support” and “lack of perceived support” as opposed to mastery/non-mastery.

Limited research has been done on models with more than two levels for an attribute. As will be

discussed in greater detail in chapter 3, attributes can also come in different grain sizes. Grain

sizes are the level of specificity of attributes. A finer grain, for example, could be adding and

subtracting, whereas a coarser grain may be whole number operations. Thus, a grain-size refers

to the scope of level of specificity.

There are three traditional categories of applications for cognitive diagnostic models

(Rupp, Templin, & Henson, 2010). First, and perhaps most common, cognitive diagnostic

models are applied to diagnostic assessment in education. Researchers commonly use these

models to measure student achievement in various areas, in order to help teachers identify what

attributes specific student’s need the most help with. Thus, in these applications, diagnostic

models have a strict formative educational assessment purpose (Roussos, DiBello, & Stout

2007). One example of this category of application came from a study by de la Torre and

Douglas (2004). In this study, researchers identified eight key cognitive attributes that underlie

performance in fraction subtraction. Thus, much time and effort was subsequently and

necessarily invested in the assessment blueprint and the Q-matrix.

A second, recently established, category of cognitive diagnostic modeling applications is

for clinical diagnosis of psychological neurological disorders (Rupp, Templin, & Henson, 2010).

46

Using cognitive diagnostic models, researchers sought to develop a diagnostic assessment that

could be used to screen respondents for a predisposition to be pathological gamblers (Templin &

Henson, 2006). Researchers found that “…the use of a cognitive diagnosis model, the DINO

(see p. 48), allowed for an instrument that was created to investigate the structure of underlying

personality factors in pathological gambling to provide diagnostic information for each criterion”

(p. 301). This is a fairly new application so there are limited studies using this application.

Finally, the most relevant category of applications to the current study is when cognitive

diagnostic models are applied to standards-based assessments in education. Accountability

efforts increasingly require examining the proportion of students performing at “proficient” or

“advanced” level. In one example, Poggio, Yang, Irwin, Glasnapp, & Poggio (2007) judged

student proficiency on the basis of comparing the raw score a student receives on an assessment

with a set of cut-scores that were set with standard-setting procedures to classify students.

Cognitive diagnostic models provide a clear advantage in this scenario because they provide

more accurate and informative mastery profiles for respondents at a fine diagnostic grain size

(Rupp, Templin, & Henson, 2010). Moreover, according to Leighton and Gierl (2007), more

assessments at a finer grain size are needed to obtain reliable information about broader domains.

Similar to the assessments in the previously discussed study, teacher evaluation systems

based on multiple measures are another example of innovations resulting from the multiple

standards-based measures movement. Instead of students being rated as “proficient” or not,

teachers receive one of the possible ratings on multiple standards. In one recent study that was

relied upon to inform the current study, Halpin and Kieffer (2015) used a secondary analysis of

data from the Measures of Effective Teaching study to outline the application of using latent

classes to learn about classroom observational instruments. They studied the diagnostic

47

information about teachers’ instructional strengths and weaknesses, along with estimates of

measurement error for individual teachers. Researchers described the advantages of providing

empirically derived profiles of instruction that describe what real teachers are doing in their

classrooms. The diagnostics provided an estimate of the measurement error associated with each

teacher’s profile membership and thereby addresses a major shortcoming of the current practice

of using total scores on multiple measures of effective teaching (Halpin & Kieffer, 2015). An

additional important note about this study is that researchers used latent class analysis which

was previously discussed. One limitation of this study was that they did not address how profile

memberships are related to teachers’ classroom and school contextual factors— such as the

effectiveness of the policy implementation.

In summary, there are relatively few applications of cognitive diagnostic models in the

literature. Three general categories of applications exist. However, it should be noted that

cognitive diagnostic models can be applied whenever statistically-driven classifications of

respondents according to multiple latent traits are sought (Rupp & Templin 2009). The current

study provides those requisite contextual conditions. It will be the first study to produce

empirical-based latent profiles of policy implementation support for teachers. The goal will be to

use these profiles to identify where teachers perceive they are being supported and how this

information can be used to provide feedback to practitioners.

Model Specification: Compensatory vs. Non-Compensatory

Rupp and Templin (2008) recommend three defining characteristics in understanding

model taxonomy. First, the scoring rules of the observed response variables can be dichotomous

48

or polytomous. Dichotomous response variables provide two choices for the participant whereas

polytomous response variables provide more than two. Secondly, the measurement scales of the

attributes they measure can also be either dichotomous or polytomous. Finally, an important

decision regarding model specification is whether the required skills for a specific task interact in

a compensatory manner. The diagnostic model specifies the probability of a positive item

response in terms of examinee skills and item parameters. Many models with varying

simplifying assumptions are available depending on the specified purpose. The ways that the

relationships between the attributes are combined can be compensatory or non-compensatory. In

compensatory models, a positive response on any of the attributes measured by an item can

compensate for a negative response on other attributes. Thus, the assumption exists that a deficit

in one attribute can be offset or be compensated by strength in another attribute. This means that

each support attribute that is measured by an item increases the probability of a correct or

positive response on that item. Conversely, in non-compensatory models lack of perceived

support of one attribute cannot be completely compensated by other attributes in terms of the

probability of positive response in the item performance.

In the current study, a compensatory relationship is hypothesized. Hence, if a particular

item is identified under two attributes in the Q-matrix, this means that this observed item is

measuring both attributes. A positive response of one attribute would mean that the teacher feels

supported in this area despite whether the other attribute is mastered. The specific model being

used is called the compensatory reparameterized unified model (C-RUM). Since it is a

compensatory model, a non-positive response or “a lack of perceived support” of a particular

measured attribute could be made up by the “mastery” or “support” of another measured attribute

(Rupp, Templin & Henson, 2008). The C-RUM model is the most suitable diagnostic model for

49

this study for several reasons. First, based on the previously discussed literature, support is a

compensatory construct. This means that when a teacher perceives a lack of support on a

particular item, the particular attribute that item is assigned to can still be diagnosed as

“supported” depending on teacher responses to other items that are also assigned to that

particular attribute. Moreover, this model is selected because it allows for flexibility. The C-

RUM contains unique parameters for each item and attribute. Thus, a similar approach to

defining global and attribute-specific item discrimination indices can be used. The endorsement

probabilities for respondents in the same latent class are not constrained to be equal across all

items with the same attribute requirements (Rupp et al., 2008). More advantages and the specific

characteristics of the C-RUM are discussed in more detail in chapter 3.

Through their thorough review of cognitive diagnostic models, Rupp and Templin (2008)

defined cognitive diagnostic models as “…probabilistic, confirmatory multidimensional latent-

variable models with a simple or complex loading structure” (p.226). This definition contains

two elements that require clarification. First, they are probabilistic because they expresses a

given respondent’s performance level in terms of the probability of mastery, or perceived

support, of each attribute separately, or the probability of each person belonging to each latent

class (Lee and Sawaki, 2009). The number of latent classes in a model depends on the number of

attributes hypothesized. If, for example, one hypothesizes that four attributes have two levels

each, then there will be 24 =16 latent classes. The probability of a respondent belonging to each

individual latent class will be provided by the diagnostic output. Secondly, diagnostic models are

confirmatory in the sense that the latent variables are defined a priori through a Q-matrix

(Ravand & Robitzsch, 2015).

The Core Compensatory and Non-Compensatory Cognitive Diagnostic Models

50

There are six core cognitive diagnostic models identified by Rupp, Templin and Henson

(2008). As previously explained, the core non-compensatory models assume that a deficit in one

attribute cannot be compensated for by a surplus in a different attribute. The three core non-

compensatory models include the deterministic-input, noisy-and-gate (DINA) model, the noisy-

input, deterministic-and-gate (NIDA) model, and the non-compensatory reparameterized unified

(NC-RUM) model. Generally, the DINA model separates respondents into mastery classes for

each item. One class includes the respondents who have mastered all of the measured attributes,

and the other class includes respondents who are non-masters of at least one of the attributes

measured by the item. In this model, no further differentiation between respondents who lack

different attributes is made for any item. This reduces the number of parameters that need to be

estimated. Conversely, the NIDA model accounts for the fact that a respondent lacking only one

of the measured attributes has a higher chance of a positive response than a respondent who has

not mastered any of the measured attributes. The model does this by including a slipping and

guessing parameter per attribute, which ultimately increases the number of parameters that need

to be estimated in comparison to the DINA model. These parameters are constrained to be equal

across all items. Similar to the DINA model, the NIDA includes a probability of a positive

response as output. However, the probability of a correct response is equal to the probability of

correctly applying all measured attributes for an item. The advantage of this is its finer

distinction between respondents who lack different combinations of attributes. The NC-RUM is

somewhat similar to the previous two models in that it is a non-compensatory model. However,

this model includes an interaction term to absorb the effects of incompleteness in an

undifferentiated way to improve the chances that the model will fit the data better. The model

51

relaxes the parameter constraints inherent in the models and provides parameter estimates at the

item level as well as for each combination of item and attribute.

The core compensatory models maintain the assumption that a deficit in one attribute can

be compensated for by a surplus on another attribute. The core compensatory models include the

deterministic input, noisy-or-gate (DINO) model, the noisy input, deterministic-or-gate (NIDO)

model, and the compensatory reparameterized unified model (C-RUM). The DINO model

includes slipping and guessing parameters modeled at the item level. In addition, it includes a

gate component that summarizes the contribution of individual attributes in the latent response

variable. It is limited in that no finer distinction is made between respondents for whom different

sets of attributes are present for items that require multiple attributes. Hence the NIDO model

provides a finer distinction than the DINO model, similarly to how NIDA compensates for the

limitation in the DINA model. Although the NIDO allows for differing contributions of each

attribute in a compensatory way, the parameters are restricted to equality across items. This

constraint may be unrealistic in some cases. The C-RUM model allows for a higher degree of

modeling flexibility in that response behavior is modeled at the item by attribute level without

equality constraints across items or attributes.

In recent years, there has been a movement towards more general models, including the

general deterministic-input, noisy-and-gate (G-DINA), the general diagnostic model (GDM), and

the log-linear cognitive diagnostic model (LCDM). The LCDM and GDM are more flexible

models parameterized for both dichotomous and polytomous data and attributes, and they can be

viewed more generally as representing model families consisting of a variety of compensatory

models that arise out of restrictions placed on the parameters in the model (Rupp & Templin

52

2008). The LCDM models the conditional probability that a respondent with a specific attribute

profile provides a positive response to a particular item. The formula for the LCDM is given by:

𝜋𝑖𝑐 = P(𝑋𝑖𝑐 = 1 |𝛂𝑐) = exp( λ𝑖,0 + λ𝑖𝑇ℎ(𝛂𝑐𝐪𝑖))/ exp( λ𝑖,0 + λ𝑖

𝑇ℎ(𝛂𝑐𝐪𝑖)) (3)

where qi is the set of Q-matrix entries for item i, λi,0 represents the logit of a correct response

given that no Q-matrix indicated attributes are possessed by a respondent. The vector λi

represents a vector of size (2A-1) x 1 with main effect and intercept parameters for item 1.

Moreover, (αc, qi) is a vector of the same size with linear combinations of the αc, qi. Thus, the

LCDM can be compared to a multiple way-ANOVA. Since most cognitive diagnosis models are

typically parameterized to define the probability of a positive response (i.e., each item is either

positive Xij = 1 or not positive Xij = 0) the log-linear model is re-expressed in terms of the log-

odds of a correct response for each item (as a function of the latent variables) Henson, Templin

& Willse, 2009). A compensatory model, such as the C-RUM, consists of only main effects

while non-compensatory models contain only interaction effects.

The Q-Matrix

To attain the diagnostic information from a cognitive diagnostic model, one has to

develop a confirmatory hypothesis assigning items to the latent attributes. This is known as a Q-

matrix (Tatsuoka, 1983). A Q-matrix depicts which skills/attributes contribute to the probability

a participant responds positively to an item. Items can be assigned to multiple skills, and thus,

the more skills assigned to an item, the more skills that affect the probability of a positive

response to that item. Q-matrix indicators entries are binary in that a skill affects the probability

of a correct response or it does not. In table 3, attribute 1 affects the probability of items 3 and 5.

Attribute 2 is hypothesized to affect items 1, 2, 3, and 5. Finally, attribute 3 is hypothesized to

affect items 1, 3 and 4. Looking at the items, the probability of a positive response to item 1 is

53

affected by attributes 2 and 3, but the probability is not hypothesized to be affected by attribute

1. Items can be hypothesized to be affected by one or more attributes.

Table 3. Example of Q-Matrix

Skill/Attribute 1 Skill/Attribute 2 Skill/Attribute 3

item 1 0 1 1

item 2 0 1 0

item 3 1 1 1

item 4 0 0 1

item 5 1 1 0

… … … …

The Q-matrix method came from Tatsuoka (1983) who developed the Rule Space

Method. In this research, she explored the process of diagnosing student skills in regards to

adding fractions in order to provide remediation. The depiction of students’ overall knowledge of

the skills required to add fractions with test question responses marked the genesis of the Q-

matrix. Several strategies have since been used in the literature to develop the Q-matrix. One

common approach includes relying on the literature to identify an initial set of skills to be

measured. In this strategy, researchers define the attributes or skills to be measured using

theoretical support, and then they develop the items to measure the attributes. Next, the

relationships between each item and attribute are coded, the model is run, and the initial Q-

matrix is revised based on various fit statistics. This strategy is convenient and cost-effective.

However, Leighton & Gierl (2007) found that when this strategy is used in isolation, it often

results in more generalized attributes and thus, recommend that this strategy should not be used

in isolation. A similar approach other researchers have used is to simply relied on existing test

specifications (Xu & Von Davier 2008) for the initial hypothesis and then work towards a final

matrix. This strategy can be assumed to be an applicable extension of just relying on the

available literature, assuming the test specifications were developed using literature.

54

Using think aloud protocols to develop the Q-matrix has also been documented in the

literature (Jang, 2009; Wang & Gierl, 2007). Such protocols have been found to have the

potential to increase understanding of cognitive processes as researchers assign items to

attributes. Moreover, they are reliable. Often, expert panels will be relied upon (Sawaki, Kim, &

Gentile 2009). Expert panels are also commonly used outside the context of think alouds as well.

When working with expert panels in developing a Q-matrix, the process may be an informal

conversation. Conversely, it is also common to develop coding rules that are used for an initial

substantive assignment of attributes to tasks so that independent raters will reliably agree on

these assignments (diBello, Roussos, & Stout, 2007).

In one example, Ravand and Robitzsch, (2015) developed a Q-matrix to assess reading

skills. They invited multiple subject matter experts to identify the relationships between items

and attributes. Moreover, they found additional empirical support from subsequent correlational

analyses among skills and the nature of the interactions among the multiple skills required by

individual tasks. Other studies have also relied on correlational analyses to inform the

development of the Q-matrix to improve statistical performance. In one example, Liu, Douglas

and Henson (2009) used an exploratory factor analysis for Q-matrix development and found that

this approach can give a reasonable solution when the Q-matrix is not too complex. Researchers

used a three-factor exploratory model to identify basic clusters of items that might measure

similar abilities. They compared the empirical results from the factor analysis to the content of

items was used to identify the final Q-matrix.

In a recent study, Close, Davison and Davenport (2014) relied on binary examinee

responses from the simulated and real data to compute item correlations that were analyzed via

principal components analysis (PCA) with promax rotation to identify the skill sets. They found

55

that the components analysis method for Q-matrix development appeared to be a viable and

useful step in generating a Q-matrix when skill sets were measured by more than one item. They

also found that once items had been developed by content specialists, these items should be pilot

tested and a task analysis using components analysis conducted to finalize the Q-matrix before

items are used operationally.

In summary, chapter two provided a review of the literature on teacher effectiveness and

traditional teacher evaluation systems. This discussion turned to a review on recent efforts to

reform teacher evaluation systems, including using multiple measures of effectiveness. The

action taken at the Federal level through the Race to the Top initiative was addressed. This

discussion filtered down to the specific efforts taken by the Virginia Department of Education.

The Virginia Guidelines for Uniform Performance Standards and Evaluation Criteria for

Teachers was the particular policy being explored in this study, so key components were

reviewed in detail. Next, the theoretical framework for the study was reviewed with literature

focusing on policy implementation in organizations and K-12 contexts. Moreover, the

hypotheses for this study were presented. Finally, a conceptual overview of the methodology was

presented. The advantages of using cognitive diagnostic models were highlighted. As described

in the research questions, the main objective of this study was to test whether this type of

modeling with binary outcomes can be used instead of a unidimensional model. In order for this

to happen, the model must show to have significantly better model fit than unidimensional

models. Moreover, the parameters must be accurately measured and the interpretations must

make sense. The next chapter describes the specific actions taken to test these hypotheses.

56

CHAPTER 3

METHODOLOGY

The purpose of this study is to explore a new way to measure teachers’ perceptions of


Cognitive diagnostic modeling has not previously been applied to policy implementation support

constructs. The diagnostic output from the analysis in this study was intended to provide detailed

empirical information about teachers’ perceptions of support. It is assumed that more precise

diagnostic feedback will be beneficial to policy makers and school leaders in identifying

strengths and weaknesses and in targeting resources in the policy implementation process. When

equipped with more precise diagnostic feedback, policy makers and school leaders may be able

to more confidently engage in empirical decision making, especially in regards to targeting

resources for short-term and long-term organizational goals subsumed within the policy

implementation initiative. Specifically, the follow research questions are addressed:



a. Do diagnostic models specifying finer-grained attributes representing

implementation support mechanisms fit the data better than models specifying

coarser grained attributes?

b. Do diagnostic models specifying finer-grained attributes representing

implementation support mechanisms have more accurate parameter estimates than

models specifying coarser grained attributes?





a. What are the group differences of the estimated latent class profile

distributions based on grade level, subject taught, and career status?

57



The ITES Project, Current Study, and the Role of the Researcher

The current study relies on the preliminary efforts part of a larger grant-funded study.

Researchers with the ITES project team developed the instrument and collected the data used in

the current study. Moreover, Figure 1, summarizing the conceptual framework used in the

current study was adapted, with permission, from the ITES project. Several modifications were

made. Most notably, there is an added emphasis on both the psychometric components and the

overlapping nature of the inter-organization support mechanisms in the current study. As is the

procedure in any study in which a secondary data source is used, the data owner and principal

investigator (PI) for the ITES project and the Virginia Tech Institutional Review Board for the

Protection Human Subjects (IRB) granted approval for the use of the ITES data in this study (see

Appendix B). To be clear, the ITES project included the data collection and the instrument

development. The concept, study design, and methodology in the current study separated from

the ITES project upon the ITES PI permission and IRB approval.

As the lead researcher of the current study, it is necessary to clarify my role with the

ITES project. As the lead graduate assistant for the former study, my role involved working with

the data. I have no connection to the teachers who have submitted the data since neither study

involves intervention. My role as a former ITES researcher did not involve opportunity to

influence the individual teachers who have already provided the existing data. The data I

requested to use for this project was de-identified in 2013.

Data Collection and Sample

Although the data was attained from a secondary source, a description of the data

collection efforts from the previous study will provide context for the current study. In the larger

58

project, both quantitative and qualitative data were collected from multiple sources. For this

dissertation study, the primary sources of data included administrative data collected from the

local education agencies, information provided by the National Center for Education Statistics

Common Core of Data, and the ITES teacher survey data collected in the first year of the policy

implementation. A description of each data source is provided in table 4. In total, the study

included three participating LEAs for a total of 35 schools: 6 high, 6 middle and 23 elementary

schools. This study used a sample of partnership LEAs located in Southwest Virginia and

included on rural, one suburban, and one urban LEA. A total of 19,315 students were served

across the sample.

Table 4. Data Description for the Current Study

Data Description Collection Method

Local Education

Agency

Administrative

Data

This included teachers’ growth measures; other

measures of teachers’ performance including peer

evaluation and student surveys of classroom

instruction; teacher background; and school and

division documents relevant to the

implementation.

Provided by

administrators in

partnership LEAs.

School and Local

Education Agency

Data

This included necessary data from National Center

for Education Statistics Common Core of Data

Online Source

ITES Teacher

Surveys This included teachers’ attitudes towards supports

for the new teacher evaluation system and their

perceptions of major barriers in the adoption

Research team

survey

administration

All full-time teachers who had experienced evaluation in 2012-13 in each of the LEAs

were eligible to take the survey. The final sample included 747 teachers. The response rate was

70%. The high response rate was attributable to the strong partnerships developed with the

LEAs. The surveys were administered externally from the ITES research team with the LEA

administration providing permission, opportunities to promote the survey and the allotted time

59

for teachers to take the survey. An ITES research team member was present at each school for

each survey administration.

In 2013-14, education agency 1 (EA1) had all of its schools accredited by the State based

on the 2012-13 state standardized tests, while 9% of education agency 2 (EA2) schools received

accreditation with warning because these schools did not meet the state benchmark in

mathematics. In EA3, 30% of the schools were accredited with a warning while the other 70%

were fully accredited. It could also be noted that the three education agencies did not serve

similar students in regards to race/ethnicity. However, this is a study about policy

implementation, and there was no literature found to support the notion that the race/ethnicity of

the students would impact teachers’ perception of support from these variables. As is evident in

table 5, all three education agencies are comparable to the state statistics on variables that may

impact the level of resources available to the schools. This included whether the school was a

title 1 school and the proportion of students who qualified for free and reduced lunch. Moreover,

all three education agencies were similar in regards to accreditation status, which also have had

implications for the policy implementation.

Table 5. 2013-2014 School-Year Local Education Agency Information

EA1 EA2 EA3 Virginia

White students 81.0% 92.0% 92.0% 53%

Black students 11.0% 3.0% 6.0% 23%

Asian students 0.0% 0.5% 0.4% 6%

Hispanic students 3.2% 2.0% 1.0% 11%

Students eligible for free and reduced-

price lunch 32.0% 32.3% 41% 38%

Title I schools 33.3% 37.4% 44% 40%

Virginia Accreditation Status All

accredited

All

accredited*

All

accredited** 98%***

Notes: *indicates that 9.1% of school credited with warning;

**indicates that 57% of schools accredited with warning;

***indicates that 30% of schools accredited with warning.

60

Using teacher personnel data provided by the partnership school districts and teachers’

responses to survey items, it was possible to analyze information about the survey respondents

that were included in the final sample and information for those teachers that were not included

in the final sample—non-respondents. Non-respondents included those teachers who were

eligible but chose not to respond completely to the survey. A comparison of respondents to non-

respondents is included in table 6. As was clear from the table, there were no significant

differences between the two groups on any of the available variables. Thus, there was no

meaningful systematic explanation for why teachers chose not to respond to the survey.

The final sample included 747 total teachers. A total of 570 teachers were female and 177

were male (See Table 7). Almost half of the final sample were identified as elementary school

teachers, whereas the other half of the teachers were almost equal proportions of middle and high

school teachers. Surprisingly, there were fewer teachers who were identified as teachers of

science, technology, engineering or math (STEM) than those who were not identified as STEM.

A STEM teacher was defined as any teacher who teaches a math course (e.g. calculus,

trigonometry, applied mathematics) or a science course (e.g. biology, chemistry, physics,

engineering, general science, computer science). The reason this was surprising is that in a

sample with so many elementary school teachers, one may expect to find more STEM teachers

because it is so common for elementary school teachers to teach all subjects to their assigned

class.

61

Table 6. Comparison of Teacher Characteristics

Total Sample

Respondents Non-Respondentsa

Female 76% 72%

STEM teachers 39% 36%

Elementary 46% 49%

Middle 26% 23%

High 32% 31%

Early Careerb 34% 35%

Experiencedd 66% 65%

Average evaluation rating prior rating in 2012-13cd

1.53 1.46 Note: F-statistics used to indicate whether significant difference by *p<0.05 ** p<0.01 *** p<0.001; aIndicates a group of teachers who were not included in this analysis because they did not respond

bEarly career teachers indicates those teachers who reported teaching for 1-5 years; mid-career indicates those

teachers who had been teaching for 6 to 10 years; experienced teachers are those who reported teaching over 10

years. cThis includes data only for EA1 and EA2. This information was not available for EA3.

dThis is the average score as calculated by the school district.

Finally, a comparison of experienced and early career teachers was made. Early career teachers

are those teachers in the first 5 years of their teaching career. Conversely, experienced teachers

are those teachers with greater than 5 years of teachers. As is evident in table 7, there were 435

experienced teachers and 312 early career teachers in this sample.

Table 7. Crosstabs of Teacher Characteristics

Gender Level STEM

Female Male N 1 2 3 N 1 2 N

Level

1: Elementary 314 31 345 - - - - - - -

2: Middle 137 57 194 - - - - - - -

3: High 119 89 208 - - - - - - -

N 570 177 747 - - - - - - -

STEM 1: Stem 258 36 294 189 54 51 294 - - -

2: Not-Stem 312 141 453 156 140 157 453 - - -

N 570 177 747 345 194 208 747 - - -

Status 1: Experienced 358 77 435 214 119 102 435 173 262 435

2: Early Career 212 100 312 131 76 105 312 121 191 312

N 570 177 747 345 195 208 748 294 453 747

62

Instrumentation

As previously mentioned, the instrument development work was not part of this current

dissertation. However, the description of the instrumentation is important information in

understanding the retro-fitting process of the diagnostic model (see chapter 4). The teacher

survey questions consisted of items about teachers’ perceptions of the teacher evaluation policy,

and their perceptions of supports and barriers to implementing the policy. Teachers shared

perceptions of the supports by indicating the extent to which they agreed about specific aspects

regarding the policy implementation. The items were originally on 4-point scales where 1= not at

all, 2 = some extent, 3 = moderate extent, 4 = great extent (See Appendix A). The development

of the items relied upon the four areas of implementation in the conceptual framework. As

previously discussed, the survey was developed to investigate the policy implementation in

Virginia. Thus, the GUPSECT standards were used to ensure items were relevant to this sample

of teachers. A visual representation of an informal blueprint is depicted in Appendix C. Due to

the overlapping nature of the areas of implementation from the conceptual framework,

specifying that an exact number of items represent each cell in this blueprint was not an objective

of the development. Rather, the members of the ITES research team involved in the instrument

development process systematically checked that each area was represented by numerous items.

The final survey included 89 total items about policy implementation support and

additional items about demographic information. After the survey items were written, the

research team solicited feedback from higher level district administrators including

superintendents and assessment coordinators. Two separate meetings with two different groups

of administrators from two different LEAs occurred. Upon incorporating feedback from

administrators the survey was piloted with the district data team of teachers. This was a total of

63

about 30 teachers. The average time for the survey was about 25 minutes. Further revisions were

made based on the pilot group written qualitative feedback and time constraints. The final

revision of the teacher survey was developed on Qualtrics software using HTML and then tested

by project team members for adequate functioning. The final survey was administered to

teachers using an official list of teacher e-mail addresses provided by each LEA central offices.

All teachers in LEAs 1 and 2 in the fall semester of 2013, and in the spring semester of 2014 for

LEA 3. Following each initial distribution, up to three reminders were given to teachers every

two weeks thereafter. The final 89 items were included in Appendix A.

Item Analysis

For the preliminary analysis, it was necessary to investigate any irregularities that may be

present in the data. This was accomplished through an item analysis using Stata software. As

previously described, the instrument was developed to measure teachers’ perceptions of intra-

organizational mechanisms for supporting teacher evaluation. Thus, initially, a unidimensional

latent construct was assumed. The unidimensional construct was teachers’ perceptions of overall

policy implementation support. A higher score on this construct would indicate a greater degree

of perceived support.

From the polytomous item analysis, it was clear that most items performed well based on

their means and variances (See Table 8). The item with the highest mean was item 69

(mean=3.46). The item with the lowest mean was item 3 (mean =1.21). The item with the largest

variance was item 6 (sd=1.18) and the lowest variance was item 3 (0.56).

64

Table 8. Polytomous Item Descriptive Statistics

item mean sd item mean sd item mean sd

1 2.46 0.85 31 2.05 0.77 61 2.19 1.07

2 2.05 0.93 32 2.30 1.07 62 2.29 1.00

3 1.21 0.56 33 2.35 0.89 63 2.82 0.86

4 1.25 0.61 34 2.47 0.95 64 2.75 0.88

5 1.60 0.73 35 2.12 0.70 65 2.72 1.33

6 2.78 1.18 36 3.10 0.84 66 3.29 0.93

7 1.86 1.16 37 1.46 0.91 67 3.30 0.96

8 2.41 1.12 38 1.65 1.03 68 2.20 1.33

9 3.17 1.01 39 1.82 1.12 69 3.46 0.86

10 3.15 1.02 40 1.48 0.89 70 2.03 1.25

11 2.39 1.02 41 1.46 0.90 71 2.93 1.11

12 3.06 0.96 42 1.33 0.82 72 2.07 1.18

13 3.27 0.88 43 2.22 0.96 73 3.06 1.10

14 3.23 0.89 44 2.89 0.87 74 2.02 0.91

15 2.81 0.98 45 3.05 0.80 75 2.01 0.93

16 2.32 1.02 46 2.72 0.96 76 2.08 0.96

17 2.47 1.01 47 3.30 0.76 77 2.34 0.97

18 2.62 1.00 48 3.36 0.72 78 1.67 0.75

19 3.16 0.94 49 3.24 0.77 79 2.21 0.93

20 3.25 0.92 50 3.32 0.76 80 2.05 0.96

21 3.20 0.95 51 3.40 0.72 81 2.11 0.94

22 2.77 1.00 52 3.30 0.76 82 2.10 0.99

23 2.46 1.00 53 2.94 0.92 83 2.14 0.97

24 1.94 0.94 54 2.91 0.92 84 1.83 0.81

25 1.95 0.92 55 2.61 0.94 85 2.00 0.96

26 1.88 0.92 56 2.84 0.92 86 1.90 0.85

27 3.24 0.97 57 2.47 0.99 87 2.59 0.90

28 1.74 1.02 58 2.66 1.02 88 2.76 0.85

29 2.73 0.94 59 2.14 1.03 89 2.71 0.89

30 2.69 1.11 60 2.38 1.01

The item mean distribution was plotted and the items with means at the tail ends of the

distribution (<2 or >3) were flagged and further analyzed using a graphical and substantive

analyses. The quantiles of each item were plotted against the quantiles of the normal distribution

and the text of these flagged items was further analyzed for substantive evaluation purposes. The

text of the flagged items was reviewed with two content experts. The experts included two

teachers, neither of whom participated in the original survey. In total, twelve of the flagged

65

items were removed from the analysis. The rationale for deleting these items was that, first, the

majority of the responses to these items were the extreme choice (1 or 4) which resulted in an

extreme mean and/or low item variance. Secondly, the selected teachers and the research team

determined that cases could be made that the wording of the text of the items was ambiguous, not

clear, or not necessarily important. Some items were reviewed but ended up being included in

the analysis based on the substantive contributions of the items to the measured attributes of

interest. Figure 2 displayed an example of the histograms and quantile plots of one of the

reviewed items.

Figure 2. Item Flagged for Substantive Review

A major theme of this study was the application of cognitive diagnostic models to

improve the measurement precision of policy implementation support in order to inform teacher

training and development. Since cognitive diagnostic models have not previously been used for

this type of application, several considerations were necessary. The first consideration was the

current state of the research on cognitive diagnostic models. The majority of the limited research

on cognitive diagnostic models was focused on dichotomous models and so this suggested it may

be advantageous to rely on dichotomous data simply because more support would be available

for methodology decisions. Secondly, most of the software programs available that actually run

1 2 3 4Frequency

Histogram of Removed Item

12

34

Ite

m

-1 0 1 2 3 4Inverse Normal

Q Plot of Removed Item

66

cognitive diagnostic models are restricted to running the dichotomous models. One additional

point worth noting was that the polytomous models required much larger sample sizes to reach

convergence.

As is discussed in more detail in chapter 4 and shown in detail in Appendix D , the

internal structure of the dichotomous data was shown to be very similar to the internal structure

of the polytomous data. In both cases, the resulting structures supported a 10-factor solution of

finer-grained support mechanisms. Based on these reasons, the application of cognitive

diagnostic models could still be investigated using the dichotomous data, and the diagnostic

output could still be attained from the analysis and used to improve the support provided to

teachers. To reiterate, dichotomizing the data will present a loss of information and may

introduce a limitation of this study, however, it will not have a major impact on the most

important goals of this study.

To dichotomize the data, instead of using the initial scale where 1 = not at all, 2 = some

extent, 3 = moderate extent, 4 = great extent, items were recoded. If teachers indicated a 3 or a 4

in response to an item, that entry was recoded to a “1” which represented that the teacher was

indicating they felt supported on this item. Similarly, 1’s and 2’s were recoded to a “0” which

indicated they did not feel supported on that item. The overall reliability of the survey was

calculated after the items were dichotomized. The coefficient alpha was 0.9, indicating an

excellent level of reliability. Similar to the procedure followed with the polytomous data, an

item analysis was completed on the dichotomous data. The histograms of items at the tail ends

of the distributions were further investigated, as was the text of the potentially problematic items.

However, none of the items were removed based on their dichotomized performance. Thus,

67

based on the initial item analysis of polytomous and dichotomous items, a total of 77 items were

selected. The dichotomized item scores were summarized in table 9.

Table 9. Dichotomized Item Descriptive Statistics

Item Mean SD Item Mean SD Item Mean SD

1 0.52 0.50 31 0.23 0.42 61 0.34 0.47

2 0.51 0.49 32 0.44 0.44 62 0.34 0.47

3 0.43 0.50 33 0.38 0.49 63 0.62 0.49

4 0.07 0.25 34 0.45 0.45 64 0.57 0.49

5 0.10 0.30 35 0.23 0.42 65 0.56 0.50

6 0.62 0.49 36 0.65 0.48 66 0.79 0.41

7 0.30 0.46 37 0.09 0.29 67 0.79 0.41

8 0.51 0.50 38 0.15 0.36 68 0.38 0.49

9 0.76 0.43 39 0.21 0.41 69 0.84 0.37

10 0.76 0.43 40 0.08 0.27 70 0.32 0.47

11 0.45 0.50 41 0.09 0.29 71 0.65 0.48

12 0.73 0.45 42 0.05 0.22 72 0.32 0.47

13 0.80 0.40 43 0.33 0.47 73 0.70 0.46

14 0.80 0.40 44 0.67 0.47 74 0.24 0.43

15 0.63 0.48 45 0.74 0.44 75 0.24 0.43

16 0.43 0.50 46 0.59 0.49 76 0.27 0.44

17 0.48 0.50 47 0.83 0.38 77 0.38 0.48

18 0.54 0.50 48 0.87 0.33 78 0.12 0.32

19 0.77 0.42 49 0.83 0.38 79 0.33 0.47

20 0.80 0.40 50 0.84 0.37 80 0.26 0.44

21 0.76 0.43 51 0.87 0.34 81 0.28 0.45

22 0.61 0.49 52 0.84 0.36 82 0.29 0.45

23 0.48 0.50 53 0.63 0.48 83 0.30 0.46

24 0.28 0.45 54 0.66 0.47 84 0.17 0.37

25 0.26 0.44 55 0.51 0.50 85 0.23 0.42

26 0.23 0.42 56 0.62 0.49 86 0.21 0.40

27 0.79 0.41 57 0.46 0.50 87 0.58 0.49

28 0.19 0.39 58 0.50 0.50 88 0.65 0.48

29 0.65 0.48 59 0.30 0.46 89 0.63 0.48

30 0.48 0.48 60 0.40 0.49

Descriptive Statistics

Following the item analyses, the dichotomous data needed to be further explored. The

number of items teachers indicated they felt supported on were summed to create a total survey

score. There were 747 responses and the average total score was 36.5 with a standard deviation

68

of 13.5. The maximum number of items a teacher perceived to be supported on was 86. The

minimum was 2. The distribution of the total scores was approximately normal as depicted in

figure 3.

Figure 3. Distribution of ITES Survey Dichotomized Total Scores

In regards to the observed dichotomous total scores, it was clear that there were very few

differences in instrument performance across groups. As seen in table 10, there were no

significant differences on observed total score between experienced and early career teachers or

between STEM and non-STEM teachers. However, a one-way analysis of variance with

bonferonni comparisons revealed that that there were significant differences between elementary

and middle school teachers (F(746) = 13.94, p = 0.02 < α = 0.05). Moreover, there was a

significant difference between elementary school teachers and high school teachers on total

observed score (F(746) = 13.94, p < 0.001 < α = 0.05). The analysis in this study, specifically

research question 2, was focused on the associations between groups on instrument performance.

However it should be noted that this focus was not on the overall total score of the instrument.

Rather, all teachers were to be grouped into diagnostic profiles based on their performances on

0 20 40 60 80Dichotomous Summed Total Score

69

specific components of the instrument. Hence, the overall total score performances and the

differences between groups on overall total score performances was presented in table 10 purely

for descriptive purposes. In later analyses, these groups will be compared based on diagnostic

analyses instead of classical total scores.

Table 10. Total Scores by Group

N Score SD Min max

Elementary 345 39.05 12.98 2 75

Middle 194 35.85 13.00 5 65

High 208 32.97 14.03 4 67

STEM 294 37.89 13.30 4 75

Not-STEM 453 35.64 13.58 2 67

Female 570 37.95 13.11 2 75

Male 177 31.95 13.81 4 62

Experienced 435 36.35 12.98 2 69

Early 312 36.77 14.23 4 75

Plan of Analysis

The purpose of applying cognitive diagnostic models to understanding teachers’

perceptions of intra-organizational support mechanisms is to generate profiles of perceived

support on categorical outcome variables key to the implementation process. These categorical

outcomes are especially advantageous when criterion referenced, standards-based assessments

are desired, as in this study. Another major advantage of cognitive diagnostic modeling is that

“cut-scores” are defined “model internal” and a priori. As discussed, a potential alternative

framework is multidimensional IRT (mIRT). Similar to cognitive diagnostic modeling, mIRT

uses full-information estimation and accommodates a multidimensional internal structure,

however, mIRT typically uses continuous latent outcome variables and also requires a second

“model external” standard-setting step in order to be useful when criterion-referenced decisions

70

are desired. Thus, if the application of cognitive diagnostic modeling is found to be successful in

this application, it is preferable.

It is contended that policy makers and school leaders can use the standards-based

categorical diagnostic information to organize resources for policy implementation based on

where teachers’ fall in the diagnostic marginal distribution profiles. For example, school leaders

may target certain resources and professional development for teachers who fall under one

profile and plan other remedial steps for teachers in another profile. Before the diagnostic output

is of value, it is imperative to find the best fitting diagnostic model possible. In the remainder of

this chapter, the methodologies used in the analysis are reviewed and the plan of analysis is

detailed. Following this review, chapter 4 examines the results of the analysis.

The Q-Matrix: Defining the Attribute and Skills Space

It will first be necessary to formally define the attribute and skills space. This process

amounts to establishing what skills and attributes were being measured in the policy

implementation process a priori. Based on the literature review, four coarse grained attributes are

hypothesized (see Figure 1, p. 15). These coarse grains include characteristics of the policy,

characteristics of teachers, characteristics of leadership, and characteristics of the organization.

Most commonly, theoretical support is utilized in the development of this space. The previously

discussed literature (e.g. Century et. al, 2012) is relied upon to determine the initial coarse

grained attributes here. The number of latent classes (skill profiles) in a dichotomous model will

be 2A where A is equal to the number of attributes and 2 means two levels of each attribute.

Typically following the process of defining the attributes, the survey items are developed.

These are referred to as the skill tasks (diBello et al, 2006). Contrary to how this process works

with other methodologies, in this process, task developers do not avoid tasks that require

71

combinations of multiple skills per task. Thus, survey or test items can be developed to measure

multiple attributes. However, since the data in this survey is attained from a secondary source,

the items were developed prior to this study. With items and attributes defined, the most

important process in this study became how to map these items to measure the appropriate skills.

Since model fit is subject to the relationships between the individual items and the latent

categorical attributes of interest, the development of the Q-matrix is an important element of this

study. A reasonable portion of the preliminary analysis in this study is dedicated to the empirical

procedures taken beyond the literature review to inform the development of the initial Q-matrix,

although this is not the main focus of the study.

As discussed in chapter 2, a Q-matrix is the numerical specification table that includes

the information of which attributes are hypothesized to measure by which item (Tatsuoka, 1983).

It represents a particular hypothesis about which skills are required to successfully answer each

item in a test (Li & Suen, 2013). A Q-matrix traditionally contains the items in the rows and the

attributes in the columns and includes a binary indicator system (e.g. 1 or 0) to reveal whether or

not an attribute is measured by an item. As previously discussed, one item may measure

multiple attributes, and an attribute should be measured by multiple items.

In chapter 2, various ways were reviewed on how to develop a Q-matrix. As discussed,

they are most commonly developed using a theory-based approach. Researchers use the relevant

literature or rely on expert opinion to determine what attributes are important and then they

organize the relationships between attributes and items. These steps will be used in this study.

However, in order to add more empiricism and improve the Q-matrix development, additional

validation steps will be undertaken to ensure the proper specification of the Q-matrix in this

study. Specifically, exploratory factor analysis (EFA) will be used as a preliminary step in order

72

to build understanding of the data dimensionality and understand whether finer-grained attributes

underlay the data. Although it will be an important preliminary action taken to understand the

relationships in the data, it will only be one of the additional measures taken to add empiricism to

this study. To be clear, the EFA will not be the main focus of this study. The details and results

of the EFA and the entire Q-matrix development and validation process are described in detail in

chapter 4.

The Compensatory Reparameterized Unified Model (C-RUM)

Once the initial Q-matrix is developed, a diagnostic model will be applied to the data.

The particular model to be used in this study is called the compensatory reparameterized unified

model (C-RUM) (e.g. Hartz, 2002; Templin, 2006). The C-RUM allows for a higher degree of

modeling flexibility than some other commonly used models. Since the C-RUM is a

compensatory model, a lack of “perceived support” of a particular measured attribute can be

made up by the “perceived support” of another measured attribute (Rupp, Templin & Henson,

2008). Moreover, this model is chosen is because it is flexible and it contains unique parameters

for each item and attribute. The notation used by Rupp, Templin, and Hensen (2009) is used in

this study:

𝜋𝑖𝑐 = P(𝑋𝑖𝑐 = 1 |𝛂𝑐) = exp( λ𝑖,0 + ∑ 𝛌𝑖,1,(𝑎)𝛂𝑐𝑎 , 𝑞𝑖𝑎)𝐴

𝑎=1/ 1 + exp( λ𝑖,0 + ∑ 𝛌𝑖,1,(𝑎)𝛂𝑐𝑎 , 𝑞𝑖𝑎)

𝐴

𝑎=1

(4)

where P is the probability of a positive response (i.e. a response of 1), exp (.) is the exponential

function, πic is the probability of positive response to item i in latent class c, Xic is the observed

response for item i in latent class c, qia is the indicator from the Q-matrix indicating whether

attribute a is measured by item i, αca is the attribute “perceived support” indicator for attribute a

73

in latent class c, λi,0 is the intercept parameter for item i, and λi,1,(a) is the slope of parameter for

item i and attribute a.

According to Rupp, Templin, and Hensen (2008) two components are needed to build the

C-RUM. The first component is an intercept parameter, λi,0, which is defined at the item level but

not at the attribute level. The second component is a slope parameter, λi,1,(a), which is a main-

effect term defined at the attribute level separately for each item. The C-RUM estimates one

intercept parameter for each item and as many slope parameters as there are entries of 1 in the Q-

matrix. The item response function specifically produces a step function, with the λi,1,(a)

parameters being the amount of increase for the presence of the different attributes needed for an

endorsement of an item. Rupp and Templin (2009) explain that items for which the baseline

probability, as defined through the intercept parameter, is relatively high are items that might be

problematic. Moreover, items that poorly measure the required attributes are those for which the

probability increments, as defined through the slope parameters, are small. Finally, in the C-

RUM model the respondent receives an item-specific increment for each possessed attribute

measured by the item through λi,1,(a).

Cognitive Diagnostic Model Estimation Method

In order for the model to be identifiable, every parameter should be able to be estimated

by a unique value. This includes the item parameters and the person population parameters.

According to DiBello, Roussos, & Stout (2006). models can be estimated with all parameters

being fixed without prior distributions (non-Bayesian), a portion of the parameters can have prior

distributions imposed (partially Bayesian), or all of the parameters can be assigned a joint prior

distribution (Bayesian). The diagnostic model estimation method used by the software in this

study is marginal maximum likelihood estimation (MMLE) using the EM algorithm. The EM

74

algorithm is a general iterative algorithm for item parameter estimation by maximum likelihood

when some of the random variables involved are not observed i.e., considered missing or

incomplete (Bock & Aitkin, 1981). It formalizes an intuitive idea for obtaining parameter

estimates when some of the data are missing. In the EM algorithm, the first step (E-Step) is to

replace missing values by estimated values. The data is used to compute number and expected

number of examinees at each quadrature point. Next, the parameters are estimated using the

quadrature points from E-step to carry out MMLE to estimate item difficulties. Item parameter

estimates are then used to re-estimate quadrature point distribution. That revised set of

quadrature point frequencies is then used to re-estimate item parameters. EM cycles are repeated

until convergence is nearly achieved. Convergence rates can vary, with more complex models

taking longer amounts of time. Among compensatory models, models with intercept and main

effects without the interaction parameters, such as the C-RUM, can recover their parameters

more accurately than models with interaction parameters (Choi et al., 2010). Moreover, the

number of items has a significant effect on parameter estimation and classification accuracy, as

well as fit estimation accuracy (Henson et al., 2009; Tatsuoka, 1990; Templin et al., 2008).

Finally, the number of latent classes increases exponentially with the number of

attributes. Thus, parameter estimation becomes more difficult to compute as the number of

attributes increase. Models with many attributes or high numbers of skills per item are in

particular danger of being nonidentifiable (Dibello, Roussous, & Stout, 1996). This has practical

implications for the current study. As described below, the range of attributes to be modeled will

vary from 4 to 10. It is common that less than seven attributes are used per dataset (e.g., Hartz,

2002; Templin and Henson, 2006). Rupp and Templin (2008) found that three, four, and five

attributes are the most common numbers of attributes for log-linear diagnostic models.

75

According to Galeshi and Skaggs (2015), there is limited empirical evidence examining various

sample sizes, item numbers and skill varieties for the C-RUM model. Tests with as few as 15

items and four attributes have been examined (Templin et al., 2008). Moreover, tests with as

many as 50 items with five attributes have also been examined with the LCDM approach

(Kunina-Habenicht et al., 2012). Careful consideration of both parameter identifiability and

systematic model evaluation help achieve effective calibration.

Analyses for Research Question 1

Research question 1 is essentially about finding the best-fitting cognitive diagnostic

model. This section describes the plan of analysis for the overarching research question and the

sub-questions as well. For convenience, the questions are included below:







implementation support mechanisms have more accurate parameter estimates

than models specifying coarser grained attributes?

c. What are the posterior estimated proportions of teachers that are diagnosed to

fall under each latent profile?

The analysis will begin with the 77 items and the Q-matrix developed from the EFA in the

preliminary analysis. Since this is a new application, the precise number of attributes that

should be modeled is not known, although a four attribute model is initially hypothesized based

on the literature. An EFA is used to develop the initial Q-matrix in order to build understanding

of the data and the finer-grained mechanisms. Expert opinions will also be used to refine the

76

matrices through the model specification process. This procedure is detailed in chapter 4. The

over-arching goal in research question 1 is to establish the applicability of cognitive diagnostic

models. It is anticipated that a thorough and comprehensive comparison of cognitive diagnostic

models to the unidimensional model will establish the applicability to this particular data. If, for

instance, a unidimensional model were to fit the data better than all diagnostic models specified

by this Q-matrix, any further pursuit of this application may not be reasonable. Thus, the first

research question explores whether cognitive diagnostic models can fit the teacher perception

data better than a model that assumes a single unidimensional “support” ability score. Moreover,

in comparing the various models to the unidimensional model, diagnostic models will be

compared to one another based on the number of attributes defined in each model. It is

anticipated that a better understanding of model fit will emerge based on the number of attributes

defined in the model. Thus, models will be compared on various characteristics, including

relative fit, parameter estimates, and standard errors.

Considering that there are several sources of model misfit in specifying a cognitive

diagnostic model, several factors will be considered in answering research question 1. For

example, misfit can occur from misspecification of the general model type (i.e., compensatory

vs. non-compensatory), misspecification of the Q-matrix, misspecification of the specific model

(i.e. C-RUM vs. GDINA), or heterogeneity issues within the population (Rupp & Templin,

2008). Due to the multiple sources of misfit, fit evaluation in cognitive diagnosis modeling can

be challenging. Moreover, there is limited empirical research comparing fit statistics and their

criteria in cognitive diagnostic modeling, the literature on the statistics. Even fewer studies have

been conducted to systematically evaluate the extent to which these statistics are sensitive to

model–data misfit or useful for model selection. In this study, model fit will be determined

77

based on comparisons of the relative fit indices and the stability and accuracy of parameter

estimates. Relative fit indices refer to the process of selecting the best-fitting model among a set

of competing models (Chen, de la Torre, & Zhang, 2013).The comparisons of model fit indices

will include the AIC, BIC, and RMSEA. With all three indices, the model with the lowest values

will indicate better overall fit.

All three relative fit statistics are calculated as functions of the maximum likelihood and

for all three, the fitted model with the smallest value is selected. The Akaike information

criterion (AIC) provides a relative estimate of the information lost when a given model is used to

represent the process that generates the data (Akaike, 1974). It is calculated as:

𝐴𝐼𝐶 = 2𝑘 − 2 ln(𝐿) (8)

where L represents the value that maximizes the likelihood function for the cumulative logit

model. Models with the lowest AIC are said to be the most parsimonious model (Sakamoto et al.,

1986). The Bayesian information criterion (BIC) will also be used. Closely related to the AIC,

the BIC is based, in part, on the likelihood function and includes a larger penalty term for

increasing parameters than the AIC. The BIC is found by:

𝐵𝐼𝐶 = −2 ∗ 𝑙𝑛𝐿 + 𝑘 ∗ ln (𝑛) (9)

where = the number of data points in the observed data, the number of observations, or

equivalently, the sample size; = the number of free parameters to be estimated. Researchers

studying mixture modelling approaches suggest the use of BIC and as an indicator of fit

estimation and the criteria for comparing competing models (Hagenaars & McCutcheon, 2002;

Magidson & Vermunt, 2004; Sclove, 1987). However, AIC and BIC have shown to be sensitive

to sample size, test length, and the number of model parameters (Burnham & Anderson, 2004).

In a study by Chen, de la Torre, & Zhang (2013) it was demonstrated that for relative fit

https://en.wikipedia.org/wiki/Likelihood_function

https://en.wikipedia.org/wiki/Observation

https://en.wikipedia.org/wiki/Parameter

78

evaluation, the BIC, and to some extent, the AIC, can be useful to detect misspecification of the

model, the Q-matrix, or both. In another study, Kunina-Habenicht et al. (2012) found that AIC

and BIC were useful in selecting the correctly specified Q-matrix against the misspecified Q-

matrices. Moreover, they also found that the AIC was useful in selecting the correct model

against the misspecified model when all interaction effects were omitted.

In another study, Galeshi and Skaggs (2015) examined the performance of the commonly

used relative fit indices in determining the model to data fit for C-RUM. Researchers evaluated

the sensitivity of the AIC and BIC in identifying model misfit/selection for six sample sizes of

10,000, 5,000, 1,000, 500, 100, and 50 with various test length/ number of attribute under two

extreme Q-matrix misspecifications—over-fit and complete reverse Q-matrices. The results

indicated that the BIC and AIC indices performed similarly for larger datasets (N ≥ 500) but

varied for smaller datasets (N < 500) suggesting a superior performance for BIC. Since the

dataset in this study is a large dataset (N>500), both of the aforementioned fit statistics will be

considered in this study, and in cases where they disagree, BIC will be used.

Another important fit statistic that will be used in this study is the RMSEA. The model

RMSEA analyzes the discrepancy between the hypothesized model, with optimally chosen

parameter estimates, and the population covariance matrix. One advantage of this statistic is that

it avoids sample size issues. The RMSEA ranges from 0 to 1, with smaller values indicating

better model fit. A value of .06 or less is indicative of acceptable model fit. MacCallum, Browne

and Sugawara (1996) have used 0.01, 0.05, and 0.08 to indicate excellent, good, and mediocre

fit, respectively.

𝑅𝑀𝑆𝐸𝐴 = √(𝜒2 − 𝑑𝑓)

√(𝑑𝑓(𝑁 − 1)

(10)

79

According to Cook, Kallen, and Amtmann (2009), the model RMSEA provides an answer to the

question, “How well would the model, with unknown but optimally chosen parameter values, fit

the population covariance matrix if it were available?” It is based on an estimate of the

population discrepancy function, which assesses the error of approximation in the population.

The RMSEA is thus a measure of discrepancy; because it presents discrepancy per degree of

freedom, it is sensitive to model complexity (i.e., number of estimated parameters). Additionally,

the mdltm software will automatically produce RMSEA for individual items.

The model specification will include the application of fine-grained models and coarse-

grained models to the data. To answer research question 1b, the diagnostic models will also be

evaluated in terms of the stability and accuracy of the parameter estimates. Trends of the

average parameter estimates and standard errors will be compared. For every model, the item

parameters indicate how well the items performed for diagnostic purposes. Models with more

stable and lower standard errors will be determined to be more advantageous. It is anticipated

that items measuring two attributes will more likely have larger standard errors than items

measuring one attribute. The standard errors will provide some insight into the accuracy of item

parameter estimates.

Finally, the ability of the models to discriminate between teachers whom perceive

support and teachers who do not perceive support will be used as model fit index. Technically,

this is a discrimination index but similar to other studies (Jang 2009), it will be used as one way

to determine the best fitting model. The calculation of the discrimination index is described in

Appendix F.

80

Analysis for Research Question 2

For research question 2, the overarching goal is to explore group differences through the

application of multiple-group multinomial log-linear models. The groups being compared in this

study were selected based on logistical organizational considerations. For example, as previously

discussed, the vision is that the application of these models can be used by school districts to

organize for policy implementation. Schools leaders can use this type of information to organize

teachers in to groups based on their latent profiles and target the appropriate resources to groups

of teachers. Although the data contains numerous grouping variables, school districts are most

inclined to organize resources for teachers based on grade level, subject taught, or career status.

This is true for many reasons, but most importantly, because of logistical organization

restrictions. Teachers in the same grade levels and who teach the same subjects likely have more

similar teaching schedules, teaching strategies, and teaching philosophies. Thus, research

question 2 explores the application of mixture models defined with these multiple groups. One of

the advantages of the diagnostic model is to concurrently estimate item parameters for several

subpopulation or subgroups. In the case of teachers’ perceptions of policy implementation

support, the population includes several different subgroups, for example, grade level, subject

taught, and career status. Since this is a new application, it is necessary to explore differential

group estimation. Thus, for the second research question, the ability of the diagnostic model to

estimate group differences will be explored.



a. What are the group differences of the estimated latent class profile distributions

based on grade level, subject taught, and career status?

81

b. What are the group differences in diagnostic model estimations of

parameter estimates based on grade level, subject taught, and career

status?

To be clear, research question 2 will not be a traditional group contingency comparison analysis

of outcomes. Nor will it be akin to controlling for potential confounding variables in a regression

model. Rather, since the goal of this study is to test a new application of the diagnostic model,

this analysis will essentially rely on the advantage of the diagnostic model to concurrently

estimate item parameters for several subgroups. This strategy of latent grouping will provide an

overall better model fit, and thus, the diagnostic outcomes will be more useful. Recall that the

proportions of teachers in each diagnostic profile are actually estimated from the model. So,

comparing groups is actually comparing whether the models are the same. The procedures that

will be followed will be conducted in mdltm software. Moreover, procedures were outlined in

Xu & von Davier (2008) for comparing groups. The analysis will commence by running mdltm

Software for groups in the same model. This is known as a multiple group multinomial log-linear

model. This will permit item and attribute parameter estimates to be on the same scale across

groups. Similar to Xu and Von Davier (2008), the Q-matrix from the best-fitting single group

model from research question 1 will be used. Next, a two-group model will be fit. The two –

groups represent teachers who teach STEM subjects (science, technology, engineering, math),

and those who do not. The other 2-group model that will be fit will be the early career vs.

experienced teachers. After that, a 3-group model will be fit. The three-group model will be

defined by groups of elementary, middle, and high school teachers. Finally, a 6 group model

will be represented by a full-factorial design model of STEM and teacher level. In addition to

comparisons of model fit and parameter estimates to determine if the model fit differently for

82

different groups, the proportions of teacher in each estimated latent class based on the diagnostic

marginal distributions will be compared.

Next, a descriptive comparison of estimated profile distributions, marginal distributions

of each attribute, and correlations between attributes will be completed. Since the distributions

are estimated from the models, no hypothesis tests will be necessary. Following the descriptive

statistics, a comparison of single group versus multiple group using fit indices will be completed.

The statistics will include both the AIC and BIC. The likelihood ratio test will also be used since

a single group model is nested within a multiple group model. This test is sensitive to sample

size and number of parameters. Finally, there will be a comparison of item parameter estimates.

The 95% confidence intervals will be formed around each item parameter, separately for each

group. The item parameters that are different enough to lie outside these confidence intervals

will be identified.

Running a model for each group separately would result in the item parameters being on

different scales. Thus, they would not be directly comparable. Using a multiple-group

multinomial log-linear mixture model puts the groups on the same scale to allow for proper

group comparisons (Xu and Von Davier, 2008). The groups being compared in this study were

selected based on logistical organizational considerations. For example, as previously discussed,

the vision is that the application of these models can be used to organize for policy

implementation. Schools leaders can use this type of information to organize teachers in to

groups based on their latent profiles and target the appropriate resources to groups of teachers.

Although the data contains numerous grouping variables, school districts are most inclined to

group/organize teachers based on grade level, subject taught, or career status. This is true for

many reasons, but most importantly, because of logistical organization restrictions. Teachers in

83

the same grade levels and who teach the same subjects likely have more similar teaching

schedules, teaching strategies, teaching philosophies, etc… Thus, research question 2 explores

the application of mixture models defined with these multiple groups.

Specifically, the groups of interest in this study will be the available categorical teacher

level variables including grade level, subject taught, and experience. Grade level will be split

into three categories including elementary, middle, and high school. Subject taught will be a

variable where a teacher either teaches a STEM subject or not. Career status will be early career

and experienced teachers as defined by Sun (2012). As previously discussed, many grouping

variables could have been explored with cognitive diagnostic modeling. However, none of the

available grouping variables aligned with the practical goals of this study like grade level and

subject taught. Recall that the goal of using cognitive diagnostic modeling is to retrieve the

categorical outcomes to inform standards-based targeting of resources for policy implementation.

The group variables in this study allow for specific differential supports to conveniently, in the

organizational sense, be targeted by group, if it were found to be necessary.

84

CHAPTER 4

RESULTS

Preliminary Analysis: Q- Matrix Development

Data Dimensionality

As described in chapter 2, the instrument used in this study measures teachers’

perceptions of intra-organizational mechanisms for supporting teacher evaluation. Based on the

literature review, it is clear that the unidimensionality assumption is not a realistic assumption

when it comes to supporting teacher evaluation policy implementation. Multiple underlying

factors make up teachers’ perceived support of such mechanisms, although the precise number is

not empirically supported in the available body of literature. A very thorough literature review

has resulted in a four-factor hypothesis. As reviewed in chapter 2, one alternative analytic

approach for testing this hypothesis is to specify the relationships between the observed variables

and the latent variables based on the literature and then use a confirmatory factor analysis to test

the model. However, as previously discussed, a major goal of this study is to test the application

of cognitive diagnostic modeling to this data. Cognitive diagnostic modeling offers several

advantages to a factor analytic framework, such as the accommodation of multidimensionality of

the data, complex loading structures and the use of a categorical outcome variable. These

advantages, and others, are discussed in chapters 2 and 3. Additionally, cognitive diagnostic

models are confirmatory in that the skill and attribute relationships are specified a priori through

the development of a hypothesized Q-matrix.

Using the literature review combined with the existing instrument specifications is one

common approach to developing the Q-matrix for the diagnostic modeling process. Both of

these procedures are economical and convenient. However, Leighton & Gierl (2007) found that

85

the use of only these procedures may result in a model that is too general for diagnostic purposes

and that the use of these procedures alone is unwarranted. For these reasons, an empirical

investigation into the dimensionality of the data is necessary. This empirical support can be

used, in conjunction with the literature and instrument specifications, to justify the initial Q-

matrix used in this study.

Exploratory Factor Analysis

The goal of the exploratory factor analysis (EFA) is to understand the underlying

relationships of the data in order to provide empirical support for the initial Q-matrix. In an

exploratory factor analysis, the components are referred to as factors. The general factor analysis

(FA) approach is concerned with identifying the underlying factor structure that explains the

relationships between the observed variables. EFA, specifically, is based on the common or

shared variance between variables, which is partitioned from the left-over variance unique to

each variable and any error introduced by measurement. In other words, the purpose of the EFA

is to identify basic clusters of items that might measure similar attributes. These clusters can be

used to establish the initial Q-matrix.

Prior to conducting the EFA, the items were dichotomized. In chapter 3, several reasons

were provided for this decision. The main reason for dichotomizing the data was to stay

consistent with the diagnostic models to be conducted in a subsequent analysis. Moreover, an

eigenvalue and maximum likelihood-based EFA with the polytomous data was conducted and

this procedure produced a very similar result in terms of the number of factors and the factor

structure in regards to item-factor correspondence (see Appendix E).

A traditional eigenvalue-based EFA with principal axis factor extraction on the

correlation matrix was conducted. Promax rotation, one of the most popular oblique rotations,

86

was used because it permitted correlations among factors. Considering the context of this study,

it would be expected that factors regarding policy implementation support would correlate. In

order to determine the optimal number of factors to retain, several statistical factor selection

procedures were used. Each of these methods was based on eigenvalues. The eigenvalue of a

factor represents the amount of variance accounted for by that factor. The lower the eigenvalue,

the less that factor contributes to the explanation of variances in the variables (Norris &

Lecavalier, 2009). In addition to the eigenvalues, as recommended by Brown (2015), factor

selection was also guided by substantive considerations. According to Brown (2015), “…the

validity of a given factor should be evaluated in part by its interpretability (p. 21).

The first eigenvalue-based factor selection procedure was the Kaiser –Guttman rule

(1991). This rule states that when an eigenvalue is less than 1.0, the variance explained by a

factor is less than the variance of a single indicator. Although this method is simple and

convenient, when used alone it may lead to over-factoring or under-factoring on factors with

eigenvalues around 1. Based on the traditional eigenvalue-based EFA on the correlation matrix

composed from dichotomously scored items, ten factors had eigenvalues greater than 1 (see table

11).

Table 11. Exploratory Factor Analysis Eigenvalues by Factor

Factors Eigenvalue

Factor1 13.27

Factor2 4.33

Factor3 2.98

Factor4 2.34

Factor5 2.20

Factor6 2.18

Factor7 1.63

Factor8 1.46

Factor9 1.35

Factor10 1.02

Factor11 0.73

87

The second factor selection procedure that was used was Cattell's (1966) scree plot. The Cattell

scree test plots the eigenvalue components as the X axis and the corresponding eigenvalues as

the Y-axis. As one moves to the right, toward later components, the eigenvalues drop. When the

drop ceases and the curve makes an elbow toward less steep decline, the scree test says to drop

all further components after the one starting after the last substantial drop in the magnitude of

eigenvalues. This is referred to as the “elbow rule.” Since this was somewhat subjective, the

scree plot was used as guide to locate an approximate number of factors. One limitation of this

procedure is that there may not be a clear drop in eigenvalues. In this method, the eignevalues

were first plotted. Next, the “elbow rule” was used to determine the last substantial drop in the

magnitude of eigenvalues. Based on figure 4, there appeared to be a significant drop in

eigenvalue at approximately 10 factors.

Figure 4. Exploratory Factor Analysis Scree Plot

Based on the Eigenvalues and the scree plot, the loadings of the 10-factor model were

explored for conceptual interpretability. In each case, factor loadings were used to determine

which item loaded onto which factor. A minimum loading of 0.3 was required in order for an

05

10

15

Eig

enva

lue

0 20 40 60 80Factors

https://en.wikipedia.org/wiki/Eigenvalue,_eigenvector_and_eigenspace

https://en.wikipedia.org/wiki/Y-axis

88

item to load to a factor. This 0.3 threshold was based on the recommendation by Hair, Tatham,

Anderson and Black (1998) for sample sizes greater than 350. The summary in table 12 describes

each of the 10 resulting factors. These 10 factors explained approximately 92% of the total

variance. The factor with the largest eigenvalue was the factor with items regarding the

legitimacy of the policy (factor 1). A total of 14 items loaded on this factor. After factor analysis,

seven items loaded on one underlying construct that captures policy adaptability; seven items

loaded on the factor regarding teacher confidence; seven items loaded on the factor that captured

teachers’ attitude towards the policy; seven items loaded on the factor that captured leadership

advocacy and communication regarding the policy; eight items loaded on the factor that captured

the quality of professional development provided by leadership; seven items loaded on the factor

capturing the legitimacy of the leadership; eight items loaded on the factor capturing the values

of the organization; six items loaded on the factor capturing organizational locus of decision

making; and four items loaded on the factor capturing organizational resources provided for the

policy implementation. Please see Appendix A for a detailed information on item-factor

loadings.

89

Table 12. Summary Interpretation of Factors

Factor Description

1: Policy Legitimacy

Items address legitimacy of policy in terms of its

intended impact and the sources of evidence used in the

evaluation.

2: Policy Clarity/Adaptability Items address alignment of policy with other institutional

guidelines

3: Teacher Confidence Items address teachers’ confidence on skills relating to

the policy.

4: Attitude towards policy Items address teachers' attitude towards policy and

guidelines.

5: Leadership Advocacy and

Communication

Items address how school leaders advocate for the policy

and communicate regarding the guidelines.

6: Quality of Professional Development Items address the professional development provided by

leadership on policy-related concepts

7: Leader Legitimacy Items address the ability of school leaders to effectively

lead the implement the policy.

8: Organizational Resources Items address the resources available provided by the

organization.

9: Organizational Locus of Decision-

Making

Items address whether organizational factors allow for

teachers to be involved in decisions related to the policy.

10: Organizational Values Items address the values of the organization in terms of

policy-related concepts.

Upon closer analysis, the results indicated that although the best fitting model had 10 factors,

those factors, once understood in the context of survey items and the literature, could be

interpreted as finer-grains of the originally hypothesized four coarse grain attributes. In Table 13,

it was evident that the four course grained attributes are in the first row, and the ten interpreted

factors are on the remaining roles under their corresponding attribute. This had implications for

the cognitive diagnostic model building process. It allowed for greater flexibility in specifying

the model. More specifically, models with finer-grained attributes (10-factor model) could be fit

in a way that maintained consistency with the hypothesized model. Then, factors could be

combined to fit a 9-attribute model that also was consistent with the original model, as long as

the factors that were combined were within the same coarse grain attribute. As described below

90

in more detail, the correlations between factors were used to determine which factors should be

combined. The relationship between the fine-grained 10-factor interpretation and the course-

grained 4-factor interpretation is summarized in table 13.

Table 13. Final Model: Grain Sizes

Policy Teachers Leadership Organization

1: Clarity/

Adaptability

3: Self-Efficacy 5: Professional Development

Quality

8: Resources

2: Legitimacy 4: Understanding

and Attitude

towards the

innovation

6: Evaluator legitimacy 9: Values

7: Innovation

Advocacy/Communication

10: Locus of Decision

Making

The Q-Matrix and Model Specification

Using the results from the factor analysis as a preliminary step, the final 10-attribute Q-

matrix was established. This procedure commenced with the results of the EFA being used to

determine the initial factor that each item loaded on. However, as previously described, the EFA

was only the initial step conducted in order to increase the level of empiricism of the Q-matrix.

Given that one of the advantages of cognitive diagnostic modeling is the ability to account for

inter-item variance, it needed to be determined whether items could, conceptually, be argued to

load on factors other than those they were found to in the EFA. Thus, using the results from the

factor analysis, expert opinion was sought to make additional revisions to the Q-matrix.

Specifically, two experts in educational policy and two public school teachers were consulted to

determine which items potentially load on additional factors. One educational policy expert had a

PhD in Educational Policy and was an assistant professor. The other expert was a PhD candidate

in Educational Policy. The two public school teachers were licensed to teach K12 in the state of

Virginia. All experts were invited to discuss all 89 of the original items and the attributes and

91

specify the attributes measured by each item. The result from this consultation was a 10-attribute

Q-matrix with each item loading on either one or two attributes. Moreover, several items could

potentially load on to more than one or two attributes. However, the number of attributes an item

could load on to was limited to either one or two. This would ensure that the complex loading

structure was not overly complex, and that the number of parameters would not grow so large as

to prevent model convergence. This decision was made as a result of the sample size limitation

in this study. It was determined that through the model specification process, any revisions to

the Q-matrix would be based on the expert consultation. Table 14 displays an example of the

initial 10-attribute Q-matrix for the first five items.

Table 14. An example of the Final Q-Matrix

Policy

Attributes

Teacher

Attributes

Leadership

Attributes

Organization

Attributes

Items 1 2 3 4 5 6 7 8 9 10

1 0 0 0 0 0 1 0 1 0 0

2 0 0 0 0 0 1 0 1 0 0

3 0 0 0 0 1 0 0 0 0 0

4 0 0 0 0 1 0 0 0 0 0

5 0 0 0 0 1 0 0 0 0 0

… … … … … … … … … … …

Findings

Research Question 1: Testing the New Application of Cognitive Diagnostic Models

To answer research question 1, models were compared to the unidimensional model and

to other models. First, a unidimensional model was run in order to establish a baseline

comparison model. Comparisons were made between model fit and parameter estimates. The

best-fitting model was chosen and the diagnostic information from that model was interpreted.

To begin answering research question 1, the 10-attribute model was fit. As anticipated, fitting the

10-attribute cognitive diagnostic model was quite complex because of the number of parameters

92

that needed to be estimated. The initial Q-matrix did not converge. Adjustments were made to

the initial Q-matrix in order to fit different interpretations of the Q-matrix for the 10-attribute

model. These adjustments were based entirely on the previously discussed expert consultation.

After several modifications, a 10-attribute model converged. Further revisions were made to the

Q-matrix, again, based on expert consultation. In total, two 10-attribute models converged. Using

the best-fitting 10-attribute model based on the AIC, BIC, and RMSEA, the two highest

correlated attributes were used to determine how the finer-grained attributes would be combined

in order to develop the initial Q-matrix for the 9 attribute model. The Q-matrix entries of the two

attributes with the highest bivariate correlations were then combined in order to develop the 9-

attribute model. Specifically, the two attributes called “organizational values” and

“organizational locus of decision making” were combined. After multiple conceptually adequate

solutions were found for the 9-attribute model, the best-fitting model attribute correlations were

again used to specify the initial Q-matrix for the 8-attribute model. This process was repeated for

all of the initial 4 through 9 attribute models. One guiding rule for this technique was that

attributes were not to be combined unless they were under the same coarse-grained umbrella. For

example, regarding the 10-attribute model in table 13, the highest correlated attributes were

attributes 9 and 10, thus, those attributes were combined to establish the initial 9-attribute model.

However, if the highest correlation actually had been between attribute 7 and 8 followed by 9

and 10, attributes 9 and 10 would have still been the attributes that were combined for the 9-

attribute model. The same procedure would occur to fit the models with four through eight

attributes. In each instance, the same rules were applied.

In total, there were two 10-attribute models, two 9-attribute models, two, 8-attribute

models, and three 7-attribute models. Each model was described in terms of the number of 1-

93

attribute items and the number of 2-attribute items. The number of items was not changed

following the initial item analysis in chapter 3. This permitted the use of AIC and BIC fit

comparisons between models. The total number of models that were fit for each category (i.e.

10-attribute, 9-attribute, 8-attribute) was limited by the sample size and substantive

interpretation. First, the number of models depended on whether the sample size was sufficient in

order to estimate every parameter in order to reach model convergence. However, the number of

parameters in each model was limited by conceptual considerations. As previously discussed,

any modifications to the Q-matrices were based entirely on expert consultation. Thus, there were

a limited number of modifications that could be made to each Q-matrix. For example, there were

three separate, 7-attribute models displayed. Each of these models began with the 10-attribute Q-

matrix solution from the exploratory factor analysis.

As previously described, this resulted in each item being assigned to either one or two

attributes. As is evident in table 15, model 1 had 36 items assigned to one-attribute and 41 items

assigned to two attributes. Expert judgements were used to modify the relationships between

items and attributes. Model 2 showed that 5 of the items that were assigned to one-attribute in

model 1 were assigned to two attributes in model 2. Although all potential modifications were

exhausted for each category, and all models were run, fewer of the higher attribute models

actually converged than for the lower attributes.

94

Table 15. Fit Results for Models with 7-10 Attributes

Model Fit for Tests, Items (i) = 77 Attributes (K) = 7

Model One-

Attribute

Two-

Attribute AIC BIC RMSEA j/k* DI**

Misfit

(N)***

1: i = 36 i = 41 60258.59 60271.04 0.35 16.86 0.37 26

2: i = 31 i = 46 60194.19 60209.73 0.32 17.57 0.34 26

3: i = 26 i = 51 60141.01 60159.63 0.37 18.28 0.33 26

60197.93 60213.47 0.35 17.22 0.35 26


Model One-

Attribute

Two-

Attribute AIC BIC RMSEA j/k DI

Misfit

(N)

1: i = 55 i = 22 60092.75 60006.73 0.21 15.12 0.11 49

2: i = 45 i = 32 60859.18 60925.49 0.22 15.88 0.36 65

60475.97 60466.11 0.34 15.50 0.23 57


Model One-

Attribute

Two-


Misfit

(N)

1: i = 29 i = 48 60499.43 60593.44 0.34 13.00 0.25 71

2: i = 24 i = 53 60366.38 60483.47 0.32 13.56 0.23 73

60432.91 60538.46 0.33 13.28 0.24 72


Model One-

Attribute

Two-


Misfit

(N)

1: i = 39 i = 38 60461.49 60472.41 0.36 10.76 0.21 70

2: i = 29 i = 48 60182.51 60318.06 0.32 10.50 0.18 73

60322.00 60395.24 0.32 10.50 0.19 73

Notes: *items per attribute calculated by summing the number of items assigned to each attribute, then dividing by the number of attributes; **discrimination (DI) calculation described in Appendix F); ***misfit calculated by summing the number of items with RMSEA > 0.10.

As can be seen in table 15, many of these models performed very poorly in terms of fit

indices. Among the 7-attribute models, model 3 had the lowest AIC and BIC. However, this

model still had an RMSEA well above 0.1. Moreover, not a single model with 7 or more

attributes had a reasonable RMSEA. Most of these models also had many mis-fitting items.

Between the 7-10 attribute models, a trend clearly emerges in the number of mis-fitting items.

Despite the lack of fit with these models, they did still do an adequate job of discriminating.

95

Among the 7-attribute models, the average difference between the proportion correct for teachers

who perceived support and those that did not was 0.35 (See Appendix F). A higher

discrimination index (DI) implied a better model. Although the DI increased from the 10-

attribute model to the 7-attribute model, the trend is not as clear as it was for the misfit statistic.

Due to the poor fit of all the models described in table 15, none of the models were found to be

useful in terms of diagnosing teachers’ perceptions of intra- organizational support mechanisms.

This was an anticipated result. Had a higher-attribute model fit the data well, problems still

would have existed with the practical application of the diagnostic interpretation. For example, if

the 10 attribute model fit the data best, there would have been 210

= 1024 latent profiles that

teachers could potentially be diagnosed to fall under. This would not have been practical for the

purposed application of this study. As can be seen in table 15, model 1 with 8 attributes (K = 8)

was the best fitting model with the lowest AIC, BIC, and RMSEA of all the models. As

previously discussed in more detail, the literature recommended that when AIC and BIC

disagree, the BIC is to be relied upon. In addition, the RMSEA was also used as an index to

ensure that any model that was determined to be the “best-fitting” model by the AIC and BIC,

did, in fact have adequate model fit. As previously described, adequate RMSEA values are those

less than 0.1. For the best-fitting model in table 15, it was shown that, 55 of 77 items were

assigned to one attribute and 22 of the 77 items were assigned to two attributes. The higher

number of items assigned to one attribute could have explained the slightly better overall fit.

However, even though both the AIC and BIC suggested that this model fit better than all other 7,

8, 9, and 10-attribute models, the RMSEA values for all sets of models were greater than 0.1,

suggesting that, although convergence was achieved for these models, the models did not fit the

data well enough to be used for interpretation.

96

As the number of attributes in the model decreased, the model seemed to fit the data

better. This trend was captured by the fit statistics in table 16. However, the trend is not quite so

clear when the models had 4, 5, or 6 attributes as displayed in table 15. Although the RMSEA

was lowest for the set of 4 and 5 attribute models, the AIC and BIC were only lowest for the 4

attribute model. Each of the 5 attribute models had higher average AIC and BIC values averages

for other sets of models. In general, the models seemed to discriminate similarly, although the 4-

attribute models had slightly higher ability to discriminate.

Based on these results, the RMSEA suggested that the best fitting model is among the 4

or 5-attribute models. In fact, except for model 5 in the set of 6-attribute models (RMSEA =

0.10), all other models fit the data quite poorly (RMSEA > 0.10). The AIC and BIC suggested

that the best fitting model was among the set of 4 attribute models. Moreover, the set of 4-

attribute models were the only models in which the AIC and BIC were lower than the

unidimensional model values.

97

Table 16. Fit Results for Unidimensional Model and Models with 4-6 Attributes

Model Fit for Tests, Items (i) = 77 Attribute (K) = 1

Model One-

Attribute

Two-

Attribute AIC BIC RMSEA j/k* DI**

Misfit

(N)***

1 i =77 i = 0 59764.7 60475.6 0 77 0 0


Model One-

Attribute

Two-

Attribute AIC BIC RMSEA j/k DI Misfit (N)

1 i = 59 i = 18 58735.8 59557.5 0.01 23.75 0.41 13

2 i = 53 i = 24 58821.8 59675.8 0.01 25.5 0.41 13

3 i = 46 i = 31 58771.8 59563.4 0.01 27 0.39 13

4 i = 42 i = 35 58746.2 59626.4 0.03 28 0.4 13

5 i = 40 i = 37 58777.6 59582.4 0.06 28.75 0.38 12


Model One-

Attribute

Two-


1 i = 55 i = 22 60921.2 61719.8 0.05 17.4 0.35 25

2 i = 49 i = 28 60310.9 61192.5 0.03 21.00 0.31 19

3 i = 48 i = 29 60566.3 61401.8 0.05 19 0.34 20

4 i = 44 i = 33 60070.7 61929.3 0.01 20 0.34 21

5 i = 37 i = 40 60062 61952.9 0.01 21.4 0.32 23


Model One-

Attribute

Two-


1 i = 61 i = 16 59458.6 60308 0.21 15.6 0.39 36

2 i = 57 i = 20 59718.5 60586.3 0.13 16.17 0.37 31

3 i = 52 i = 25 59721.5 60612.4 0.13 17 0.37 26

4 i = 50 i = 27 59655.6 60574.2 0.14 18 0.38 28

5 i = 47 i = 27 59622.1 60563.8 0.1 18.83 0.37 25

Notes: *items per attribute calculated by summing the number of items assigned to each attribute, then dividing by the number of attributes; **discrimination (DI) calculation described in Appendix F); ***misfit calculated by summing the number of items with RMSEA > 0.10.

The parameter estimates and the standard errors for the parameter estimates were

averaged and compared across each model. Figure 5 showed the trends for the averages for four

sets of models. The line in the figure should not be interpreted to demonstrate any relationship

between the models. Rather, it was included to show the increase in the value of the average

98

standard errors as more attributes were included in the model. A similar figure was used by

Halpin and Keiffer (2015) in their application of latent class models to teacher performance

scores. Clearly, the parameters are estimated less accurately as the number of attributes

increases. The line shows the 10-attribute model with higher standard errors for each parameter.

The unidimensional model had the most accurate estimates, with the 4-attribute model being a

close second. This finding supported the results from the model-fit comparisons in that it appears

that the 4-attribute model is comparable, and in many cases, a better fit than the unidimensional

model. However, the other models did not fit the data as well.

Figure 5. Average Standard Errors of Intercept Parameter Estimates

In table 17, the average parameter estimates were summarized for the three top-

performing 4-attribute models based on the previously discussed fit statistics. Among these

models, there did not appear to be major differences as far as the distributions of the parameter

.09

.1.1

1.1

2

Ave

rag

e S

tand

ard

Err

or

1 2 3 4 5 6 7 8 9 10Model Attributes

99

estimate distributions. Lower slopes indicated less of an influence of perceived support on the

probabilities of positive responses on items. The average slope parameters were fairly

reasonable across all three models. However, model 2 and model 3 each had a minimum slope

that turned out to be negative. This indicated either a poor item or a misspecification in the Q-

matrix for the item. Looking at the “best-fitting” model, the maximum intercept parameter on

items that measured more than one attribute was 5.25. This would imply that for this particular

item, perceiving support on none of the attributes would result in a probability of 0.99 of

answering positively to the item. Basically, all respondents agree that they receive support on

this item. Thus, this item has very little empirical contribution to the model and was marked for

review. No other item fit as poorly as this item, however, there were 11 other items tagged with

intercepts approximately equal to or greater than 1. With the current model, an intercept

parameter with the value of 1 implied a probability of 0.73 of a positive response to an item

when none of the attributes had overall perceived support. Although there was no official

guideline for an acceptable probability, this high probability meant that almost all respondents

are agreeing, no matter their profile. So, these would be items that most everyone perceives

support on. In other words, it was easy to perceive support on this item. These items don’t

discriminate well between the different profiles. Moreover, the RMSEA values for each of these

items was greater than 0.09, which indicated that these items did not fit. With 77 items and a

typy-1 error rate set with α = 0.05, one would expect approximately 4 items to misfit. Based on a

substantive review of the items, there were no obvious reasons as to why the specific items fit the

model so poorly. Moreover, since there were so many quality items available, the model

specification process was further investigated by removing the most egregious items from the

analysis in terms of the values of the parameter estimates.

100

Table 17. Comparisons of 4-Attribute Model Average Parameter Estimate Distributions

Model Distribution

Test with 77 items and 4 attributes item specifications:

Items (i) = 59, Attributes (K) = 1 , i = 18 K = 2

Model 1 Parameters 𝜇 𝜎 min max N

One-attribute intercepts -0.311 1.383 -3.340 2.841 59

One-attribute slopes 0.977 0.644 0.007 4.000 59

Two-attribute intercepts -0.205 1.653 -1.786 5.251 18

Two-attribute slopes 0.711 0.698 0.001 4.000 36

Test with 77 items and 4 attributes item specifications: i = 46 K = 1 , i = 31 K = 2





Two-attribute slopes 0.703 0.651 -0.087 4.000 62

Test with 77 items and 4 attributes item specifications: i = 42 K = 1 , i = 35 K = 2





Two-attribute slopes 0.625 0.720 -0.038 4.000 70

Using the best-fitting 4-attribute model, all items with intercept parameters greater than 1

were flagged. Next, the item with largest intercept ( λ𝑖,0 = 5.25) was removed from the analysis.

Several different combinations of poorly fitting items were removed and models were compared

in terms of the resulting total number of mis-fitting items and the RMSEA. The final model had

70-items. The AIC = 54232.50 and the BIC = 54980.30 which were both much lower than the

values for any of the 77-item models. Moreover, the RMSEA = 0.01 which was equivalent to

the best-fitting 77-item model.

The number of items per attribute in model 2 decreased, as anticipated. It was also

anticipated that the AIC and BIC would decrease substantially. Although, the total number of

poorly fitting items only slightly decreased, the items that did fit poorly were not nearly as

101

extreme in terms of how poorly they fit. This is evident by looking at the distribution of the

parameters in table 18. The maximum value for the intercept of items that loaded on one attribute

was 2.64 as opposed to 2.84 with the 77-item model. An even larger discrepancy occurred

between the maximum intercept value among items that loaded on two attributes. For the 77-

item model, the maximum value when K = 2 was 5.25. Although the maximum value was still

fairly high with the 70-item model ( λ𝑖,0 = 2.42) , it was notably lower.

Table 18. Parameter Distributions for Final Model

Model Distribution

Test with 70 items and 4 attributes items




Two-attribute intercepts 0.43 1.31 -1.97 2.41 16

Two-attribute slopes 0.69 0.45 0.11 1.59 32

Interestingly, the standard errors of the 70-item model were quite comparable to the

standard errors of the unidimensional model. This meant that the parameters were measured

nearly as accurately as they were with the unidimensional model. Among the items with the

lowest standard errors, the 4-attribute model actually had slightly lower standard errors than the

unidimensional model. Model 1 had the highest average standard error. Model 1 represented the

best-fitting 4-attribute, 77-item model. The average standard error was 0.097. Model 2 was the

revised 4-attribute, 70-item model, and it had an average standard error of 0.093. Finally, model

3 represented the unidimensional model and it had an average standard error of 0.091.

Based on the model fit indices and the parameter estimates, it was clear that the best-

fitting model was the 4-attribute, 70-item model. Having selected the best-fitting model, the

model could be applied to understanding of teachers’ perceptions and the substantive output

from the model could be interpreted. Recall that the statistical purpose of cognitive diagnostic

102

modeling was to “…develop a multivariate profile of teachers’ perceptions based on classifying

them according to their degree of mastery on each of the traits” (Rupp & Templin 2008, p. 226).

Thus, substantively, we were interested in attaining the detailed diagnostic profiles that promote

assessment for learning through modification in target areas (Jang, 2009). Based on the best

fitting model, the next step was to use teachers’ observed responses to estimate the person

parameters in order to understand the estimated diagnostic proportions of teachers in each of the

profiles.

To review, the initial procedure commenced with fitting models with different grain-

sizes. Grain sizes referred to the scope or level of specificity of attributes. A finer grain, for

example, could be adding and subtracting, whereas a coarser grain may be whole number

operations. It was found that the best-fitting model was a 4-attribute model with 70 items. Thus,

the diagnostic output from this model was subsequently selected to be used in learning about

teachers’ perceptions of policy implementation. The advantage of using the diagnostic model

was that instead of a continuous ability estimate, the diagnostic model estimated the probability

that a respondent had mastered each attribute (Rupp & Templin, 2008). In this study, instead of

“mastery,” the descriptor was actually “perceived support.” If that probability of perceived

support was greater than 0.5, the respondent was classified as having perceived support on the

attribute. For each respondent, a profile resulted in which perceived support or lack of perceived

support on each attribute is estimated. To clarify, attributes were the skills or attributes and

mastery indicated perceived support whereas non-mastery indicated a lack of perceived support.

Typically, attributes have been defined by content knowledge, cognitive skills, or mental

processes. In this new application, we were interested in modeling the perceptions of support for

policy implementation. Through this process, diagnostic modeling produced output for

103

respondents as a profile of perceived support and non-perceived support of the attributes.

Teachers were estimated to fit a specific profile based on the posterior probabilities that the

respondent perceived support on each mechanism in the latent profile results. Since the outcomes

of interest in this study were the latent categorical outcomes, the probabilities were not

presented, but the categorical interpretations of the probabilities were. Moreover, since this study

was a new application of this methodology, Table 19 was provided as an example of the

posterior probabilities of satisfying each attribute for five teachers for didactic purposes.

Table 19. Example of Estimated Posterior Probabilities of Attribute Perceived Support By

Respondent

ID 1: Policy 2: Teacher 3: Leadership 4: Organization

9001 0.56 0.76 0.26 0.47

9002 0.28 0.29 0.21 0.26

9003 0.76 0.71 0.74 0.68

9004 0.08 0.47 0.21 0.16

9005 0.64 0.82 0.65 0.63

--- --- --- --- ---

Based on the probabilities in table 19, the software assigns a diagnostic, standards-based

categorical outcome profile for each teacher. Based on the probabilities in table 20, the resulting

estimated profiles for each of the respondents was included in table 20.

Table 20. Example of Estimated Latent Profiles Based on Posterior Probabilities By Respondent

ID 1: Policy 2: Teacher 3: Leadership 4: Organization

9001 1 1 0 0

9002 0 0 0 0

9003 1 1 1 1

9004 0 0 0 0

9005 1 1 1 1

--- --- --- --- ---

The estimated posterior probabilities were functions of item response patterns in addition to the

estimated base rates. In the example, teacher with ID 9001 had a higher model-estimated

104

probability of perceiving support on support mechanisms regarding the “characteristics of

teachers” than teacher 9002. The latent categorical outcome of teacher 9001 would be 1100

because the probability of this teacher perceiving support on attributes 1 (Policy) and 2 (Teacher)

were greater than the 0.5 a priori, internally defined threshold and the probabilities of this teacher

perceiving support on attributes 3 and 4 were less than 0.5. In addition, the responses provided

by teacher 9005 were more representative of a fully supported teacher in regards to policy

implementation.

The distribution of the skill pattern profiles for the best-fitting 4-attribute model was

summarized in table 21. These proportions were estimated using a log-linear structural model.

The proportion of respondents in each profile is a model parameter to be estimated. The log-

linear model reduces the number of parameters. In particular, some profiles have low proportions

and are therefore difficult to estimate. Although it is not clear from the literature as to whether to

use probability-based or log-linear model, one of the advantages of using the log-linear model to

represent the C-RUM is that it can be used to identify a suitable model by placing parameter

restrictions within a very flexible general model (Rupp, Templin, & Henson, 2007). The log-

linear representation of the C-RUM models the conditional probability that a respondent with a

specific attribute profile (as depicted in table 21), provides a positive response to the item. For a

Q-matrix with 4-attributes, the first set of elements are 4 elements of the log-linear equation

represent the 4 main effect parameters. The first element corresponds to the main effect for

attribute 1 if attribute 1 is measured by the item, then the second element corresponds to the main

effect for attribute 2 if attribute 2 is measured by the item. This is done for each attribute in the

model. Once all 4 attribute main effects are accounted for in the model, the second set of

elements account for the two-way interactions. (Rupp, Templin, & Henson, 2007). The model-

105

estimated proportions were contrasted with the probability-based proportions in table 21. The

larger differences occurred in the less frequent profiles.

Table 21. Distribution of Diagnostic Categorical Profiles for 4-Attribute Model

Profile Log-Linear Model Probability-Based

Policy Teacher Leadership Organization Percent (%) N Percent (%) N

p1: 0 0 0 0 16.20 121 16.73 125

p2: 1 0 0 0 1.20 9 3.61 27

p3: 0 1 0 0 23.43 175 21.02 157

p4: 1 1 0 0 16.29 47 23.29 174

p5: 0 0 1 0 0.80 6 0.13 1

p6: 1 0 1 0 3.35 25 1.61 12

p7: 0 1 1 0 0.54 4 0.00 0

p8: 1 1 1 0 7.50 56 4.28 32

p9: 0 0 0 1 0.00 0 0.00 0

p10: 1 0 0 1 0.27 2 0.13 1

p11: 0 1 0 1 0.00 0 0.00 0

p12: 1 1 0 1 3.48 26 4.28 32

p13: 0 0 1 1 0.00 0 0.00 0

p14: 1 0 1 1 2.54 19 1.47 11

p15: 0 1 1 1 0.00 0 0.00 0

p16: 1 1 1 1 24.40 257 23.43 175

Table 21 showed that the model estimated that 8 profiles accounted for most of the

teachers. Moreover, the perceptions of approximately one quarter of the total number of

teachers (N = 747) placed them in profile 16. Profile 16 represented the group of teachers

who were estimated to have perceived support on each of the four attributes. Profile 1

(16.20%) represented the group of teachers who were estimated to not have perceived

support on any of the attributes based on their responses. The next most common profile

outcome was profile 3 (23.43%), which represented perceived support of the

characteristics of teachers, but no other attribute. There were three profiles that had no

teachers.

106

The profile distributions and the proportions of teachers perceiving support on each

attribute between the log-linear model and actual estimates of individuals were compared. More

specifically, based on the each participants’ probability of having perceiving support on an

attribute, they were assigned a profile. A participant was assigned a 1 for an attribute if they had

a greater than 0.5 probability of perceiving support on an attribute. The proportions of these

estimates were compared to the proportions estimated by the log-linear model. The actual

profiles estimated through the log-linear model distorted the distribution. More specifically,

when respondents had a .51 probability of perceiving support on an attribute, they were

categorized as “perceived support” on the attribute.

Based on the single-group total distribution, it appeared that the majority of

teachers perceived support on characteristics of teachers because approximately 75% of

teachers belonged to a profile that indicated this (see table 22). This was found by

summing the total number of teachers that belonged to a profile that indicated perceived

support for this attribute. Mdltm provided the proportion of teachers perceiving support

on each attribute by counting them directly, not from a log-linear model. Recall that this

attribute was measured by items that focused on teachers’ confidence in their own skills

related to the policy implementation. Furthermore, almost 60% of teachers’ perceptions

resulted in estimation of belonging to profiles with perceived support on characteristics of

the policy. This indicated that the model estimated that more teachers perceived to be

supported by characteristics of the actual policy than those who did not. Finally, only

49% of teachers were estimated to be in profiles with perceived support in characteristics

of leadership and 41% of teachers were estimated to be in profiles with support related to

the characteristics of the organizations.

107

Table 22. Percentage of Teachers in Profiles with Perceived Support by Attribute

Policy Teacher Leadership Organization

% N % N % N % N

59.03 441.00 75.64 565.00 49.13 367 40.69 304

From tables 21 and 22, one can infer that the model estimated that many teachers perceived to

understand characteristics of the policy. Moreover the model estimated that teachers were

generally confident in their abilities to meet the guidelines outlined by the policy. However, the

lower proportions in characteristics of the leadership and the organization indicated that they

perceived a lesser level of support in these areas. Although there were differences among the

models with multiple groups in terms of the proportions of teachers belonging to profiles with

perceived support on each attribute, these differences were quite minimal.

Research Question 2: Exploring Group Comparisons Using the Diagnostic Model

One of the advantages of the general diagnostic model as implemented in mdltm is to

concurrently estimate item parameters for several subpopulation or subgroups. In the case of

teachers’ perceptions of policy implementation support, the population includes several different

subgroups, for example, grade level, subject taught, and career status. Since this is a new

application, it is necessary to explore differential group estimation. Thus, for the second research

question, the ability of the diagnostic model to estimate group differences was explored.

Specifically, the following research questions were asked:

2. Are there group differences in the diagnostic model fit based on grade level,

subject taught, and career status?



108

b. What are the group differences in diagnostic model estimations of

parameter estimates based on grade level, subject taught, and career

status?

As previously discussed, running a model for each group separately would result in the item

parameters being on different scales. Thus, they would not be directly comparable. Using a

multiple-group multinomial log-linear mixture model (Xu and Von Davier, 2008) puts the

groups on the same scale to allow for proper group comparisons. The groups being compared in

this study were selected based on logistical organizational considerations. For example, as

previously discussed, the vision was that the application of these models can be used to organize

for policy implementation. Schools leaders could use this type of information to organize

teachers in to groups based on their latent profiles and target the appropriate resources to groups

of teachers. Although the data contained numerous grouping variables, school districts were most

inclined to group/organize teachers based on grade level, subject taught, or career status. This

was true for many reasons, but most importantly, because of logistical organization restrictions.

Teachers in the same grade levels and who teach the same subjects likely have more similar

teaching schedules, teaching strategies, or teaching philosophies. Thus, research question 2

explored the application of mixture models defined with these multiple groups.

For this analysis, grade level was split into three categories including elementary, middle,

and high school. For subject taught, teachers were either a STEM subject or not. STEM subjects

included science, technology, or math related subjects. Those teachers who taught both STEM

and non-stem subjects were classified as STEM teachers. Career status was early career and

experienced teachers. Early career teachers were those teachers within the first 5 years of their

careers, whereas experienced teachers were those teachers with more than five years of teaching

experience. As previously discussed, many grouping variables could have been explored with

109

cognitive diagnostic modeling. However, none of the available grouping variables aligned with

the practical goals of this study like grade level and subject taught. Recall that the goal of using

cognitive diagnostic modeling is to retrieve the categorical outcomes to inform standards-based

targeting of resources for policy implementation. The group variables in this study allow for

specific differential supports to conveniently, in the organizational sense, be targeted by group, if

it were found to be necessary.

The models G = 1, 2, 2, 3, 6 were used to analyze the data. When G = 1, this model was

called a single-group model. This is the best-fitting model from research question 1 where no

sub-group differences were accounted for. When G = 2, the two groups, S1 = STEM and S2 =

non-STEM were identified by the subject variable. For the second model where G = 2, the two

groups, E1 = early career and E2 = experienced were identified by the career-status variable. For

the case of G = 3, the grade level variable was used as the indicator of the subgroups L1 =

elementary, L2 = middle, and L3 = high school. When G = 6, a complete factorial design of

grade level and subject was defined and used as the indicator variable for the subgroups. These

five models were compared using the following: model fit, item parameter estimates, as well the

marginal distribution of each latent random variable.

First, Mdltm Software was used to run the multiple group multinomial log-linear models

for multidimensional skill distributions in the general diagnostic model (Xu and Von Davier,

2008). The distributions of the latent classes from the analysis are shown in table 23. The results

showed that the marginal distribution resulting from these two models were fairly similar. The

most notable discrepancy was the differences among profile 8 and 16 between the models. For

model 1, the largest proportion of teachers were estimated to fall into profile 8, however, for

every other model, profile 16 was the most common profile. Profile 8 included perceived

110

support on all attributes except for characteristics of the organization. Profile 16 included

perceived support of all attributes. Thus, the only difference between the definitions of these

profiles was whether or not teachers were estimated to have perceived support on characteristics

of the organization. It should be noted that the proportions are estimated across all groups and

profiles in the model, and not for each group separately. Thus, the columns in table 23 do not

sum to 1, but the set of columns for each model sum to 1.

111

Table 23. Estimated Group Proportions by Profile

Profile Single Model 1: Model 2: Model 3: Model 4:

L1 L2 L3 S1 S2 E1 E2 L1_S1 L1_S2 L2_S1 L2_S2 L3_S1 L3_S2

p1: 0000 0.16 0.07 0.04 0.05 0.09 0.06 0.07 0.08 0.03 0.03 0.03 0.01 0.04 0.01

p2: 1000 0.01 0.02 0.01 0.01 0.00 0.05 0.03 0.01 0.00 0.04 0.00 0.01 0.01 0.00

p3: 0100 0.23 0.07 0.07 0.06 0.09 0.05 0.11 0.06 0.05 0.02 0.02 0.01 0.05 0.01

p4: 1100 0.16 0.02 0.05 0.04 0.06 0.08 0.06 0.04 0.02 0.06 0.07 0.00 0.03 0.03

p5: 0010 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

p6: 1010 0.03 0.03 0.00 0.01 0.01 0.01 0.05 0.00 0.01 0.01 0.00 0.00 0.01 0.01

p7: 0110 0.01 0.02 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00

p8: 1110 0.08 0.18 0.02 0.02 0.01 0.02 0.13 0.00 0.08 0.00 0.01 0.01 0.02 0.00

p9: 0001 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

p10: 1001 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00

p11: 0101 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

p12: 1101 0.03 0.00 0.01 0.00 0.05 0.01 0.00 0.05 0.00 0.00 0.00 0.00 0.00 0.00

p13: 0011 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

p14: 1011 0.03 0.01 0.01 0.01 0.03 0.01 0.01 0.02 0.00 0.01 0.01 0.00 0.00 0.00

p15: 0111 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

p16: 1111 0.24 0.03 0.05 0.07 0.23 0.10 0.12 0.16 0.01 0.08 0.04 0.02 0.05 0.01 Notes: Single refers to single-group model;

Model 1: L1 refers to elementary teachers, L2 refers to middle school teachers, L3 refers to high-school teachers;

Model 2: S1 refers to STEM teachers, S2 refers to non-STEM teachers;

Model 3: E1 refers to early-career teachers, E2 refers to experienced teachers;

Model 4: Factorial design uses conventions from previous models

112

The attributes were positively and significantly correlated. The moderate size of the

correlations in table 24 in conjunction with the statistical significance suggested that each

attribute did, in fact, capture a different component of policy implementation support. The

attribute that was defined by characteristics of leadership was significantly correlated by the

attribute defined by characteristics of the organization. Although they were different attributes,

organizational mechanisms and leadership mechanisms were quite significantly correlated, much

more than any other bivariate correlation.

Table 24. Correlations Between Attributes

1 2 3 4

1: Policy 1.000

2: Teacher 0.286* 1.000

3: Leadership 0.422* 0.292* 1.000

4: Organization 0.391* 0.320* 0.783** 1.000 Notes: *indicated correlation was positive when α = 0.05;

**indicated correlation was positive α = 0.001.

Table 25 provided an overview of the AIC and BIC model-fit statistics. As previously

discussed, a model can be said to be better when it results in a smaller AIC and BIC fit indices.

Compared to the single-group analysis (when G = 1), none of the multiple-group models

improved the overall fit. Traditionally, model fit indices are used to determine the best fitting

model and then that best-fitting model is interpreted. However, each of the proposed multi-group

models were examined in terms of parameter estimates and marginal distributions since this was

a new application for cognitive diagnostic models with both psychometric and policy

implications. This implied that there were not overall significant differences between the groups

examined in this study.

113

Table 25. Fit Statistic Comparisons Between Models

Groups Description AIC BIC

G = 1 Single Group 53971.85 54098.32

G = 2 Subject 54002.22 55507.06

G = 2 Status 54232.50 55476.68

G = 3 Level 54171.96 56439.10

G = 6 Subject*Level 54291.08 58805.59

Following the model fit comparisons, the distributions of parameter estimates in each

model were compared to understand the differences between the models. It was very clear that

standard errors for the intercept-parameter estimates increased as more groups were added to the

model. Thus, as more groups were added to the model, the estimates were less accurate. In table

26, the intercept parameter distributions were summarized separately for items assigned to one

attribute, and items assigned to two attributes for each model. There was almost no difference in

the estimated average intercept parameter between experienced teachers and early-career

teachers. This suggested that when there was no perceived support for any of the support

mechanisms (not sure if this is the term commonly used), the probabilities that teachers

perceived support on a particular support mechanism were similar for these groups. Moreover,

the average intercept parameter in the teacher subject model (G = 2) was lower for STEM

teachers than for non-STEM teachers. This dynamic held true in the factorial design model (G =

6), where the average intercept-parameters were lower for STEM teachers than non-STEM

teachers at all three grade levels, elementary, middle, and high school. Thus, respondents who

had the same profile but belonged to different groups had different probabilities of answering

positively to the items. Differences in intercept means are akin to overall group differences, thus

STEM teachers overall perceived less support than non-STEM teachers, especially at elementary

and middle school levels. This could indicate that STEM teachers may require or expect more

support than non-STEM teachers, or it could be that they actually received less support. The

114

highest intercept estimate was for the group of elementary, non-STEM teachers which was equal

to λ𝑖,0 = -0.03. This resulted in an average probability of 0.49 for a correct response. Conversely,

the lowest estimate was for the group of high-school STEM teachers for which λ𝑖,0 = -0.94. This

resulted in an average probability of a correct response of 0.28. Although almost every parameter

distribution had a few relatively extreme maximum values, the overall frequency of extreme

estimates was not an issue. Item text was reviewed and, in each instance, the items were

determined to be of substantive significance to the study.

115

Table 26. Comparisons of Intercept Estimate Distributions of Single vs. Multi-Group Models

Distributions of Intercepts

Model Groups

S.E. min max

G = 1 Single One-attribute intercept -0.25 0.10 -2.92 3.78

Two-attribute intercept -0.43 0.09 -1.97 2.41

G = 2 1: STEM One-attribute intercept -0.80 0.13 -6.21 1.43


2: non-STEM One-attribute intercept -0.10 0.17 -6.21 3.70


G = 2 1: early career One-attribute intercept -0.55 0.13 -6.65 3.92


2: experienced One-attribute intercept -0.58 0.16 -4.15 3.32


G = 3 1: elementary One-attribute intercept -0.35 0.16 -6.13 3.96


2: middle One-attribute intercept -0.22 0.21 -7.01 3.76


3: high One-attribute intercept -0.45 0.19 -6.14 4.19


G = 6 1: elementary/STEM One-attribute intercept -0.57 0.23 -5.87 2.19


2: elementary/Non-STEM One-attribute intercept -0.03 0.21 -3.61 2.82


3: middle/STEM One-attribute intercept -0.50 0.24 -6.69 3.42


4: middle/Non-STEM One-attribute intercept -0.21 0.39 -5.98 1.58


5: high/STEM One-attribute intercept -0.94 0.23 -3.81 1.96


6: high/Non-STEM One-attribute intercept -0.77 0.48 -4.99 1.29


After building a general understanding of the distributions of intercept estimates, the next

step was to explore differences across groups on the actual individual parameter estimates. To do

this, the 95% confidence intervals for all items’ intercept-parameter estimates were plotted using

Stata 13 by group for each model. The intervals were compared to determine whether there were

statistically significant differences between groups on each parameter estimate in each model.

116

There were no formal hypothesis tests conducted for differences in parameter estimates because

the theoretical distributions of parameters were not known. Thus, looking at overlap of

confidence intervals was an approximation for statistical significance. As previously discussed,

a lack of perceived support on all attributes resulted in STEM teachers having lower average

estimated intercepts than non-STEM teachers. This was true for items assigned to 1-attribute and

also 2-attributes. However, figure 7 showed that in the 6-group factorial design model, group 1

(elementary STEM teachers) was significantly higher than group 2 (elementary non-STEM

teachers) on item 22. Thus we rejected the null hypothesis that this intercept estimate was equal

across groups. This particular item asked teachers to rate the extent of confidence they had

regarding abilities related to the policy implementation process. Specifically, it asked them to

rate the extent to which they felt confident in understanding the teacher evaluation standards

described by GUPSECT. In addition to the significant difference between STEM and non-

STEM teachers at the elementary level, group 5 (high-school STEM teachers) perceived

significantly higher degree of support on this item than group 6 (high-school non-STEM

teachers). However, there was no difference between group 3 (middle school STEM teachers)

and group 4 (middle school non-stem teachers). Consistent with the finding for this item, the

right side of figure 7 also showed that in the 2-group model comparing groups of STEM and

non-STEM teachers, STEM teachers were, again, significantly higher on this item, which was

similar to the finding in the 6-group model. Thus, when there were no support attributes with

perceived support, STEM-teachers and non-STEM teachers were estimated to perceive a

significantly different level of confidence in understanding the teacher evaluation standards,

whether or not grade-level was accounted for in the model.

117

Legend for G = 6: 1= Elementary/STEM; 2 = Elementary/non-STEM; 3 = Middle/STEM; 4 =

Middle/non-STEM; 5 = High/STEM; 6 = High/non-STEM

Legend for G = 2: 1= STEM; 2 = non-STEM.

Figure 6. Comparisons of Confidence Intervals for Item 22 Intercept Estimates by Group

This trend was found for two other items that were very similar in content. In fact, all three items

loaded on to the 4-attribute coarse-grained factor “characteristics of teachers.” They also loaded

on to the 10-attribute fine-grained factor “teacher confidence in abilities relating to policy.”

Figured 8 showed the findings from the 6-group model and the 2-group model for the item that

asked teachers to rate the extent to which they felt confident in understanding the measures of

their teaching defined by GUPSECT. On this item, when G = 6, elementary STEM teachers were

significantly higher than all other groups except for middle school STEM teachers. Middle

school STEM teachers were significantly higher than elementary and middle school non-STEM

teachers. Moreover, high school STEM and non-STEM teachers were both significantly higher

than the two groups elementary non-STEM and middle school non-STEM. When G = 2, STEM

teachers clearly were significantly higher than non-STEM teachers on this item. Thus, in this

model, among the various groups there were many differences, but the same general trend was

-3-2

-10

Inte

rce

pt E

stim

ate

1 2 3 4 5 6Groups

-1.5

-1-.

50

Inte

rce

pt E

stim

ate

s

1 2Groups

118

found for this item as was found for item 22. The implications of this are discussed in detail in

chapter 5.





Finally, item 28 followed the same trend as the two previously discussed items. This item asked

teachers to rate the extent to which they felt confident in using the evaluation to inform their

teaching. When G = 6, elementary STEM teachers were estimated to perceive a significantly

higher degree of confidence than all other groups of teachers. Moreover, middle school non-

STEM teachers were estimated to perceive a significantly lower degree of confidence than all

other groups of teachers. When G = 2, it was clear that STEM teachers were estimated to have

significantly higher degree of confidence than non-STEM teachers, which was consistent with

the trend found with the other two items. Despite this logical, substantive interpretation, with α

= 0.05, one would expect about 3 type-one errors which provides an alternative statistical

explanation for this finding. Thus, further investigation into these support mechanisms would be

necessary.

-4-3

-2-1

0

Inte

rce

pt E

stim

ate

1 2 3 4 5 6Groups

-2.5

-2-1

.5-1

Inte

rce

pt E

stim

ate

s

1 2Groups

119





Next, the slope-parameter main effect estimates were investigated. The slope

parameters captured the influence in logits of attribute perceived support on the probability of a

correct response to an item. The distributions for the slopes were summarized by model, and by

the number of slope parameters assigned for each item (1 or 2) in table 27. Similar to the

intercepts, the average standard errors increased as more groups were included in the models.

However, the items assigned to one attribute were, on average, higher than items assigned to two

attributes. This was the opposite of what was observed with the intercept estimates in table 26.

-20

-1-3

-4-5

Inte

rce

pt E

stim

ate

1 2 3 4 5 6Groups

-1.5

-1-.

50

Inte

rce

pt E

stim

ate

1 2Groups

120

Table 27. Comparisons of Slope Estimate Distributions of Single vs. Multi-Group Models

Distributions of Slopes

Model Groups S.E. min max

G = 1 Single One-attribute slope 1.04 0.10 0.02 4.00

Two-attribute slope 0.69 0.09 0.11 1.59

G = 2

1: STEM One-attribute slope 1.12 0.12 0.18 4.12


2: non-STEM One-attribute slope 1.10 0.16 0.28 4.14


G = 2 1: early career One-attribute slope 1.12 0.13 0.43 4.12


2: experienced One-attribute slope 1.17 0.16 0.00 4.11


G = 3

1: elementary One-attribute slope 1.08 0.16 0.02 4.05


2: middle One-attribute slope 0.93 0.21 0.41 4.05


3: high One-attribute slope 0.99 0.19 0.32 4.09


G = 6

1: elementary/STEM One-attribute slope 1.16 0.23 0.01 4.03


2: elementary/Non-STEM One-attribute slope 1.10 0.21 0.24 4.20


3: middle/STEM One-attribute slope 1.10 0.24 0.33 4.03


4: middle/Non-STEM One-attribute slope 1.05 0.39 0.22 4.01


5: high/STEM One-attribute slope 1.09 0.23 0.21 3.09


6: high/Non-STEM One-attribute slope 1.00 0.48 0.19 2.64


Similar to the procedure with the intercept estimates, the 95% confidence intervals of the slope

estimates were plotted for each model, group, and item. There were very few significant

differences between groups on items. In fact, there were no significant differences between

STEM and non-STEM teachers, or between experienced and early-career teachers. This finding

121

indicated that generally, attributes with perceived support had similar influences on teachers’

probability of correct response to items across grade level, subject, and career status. However,

there was an important group of items for which there were significant differences across grade

levels. Figure 10 showed the confidence intervals for four different items. Item 55 asked

teachers to rate the extent to which they agreed that principals collected adequate evidence to

evaluate their teaching. The graph displayed confidence intervals by elementary, middle, and

high school groups. Both elementary and middle school teachers were estimated significantly

higher than high school teachers. This item loaded onto the coarse-grained “characteristic of

leadership” attribute and the finer-grained “principal legitimacy” attribute. Differences in slope

means indicated different strengths of relationship between attribute and probability of positive

perception. Thus, the significant differences between elementary and middle school teachers’

slope parameters indicated differences in the strengths of relationship between the leadership

attribute and the probability of a correct response on this item. This same trend was observed for

item 57. Item 57 was assigned to the same attributes as item 55. It asked teachers to rate the

extent to which their principals had the knowledge and skills to evaluate their teaching.

Elementary and middle school teachers’ estimates were both significantly higher than high

school teachers. Thus, when teachers perceived support of the leadership attribute, the influence

of this support had significantly lesser effect on high school teachers’ probability of indicating

that their principal had the required knowledge and skills. Similar to the above inference, this

may have indicated that high school teachers valued principal knowledge and skills less than

other teachers in terms of support mechanisms regarding policy implementation characteristics

of leadership.

122

Item 45 asked teachers the extent to which they agreed that the professional

development they received on teacher evaluation was useful. This item was assigned in the Q-

matrix to the attribute “characteristics of leadership.” In the 10-attribute model, it was assigned

to “quality of professional development.” Figure 10 showed the confidence intervals by grade

level. Middle school teachers were statistically significantly higher than elementary and high

school teachers. Moreover, elementary teachers were significantly higher than high school

teachers. This indicated that perceived support of the leadership attribute had more influence on

the probability of middle school teachers indicating support on this item than elementary and

high school teachers. Finally, item 51 asked teachers to rate the extent to which evaluation

feedback informed their professional development selection. Elementary school teachers were

estimated significantly higher than middle and high school teachers. This indicated that

perceived support on the leadership attribute had less of an influence on the logit of a correct

response to this item for middle and high school teachers. However, with a type 1 error rate of

0.05, the possibility remained that these differences were actually type-1 errors. Thus, future

investigation into these support mechanisms is necessary.

123

Item 55 Item57

Item 45 Item 51

Legend for G = 3: 1= Elementary School; 2 = Middle School; 3 = High School.

Figure 9. Comparisons of Confidence Intervals for Four Item Slope Estimates by Grade Level

01

23

Slo

pe E

stim

ate

1 2 3Groups

01

23

4

Slo

pe E

stim

ate

1 2 3Groups

12

34

5

Slo

pe E

stim

ate

s

1 2 3Groups

01

23

4

Slo

pe In

terc

ep

ts

1 2 3Groups

124

CHAPTER 5

IMPLICATIONS

The purpose of this study was to explore a new way to measure teachers’ perceptions of


Cognitive diagnostic models have not previously been applied to policy implementation support

constructs. The categorical, standards-based diagnostic output from the analysis in this study was

anticipated to provide detailed empirical information about teachers’ perceptions of support. It

was assumed that more precise diagnostic feedback would be beneficial to policy makers and

school leaders in identifying strengths and weaknesses and in targeting resources in the policy

implementation process. When equipped with more precise diagnostic feedback, policy makers

and school leaders may be able to more confidently engage in empirical decision making,

especially in regards to targeting resources for short-term and long-term organizational goals

subsumed within the policy implementation initiative. Specifically, the following research

questions were addressed:







implementation support mechanisms have more accurate parameter estimates than

models specifying coarser grained attributes?



125







Discussion on Research Question 1a

Cognitive diagnostic models can undoubtedly be useful in their application to

understanding teachers’ perceptions of intra-organizational mechanisms for supporting policy

implementation. The first over-arching question was answered by attaining results that were

both statistically advantageous and substantively useful. The entire set of 4-attribute models had

AIC and BIC values lower than the unidimensional model values. This finding solidified the

application of cognitive diagnostic models to understanding intra-organizational mechanisms for

supporting policy implementation. Statistically, since the best-fitting cognitive diagnostic model

fit the data better than the unidimensional model, it was only logical to go with the diagnostic

model since it would, in fact, provide more nuanced substantive insights into the multiple

hypothesized dimensions of policy implementation. Moreover, the diagnostic model

measurements were actually much more precise, since they accounted for complex loading

structures and more nuanced relationships between the items and attributes than a

unidimensional model. Thus, with a more complete, and nuanced understanding of the

relationships of how teachers perceived these underlying dimensions, including characteristics of

the policy, teachers, leadership, and the organization, states and school districts would be much

more prepared to improve the implementation process than with just an overall total score of

support or the results from a simple unidimensional model. For example, using only the single-

126

group, 4-attribute model, district leaders can identify 16 separate latent classes of teachers who

perceive the implementation differently. They can narrow these 16 classes even further, since

some profiles had very low proportions and most teachers were estimated to fall into one of eight

the classes. In planning district-wide professional development, district leaders would be able to

group teachers based on the type of support they indicate they were lacking. Additionally, district

leaders could potentially identify groups of teacher-leaders who indicated they received support

and use follow-up studies to better understand what made teachers perceive this level of support.

Such strategies could be very useful in facilitating the diffusion of support in an organization.

Most importantly, because this study uses teachers’ perceptions of the implementation

process and teachers were the professionals who actually implemented the principles of the

policy, the nuanced understandings from this application were particularly relevant. In

educational and other social contexts, the understanding of complex latent variables, such as

perceptions of policy implementation support, commonly requires investigation into the

components of the cognitive process. That was precisely the advantage diagnostic models

provided in this application.

The literature review resulted in a coarse-grained 4-attribute model hypothesis.

However, several finer-grained models were built as well. It was clear that diagnostic models

specifying coarser-grained attributes representing implementation support mechanisms fit the

data better than models specifying finer-grained attributes. This was likely because of the

number of parameters that needed to be estimated in the finer-grained models and the sample

size that was available to estimate them. Regardless, the 4-attribute model provided a useful

substantive interpretation and was substantively logical.

127

Discussion on Research Question 1b

In comparing the best-fitting and 4-attribute and unidimensional model, parameter

estimates were very similar in terms of accuracy. This provided further evidence that the

application of diagnostic modeling to intra-organizational support mechanisms may be useful in

terms of gaining a more nuanced understanding of teachers’ perceptions. More specifically, by

investigating policy implementation support in underlying components we get a more nuanced

understanding. However, if these components led to standard errors were not comparable, or

more precise than the unidimensional model, it would be difficult to justify the application.

Although it was an exception to the general trend, the best-fitting four attribute model actually

had very comparable standard errors to the unidimensional model. This suggested that

researchers do not necessarily always have to sacrifice a higher degree of model complexity in

order to maintain model precision. However, the general trend of the standard errors showed that

as the models became increasingly complex and more attributes were added, the standard errors

for the parameters of the finer-grained attributes increased. More generally, as anticipated, the

more complex the model, the less accurately it was measured. Although the current study retro-

fitted a diagnostic model to the data, those designing an instrument specifically for a diagnostic

model may want to tend towards lower numbers of attributes in order to preserve accuracy of the

parameter estimates.

Discussion on Research Question 1c

The posterior estimated proportions of teachers were diagnosed and teachers fell under

one of the sixteen latent profiles. Profiles of organizational support generated from adequately-

fitting psychometric models provided a rigorous quantitative basis for the policy implementation

process. Based on teacher’s profile membership, their perceptions of specific organizational

128

strengths and weaknesses were estimated. The estimated marginal diagnostic distributions

suggested that teachers generally perceived a higher level of support on characteristics of the

policy and characteristics of teachers. It also suggested they perceive a lower level of support on

characteristics of leadership and characteristics of the organization. The certainty with which

teachers are assigned to a profile was intended to be used to inform decisions about the

appropriateness of potential programs for that teacher, and this information is available from his

or her profile membership distribution. One potential use of the diagnostic information captured

by is to support investments into the human capital available in the teacher workforce. Such

investments would include informing decisions regarding teachers’ professional development.

The empirically derived profiles of perceptions of implementation support could facilitate the

development of interventions that are both targeted at the needs of individual teachers and

coordinated across multiple organizational domains of the implementation process (Halpin &

Keiffer, 2015). Although the answer to exactly how and specifically what professional

development resources should be targeted into improvement of the process is not explored in this

study, this new measurement application provides direction towards understanding the

challenges of the implantation of a policy of this magnitude better. The obvious approach would

be to focus on the implementation areas a teacher, or group of teachers perceive a deficit in

support. However, it is rare that educational policy makers and leaders are equipped with the

type of diagnostic information that explicitly permits for inferences into how perceptions of the

important organizational attributes are interrelated. This is extremely important information to

consider when attempting to maximize the utility of limited resources available for facilitating

the professional growth of the teacher workforce.

129

Discussion for Research Question 2a

For research question 2, the multiple-group models were compared in terms of model fit,

item parameter estimates, as well as the marginal distributions for these groups. Five different

psychometric models were used in this analysis. These were a single-group analysis, a two-group

analysis defined by subject, a two-group analysis defined by career status, a three-group analysis

defined by grade-level, and a six-group analysis defined by a factorial design of subject and

grade-level. For overall model fit indices BIC and AIC, the single-group analysis was the best

compared to the group analyses. Among the multiple-group models, the teacher subject analysis

fit best.

The results showed that the marginal distribution resulting from these two models were

fairly similar. The most notable discrepancy was the differences among profile 8 and 16

between the models. For model 1, the largest proportion of teachers were estimated to fall into

profile 8, however, for every other model, profile 16 was the most common profile. Profile 8

included perceived support on all attributes except for characteristics of the organization. Profile

16 included perceived support of all attributes. Thus, the only difference between the definitions

of these profiles was whether or not teachers were estimated to have perceived support on

characteristics of the organization. The higher correlation between the leadership and

organizational attributes (r = 0.783, p < 0.01) may have some role in the explanation for this

slight difference between the multiple group models. Notwithstanding this exception, the

remaining bivariate correlations between the attributes were of fairly moderate size. However,

they were all statistically significant. This provided empirical evidence that each attribute likely

captured a different, but related component of policy implementation support.

130

Discussion for Research Question 2b

The multi-group models offered major advantages in terms of interpreting the results.

Parameter estimates were explored and several findings were noted. The first notable finding was

that there were almost no difference in the estimated average intercept parameter between

experienced teachers and early-career teachers. This suggested that when no support attributes

had perceived support, the probabilities that teachers perceived support on a particular

mechanism were similar for these groups. Secondly, the average intercept parameter in the

teacher subject model (G = 2) was lower for STEM teachers than for non-STEM teachers. This

dynamic held true in the factorial design model (G = 6), where the average intercept-parameters

were lower for STEM teachers than non-STEM teachers at all three grade levels, elementary,

middle, and high school. This meant that there was less perceived support by STEM teachers

than non-STEM teachers who have the same profile. More specifically, it suggested an overall

group difference between STEM and non-STEM teachers. Generally, this finding further

suggests that STEM teachers required differential support than non-STEM teachers. STEM

teachers also surprisingly had significantly higher estimates on the intercepts for items pertaining

to their extent of confidence in understanding the GUPSECT standards, the measures used to

evaluate their teaching, and using their ratings to inform their teaching. This may have indicated

that STEM teachers at all levels required less support on these mechanisms than non-STEM

teachers in order to have a significantly higher probability of a positive response. This would be

important for a school leader to know. If, for example, a professional development session was to

be centered on using evaluation ratings to improve teaching, a principal may be able to focus

attention on non-STEM teachers for this session. Or, even more strategically, this principal may

be able to rely on specific STEM teachers to help build understanding and confidence among

131

less confident teachers participating in this session. Such information is vital in maximizing the

limited available time allotted for teacher professional development.

The slope parameters captured the influence in logits of attribute perceived support on

the probability of a correct response to an item. It also captured group mean differences in

strength of relationship between items and the 4 support attributes. There were very few

significant differences between groups on items. In fact, there were no significant differences

between STEM and non-STEM teachers, or between experienced and early-career teachers.

This finding indicated that generally, perceptions of support had similar influences on teachers’

probability of an agreement response to items across grade level, subject, and career status.

However, when teachers were asked to rate the extent to which they agreed that principals

collected adequate evidence to evaluate their teaching, both elementary and middle school

teachers were estimated significantly greater than high school teachers. However, there was the

possibility of type I errors in the few significant differences in item parameters. This suggested

that future research should focus on investigating whether these differences replicate.

Alternatively, the differences could have something to do with discrepancies in the frequency of

achievement testing at different levels. In Virginia, Standards of Learning testing occurs every

year for math and science in grades 3 through 8. Moreover, writing and science tests occur in

grades 5 and 8, and history once through grades 4 through 7. In comparison, high school students

are tested less frequently. If teachers believe student tests scores are used as evidence in their

evaluations, this may impact their perception on this item. In reality, the 40% weighting of

student academic progress defined in the GUPSECT standards can include other evidence of

student progress in addition to testing. One alternative explanation for this finding was that

132

teachers may believe that the number of years in elementary school would provide more

opportunities for principals to collect evidence of their teaching.

Summary of Results

In summary, various cognitive diagnostic models were applied to teachers’ perceptions of

intra-organizational mechanisms for supporting teacher evaluation policy implementation. A

preliminary analysis was conducted and various methods were used to construct the Q-matrix.

An exploratory factor analysis provided empirical support for a fine-grained, 10-attribute model.

It was also determined that these finer-grained attributes could be combined in order to construct

a coarser-grained four attribute model. This was the starting point for the diagnostic model-

fitting process. Using expert consultation, the Q-matrix was refined until a set of 10-attribute

models converged. Based on the correlations and substantive interpretations, attributes were

combined to build sets of 9, 8, 7, 6, 5 and 4 attribute models.

The set of 4-attribute models were superior in terms of model fit. Moreover, out of each

of the diagnostic models, they had lower standard errors. The best-fitting 4-attribute model was

selected and the 7 most egregiously misfitting items were removed. The 70 item-4 attribute

model was determined to be a better fit. The estimated proportions of teachers in each profile

were interpreted for this model in order to address research question 1c. In total, approximately

24% of teachers were estimated to have perceived support on all four attributes. Conversely,

approximately 16% of teachers were estimated to have perceived support on none of the support

attributes. Approximately 23% of teachers indicated that they were supported by support

mechanisms related to characteristics of teachers, and another 16% were estimated to have

perceived support on mechanisms relating to characteristics of both the policy and teachers.

133

Finally, diagnostic models were explored in terms of group differences. Specifically,

group differences were compared using four different models. First, a model was run for early

career teachers vs. experienced teachers (G = 2). Next, a model was run for STEM teachers vs.

non-STEM teachers (G = 2). This was followed by a model comparing teacher grade level (G =

3). Finally, a full-factorial model of STEM and level was run (G = 6).

Although the multi-group models failed to improve the overall fit of the single-group

model, each of the models was explored in terms of parameter estimates, standard errors, and

distributions across profiles. The distributions across profiles were quite comparable across all

models. Moreover, these multi-group models offered major advantages in terms of interpreting

the results. Parameter estimates were explored and several findings were noted. The first notable

finding was that STEM teachers had generally lower intercept-estimates than non-STEM

teachers when no attributes were had perceived support. However, STEM teachers had

significantly greater intercept estimates on items pertaining to their extent of confidence in

understanding the GUPSECT standards, measures, and using their ratings to inform their

teaching.

Secondly, there were very few significant differences between groups on items. In fact,

there were no significant differences between STEM and non-STEM teachers, or between

experienced and early-career teachers. Both elementary and middle school teachers were

estimated significantly higher than high school teachers on specific items related to “leadership

support.” These included the item that measured the extent to which teachers agreed that

principals collected adequate evidence to evaluate their teaching and the extent to which their

principals had the knowledge and skills to evaluate their teaching. Middle school teachers were

statistically significantly greater than elementary and high school teachers on the extent to which

134

the professional development they received on teacher evaluation was useful. Finally, elementary

school teachers were estimated significantly higher than middle and high school teachers. This

indicated that perceived support of the leadership attribute had less of an influence on the logit of

a correct response to this item for middle and high school teachers.

The initial findings were quite consistent with the literature. More specifically, the

exploratory factor analysis showed that the 10 attribute model could be interpreted as a finer-

grained representation of the coarse-grained 4-attribute hypothesis constructed from a review of

the literature. There is support for the notion that, in the context of state-wide teacher evaluation

policy implementation, local organizations can think of implementation support in terms of 4-

coarse grained components rather than as a single, unidimensional construct. By supporting

teachers in this way, districts can target the necessary resources to the appropriate schools and

teachers. This, in turn, should maximize the utility of the resources available to school districts.

This study is among the first to look at K12 policy implementation support from

teachers’ perspective. Several insights were gained by investigating teachers’ perceptions of

organizational supports and barriers that would not have been available through other methods.

Thus, these insights provided additional evidence that school leaders can broaden understanding

of what is and is not working with new policies by collecting data from those in control of

actually implementing the guidelines of the policy.

Limitations

The analyses presented in this article demonstrated the usefulness of cognitive diagnostic

models as a method for investigating the implementation of a state-wide teacher evaluation

policy. Researchers on the ITES project collected survey data that allowed for a

multidimensional analysis into teachers’ perceptions of intra-organizational mechanisms for

135

supporting teacher evaluation. This provided an opportunity to retro-fit the C-RUM model to this

data in order to make criterion-referenced, standards’ based-decisions regarding the

implementation process. The practice of retro-fitting (de la Torre and Karelitz, 2009) is generally

suboptimal, and can result in the misclassification of examinees if the instrument was intended

for a unidimensional construct. Hence, one limitation of using this data was that the instrument

was not initially designed for a diagnostic model. Because of this, the attributes were unevenly

covered and items were not specifically written for a diagnostic model. However, although the

instrument used in this study was not developed with a diagnostic model in mind, the fact that it

was developed around four coarse-grained mechanisms for policy implementation support made

it suitable for exploratory research purposes. A future direction would be the development of an

instrument designed for a cognitive diagnostic model, developed from a Q-matrix.

Secondly, the dataset limits the generalizability of the substantive results to the

population of K12 public schools in Southwest Virginia. The final sample included 794 teachers.

This sample is not small, but it is not overly large either. It should be noted that a high response

rate (70%) was attained. Thus, although the data was not collected from a probabilistic sample,

the sample is very representative of the target population. Moreover, there were many reasons

presented as to why this was the data that will provide the best answers to the previously

presented research questions. Most importantly, no other dataset currently includes the variables

necessary to answer the research questions in this study. Since teacher evaluation policy is

typically developed and implemented at the state or local level, there exists limited funding

available for collecting data and conducting research on the implementation process, which may

provide some explanation of why there is limited data in this area of research. Additionally,

136

because policies are implemented at the state and local level, no national data is possible.

Although it would be very interesting to investigate policies in and across other states.

It was acknowledged that often the exact specification of the Q-matrix is unknown a

priori. Mechanisms of policy implementation support in schools are not fully understood, and

thus the exact relationships in such a complex model cannot be known for certain. For this

reason, in addition to traditional approaches to Q-matrix development, empirically based Q-

matrix discovery techniques are pursued in this study. There is need for the investigation is the

development of empirical techniques for determining the entries of the Q-matrix.

Finally, in the current study, the data for the final specified model was dichotomized.

Although this strategy has precedent, it does present an additional limitation. With a much larger

sample, the data may have remained polytomous. However, the current sample size (n=747) does

not support the number of parameters that needed to be estimated if the data remained

polytomous. This limitation turned out to not be significant, as a preliminary analysis showed

that there were almost no differences between the structures resulting from the polytomous data

and the dichotomous data.

Future Research

Several areas of future research emerge from the findings of this study. The current study

illustrated the dynamic between the general application of cognitive diagnostic models to an area

of research they had not previously been used and the actual substantive findings from that

model. Future research studies should focus on both. First, there is a great need for research into

designing psychometrically sound instruments specifically intended for diagnostic models.

Although there was evidence to support the application of diagnostic models to policy

implementation support, the model was retro-fitted to the data collected using an instrument not

137

specifically designed for this use. By advancing the research into diagnostic modeling instrument

development initiatives, researchers in all fields seeking new applications for these models will

be better prepared. In conjunction with further investigation into instrument development, more

research into diagnostic modeling of polytomously scored items is necessary in order to preserve

the information attained by instruments with polytomous items.

Furthermore, although several studies exist on Q-matrix development, there exists

minimal agreement on the most effective way to develop this important hypothesis. This could

be because the Q-matrix may be dependent, to a high degree, on the attributes, items, and

relationships being explored. However, since several methodologies have been documented in

the literature focusing on Q-matrix construction, future studies should compare and contrast

competing ways to construct this important hypothesis in different areas of study.

Since this was a new application of cognitive diagnostic modeling, the substantive

findings of this study could be further validated by replicating this study. For example, when

teachers were asked to rate the extent to which they agreed that principals collected adequate

evidence to evaluate their teaching, both elementary and middle school teachers were estimated

significantly greater than high school teachers. However, with a type 1 error rate of 0.05, the

possibility remained that these differences were actually type-1 errors. Thus, future investigation

into these support mechanisms is necessary. Moreover, as the ITES project continues to grow, a

larger sample will be helpful for researchers to further explore the valid and reliable

measurement of policy implementation support. As new researchers get involved in the project,

triangulating the results of this using other methodologies may be helpful. More specifically,

because the study was a localized, qualitative inquiry into teachers’ perceptions may substantiate

the important finding s.

138

Traditionally, cognitive diagnostic models are applied to K-12 skills or diagnosing

medical conditions where the presence of symptoms is a binary outcome. Thus, although

generalizability is important, proving the application of these models and the potential value in

the diagnostic output was considered a more important initiative in regards to informing future

studies. Exploring new ways to more validly and reliably measure and improve the support

available to teachers in using performance information to adjust instruction remains an important

activity for educational leaders and researchers to engage in. This effort can be a crucial part of

broader initiatives to build school capacity to better serve students in this performance

accountability era (Sun, Mutcheson, & Kim, 2014). This further supports the notion that the

initial findings in this study should be further substantiated through replication.

139

References

Akaike, H. 1987. Factor analysis and AIC. Psychometrika 52: 317–332

Akerlof, G. A., & Kranton, R. E. (2005). Identity and the Economics of Organizations. Journal

of Economic perspectives, 9-32

Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters:

Application of an EM algorithm. Psychometrika, 46(4), 443-459.

Brown, T. A. (2015). Confirmatory factor analysis for applied research. Guilford Publications.

Bryk, A. S., Sebring, P. B., Allensworth, E., Easton, J. Q., & Luppescu, S. (2010). Organizing

schools for improvement: Lessons from Chicago. University of Chicago Press.

Cattell, R. B. (1966). The scree test for the number of factors. Multivariate behavioral

research, 1(2), 245-276.

Century, J., Rudnick, M., & Freeman, C. (2010). A framework for measuring fidelity of

implementation: A foundation for shared language and accumulation of

knowledge. American Journal of Evaluation, 31(2), 199-218.

Century, J., Cassata, A., Rudnick, M., & Freeman, C. (2012). Measuring enactment of

innovations and the factors that affect implementation and sustainability: Moving toward

common language and shared conceptual understanding. The journal of behavioral health

services & research, 39(4), 343-361.

Chen, W., & Thissen, D. (1997). Local dependence indexes for item pairs using item response

theory. Journal of Educational and Behavioral Statistics, 22, 265-289.

Chen, J., de la Torre, J. and Zhang, Z. (2013), Relative and Absolute Fit Evaluation in Cognitive

Diagnosis Modeling. Journal of Educational Measurement, 50: 123–140.

doi: 10.1111/j.1745-3984.2012.00185.x

Close, C. N., Davison, M. L., & Davenport, E. C. (2012). An exploratory technique for finding

the Q-matrix in cognitive diagnostic assessment: Combining theory with data. In Annual

Meeting of the National Council on Measurement in Education. Vancouver, British

Columbia, Canada.

140

Coburn, C. E. (2001). Collective sensemaking about reading: How teachers mediate reading

policy in their professional communities. Educational Evaluation and Policy Analysis,

23(2), 145–170.

Curtin, T. R., Ingels, S. J., Wu, S., & Heuer, R. (2002). National education longitudinal study of

1988: Base-year to fourth follow-up data file user’s manual (NCES 2002-323).

Washington, DC: US Department of Education.National Center for Education Statistics.

Datnow, A., & Park, V. (2009). Conceptualizing policy implementation: Large-scale reform in

an era of complexity. Handbook of Education Policy Research, 348-361.

Darling-Hammond, L., Amerin-Beardsley, A., Haetel, E., & Rothstein, J. (2012). Evaluating

teacher evaluation. Phi Delta Kappan, 93(6), 8–15.

DiBello, L. V., Roussos, L. A., & Stout, W. (2006). 31A Review of cognitively diagnostic

assessment and a summary of psychometric models. Handbook of statistics, 26, 979-

1030.

Eisenhardt, K.M. (1989). Agency theory: An assessment and review. The Academy of

Management Review, 14(1). 57-74.

Frank, K. A., Zhao, Y., & Borman, K. (2004). Social capital and the diffusion of innovations

within organizations: The case of computer technology in schools.Sociology of

Education, 77(2), 148-171.

Galeshi, R., & Skaggs, G. (2014). Traditional fit indices utility in new psychometric model:

cognitive diagnostic model. International Journal of Quantitative Research in

Education, 2(2), 113-132.

Goldhaber, D. D., & Brewer, D. J. (2000). Does teacher certification matter? High school teacher

certification status and student achievement.Educational evaluation and policy

analysis, 22(2), 129-145.

Gorin, J. S. (2009). Diagnostic Classification Models: Are they Necessary? Commentary on

Rupp and Templin (2008).

Haberman, S. J., Davier, M., & Lee, Y. H. (2008). Comparison of multidimensional item

response models: Multivariate normal ability distributions versus multivariate

polytomous ability distributions. ETS Research Report Series, 2008(2), i-25.

Hair, J., Tatham R., Anderson R., & Black W (1998). Multivariate data analysis. (Fifth Ed.)

Prentice-Hall: London.

141

Hagenaars, J. A., & McCutcheon, A. L. (Eds.). (2002). Applied latent class analysis. Cambridge

University Press.

Hallinger, P., Heck, R. H., & Murphy, J. (2014). Teacher evaluation and school improvement:

An analysis of the evidence. Educational Assessment, Evaluation and Accountability,

26(1), 1-24.

Halpin, P. F., & Kieffer, M. J. (2015). Describing Profiles of Instructional Practice A New

Approach to Analyzing Classroom Observation Data.Educational Researcher,

0013189X15590804.The New Teacher Project (2007)

Harris, D. N., & Sass, T. R. (2011). Teacher training, teacher quality and student

achievement. Journal of public economics, 95(7), 798-812.

Hartz, S. M. (2002). A Bayesian framework for the unified model for assessing cognitive

abilities: Blending theory with practicality (Doctoral dissertation, University of Illinois at

Urbana-Champaign).

Henson, R. A., Templin, J. L., & Willse, J. T. (2009). Defining a family of cognitive diagnosis

models using log-linear models with latent variables.Psychometrika, 74(2), 191-210.

Horn, J. L. (1965). A rationale and test for the number of factors in factor

analysis. Psychometrika, 30(2), 179-185.

Hu, L., & Bentler, P. M. (1998). Fit indices in covariance structure modeling: Sensitivity

to underparameterized model misspecification. Psychological Methods, 3, 424–453.

Jang, E. E. (2009). Cognitive diagnostic assessment of L2 reading comprehension ability:

Validity arguments for Fusion Model application to LanguEdge assessment. Language

Testing, 26(1), 031-73.

Kaiser, H. F. (1991). Coefficient alpha for a principal component and the Kaiser-Guttman

rule. Psychological reports, 68(3), 855-858.

Kane, T. J., McCaffrey, D. F., Miller, T., & Staiger, D. O. (2013). Have We Identified Effective

Teachers? Validating Measures of Effective Teaching Using Random Assignment.

Research Paper. MET Project. Bill & Melinda Gates Foundation.

Kunina‐Habenicht, O., Rupp, A. A., & Wilhelm, O. (2012). The Impact of Model

Misspecification on Parameter Estimation and Item‐Fit Assessment in Log‐Linear

Diagnostic Classification Models. Journal of Educational Measurement, 49(1), 59-81

Kelcey, B., Hill, H. C., & McGinn, D. (2014). Approximate measurement invariance in cross-

classified rater-mediated assessments. Frontiers in Psychology, 5(1469).

142

Lee, Y. W., & Sawaki, Y. (2009). Cognitive diagnosis approaches to language assessment: An

overview. Language Assessment Quarterly, 6(3), 172-189.

Leighton, J., & Gierl, M. (Eds.). (2007). Cognitive diagnostic assessment for education: Theory

and applications. Cambridge University Press.

Li, H., & Suen, H. K. (2013). Constructing and validating a Q-matrix for cognitive diagnostic

analyses of a reading test. Educational Assessment,18(1), 1-25.

Liu, Y., Douglas, J. A., & Henson, R. A. (2009). Testing person fit in cognitive

diagnosis. Applied psychological measurement, 33(8), 579-598.

Magidson, J., & Vermunt, J. K. (2004). Latent class models. The Sage handbook of quantitative

methodology for the social sciences, 175-198.

Marzano, R. J., Pickering, D. J., & Pollock, J. E. (2001). Classroom instruction that works:

Research-based strategies for increasing student achievement. Alexandria, VA:

Association for Supervision and Curriculum Development.

Meng, X.L. and Rubin, D.B. (1993) “Maximum Likelihood Estimation via the ECM Algorithm:

a general framework,” Biometrika, 80, 267–278.

Mihaly, K., McCaffrey, D. F., Staiger, D. O., & Lockwood, J. R. (2013). A composite estimator

of effective teaching. Seattle, WA: Bill & Melinda Gates Foundation.

Moynihan, D. P. (2008). The dynamics of performance management: Constructing information

and reform. Washington DC: Georgetown University Press.

Murphy, J., Hallinger, P., & Heck, R. H. (2013). Leading via teacher evaluation: The case of the

missing clothes? Educational Researcher, 42(6), 349–354.

Norris, Megan; Lecavalier, Luc (17 July 2009). "Evaluating the Use of Exploratory Factor

Analysis in Developmental Disability Psychological Research".Journal of Autism and

Developmental Disorders 40 (1): 8–20. doi:10.1007/s10803-009-0816-2.

Nye, B., Konstantopoulos, S., & Hedges, L. V. (2004). How large are teacher effects?

Educational Evaluation and Policy Analysis, 26(3), 237-257.

Poggio, A. J., Yang, X., Irwin, P. M., Glasnapp, D. R., & Poggio, J. P. (2007). Kansas

Assessments in Reading and Mathematics 2006 Technical Manual for the Kansas

General Assessments, Kansas Assessments of Multiple Measures (KAMM), Kansas

Alternate Assessments (KAA).Retrieved April, 20, 2008.

https://en.wikipedia.org/wiki/Digital_object_identifier

https://dx.doi.org/10.1007%2Fs10803-009-0816-2

143

Proctor B, (2011) 38:65–76 DOI 10.1007/s10488-010-0319-7

Outcomes for Implementation Research: Conceptual Distinctions, Measurement

Challenges, and Research Adm Policy Ment Health

Putnam, Robert. (2000), "Bowling Alone: The Collapse and Revival of American Community"

(Simon and Schuster).

Ravand, H., Robitzsch, A. (2015). Cognitive Diagnostic Modeling Using R. Practical

Assessment, Research & Evaluation.

Roberts, M. R., & Gierl, M. J. (2010). Developing score reports for cognitive diagnostic

assessments. Educational Measurement: Issues and Practice,29(3), 25-38.

Rockoff, J. E. (2004). The impact of individual teachers on student achievement: Evidence from

panel data. The American Economic Review,94(2), 247-252.

Rockoff, JonahE. 2004. "The Impact of Individual Teachers on Student Achievement: Evidence

from Panel Data." American Economic Review, 94(2): 247-252.

Rivkin, S. G., Hanushek, E. A., & Kain, J. F. (2005). Teachers, schools, and academic

achievement. Econometrica, 73(2), 417-458.

Rogers, E. M. (2010). Diffusion of innovations. Simon and Schuster.

Rogers, E. M. (2003). Elements of diffusion. Diffusion of innovations, 5, 1-38.

Rosen, A., & Proctor, E. K. (1981). Distinctions between treatment outcomes and their

implications for treatment evaluation. Journal of Consulting and Clinical Psychology,

49(3), 418–425.

Rothstein, J. & Mathis, W.J. (2013). Review of two culminating reports from the MET project.

National Educaitonal Policy Center

Rupp A.A., Templin, J., & Henson, R.A. (2008). Unique characteristics of diagnostic

classification models: A comprehensive review of the current state-of-the-art. Taylor and

Francis Group, LLC. Measurement, 6: 219-262, ISSN 1536-6367 DOI:

10.1080/15366360802490866

Rupp A.A., Templin, J., & Henson, R.A. (2010). Diagnostic measurement: Theory, methods, and

applications. The Guilford Press

Rupp, A. A., & Templin, J. L. (2008). Unique characteristics of diagnostic classification models:

A comprehensive review of the current state-of-the-art.Measurement, 6(4), 219-262.

144

Sartain, L., Stoelinga, S.R., & Brown, E.R. (2011) Rethinking teacher evaluation in Chicago.

Consortium on Chicago School Research at the University of Chicago Urban Education

Institute. Pg 1-50

Sawaki, Y., Kim, H. J., & Gentile, C. (2009). Q-matrix construction: Defining the link between

constructs and test items in large-scale reading and listening comprehension

assessments. Language Assessment Quarterly,6(3), 190-209.

Sclove, S. L. (1987). Application of model-selection criteria to some problems in multivariate

analysis. Psychometrika, 52(3), 333-343.

Spillane, J. P., & Miele, D. B. (2007). Evidence in practice: A framing of the terrain. Yearbook

of the National Society for the Study of Education, 106(1), 46-73.

Spillane, J. P., Reiser, B. J., & Reimer, T. (2002). Policy implementation and cognition:

Reframing and refocusing implementation research. Review of Educational Research,

72(3), 387–431.

Spillane, J.P., Gomez, L., & Mesler, L. (2009). School organization and policy: Implementation,

organizational resources, and school work practice in Plank, D., Sykes, G., & Schneider,

B. , Handbook of Education Policy Research (pp. 409-425). Lawrence Erlbaum.

StataCorp, L. P. (2007). Stata data analysis and statistical Software. Special Edition Release, 10.

Stronge, J. H. (2010). Evaluating what good teachers do: Eight research-based standards for

assessing teacher excellence. Larchmont, NY: Eye On Education.

Stronge, J. H., Gareis, C. R., & Little, C. A. (2006). Teacher pay and teacher quality: Attracting,

developing, and retaining the best teachers. Corwin Press.

Sun, M., & Mutcheson, R. B. (2014). Implementation of Virginia new teacher evaluation system:

A report to district B. Virginia tech: Virginia, U.S.

Sun, M., Mutcheson, B., & Kim. J. (in press). Teachers’ use of evaluation for instructional

improvement and school supports. In J.A. Grissom & Peter Youngs (Eds.), Making the

most of multiple measures: The impacts and challenges of implementing rigorous teacher

evaluation systems. New York: Teachers College Press.

Tatsuoka, K. K. (1983). Rule space: An approach for dealing with misconceptions based on item

response theory. Journal of Educational Measurement, 20(4), 345-354.

Tatsuoka, K. K. (1990). Toward an integration of item-response theory and cognitive error

diagnosis. Diagnostic monitoring of skill and knowledge acquisition, 453-488.

145

Taylor, E. S., & Tyler, J. H. (2012). The effect of evaluation on teacher performance. The

American Economic Review, 102(7), 3628-3651.

Templin, J.L., & Henson, R. A. (2006). Measurement of psychological disorders using cognitive

diagnosis models. Psychological Methods, 11, 287-305.

Templin, J., & Henson, R. A. (2010). Diagnostic measurement: Theory, methods, and

applications. Guilford Press.

Tucker, P. D., & Stronge, J. H. (2001). Measure for Measure: Using Student Test Results in

Teacher Evaluations. American School Board Journal,188(9), 34-37.

USDOE http://www2.ed.gov/policy/elsec/guid/esea-flexibility/index.html retrieved 2/15/2015

Virginia Department of Education. (2011). Guidelines for Uniform Performance Standards and

Evaluation Criteria for Teachers. Virginia, U.S.

Wang, C., & Gierl, M. J. (2007, April). Investigating the cognitive attributes underlying student

performance on the SAT® critical reading subtest: An application of the Attribute

Hierarchy Method. In annual meeting of the National Council on Measurement in

Education, Chicago, Illinois (Vol. 9).

Weisberg, D., Sexton, S., Mulhern, J., Keeling, D., Schunck, J., Palcisco, A., & Morgan, K.

(2009). The widget effect: Our national failure to acknowledge and act on differences in

teacher effectiveness. New Teacher Project.

Westberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). The Widget Effect Our National

Failure to Acknowledge and Act on Differences in Teacher Effectiveness.

Xu, X., & Davier, M. (2008). Fitting the structured general diagnostic model to NAEP data. ETS

Research Report Series, 2008(1), i-18.

Von Davier, M. (2015) mldtm GDM software accessed from http://www.von-davier.com/

Von Davier, M. (2005). mdltm: Software for the general diagnostic model and for estimating

mixtures of multidimensional discrete latent traits models [Computer

software]. Princeton, NJ: ETS.

Youngs, P., Frank, K.A., Thum, Y.M., & Low, M. (2012). The motivation of teachers to

produce human capital and conform to their social contexts. In T. Smith, L. Desimone,

& A.C. Porter (Eds.), Yearbook of the National Society for the Study of Education: Vol.

110. Organization and effectiveness of high-intensity induction programs for new

teachers (pp.248-272). Malden, MA: Blackwell Publishi

http://www2.ed.gov/policy/elsec/guid/esea-flexibility/index.html

146

APPENDECIES

Appendix A:

ITES Survey Items By Factor

Characteristics of the Policy

1: Policy Legitimacy

1 Extent sources of evidence were used to inform evaluation: formal obs

2 Extent sources of evidence were used to inform evaluation: informal obs

3 Extent sources of evidence were used to inform evaluation: student work

4 Extent sources of evidence were used to inform evaluation: feedback from parents

5 Extent sources of evidence were used to inform evaluation: student surveys

6 Extent sources of evidence were used to inform evaluation: student growth

7 Extent of agreement with statement: precise instruments were used

8 Extent of agreement with statements: policy impacted challenging homework

9 Extent of agreement with statements: policy impacted classroom assessment

10 Extent of agreement with statements: policy impacted feedback to students

11 Extent of agreement with statements: policy impacted reflection

12 Extent of agreement with statements: policy impacted preparing for tests

13

Extent of agreement with statements: policy impacted strategies for underperforming

students

14 Extent of agreement with statements: policy impacted collaboration

15 Extent of agreement with statement: my evaluation provided an accurate rating

16 Extent of agreement with statement: policy impacted my use of ratings

2: Policy Clarity/Adaptability

17 Extent of policy importance placed on guiding PD

18 Extent of policy importance placed on improving instruction

19 Extent of policy importance placed on accountability for achievement

20 Extent policy aligns with my job description

21 Extent policy aligns with previous evaluation

22 Extent policy aligns with school values

23 Extent of policy importance placed on guiding compensation/contract renewel

147

Characteristics of The Teachers

3: Teacher confidence in abilities relating to policy

24 Extent of confidence in understanding of standards

25 Extent of confidence in understanding of measures

26 Extent of confidence in collecting evidence

27 Extent of confidence setting SMART goals

28 Extent of confidence documenting student progress

29 Extent of confidence using data to adjust teaching

30 Extent of confidence using evaluation inform teach

31 Extent of confidence communicating teaching and student growth to parents

4: Attitude towards policy

32

Extent the teacher evaluation in your school was focused on aspects teacher

disposition

33 Extent policy aligns with my own views

34 Extent of agreement with statement: the evaluation process worth it

35

Extent of agreement with statement: the evaluation process was burdensome

(reverse)

36

Extent of agreement with statement: evaluation feedback helped my

improvement

37

Extent of agreement with statements: policy impacted extent policy impact:

communicating student progress

148

Characteristics of the Leadership

5: Innovation Advocacy/Communication

38 Extent to which teacher instruction was observed by principal

39 Extent to which teacher instruction was observed by asst principal

40

Extent of agreement with statement: teachers are encourages to find effective

strategies

41 Extent of agreement with statement: adequate recognition

42 Extent of agreement with statements: principal advocated for policy

43

Extent of agreement with statements: principal advocated tying evaluation to

personnel decisions

44 Extent of agreement with statements: policy impacted communicate with admin

6: Quality of Professional Development

45 Extent of professional development usefulness regarding content areas

46 Extent of professional development usefulness regarding teacher evaluation

47 Extent of professional development usefulness regarding making sense of data

48 Extent of professional development usefulness regarding overall impression

49 Extent of professional development hours provided regarding content areas

50 Extent of professional development hours provided regarding teacher evaluation

51

Extent of professional development hours provided regarding making sense of

data

52 Extent of agreement with statement: evaluation feedback informed my PD selection

7: Leader Legitimacy

53 Extent of agreement with statements: principal encouraged data for decisions

54 Extent of agreement with statement: fair procedures were used

55 Extent of agreement with statements: principal adequate observations

56 Extent of agreement with statements: principal collected adequate evidence

57 Extent of agreement with statements: principal applied same procedures

58 Extent of agreement with statements: principal had knowledge and skills

59 Extent of agreement with statements: principal decisions best interest school

149

Characteristics of the Organization

8: Resources

60 Extent of agreement with statement: sufficient time for evaluation

61 Extent of agreement with statement: sufficient resources for evaluation

62 Extent of agreement with statements: policy impacted rigorous materials

63 Extent of agreement with statements: policy impacted time interpreting data

9: Org locus of decision

64 Extent teacher involvement in design and modification of evaluation criteria

65 Extent teacher involvement in design and modification of what evidence is used

66 Extent teacher involvement in design and modification of using data

67 Extent teacher involvement in design and modification of how evidence is used

68

Extent teacher involvement in design and modification of using evaluation for

personnel decisions

69

Extent teacher involvement in design and modification of professional development

selection

10: Org Values

70

Extent the teacher evaluation in your school was focused on aspects content

knowledge

71

Extent the teacher evaluation in your school was focused on aspects instructional

knowledge/skills

72

Extent the teacher evaluation in your school was focused on aspects class

management

73

Extent the teacher evaluation in your school was focused on aspects relations with

parents/stud

74 Extent the teacher evaluation in your school was focused on aspects collegiality

75

Extent the teacher evaluation in your school was focused on aspects relation with

administrators

76

Extent the teacher evaluation in your school was focused on aspects service to

profession

77

Extent the teacher evaluation in your school was focused on aspects impact on

student growth

150

Appendix B

IRB Approval

MEMORANDUM

DATE: September 23, 2015

TO: Gary E Skaggs, Ryan Brock Mutcheson

FROM: Virginia Tech Institutional Review Board (FWA00000572, expires July 29, 2020)

PROTOCOL TITLE: Diagnostic Modeling of Intra-Organizational Mechanisms to

Support Policy Implementation

IRB NUMBER: 15-879

Effective September 23, 2015, the Virginia Tech Institution Review Board (IRB) Chair, David M

Moore, approved the New Application request for the above-mentioned research protocol. This

approval provides permission to begin the human subject activities outlined in the IRB-approved

protocol and supporting documents. Plans to deviate from the approved protocol and/or

supporting documents must be submitted to the IRB as an amendment request and approved by

the IRB prior to the implementation of any changes, regardless of how minor, except where

necessary to eliminate apparent immediate hazards to the subjects. Report within 5 business days

to the IRB any injuries or other unanticipated or adverse events involving risks or harms to

human research subjects or others. All investigators (listed above) are required to comply with

the researcher requirements outlined at: http://www.irb.vt.edu/pages/responsibilities.htm

(Please review responsibilities before the commencement of your research.)

PROTOCOL INFORMATION:

Approved As: Exempt, under 45 CFR 46.110 category(ies) 4

Protocol Approval Date: September 23, 2015

*Date a Continuing Review application is due to the IRB office if human subject

activities covered under this protocol, including data analysis, are to continue beyond the

Protocol Expiration Date.

FEDERALLY FUNDED RESEARCH REQUIREMENTS:

Per federal regulations, 45 CFR 46.103(f), the IRB is required to compare all federally funded

grant proposals/work statements to the IRB protocol(s) which cover the human research

activities included in the proposal / work statement before funds are released. Note that this

requirement does not apply to Exempt and Interim IRB protocols, or grants for which VT is not

the primary awardee.

151

Appendix C

Informal Blueprint of Teacher Survey Item Development

Virginia Guidelines for Professional Standards and Evaluation Criteria for Teachers (GUPSECT)

Areas of Implementation

from Conceptual

Framework

Standard 1:

Professional

Knowledge

Standard 2:

Instructional

Planning

Standard 3:

Instructional

Delivery

Standard 4:

Assessment

For/Of

Learning

Standard 5:

Learning

Environment

Standard 6:

Professionalism

Standard 7:

Student

Academic

Progress

Characteristics of the Policy

and Guidelines X X X X

X X X

Characteristics of School

Organizational Factors X X X X X X X

Characteristics of School

Leadership X X X X X X X

Characteristics of Teachers X X X X X X X

Survey Items Developed

152

Appendix D

Polytomous EFA Model Dimensionality Summary

In learning about the polytomous and dichotomous data, several steps were taken. One of

the steps included that the polytmous data were explored using exploratory factor analysis with

maximum likelihood and promax rotation. This method treats the polytomous items with 5-point

scales as continuous variables as a rough approximation. It also assumes normal distributions for

each item response. Such an approximation is considered to be acceptable if the number of

categories has at least four options (Hu & Bentler, 1998) and the distribution of item responses

are not heavily skewed (Hu & Bentler, 1998). The AIC and BIC fit indices suggested a 10-factor

model fit the data best. Moreover, based on the factor loadings, this model made sense as far as

interpretability. The results from the exploratory factor analyses indicated that the internal

structure of the model for the dichotomized items was comparable the internal structure of the

model for the polytomous items. In each case, factor loadings were used to determine which

item loaded onto which factor. A minimum loading of 0.3 was required in order for an item to

load to a factor (McDonald, 2000). A qualitative comparison of the factor loadings and items

revealed that not only were the interpretable factors of the dichotomous and polytomous EFA

similar, the loadings were also very similar. Although not the major focus of the current study,

these findings about the model dimensionality were important because they established that, with

this data, the dichotomous items can be reasonably relied upon for the cognitive diagnostic

analysis in the case that the sample size would not converge for the polytmous items. Moreover,

most of the software programs available that actually run cognitive diagnostic models are

restricted to running the dichotomous models. Those software programs that do run polytomous

153

models require a substantially larger sample sizes to reach convergence. In most cases it would

be possible to find additional resources to increase the sample size. However, this study relies on

a secondary dataset and the sample size will not be altered. Finally, one additional reason why

this finding was important was the since cognitive diagnostic models have not previously been

used for this type of application, the limited research tends to center on dichotomous models.

With reasonable approximations of univariate normality for the polytomous items, the

next step was to conduct the exploratory factor analysis with maximum likelihood estimation and

promax rotation. A promax rotation was used because it allows relationships between factors,

and is preferred in most situations, unless a strong argument can be made as to why the factors

should not be correlated (Beavers et al., 2013; Costello & Osborne, 2005; Gaskin & Happell,

2013; Matsunaga, 2010). Based on the AIC and BIC fit indices obtained by applying the

maximum likelihood that assumed continuous variables with normal distributions, it was

determined that the 10 factor model was the best fit for the data. The interpretability of the 10-

factor model was also favorable, and this is discussed in detail in the following sections.

Table 28: Factor Analysis of Polytomous Items

factors AIC BIC

1 24400.51 24760.56

4 13009.42 14421.94

5 11516.64 13270.74

6 10387.34 12478.42

7 9547.16 11970.60

8 8878.52 11629.69

9 8311.46 11385.76

10 7781.59 11174.40

154

Appendix E

Table 29: 4-Factor Solution for 70 Item Teacher Evaluation Instrument by Applying the Maximum Likelihood for Continuous

Variables

Item

Policy: Slopes

( λ𝑖,1,(1))

Teacher: Slopes

(λ𝑖,1,(2))

Leadership: Slopes

(λ𝑖,1,(3))

Organization: Slopes

( λ𝑖,1,(4)) Intercepts

( λ𝑖,0) RMSEA

1 0.59 (0.08) 0.44 (0.08) 0.05

2 0.09 (0.07) -0.33 (0.07) 0.02

3 0.22 (0.15) -2.63 (0.15) 0.02

4 0.57 (0.14) -2.42 (0.14) 0.03

5 0.35 (0.12) -1.98 (0.12) 0.05

6 0.31 (0.08) 0.48 (0.08) 0.73 (0.08) 0.06

7 0.11 (0.08) 0.36 (0.08) -0.68 (0.08) 0.06

8 0.37 (0.08) 0.52 (0.08) 0.22 (0.08) 0.04

9 0.43 (0.09) 1.59 (0.09) 2.41 (0.09) 0.07

10 0.96 (0.08) -0.58 (0.08) 0.15

11 0.43 (0.09) 0.67 (0.09) 0.02

12 1.41 (0.1) 0.14 (0.1) 1.42 (0.1) 0.16

13 0.35 (0.11) 1.42 (0.11) 2.22 (0.11) 0.04

14 1.00 (0.09) 0.64 (0.09) 0.84 (0.09) 0.08

15 1.14 (0.08) -0.76 (0.08) 0.12

16 0.77 (0.08) 0.32 (0.08) 0.05

17 0.66 (0.08) 0.47 (0.08) 0.29 (0.08) 0.03

18 0.78 (0.1) 0.98 (0.1) 1.87 (0.1) 0.08

19 0.87 (0.08) 0.24 (0.08) 0.11

20 0.78 (0.08) 0.31 (0.08) 0.09

21 0.97 (0.09) -1.44 (0.09) 0.07

22 0.62 (0.09) -0.8 (0.09) 0.06

23 0.63 (0.09) -1.52 (0.09) 0.12

24 0.95 (0.1) 1.96 (0.10) 0.05

Note 1. The values in ( ) are standard errors.

Note 2. Blanks in the factor loading estimate and its standard error mean that the estimated loading was lower than the pre-specified threshold

(i.e., 0.3) and therefore it was not reported here.

155

Item

Policy: Slopes

( λ𝑖,1,(1))

Teacher: Slopes

(λ𝑖,1,(2))

Leadership: Slopes

(λ𝑖,1,(3))



( λ𝑖,0) RMSEA

25 0.14 (0.09) 0.41 (0.09) -1.35 (0.09) 0.05

26 0.52 (0.08) 0.98 (0.08) 0.05

27 0.15 (0.08) -0.41 (0.08) 0.07

28 0.84 (0.09) -0.92 (0.09) 0.06

29 0.17 (0.08) -0.9 (0.08) 0.08

30 1.03 (0.08) 0.03 (0.08) 0.06

31 0.79 (0.09) -0.9 (0.09) 0.06

32 0.02 (0.08) 0.76 (0.08) 0.03

33 0.49 (0.13) -2.13 (0.13) 0.04

34 0.22 (0.1) 0.32 (0.1) -1.4 (0.1) 0.05

35 0.55 (0.09) -1.11 (0.09) 0.06

36 0.35 (0.14) -2.77 (0.14) 0.03

37 0.69 (0.13) -2.13 (0.13) 0.03

38 1.03 (0.17) -2.92 (0.17) 0.08

39

1.89 (0.09)

-2.34 (0.09) 0.05

40 1.66 (0.09) -0.36 (0.09) 0.05

41 1.81 (0.11) 0.11 (0.11) 0.06

42 1.57 (0.09) -0.73 (0.09) 0.04

43 1.8 (0.12) 0.91 (0.12) 0.16

44 0.91 (0.13) 1.68 (0.13) 0

45 0.86 (0.11) 1.32 (0.11) 0.03

46 1.94 (0.13) 1.09 (0.13) 0.02

47 2.34 (0.15) 1.62 (0.15) 0.06

48 1.58 (0.13) 0.93 (0.13) 1.8 (0.13) 0.13

49 2.1 (0.09) 1.92 (0.09) 0.07

50 0.96 (0.1) 1.07 (0.1) 1.12 (0.1) 0.01

51 1.69 (0.1) -0.75 (0.1) 0.09

52 1.53 (0.1) 0.00 (0.1) 0.15




156

Item

Policy: Slopes

( λ𝑖,1,(1))

Teacher: Slopes

(λ𝑖,1,(2))

Leadership: Slopes

(λ𝑖,1,(3))



( λ𝑖,0) RMSEA

53 1.7 (0.1) 0.64 (0.1) 0.05

54 0.13 (0.07) -0.19 (0.07) 0.06

55 1.2 (0.09) -0.58 (0.09) 0.06

56 0.32 (0.09) 1.39 (0.09) -0.17 (0.09) 0.05

57 0.16 (0.09) 1.3 (0.09) -0.48 (0.09) 0.08

58 1.16 (0.09) -0.33 (0.09) 0.09

59 4.00 (0.09) 3.77 (0.09) 0.15

60 1.46 (0.09) 1.05 (0.09) 0.12

61 1.07 (0.08) 0.73 (0.08) 0.06

62 0.87 (0.08) -0.18 (0.08) 0.06

63 0.95 (0.09) -0.49 (0.09) 0.15

64 2.79 (0.09) 2.64 (0.09) 0.08

65 0.86 (0.09) -0.5 (0.09) 0.01

66 1.04 (0.09) 0.5 (0.09) 0.06

67 1.09 (0.09) -1.7 (0.09) 0.07

68 0.98 (0.09) 0.38 (0.09) -1.97 (0.09) 0.06

69 0.87 (0.09) -1.4 (0.09) 0.01

70 0.53 (0.08) -0.88 (0.08) 0.05




157

Appendix F

Description of Model Evaluation Discrimination Index (DI)

The DI is calculated using the observed proportion correct scores for those teachers who

perceived support on an item and those teachers who did not perceive support on an item. This

method is adapted from Li & Suen, 2013. A larger difference between the proportion-correct

scores of these two groups indicates a larger degree of model fit because the membership of

“perceived support” or “non-perceived support” is based on each examinee’s skill classification

which is determined by a probability (DiBello, Roussos, & Stout, 2007).

For the 4-attribute model that was determined to be the best-fitting model, the average

difference in proportion correct between model predicted “perceived support” teachers and “non-

perceived support” teachers across all items is 0.4 The teachers whom perceived support had a

proportion of 0.68 and the “non-perceived support” teachers had a mean proportion correct of

0.26. However, there are no commonly agreed cutoff criteria, and thus this can only be used as

relative model fit evidence.

Figure 10. Description of Model Evaluation Discrimination Index (DI)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75

Masters Non-Masters

158

Appendix G

In the original analysis, a traditional eigenvalue-based exploratory factor analysis (EFA)

with promax rotation was used to understand the underlying structure of the data. The results

suggested a 10-factor solution and the factor loadings were interpreted. The resulting model was

conceptually sound. Here, a full-information EFA for dichotomous items was performed using

maximum likelihood and promax rotation with IRTPRO software (Scientific Software

International Incorporated, 2016).. The global fit statistics such as AIC and BIC for models with

1 through 13 factors were summarized in Table 30. The AIC was lowest for Model E (8 factors)

and the BIC was lowest for model B (5 factors). Clearly, model A1 (1 factor) is not empirically

supported as the best model. No RMSEA values were available from the IRTPRO output.

Table 30. Full-Information Exploratory Factor Analysis Model Fit

Model Factors Estimated Parameters AIC BIC

A1 1 140 51677.84 52324.09

A2 2 209 49356.67 50321.43

A3 3 277 48262.33 49540.98

A4 4 344 47236.3 48824.22

B 5 410 46483.42 48376.01

C 6 475 46194.84 48387.47

D 7 539 45966.38 48454.44

E 8 602 45728.64 48507.51

F 9 664 45768.6 48833.67

G 10 725 45809.92 49156.56

H 11 785 45930.19 49553.8

I 12 844 46164.73 50060.69

J 13 902 46876.64 51040.33

The discrepancy in the results of the AIC (8 factor solution) and BIC (5 factor solution)

demonstrated one of the difficulties in determining the number of factors statistically and/or

objectively. Further, the -2loglikelihoods were calculated from the AIC and a deviance test was

conducted to determine which model is a best fitting model based on the deviance test. The

deviance test comparing models with factors 1-13 was summarized in table 31. The results of

the deviance test indicated that the Model G (10-factors) was the best fitting model. Beginning

with Model G, the deviance values start to get bigger. At first, it was thought that this could be

because the likelihood function was getting too flat and multimodal which resulted in the

algorithm picking up one of the local maximums. However, closer look at the output revealed

that due to the number of parameters models with more than 11 factors, the models were actually

159

not converging. The deviance test between the 9 and 10 factor models indicated that the 10 factor

model fit significantly better than the 9 factor model. However, the p-value was 0.047. It is

important to note that had an 11-factor model actually converged, it appeared that it may not fit

significantly better than the 10 factor model. Based on the deviance test, Model G, a 10-factor

model was retained.

Table 31. Deviance Test

Model Deviance Test

Model Factors Estimated Parameters

-2LOG H0 Ha Statistic d.f. p-value Decision at .05 level

A1 1 140 51397.84 A1 A2 2459.17 69 0.000 A2

A2 2 209 48938.67 A2 A3 1230.34 68 0.000 A3

A3 3 277 47708.33 A3 A4 1160.03 67 0.000 A4

A4 4 344 46548.3 A4 B 884.88 66 0.000 B

B 5 410 45663.42 B C 418.58 65 0.000 C

C 6 475 45244.84 C D 356.46 64 0.000 D

D 7 539 44888.38 D E 363.74 63 0.000 E

E 8 602 44524.64 E F 84.04 62 0.033 F

F 9 664 44440.6 F G 80.68 61 0.047 G

G 10 725 44359.92 - - -0.27 60 - G

H 11 785 44360.19 - - -116.54 59 - G

diagnostic modeling of intra-organizational mechanisms for … · 2020-01-17 · dissertation...

Documents