diagnostic modeling of intra-organizational mechanisms for … · 2020-01-17 · dissertation...
TRANSCRIPT
Diagnostic Modeling of Intra-Organizational Mechanisms for Supporting Policy Implementation
Ryan Brock Mutcheson
Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in
partial fulfillment of the requirements for the degree of
Doctor of Philosophy
in
Educational Research and Evaluation
Gary E. Skaggs, Chair
Sue G. Magliaro
Yasuo Miyazaki
Kusum Singh
April 28, 2016
Blacksburg, Virginia
Keywords: cognitive diagnostic modeling, latent class model, teacher evaluation, effective
teaching, psychometrics, crum model
Diagnostic Modeling of Intra-Organizational Mechanisms for Supporting Policy Implementation
Ryan Brock Mutcheson
ABSTRACT
The Virginia Guidelines for Uniform Performance Standards and Evaluation Criteria for
Teachers represented a significant overhaul of conventional teacher evaluation criteria in
Virginia. The policy outlined seven performance standards by which all Virginia teachers would
be evaluated. This study explored the application of cognitive diagnostic modeling to measure
teachers’ perceptions of intra-organizational mechanisms available to support educational
professionals in implementing this policy.
It was found that a coarse-grained, four-attribute compensatory, re-parameterized unified
model (C-RUM) fit teacher perception data better and had lower standard errors than the
competing finer-grained models. The Q-matrix accounted for the complex loadings of items to
the four theoretically and empirically driven mechanisms of implementation support including
characteristics of the policy, teachers, leadership, and the organization. The mechanisms were
positively, significantly, and moderately correlated which suggested that each mechanism
captured a different, yet related, component of policy implementation support. The diagnostic
profile estimates indicated that the majority of teachers perceived support on items relating to
“characteristics of teachers.” Moreover, almost 60% of teachers were estimated to belong to
profiles with perceived support on “characteristics of the policy.” Finally, multiple group
multinomial log-linear models (Xu and Von Davier, 2008) were used to analyze the data across
subjects, grade levels, and career status. There was lower perceived support by STEM teachers
than non-STEM teachers who have the same profile, suggesting that STEM teachers required
differential support than non-STEM teachers.
The precise diagnostic feedback on the implementation process provided by this
application of diagnostic models will be beneficial to policy makers and educational leaders.
Specifically, they will be better prepared to identify strengths and weaknesses and target
resources for a more efficient, and potentially more effective, policy implementation process. It
is assumed that when equipped with more precise diagnostic feedback, policy makers and
school leaders may be able to more confidently engage in empirical decision making, especially
in regards to targeting resources for short-term and long-term organizational goals subsumed
within the policy implementation initiative.
iv
Table of Contents
CHAPTER 1 ................................................................................................................................................ 1
RESEARCH PROBLEM ........................................................................................................................... 1
Context of the Research Study ............................................................................................................... 1
Statement of the Problem .................................................................................................................... 2
Purpose of the Study ............................................................................................................................. 5
Overview of the Study ............................................................................................................................ 6
Overview of the Theoretical Framework .............................................................................................. 6
Overview of the Sample and Data Collection ....................................................................................... 8
Overview of the Methodology ............................................................................................................ 12
Significance of the Study ...................................................................................................................... 15
Limitations ............................................................................................................................................. 17
Organization of the Study .................................................................................................................... 19
CHAPTER 2 .............................................................................................................................................. 20
REVIEW OF THE LITERATURE......................................................................................................... 20
Introduction ........................................................................................................................................... 20
Teacher Effectiveness ........................................................................................................................... 20
Traditional Teacher Evaluation Systems and the Widget Effect ..................................................... 22
Teacher Evaluation Policies and the Multiple Measures of Teacher Effectiveness ........................ 24
The Policy: Virginia Guidelines for Uniform Performance Standards and Evaluation Criteria for
Teachers ................................................................................................................................................. 27
Theoretical Framework: Policy Implementation in K-12 Organizations ........................................ 31
Preliminary Assumptions about the Policy Implementation Process ................................................. 31
Theories of Utility Maximization and Policy Implementation ............................................................ 33
Social Capital and Policy Implementation ........................................................................................... 34
Frameworks for K-12 Policy Implementation ..................................................................................... 35
Review of Cognitive Diagnostic Modeling Concepts .......................................................................... 38
Measurement Theory ......................................................................................................................... 38
Cognitive Diagnostic Modeling Theory and Applications ................................................................... 44
Model Specification: Compensatory vs. Non-Compensatory ............................................................. 47
v
The Core Compensatory and Non-Compensatory Cognitive Diagnostic Models ............................... 49
The Q-Matrix ....................................................................................................................................... 52
CHAPTER 3 .............................................................................................................................................. 56
METHODOLOGY ................................................................................................................................... 56
The ITES Project, Current Study, and the Role of the Researcher ................................................. 57
Data Collection and Sample ................................................................................................................. 57
Instrumentation..................................................................................................................................... 62
Item Analysis ....................................................................................................................................... 63
Descriptive Statistics ........................................................................................................................... 67
Plan of Analysis ..................................................................................................................................... 69
The Q-Matrix: Defining the Attribute and Skills Space ....................................................................... 70
The Compensatory Reparameterized Unified Model (C-RUM) .......................................................... 72
Cognitive Diagnostic Model Estimation Method ................................................................................ 73
Analyses for Research Question 1 ...................................................................................................... 75
Analysis for Research Question 2 ........................................................................................................ 80
CHAPTER 4 .............................................................................................................................................. 84
RESULTS .................................................................................................................................................. 84
Preliminary Analysis: Q- Matrix Development ................................................................................. 84
Data Dimensionality ............................................................................................................................ 84
Exploratory Factor Analysis ................................................................................................................. 85
The Q-Matrix and Model Specification ............................................................................................... 90
Findings .................................................................................................................................................. 91
Research Question 1: Testing the New Application of Cognitive Diagnostic Models ......................... 91
Research Question 2: Exploring Group Comparisons Using the Diagnostic Model .......................... 107
CHAPTER 5 ............................................................................................................................................ 124
IMPLICATIONS .................................................................................................................................... 124
Discussion on Research Question 1a ................................................................................................. 125
Discussion on Research Question 1b ................................................................................................. 127
Discussion on Research Question 1c.................................................................................................. 127
Discussion for Research Question 2a ................................................................................................ 129
Discussion for Research Question 2b ................................................................................................ 130
Summary of Results ............................................................................................................................ 132
vi
Limitations ........................................................................................................................................... 134
Future Research .................................................................................................................................. 136
References ................................................................................................................................................ 139
APPENDECIES ...................................................................................................................................... 146
Appendix A: ............................................................................................................................................. 146
ITES Survey Items By Factor ............................................................................................................... 146
Appendix B .......................................................................................................................................... 150
IRB Approval ...................................................................................................................................... 150
Appendix C .......................................................................................................................................... 151
Informal Blueprint of Teacher Survey Item Development ................................................................ 151
Appendix D .......................................................................................................................................... 152
Polytomous EFA Model Dimensionality Summary ............................................................................. 152
Appendix E .......................................................................................................................................... 154
Appendix F .......................................................................................................................................... 157
Description of Model Evaluation Discrimination Index (DI) ............................................................. 157
Appendix G .......................................................................................................................................... 158
vii
List of Figures
Figure 1. Conceptual Framework of Mechanisms that Influence the Implementation ............................... 8
Figure 2. Item Flagged for Substantive Review ........................................................................................... 65
Figure 3. Distribution of ITES Survey Dichotomized Total Scores ............................................................... 68
Figure 4. Exploratory Factor Analysis Scree Plot ......................................................................................... 87
Figure 5. Average Standard Errors of Intercept Parameter Estimates ....................................................... 98
Figure 6. Comparisons of Confidence Intervals for Item 22 Intercept Estimates by Group ..................... 117
Figure 7. Comparisons of Confidence Intervals for Item 23 Intercept Estimates by Group ..................... 118
Figure 8. Comparisons of Confidence Intervals for Item 28 Intercept Estimates by Group ..................... 119
Figure 9. Comparisons of Confidence Intervals for Four Item Slope Estimates by Grade Level .............. 123
Figure 10. Description of Model Evaluation Discrimination Index (DI) .................................................... 157
viii
List of Tables
Table 1. Data Sources and Descriptions ........................................................................................................ 9
Table 2. Virginia Guidelines for Uniform Performance Standards and Evaluation Criteria for Teachers . 29
Table 3. Example of Q-Matrix ..................................................................................................................... 53
Table 4. Data Description for the Current Study ........................................................................................ 58
Table 5. 2013-2014 School-Year Local Education Agency Information ...................................................... 59
Table 6. Comparison of Teacher Characteristics......................................................................................... 61
Table 7. Crosstabs of Teacher Characteristics ............................................................................................ 61
Table 8. Polytomous Item Descriptive Statistics ......................................................................................... 64
Table 9. Dichotomized Item Descriptive Statistics ...................................................................................... 67
Table 10. Total Scores by Group ................................................................................................................. 69
Table 11. Exploratory Factor Analysis Eigenvalues by Factor ..................................................................... 86
Table 12. Summary Interpretation of Factors ............................................................................................. 89
Table 13. Final Model: Grain Sizes .............................................................................................................. 90
Table 14. An example of the Final Q-Matrix ............................................................................................... 91
Table 15. Fit Results for Models with 7-10 Attributes ................................................................................ 94
Table 16. Fit Results for Unidimensional Model and Models with 4-6 Attributes ..................................... 97
Table 17. Comparisons of 4-Attribute Model Average Parameter Estimate Distributions ...................... 100
Table 18. Parameter Distributions for Final Model .................................................................................. 101
Table 19. Example of Estimated Posterior Probabilities of Attribute Perceived Support By Respondent103
Table 20. Example of Estimated Latent Profiles Based on Posterior Probabilities By Respondent ......... 103
Table 21. Distribution of Diagnostic Categorical Profiles for 4-Attribute Model ...................................... 105
Table 22. Percentage of Teachers in Profiles with Perceived Support by Attribute ................................. 107
Table 23. Estimated Group Proportions by Profile ................................................................................... 111
Table 24. Correlations Between Attributes .............................................................................................. 112
Table 25. Fit Statistic Comparisons Between Models ............................................................................... 113
Table 26. Comparisons of Intercept Estimate Distributions of Single vs. Multi-Group Models ............... 115
Table 27. Comparisons of Slope Estimate Distributions of Single vs. Multi-Group Models ..................... 120
Table 28: Factor Analysis of Polytomous Items ........................................................................................ 153
Table 29: 4-Factor Solution for 70 Item Teacher Evaluation Instrument by Applying the Maximum
Likelihood for Continuous Variables ......................................................................................................... 154
Table 30. Full-Information Exploratory Factor Analysis Model Fit ........................................................... 158
Table 31. Deviance Test ............................................................................................................................ 159
1
CHAPTER 1
RESEARCH PROBLEM
Context of the Research Study
State and local education agencies (LEAs) are investing substantial resources, including
in both human and financial capital, towards a nationwide initiative that re-conceptualizes
teacher evaluation. (Darling-Hammond et al., 2012; Hallinger, Heck, & Murphy, 2014).
Moreover, the Gates Foundation has invested $45 million in the Measures of Effective Teaching
(MET) project that uses multiple measure of teacher effectiveness, including student evaluations
of teachers, student classroom work, and evaluations of classroom practice using multiple rubrics
(e.g., Kane, McCaffrey, Miller, & Staiger, 2013).
According to Sartain, Stoelinga, and Brown (2011), two main factors have motivated this
movement. First, the traditional teacher evaluation process is generally not an effective
mechanism for promoting and supporting teacher development in order to improve student
achievement. Secondly, the traditional teacher evaluation process has not proven to be an
effective mechanism for providing data for making empirically supported personnel decisions.
For example, in a project funded by the New Teacher Project, Weisberg, Sexton, Mulhern, and
Keeling (2009) found that under the traditional teacher evaluation system in Chicago Public
Schools (CPS), 93 percent of teachers were rated as either “superior” or “excellent.” At the same
time, 66 percent of CPS schools were failing to meet state standards, suggesting a major
disconnect between classroom results and classroom evaluations. Moreover, they found that 99
percent of teachers were rated as “satisfactory” when their schools used a binary
satisfactory/unsatisfactory rating system.
2
In September 2011, with bipartisan support, the United States Department of Education
(USDOE) invited all State Educational Agencies (SEA) to request flexibility regarding specific
requirements of the Elementary and Secondary Education Act (ESEA). Designing and
implementing teacher performance-based evaluation had been the main focus of the efforts to
implement ESEA flexibility (USDOE, Jan. 2013). Initially, the USDOE granted waivers to 34
states and the District of Columbia, including Virginia. The Virginia Department of Education
(VDOE) efforts produced the Guidelines for Uniform Performance Standards and Evaluation
Criteria for Teachers (GUPSECT), a document that became effective on July 1, 2012. The
guidelines called for 40% of teachers’ evaluations to be based on student academic progress
using multiple measures of learning and achievement. The formal signing of this policy
represented a significant overhaul of conventional evaluation criteria. The 2012-13 school-year
was the first year of the state-wide pilot and implementation.
Statement of the Problem
A better understanding of the impacts of the teacher evaluation policy in Virginia is
necessary to guide educational leaders and policy makers. At the national level, researchers
supported by major government and non-profit funding agencies, including Institute of
Education Science (IES), National Science Foundation (NSF), the Gates Foundation, and
Spencer Foundation, have been exploring the reliability and validity of the effective
performance-based evaluation rubrics and toolkits (Sun & Mutcheson, 2014). Unfortunately,
however, such policy studies have proven difficult to conduct due, in part, to issues regarding the
variance in key aspects of the implementation of educational policies. Even in cases where
statewide policies exist, there is enough ambiguity in the policy to lead to disparities in the
implementation across LEAs. Moreover, some components of the policies are explicitly intended
3
for LEAs to interpret and implement in ways that suit their particular needs in the best interests
of the students they serve.
Principals and superintendents across the country have highlighted concerns about the
successful implementation of the new systems, including: gaining teachers’ buy-in, the cost of
training teachers under the new system, and tying evaluations to strategic compensation plans
(Sun & Mutcheson, 2014). Thus, in order to understand policy impacts, greater insight is needed
into the discrepancies in the intention of the education policy makers and the way that the policy
unfolds in reality (Coburn, 2001; Datnow & Park, 2009; Spillane & Miele, 2007; Spillane,
Reiser, & Reimer, 2002). Since political and legislative efforts have already outlined the new
teacher evaluation criteria, the issues facing practitioners is how to successfully implement this
system and use it in ways that promote teachers’ professional growth (Sun, Mutcheson & Kim,
2014).
Given that teachers are ultimately responsible for implementing teacher evaluation
policies in that they incorporate the policy principles into practice, their voices are potentially
very valuable sources of information in understanding what is and is not working in the policy
implementation process. Although few empirical studies explore how to best support LEAs in
conducting the new teacher evaluations (Rothstein & Mathis, 2013), there are even fewer studies
that attempt to improve the precision with which the key mechanisms for supporting the policy
implementation process are measured. More precise diagnostic feedback on the fidelity of policy
implementation could potentially help practitioners and researchers understand strengths and
weaknesses on organizational support mechanisms, identify remedial pathways, and target
resources toward perceived support, on all teacher evaluation policy components. Most
importantly, more precise diagnostic feedback may help researchers and practitioners to better
4
understand under what conditions the new teacher evaluation system works and help school
leaders reflect and adjust past practices of supporting teacher development. It is assumed that
more effective measurement practices will promote a stronger culture of continuous
improvement. Diagnostic results can be used to organize and inform teacher training, and
ultimately gain teacher “buy-in” on policies and procedures. All school districts invest
substantial resources on district-wide professional development. Hypothetically, if those
planning professional development had diagnostic information about teachers’ perceptions of
intra-organizational supports, they could use this information to direct teacher training in a much
more efficient, cost-effective manner. Teachers could be put in groups to receive personalized
support on the necessary policy elements. Moreover, patterns among the results can be used to
inform discussions among school leaders, and they can learn effective support strategies from
one another. It should be noted that this methodology is congruent with the increasingly popular
“standards-based” approach in K-12 education. Thus, the results from this type of study should
only be used for formative purposes. More clearly, this study does not support the use of these
results for personnel decisions such as hiring, firing, promotion, or compensation.
This will be the first study to approach the implementation of teacher evaluation policies
using cognitive diagnostic models. As discussed further in the following chapters, cognitive
diagnostic modeling will provide key measurement related advantages. According to Haberman,
von Davier, and Lee (2008), “…analyzing data from assessments that are created with the intent
of scaling respondents on a unidimensional continuum almost always provides extremely poor
results from a diagnostic standpoint.” One recent study (Halpin, 2015) focuses on using latent
class analysis to investigate the actual teacher evaluation scores of teachers when different
instruments were used. That study demonstrates the usefulness in applying a new methodology
5
to a popular, worthwhile concept. It is anticipated that this study will accomplish a similar
methodological goal. Specifically, this approach will offer useful, formative, diagnostic
information to LEAs, with the key difference being that the diagnostic information will be about
teachers’ perceptions of the strengths and weaknesses of the teacher evaluation policy
implementation process, instead of overall evaluation scores.
The anticipated result of this study will be diagnostic output that provides detailed
empirical information about teachers’ perceptions of policy support that are involved in the
response processes and the manner in which these components interact were obtained (diBello,
Roussos, & Stout, 2007). In an educational assessment context, it is commonly believed that an
identification of these perceptions, sometimes referred to as “mental components,” may help to
identify remedial pathways toward perceived support, on all components that are relevant and
educationally meaningful to the respondents (diBello, Roussos, & Stout, 2007).
Purpose of the Study
The purpose of this study is to explore a new way to measure teachers’ perceptions of
school intra-organizational mechanisms for supporting teacher evaluation policy implementation.
Cognitive diagnostic models have not previously been applied to policy implementation support
constructs. The diagnostic output from the analysis in this study will provide detailed empirical
information about teachers’ perceptions of support. It is assumed that more precise diagnostic
feedback will be beneficial to policy makers and school leaders in identifying strengths and
weaknesses and in targeting resources in the policy implementation process. When equipped
with more precise diagnostic feedback, policy makers and school leaders may be able to more
confidently engage in empirical decision making, especially in regards to targeting resources for
6
short-term and long-term organizational goals subsumed within the policy implementation
initiative. Specifically, the following research questions are addressed:
1. Can cognitive diagnostic models be applied to understanding teachers’ perceptions of
intra-organizational mechanisms for supporting policy implementation?
a. Do diagnostic models specifying finer-grained attributes representing implementation
support mechanisms fit the data better than models specifying coarser grained
attributes?
b. Do diagnostic models specifying finer-grained attributes representing implementation
support mechanisms have more stable and accurate parameter estimates than models
specifying coarser grained attributes?
c. What are the posterior estimated proportions of teachers that are diagnosed to fall
under each latent profile?
2. Are there group differences in the diagnostic model fit based on grade level, subject
taught, and career status?
a. What are the group differences of the estimated latent class profile distributions
based on grade level, subject taught, and career status?
b. What are the group differences in diagnostic model estimations of parameter
estimates based on grade level, subject taught, and career status?
Overview of the Study
Overview of the Theoretical Framework
The development of the theoretical framework for this project relies on utility
maximization theories that explain individual behaviors in their social settings (Akerlof &
Kranton, 2005; Youngs, Frank, Thum, & Low, 2012), and the importance of capacity building as
schools organize for policy implementation (e.g., Bryk, Sebring, Allensworth, Lupperscu, &
Easton, 2010; Spillane, Gomez, & Mesler, 2009). Moreover, the concept for this study is
influenced by organizational practices of applying new assessment strategies to propel
7
transitioning individuals, teams, and organizations to a desired future state. These ideas are
congruent with the ideas outlined in Rogers’ Diffusion of Innovations (1962) and have been
extended and adapted to inform many other studies that were used in the development of the
framework for this study. These ideas are explained in detail in chapter 2.
In this study, teachers’ perceptions are to be analyzed from a systematic view of school
and district organizational supports for policy implementation intended to permeate schools and
their classrooms (Sun & Mutcheson, 2014). The extent to which teachers perceive that they are
supported on the new teacher evaluation system is assumed to have implications for the
effectiveness of the implementation of the policy. Moreover, the variation in successful
implementation is also assumed to affect the variation in the degree to which the new evaluation
system could achieve its potential to provide more useful feedback to teachers and promote
effective teaching. Finally, the policy implementation is hypothesized to be influenced by
various factors, including the clarity of the guidelines, principals’ and teachers’ acceptance of
various aspects of the new teacher evaluation system, teacher and principal capacity to carry out
the reform, and the coherence of the new system with other school practices.
Relying on the aforementioned literature and assumptions, the mechanisms for
supporting the policy implementation are initially investigated in four components. The first
component is the teachers’ perceived support on policy guidelines. This includes, among other
supports, how clear the policy guidelines are to teachers, how specific and adaptable they are,
and how the policy lends itself to being communicated and monitored. Secondly, teachers’
perceptions of support mechanisms at the teacher level are explored. This includes teacher self-
efficacy, expertise, and capacity to change. It also includes the important support gained from
situated contexts like close colleagues. Next, school leadership are investigated. These include
8
leadership expertise and advocacy for the policy. Moreover, it includes teachers’ perceptions of
whether leadership values align with the policy values. Finally, organization conditional
supports, such as professional development, collaboration, resources, and locus of decision
making are explored. These areas of the conceptual framework are discussed more thoroughly in
chapter two. The assumptions, theoretical components, and the overlapping relationships are
visually represented by the guiding model presented in Figure 1.
Figure 1. Conceptual Framework of Mechanisms that Influence the Implementation
Overview of the Sample and Data Collection
The data used in this study was attained from a secondary source. Specific permission
was granted from the Principal Investigator from the project entitled “Exploring the
Implementation of Virginia’s New Teacher Evaluation Policy (ITES)”. The role of the researcher
in the current study was as a research team member of the ITES study. A clear distinction
between the studies and the role of the researcher is outlined in chapter 3.
9
The ITES study included three participating LEAs for a total of 35 schools: 6 high, 6
middle and 23 elementary schools. This ITES project used a convenience sample of partnership
LEAs located in Southwest Virginia. A total of 19,315 students are served across all three
districts. All full-time teachers in each of the LEAs were eligible to take the survey. The final
sample included 747 teachers. The overall survey response rate was just over 70%. In the larger
project, both quantitative and qualitative data were collected from multiple sources. However,
for this dissertation study, the primary sources of data included:
Table 1. Data Sources and Descriptions
Data Description Collection Method
Local Education
Agency
Administrative
Data
This included teachers’ growth measures; other
measures of teachers’ performance including peer
evaluation and student surveys of classroom
instruction; teacher background; and school and
division documents relevant to the
implementation.
Provided by
administrators in
partnership LEAs.
NCES This included necessary data from National Center
for Education Statistics Common Core of Data
Online Source
ITES Teacher
Surveys This included teachers’ attitudes towards supports
for the new teacher evaluation system and their
perceptions of major barriers in the adoption
Research team
survey
administration
It should be noted that the larger ITES study is currently in the third year of data
collection, and has already attracted greater research interest outside the initial project. Multiple
educational researchers at various institutions have already begun projects using this data.
Moreover, publications and presentations at conferences and professional forums have already
resulted from this data. Despite the strengths of this data, it does have limitations. One limitation
is that the data was not collected from a probabilistic sample. Rather, the data was collected
from a convenience sample of partnership LEAs in Southwest Virginia. However, there are
multiple reasons why this is the data that will answer the previously presented research
10
questions. Most importantly, no other dataset currently includes the variables necessary to
answer the research questions in this study. There are a few reasons for this. First, this study is
focused on K-12 organizational policy implementation. Such policy studies have proven difficult
to study due, in part, to the fact that education policies are typically implemented at the state
level, thus, there exist issues regarding the variance in key aspects of the implementation
process. With no federally mandated policy, stipulations vary across states. Even in cases where
similar objectives are to be met from two separate statewide policies, there are enough
ambiguities in the policies or the organizational structures to lead to disparities in the
implementation across LEAs. Secondly, a nationally representative dataset would not necessarily
be better since the goal of this study is to investigate a state-level policy implementation process.
Unless a federal policy mandates a uniform teacher evaluation policy, which is highly unlikely,
there will not be a national dataset available for this particular topic. Thus, the instrument was
not developed to collect data from or investigate any other policy than Virginia Guidelines for
Uniform Performance Standards and Evaluation Criteria for Teachers. The consequence of this
focus is that the statistical and diagnostic results will not generalize beyond the Virginia
education borders. However, this limitation in traditional generalizability should not be taken to
preclude potential implications for this study beyond Virginia borders. In fact, it is anticipated
that there will be interest in this study outside of Virginia because of the unique methodological
approach and comprehensive conceptual model. As previously mentioned, cognitive diagnostic
modeling has not previously been applied to intra-organizational support mechanisms. Although
the statistical results are not applicable outside of Virginia, the methodological approach to
assessing policy implication may draw interest from both the measurement and educational
policy communities. It must be reiterated and emphasized that the aim of this study is to explore
11
the Virginia state education policy implementation. Although the convenience sample precludes
generalizability to the national level, the study is designed to make strong local impacts at the
state and district level. Hence, external validity, although important, is not as important as the
potential of the study to help promote a local culture of policy implementation assessment. Since
the population in this study is all Virginia public schools, and the sample in this study is
representative of that population in terms of locality (e.g., suburban, rural, urban), demographic
indicators (e.g., race/ethnicity, gender), and achievement (e.g., SOL), the level of external
validity is adequate.
The second most important reason why this is the best dataset for this study is that no
other datasets allow researchers to explore cognitive diagnostic modeling to analyze supports in
this way. As described in the theoretical framework, teachers’ perceptions of four separate levels
or attributes will be modeled. While some datasets include items about teachers’ perceptions of
workplace conditions, they either ask about organizational factors or leadership factors. This
dataset includes items that capture teachers’ perceptions on all four levels: policy characteristics,
teacher characteristics, leadership characteristics, and organizational characteristics. One
alternative dataset that was considered was the NCES Schools and Staffing Survey. Although
this dataset is nationally representative and collects teachers’ perceptions of working conditions,
none of the items on this survey specifically address supports for policy implementation. Rather,
the items address general working conditions. Another dataset that was considered was from the
Measures of Effective Teaching. Despite the stringent restrictions on this data, it may have
provided the most plausible alternative as it is a nationally representative randomized dataset and
was collected in the context of teacher evaluation policy implementation in multiple LEAs.
However, the teacher survey in this study asks about teacher perceptions of more general
12
supports for teacher evaluation and does not ask about supports that are directly related to the
Virginia state teacher evaluation policy. Moreover, the data does not provide information about
teachers’ perceptions of supports at all four levels described in the framework for this study.
Lastly, in addition to the diagnostic results about the specific policy, this study will
contribute to the growing literature on cognitive diagnostic models. Many cognitive diagnostic
models exist, but none have been applied to teachers’ perceptions of support mechanisms.
Traditionally, they are applied to K-12 skills or diagnosing medical conditions where the
presence of symptoms is a binary outcome. Thus, although generalizability is important, proving
the application of these models and the potential value in the diagnostic output is a more
important component in regards to informing future studies. Even so, as previously mentioned,
this study does possess the appropriate degree of generalizability as the results will be applicable
to the target population which includes all Virginia schools. Despite the limitation, exploring
ways to support teachers in using performance information to adjust instruction is an important
activity for educational leaders and researchers to engage in. This effort can be crucial part of
broader initiatives to build school capacity to better serve students in this performance
accountability era (Sun, Mutcheson, & Kim, 2014).
Overview of the Methodology
As previously established, the data includes insights into four levels of implementation.
The levels included the policy, teacher characteristics, leadership characteristics, and the
organizational characteristics. However, items were written to capture the overlapping sections
of these levels. More specifically, the survey items were developed so that the structure of the
instrument was such that unidimensional scores on each dimension was not practical because, in
most cases, items could theoretically be mapped to more than one dimension (see Figure 1
13
above). The complexity of overlapping policy implementation components was one reason why
it was hypothesized that cognitive diagnostic models would provide more measurement precision
than traditional unidimensional models. Many items that were included in the instrument could
potentially load onto more than one of the hypothesized components. Cognitive diagnostic
models provided the ability to account for this unique loading structure. Similar to the
methodological requirements in the study by Halpin and Kieffer (2015), it was essential to
identify a measurement methodology that captured the item-level diagnostic information that the
instruments were designed to provide. Cognitive diagnostic models offer this advantage. As will
be reviewed in the development of the theoretical framework, and as is clearly evident upon a
review of the items (see Appendix A), most, if not all, items can be attributed to multiple sources
of supports. For example, one particular item asked teachers to rate the extent to which the
professional development they received on the policy useful. A closer look at this item reveals
that it likely requires support from at least two main sources. First, organizational conditions
must have provided for adequate resources, such as time and learning tools for teachers to
indicate that they received adequate support on this item. However, the adequate level of support
for usefulness of professional development could instead require expertise, judgement, and
capacity from the principal or leader responsible for providing professional development. Thus,
this item will contribute variance to at least two other different sources of support variables. This
is one example of why cognitive diagnostic models have been frequently promoted by
psychometricians as important modelling alternatives for analyzing response data in situations
where multivariate classifications of respondents are made on the basis of multiple postulated
latent skills (Rupp & Templin, 2008).
14
In addition to descriptive statistics, the first research question was addressed by exploring
the dimensionality of the constructs relating to policy implementation. First, responses provided
by teachers in the first year of implementation of the project were explored using exploratory
factor analyses (EFA). Using the factors resulting from this process, a Q-Matrix was developed
in order to map each individual item to one or more attributes. The Q-matrix is the hypothesized
item loading. In order to empirically validate the Q-Matrix, multiple strategies were relied upon.
A comprehensive review of prior strategies used in Q-matrix development is included in chapter
2. In chapter 3, this discussion is followed by a detailed explanation of the techniques used in the
development and validation of the Q-Matrix in this study. Following the Q-matrix validation,
teachers’ perceptions were modeled using the C-RUM model, which focuses on only the main
effects (Rupp & Templin 2008).
The data cleaning will be completed with Stata software and the analysis will be
completed using mdltm software (Von Davier, 2015). Marginal maximum likelihood estimation
is used in mdltm. After fitting this model to the data, the results will then be interpreted and
compared to the unidimensional model. Results will be evaluated using fit indices and the
standard errors of the item estimates. Using the results from the final model, teachers will be
grouped into latent classes based on the pre-conceived Q-matrix and their responses to annual
survey distributed in the first year of implementation. The profiles will capture the probability of
each latent class in contrast to the traditional “total score” or item response theory approaches.
More generally, they provided insights into which areas of the implementation process teachers
perceive support, and where support may be lacking.
In summary, diagnostic modeling provided the means to capturing the information that
the survey was actually designed to capture. Moreover, the C-RUM model accounted for the
15
assumed compensatory relationship between the attributes. Thus, it is anticipated that this
approach would offer useful, formative, diagnostic information about teachers’ perceptions of
the strengths and weaknesses of the implementation process.
Significance of the Study
This study may have implications for several constituencies. One group of stakeholders
that may benefit from the results includes Virginia K-12 Policy makers. As previously
mentioned, in this study, multi-dimensional models are applied to teachers’ perceptions of
mechanisms for supporting teacher evaluation policy implementation. The results are anticipated
to include measurements that are more precise and useful than those offered by alternative
methodologies. The distinctions between cognitive diagnostic modeling and alternative
methodologies such as multidimensional IRT (mIRT) and confirmatory factor analysis are made
in chapter 3. When equipped with more precise diagnostic feedback, policy makers and school
leaders may be able to more confidently engage in empirical decision making, especially in
regards to targeting resources for short-term and long-term organizational goals subsumed within
the policy implementation initiative.
It follows that if policy makers are equipped with more precise feedback, they better
understand how to support those agents involved in the actual implementation of the policy. This
includes Virginia K-12 principals and teachers. With more precise feedback, principals and
teachers can maximize the potential of the policy. And, according to Virginia’s Guidelines for
Uniform Performance Standards and Evaluation Criteria for Teachers (GUPSECT), this means
that principals and teachers will use the feedback to enhance the effectiveness of teachers.
Finally, with more effective teachers, all stakeholders can anticipate stronger, better prepared
graduates contributing to Virginia communities, culture, and global workforce. Parents of
16
students, students, employers, and society as a whole, all stand to benefit from more effective
teachers. The dynamics of the successful implementation of this policy are depicted in Figure 1
on page 15.
When policies are determined to not be making the impact anticipated when the
investments were committed, diagnostic profiles of educational policy implementation will
inform conversations and debates about strengths and weaknesses regarding the vital
components of implementation. The patterns of associations among the components will provide
valuable information as to where efforts and resources should be targeted. For example, consider
the scenario where a large percentage of teachers indicate that they do not feel supported in
regards to characteristics of the policy guidelines. Furthermore, they indicate that they are
supported in regards to characteristics of teachers, leadership, and the organization. This would
result in a high percentage of teachers in that district with a profile of 0, 1, 1, 1. Policy-makers
and educational leaders could use this data to support the decision to focus on making the policy
clearer or more adaptable and specific. Various strategies will then be targeted to mitigate the
identified deficiency and the limited resources available will not be wasted on areas in which
teachers already feel supported. The profiles will create a broad picture that can be used to
narrow the focus and identify areas that require further inquiry.
Although the substantive results will not be statistically generalizable beyond the
population of K-12 public school stakeholders in Southwest Virginia, the research
methodological significance of the study may extend beyond Virginia to broader boarders. As
previously mentioned, in educational assessment contexts, it is commonly believed that an
identification and understanding of skill components helps to identify remedial pathways toward
perceived support, on all components that are relevant and educationally meaningful to the
17
respondents (diBello, Roussos, & Stout, 2007). These models have not previously been applied
to investigations into individual, team, or organizational support mechanisms, policy
implementation strategies, or organizational change management contexts. If the diagnostic
output does prove to be valuable in this context, it may justify extending the application of these
models to more generalizable follow-up studies in broader organizational contexts. Thus, this
study could ultimately lead to equally significant methodological results.
At a minimum, it is anticipated that this study contributes to promoting a culture of
inquiry surrounding policy and innovation implementation and assessment at all levels of K-12
organizations. Considering the substantial investments made into developing and promoting
these expensive innovations and the increasingly limited resources available for public education
institutions, on-going inquiry into the strengths and barriers to the implementation process are
necessary to ensure the policies achieve their potential—that the investments are worth
continuing. In this case, it is posited that the degree to which teachers feel supported on the new
teacher evaluation policy affects the variation in the degree to which the new evaluation system
could achieve its potential to provide more useful feedback to teachers and promote effective
teaching. By legitimizing a new approach to such inquiry, this study will ultimately have a
positive effect on promoting a culture of inquiry.
Limitations
The dataset limits the generalizability of the substantive results to the population of K12
public schools in Southwest Virginia. The final sample included 794 teachers. This sample is not
small, but it is not overly large either. It should be noted that a high response rate (70%) was
attained. Thus, although the data was not collected from a probabilistic sample, the sample is
very representative of the target population. Moreover, there were many reasons presented as to
18
why this was the data that will provide the best answers to the previously presented research
questions. Most importantly, no other dataset currently includes the variables necessary to
answer the research questions in this study.
It is acknowledged that often the exact specification of the Q-matrix is unknown a priori.
Mechanisms of policy implementation support in schools are not fully understood, and thus the
exact relationships in such a complex model cannot be known for certain. For this reason, in
addition to traditional approaches to Q-matrix development, empirically based Q-matrix
discovery techniques are pursued in this study. The methods are discussed in full detail in
chapter 3. There is need for the investigation is the development of empirical techniques for
determining the entries of the Q-matrix.
Further confounding this analysis was the use of data from an instrument that was not
necessarily developed for the purpose of cognitive diagnostic modeling. Fortunately, however,
the four guiding coarse-grained mechanisms used in this study were the same mechanisms used
to develop the original instrument. This preserved some sense of consistency from the prior
study to the current study. Additionally, the current study does not address within-teacher
variation because it relies on a single distribution of the instrument. The organizations and the
contexts within which they are situated can be expected to evolve along with the guidelines of
the policy. Hence, the follow-up studies currently in process further inquire into the supports for
implementation and will contribute to the understanding of within-teacher variation over time.
Finally, in the current study, the data for the final specified model was dichotomized.
Although this strategy has precedent, it does present an additional limitation. With a much larger
sample, the data may have remained polytomous. However, the current sample size (n=747) does
not support the number of parameters that needed to be estimated if the data remained
19
polytomous. This limitation turned out to not be significant, as a preliminary analysis showed
that there were almost no differences between the structures resulting from the polytomous data
and the dichotomous data. This is discussed in more detail in chapter 3.
Organization of the Study
This study is organized around five chapters. Chapter One introduces the topic of the
study, the research questions and the significance of the study. The second chapter reviews the
literature relevant to the study. Chapter Three describes the methodology of the study, including
the sampling techniques and the procedures used to collect and analyze the data. The fourth
chapter describes the results of the study while the final chapter discusses those results and their
implications for future practice, research, theory, and policy.
20
CHAPTER 2
REVIEW OF THE LITERATURE
Introduction
The framework for this study extends beyond the realm of psychometrics, and borrows
from concepts rooted in economics, organizational research and management, and K-12 policy
implementation. Additionally, the purpose of this study is to examine the supports available for a
specific state policy. Thus, it is imperative to review the origins and key components of that
policy. In the first section of this chapter, the literature on teacher effects and teacher evaluation
policies is reviewed. This includes a brief review of the United States Department of Education
teacher evaluation initiative through Race for the Top and narrows down to focus on Virginia’s
Guidelines for Uniform Performance Standards and Evaluation Criteria for Teachers
(GUPSECT). GUPSECT is the particular policy being examined in this study and so the key
policy components are described in detail. A thorough review of the development of the
theoretical framework and the supporting literature for the study is described next. Finally, as
previously mentioned, the crux of this chapter includes a thorough description of the
methodological concepts and applications. The literature on cognitive diagnostic modeling is
reviewed and key definitions are provided. Model fit and parameter estimation techniques are
examined by reviewing prior studies.
Teacher Effectiveness
Teacher effects have been found to be the most significant school-related variable
impacting student learning outcomes (Stronge, 2006). In one notable study, Rockoff (2004)
obtained multiple years of data on elementary-school students and teachers and found that
21
raising teacher quality was a key instrument in improving student outcomes. The researcher
used panel data collected from New Jersey local education agencies and found that an increase in
one standard deviation in teacher fixed effect distribution raises both reading and math test
scores by approximately 0.1 standard deviations on a nationally standardized scale.
Consistent with this finding, Nye, Konstantopoulos, and Hedges (2004) found
statistically significant teacher effects on achievement gains. They also found larger effects on
mathematics achievement than on reading achievement and much larger teacher effect variance
was found in low socioeconomic status schools that in high SES schools. This study was
especially notable because of the unique data available to researchers. More specifically,
researchers used data from a four-year experiment called the Tennessee Class Size Experiment,
in which teachers and students were randomly assigned to classes to estimate teacher effects on
student achievement.
There still exists some debate about how to define, identify, measure, and develop
effective teaching (Hallinger, Heck & Murphy, 2014). Without a common understanding of
these concepts, policy makers and practitioners have faced substantial challenges in increasing
the quality of education for all students. This has led to significant issues when policy makers
attempt to make empirical personnel decisions, such as teacher selection, tenure, and
compensation. Decisions in these areas have traditionally been addressed using data on teacher
education, experience, certification, and salary schedules (Rothstein & Mathis, 2013). However,
there has been much evidence against the use of such measures in high stakes contexts. For
example, Hanushek (1986, 1997) conducted educational production function studies and found
that the characteristics that form the basis for teacher compensation (e.g., graduate degrees and
experience) were weak predictors of a teacher’s contribution to student achievement. In a more
22
recent example, Goldhaber and Brewer (2000) used the National Educational Longitudinal Study
of 1988 (NELS:88) to calculate the value-added scores of 3,786 math and 2,524 science public
school students in the 12th grade. Results indicated that mathematics and science students who
had teachers with emergency credentials did no worse than students whose teachers had standard
teaching credentials.
Further evidence was provided in a similar study by Rivkin, Hanushek, and Kain (2005).
In this study, researchers used unique matched panel data from the Texas Schools Project to find
that teachers had powerful effects on reading and mathematics achievement, but that little of the
variation in teacher quality was explained by observable characteristics such as education or
experience. Taken together, these studies suggested that alternative measurement mechanisms
were necessary to increase the validity with which teachers were evaluated, and also in order to
effectively develop teacher capacity and make empirically guided personnel decisions.
Traditional Teacher Evaluation Systems and the Widget Effect
In addition to the traditionally approach of collecting information about teacher
education, experience, and certification, school administrators have been collecting data from
observations and teacher evaluations of instruction for over a century (Hallinger, Heck &
Murphy, 2014). The potential of teacher evaluation systems to empirically inform the
development of teacher capacity has been known for some time. In one study, Johnston (1997)
interviewed sixty-three participants from various local education agencies across three states to
explore conceptions of effective teaching held across various roles within the school organization
and discussed the implications of these for teacher-evaluation policy. He found that the process
of teacher evaluation could be valuable in assessing the effectiveness of classroom teachers and
identifying areas in need of improvement. In this same study, he also found that teacher
23
evaluation systems could be helpful in making professional development more individualized
and improving overall instruction schoolwide. These findings supported the endeavor of
developing and implementing more effective teacher evaluation systems so that State education
agencies (SEAs) and local education agencies (LEAs) could effectively identify, reward,
develop, and retain the strongest teachers in the interest of their students.
Despite that there has been empirical evidence about the potential usefulness of teacher
evaluation systems, there has been limited evidence to support the notion that inferences made
from traditional teacher evaluation systems actually possess an adequate degree of reliability and
validity (Darling Hammond et al., 2012). Without empirical evidence, one cannot contend that
traditional teacher evaluation systems would be any more effective than using the education,
experience, and certification. In fact, substantial evidence existed to support the opposite
contention—that, in fact, there are legitimate concerns with traditional teacher evaluation
systems. In one notable study funded by The New Teacher Project, Weisberg, Sexton, Mulhern,
and Keeling (2009) surveyed over 15,000 teachers and 1,300 principals in 12 LEAs across four
states. Researchers found that almost all teachers were rated as great or at least good.
Furthermore, in schools that used a binary system where teachers were rated as either
satisfactory or unsatisfactory on various standards, 99 percent of teachers were rated as
satisfactory. Finally, they found that when schools used an evaluation scale with more than two
performance rating options (i.e., more than just satisfactory and unsatisfactory), an
overwhelming 94 percent of all teachers received one of the top two ratings. Termed “The
Widget Effect,” Weisberg et al. (2009) noted that when schools tended to assume that teachers’
effectiveness in the classroom is the same from teacher to teacher, they treated them as
interchangeable parts. Several consequences of this were noted. Most notably, evaluation
24
systems that failed to differentiate performance among teachers led to systems in which excellent
teachers could not be recognized or rewarded. Moreover, chronically low-performing teachers
languished, and the wide majority of teachers performing at moderate levels do not get the
differentiated support and development they need to improve as professionals.
Although teacher evaluation systems had previously been shown to have potential in
growing teacher and school capacity in terms of targeting professional development and
identifying and promoting talent, validity and reliability issues have precluded such systems
from being heavily relied upon in high-stakes personnel decisions such as tenure, compensation,
and dismissal. Despite this, teacher evaluation systems’ potential as a mechanism to improve
education quality through developing teacher capacity and through recruitment and retention of
talent, continued to attracted interest. Increasingly, assistance from external pressures, including
the Federal Race to the Top Funding Program, has enabled policymakers to experiment with
evaluation and accountability for individual teachers. In order for schools to receive certain
funding, the federal program, Race to the Top, required participating SEAs and LEAs to measure
and reward teachers and school leaders based on multiple measures of teaching. The results of
these experiments, one which required substantial funding (e.g., Gates Foundation Measures of
Effective Teaching Project), have been extensively studied over the past five years and will
continue to be explored for some time.
Teacher Evaluation Policies and the Multiple Measures of Teacher Effectiveness
In light of increased demand for greater school accountability and the limitations
documented with traditional teacher evaluation systems, education policy has gradually shifted
from holding schools accountable for policy compliance to accountability for learning outcomes
(Atkinson, 2009). Teacher effectiveness has been increasingly connected to student academic
25
progress as it has become generally agreed upon that measures of student learning in the
evaluation process provided the “ultimate accountability” for educating students (Tucker, &
Stronge, 2001). However, debate continued over the scope of educational outcomes, and how to
increase the degree of reliability and validity when measuring them.
This debate has extended into wide experimentation and research into the implementation
of new teacher evaluation systems. One strong voice in this debate has been the Gates
Foundation, which invested $45 million in a nationwide project called Measures of Effective
Teaching (MET). The goal of this project was to not only improve teacher evaluation, but also
use this information to make high-stakes decisions about teachers’ careers. Various researchers
from a wide-range of institutions measured teacher effectiveness in many different ways,
including student evaluations of teachers, student classroom work, evaluations of classroom
practice using commonly used rubrics, and student and parent surveys (e.g., Kane, McCaffrey,
Miller, & Staiger, 2013). In one report from this study, Kane et al. (2013) found that a composite
measure of effectiveness could identify teachers who produced higher achievement among their
students. They also found that the actual impacts on student achievement were approximately
equal on average to what the existing measures of effectiveness had predicted. These were
especially notable findings because they were causal impacts because they were estimated with
randomly assigned groups.
In another key finding from the MET project, Mihaly (2013) explored how indicators
(e.g., student evaluations, principal observations, value-added scores) could be combined to
improve inferences about a teacher’s impact on student achievement and about teaching.
Researchers estimated the parameters of an optimal combined measure of teacher effectiveness
26
and found that for a typical teacher, one year of data on value-added for state tests is highly
correlated with a teacher’s stable impact on student achievement gains on state tests.
Preliminary investigations appear to support the use of multiple measures to evaluate
teachers. Teacher evaluation systems that use multiple measures of teacher performance in
public schools has been shown to have potential to produce useful information for teachers. For
example, Taylor and Tyler (2012) studied a sample of midcareer elementary and middle school
teachers in the Cincinnati Public Schools, all of whom were evaluated in a yearlong program
based largely on classroom observation between the 2003-04 and 2009-10 school years. Their
analyses showed that teachers were more effective at raising student achievement during the
school year when they were being evaluated than they were previously, and even more effective
in the years after evaluation, particularly in mathematics. Other studies, beyond the MET project,
have also explored the relationships between the multiple measures of teaching. For example,
Harris and Sass (2009) used data from a midsize Florida LEA and found that teacher value-
added and principals’ subjective ratings were positively correlated. They also found that
principals’ evaluations were better predictors of a teacher’s value added than traditional
approaches to teacher compensation focused on experience and formal education. Perhaps most
importantly, they found that in settings where schools were judged on student test scores,
teachers’ ability to raise those scores was important to principals, as reflected in their subjective
teacher ratings.
However, not all studies support the use of the increasingly common teacher evaluation
systems based on multiple measures. In contrast to some of the previously discussed studies,
Hallinger, Heck, and Murphy (2013) examined the new generation of teacher evaluation systems
along three lines of analysis: evidence on the magnitude, consistency, and stability of teacher
27
effects on student learning, evidence on the impact of teacher evaluation on growth in student
learning, and literature from the sociology of organizations on how schools function. One key
finding in this study was that the policy logic supporting the teacher evaluation reform was
considerably stronger than the empirical evidence supporting the actual reform. This contrasting
finding suggests that more exploration into these systems may be necessary.
In this section, background on teacher effectiveness and teacher evaluation systems is
outlined. It was clear that, increasingly, education policy makers and researchers have turned
towards experimentation with new models of teacher performance evaluation focused on
multiple measures. In the next section, Virginia’s efforts towards a more reliable and valid
teacher evaluation system based on multiple measures are described. Following a review of the
specific policy being examined in this study called GUPSECT, literature on organizational
policy implementation is explored and narrowed down to K-12 teacher evaluation contexts.
The Policy: Virginia Guidelines for Uniform Performance Standards and Evaluation
Criteria for Teachers
In September 2011, in order to move forward with reforms to increase the quality of
instruction for all students, the United States Department of Education (USDOE) invited all State
Educational Agencies (SEA) and local education agencies (LEA) to request flexibility regarding
specific requirements of the Elementary and Secondary Education Act (ESEA) of 1965.
Designing and implementing teacher performance-based evaluation had been the main focus of
the efforts to implement ESEA flexibility (USDOE, Jan. 2013). The authority for this flexibility
was pursuant to the authority in section 9401 of ESEA, which allowed the Secretary of
Education, Arne Duncan, to waive statutory or regulatory requirements of the ESEA for an SEA
that receives funds under a program authorized by the ESEA and requests a waiver. Funds also
28
came from the American Recovery and Reinvestment Act (ARRA). Since this invitation, state
and local education agencies have invested substantial resources, including in both human and
financial capital, towards this nationwide initiative that re-conceptualizes teacher evaluation.
Initially, the USDOE granted waivers to 34 states and the District of Columbia, including
Virginia. The current study examines the implementation of the policy which was used in partial
fulfillment of Virginia’s SEA successful request for ESEA flexibility. In Virginia, GUPSECT
became effective on July 1, 2012. The 2012-13 school-year was the first year of the state-wide
pilot and implementation. Based on the official GUPSECT document (2011), the primary
purposes of a quality teacher evaluation system were to:
contribute to the successful achievement of the goals and objectives defined in
the school division’s educational plan;
improve the quality of instruction by ensuring accountability for classroom
performance and teacher effectiveness;
implement a performance evaluation system that promotes a positive working
environment and continuous communication between the teacher and the
evaluator that promotes continuous professional growth and improved student
outcomes;
promote self-growth, instructional effectiveness, and improvement of overall
professional performance; and, ultimately
optimize student learning and growth (p. 14).
With these principles guiding the policy making process, the formal signing of this policy
represented a significant overhaul of conventional evaluation criteria. Most notably, GUPSECT
set forth seven performance standards for all Virginia teachers. Pursuant to state law, teacher
evaluations were required to be consistent with the performance standards included in the official
GUPSECT document (2011). It was recommended that each teacher receive a summative
evaluation rating, and that the rating be determined by weighting the first six standards equally at
29
10 percent each, and that the seventh standard, student academic progress, account for 40 percent
of the summative evaluation (see Table 2).
In addition to establishing the uniform performance standards for teachers, the official
document also made recommendations on how teacher performance could be effectively
documented and used to provide comprehensive and accurate feedback on teacher performance.
Specifically, GUSPECT recommended that evaluators use five data sources for the evaluation.
The five data sources included formal observations, informal observations, student surveys,
portfolios/document logs, and self-evaluation.
Table 2. Virginia Guidelines for Uniform Performance Standards and Evaluation Criteria for
Teachers
Standard Description of Standard
Performance Standard 1:
Professional Knowledge
The teacher demonstrates an understanding of the curriculum,
subject content, and the developmental needs of students by
providing relevant learning experiences.
Performance Standard 2:
Instructional Planning
The teacher plans using the Virginia Standards of Learning,
the school’s curriculum, effective strategies, resources, and
data to meet the needs of all students.
Performance Standard 3:
Instructional Delivery
The teacher effectively engages students in learning by using
a variety of instructional strategies in order to meet individual
learning needs.
Performance Standard 4:
Assessment of and for
Student Learning
The teacher systematically gathers, analyzes, and uses all
relevant data to measure student academic progress, guide
instructional content and delivery methods, and provide
timely feedback to both students and parents throughout the
school year.
Performance Standard 5:
Learning Environment
The teacher uses resources, routines, and procedures to
provide a respectful, positive, safe, student-centered
environment that is conducive to learning.
Performance Standard 6:
Professionalism
The teacher maintains a commitment to professional ethics,
communicates effectively, and takes responsibility for and
participates in professional growth that results in enhanced
student learning.
Performance Standard 7:
Student Academic Progress
The work of the teacher results in acceptable, measurable, and
appropriate student academic progress.
*Adapted from the Virginia Guidelines for Uniform Performance Standards and Evaluation Criteria for Teachers
30
Since the use of student learning measures in teacher evaluation was new for both
teachers and principals, perhaps the most difficult component of the policy in terms of gaining
teachers’ buy-in was Performance Standard 7: Student Academic Progress. There are three
potential explanations for the lack of teacher buy-in on this standard. First, traditional evaluation
systems have not included student progress as a measure of teacher performance. Thus, as with
any innovation, especially one implemented in contexts as sensitive and important as job
performance, the novelty factor presents an initial barrier. Secondly, and perhaps most
prominently, the heavy weighting of this new standard marks a cultural shift in how “effective
teaching” is defined. Since student academic progress is weighted at four times any other
standard, the shift focuses on the idea that what students learn from a teacher is as important as
how or what teachers teach. This type of cultural shift understandably takes time to gain buy-in
from all stakeholders, those most directly affected by the change. A third factor that may help
explain why teacher buy-in faces difficulty is that although teacher quality is a significant
predictor of student achievement, many other factors are also significant. One example of these
factors includes socio-economic status of students.
As previously mentioned, the guidelines recommended that student academic progress
accounted for 40 percent of a teacher’s summative evaluation. However, LEAs were also
granted some flexibility under this standard. For example, at least 20 percent of the teacher
evaluation (half of the student academic progress measure) was comprised of student growth
percentiles as provided from the Virginia Department of Education when the data were available
and could be used appropriately. Moreover, another 20 percent of the teacher evaluation (half of
the student academic progress measure) was recommended to be measured using one or more
alternative measures with evidence that the alternative measure is valid. Thus, in choosing
31
measures of student academic progress, schools and LEAs were encouraged to consider
individual teacher and schoolwide goals, and align performance measures to the goals. In
anticipation of resistance to the use of standardized testing in the evaluation, policymakers
highlighted that less than 30 percent of teachers in Virginia’s public schools should have a direct
measure of student academic progress available based on the State assessment results (Standards
of Learning).
The main components of GUSPECT have been highlighted in this section. However, it
should be evident that although some important components, such as the weighting of the
standards, were mandated at the state level, some components were flexible in terms of how they
were implemented by LEAs. One result of providing wider local flexibility in interpretation is
could be a larger degree of variance in the fidelity of policy implementation. This study explores
that variance using teachers’ perceptions. In the next section, the development of the guiding
theoretical framework was described.
Theoretical Framework: Policy Implementation in K-12 Organizations
Preliminary Assumptions about the Policy Implementation Process
The development of the theoretical framework for this study was based on key
underlying assumptions. First, the extent to which a sample of Virginia’s teachers perceived that
they were supported on the new teacher evaluation system was assumed to have implications for
the effectiveness of the implementation of the policy. Thus, environments in which teachers
indicated receiving more support were environments in which the policy was being implemented
more effectively. Secondly, the variation in successful implementation was assumed to affect the
variation in the degree to which the new evaluation system could achieve its potential to provide
more useful feedback to teachers and promote effective teaching. Taken together, these first two
32
assumptions indicated that when teachers perceived more support in implementing the policy the
implementation was more effective which leads to increased teacher effectiveness and capacity.
The final piece to these assumptions, based on empirical studies described in the previous
section, was that increased teacher effectiveness leads to increased student achievement which
should be the ultimate goal of any educational organization.
Relying on the aforementioned studies and assumptions, the mechanisms for supporting
the policy implementation were initially investigated in four components. These assumptions
and the remainder of this section were summarized in figure 1 on page 6 and were included again
below for convenience. As is evident in figure 1, many of the supports investigated in this study
actually may fall under multiple categories. How a teacher perceives support in one component
may actually influence the perceived support in another. This idea will be discussed in detail in
order to provide a more accurate depiction of the development of the framework
The first component of support mechanisms is the teachers’ perceived support on policy
guidelines. This includes, among other supports, how clear the policy guidelines are to teachers,
how specific and adaptable they are, and how the policy lends itself being communicated and
monitored. Secondly, teachers’ perceptions of support mechanisms at the teacher level are
explored. This includes teacher self-efficacy, expertise, and capacity to change. It also includes
the important support gained from situated contexts like close colleagues. Next, school
leadership supports are investigated. These include leadership expertise and advocacy for the
policy. Moreover, it includes teachers’ perceptions of whether leadership values align with the
policy values. Finally, organization conditional supports, such as professional development,
collaboration, resources, and locus of decision making are explored. These areas of the
conceptual framework are discussed more thoroughly in chapter three. The assumptions,
33
theoretical components, and the overlapping relationships are visually represented by the guiding
model presented in Figure 4 above. The next sub-sections in the section describe the theories that
informed the development of the framework for this study.
Theories of Utility Maximization and Policy Implementation
Utility maximization theory is an economics concept that directly relates to the evaluation
of teacher quality as teachers operate as individuals in their social settings (Akerlof & Kranton,
2005). Specifically, the utility maximization problem refers to individuals attempting to attain
the greatest value possible from expenditure of least amount of resources. In a consumer market,
this unfolds as individuals operating to maximize the total value derived from some currency—
usually money. In this study of a school organization context, teachers operate to maximize their
utility within the school organization. In teacher evaluation contexts, teachers’ utility is derived
from their value within the organization which, as discussed in previous sections, is increasingly
being derived from the value they add to student learning from expenditures in the amount time,
effort, and other learning resources and supports available. Thus, in order to increase utility,
teachers should capitalize on the available supports in implementing this policy. This concept
can also be applied to a broader level as individual schools and districts operate to increase the
capacity of their teacher workforce from time and budget constraints. The optimal decision in
any given situation maximizes the average utility over all possible outcomes of a decision
(Akerlof & Kranton, 2005).
Akerlof and Kranton, (2005) proposed that workers’ self-image as jobholders, coupled
with their ideal as to how their job should be done, can be a major work incentive. They showed
how identities can flatten reward schedules, as they solve the “principal-agent” problem. The
principal-agent problem is based on the assertion that the principal (e.g., the employer) cannot
34
observe the true effects of agents (e.g., employees); rather, only agents themselves know their
true effort and performance (Sun, Mutcheson, & Kim, 2015). To address this asymmetric locus
of information, the employer can align the incentives for employees with the organization’s
goals. For example, aligning measures of teachers’ performance closely with students’ learning
is expected to motivate teachers’ efforts and in turn contribute to schools’ organizational values
(Sun, Mutcheson & Kim, 2015).
Social Capital and Policy Implementation
As individuals become more valuable in an organization, they increase their social capital
(Rogers, 2003). Social capital is the expected collective or economic benefits derived from the
preferential treatment and cooperation between individuals and groups (Putnam, 2000). One
theory directly related to the importance of social capital in policy implementation contexts is
Rogers’ (2003) diffusion of innovations. Diffusion of innovations applies to organizations
seeking to explain how innovations are communicated through certain channels over time among
the participants in a social system or culture. Under this theory, school systems are organizations,
and as such, the rate at which policies spread is heavily dependent on social capital, the
organizational communication channels, time, and the social system (Rogers, 2003).
In one study, Frank, Zhao, and Borman (2004) applied diffusion of innovation theory to
educational contexts when they characterized informal access to expertise and responses to social
pressure as manifestations of social capital. They used longitudinal and network data in a study
of the implementation of computer technology in six schools and found that the effects of
perceived social pressure and access to expertise through help and talk were at least as important
as the effects of traditional constructs. This suggested that teachers were better able to gain
35
access to each other’s’ expertise informally and are more likely to respond to social pressure to
implement an innovation, regardless of their own perceptions of the value of the innovation.
In another study investigating social capital in schools, Youngs, Frank, Thum, and Low
(2012) explained the effects of mentoring and induction activities on new teachers’ commitment,
instructional quality, and effectiveness. They found that “…when beginning teachers’ beliefs and
practices were aligned with those of their mentors, subgroups, and other colleagues, they may
feel little professional tension and may be able to promote student learning and respond to
others’ expectations by exerting effort on a single dimension” (p. 22). Alternatively, when
novice teachers were not aligned with their mentors, subgroups, or administrators, they required
“…exerted effort simply to meet others’ expectations and they may experience significant
professional tensions” (p. 22).
Taken together, these studies provide evidence of the importance of social capital in the
diffusion of innovations in organizations. Social capital can be used to positively influence
teachers. However, it can also negatively impact teachers’ commitment to the implementation
process which could potentially attenuate or even nullify the intended effects of the policy. This
is important to the current study for many reasons. Most obviously, it suggests that when
measuring teachers’ perceptions of policy implementation supports, the influence teachers have
on colleague’s perceptions is important to consider. This can be accomplished in many ways, one
of which is to include measures asking teachers how their perceptions align with colleague’s
perceptions. That is the strategy taken for the purposes of this study.
Frameworks for K-12 Policy Implementation
There are multiple frameworks for policy implementation that influenced the design of
this study. With their objective of advancing the vocabulary of implementation science, Proctor
36
et al. (2012) extended the theory of diffusion of innovations and put forth a taxonomy of
implementation outcomes to help organize the key variables and frame research questions
required to advance implementation science. The taxonomy consisted of eight conceptually
distinct implementation outcomes including acceptability, adoption, appropriateness, feasibility,
fidelity, implementation cost, penetration, and sustainability. Researchers defined
implementation outcomes as “…the effects of deliberate and purposive actions to implement new
treatments, practices, and services” (p.65). Moreover, they identified three important functions of
implementation outcomes. One function was to serve as indicators of the implementation
success. Second, implementation outcomes serve as proximal indicators of implementation
processes. Finally, they were recognized as key intermediate outcomes (Rosen & Proctor, 1981)
in relation to a service system such as school organizations.
Similar to the previous study, Century, Cassata, Rudnick, and Freeman (2012) put forth a
framework intending to move toward common language and shared conceptual understanding.
They focus on two aspects of “implementation” including innovation implementation and the
implementation process. They provide definitions of these aspects that have been relied upon in
the development of the current study. First, innovation implementation refers to the status of the
innovation or extent to which the innovation itself is enacted, in whole or part. This is commonly
referred to as “implementation fidelity.” Secondly, the implementation process includes
innovation implementation as well as all of the contextual factors that contribute to and/or inhibit
the innovation implementation. Most relevant and useful for the current study, the researchers
describe the mechanisms that influence policy implementation in different grain-sizes. For
example, in their framework, mechanisms fall under various categories including: characteristics
of the innovation, characteristics of the individual users, characteristics of the leadership,
37
characteristics of the organization, and elements of the environment. Each of these grains is then
broken down into smaller grain sizes and easily measureable elements of policy implementation.
As is evident from chapter one, this framework was very instrumental in the development of the
hypotheses, design, and framework for the current study.
In an alternative approach more targeted towards understanding policy implementation
science in K-12 contexts, Spillane, Reiser, and Reimer (2002) developed a cognitive framework
to characterize sense-making in the implementation process. This framework was especially
relevant for recent education policy initiatives, such as standards-based reforms that press for
tremendous changes in classroom instruction, such as teacher evaluation. According to Spillane
et. al. (2002) “…a key dimension of the implementation process is whether, and in what ways,
implementing agents come to understand their practice, potentially changing their beliefs and
attitudes in the process” (p. 387). They argued that one plausible explanation for the evolution of
policies during implementation is the process of human sense-making. In this approach, they
highlighted the importance of unpacking how and why policy evolves as it does. They note that
“…this strategy is likely to generate important insights into the implementation process, insights
that can inform the design of state and national standards as well as other education policies.
In summary, multiple theories and frameworks were used to frame the contextual
elements of statewide K-12 policy implementation. As was evident in the previous discussions,
the measurement of the processes was complex. The complexity of overlapping policy
implementation components is one reason why it was hypothesized that cognitive diagnostic
models would provide more measurement precision than traditional unidimensional models. In
the next section, a conceptual review of cognitive diagnostic models is provided in order to
justify the need for this proof of concept endeavor. Some important studies are reviewed to
38
provide insight into past applications of cognitive diagnostic models highlight the potential
advantages of this methodology.
Review of Cognitive Diagnostic Modeling Concepts
In order to determine whether cognitive diagnostic models will provide more precise
measurements of teachers’ perceptions than traditional measurement frameworks, a review of
the concepts and applications of cognitive diagnostic models is necessary. First, however, it is
necessary to review alternative and closely related psychometric approaches in order to detail the
precise statistical properties of cognitive diagnostic modeling that make it advantageous for the
purposes of this study. Several alternative psychometric approaches are described. In comparing
and contrasting these alternative approaches with cognitive diagnostic modeling, it becomes
clear as to why the final approach is selected, especially in light of the goals of the current study.
Following this comparison, the concepts and applications of cognitive diagnostic modeling are
reviewed.
Measurement Theory
The most common measurement framework is informed by classical test theory (CTT). .
In CTT, for each respondent exists a true score that is assumed to be measured without error. The
true score is the expected score if a respondent were to respond an infinite number of
independent trials. Since a respondent can never actually do this, it is assumed that the true score
is the observed score for a respondent plus some magnitude of error. Hence, the formula;
𝑋 = 𝑇 + 𝐸; (1)
where X represents the observed score of a respondent, T represents the true score of a
respondent, and E represents the measurement error. The reliability of the observed test
39
scores can be shown to be equal to the proportion of the variance in the test scores that we could
explain if we knew the true scores, and is given by:
𝜌2𝑋𝑇
=𝜎2
𝑇
𝜎2𝑋
= 𝜎2
𝑇
𝜎2𝑇+ 𝜎2
𝐸 ; (2)
Looking at the formulas, it is clear that the assumptions of this measurement approach are
unrealistic for the purposes of this study. In this study, the overarching hypothesis is that policy
implementation support is actually multidimensional. In this sense, several related constructs are
actually contributing to the teachers’ overall perceived support. However, classical test theory is
a unidimensional measurement approach in that all measurements are taken on a linear scale for
one single construct. In CTT, examinee characteristics and test characteristics can only be
interpreted in the context of the other and the standard error of measurement is assumed to be the
same for all examinees. Hence, classical test-theory is not practical for this study because it is a
test oriented approach, so it is not particularly useful in terms of predicting respondent item
performance which is exactly what we are investigating in this study.
A more advantageous methodology considered for this study is latent class modeling.
Latent class models are a type of mixture model where the data is categorical and item responses
are independent given class (Lazarsfeld & Henry, 1968). The analysis involves latent variables
that are indicated by measured items. In contrast to more common measurement models, such as
factor analysis and item response theory, the latent variable in the analysis are categorical rather
than continuous. This aligns with the objectives of the current study because the categorical
variables contain diagnostic properties (Embretson & Yang, 2013). The first use of latent class
models in educational measurement was by Macready and Dayton (1977). Most of the early
psychometric applications were centered on identifying participants who cluster together based
on item scores. Cognitive diagnosis models are very similar to latent class models except with a
40
set equality constraints placed on class probabilities (Templin, 2008). Moreover, in cognitive
diagnostic modeling, the classes, or skill patterns, are specified and defined a priori, whereas in
latent class analysis, the classes are not known prior to the analysis (Halpin and Kieffer, 2015).
Thus, latent class analysis is more of an exploratory procedure for understanding data. In this
study, the literature is used to define the specific 4-attributes of interest, thus making a more
confirmatory procedure more appropriate for the diagnostic modeling component.
Factor analysis (FA) is another commonly used measurement framework. EFA can be
used to analyze the associations between observed item responses using latent variables, such as
the intra-organizational mechanisms being investigated in this study. EFA is often used to
uncover the underlying structure of the data. Similar to latent class analysis, it is exploratory in
nature. Thus, EFA is used when no a priori hypothesis about factors or patterns of measured
variables exists. All observed variables are allowed to freely load on all latent variables. In the
current study, the coarse-grained mechanisms are defined a priori via literature review. However,
due to a lack of empirical understanding of the finer-grained mechanisms for supporting policy
implementation, an EFA is used in the preliminary analysis. This procedure is discussed in more
detail in chapters 3 and 4.
Also within the FA framework, confirmatory factor analysis (CFA) differs from EFA in
that this procedure requires that the loading structures be specified a priori. The development of
these loadings is typically theory-based. Most commonly, in both EFA and CFA a simple
loading structure is the goal. A simple loading structure occurs when all items load to one latent
variable resulting in analysis of between-item variance, but it does not always occur. Although
the CFA framework is capable of accommodating a complex structure where each item can load
to multiple latent constructs, the a priori specification of a complex loading structure is not as
41
common in the literature as a simple structure. This is because items are usually written to
measure one latent construct, as opposed to multiple latent constructs. Complex loading
structures allow one to analyze within item-variance and are more commonly used in
multidimensional IRT models and cognitive diagnostic models. Additionally, in EFA and CFA
estimation routines typically utilize summary statistics of the data such as means, variance,
covariance, and correlations (Rupp, Templin, and Henson, 2008). This is referred to as partial
information estimation. These statistics contain all relevant information about the unknown
parameters of interest in these models.
An alternative framework for modeling latent constructs is item response theory (IRT).
Typically, in IRT and cognitive diagnostic modeling frameworks, full-information statistics are
used for estimation (Rupp, Templin, and Henson, 2008). The statistical properties subsumed
within an item-response theory (IRT) framework provide similar advantages as cognitive
diagnostic modeling. The IRT model can be used to account for important parameters such as
item difficulties and discriminations. However, according to Ravand and Robitzsch, (2015),
“…conventional IRT models locate test takers on a broadly defined single latent variable,
whereas diagnostic models provide information about the perceived support status of test takers
of a set of interrelated separable attributes” (p.1). Such conventional models are unidimensional,
and they typically have simple loading structures in that they relate the observed response
variables to a single latent variable. Thus, conventional, unidimensional IRT models cannot be
useful for this study.
Increasingly, multidimensional IRT (mIRT) models are used to calibrate multiple
dimensions to estimate correlations of the latent variables. Many of these models are quite
similar to those found in the family of cognitive diagnostic models. For example, a priori
42
specifications of the relationships between observed and latent variables are made in both
approaches. In cognitive diagnostic modeling, this is referred to as a Q-matrix. Most Q-matrices
in cognitive diagnostic models specify complex loading structures that accommodate within item
multidimensionality where each item can load onto multiple dimensions. In other words, items
can be attributed to multiple sources of supports. For example, one particular item asked teachers
to rate the extent to which the professional development they received on the policy useful. A
closer look at this item reveals that it likely requires support from at least two main sources.
First, organizational conditions must have provided for adequate resources, such as time and
learning tools for teachers to indicate that they received adequate support on this item. However,
the adequate level of support for usefulness of professional development could instead require
expertise, judgement, and capacity from the principal or leader responsible for providing
professional development. Thus, this item will contribute variance to at least two other different
sources of support variables. This is one example of why diagnostic models have been frequently
promoted by psychometricians as important modelling alternatives for analyzing response data in
situations where multivariate classifications of respondents are made on the basis of multiple
postulated latent skills (Rupp and Templin, 2008). Although mIRT can accommodate such
within item multidimensionality, a simple loading structure focused on between-item
multidimensionality is much more common.
Perhaps the most important key distinction between mIRT and cognitive diagnostic
models is that mIRT uses continuous latent variables which allows for multiple real number
scales and norm-referenced interpretations. Conversely, the latent variables in cognitive
diagnostic models are categorical, so they provide for the support of the type of criterion-
referenced interpretations sought to be made in this study. For example, in this study, teachers’
43
perceptions will be used to classify them into latent classes in order to inform training and
development on policy implementation. Teachers will be proficient or not proficient on each
pertinent skill to the policy implementation process. The key advantage to cognitive diagnostic
modeling is that the cut scores will be set to maximize the reliable separation of respondents
(Templin, 2009). In comparison, to make the same proficient/non-proficient determination from
the continuous outcome variable in mIRT, one would require a second, external phase (e.g.
standard setting). As previously mentioned, in this study, the ability to make standards-based
(proficient/non-proficient) classifications is highly desirable. This is because the proposed
application of this methodology which is discussed in previous sections.
Templin and Hensen (2009) identified further advantages to cognitive diagnostic models.
First, they noted that they hold great potential because of the promise of providing more detailed
information related to the defined attributes. Secondly, they provide a tool that can aid in the
development of tailored action plans which could save leaders and teachers time. Finally, most
current studies do not provide a typical low stakes situation, but instead represent nearly ideal
situations with large sample sizes that provide for the demonstration of a new methodology.
Thus, there is need for studies using cognitive diagnostic modeling in real life situations.
In another paper, Gorin (2009) provided additional support by highlighting the additional
advantages to cognitive diagnostic models. First, their numerically derived cut-scores allow for
criterion-reference score interpretations in that they generate multidimensional score estimates in
terms of diagnostic classifications regarding student mastery or non-mastery of measured skills.
Secondly, cognitive diagnostic models hold diagnostic power based on multidimensional data.
Third, given current legislative demands on educational assessment for curricular design and
44
educational accountability, the use of criterion-referenced score interpretations has
overshadowed the historical importance of normative score interpretations.
It is anticipated that the approach taken in this study will provide useful, formative,
diagnostic information about teachers’ perceptions of the strengths and weaknesses of the
implementation process. The result will be diagnostic output that provides detailed empirical
information about teachers’ perceptions of policy support that are involved in the response
processes and the manner in which these components interact were obtained (diBello, Roussos,
& Stout, 2007). In an educational assessment context, it is commonly believed that an
identification of these perceptions, sometimes referred to as “mental components,” may help to
identify remedial pathways toward mastery on all components that are relevant and educationally
meaningful to the respondents (diBello, Roussos, & Stout, 2007).
Cognitive Diagnostic Modeling Theory and Applications
The statistical purpose of cognitive diagnostic modeling is to… “…develop a
multivariate profile of respondents’ traits that is based on classifying them according to their
degree of mastery on each of the traits” (Rupp & Templin 2008, p. 226). Substantively, they
provide detailed diagnostic profiles that promote assessment for learning through modification in
target areas (Jang, 2009). Cognitive diagnostic models use categorical latent variables that are
indicated by observed measured items. Instead of a continuous ability estimate, a cognitive
diagnostic model will estimate the probability that a respondent has mastered each attribute
(Rupp & Templin, 2008). If that probability is greater than 0.5, the respondent will be classified
as having mastered the attribute. For each respondent, a profile will result in which mastery/non-
mastery on each attribute is estimated. Attributes are the skills one is interested in measuring.
45
They can also be content knowledge, cognitive skills, mental processes, or, as in this case,
perceptions of support for policy implementation. Typically, cognitive diagnostic analyses
produce output for respondents as a profile on the attributes. The attributes usually have two
levels (mastery/non-mastery), or more than two levels as specified in the model. For the purposes
of this study, two categories are used, however, those categories are more accurately described as
“perceived support” and “lack of perceived support” as opposed to mastery/non-mastery.
Limited research has been done on models with more than two levels for an attribute. As will be
discussed in greater detail in chapter 3, attributes can also come in different grain sizes. Grain
sizes are the level of specificity of attributes. A finer grain, for example, could be adding and
subtracting, whereas a coarser grain may be whole number operations. Thus, a grain-size refers
to the scope of level of specificity.
There are three traditional categories of applications for cognitive diagnostic models
(Rupp, Templin, & Henson, 2010). First, and perhaps most common, cognitive diagnostic
models are applied to diagnostic assessment in education. Researchers commonly use these
models to measure student achievement in various areas, in order to help teachers identify what
attributes specific student’s need the most help with. Thus, in these applications, diagnostic
models have a strict formative educational assessment purpose (Roussos, DiBello, & Stout
2007). One example of this category of application came from a study by de la Torre and
Douglas (2004). In this study, researchers identified eight key cognitive attributes that underlie
performance in fraction subtraction. Thus, much time and effort was subsequently and
necessarily invested in the assessment blueprint and the Q-matrix.
A second, recently established, category of cognitive diagnostic modeling applications is
for clinical diagnosis of psychological neurological disorders (Rupp, Templin, & Henson, 2010).
46
Using cognitive diagnostic models, researchers sought to develop a diagnostic assessment that
could be used to screen respondents for a predisposition to be pathological gamblers (Templin &
Henson, 2006). Researchers found that “…the use of a cognitive diagnosis model, the DINO
(see p. 48), allowed for an instrument that was created to investigate the structure of underlying
personality factors in pathological gambling to provide diagnostic information for each criterion”
(p. 301). This is a fairly new application so there are limited studies using this application.
Finally, the most relevant category of applications to the current study is when cognitive
diagnostic models are applied to standards-based assessments in education. Accountability
efforts increasingly require examining the proportion of students performing at “proficient” or
“advanced” level. In one example, Poggio, Yang, Irwin, Glasnapp, & Poggio (2007) judged
student proficiency on the basis of comparing the raw score a student receives on an assessment
with a set of cut-scores that were set with standard-setting procedures to classify students.
Cognitive diagnostic models provide a clear advantage in this scenario because they provide
more accurate and informative mastery profiles for respondents at a fine diagnostic grain size
(Rupp, Templin, & Henson, 2010). Moreover, according to Leighton and Gierl (2007), more
assessments at a finer grain size are needed to obtain reliable information about broader domains.
Similar to the assessments in the previously discussed study, teacher evaluation systems
based on multiple measures are another example of innovations resulting from the multiple
standards-based measures movement. Instead of students being rated as “proficient” or not,
teachers receive one of the possible ratings on multiple standards. In one recent study that was
relied upon to inform the current study, Halpin and Kieffer (2015) used a secondary analysis of
data from the Measures of Effective Teaching study to outline the application of using latent
classes to learn about classroom observational instruments. They studied the diagnostic
47
information about teachers’ instructional strengths and weaknesses, along with estimates of
measurement error for individual teachers. Researchers described the advantages of providing
empirically derived profiles of instruction that describe what real teachers are doing in their
classrooms. The diagnostics provided an estimate of the measurement error associated with each
teacher’s profile membership and thereby addresses a major shortcoming of the current practice
of using total scores on multiple measures of effective teaching (Halpin & Kieffer, 2015). An
additional important note about this study is that researchers used latent class analysis which
was previously discussed. One limitation of this study was that they did not address how profile
memberships are related to teachers’ classroom and school contextual factors— such as the
effectiveness of the policy implementation.
In summary, there are relatively few applications of cognitive diagnostic models in the
literature. Three general categories of applications exist. However, it should be noted that
cognitive diagnostic models can be applied whenever statistically-driven classifications of
respondents according to multiple latent traits are sought (Rupp & Templin 2009). The current
study provides those requisite contextual conditions. It will be the first study to produce
empirical-based latent profiles of policy implementation support for teachers. The goal will be to
use these profiles to identify where teachers perceive they are being supported and how this
information can be used to provide feedback to practitioners.
Model Specification: Compensatory vs. Non-Compensatory
Rupp and Templin (2008) recommend three defining characteristics in understanding
model taxonomy. First, the scoring rules of the observed response variables can be dichotomous
48
or polytomous. Dichotomous response variables provide two choices for the participant whereas
polytomous response variables provide more than two. Secondly, the measurement scales of the
attributes they measure can also be either dichotomous or polytomous. Finally, an important
decision regarding model specification is whether the required skills for a specific task interact in
a compensatory manner. The diagnostic model specifies the probability of a positive item
response in terms of examinee skills and item parameters. Many models with varying
simplifying assumptions are available depending on the specified purpose. The ways that the
relationships between the attributes are combined can be compensatory or non-compensatory. In
compensatory models, a positive response on any of the attributes measured by an item can
compensate for a negative response on other attributes. Thus, the assumption exists that a deficit
in one attribute can be offset or be compensated by strength in another attribute. This means that
each support attribute that is measured by an item increases the probability of a correct or
positive response on that item. Conversely, in non-compensatory models lack of perceived
support of one attribute cannot be completely compensated by other attributes in terms of the
probability of positive response in the item performance.
In the current study, a compensatory relationship is hypothesized. Hence, if a particular
item is identified under two attributes in the Q-matrix, this means that this observed item is
measuring both attributes. A positive response of one attribute would mean that the teacher feels
supported in this area despite whether the other attribute is mastered. The specific model being
used is called the compensatory reparameterized unified model (C-RUM). Since it is a
compensatory model, a non-positive response or “a lack of perceived support” of a particular
measured attribute could be made up by the “mastery” or “support” of another measured attribute
(Rupp, Templin & Henson, 2008). The C-RUM model is the most suitable diagnostic model for
49
this study for several reasons. First, based on the previously discussed literature, support is a
compensatory construct. This means that when a teacher perceives a lack of support on a
particular item, the particular attribute that item is assigned to can still be diagnosed as
“supported” depending on teacher responses to other items that are also assigned to that
particular attribute. Moreover, this model is selected because it allows for flexibility. The C-
RUM contains unique parameters for each item and attribute. Thus, a similar approach to
defining global and attribute-specific item discrimination indices can be used. The endorsement
probabilities for respondents in the same latent class are not constrained to be equal across all
items with the same attribute requirements (Rupp et al., 2008). More advantages and the specific
characteristics of the C-RUM are discussed in more detail in chapter 3.
Through their thorough review of cognitive diagnostic models, Rupp and Templin (2008)
defined cognitive diagnostic models as “…probabilistic, confirmatory multidimensional latent-
variable models with a simple or complex loading structure” (p.226). This definition contains
two elements that require clarification. First, they are probabilistic because they expresses a
given respondent’s performance level in terms of the probability of mastery, or perceived
support, of each attribute separately, or the probability of each person belonging to each latent
class (Lee and Sawaki, 2009). The number of latent classes in a model depends on the number of
attributes hypothesized. If, for example, one hypothesizes that four attributes have two levels
each, then there will be 24 =16 latent classes. The probability of a respondent belonging to each
individual latent class will be provided by the diagnostic output. Secondly, diagnostic models are
confirmatory in the sense that the latent variables are defined a priori through a Q-matrix
(Ravand & Robitzsch, 2015).
The Core Compensatory and Non-Compensatory Cognitive Diagnostic Models
50
There are six core cognitive diagnostic models identified by Rupp, Templin and Henson
(2008). As previously explained, the core non-compensatory models assume that a deficit in one
attribute cannot be compensated for by a surplus in a different attribute. The three core non-
compensatory models include the deterministic-input, noisy-and-gate (DINA) model, the noisy-
input, deterministic-and-gate (NIDA) model, and the non-compensatory reparameterized unified
(NC-RUM) model. Generally, the DINA model separates respondents into mastery classes for
each item. One class includes the respondents who have mastered all of the measured attributes,
and the other class includes respondents who are non-masters of at least one of the attributes
measured by the item. In this model, no further differentiation between respondents who lack
different attributes is made for any item. This reduces the number of parameters that need to be
estimated. Conversely, the NIDA model accounts for the fact that a respondent lacking only one
of the measured attributes has a higher chance of a positive response than a respondent who has
not mastered any of the measured attributes. The model does this by including a slipping and
guessing parameter per attribute, which ultimately increases the number of parameters that need
to be estimated in comparison to the DINA model. These parameters are constrained to be equal
across all items. Similar to the DINA model, the NIDA includes a probability of a positive
response as output. However, the probability of a correct response is equal to the probability of
correctly applying all measured attributes for an item. The advantage of this is its finer
distinction between respondents who lack different combinations of attributes. The NC-RUM is
somewhat similar to the previous two models in that it is a non-compensatory model. However,
this model includes an interaction term to absorb the effects of incompleteness in an
undifferentiated way to improve the chances that the model will fit the data better. The model
51
relaxes the parameter constraints inherent in the models and provides parameter estimates at the
item level as well as for each combination of item and attribute.
The core compensatory models maintain the assumption that a deficit in one attribute can
be compensated for by a surplus on another attribute. The core compensatory models include the
deterministic input, noisy-or-gate (DINO) model, the noisy input, deterministic-or-gate (NIDO)
model, and the compensatory reparameterized unified model (C-RUM). The DINO model
includes slipping and guessing parameters modeled at the item level. In addition, it includes a
gate component that summarizes the contribution of individual attributes in the latent response
variable. It is limited in that no finer distinction is made between respondents for whom different
sets of attributes are present for items that require multiple attributes. Hence the NIDO model
provides a finer distinction than the DINO model, similarly to how NIDA compensates for the
limitation in the DINA model. Although the NIDO allows for differing contributions of each
attribute in a compensatory way, the parameters are restricted to equality across items. This
constraint may be unrealistic in some cases. The C-RUM model allows for a higher degree of
modeling flexibility in that response behavior is modeled at the item by attribute level without
equality constraints across items or attributes.
In recent years, there has been a movement towards more general models, including the
general deterministic-input, noisy-and-gate (G-DINA), the general diagnostic model (GDM), and
the log-linear cognitive diagnostic model (LCDM). The LCDM and GDM are more flexible
models parameterized for both dichotomous and polytomous data and attributes, and they can be
viewed more generally as representing model families consisting of a variety of compensatory
models that arise out of restrictions placed on the parameters in the model (Rupp & Templin
52
2008). The LCDM models the conditional probability that a respondent with a specific attribute
profile provides a positive response to a particular item. The formula for the LCDM is given by:
𝜋𝑖𝑐 = P(𝑋𝑖𝑐 = 1 |𝛂𝑐) = exp( λ𝑖,0 + λ𝑖𝑇ℎ(𝛂𝑐𝐪𝑖))/ exp( λ𝑖,0 + λ𝑖
𝑇ℎ(𝛂𝑐𝐪𝑖)) (3)
where qi is the set of Q-matrix entries for item i, λi,0 represents the logit of a correct response
given that no Q-matrix indicated attributes are possessed by a respondent. The vector λi
represents a vector of size (2A-1) x 1 with main effect and intercept parameters for item 1.
Moreover, (αc, qi) is a vector of the same size with linear combinations of the αc, qi. Thus, the
LCDM can be compared to a multiple way-ANOVA. Since most cognitive diagnosis models are
typically parameterized to define the probability of a positive response (i.e., each item is either
positive Xij = 1 or not positive Xij = 0) the log-linear model is re-expressed in terms of the log-
odds of a correct response for each item (as a function of the latent variables) Henson, Templin
& Willse, 2009). A compensatory model, such as the C-RUM, consists of only main effects
while non-compensatory models contain only interaction effects.
The Q-Matrix
To attain the diagnostic information from a cognitive diagnostic model, one has to
develop a confirmatory hypothesis assigning items to the latent attributes. This is known as a Q-
matrix (Tatsuoka, 1983). A Q-matrix depicts which skills/attributes contribute to the probability
a participant responds positively to an item. Items can be assigned to multiple skills, and thus,
the more skills assigned to an item, the more skills that affect the probability of a positive
response to that item. Q-matrix indicators entries are binary in that a skill affects the probability
of a correct response or it does not. In table 3, attribute 1 affects the probability of items 3 and 5.
Attribute 2 is hypothesized to affect items 1, 2, 3, and 5. Finally, attribute 3 is hypothesized to
affect items 1, 3 and 4. Looking at the items, the probability of a positive response to item 1 is
53
affected by attributes 2 and 3, but the probability is not hypothesized to be affected by attribute
1. Items can be hypothesized to be affected by one or more attributes.
Table 3. Example of Q-Matrix
Skill/Attribute 1 Skill/Attribute 2 Skill/Attribute 3
item 1 0 1 1
item 2 0 1 0
item 3 1 1 1
item 4 0 0 1
item 5 1 1 0
… … … …
The Q-matrix method came from Tatsuoka (1983) who developed the Rule Space
Method. In this research, she explored the process of diagnosing student skills in regards to
adding fractions in order to provide remediation. The depiction of students’ overall knowledge of
the skills required to add fractions with test question responses marked the genesis of the Q-
matrix. Several strategies have since been used in the literature to develop the Q-matrix. One
common approach includes relying on the literature to identify an initial set of skills to be
measured. In this strategy, researchers define the attributes or skills to be measured using
theoretical support, and then they develop the items to measure the attributes. Next, the
relationships between each item and attribute are coded, the model is run, and the initial Q-
matrix is revised based on various fit statistics. This strategy is convenient and cost-effective.
However, Leighton & Gierl (2007) found that when this strategy is used in isolation, it often
results in more generalized attributes and thus, recommend that this strategy should not be used
in isolation. A similar approach other researchers have used is to simply relied on existing test
specifications (Xu & Von Davier 2008) for the initial hypothesis and then work towards a final
matrix. This strategy can be assumed to be an applicable extension of just relying on the
available literature, assuming the test specifications were developed using literature.
54
Using think aloud protocols to develop the Q-matrix has also been documented in the
literature (Jang, 2009; Wang & Gierl, 2007). Such protocols have been found to have the
potential to increase understanding of cognitive processes as researchers assign items to
attributes. Moreover, they are reliable. Often, expert panels will be relied upon (Sawaki, Kim, &
Gentile 2009). Expert panels are also commonly used outside the context of think alouds as well.
When working with expert panels in developing a Q-matrix, the process may be an informal
conversation. Conversely, it is also common to develop coding rules that are used for an initial
substantive assignment of attributes to tasks so that independent raters will reliably agree on
these assignments (diBello, Roussos, & Stout, 2007).
In one example, Ravand and Robitzsch, (2015) developed a Q-matrix to assess reading
skills. They invited multiple subject matter experts to identify the relationships between items
and attributes. Moreover, they found additional empirical support from subsequent correlational
analyses among skills and the nature of the interactions among the multiple skills required by
individual tasks. Other studies have also relied on correlational analyses to inform the
development of the Q-matrix to improve statistical performance. In one example, Liu, Douglas
and Henson (2009) used an exploratory factor analysis for Q-matrix development and found that
this approach can give a reasonable solution when the Q-matrix is not too complex. Researchers
used a three-factor exploratory model to identify basic clusters of items that might measure
similar abilities. They compared the empirical results from the factor analysis to the content of
items was used to identify the final Q-matrix.
In a recent study, Close, Davison and Davenport (2014) relied on binary examinee
responses from the simulated and real data to compute item correlations that were analyzed via
principal components analysis (PCA) with promax rotation to identify the skill sets. They found
55
that the components analysis method for Q-matrix development appeared to be a viable and
useful step in generating a Q-matrix when skill sets were measured by more than one item. They
also found that once items had been developed by content specialists, these items should be pilot
tested and a task analysis using components analysis conducted to finalize the Q-matrix before
items are used operationally.
In summary, chapter two provided a review of the literature on teacher effectiveness and
traditional teacher evaluation systems. This discussion turned to a review on recent efforts to
reform teacher evaluation systems, including using multiple measures of effectiveness. The
action taken at the Federal level through the Race to the Top initiative was addressed. This
discussion filtered down to the specific efforts taken by the Virginia Department of Education.
The Virginia Guidelines for Uniform Performance Standards and Evaluation Criteria for
Teachers was the particular policy being explored in this study, so key components were
reviewed in detail. Next, the theoretical framework for the study was reviewed with literature
focusing on policy implementation in organizations and K-12 contexts. Moreover, the
hypotheses for this study were presented. Finally, a conceptual overview of the methodology was
presented. The advantages of using cognitive diagnostic models were highlighted. As described
in the research questions, the main objective of this study was to test whether this type of
modeling with binary outcomes can be used instead of a unidimensional model. In order for this
to happen, the model must show to have significantly better model fit than unidimensional
models. Moreover, the parameters must be accurately measured and the interpretations must
make sense. The next chapter describes the specific actions taken to test these hypotheses.
56
CHAPTER 3
METHODOLOGY
The purpose of this study is to explore a new way to measure teachers’ perceptions of
school intra-organizational mechanisms for supporting teacher evaluation policy implementation.
Cognitive diagnostic modeling has not previously been applied to policy implementation support
constructs. The diagnostic output from the analysis in this study was intended to provide detailed
empirical information about teachers’ perceptions of support. It is assumed that more precise
diagnostic feedback will be beneficial to policy makers and school leaders in identifying
strengths and weaknesses and in targeting resources in the policy implementation process. When
equipped with more precise diagnostic feedback, policy makers and school leaders may be able
to more confidently engage in empirical decision making, especially in regards to targeting
resources for short-term and long-term organizational goals subsumed within the policy
implementation initiative. Specifically, the follow research questions are addressed:
1. Can cognitive diagnostic models be applied to understanding teachers’ perceptions of
intra-organizational mechanisms for supporting policy implementation?
a. Do diagnostic models specifying finer-grained attributes representing
implementation support mechanisms fit the data better than models specifying
coarser grained attributes?
b. Do diagnostic models specifying finer-grained attributes representing
implementation support mechanisms have more accurate parameter estimates than
models specifying coarser grained attributes?
c. What are the posterior estimated proportions of teachers that are diagnosed to fall
under each latent profile?
2. Are there group differences in the diagnostic model fit based on grade level, subject
taught, and career status?
a. What are the group differences of the estimated latent class profile
distributions based on grade level, subject taught, and career status?
57
b. What are the group differences in diagnostic model estimations of parameter
estimates based on grade level, subject taught, and career status?
The ITES Project, Current Study, and the Role of the Researcher
The current study relies on the preliminary efforts part of a larger grant-funded study.
Researchers with the ITES project team developed the instrument and collected the data used in
the current study. Moreover, Figure 1, summarizing the conceptual framework used in the
current study was adapted, with permission, from the ITES project. Several modifications were
made. Most notably, there is an added emphasis on both the psychometric components and the
overlapping nature of the inter-organization support mechanisms in the current study. As is the
procedure in any study in which a secondary data source is used, the data owner and principal
investigator (PI) for the ITES project and the Virginia Tech Institutional Review Board for the
Protection Human Subjects (IRB) granted approval for the use of the ITES data in this study (see
Appendix B). To be clear, the ITES project included the data collection and the instrument
development. The concept, study design, and methodology in the current study separated from
the ITES project upon the ITES PI permission and IRB approval.
As the lead researcher of the current study, it is necessary to clarify my role with the
ITES project. As the lead graduate assistant for the former study, my role involved working with
the data. I have no connection to the teachers who have submitted the data since neither study
involves intervention. My role as a former ITES researcher did not involve opportunity to
influence the individual teachers who have already provided the existing data. The data I
requested to use for this project was de-identified in 2013.
Data Collection and Sample
Although the data was attained from a secondary source, a description of the data
collection efforts from the previous study will provide context for the current study. In the larger
58
project, both quantitative and qualitative data were collected from multiple sources. For this
dissertation study, the primary sources of data included administrative data collected from the
local education agencies, information provided by the National Center for Education Statistics
Common Core of Data, and the ITES teacher survey data collected in the first year of the policy
implementation. A description of each data source is provided in table 4. In total, the study
included three participating LEAs for a total of 35 schools: 6 high, 6 middle and 23 elementary
schools. This study used a sample of partnership LEAs located in Southwest Virginia and
included on rural, one suburban, and one urban LEA. A total of 19,315 students were served
across the sample.
Table 4. Data Description for the Current Study
Data Description Collection Method
Local Education
Agency
Administrative
Data
This included teachers’ growth measures; other
measures of teachers’ performance including peer
evaluation and student surveys of classroom
instruction; teacher background; and school and
division documents relevant to the
implementation.
Provided by
administrators in
partnership LEAs.
School and Local
Education Agency
Data
This included necessary data from National Center
for Education Statistics Common Core of Data
Online Source
ITES Teacher
Surveys This included teachers’ attitudes towards supports
for the new teacher evaluation system and their
perceptions of major barriers in the adoption
Research team
survey
administration
All full-time teachers who had experienced evaluation in 2012-13 in each of the LEAs
were eligible to take the survey. The final sample included 747 teachers. The response rate was
70%. The high response rate was attributable to the strong partnerships developed with the
LEAs. The surveys were administered externally from the ITES research team with the LEA
administration providing permission, opportunities to promote the survey and the allotted time
59
for teachers to take the survey. An ITES research team member was present at each school for
each survey administration.
In 2013-14, education agency 1 (EA1) had all of its schools accredited by the State based
on the 2012-13 state standardized tests, while 9% of education agency 2 (EA2) schools received
accreditation with warning because these schools did not meet the state benchmark in
mathematics. In EA3, 30% of the schools were accredited with a warning while the other 70%
were fully accredited. It could also be noted that the three education agencies did not serve
similar students in regards to race/ethnicity. However, this is a study about policy
implementation, and there was no literature found to support the notion that the race/ethnicity of
the students would impact teachers’ perception of support from these variables. As is evident in
table 5, all three education agencies are comparable to the state statistics on variables that may
impact the level of resources available to the schools. This included whether the school was a
title 1 school and the proportion of students who qualified for free and reduced lunch. Moreover,
all three education agencies were similar in regards to accreditation status, which also have had
implications for the policy implementation.
Table 5. 2013-2014 School-Year Local Education Agency Information
EA1 EA2 EA3 Virginia
White students 81.0% 92.0% 92.0% 53%
Black students 11.0% 3.0% 6.0% 23%
Asian students 0.0% 0.5% 0.4% 6%
Hispanic students 3.2% 2.0% 1.0% 11%
Students eligible for free and reduced-
price lunch 32.0% 32.3% 41% 38%
Title I schools 33.3% 37.4% 44% 40%
Virginia Accreditation Status All
accredited
All
accredited*
All
accredited** 98%***
Notes: *indicates that 9.1% of school credited with warning;
**indicates that 57% of schools accredited with warning;
***indicates that 30% of schools accredited with warning.
60
Using teacher personnel data provided by the partnership school districts and teachers’
responses to survey items, it was possible to analyze information about the survey respondents
that were included in the final sample and information for those teachers that were not included
in the final sample—non-respondents. Non-respondents included those teachers who were
eligible but chose not to respond completely to the survey. A comparison of respondents to non-
respondents is included in table 6. As was clear from the table, there were no significant
differences between the two groups on any of the available variables. Thus, there was no
meaningful systematic explanation for why teachers chose not to respond to the survey.
The final sample included 747 total teachers. A total of 570 teachers were female and 177
were male (See Table 7). Almost half of the final sample were identified as elementary school
teachers, whereas the other half of the teachers were almost equal proportions of middle and high
school teachers. Surprisingly, there were fewer teachers who were identified as teachers of
science, technology, engineering or math (STEM) than those who were not identified as STEM.
A STEM teacher was defined as any teacher who teaches a math course (e.g. calculus,
trigonometry, applied mathematics) or a science course (e.g. biology, chemistry, physics,
engineering, general science, computer science). The reason this was surprising is that in a
sample with so many elementary school teachers, one may expect to find more STEM teachers
because it is so common for elementary school teachers to teach all subjects to their assigned
class.
61
Table 6. Comparison of Teacher Characteristics
Total Sample
Respondents Non-Respondentsa
Female 76% 72%
STEM teachers 39% 36%
Elementary 46% 49%
Middle 26% 23%
High 32% 31%
Early Careerb 34% 35%
Experiencedd 66% 65%
Average evaluation rating prior rating in 2012-13cd
1.53 1.46 Note: F-statistics used to indicate whether significant difference by *p<0.05 ** p<0.01 *** p<0.001; aIndicates a group of teachers who were not included in this analysis because they did not respond
bEarly career teachers indicates those teachers who reported teaching for 1-5 years; mid-career indicates those
teachers who had been teaching for 6 to 10 years; experienced teachers are those who reported teaching over 10
years. cThis includes data only for EA1 and EA2. This information was not available for EA3.
dThis is the average score as calculated by the school district.
Finally, a comparison of experienced and early career teachers was made. Early career teachers
are those teachers in the first 5 years of their teaching career. Conversely, experienced teachers
are those teachers with greater than 5 years of teachers. As is evident in table 7, there were 435
experienced teachers and 312 early career teachers in this sample.
Table 7. Crosstabs of Teacher Characteristics
Gender Level STEM
Female Male N 1 2 3 N 1 2 N
Level
1: Elementary 314 31 345 - - - - - - -
2: Middle 137 57 194 - - - - - - -
3: High 119 89 208 - - - - - - -
N 570 177 747 - - - - - - -
STEM 1: Stem 258 36 294 189 54 51 294 - - -
2: Not-Stem 312 141 453 156 140 157 453 - - -
N 570 177 747 345 194 208 747 - - -
Status 1: Experienced 358 77 435 214 119 102 435 173 262 435
2: Early Career 212 100 312 131 76 105 312 121 191 312
N 570 177 747 345 195 208 748 294 453 747
62
Instrumentation
As previously mentioned, the instrument development work was not part of this current
dissertation. However, the description of the instrumentation is important information in
understanding the retro-fitting process of the diagnostic model (see chapter 4). The teacher
survey questions consisted of items about teachers’ perceptions of the teacher evaluation policy,
and their perceptions of supports and barriers to implementing the policy. Teachers shared
perceptions of the supports by indicating the extent to which they agreed about specific aspects
regarding the policy implementation. The items were originally on 4-point scales where 1= not at
all, 2 = some extent, 3 = moderate extent, 4 = great extent (See Appendix A). The development
of the items relied upon the four areas of implementation in the conceptual framework. As
previously discussed, the survey was developed to investigate the policy implementation in
Virginia. Thus, the GUPSECT standards were used to ensure items were relevant to this sample
of teachers. A visual representation of an informal blueprint is depicted in Appendix C. Due to
the overlapping nature of the areas of implementation from the conceptual framework,
specifying that an exact number of items represent each cell in this blueprint was not an objective
of the development. Rather, the members of the ITES research team involved in the instrument
development process systematically checked that each area was represented by numerous items.
The final survey included 89 total items about policy implementation support and
additional items about demographic information. After the survey items were written, the
research team solicited feedback from higher level district administrators including
superintendents and assessment coordinators. Two separate meetings with two different groups
of administrators from two different LEAs occurred. Upon incorporating feedback from
administrators the survey was piloted with the district data team of teachers. This was a total of
63
about 30 teachers. The average time for the survey was about 25 minutes. Further revisions were
made based on the pilot group written qualitative feedback and time constraints. The final
revision of the teacher survey was developed on Qualtrics software using HTML and then tested
by project team members for adequate functioning. The final survey was administered to
teachers using an official list of teacher e-mail addresses provided by each LEA central offices.
All teachers in LEAs 1 and 2 in the fall semester of 2013, and in the spring semester of 2014 for
LEA 3. Following each initial distribution, up to three reminders were given to teachers every
two weeks thereafter. The final 89 items were included in Appendix A.
Item Analysis
For the preliminary analysis, it was necessary to investigate any irregularities that may be
present in the data. This was accomplished through an item analysis using Stata software. As
previously described, the instrument was developed to measure teachers’ perceptions of intra-
organizational mechanisms for supporting teacher evaluation. Thus, initially, a unidimensional
latent construct was assumed. The unidimensional construct was teachers’ perceptions of overall
policy implementation support. A higher score on this construct would indicate a greater degree
of perceived support.
From the polytomous item analysis, it was clear that most items performed well based on
their means and variances (See Table 8). The item with the highest mean was item 69
(mean=3.46). The item with the lowest mean was item 3 (mean =1.21). The item with the largest
variance was item 6 (sd=1.18) and the lowest variance was item 3 (0.56).
64
Table 8. Polytomous Item Descriptive Statistics
item mean sd item mean sd item mean sd
1 2.46 0.85 31 2.05 0.77 61 2.19 1.07
2 2.05 0.93 32 2.30 1.07 62 2.29 1.00
3 1.21 0.56 33 2.35 0.89 63 2.82 0.86
4 1.25 0.61 34 2.47 0.95 64 2.75 0.88
5 1.60 0.73 35 2.12 0.70 65 2.72 1.33
6 2.78 1.18 36 3.10 0.84 66 3.29 0.93
7 1.86 1.16 37 1.46 0.91 67 3.30 0.96
8 2.41 1.12 38 1.65 1.03 68 2.20 1.33
9 3.17 1.01 39 1.82 1.12 69 3.46 0.86
10 3.15 1.02 40 1.48 0.89 70 2.03 1.25
11 2.39 1.02 41 1.46 0.90 71 2.93 1.11
12 3.06 0.96 42 1.33 0.82 72 2.07 1.18
13 3.27 0.88 43 2.22 0.96 73 3.06 1.10
14 3.23 0.89 44 2.89 0.87 74 2.02 0.91
15 2.81 0.98 45 3.05 0.80 75 2.01 0.93
16 2.32 1.02 46 2.72 0.96 76 2.08 0.96
17 2.47 1.01 47 3.30 0.76 77 2.34 0.97
18 2.62 1.00 48 3.36 0.72 78 1.67 0.75
19 3.16 0.94 49 3.24 0.77 79 2.21 0.93
20 3.25 0.92 50 3.32 0.76 80 2.05 0.96
21 3.20 0.95 51 3.40 0.72 81 2.11 0.94
22 2.77 1.00 52 3.30 0.76 82 2.10 0.99
23 2.46 1.00 53 2.94 0.92 83 2.14 0.97
24 1.94 0.94 54 2.91 0.92 84 1.83 0.81
25 1.95 0.92 55 2.61 0.94 85 2.00 0.96
26 1.88 0.92 56 2.84 0.92 86 1.90 0.85
27 3.24 0.97 57 2.47 0.99 87 2.59 0.90
28 1.74 1.02 58 2.66 1.02 88 2.76 0.85
29 2.73 0.94 59 2.14 1.03 89 2.71 0.89
30 2.69 1.11 60 2.38 1.01
The item mean distribution was plotted and the items with means at the tail ends of the
distribution (<2 or >3) were flagged and further analyzed using a graphical and substantive
analyses. The quantiles of each item were plotted against the quantiles of the normal distribution
and the text of these flagged items was further analyzed for substantive evaluation purposes. The
text of the flagged items was reviewed with two content experts. The experts included two
teachers, neither of whom participated in the original survey. In total, twelve of the flagged
65
items were removed from the analysis. The rationale for deleting these items was that, first, the
majority of the responses to these items were the extreme choice (1 or 4) which resulted in an
extreme mean and/or low item variance. Secondly, the selected teachers and the research team
determined that cases could be made that the wording of the text of the items was ambiguous, not
clear, or not necessarily important. Some items were reviewed but ended up being included in
the analysis based on the substantive contributions of the items to the measured attributes of
interest. Figure 2 displayed an example of the histograms and quantile plots of one of the
reviewed items.
Figure 2. Item Flagged for Substantive Review
A major theme of this study was the application of cognitive diagnostic models to
improve the measurement precision of policy implementation support in order to inform teacher
training and development. Since cognitive diagnostic models have not previously been used for
this type of application, several considerations were necessary. The first consideration was the
current state of the research on cognitive diagnostic models. The majority of the limited research
on cognitive diagnostic models was focused on dichotomous models and so this suggested it may
be advantageous to rely on dichotomous data simply because more support would be available
for methodology decisions. Secondly, most of the software programs available that actually run
1 2 3 4Frequency
Histogram of Removed Item
12
34
Ite
m
-1 0 1 2 3 4Inverse Normal
Q Plot of Removed Item
66
cognitive diagnostic models are restricted to running the dichotomous models. One additional
point worth noting was that the polytomous models required much larger sample sizes to reach
convergence.
As is discussed in more detail in chapter 4 and shown in detail in Appendix D , the
internal structure of the dichotomous data was shown to be very similar to the internal structure
of the polytomous data. In both cases, the resulting structures supported a 10-factor solution of
finer-grained support mechanisms. Based on these reasons, the application of cognitive
diagnostic models could still be investigated using the dichotomous data, and the diagnostic
output could still be attained from the analysis and used to improve the support provided to
teachers. To reiterate, dichotomizing the data will present a loss of information and may
introduce a limitation of this study, however, it will not have a major impact on the most
important goals of this study.
To dichotomize the data, instead of using the initial scale where 1 = not at all, 2 = some
extent, 3 = moderate extent, 4 = great extent, items were recoded. If teachers indicated a 3 or a 4
in response to an item, that entry was recoded to a “1” which represented that the teacher was
indicating they felt supported on this item. Similarly, 1’s and 2’s were recoded to a “0” which
indicated they did not feel supported on that item. The overall reliability of the survey was
calculated after the items were dichotomized. The coefficient alpha was 0.9, indicating an
excellent level of reliability. Similar to the procedure followed with the polytomous data, an
item analysis was completed on the dichotomous data. The histograms of items at the tail ends
of the distributions were further investigated, as was the text of the potentially problematic items.
However, none of the items were removed based on their dichotomized performance. Thus,
67
based on the initial item analysis of polytomous and dichotomous items, a total of 77 items were
selected. The dichotomized item scores were summarized in table 9.
Table 9. Dichotomized Item Descriptive Statistics
Item Mean SD Item Mean SD Item Mean SD
1 0.52 0.50 31 0.23 0.42 61 0.34 0.47
2 0.51 0.49 32 0.44 0.44 62 0.34 0.47
3 0.43 0.50 33 0.38 0.49 63 0.62 0.49
4 0.07 0.25 34 0.45 0.45 64 0.57 0.49
5 0.10 0.30 35 0.23 0.42 65 0.56 0.50
6 0.62 0.49 36 0.65 0.48 66 0.79 0.41
7 0.30 0.46 37 0.09 0.29 67 0.79 0.41
8 0.51 0.50 38 0.15 0.36 68 0.38 0.49
9 0.76 0.43 39 0.21 0.41 69 0.84 0.37
10 0.76 0.43 40 0.08 0.27 70 0.32 0.47
11 0.45 0.50 41 0.09 0.29 71 0.65 0.48
12 0.73 0.45 42 0.05 0.22 72 0.32 0.47
13 0.80 0.40 43 0.33 0.47 73 0.70 0.46
14 0.80 0.40 44 0.67 0.47 74 0.24 0.43
15 0.63 0.48 45 0.74 0.44 75 0.24 0.43
16 0.43 0.50 46 0.59 0.49 76 0.27 0.44
17 0.48 0.50 47 0.83 0.38 77 0.38 0.48
18 0.54 0.50 48 0.87 0.33 78 0.12 0.32
19 0.77 0.42 49 0.83 0.38 79 0.33 0.47
20 0.80 0.40 50 0.84 0.37 80 0.26 0.44
21 0.76 0.43 51 0.87 0.34 81 0.28 0.45
22 0.61 0.49 52 0.84 0.36 82 0.29 0.45
23 0.48 0.50 53 0.63 0.48 83 0.30 0.46
24 0.28 0.45 54 0.66 0.47 84 0.17 0.37
25 0.26 0.44 55 0.51 0.50 85 0.23 0.42
26 0.23 0.42 56 0.62 0.49 86 0.21 0.40
27 0.79 0.41 57 0.46 0.50 87 0.58 0.49
28 0.19 0.39 58 0.50 0.50 88 0.65 0.48
29 0.65 0.48 59 0.30 0.46 89 0.63 0.48
30 0.48 0.48 60 0.40 0.49
Descriptive Statistics
Following the item analyses, the dichotomous data needed to be further explored. The
number of items teachers indicated they felt supported on were summed to create a total survey
score. There were 747 responses and the average total score was 36.5 with a standard deviation
68
of 13.5. The maximum number of items a teacher perceived to be supported on was 86. The
minimum was 2. The distribution of the total scores was approximately normal as depicted in
figure 3.
Figure 3. Distribution of ITES Survey Dichotomized Total Scores
In regards to the observed dichotomous total scores, it was clear that there were very few
differences in instrument performance across groups. As seen in table 10, there were no
significant differences on observed total score between experienced and early career teachers or
between STEM and non-STEM teachers. However, a one-way analysis of variance with
bonferonni comparisons revealed that that there were significant differences between elementary
and middle school teachers (F(746) = 13.94, p = 0.02 < α = 0.05). Moreover, there was a
significant difference between elementary school teachers and high school teachers on total
observed score (F(746) = 13.94, p < 0.001 < α = 0.05). The analysis in this study, specifically
research question 2, was focused on the associations between groups on instrument performance.
However it should be noted that this focus was not on the overall total score of the instrument.
Rather, all teachers were to be grouped into diagnostic profiles based on their performances on
0 20 40 60 80Dichotomous Summed Total Score
69
specific components of the instrument. Hence, the overall total score performances and the
differences between groups on overall total score performances was presented in table 10 purely
for descriptive purposes. In later analyses, these groups will be compared based on diagnostic
analyses instead of classical total scores.
Table 10. Total Scores by Group
N Score SD Min max
Elementary 345 39.05 12.98 2 75
Middle 194 35.85 13.00 5 65
High 208 32.97 14.03 4 67
STEM 294 37.89 13.30 4 75
Not-STEM 453 35.64 13.58 2 67
Female 570 37.95 13.11 2 75
Male 177 31.95 13.81 4 62
Experienced 435 36.35 12.98 2 69
Early 312 36.77 14.23 4 75
Plan of Analysis
The purpose of applying cognitive diagnostic models to understanding teachers’
perceptions of intra-organizational support mechanisms is to generate profiles of perceived
support on categorical outcome variables key to the implementation process. These categorical
outcomes are especially advantageous when criterion referenced, standards-based assessments
are desired, as in this study. Another major advantage of cognitive diagnostic modeling is that
“cut-scores” are defined “model internal” and a priori. As discussed, a potential alternative
framework is multidimensional IRT (mIRT). Similar to cognitive diagnostic modeling, mIRT
uses full-information estimation and accommodates a multidimensional internal structure,
however, mIRT typically uses continuous latent outcome variables and also requires a second
“model external” standard-setting step in order to be useful when criterion-referenced decisions
70
are desired. Thus, if the application of cognitive diagnostic modeling is found to be successful in
this application, it is preferable.
It is contended that policy makers and school leaders can use the standards-based
categorical diagnostic information to organize resources for policy implementation based on
where teachers’ fall in the diagnostic marginal distribution profiles. For example, school leaders
may target certain resources and professional development for teachers who fall under one
profile and plan other remedial steps for teachers in another profile. Before the diagnostic output
is of value, it is imperative to find the best fitting diagnostic model possible. In the remainder of
this chapter, the methodologies used in the analysis are reviewed and the plan of analysis is
detailed. Following this review, chapter 4 examines the results of the analysis.
The Q-Matrix: Defining the Attribute and Skills Space
It will first be necessary to formally define the attribute and skills space. This process
amounts to establishing what skills and attributes were being measured in the policy
implementation process a priori. Based on the literature review, four coarse grained attributes are
hypothesized (see Figure 1, p. 15). These coarse grains include characteristics of the policy,
characteristics of teachers, characteristics of leadership, and characteristics of the organization.
Most commonly, theoretical support is utilized in the development of this space. The previously
discussed literature (e.g. Century et. al, 2012) is relied upon to determine the initial coarse
grained attributes here. The number of latent classes (skill profiles) in a dichotomous model will
be 2A where A is equal to the number of attributes and 2 means two levels of each attribute.
Typically following the process of defining the attributes, the survey items are developed.
These are referred to as the skill tasks (diBello et al, 2006). Contrary to how this process works
with other methodologies, in this process, task developers do not avoid tasks that require
71
combinations of multiple skills per task. Thus, survey or test items can be developed to measure
multiple attributes. However, since the data in this survey is attained from a secondary source,
the items were developed prior to this study. With items and attributes defined, the most
important process in this study became how to map these items to measure the appropriate skills.
Since model fit is subject to the relationships between the individual items and the latent
categorical attributes of interest, the development of the Q-matrix is an important element of this
study. A reasonable portion of the preliminary analysis in this study is dedicated to the empirical
procedures taken beyond the literature review to inform the development of the initial Q-matrix,
although this is not the main focus of the study.
As discussed in chapter 2, a Q-matrix is the numerical specification table that includes
the information of which attributes are hypothesized to measure by which item (Tatsuoka, 1983).
It represents a particular hypothesis about which skills are required to successfully answer each
item in a test (Li & Suen, 2013). A Q-matrix traditionally contains the items in the rows and the
attributes in the columns and includes a binary indicator system (e.g. 1 or 0) to reveal whether or
not an attribute is measured by an item. As previously discussed, one item may measure
multiple attributes, and an attribute should be measured by multiple items.
In chapter 2, various ways were reviewed on how to develop a Q-matrix. As discussed,
they are most commonly developed using a theory-based approach. Researchers use the relevant
literature or rely on expert opinion to determine what attributes are important and then they
organize the relationships between attributes and items. These steps will be used in this study.
However, in order to add more empiricism and improve the Q-matrix development, additional
validation steps will be undertaken to ensure the proper specification of the Q-matrix in this
study. Specifically, exploratory factor analysis (EFA) will be used as a preliminary step in order
72
to build understanding of the data dimensionality and understand whether finer-grained attributes
underlay the data. Although it will be an important preliminary action taken to understand the
relationships in the data, it will only be one of the additional measures taken to add empiricism to
this study. To be clear, the EFA will not be the main focus of this study. The details and results
of the EFA and the entire Q-matrix development and validation process are described in detail in
chapter 4.
The Compensatory Reparameterized Unified Model (C-RUM)
Once the initial Q-matrix is developed, a diagnostic model will be applied to the data.
The particular model to be used in this study is called the compensatory reparameterized unified
model (C-RUM) (e.g. Hartz, 2002; Templin, 2006). The C-RUM allows for a higher degree of
modeling flexibility than some other commonly used models. Since the C-RUM is a
compensatory model, a lack of “perceived support” of a particular measured attribute can be
made up by the “perceived support” of another measured attribute (Rupp, Templin & Henson,
2008). Moreover, this model is chosen is because it is flexible and it contains unique parameters
for each item and attribute. The notation used by Rupp, Templin, and Hensen (2009) is used in
this study:
𝜋𝑖𝑐 = P(𝑋𝑖𝑐 = 1 |𝛂𝑐) = exp( λ𝑖,0 + ∑ 𝛌𝑖,1,(𝑎)𝛂𝑐𝑎 , 𝑞𝑖𝑎)𝐴
𝑎=1/ 1 + exp( λ𝑖,0 + ∑ 𝛌𝑖,1,(𝑎)𝛂𝑐𝑎 , 𝑞𝑖𝑎)
𝐴
𝑎=1
(4)
where P is the probability of a positive response (i.e. a response of 1), exp (.) is the exponential
function, πic is the probability of positive response to item i in latent class c, Xic is the observed
response for item i in latent class c, qia is the indicator from the Q-matrix indicating whether
attribute a is measured by item i, αca is the attribute “perceived support” indicator for attribute a
73
in latent class c, λi,0 is the intercept parameter for item i, and λi,1,(a) is the slope of parameter for
item i and attribute a.
According to Rupp, Templin, and Hensen (2008) two components are needed to build the
C-RUM. The first component is an intercept parameter, λi,0, which is defined at the item level but
not at the attribute level. The second component is a slope parameter, λi,1,(a), which is a main-
effect term defined at the attribute level separately for each item. The C-RUM estimates one
intercept parameter for each item and as many slope parameters as there are entries of 1 in the Q-
matrix. The item response function specifically produces a step function, with the λi,1,(a)
parameters being the amount of increase for the presence of the different attributes needed for an
endorsement of an item. Rupp and Templin (2009) explain that items for which the baseline
probability, as defined through the intercept parameter, is relatively high are items that might be
problematic. Moreover, items that poorly measure the required attributes are those for which the
probability increments, as defined through the slope parameters, are small. Finally, in the C-
RUM model the respondent receives an item-specific increment for each possessed attribute
measured by the item through λi,1,(a).
Cognitive Diagnostic Model Estimation Method
In order for the model to be identifiable, every parameter should be able to be estimated
by a unique value. This includes the item parameters and the person population parameters.
According to DiBello, Roussos, & Stout (2006). models can be estimated with all parameters
being fixed without prior distributions (non-Bayesian), a portion of the parameters can have prior
distributions imposed (partially Bayesian), or all of the parameters can be assigned a joint prior
distribution (Bayesian). The diagnostic model estimation method used by the software in this
study is marginal maximum likelihood estimation (MMLE) using the EM algorithm. The EM
74
algorithm is a general iterative algorithm for item parameter estimation by maximum likelihood
when some of the random variables involved are not observed i.e., considered missing or
incomplete (Bock & Aitkin, 1981). It formalizes an intuitive idea for obtaining parameter
estimates when some of the data are missing. In the EM algorithm, the first step (E-Step) is to
replace missing values by estimated values. The data is used to compute number and expected
number of examinees at each quadrature point. Next, the parameters are estimated using the
quadrature points from E-step to carry out MMLE to estimate item difficulties. Item parameter
estimates are then used to re-estimate quadrature point distribution. That revised set of
quadrature point frequencies is then used to re-estimate item parameters. EM cycles are repeated
until convergence is nearly achieved. Convergence rates can vary, with more complex models
taking longer amounts of time. Among compensatory models, models with intercept and main
effects without the interaction parameters, such as the C-RUM, can recover their parameters
more accurately than models with interaction parameters (Choi et al., 2010). Moreover, the
number of items has a significant effect on parameter estimation and classification accuracy, as
well as fit estimation accuracy (Henson et al., 2009; Tatsuoka, 1990; Templin et al., 2008).
Finally, the number of latent classes increases exponentially with the number of
attributes. Thus, parameter estimation becomes more difficult to compute as the number of
attributes increase. Models with many attributes or high numbers of skills per item are in
particular danger of being nonidentifiable (Dibello, Roussous, & Stout, 1996). This has practical
implications for the current study. As described below, the range of attributes to be modeled will
vary from 4 to 10. It is common that less than seven attributes are used per dataset (e.g., Hartz,
2002; Templin and Henson, 2006). Rupp and Templin (2008) found that three, four, and five
attributes are the most common numbers of attributes for log-linear diagnostic models.
75
According to Galeshi and Skaggs (2015), there is limited empirical evidence examining various
sample sizes, item numbers and skill varieties for the C-RUM model. Tests with as few as 15
items and four attributes have been examined (Templin et al., 2008). Moreover, tests with as
many as 50 items with five attributes have also been examined with the LCDM approach
(Kunina-Habenicht et al., 2012). Careful consideration of both parameter identifiability and
systematic model evaluation help achieve effective calibration.
Analyses for Research Question 1
Research question 1 is essentially about finding the best-fitting cognitive diagnostic
model. This section describes the plan of analysis for the overarching research question and the
sub-questions as well. For convenience, the questions are included below:
1. Can cognitive diagnostic models be applied to understanding teachers’ perceptions of
intra-organizational mechanisms for supporting policy implementation?
a. Do diagnostic models specifying finer-grained attributes representing
implementation support mechanisms fit the data better than models specifying
coarser grained attributes?
b. Do diagnostic models specifying finer-grained attributes representing
implementation support mechanisms have more accurate parameter estimates
than models specifying coarser grained attributes?
c. What are the posterior estimated proportions of teachers that are diagnosed to
fall under each latent profile?
The analysis will begin with the 77 items and the Q-matrix developed from the EFA in the
preliminary analysis. Since this is a new application, the precise number of attributes that
should be modeled is not known, although a four attribute model is initially hypothesized based
on the literature. An EFA is used to develop the initial Q-matrix in order to build understanding
of the data and the finer-grained mechanisms. Expert opinions will also be used to refine the
76
matrices through the model specification process. This procedure is detailed in chapter 4. The
over-arching goal in research question 1 is to establish the applicability of cognitive diagnostic
models. It is anticipated that a thorough and comprehensive comparison of cognitive diagnostic
models to the unidimensional model will establish the applicability to this particular data. If, for
instance, a unidimensional model were to fit the data better than all diagnostic models specified
by this Q-matrix, any further pursuit of this application may not be reasonable. Thus, the first
research question explores whether cognitive diagnostic models can fit the teacher perception
data better than a model that assumes a single unidimensional “support” ability score. Moreover,
in comparing the various models to the unidimensional model, diagnostic models will be
compared to one another based on the number of attributes defined in each model. It is
anticipated that a better understanding of model fit will emerge based on the number of attributes
defined in the model. Thus, models will be compared on various characteristics, including
relative fit, parameter estimates, and standard errors.
Considering that there are several sources of model misfit in specifying a cognitive
diagnostic model, several factors will be considered in answering research question 1. For
example, misfit can occur from misspecification of the general model type (i.e., compensatory
vs. non-compensatory), misspecification of the Q-matrix, misspecification of the specific model
(i.e. C-RUM vs. GDINA), or heterogeneity issues within the population (Rupp & Templin,
2008). Due to the multiple sources of misfit, fit evaluation in cognitive diagnosis modeling can
be challenging. Moreover, there is limited empirical research comparing fit statistics and their
criteria in cognitive diagnostic modeling, the literature on the statistics. Even fewer studies have
been conducted to systematically evaluate the extent to which these statistics are sensitive to
model–data misfit or useful for model selection. In this study, model fit will be determined
77
based on comparisons of the relative fit indices and the stability and accuracy of parameter
estimates. Relative fit indices refer to the process of selecting the best-fitting model among a set
of competing models (Chen, de la Torre, & Zhang, 2013).The comparisons of model fit indices
will include the AIC, BIC, and RMSEA. With all three indices, the model with the lowest values
will indicate better overall fit.
All three relative fit statistics are calculated as functions of the maximum likelihood and
for all three, the fitted model with the smallest value is selected. The Akaike information
criterion (AIC) provides a relative estimate of the information lost when a given model is used to
represent the process that generates the data (Akaike, 1974). It is calculated as:
𝐴𝐼𝐶 = 2𝑘 − 2 ln(𝐿) (8)
where L represents the value that maximizes the likelihood function for the cumulative logit
model. Models with the lowest AIC are said to be the most parsimonious model (Sakamoto et al.,
1986). The Bayesian information criterion (BIC) will also be used. Closely related to the AIC,
the BIC is based, in part, on the likelihood function and includes a larger penalty term for
increasing parameters than the AIC. The BIC is found by:
𝐵𝐼𝐶 = −2 ∗ 𝑙𝑛𝐿 + 𝑘 ∗ ln (𝑛) (9)
where = the number of data points in the observed data, the number of observations, or
equivalently, the sample size; = the number of free parameters to be estimated. Researchers
studying mixture modelling approaches suggest the use of BIC and as an indicator of fit
estimation and the criteria for comparing competing models (Hagenaars & McCutcheon, 2002;
Magidson & Vermunt, 2004; Sclove, 1987). However, AIC and BIC have shown to be sensitive
to sample size, test length, and the number of model parameters (Burnham & Anderson, 2004).
In a study by Chen, de la Torre, & Zhang (2013) it was demonstrated that for relative fit
78
evaluation, the BIC, and to some extent, the AIC, can be useful to detect misspecification of the
model, the Q-matrix, or both. In another study, Kunina-Habenicht et al. (2012) found that AIC
and BIC were useful in selecting the correctly specified Q-matrix against the misspecified Q-
matrices. Moreover, they also found that the AIC was useful in selecting the correct model
against the misspecified model when all interaction effects were omitted.
In another study, Galeshi and Skaggs (2015) examined the performance of the commonly
used relative fit indices in determining the model to data fit for C-RUM. Researchers evaluated
the sensitivity of the AIC and BIC in identifying model misfit/selection for six sample sizes of
10,000, 5,000, 1,000, 500, 100, and 50 with various test length/ number of attribute under two
extreme Q-matrix misspecifications—over-fit and complete reverse Q-matrices. The results
indicated that the BIC and AIC indices performed similarly for larger datasets (N ≥ 500) but
varied for smaller datasets (N < 500) suggesting a superior performance for BIC. Since the
dataset in this study is a large dataset (N>500), both of the aforementioned fit statistics will be
considered in this study, and in cases where they disagree, BIC will be used.
Another important fit statistic that will be used in this study is the RMSEA. The model
RMSEA analyzes the discrepancy between the hypothesized model, with optimally chosen
parameter estimates, and the population covariance matrix. One advantage of this statistic is that
it avoids sample size issues. The RMSEA ranges from 0 to 1, with smaller values indicating
better model fit. A value of .06 or less is indicative of acceptable model fit. MacCallum, Browne
and Sugawara (1996) have used 0.01, 0.05, and 0.08 to indicate excellent, good, and mediocre
fit, respectively.
𝑅𝑀𝑆𝐸𝐴 = √(𝜒2 − 𝑑𝑓)
√(𝑑𝑓(𝑁 − 1)
(10)
79
According to Cook, Kallen, and Amtmann (2009), the model RMSEA provides an answer to the
question, “How well would the model, with unknown but optimally chosen parameter values, fit
the population covariance matrix if it were available?” It is based on an estimate of the
population discrepancy function, which assesses the error of approximation in the population.
The RMSEA is thus a measure of discrepancy; because it presents discrepancy per degree of
freedom, it is sensitive to model complexity (i.e., number of estimated parameters). Additionally,
the mdltm software will automatically produce RMSEA for individual items.
The model specification will include the application of fine-grained models and coarse-
grained models to the data. To answer research question 1b, the diagnostic models will also be
evaluated in terms of the stability and accuracy of the parameter estimates. Trends of the
average parameter estimates and standard errors will be compared. For every model, the item
parameters indicate how well the items performed for diagnostic purposes. Models with more
stable and lower standard errors will be determined to be more advantageous. It is anticipated
that items measuring two attributes will more likely have larger standard errors than items
measuring one attribute. The standard errors will provide some insight into the accuracy of item
parameter estimates.
Finally, the ability of the models to discriminate between teachers whom perceive
support and teachers who do not perceive support will be used as model fit index. Technically,
this is a discrimination index but similar to other studies (Jang 2009), it will be used as one way
to determine the best fitting model. The calculation of the discrimination index is described in
Appendix F.
80
Analysis for Research Question 2
For research question 2, the overarching goal is to explore group differences through the
application of multiple-group multinomial log-linear models. The groups being compared in this
study were selected based on logistical organizational considerations. For example, as previously
discussed, the vision is that the application of these models can be used by school districts to
organize for policy implementation. Schools leaders can use this type of information to organize
teachers in to groups based on their latent profiles and target the appropriate resources to groups
of teachers. Although the data contains numerous grouping variables, school districts are most
inclined to organize resources for teachers based on grade level, subject taught, or career status.
This is true for many reasons, but most importantly, because of logistical organization
restrictions. Teachers in the same grade levels and who teach the same subjects likely have more
similar teaching schedules, teaching strategies, and teaching philosophies. Thus, research
question 2 explores the application of mixture models defined with these multiple groups. One of
the advantages of the diagnostic model is to concurrently estimate item parameters for several
subpopulation or subgroups. In the case of teachers’ perceptions of policy implementation
support, the population includes several different subgroups, for example, grade level, subject
taught, and career status. Since this is a new application, it is necessary to explore differential
group estimation. Thus, for the second research question, the ability of the diagnostic model to
estimate group differences will be explored.
2. Are there group differences in the diagnostic model fit based on grade level, subject
taught, and career status?
a. What are the group differences of the estimated latent class profile distributions
based on grade level, subject taught, and career status?
81
b. What are the group differences in diagnostic model estimations of
parameter estimates based on grade level, subject taught, and career
status?
To be clear, research question 2 will not be a traditional group contingency comparison analysis
of outcomes. Nor will it be akin to controlling for potential confounding variables in a regression
model. Rather, since the goal of this study is to test a new application of the diagnostic model,
this analysis will essentially rely on the advantage of the diagnostic model to concurrently
estimate item parameters for several subgroups. This strategy of latent grouping will provide an
overall better model fit, and thus, the diagnostic outcomes will be more useful. Recall that the
proportions of teachers in each diagnostic profile are actually estimated from the model. So,
comparing groups is actually comparing whether the models are the same. The procedures that
will be followed will be conducted in mdltm software. Moreover, procedures were outlined in
Xu & von Davier (2008) for comparing groups. The analysis will commence by running mdltm
Software for groups in the same model. This is known as a multiple group multinomial log-linear
model. This will permit item and attribute parameter estimates to be on the same scale across
groups. Similar to Xu and Von Davier (2008), the Q-matrix from the best-fitting single group
model from research question 1 will be used. Next, a two-group model will be fit. The two –
groups represent teachers who teach STEM subjects (science, technology, engineering, math),
and those who do not. The other 2-group model that will be fit will be the early career vs.
experienced teachers. After that, a 3-group model will be fit. The three-group model will be
defined by groups of elementary, middle, and high school teachers. Finally, a 6 group model
will be represented by a full-factorial design model of STEM and teacher level. In addition to
comparisons of model fit and parameter estimates to determine if the model fit differently for
82
different groups, the proportions of teacher in each estimated latent class based on the diagnostic
marginal distributions will be compared.
Next, a descriptive comparison of estimated profile distributions, marginal distributions
of each attribute, and correlations between attributes will be completed. Since the distributions
are estimated from the models, no hypothesis tests will be necessary. Following the descriptive
statistics, a comparison of single group versus multiple group using fit indices will be completed.
The statistics will include both the AIC and BIC. The likelihood ratio test will also be used since
a single group model is nested within a multiple group model. This test is sensitive to sample
size and number of parameters. Finally, there will be a comparison of item parameter estimates.
The 95% confidence intervals will be formed around each item parameter, separately for each
group. The item parameters that are different enough to lie outside these confidence intervals
will be identified.
Running a model for each group separately would result in the item parameters being on
different scales. Thus, they would not be directly comparable. Using a multiple-group
multinomial log-linear mixture model puts the groups on the same scale to allow for proper
group comparisons (Xu and Von Davier, 2008). The groups being compared in this study were
selected based on logistical organizational considerations. For example, as previously discussed,
the vision is that the application of these models can be used to organize for policy
implementation. Schools leaders can use this type of information to organize teachers in to
groups based on their latent profiles and target the appropriate resources to groups of teachers.
Although the data contains numerous grouping variables, school districts are most inclined to
group/organize teachers based on grade level, subject taught, or career status. This is true for
many reasons, but most importantly, because of logistical organization restrictions. Teachers in
83
the same grade levels and who teach the same subjects likely have more similar teaching
schedules, teaching strategies, teaching philosophies, etc… Thus, research question 2 explores
the application of mixture models defined with these multiple groups.
Specifically, the groups of interest in this study will be the available categorical teacher
level variables including grade level, subject taught, and experience. Grade level will be split
into three categories including elementary, middle, and high school. Subject taught will be a
variable where a teacher either teaches a STEM subject or not. Career status will be early career
and experienced teachers as defined by Sun (2012). As previously discussed, many grouping
variables could have been explored with cognitive diagnostic modeling. However, none of the
available grouping variables aligned with the practical goals of this study like grade level and
subject taught. Recall that the goal of using cognitive diagnostic modeling is to retrieve the
categorical outcomes to inform standards-based targeting of resources for policy implementation.
The group variables in this study allow for specific differential supports to conveniently, in the
organizational sense, be targeted by group, if it were found to be necessary.
84
CHAPTER 4
RESULTS
Preliminary Analysis: Q- Matrix Development
Data Dimensionality
As described in chapter 2, the instrument used in this study measures teachers’
perceptions of intra-organizational mechanisms for supporting teacher evaluation. Based on the
literature review, it is clear that the unidimensionality assumption is not a realistic assumption
when it comes to supporting teacher evaluation policy implementation. Multiple underlying
factors make up teachers’ perceived support of such mechanisms, although the precise number is
not empirically supported in the available body of literature. A very thorough literature review
has resulted in a four-factor hypothesis. As reviewed in chapter 2, one alternative analytic
approach for testing this hypothesis is to specify the relationships between the observed variables
and the latent variables based on the literature and then use a confirmatory factor analysis to test
the model. However, as previously discussed, a major goal of this study is to test the application
of cognitive diagnostic modeling to this data. Cognitive diagnostic modeling offers several
advantages to a factor analytic framework, such as the accommodation of multidimensionality of
the data, complex loading structures and the use of a categorical outcome variable. These
advantages, and others, are discussed in chapters 2 and 3. Additionally, cognitive diagnostic
models are confirmatory in that the skill and attribute relationships are specified a priori through
the development of a hypothesized Q-matrix.
Using the literature review combined with the existing instrument specifications is one
common approach to developing the Q-matrix for the diagnostic modeling process. Both of
these procedures are economical and convenient. However, Leighton & Gierl (2007) found that
85
the use of only these procedures may result in a model that is too general for diagnostic purposes
and that the use of these procedures alone is unwarranted. For these reasons, an empirical
investigation into the dimensionality of the data is necessary. This empirical support can be
used, in conjunction with the literature and instrument specifications, to justify the initial Q-
matrix used in this study.
Exploratory Factor Analysis
The goal of the exploratory factor analysis (EFA) is to understand the underlying
relationships of the data in order to provide empirical support for the initial Q-matrix. In an
exploratory factor analysis, the components are referred to as factors. The general factor analysis
(FA) approach is concerned with identifying the underlying factor structure that explains the
relationships between the observed variables. EFA, specifically, is based on the common or
shared variance between variables, which is partitioned from the left-over variance unique to
each variable and any error introduced by measurement. In other words, the purpose of the EFA
is to identify basic clusters of items that might measure similar attributes. These clusters can be
used to establish the initial Q-matrix.
Prior to conducting the EFA, the items were dichotomized. In chapter 3, several reasons
were provided for this decision. The main reason for dichotomizing the data was to stay
consistent with the diagnostic models to be conducted in a subsequent analysis. Moreover, an
eigenvalue and maximum likelihood-based EFA with the polytomous data was conducted and
this procedure produced a very similar result in terms of the number of factors and the factor
structure in regards to item-factor correspondence (see Appendix E).
A traditional eigenvalue-based EFA with principal axis factor extraction on the
correlation matrix was conducted. Promax rotation, one of the most popular oblique rotations,
86
was used because it permitted correlations among factors. Considering the context of this study,
it would be expected that factors regarding policy implementation support would correlate. In
order to determine the optimal number of factors to retain, several statistical factor selection
procedures were used. Each of these methods was based on eigenvalues. The eigenvalue of a
factor represents the amount of variance accounted for by that factor. The lower the eigenvalue,
the less that factor contributes to the explanation of variances in the variables (Norris &
Lecavalier, 2009). In addition to the eigenvalues, as recommended by Brown (2015), factor
selection was also guided by substantive considerations. According to Brown (2015), “…the
validity of a given factor should be evaluated in part by its interpretability (p. 21).
The first eigenvalue-based factor selection procedure was the Kaiser –Guttman rule
(1991). This rule states that when an eigenvalue is less than 1.0, the variance explained by a
factor is less than the variance of a single indicator. Although this method is simple and
convenient, when used alone it may lead to over-factoring or under-factoring on factors with
eigenvalues around 1. Based on the traditional eigenvalue-based EFA on the correlation matrix
composed from dichotomously scored items, ten factors had eigenvalues greater than 1 (see table
11).
Table 11. Exploratory Factor Analysis Eigenvalues by Factor
Factors Eigenvalue
Factor1 13.27
Factor2 4.33
Factor3 2.98
Factor4 2.34
Factor5 2.20
Factor6 2.18
Factor7 1.63
Factor8 1.46
Factor9 1.35
Factor10 1.02
Factor11 0.73
87
The second factor selection procedure that was used was Cattell's (1966) scree plot. The Cattell
scree test plots the eigenvalue components as the X axis and the corresponding eigenvalues as
the Y-axis. As one moves to the right, toward later components, the eigenvalues drop. When the
drop ceases and the curve makes an elbow toward less steep decline, the scree test says to drop
all further components after the one starting after the last substantial drop in the magnitude of
eigenvalues. This is referred to as the “elbow rule.” Since this was somewhat subjective, the
scree plot was used as guide to locate an approximate number of factors. One limitation of this
procedure is that there may not be a clear drop in eigenvalues. In this method, the eignevalues
were first plotted. Next, the “elbow rule” was used to determine the last substantial drop in the
magnitude of eigenvalues. Based on figure 4, there appeared to be a significant drop in
eigenvalue at approximately 10 factors.
Figure 4. Exploratory Factor Analysis Scree Plot
Based on the Eigenvalues and the scree plot, the loadings of the 10-factor model were
explored for conceptual interpretability. In each case, factor loadings were used to determine
which item loaded onto which factor. A minimum loading of 0.3 was required in order for an
05
10
15
Eig
enva
lue
0 20 40 60 80Factors
88
item to load to a factor. This 0.3 threshold was based on the recommendation by Hair, Tatham,
Anderson and Black (1998) for sample sizes greater than 350. The summary in table 12 describes
each of the 10 resulting factors. These 10 factors explained approximately 92% of the total
variance. The factor with the largest eigenvalue was the factor with items regarding the
legitimacy of the policy (factor 1). A total of 14 items loaded on this factor. After factor analysis,
seven items loaded on one underlying construct that captures policy adaptability; seven items
loaded on the factor regarding teacher confidence; seven items loaded on the factor that captured
teachers’ attitude towards the policy; seven items loaded on the factor that captured leadership
advocacy and communication regarding the policy; eight items loaded on the factor that captured
the quality of professional development provided by leadership; seven items loaded on the factor
capturing the legitimacy of the leadership; eight items loaded on the factor capturing the values
of the organization; six items loaded on the factor capturing organizational locus of decision
making; and four items loaded on the factor capturing organizational resources provided for the
policy implementation. Please see Appendix A for a detailed information on item-factor
loadings.
89
Table 12. Summary Interpretation of Factors
Factor Description
1: Policy Legitimacy
Items address legitimacy of policy in terms of its
intended impact and the sources of evidence used in the
evaluation.
2: Policy Clarity/Adaptability Items address alignment of policy with other institutional
guidelines
3: Teacher Confidence Items address teachers’ confidence on skills relating to
the policy.
4: Attitude towards policy Items address teachers' attitude towards policy and
guidelines.
5: Leadership Advocacy and
Communication
Items address how school leaders advocate for the policy
and communicate regarding the guidelines.
6: Quality of Professional Development Items address the professional development provided by
leadership on policy-related concepts
7: Leader Legitimacy Items address the ability of school leaders to effectively
lead the implement the policy.
8: Organizational Resources Items address the resources available provided by the
organization.
9: Organizational Locus of Decision-
Making
Items address whether organizational factors allow for
teachers to be involved in decisions related to the policy.
10: Organizational Values Items address the values of the organization in terms of
policy-related concepts.
Upon closer analysis, the results indicated that although the best fitting model had 10 factors,
those factors, once understood in the context of survey items and the literature, could be
interpreted as finer-grains of the originally hypothesized four coarse grain attributes. In Table 13,
it was evident that the four course grained attributes are in the first row, and the ten interpreted
factors are on the remaining roles under their corresponding attribute. This had implications for
the cognitive diagnostic model building process. It allowed for greater flexibility in specifying
the model. More specifically, models with finer-grained attributes (10-factor model) could be fit
in a way that maintained consistency with the hypothesized model. Then, factors could be
combined to fit a 9-attribute model that also was consistent with the original model, as long as
the factors that were combined were within the same coarse grain attribute. As described below
90
in more detail, the correlations between factors were used to determine which factors should be
combined. The relationship between the fine-grained 10-factor interpretation and the course-
grained 4-factor interpretation is summarized in table 13.
Table 13. Final Model: Grain Sizes
Policy Teachers Leadership Organization
1: Clarity/
Adaptability
3: Self-Efficacy 5: Professional Development
Quality
8: Resources
2: Legitimacy 4: Understanding
and Attitude
towards the
innovation
6: Evaluator legitimacy 9: Values
7: Innovation
Advocacy/Communication
10: Locus of Decision
Making
The Q-Matrix and Model Specification
Using the results from the factor analysis as a preliminary step, the final 10-attribute Q-
matrix was established. This procedure commenced with the results of the EFA being used to
determine the initial factor that each item loaded on. However, as previously described, the EFA
was only the initial step conducted in order to increase the level of empiricism of the Q-matrix.
Given that one of the advantages of cognitive diagnostic modeling is the ability to account for
inter-item variance, it needed to be determined whether items could, conceptually, be argued to
load on factors other than those they were found to in the EFA. Thus, using the results from the
factor analysis, expert opinion was sought to make additional revisions to the Q-matrix.
Specifically, two experts in educational policy and two public school teachers were consulted to
determine which items potentially load on additional factors. One educational policy expert had a
PhD in Educational Policy and was an assistant professor. The other expert was a PhD candidate
in Educational Policy. The two public school teachers were licensed to teach K12 in the state of
Virginia. All experts were invited to discuss all 89 of the original items and the attributes and
91
specify the attributes measured by each item. The result from this consultation was a 10-attribute
Q-matrix with each item loading on either one or two attributes. Moreover, several items could
potentially load on to more than one or two attributes. However, the number of attributes an item
could load on to was limited to either one or two. This would ensure that the complex loading
structure was not overly complex, and that the number of parameters would not grow so large as
to prevent model convergence. This decision was made as a result of the sample size limitation
in this study. It was determined that through the model specification process, any revisions to
the Q-matrix would be based on the expert consultation. Table 14 displays an example of the
initial 10-attribute Q-matrix for the first five items.
Table 14. An example of the Final Q-Matrix
Policy
Attributes
Teacher
Attributes
Leadership
Attributes
Organization
Attributes
Items 1 2 3 4 5 6 7 8 9 10
1 0 0 0 0 0 1 0 1 0 0
2 0 0 0 0 0 1 0 1 0 0
3 0 0 0 0 1 0 0 0 0 0
4 0 0 0 0 1 0 0 0 0 0
5 0 0 0 0 1 0 0 0 0 0
… … … … … … … … … … …
Findings
Research Question 1: Testing the New Application of Cognitive Diagnostic Models
To answer research question 1, models were compared to the unidimensional model and
to other models. First, a unidimensional model was run in order to establish a baseline
comparison model. Comparisons were made between model fit and parameter estimates. The
best-fitting model was chosen and the diagnostic information from that model was interpreted.
To begin answering research question 1, the 10-attribute model was fit. As anticipated, fitting the
10-attribute cognitive diagnostic model was quite complex because of the number of parameters
92
that needed to be estimated. The initial Q-matrix did not converge. Adjustments were made to
the initial Q-matrix in order to fit different interpretations of the Q-matrix for the 10-attribute
model. These adjustments were based entirely on the previously discussed expert consultation.
After several modifications, a 10-attribute model converged. Further revisions were made to the
Q-matrix, again, based on expert consultation. In total, two 10-attribute models converged. Using
the best-fitting 10-attribute model based on the AIC, BIC, and RMSEA, the two highest
correlated attributes were used to determine how the finer-grained attributes would be combined
in order to develop the initial Q-matrix for the 9 attribute model. The Q-matrix entries of the two
attributes with the highest bivariate correlations were then combined in order to develop the 9-
attribute model. Specifically, the two attributes called “organizational values” and
“organizational locus of decision making” were combined. After multiple conceptually adequate
solutions were found for the 9-attribute model, the best-fitting model attribute correlations were
again used to specify the initial Q-matrix for the 8-attribute model. This process was repeated for
all of the initial 4 through 9 attribute models. One guiding rule for this technique was that
attributes were not to be combined unless they were under the same coarse-grained umbrella. For
example, regarding the 10-attribute model in table 13, the highest correlated attributes were
attributes 9 and 10, thus, those attributes were combined to establish the initial 9-attribute model.
However, if the highest correlation actually had been between attribute 7 and 8 followed by 9
and 10, attributes 9 and 10 would have still been the attributes that were combined for the 9-
attribute model. The same procedure would occur to fit the models with four through eight
attributes. In each instance, the same rules were applied.
In total, there were two 10-attribute models, two 9-attribute models, two, 8-attribute
models, and three 7-attribute models. Each model was described in terms of the number of 1-
93
attribute items and the number of 2-attribute items. The number of items was not changed
following the initial item analysis in chapter 3. This permitted the use of AIC and BIC fit
comparisons between models. The total number of models that were fit for each category (i.e.
10-attribute, 9-attribute, 8-attribute) was limited by the sample size and substantive
interpretation. First, the number of models depended on whether the sample size was sufficient in
order to estimate every parameter in order to reach model convergence. However, the number of
parameters in each model was limited by conceptual considerations. As previously discussed,
any modifications to the Q-matrices were based entirely on expert consultation. Thus, there were
a limited number of modifications that could be made to each Q-matrix. For example, there were
three separate, 7-attribute models displayed. Each of these models began with the 10-attribute Q-
matrix solution from the exploratory factor analysis.
As previously described, this resulted in each item being assigned to either one or two
attributes. As is evident in table 15, model 1 had 36 items assigned to one-attribute and 41 items
assigned to two attributes. Expert judgements were used to modify the relationships between
items and attributes. Model 2 showed that 5 of the items that were assigned to one-attribute in
model 1 were assigned to two attributes in model 2. Although all potential modifications were
exhausted for each category, and all models were run, fewer of the higher attribute models
actually converged than for the lower attributes.
94
Table 15. Fit Results for Models with 7-10 Attributes
Model Fit for Tests, Items (i) = 77 Attributes (K) = 7
Model One-
Attribute
Two-
Attribute AIC BIC RMSEA j/k* DI**
Misfit
(N)***
1: i = 36 i = 41 60258.59 60271.04 0.35 16.86 0.37 26
2: i = 31 i = 46 60194.19 60209.73 0.32 17.57 0.34 26
3: i = 26 i = 51 60141.01 60159.63 0.37 18.28 0.33 26
60197.93 60213.47 0.35 17.22 0.35 26
Model Fit for Tests, Items (i) = 77 Attributes (K) = 8
Model One-
Attribute
Two-
Attribute AIC BIC RMSEA j/k DI
Misfit
(N)
1: i = 55 i = 22 60092.75 60006.73 0.21 15.12 0.11 49
2: i = 45 i = 32 60859.18 60925.49 0.22 15.88 0.36 65
60475.97 60466.11 0.34 15.50 0.23 57
Model Fit for Tests, Items (i) = 77 Attributes (K) = 9
Model One-
Attribute
Two-
Attribute AIC BIC RMSEA j/k DI
Misfit
(N)
1: i = 29 i = 48 60499.43 60593.44 0.34 13.00 0.25 71
2: i = 24 i = 53 60366.38 60483.47 0.32 13.56 0.23 73
60432.91 60538.46 0.33 13.28 0.24 72
Model Fit for Tests, Items (i) = 77 Attributes (K) = 10
Model One-
Attribute
Two-
Attribute AIC BIC RMSEA j/k DI
Misfit
(N)
1: i = 39 i = 38 60461.49 60472.41 0.36 10.76 0.21 70
2: i = 29 i = 48 60182.51 60318.06 0.32 10.50 0.18 73
60322.00 60395.24 0.32 10.50 0.19 73
Notes: *items per attribute calculated by summing the number of items assigned to each attribute, then dividing by the number of attributes; **discrimination (DI) calculation described in Appendix F); ***misfit calculated by summing the number of items with RMSEA > 0.10.
As can be seen in table 15, many of these models performed very poorly in terms of fit
indices. Among the 7-attribute models, model 3 had the lowest AIC and BIC. However, this
model still had an RMSEA well above 0.1. Moreover, not a single model with 7 or more
attributes had a reasonable RMSEA. Most of these models also had many mis-fitting items.
Between the 7-10 attribute models, a trend clearly emerges in the number of mis-fitting items.
Despite the lack of fit with these models, they did still do an adequate job of discriminating.
95
Among the 7-attribute models, the average difference between the proportion correct for teachers
who perceived support and those that did not was 0.35 (See Appendix F). A higher
discrimination index (DI) implied a better model. Although the DI increased from the 10-
attribute model to the 7-attribute model, the trend is not as clear as it was for the misfit statistic.
Due to the poor fit of all the models described in table 15, none of the models were found to be
useful in terms of diagnosing teachers’ perceptions of intra- organizational support mechanisms.
This was an anticipated result. Had a higher-attribute model fit the data well, problems still
would have existed with the practical application of the diagnostic interpretation. For example, if
the 10 attribute model fit the data best, there would have been 210
= 1024 latent profiles that
teachers could potentially be diagnosed to fall under. This would not have been practical for the
purposed application of this study. As can be seen in table 15, model 1 with 8 attributes (K = 8)
was the best fitting model with the lowest AIC, BIC, and RMSEA of all the models. As
previously discussed in more detail, the literature recommended that when AIC and BIC
disagree, the BIC is to be relied upon. In addition, the RMSEA was also used as an index to
ensure that any model that was determined to be the “best-fitting” model by the AIC and BIC,
did, in fact have adequate model fit. As previously described, adequate RMSEA values are those
less than 0.1. For the best-fitting model in table 15, it was shown that, 55 of 77 items were
assigned to one attribute and 22 of the 77 items were assigned to two attributes. The higher
number of items assigned to one attribute could have explained the slightly better overall fit.
However, even though both the AIC and BIC suggested that this model fit better than all other 7,
8, 9, and 10-attribute models, the RMSEA values for all sets of models were greater than 0.1,
suggesting that, although convergence was achieved for these models, the models did not fit the
data well enough to be used for interpretation.
96
As the number of attributes in the model decreased, the model seemed to fit the data
better. This trend was captured by the fit statistics in table 16. However, the trend is not quite so
clear when the models had 4, 5, or 6 attributes as displayed in table 15. Although the RMSEA
was lowest for the set of 4 and 5 attribute models, the AIC and BIC were only lowest for the 4
attribute model. Each of the 5 attribute models had higher average AIC and BIC values averages
for other sets of models. In general, the models seemed to discriminate similarly, although the 4-
attribute models had slightly higher ability to discriminate.
Based on these results, the RMSEA suggested that the best fitting model is among the 4
or 5-attribute models. In fact, except for model 5 in the set of 6-attribute models (RMSEA =
0.10), all other models fit the data quite poorly (RMSEA > 0.10). The AIC and BIC suggested
that the best fitting model was among the set of 4 attribute models. Moreover, the set of 4-
attribute models were the only models in which the AIC and BIC were lower than the
unidimensional model values.
97
Table 16. Fit Results for Unidimensional Model and Models with 4-6 Attributes
Model Fit for Tests, Items (i) = 77 Attribute (K) = 1
Model One-
Attribute
Two-
Attribute AIC BIC RMSEA j/k* DI**
Misfit
(N)***
1 i =77 i = 0 59764.7 60475.6 0 77 0 0
Model Fit for Tests, Items (i) = 77 Attribute (K) = 4
Model One-
Attribute
Two-
Attribute AIC BIC RMSEA j/k DI Misfit (N)
1 i = 59 i = 18 58735.8 59557.5 0.01 23.75 0.41 13
2 i = 53 i = 24 58821.8 59675.8 0.01 25.5 0.41 13
3 i = 46 i = 31 58771.8 59563.4 0.01 27 0.39 13
4 i = 42 i = 35 58746.2 59626.4 0.03 28 0.4 13
5 i = 40 i = 37 58777.6 59582.4 0.06 28.75 0.38 12
Model Fit for Tests, Items (i) = 77 Attribute (K) = 5
Model One-
Attribute
Two-
Attribute AIC BIC RMSEA j/k DI Misfit (N)
1 i = 55 i = 22 60921.2 61719.8 0.05 17.4 0.35 25
2 i = 49 i = 28 60310.9 61192.5 0.03 21.00 0.31 19
3 i = 48 i = 29 60566.3 61401.8 0.05 19 0.34 20
4 i = 44 i = 33 60070.7 61929.3 0.01 20 0.34 21
5 i = 37 i = 40 60062 61952.9 0.01 21.4 0.32 23
Model Fit for Tests, Items (i) = 77 Attribute (K) = 6
Model One-
Attribute
Two-
Attribute AIC BIC RMSEA j/k DI Misfit (N)
1 i = 61 i = 16 59458.6 60308 0.21 15.6 0.39 36
2 i = 57 i = 20 59718.5 60586.3 0.13 16.17 0.37 31
3 i = 52 i = 25 59721.5 60612.4 0.13 17 0.37 26
4 i = 50 i = 27 59655.6 60574.2 0.14 18 0.38 28
5 i = 47 i = 27 59622.1 60563.8 0.1 18.83 0.37 25
Notes: *items per attribute calculated by summing the number of items assigned to each attribute, then dividing by the number of attributes; **discrimination (DI) calculation described in Appendix F); ***misfit calculated by summing the number of items with RMSEA > 0.10.
The parameter estimates and the standard errors for the parameter estimates were
averaged and compared across each model. Figure 5 showed the trends for the averages for four
sets of models. The line in the figure should not be interpreted to demonstrate any relationship
between the models. Rather, it was included to show the increase in the value of the average
98
standard errors as more attributes were included in the model. A similar figure was used by
Halpin and Keiffer (2015) in their application of latent class models to teacher performance
scores. Clearly, the parameters are estimated less accurately as the number of attributes
increases. The line shows the 10-attribute model with higher standard errors for each parameter.
The unidimensional model had the most accurate estimates, with the 4-attribute model being a
close second. This finding supported the results from the model-fit comparisons in that it appears
that the 4-attribute model is comparable, and in many cases, a better fit than the unidimensional
model. However, the other models did not fit the data as well.
Figure 5. Average Standard Errors of Intercept Parameter Estimates
In table 17, the average parameter estimates were summarized for the three top-
performing 4-attribute models based on the previously discussed fit statistics. Among these
models, there did not appear to be major differences as far as the distributions of the parameter
.09
.1.1
1.1
2
Ave
rag
e S
tand
ard
Err
or
1 2 3 4 5 6 7 8 9 10Model Attributes
99
estimate distributions. Lower slopes indicated less of an influence of perceived support on the
probabilities of positive responses on items. The average slope parameters were fairly
reasonable across all three models. However, model 2 and model 3 each had a minimum slope
that turned out to be negative. This indicated either a poor item or a misspecification in the Q-
matrix for the item. Looking at the “best-fitting” model, the maximum intercept parameter on
items that measured more than one attribute was 5.25. This would imply that for this particular
item, perceiving support on none of the attributes would result in a probability of 0.99 of
answering positively to the item. Basically, all respondents agree that they receive support on
this item. Thus, this item has very little empirical contribution to the model and was marked for
review. No other item fit as poorly as this item, however, there were 11 other items tagged with
intercepts approximately equal to or greater than 1. With the current model, an intercept
parameter with the value of 1 implied a probability of 0.73 of a positive response to an item
when none of the attributes had overall perceived support. Although there was no official
guideline for an acceptable probability, this high probability meant that almost all respondents
are agreeing, no matter their profile. So, these would be items that most everyone perceives
support on. In other words, it was easy to perceive support on this item. These items don’t
discriminate well between the different profiles. Moreover, the RMSEA values for each of these
items was greater than 0.09, which indicated that these items did not fit. With 77 items and a
typy-1 error rate set with α = 0.05, one would expect approximately 4 items to misfit. Based on a
substantive review of the items, there were no obvious reasons as to why the specific items fit the
model so poorly. Moreover, since there were so many quality items available, the model
specification process was further investigated by removing the most egregious items from the
analysis in terms of the values of the parameter estimates.
100
Table 17. Comparisons of 4-Attribute Model Average Parameter Estimate Distributions
Model Distribution
Test with 77 items and 4 attributes item specifications:
Items (i) = 59, Attributes (K) = 1 , i = 18 K = 2
Model 1 Parameters 𝜇 𝜎 min max N
One-attribute intercepts -0.311 1.383 -3.340 2.841 59
One-attribute slopes 0.977 0.644 0.007 4.000 59
Two-attribute intercepts -0.205 1.653 -1.786 5.251 18
Two-attribute slopes 0.711 0.698 0.001 4.000 36
Test with 77 items and 4 attributes item specifications: i = 46 K = 1 , i = 31 K = 2
Model 2 Parameters 𝜇 𝜎 min max N
One-attribute intercepts -0.390 1.387 -3.245 2.487 46
One-attribute slopes 0.928 0.561 0.020 2.249 46
Two-attribute intercepts -0.299 1.502 -2.215 5.310 31
Two-attribute slopes 0.703 0.651 -0.087 4.000 62
Test with 77 items and 4 attributes item specifications: i = 42 K = 1 , i = 35 K = 2
Model 3 Parameters 𝜇 𝜎 min max N
One-attribute intercepts -0.305 1.392 -3.349 2.547 42
One-attribute slopes 0.994 0.576 0.072 2.952 42
Two-attribute intercepts -0.191 1.672 -3.083 5.008 35
Two-attribute slopes 0.625 0.720 -0.038 4.000 70
Using the best-fitting 4-attribute model, all items with intercept parameters greater than 1
were flagged. Next, the item with largest intercept ( λ𝑖,0 = 5.25) was removed from the analysis.
Several different combinations of poorly fitting items were removed and models were compared
in terms of the resulting total number of mis-fitting items and the RMSEA. The final model had
70-items. The AIC = 54232.50 and the BIC = 54980.30 which were both much lower than the
values for any of the 77-item models. Moreover, the RMSEA = 0.01 which was equivalent to
the best-fitting 77-item model.
The number of items per attribute in model 2 decreased, as anticipated. It was also
anticipated that the AIC and BIC would decrease substantially. Although, the total number of
poorly fitting items only slightly decreased, the items that did fit poorly were not nearly as
101
extreme in terms of how poorly they fit. This is evident by looking at the distribution of the
parameters in table 18. The maximum value for the intercept of items that loaded on one attribute
was 2.64 as opposed to 2.84 with the 77-item model. An even larger discrepancy occurred
between the maximum intercept value among items that loaded on two attributes. For the 77-
item model, the maximum value when K = 2 was 5.25. Although the maximum value was still
fairly high with the 70-item model ( λ𝑖,0 = 2.42) , it was notably lower.
Table 18. Parameter Distributions for Final Model
Model Distribution
Test with 70 items and 4 attributes items
Model 1 Parameters 𝜇 𝜎 min max N
One-attribute intercepts -0.25 1.39 -2.92 2.64 54
One-attribute slopes 1.04 0.72 0.02 3.77 54
Two-attribute intercepts 0.43 1.31 -1.97 2.41 16
Two-attribute slopes 0.69 0.45 0.11 1.59 32
Interestingly, the standard errors of the 70-item model were quite comparable to the
standard errors of the unidimensional model. This meant that the parameters were measured
nearly as accurately as they were with the unidimensional model. Among the items with the
lowest standard errors, the 4-attribute model actually had slightly lower standard errors than the
unidimensional model. Model 1 had the highest average standard error. Model 1 represented the
best-fitting 4-attribute, 77-item model. The average standard error was 0.097. Model 2 was the
revised 4-attribute, 70-item model, and it had an average standard error of 0.093. Finally, model
3 represented the unidimensional model and it had an average standard error of 0.091.
Based on the model fit indices and the parameter estimates, it was clear that the best-
fitting model was the 4-attribute, 70-item model. Having selected the best-fitting model, the
model could be applied to understanding of teachers’ perceptions and the substantive output
from the model could be interpreted. Recall that the statistical purpose of cognitive diagnostic
102
modeling was to “…develop a multivariate profile of teachers’ perceptions based on classifying
them according to their degree of mastery on each of the traits” (Rupp & Templin 2008, p. 226).
Thus, substantively, we were interested in attaining the detailed diagnostic profiles that promote
assessment for learning through modification in target areas (Jang, 2009). Based on the best
fitting model, the next step was to use teachers’ observed responses to estimate the person
parameters in order to understand the estimated diagnostic proportions of teachers in each of the
profiles.
To review, the initial procedure commenced with fitting models with different grain-
sizes. Grain sizes referred to the scope or level of specificity of attributes. A finer grain, for
example, could be adding and subtracting, whereas a coarser grain may be whole number
operations. It was found that the best-fitting model was a 4-attribute model with 70 items. Thus,
the diagnostic output from this model was subsequently selected to be used in learning about
teachers’ perceptions of policy implementation. The advantage of using the diagnostic model
was that instead of a continuous ability estimate, the diagnostic model estimated the probability
that a respondent had mastered each attribute (Rupp & Templin, 2008). In this study, instead of
“mastery,” the descriptor was actually “perceived support.” If that probability of perceived
support was greater than 0.5, the respondent was classified as having perceived support on the
attribute. For each respondent, a profile resulted in which perceived support or lack of perceived
support on each attribute is estimated. To clarify, attributes were the skills or attributes and
mastery indicated perceived support whereas non-mastery indicated a lack of perceived support.
Typically, attributes have been defined by content knowledge, cognitive skills, or mental
processes. In this new application, we were interested in modeling the perceptions of support for
policy implementation. Through this process, diagnostic modeling produced output for
103
respondents as a profile of perceived support and non-perceived support of the attributes.
Teachers were estimated to fit a specific profile based on the posterior probabilities that the
respondent perceived support on each mechanism in the latent profile results. Since the outcomes
of interest in this study were the latent categorical outcomes, the probabilities were not
presented, but the categorical interpretations of the probabilities were. Moreover, since this study
was a new application of this methodology, Table 19 was provided as an example of the
posterior probabilities of satisfying each attribute for five teachers for didactic purposes.
Table 19. Example of Estimated Posterior Probabilities of Attribute Perceived Support By
Respondent
ID 1: Policy 2: Teacher 3: Leadership 4: Organization
9001 0.56 0.76 0.26 0.47
9002 0.28 0.29 0.21 0.26
9003 0.76 0.71 0.74 0.68
9004 0.08 0.47 0.21 0.16
9005 0.64 0.82 0.65 0.63
--- --- --- --- ---
Based on the probabilities in table 19, the software assigns a diagnostic, standards-based
categorical outcome profile for each teacher. Based on the probabilities in table 20, the resulting
estimated profiles for each of the respondents was included in table 20.
Table 20. Example of Estimated Latent Profiles Based on Posterior Probabilities By Respondent
ID 1: Policy 2: Teacher 3: Leadership 4: Organization
9001 1 1 0 0
9002 0 0 0 0
9003 1 1 1 1
9004 0 0 0 0
9005 1 1 1 1
--- --- --- --- ---
The estimated posterior probabilities were functions of item response patterns in addition to the
estimated base rates. In the example, teacher with ID 9001 had a higher model-estimated
104
probability of perceiving support on support mechanisms regarding the “characteristics of
teachers” than teacher 9002. The latent categorical outcome of teacher 9001 would be 1100
because the probability of this teacher perceiving support on attributes 1 (Policy) and 2 (Teacher)
were greater than the 0.5 a priori, internally defined threshold and the probabilities of this teacher
perceiving support on attributes 3 and 4 were less than 0.5. In addition, the responses provided
by teacher 9005 were more representative of a fully supported teacher in regards to policy
implementation.
The distribution of the skill pattern profiles for the best-fitting 4-attribute model was
summarized in table 21. These proportions were estimated using a log-linear structural model.
The proportion of respondents in each profile is a model parameter to be estimated. The log-
linear model reduces the number of parameters. In particular, some profiles have low proportions
and are therefore difficult to estimate. Although it is not clear from the literature as to whether to
use probability-based or log-linear model, one of the advantages of using the log-linear model to
represent the C-RUM is that it can be used to identify a suitable model by placing parameter
restrictions within a very flexible general model (Rupp, Templin, & Henson, 2007). The log-
linear representation of the C-RUM models the conditional probability that a respondent with a
specific attribute profile (as depicted in table 21), provides a positive response to the item. For a
Q-matrix with 4-attributes, the first set of elements are 4 elements of the log-linear equation
represent the 4 main effect parameters. The first element corresponds to the main effect for
attribute 1 if attribute 1 is measured by the item, then the second element corresponds to the main
effect for attribute 2 if attribute 2 is measured by the item. This is done for each attribute in the
model. Once all 4 attribute main effects are accounted for in the model, the second set of
elements account for the two-way interactions. (Rupp, Templin, & Henson, 2007). The model-
105
estimated proportions were contrasted with the probability-based proportions in table 21. The
larger differences occurred in the less frequent profiles.
Table 21. Distribution of Diagnostic Categorical Profiles for 4-Attribute Model
Profile Log-Linear Model Probability-Based
Policy Teacher Leadership Organization Percent (%) N Percent (%) N
p1: 0 0 0 0 16.20 121 16.73 125
p2: 1 0 0 0 1.20 9 3.61 27
p3: 0 1 0 0 23.43 175 21.02 157
p4: 1 1 0 0 16.29 47 23.29 174
p5: 0 0 1 0 0.80 6 0.13 1
p6: 1 0 1 0 3.35 25 1.61 12
p7: 0 1 1 0 0.54 4 0.00 0
p8: 1 1 1 0 7.50 56 4.28 32
p9: 0 0 0 1 0.00 0 0.00 0
p10: 1 0 0 1 0.27 2 0.13 1
p11: 0 1 0 1 0.00 0 0.00 0
p12: 1 1 0 1 3.48 26 4.28 32
p13: 0 0 1 1 0.00 0 0.00 0
p14: 1 0 1 1 2.54 19 1.47 11
p15: 0 1 1 1 0.00 0 0.00 0
p16: 1 1 1 1 24.40 257 23.43 175
Table 21 showed that the model estimated that 8 profiles accounted for most of the
teachers. Moreover, the perceptions of approximately one quarter of the total number of
teachers (N = 747) placed them in profile 16. Profile 16 represented the group of teachers
who were estimated to have perceived support on each of the four attributes. Profile 1
(16.20%) represented the group of teachers who were estimated to not have perceived
support on any of the attributes based on their responses. The next most common profile
outcome was profile 3 (23.43%), which represented perceived support of the
characteristics of teachers, but no other attribute. There were three profiles that had no
teachers.
106
The profile distributions and the proportions of teachers perceiving support on each
attribute between the log-linear model and actual estimates of individuals were compared. More
specifically, based on the each participants’ probability of having perceiving support on an
attribute, they were assigned a profile. A participant was assigned a 1 for an attribute if they had
a greater than 0.5 probability of perceiving support on an attribute. The proportions of these
estimates were compared to the proportions estimated by the log-linear model. The actual
profiles estimated through the log-linear model distorted the distribution. More specifically,
when respondents had a .51 probability of perceiving support on an attribute, they were
categorized as “perceived support” on the attribute.
Based on the single-group total distribution, it appeared that the majority of
teachers perceived support on characteristics of teachers because approximately 75% of
teachers belonged to a profile that indicated this (see table 22). This was found by
summing the total number of teachers that belonged to a profile that indicated perceived
support for this attribute. Mdltm provided the proportion of teachers perceiving support
on each attribute by counting them directly, not from a log-linear model. Recall that this
attribute was measured by items that focused on teachers’ confidence in their own skills
related to the policy implementation. Furthermore, almost 60% of teachers’ perceptions
resulted in estimation of belonging to profiles with perceived support on characteristics of
the policy. This indicated that the model estimated that more teachers perceived to be
supported by characteristics of the actual policy than those who did not. Finally, only
49% of teachers were estimated to be in profiles with perceived support in characteristics
of leadership and 41% of teachers were estimated to be in profiles with support related to
the characteristics of the organizations.
107
Table 22. Percentage of Teachers in Profiles with Perceived Support by Attribute
Policy Teacher Leadership Organization
% N % N % N % N
59.03 441.00 75.64 565.00 49.13 367 40.69 304
From tables 21 and 22, one can infer that the model estimated that many teachers perceived to
understand characteristics of the policy. Moreover the model estimated that teachers were
generally confident in their abilities to meet the guidelines outlined by the policy. However, the
lower proportions in characteristics of the leadership and the organization indicated that they
perceived a lesser level of support in these areas. Although there were differences among the
models with multiple groups in terms of the proportions of teachers belonging to profiles with
perceived support on each attribute, these differences were quite minimal.
Research Question 2: Exploring Group Comparisons Using the Diagnostic Model
One of the advantages of the general diagnostic model as implemented in mdltm is to
concurrently estimate item parameters for several subpopulation or subgroups. In the case of
teachers’ perceptions of policy implementation support, the population includes several different
subgroups, for example, grade level, subject taught, and career status. Since this is a new
application, it is necessary to explore differential group estimation. Thus, for the second research
question, the ability of the diagnostic model to estimate group differences was explored.
Specifically, the following research questions were asked:
2. Are there group differences in the diagnostic model fit based on grade level,
subject taught, and career status?
a. What are the group differences of the estimated latent class profile
distributions based on grade level, subject taught, and career status?
108
b. What are the group differences in diagnostic model estimations of
parameter estimates based on grade level, subject taught, and career
status?
As previously discussed, running a model for each group separately would result in the item
parameters being on different scales. Thus, they would not be directly comparable. Using a
multiple-group multinomial log-linear mixture model (Xu and Von Davier, 2008) puts the
groups on the same scale to allow for proper group comparisons. The groups being compared in
this study were selected based on logistical organizational considerations. For example, as
previously discussed, the vision was that the application of these models can be used to organize
for policy implementation. Schools leaders could use this type of information to organize
teachers in to groups based on their latent profiles and target the appropriate resources to groups
of teachers. Although the data contained numerous grouping variables, school districts were most
inclined to group/organize teachers based on grade level, subject taught, or career status. This
was true for many reasons, but most importantly, because of logistical organization restrictions.
Teachers in the same grade levels and who teach the same subjects likely have more similar
teaching schedules, teaching strategies, or teaching philosophies. Thus, research question 2
explored the application of mixture models defined with these multiple groups.
For this analysis, grade level was split into three categories including elementary, middle,
and high school. For subject taught, teachers were either a STEM subject or not. STEM subjects
included science, technology, or math related subjects. Those teachers who taught both STEM
and non-stem subjects were classified as STEM teachers. Career status was early career and
experienced teachers. Early career teachers were those teachers within the first 5 years of their
careers, whereas experienced teachers were those teachers with more than five years of teaching
experience. As previously discussed, many grouping variables could have been explored with
109
cognitive diagnostic modeling. However, none of the available grouping variables aligned with
the practical goals of this study like grade level and subject taught. Recall that the goal of using
cognitive diagnostic modeling is to retrieve the categorical outcomes to inform standards-based
targeting of resources for policy implementation. The group variables in this study allow for
specific differential supports to conveniently, in the organizational sense, be targeted by group, if
it were found to be necessary.
The models G = 1, 2, 2, 3, 6 were used to analyze the data. When G = 1, this model was
called a single-group model. This is the best-fitting model from research question 1 where no
sub-group differences were accounted for. When G = 2, the two groups, S1 = STEM and S2 =
non-STEM were identified by the subject variable. For the second model where G = 2, the two
groups, E1 = early career and E2 = experienced were identified by the career-status variable. For
the case of G = 3, the grade level variable was used as the indicator of the subgroups L1 =
elementary, L2 = middle, and L3 = high school. When G = 6, a complete factorial design of
grade level and subject was defined and used as the indicator variable for the subgroups. These
five models were compared using the following: model fit, item parameter estimates, as well the
marginal distribution of each latent random variable.
First, Mdltm Software was used to run the multiple group multinomial log-linear models
for multidimensional skill distributions in the general diagnostic model (Xu and Von Davier,
2008). The distributions of the latent classes from the analysis are shown in table 23. The results
showed that the marginal distribution resulting from these two models were fairly similar. The
most notable discrepancy was the differences among profile 8 and 16 between the models. For
model 1, the largest proportion of teachers were estimated to fall into profile 8, however, for
every other model, profile 16 was the most common profile. Profile 8 included perceived
110
support on all attributes except for characteristics of the organization. Profile 16 included
perceived support of all attributes. Thus, the only difference between the definitions of these
profiles was whether or not teachers were estimated to have perceived support on characteristics
of the organization. It should be noted that the proportions are estimated across all groups and
profiles in the model, and not for each group separately. Thus, the columns in table 23 do not
sum to 1, but the set of columns for each model sum to 1.
111
Table 23. Estimated Group Proportions by Profile
Profile Single Model 1: Model 2: Model 3: Model 4:
L1 L2 L3 S1 S2 E1 E2 L1_S1 L1_S2 L2_S1 L2_S2 L3_S1 L3_S2
p1: 0000 0.16 0.07 0.04 0.05 0.09 0.06 0.07 0.08 0.03 0.03 0.03 0.01 0.04 0.01
p2: 1000 0.01 0.02 0.01 0.01 0.00 0.05 0.03 0.01 0.00 0.04 0.00 0.01 0.01 0.00
p3: 0100 0.23 0.07 0.07 0.06 0.09 0.05 0.11 0.06 0.05 0.02 0.02 0.01 0.05 0.01
p4: 1100 0.16 0.02 0.05 0.04 0.06 0.08 0.06 0.04 0.02 0.06 0.07 0.00 0.03 0.03
p5: 0010 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
p6: 1010 0.03 0.03 0.00 0.01 0.01 0.01 0.05 0.00 0.01 0.01 0.00 0.00 0.01 0.01
p7: 0110 0.01 0.02 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00
p8: 1110 0.08 0.18 0.02 0.02 0.01 0.02 0.13 0.00 0.08 0.00 0.01 0.01 0.02 0.00
p9: 0001 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
p10: 1001 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00
p11: 0101 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
p12: 1101 0.03 0.00 0.01 0.00 0.05 0.01 0.00 0.05 0.00 0.00 0.00 0.00 0.00 0.00
p13: 0011 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
p14: 1011 0.03 0.01 0.01 0.01 0.03 0.01 0.01 0.02 0.00 0.01 0.01 0.00 0.00 0.00
p15: 0111 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
p16: 1111 0.24 0.03 0.05 0.07 0.23 0.10 0.12 0.16 0.01 0.08 0.04 0.02 0.05 0.01 Notes: Single refers to single-group model;
Model 1: L1 refers to elementary teachers, L2 refers to middle school teachers, L3 refers to high-school teachers;
Model 2: S1 refers to STEM teachers, S2 refers to non-STEM teachers;
Model 3: E1 refers to early-career teachers, E2 refers to experienced teachers;
Model 4: Factorial design uses conventions from previous models
112
The attributes were positively and significantly correlated. The moderate size of the
correlations in table 24 in conjunction with the statistical significance suggested that each
attribute did, in fact, capture a different component of policy implementation support. The
attribute that was defined by characteristics of leadership was significantly correlated by the
attribute defined by characteristics of the organization. Although they were different attributes,
organizational mechanisms and leadership mechanisms were quite significantly correlated, much
more than any other bivariate correlation.
Table 24. Correlations Between Attributes
1 2 3 4
1: Policy 1.000
2: Teacher 0.286* 1.000
3: Leadership 0.422* 0.292* 1.000
4: Organization 0.391* 0.320* 0.783** 1.000 Notes: *indicated correlation was positive when α = 0.05;
**indicated correlation was positive α = 0.001.
Table 25 provided an overview of the AIC and BIC model-fit statistics. As previously
discussed, a model can be said to be better when it results in a smaller AIC and BIC fit indices.
Compared to the single-group analysis (when G = 1), none of the multiple-group models
improved the overall fit. Traditionally, model fit indices are used to determine the best fitting
model and then that best-fitting model is interpreted. However, each of the proposed multi-group
models were examined in terms of parameter estimates and marginal distributions since this was
a new application for cognitive diagnostic models with both psychometric and policy
implications. This implied that there were not overall significant differences between the groups
examined in this study.
113
Table 25. Fit Statistic Comparisons Between Models
Groups Description AIC BIC
G = 1 Single Group 53971.85 54098.32
G = 2 Subject 54002.22 55507.06
G = 2 Status 54232.50 55476.68
G = 3 Level 54171.96 56439.10
G = 6 Subject*Level 54291.08 58805.59
Following the model fit comparisons, the distributions of parameter estimates in each
model were compared to understand the differences between the models. It was very clear that
standard errors for the intercept-parameter estimates increased as more groups were added to the
model. Thus, as more groups were added to the model, the estimates were less accurate. In table
26, the intercept parameter distributions were summarized separately for items assigned to one
attribute, and items assigned to two attributes for each model. There was almost no difference in
the estimated average intercept parameter between experienced teachers and early-career
teachers. This suggested that when there was no perceived support for any of the support
mechanisms (not sure if this is the term commonly used), the probabilities that teachers
perceived support on a particular support mechanism were similar for these groups. Moreover,
the average intercept parameter in the teacher subject model (G = 2) was lower for STEM
teachers than for non-STEM teachers. This dynamic held true in the factorial design model (G =
6), where the average intercept-parameters were lower for STEM teachers than non-STEM
teachers at all three grade levels, elementary, middle, and high school. Thus, respondents who
had the same profile but belonged to different groups had different probabilities of answering
positively to the items. Differences in intercept means are akin to overall group differences, thus
STEM teachers overall perceived less support than non-STEM teachers, especially at elementary
and middle school levels. This could indicate that STEM teachers may require or expect more
support than non-STEM teachers, or it could be that they actually received less support. The
114
highest intercept estimate was for the group of elementary, non-STEM teachers which was equal
to λ𝑖,0 = -0.03. This resulted in an average probability of 0.49 for a correct response. Conversely,
the lowest estimate was for the group of high-school STEM teachers for which λ𝑖,0 = -0.94. This
resulted in an average probability of a correct response of 0.28. Although almost every parameter
distribution had a few relatively extreme maximum values, the overall frequency of extreme
estimates was not an issue. Item text was reviewed and, in each instance, the items were
determined to be of substantive significance to the study.
115
Table 26. Comparisons of Intercept Estimate Distributions of Single vs. Multi-Group Models
Distributions of Intercepts
Model Groups
S.E. min max
G = 1 Single One-attribute intercept -0.25 0.10 -2.92 3.78
Two-attribute intercept -0.43 0.09 -1.97 2.41
G = 2 1: STEM One-attribute intercept -0.80 0.13 -6.21 1.43
Two-attribute intercept -0.20 0.12 -4.24 3.44
2: non-STEM One-attribute intercept -0.10 0.17 -6.21 3.70
Two-attribute intercept -0.12 0.14 -4.24 3.44
G = 2 1: early career One-attribute intercept -0.55 0.13 -6.65 3.92
Two-attribute intercept -0.08 0.14 -1.99 3.83
2: experienced One-attribute intercept -0.58 0.16 -4.15 3.32
Two-attribute intercept -0.14 0.15 -4.50 3.05
G = 3 1: elementary One-attribute intercept -0.35 0.16 -6.13 3.96
Two-attribute intercept -0.45 0.14 -1.97 2.67
2: middle One-attribute intercept -0.22 0.21 -7.01 3.76
Two-attribute intercept -0.22 0.18 -1.15 3.47
3: high One-attribute intercept -0.45 0.19 -6.14 4.19
Two-attribute intercept -0.32 0.18 -1.59 1.10
G = 6 1: elementary/STEM One-attribute intercept -0.57 0.23 -5.87 2.19
Two-attribute intercept -0.40 0.22 -2.38 3.40
2: elementary/Non-STEM One-attribute intercept -0.03 0.21 -3.61 2.82
Two-attribute intercept -0.46 0.21 -2.14 2.66
3: middle/STEM One-attribute intercept -0.50 0.24 -6.69 3.42
Two-attribute intercept -0.37 0.23 -3.30 3.77
4: middle/Non-STEM One-attribute intercept -0.21 0.39 -5.98 1.58
Two-attribute intercept -0.03 0.41 -7.08 3.87
5: high/STEM One-attribute intercept -0.94 0.23 -3.81 1.96
Two-attribute intercept -0.58 0.21 -2.29 3.46
6: high/Non-STEM One-attribute intercept -0.77 0.48 -4.99 1.29
Two-attribute intercept -0.10 0.42 -1.73 3.23
After building a general understanding of the distributions of intercept estimates, the next
step was to explore differences across groups on the actual individual parameter estimates. To do
this, the 95% confidence intervals for all items’ intercept-parameter estimates were plotted using
Stata 13 by group for each model. The intervals were compared to determine whether there were
statistically significant differences between groups on each parameter estimate in each model.
116
There were no formal hypothesis tests conducted for differences in parameter estimates because
the theoretical distributions of parameters were not known. Thus, looking at overlap of
confidence intervals was an approximation for statistical significance. As previously discussed,
a lack of perceived support on all attributes resulted in STEM teachers having lower average
estimated intercepts than non-STEM teachers. This was true for items assigned to 1-attribute and
also 2-attributes. However, figure 7 showed that in the 6-group factorial design model, group 1
(elementary STEM teachers) was significantly higher than group 2 (elementary non-STEM
teachers) on item 22. Thus we rejected the null hypothesis that this intercept estimate was equal
across groups. This particular item asked teachers to rate the extent of confidence they had
regarding abilities related to the policy implementation process. Specifically, it asked them to
rate the extent to which they felt confident in understanding the teacher evaluation standards
described by GUPSECT. In addition to the significant difference between STEM and non-
STEM teachers at the elementary level, group 5 (high-school STEM teachers) perceived
significantly higher degree of support on this item than group 6 (high-school non-STEM
teachers). However, there was no difference between group 3 (middle school STEM teachers)
and group 4 (middle school non-stem teachers). Consistent with the finding for this item, the
right side of figure 7 also showed that in the 2-group model comparing groups of STEM and
non-STEM teachers, STEM teachers were, again, significantly higher on this item, which was
similar to the finding in the 6-group model. Thus, when there were no support attributes with
perceived support, STEM-teachers and non-STEM teachers were estimated to perceive a
significantly different level of confidence in understanding the teacher evaluation standards,
whether or not grade-level was accounted for in the model.
117
Legend for G = 6: 1= Elementary/STEM; 2 = Elementary/non-STEM; 3 = Middle/STEM; 4 =
Middle/non-STEM; 5 = High/STEM; 6 = High/non-STEM
Legend for G = 2: 1= STEM; 2 = non-STEM.
Figure 6. Comparisons of Confidence Intervals for Item 22 Intercept Estimates by Group
This trend was found for two other items that were very similar in content. In fact, all three items
loaded on to the 4-attribute coarse-grained factor “characteristics of teachers.” They also loaded
on to the 10-attribute fine-grained factor “teacher confidence in abilities relating to policy.”
Figured 8 showed the findings from the 6-group model and the 2-group model for the item that
asked teachers to rate the extent to which they felt confident in understanding the measures of
their teaching defined by GUPSECT. On this item, when G = 6, elementary STEM teachers were
significantly higher than all other groups except for middle school STEM teachers. Middle
school STEM teachers were significantly higher than elementary and middle school non-STEM
teachers. Moreover, high school STEM and non-STEM teachers were both significantly higher
than the two groups elementary non-STEM and middle school non-STEM. When G = 2, STEM
teachers clearly were significantly higher than non-STEM teachers on this item. Thus, in this
model, among the various groups there were many differences, but the same general trend was
-3-2
-10
Inte
rce
pt E
stim
ate
1 2 3 4 5 6Groups
-1.5
-1-.
50
Inte
rce
pt E
stim
ate
s
1 2Groups
118
found for this item as was found for item 22. The implications of this are discussed in detail in
chapter 5.
Legend for G = 6: 1= Elementary/STEM; 2 = Elementary/non-STEM; 3 = Middle/STEM; 4 =
Middle/non-STEM; 5 = High/STEM; 6 = High/non-STEM
Legend for G = 2: 1= STEM; 2 = non-STEM.
Figure 7. Comparisons of Confidence Intervals for Item 23 Intercept Estimates by Group
Finally, item 28 followed the same trend as the two previously discussed items. This item asked
teachers to rate the extent to which they felt confident in using the evaluation to inform their
teaching. When G = 6, elementary STEM teachers were estimated to perceive a significantly
higher degree of confidence than all other groups of teachers. Moreover, middle school non-
STEM teachers were estimated to perceive a significantly lower degree of confidence than all
other groups of teachers. When G = 2, it was clear that STEM teachers were estimated to have
significantly higher degree of confidence than non-STEM teachers, which was consistent with
the trend found with the other two items. Despite this logical, substantive interpretation, with α
= 0.05, one would expect about 3 type-one errors which provides an alternative statistical
explanation for this finding. Thus, further investigation into these support mechanisms would be
necessary.
-4-3
-2-1
0
Inte
rce
pt E
stim
ate
1 2 3 4 5 6Groups
-2.5
-2-1
.5-1
Inte
rce
pt E
stim
ate
s
1 2Groups
119
Legend for G = 6: 1= Elementary/STEM; 2 = Elementary/non-STEM; 3 = Middle/STEM; 4 =
Middle/non-STEM; 5 = High/STEM; 6 = High/non-STEM
Legend for G = 2: 1= STEM; 2 = non-STEM.
Figure 8. Comparisons of Confidence Intervals for Item 28 Intercept Estimates by Group
Next, the slope-parameter main effect estimates were investigated. The slope
parameters captured the influence in logits of attribute perceived support on the probability of a
correct response to an item. The distributions for the slopes were summarized by model, and by
the number of slope parameters assigned for each item (1 or 2) in table 27. Similar to the
intercepts, the average standard errors increased as more groups were included in the models.
However, the items assigned to one attribute were, on average, higher than items assigned to two
attributes. This was the opposite of what was observed with the intercept estimates in table 26.
-20
-1-3
-4-5
Inte
rce
pt E
stim
ate
1 2 3 4 5 6Groups
-1.5
-1-.
50
Inte
rce
pt E
stim
ate
1 2Groups
120
Table 27. Comparisons of Slope Estimate Distributions of Single vs. Multi-Group Models
Distributions of Slopes
Model Groups S.E. min max
G = 1 Single One-attribute slope 1.04 0.10 0.02 4.00
Two-attribute slope 0.69 0.09 0.11 1.59
G = 2
1: STEM One-attribute slope 1.12 0.12 0.18 4.12
Two-attribute slope 1.05 0.13 0.40 2.92
2: non-STEM One-attribute slope 1.10 0.16 0.28 4.14
Two-attribute slope 1.04 0.17 0.27 3.03
G = 2 1: early career One-attribute slope 1.12 0.13 0.43 4.12
Two-attribute slope 0.79 0.13 0.05 4.08
2: experienced One-attribute slope 1.17 0.16 0.00 4.11
Two-attribute slope 0.71 0.15 0.03 4.11
G = 3
1: elementary One-attribute slope 1.08 0.16 0.02 4.05
Two-attribute slope 0.62 0.14 0.16 2.02
2: middle One-attribute slope 0.93 0.21 0.41 4.05
Two-attribute slope 0.83 0.18 0.02 4.05
3: high One-attribute slope 0.99 0.19 0.32 4.09
Two-attribute slope 0.44 0.18 0.08 1.52
G = 6
1: elementary/STEM One-attribute slope 1.16 0.23 0.01 4.03
Two-attribute slope 0.81 0.22 0.23 2.27
2: elementary/Non-STEM One-attribute slope 1.10 0.21 0.24 4.20
Two-attribute slope 0.75 0.21 0.03 2.54
3: middle/STEM One-attribute slope 1.10 0.24 0.33 4.03
Two-attribute slope 0.78 0.22 0.12 2.52
4: middle/Non-STEM One-attribute slope 1.05 0.39 0.22 4.01
Two-attribute slope 0.69 0.40 0.02 2.19
5: high/STEM One-attribute slope 1.09 0.23 0.21 3.09
Two-attribute slope 0.91 0.21 0.03 1.51
6: high/Non-STEM One-attribute slope 1.00 0.48 0.19 2.64
Two-attribute slope 0.96 0.41 0.09 3.14
Similar to the procedure with the intercept estimates, the 95% confidence intervals of the slope
estimates were plotted for each model, group, and item. There were very few significant
differences between groups on items. In fact, there were no significant differences between
STEM and non-STEM teachers, or between experienced and early-career teachers. This finding
121
indicated that generally, attributes with perceived support had similar influences on teachers’
probability of correct response to items across grade level, subject, and career status. However,
there was an important group of items for which there were significant differences across grade
levels. Figure 10 showed the confidence intervals for four different items. Item 55 asked
teachers to rate the extent to which they agreed that principals collected adequate evidence to
evaluate their teaching. The graph displayed confidence intervals by elementary, middle, and
high school groups. Both elementary and middle school teachers were estimated significantly
higher than high school teachers. This item loaded onto the coarse-grained “characteristic of
leadership” attribute and the finer-grained “principal legitimacy” attribute. Differences in slope
means indicated different strengths of relationship between attribute and probability of positive
perception. Thus, the significant differences between elementary and middle school teachers’
slope parameters indicated differences in the strengths of relationship between the leadership
attribute and the probability of a correct response on this item. This same trend was observed for
item 57. Item 57 was assigned to the same attributes as item 55. It asked teachers to rate the
extent to which their principals had the knowledge and skills to evaluate their teaching.
Elementary and middle school teachers’ estimates were both significantly higher than high
school teachers. Thus, when teachers perceived support of the leadership attribute, the influence
of this support had significantly lesser effect on high school teachers’ probability of indicating
that their principal had the required knowledge and skills. Similar to the above inference, this
may have indicated that high school teachers valued principal knowledge and skills less than
other teachers in terms of support mechanisms regarding policy implementation characteristics
of leadership.
122
Item 45 asked teachers the extent to which they agreed that the professional
development they received on teacher evaluation was useful. This item was assigned in the Q-
matrix to the attribute “characteristics of leadership.” In the 10-attribute model, it was assigned
to “quality of professional development.” Figure 10 showed the confidence intervals by grade
level. Middle school teachers were statistically significantly higher than elementary and high
school teachers. Moreover, elementary teachers were significantly higher than high school
teachers. This indicated that perceived support of the leadership attribute had more influence on
the probability of middle school teachers indicating support on this item than elementary and
high school teachers. Finally, item 51 asked teachers to rate the extent to which evaluation
feedback informed their professional development selection. Elementary school teachers were
estimated significantly higher than middle and high school teachers. This indicated that
perceived support on the leadership attribute had less of an influence on the logit of a correct
response to this item for middle and high school teachers. However, with a type 1 error rate of
0.05, the possibility remained that these differences were actually type-1 errors. Thus, future
investigation into these support mechanisms is necessary.
123
Item 55 Item57
Item 45 Item 51
Legend for G = 3: 1= Elementary School; 2 = Middle School; 3 = High School.
Figure 9. Comparisons of Confidence Intervals for Four Item Slope Estimates by Grade Level
01
23
Slo
pe E
stim
ate
1 2 3Groups
01
23
4
Slo
pe E
stim
ate
1 2 3Groups
12
34
5
Slo
pe E
stim
ate
s
1 2 3Groups
01
23
4
Slo
pe In
terc
ep
ts
1 2 3Groups
124
CHAPTER 5
IMPLICATIONS
The purpose of this study was to explore a new way to measure teachers’ perceptions of
school intra-organizational mechanisms for supporting teacher evaluation policy implementation.
Cognitive diagnostic models have not previously been applied to policy implementation support
constructs. The categorical, standards-based diagnostic output from the analysis in this study was
anticipated to provide detailed empirical information about teachers’ perceptions of support. It
was assumed that more precise diagnostic feedback would be beneficial to policy makers and
school leaders in identifying strengths and weaknesses and in targeting resources in the policy
implementation process. When equipped with more precise diagnostic feedback, policy makers
and school leaders may be able to more confidently engage in empirical decision making,
especially in regards to targeting resources for short-term and long-term organizational goals
subsumed within the policy implementation initiative. Specifically, the following research
questions were addressed:
1. Can cognitive diagnostic models be applied to understanding teachers’ perceptions of
intra-organizational mechanisms for supporting policy implementation?
a. Do diagnostic models specifying finer-grained attributes representing
implementation support mechanisms fit the data better than models specifying
coarser grained attributes?
b. Do diagnostic models specifying finer-grained attributes representing
implementation support mechanisms have more accurate parameter estimates than
models specifying coarser grained attributes?
c. What are the posterior estimated proportions of teachers that are diagnosed to fall
under each latent profile?
125
2. Are there group differences in the diagnostic model fit based on grade level, subject
taught, and career status?
a. What are the group differences of the estimated latent class profile
distributions based on grade level, subject taught, and career status?
b. What are the group differences in diagnostic model estimations of parameter
estimates based on grade level, subject taught, and career status?
Discussion on Research Question 1a
Cognitive diagnostic models can undoubtedly be useful in their application to
understanding teachers’ perceptions of intra-organizational mechanisms for supporting policy
implementation. The first over-arching question was answered by attaining results that were
both statistically advantageous and substantively useful. The entire set of 4-attribute models had
AIC and BIC values lower than the unidimensional model values. This finding solidified the
application of cognitive diagnostic models to understanding intra-organizational mechanisms for
supporting policy implementation. Statistically, since the best-fitting cognitive diagnostic model
fit the data better than the unidimensional model, it was only logical to go with the diagnostic
model since it would, in fact, provide more nuanced substantive insights into the multiple
hypothesized dimensions of policy implementation. Moreover, the diagnostic model
measurements were actually much more precise, since they accounted for complex loading
structures and more nuanced relationships between the items and attributes than a
unidimensional model. Thus, with a more complete, and nuanced understanding of the
relationships of how teachers perceived these underlying dimensions, including characteristics of
the policy, teachers, leadership, and the organization, states and school districts would be much
more prepared to improve the implementation process than with just an overall total score of
support or the results from a simple unidimensional model. For example, using only the single-
126
group, 4-attribute model, district leaders can identify 16 separate latent classes of teachers who
perceive the implementation differently. They can narrow these 16 classes even further, since
some profiles had very low proportions and most teachers were estimated to fall into one of eight
the classes. In planning district-wide professional development, district leaders would be able to
group teachers based on the type of support they indicate they were lacking. Additionally, district
leaders could potentially identify groups of teacher-leaders who indicated they received support
and use follow-up studies to better understand what made teachers perceive this level of support.
Such strategies could be very useful in facilitating the diffusion of support in an organization.
Most importantly, because this study uses teachers’ perceptions of the implementation
process and teachers were the professionals who actually implemented the principles of the
policy, the nuanced understandings from this application were particularly relevant. In
educational and other social contexts, the understanding of complex latent variables, such as
perceptions of policy implementation support, commonly requires investigation into the
components of the cognitive process. That was precisely the advantage diagnostic models
provided in this application.
The literature review resulted in a coarse-grained 4-attribute model hypothesis.
However, several finer-grained models were built as well. It was clear that diagnostic models
specifying coarser-grained attributes representing implementation support mechanisms fit the
data better than models specifying finer-grained attributes. This was likely because of the
number of parameters that needed to be estimated in the finer-grained models and the sample
size that was available to estimate them. Regardless, the 4-attribute model provided a useful
substantive interpretation and was substantively logical.
127
Discussion on Research Question 1b
In comparing the best-fitting and 4-attribute and unidimensional model, parameter
estimates were very similar in terms of accuracy. This provided further evidence that the
application of diagnostic modeling to intra-organizational support mechanisms may be useful in
terms of gaining a more nuanced understanding of teachers’ perceptions. More specifically, by
investigating policy implementation support in underlying components we get a more nuanced
understanding. However, if these components led to standard errors were not comparable, or
more precise than the unidimensional model, it would be difficult to justify the application.
Although it was an exception to the general trend, the best-fitting four attribute model actually
had very comparable standard errors to the unidimensional model. This suggested that
researchers do not necessarily always have to sacrifice a higher degree of model complexity in
order to maintain model precision. However, the general trend of the standard errors showed that
as the models became increasingly complex and more attributes were added, the standard errors
for the parameters of the finer-grained attributes increased. More generally, as anticipated, the
more complex the model, the less accurately it was measured. Although the current study retro-
fitted a diagnostic model to the data, those designing an instrument specifically for a diagnostic
model may want to tend towards lower numbers of attributes in order to preserve accuracy of the
parameter estimates.
Discussion on Research Question 1c
The posterior estimated proportions of teachers were diagnosed and teachers fell under
one of the sixteen latent profiles. Profiles of organizational support generated from adequately-
fitting psychometric models provided a rigorous quantitative basis for the policy implementation
process. Based on teacher’s profile membership, their perceptions of specific organizational
128
strengths and weaknesses were estimated. The estimated marginal diagnostic distributions
suggested that teachers generally perceived a higher level of support on characteristics of the
policy and characteristics of teachers. It also suggested they perceive a lower level of support on
characteristics of leadership and characteristics of the organization. The certainty with which
teachers are assigned to a profile was intended to be used to inform decisions about the
appropriateness of potential programs for that teacher, and this information is available from his
or her profile membership distribution. One potential use of the diagnostic information captured
by is to support investments into the human capital available in the teacher workforce. Such
investments would include informing decisions regarding teachers’ professional development.
The empirically derived profiles of perceptions of implementation support could facilitate the
development of interventions that are both targeted at the needs of individual teachers and
coordinated across multiple organizational domains of the implementation process (Halpin &
Keiffer, 2015). Although the answer to exactly how and specifically what professional
development resources should be targeted into improvement of the process is not explored in this
study, this new measurement application provides direction towards understanding the
challenges of the implantation of a policy of this magnitude better. The obvious approach would
be to focus on the implementation areas a teacher, or group of teachers perceive a deficit in
support. However, it is rare that educational policy makers and leaders are equipped with the
type of diagnostic information that explicitly permits for inferences into how perceptions of the
important organizational attributes are interrelated. This is extremely important information to
consider when attempting to maximize the utility of limited resources available for facilitating
the professional growth of the teacher workforce.
129
Discussion for Research Question 2a
For research question 2, the multiple-group models were compared in terms of model fit,
item parameter estimates, as well as the marginal distributions for these groups. Five different
psychometric models were used in this analysis. These were a single-group analysis, a two-group
analysis defined by subject, a two-group analysis defined by career status, a three-group analysis
defined by grade-level, and a six-group analysis defined by a factorial design of subject and
grade-level. For overall model fit indices BIC and AIC, the single-group analysis was the best
compared to the group analyses. Among the multiple-group models, the teacher subject analysis
fit best.
The results showed that the marginal distribution resulting from these two models were
fairly similar. The most notable discrepancy was the differences among profile 8 and 16
between the models. For model 1, the largest proportion of teachers were estimated to fall into
profile 8, however, for every other model, profile 16 was the most common profile. Profile 8
included perceived support on all attributes except for characteristics of the organization. Profile
16 included perceived support of all attributes. Thus, the only difference between the definitions
of these profiles was whether or not teachers were estimated to have perceived support on
characteristics of the organization. The higher correlation between the leadership and
organizational attributes (r = 0.783, p < 0.01) may have some role in the explanation for this
slight difference between the multiple group models. Notwithstanding this exception, the
remaining bivariate correlations between the attributes were of fairly moderate size. However,
they were all statistically significant. This provided empirical evidence that each attribute likely
captured a different, but related component of policy implementation support.
130
Discussion for Research Question 2b
The multi-group models offered major advantages in terms of interpreting the results.
Parameter estimates were explored and several findings were noted. The first notable finding was
that there were almost no difference in the estimated average intercept parameter between
experienced teachers and early-career teachers. This suggested that when no support attributes
had perceived support, the probabilities that teachers perceived support on a particular
mechanism were similar for these groups. Secondly, the average intercept parameter in the
teacher subject model (G = 2) was lower for STEM teachers than for non-STEM teachers. This
dynamic held true in the factorial design model (G = 6), where the average intercept-parameters
were lower for STEM teachers than non-STEM teachers at all three grade levels, elementary,
middle, and high school. This meant that there was less perceived support by STEM teachers
than non-STEM teachers who have the same profile. More specifically, it suggested an overall
group difference between STEM and non-STEM teachers. Generally, this finding further
suggests that STEM teachers required differential support than non-STEM teachers. STEM
teachers also surprisingly had significantly higher estimates on the intercepts for items pertaining
to their extent of confidence in understanding the GUPSECT standards, the measures used to
evaluate their teaching, and using their ratings to inform their teaching. This may have indicated
that STEM teachers at all levels required less support on these mechanisms than non-STEM
teachers in order to have a significantly higher probability of a positive response. This would be
important for a school leader to know. If, for example, a professional development session was to
be centered on using evaluation ratings to improve teaching, a principal may be able to focus
attention on non-STEM teachers for this session. Or, even more strategically, this principal may
be able to rely on specific STEM teachers to help build understanding and confidence among
131
less confident teachers participating in this session. Such information is vital in maximizing the
limited available time allotted for teacher professional development.
The slope parameters captured the influence in logits of attribute perceived support on
the probability of a correct response to an item. It also captured group mean differences in
strength of relationship between items and the 4 support attributes. There were very few
significant differences between groups on items. In fact, there were no significant differences
between STEM and non-STEM teachers, or between experienced and early-career teachers.
This finding indicated that generally, perceptions of support had similar influences on teachers’
probability of an agreement response to items across grade level, subject, and career status.
However, when teachers were asked to rate the extent to which they agreed that principals
collected adequate evidence to evaluate their teaching, both elementary and middle school
teachers were estimated significantly greater than high school teachers. However, there was the
possibility of type I errors in the few significant differences in item parameters. This suggested
that future research should focus on investigating whether these differences replicate.
Alternatively, the differences could have something to do with discrepancies in the frequency of
achievement testing at different levels. In Virginia, Standards of Learning testing occurs every
year for math and science in grades 3 through 8. Moreover, writing and science tests occur in
grades 5 and 8, and history once through grades 4 through 7. In comparison, high school students
are tested less frequently. If teachers believe student tests scores are used as evidence in their
evaluations, this may impact their perception on this item. In reality, the 40% weighting of
student academic progress defined in the GUPSECT standards can include other evidence of
student progress in addition to testing. One alternative explanation for this finding was that
132
teachers may believe that the number of years in elementary school would provide more
opportunities for principals to collect evidence of their teaching.
Summary of Results
In summary, various cognitive diagnostic models were applied to teachers’ perceptions of
intra-organizational mechanisms for supporting teacher evaluation policy implementation. A
preliminary analysis was conducted and various methods were used to construct the Q-matrix.
An exploratory factor analysis provided empirical support for a fine-grained, 10-attribute model.
It was also determined that these finer-grained attributes could be combined in order to construct
a coarser-grained four attribute model. This was the starting point for the diagnostic model-
fitting process. Using expert consultation, the Q-matrix was refined until a set of 10-attribute
models converged. Based on the correlations and substantive interpretations, attributes were
combined to build sets of 9, 8, 7, 6, 5 and 4 attribute models.
The set of 4-attribute models were superior in terms of model fit. Moreover, out of each
of the diagnostic models, they had lower standard errors. The best-fitting 4-attribute model was
selected and the 7 most egregiously misfitting items were removed. The 70 item-4 attribute
model was determined to be a better fit. The estimated proportions of teachers in each profile
were interpreted for this model in order to address research question 1c. In total, approximately
24% of teachers were estimated to have perceived support on all four attributes. Conversely,
approximately 16% of teachers were estimated to have perceived support on none of the support
attributes. Approximately 23% of teachers indicated that they were supported by support
mechanisms related to characteristics of teachers, and another 16% were estimated to have
perceived support on mechanisms relating to characteristics of both the policy and teachers.
133
Finally, diagnostic models were explored in terms of group differences. Specifically,
group differences were compared using four different models. First, a model was run for early
career teachers vs. experienced teachers (G = 2). Next, a model was run for STEM teachers vs.
non-STEM teachers (G = 2). This was followed by a model comparing teacher grade level (G =
3). Finally, a full-factorial model of STEM and level was run (G = 6).
Although the multi-group models failed to improve the overall fit of the single-group
model, each of the models was explored in terms of parameter estimates, standard errors, and
distributions across profiles. The distributions across profiles were quite comparable across all
models. Moreover, these multi-group models offered major advantages in terms of interpreting
the results. Parameter estimates were explored and several findings were noted. The first notable
finding was that STEM teachers had generally lower intercept-estimates than non-STEM
teachers when no attributes were had perceived support. However, STEM teachers had
significantly greater intercept estimates on items pertaining to their extent of confidence in
understanding the GUPSECT standards, measures, and using their ratings to inform their
teaching.
Secondly, there were very few significant differences between groups on items. In fact,
there were no significant differences between STEM and non-STEM teachers, or between
experienced and early-career teachers. Both elementary and middle school teachers were
estimated significantly higher than high school teachers on specific items related to “leadership
support.” These included the item that measured the extent to which teachers agreed that
principals collected adequate evidence to evaluate their teaching and the extent to which their
principals had the knowledge and skills to evaluate their teaching. Middle school teachers were
statistically significantly greater than elementary and high school teachers on the extent to which
134
the professional development they received on teacher evaluation was useful. Finally, elementary
school teachers were estimated significantly higher than middle and high school teachers. This
indicated that perceived support of the leadership attribute had less of an influence on the logit of
a correct response to this item for middle and high school teachers.
The initial findings were quite consistent with the literature. More specifically, the
exploratory factor analysis showed that the 10 attribute model could be interpreted as a finer-
grained representation of the coarse-grained 4-attribute hypothesis constructed from a review of
the literature. There is support for the notion that, in the context of state-wide teacher evaluation
policy implementation, local organizations can think of implementation support in terms of 4-
coarse grained components rather than as a single, unidimensional construct. By supporting
teachers in this way, districts can target the necessary resources to the appropriate schools and
teachers. This, in turn, should maximize the utility of the resources available to school districts.
This study is among the first to look at K12 policy implementation support from
teachers’ perspective. Several insights were gained by investigating teachers’ perceptions of
organizational supports and barriers that would not have been available through other methods.
Thus, these insights provided additional evidence that school leaders can broaden understanding
of what is and is not working with new policies by collecting data from those in control of
actually implementing the guidelines of the policy.
Limitations
The analyses presented in this article demonstrated the usefulness of cognitive diagnostic
models as a method for investigating the implementation of a state-wide teacher evaluation
policy. Researchers on the ITES project collected survey data that allowed for a
multidimensional analysis into teachers’ perceptions of intra-organizational mechanisms for
135
supporting teacher evaluation. This provided an opportunity to retro-fit the C-RUM model to this
data in order to make criterion-referenced, standards’ based-decisions regarding the
implementation process. The practice of retro-fitting (de la Torre and Karelitz, 2009) is generally
suboptimal, and can result in the misclassification of examinees if the instrument was intended
for a unidimensional construct. Hence, one limitation of using this data was that the instrument
was not initially designed for a diagnostic model. Because of this, the attributes were unevenly
covered and items were not specifically written for a diagnostic model. However, although the
instrument used in this study was not developed with a diagnostic model in mind, the fact that it
was developed around four coarse-grained mechanisms for policy implementation support made
it suitable for exploratory research purposes. A future direction would be the development of an
instrument designed for a cognitive diagnostic model, developed from a Q-matrix.
Secondly, the dataset limits the generalizability of the substantive results to the
population of K12 public schools in Southwest Virginia. The final sample included 794 teachers.
This sample is not small, but it is not overly large either. It should be noted that a high response
rate (70%) was attained. Thus, although the data was not collected from a probabilistic sample,
the sample is very representative of the target population. Moreover, there were many reasons
presented as to why this was the data that will provide the best answers to the previously
presented research questions. Most importantly, no other dataset currently includes the variables
necessary to answer the research questions in this study. Since teacher evaluation policy is
typically developed and implemented at the state or local level, there exists limited funding
available for collecting data and conducting research on the implementation process, which may
provide some explanation of why there is limited data in this area of research. Additionally,
136
because policies are implemented at the state and local level, no national data is possible.
Although it would be very interesting to investigate policies in and across other states.
It was acknowledged that often the exact specification of the Q-matrix is unknown a
priori. Mechanisms of policy implementation support in schools are not fully understood, and
thus the exact relationships in such a complex model cannot be known for certain. For this
reason, in addition to traditional approaches to Q-matrix development, empirically based Q-
matrix discovery techniques are pursued in this study. There is need for the investigation is the
development of empirical techniques for determining the entries of the Q-matrix.
Finally, in the current study, the data for the final specified model was dichotomized.
Although this strategy has precedent, it does present an additional limitation. With a much larger
sample, the data may have remained polytomous. However, the current sample size (n=747) does
not support the number of parameters that needed to be estimated if the data remained
polytomous. This limitation turned out to not be significant, as a preliminary analysis showed
that there were almost no differences between the structures resulting from the polytomous data
and the dichotomous data.
Future Research
Several areas of future research emerge from the findings of this study. The current study
illustrated the dynamic between the general application of cognitive diagnostic models to an area
of research they had not previously been used and the actual substantive findings from that
model. Future research studies should focus on both. First, there is a great need for research into
designing psychometrically sound instruments specifically intended for diagnostic models.
Although there was evidence to support the application of diagnostic models to policy
implementation support, the model was retro-fitted to the data collected using an instrument not
137
specifically designed for this use. By advancing the research into diagnostic modeling instrument
development initiatives, researchers in all fields seeking new applications for these models will
be better prepared. In conjunction with further investigation into instrument development, more
research into diagnostic modeling of polytomously scored items is necessary in order to preserve
the information attained by instruments with polytomous items.
Furthermore, although several studies exist on Q-matrix development, there exists
minimal agreement on the most effective way to develop this important hypothesis. This could
be because the Q-matrix may be dependent, to a high degree, on the attributes, items, and
relationships being explored. However, since several methodologies have been documented in
the literature focusing on Q-matrix construction, future studies should compare and contrast
competing ways to construct this important hypothesis in different areas of study.
Since this was a new application of cognitive diagnostic modeling, the substantive
findings of this study could be further validated by replicating this study. For example, when
teachers were asked to rate the extent to which they agreed that principals collected adequate
evidence to evaluate their teaching, both elementary and middle school teachers were estimated
significantly greater than high school teachers. However, with a type 1 error rate of 0.05, the
possibility remained that these differences were actually type-1 errors. Thus, future investigation
into these support mechanisms is necessary. Moreover, as the ITES project continues to grow, a
larger sample will be helpful for researchers to further explore the valid and reliable
measurement of policy implementation support. As new researchers get involved in the project,
triangulating the results of this using other methodologies may be helpful. More specifically,
because the study was a localized, qualitative inquiry into teachers’ perceptions may substantiate
the important finding s.
138
Traditionally, cognitive diagnostic models are applied to K-12 skills or diagnosing
medical conditions where the presence of symptoms is a binary outcome. Thus, although
generalizability is important, proving the application of these models and the potential value in
the diagnostic output was considered a more important initiative in regards to informing future
studies. Exploring new ways to more validly and reliably measure and improve the support
available to teachers in using performance information to adjust instruction remains an important
activity for educational leaders and researchers to engage in. This effort can be a crucial part of
broader initiatives to build school capacity to better serve students in this performance
accountability era (Sun, Mutcheson, & Kim, 2014). This further supports the notion that the
initial findings in this study should be further substantiated through replication.
139
References
Akaike, H. 1987. Factor analysis and AIC. Psychometrika 52: 317–332
Akerlof, G. A., & Kranton, R. E. (2005). Identity and the Economics of Organizations. Journal
of Economic perspectives, 9-32
Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters:
Application of an EM algorithm. Psychometrika, 46(4), 443-459.
Brown, T. A. (2015). Confirmatory factor analysis for applied research. Guilford Publications.
Bryk, A. S., Sebring, P. B., Allensworth, E., Easton, J. Q., & Luppescu, S. (2010). Organizing
schools for improvement: Lessons from Chicago. University of Chicago Press.
Cattell, R. B. (1966). The scree test for the number of factors. Multivariate behavioral
research, 1(2), 245-276.
Century, J., Rudnick, M., & Freeman, C. (2010). A framework for measuring fidelity of
implementation: A foundation for shared language and accumulation of
knowledge. American Journal of Evaluation, 31(2), 199-218.
Century, J., Cassata, A., Rudnick, M., & Freeman, C. (2012). Measuring enactment of
innovations and the factors that affect implementation and sustainability: Moving toward
common language and shared conceptual understanding. The journal of behavioral health
services & research, 39(4), 343-361.
Chen, W., & Thissen, D. (1997). Local dependence indexes for item pairs using item response
theory. Journal of Educational and Behavioral Statistics, 22, 265-289.
Chen, J., de la Torre, J. and Zhang, Z. (2013), Relative and Absolute Fit Evaluation in Cognitive
Diagnosis Modeling. Journal of Educational Measurement, 50: 123–140.
doi: 10.1111/j.1745-3984.2012.00185.x
Close, C. N., Davison, M. L., & Davenport, E. C. (2012). An exploratory technique for finding
the Q-matrix in cognitive diagnostic assessment: Combining theory with data. In Annual
Meeting of the National Council on Measurement in Education. Vancouver, British
Columbia, Canada.
140
Coburn, C. E. (2001). Collective sensemaking about reading: How teachers mediate reading
policy in their professional communities. Educational Evaluation and Policy Analysis,
23(2), 145–170.
Curtin, T. R., Ingels, S. J., Wu, S., & Heuer, R. (2002). National education longitudinal study of
1988: Base-year to fourth follow-up data file user’s manual (NCES 2002-323).
Washington, DC: US Department of Education.National Center for Education Statistics.
Datnow, A., & Park, V. (2009). Conceptualizing policy implementation: Large-scale reform in
an era of complexity. Handbook of Education Policy Research, 348-361.
Darling-Hammond, L., Amerin-Beardsley, A., Haetel, E., & Rothstein, J. (2012). Evaluating
teacher evaluation. Phi Delta Kappan, 93(6), 8–15.
DiBello, L. V., Roussos, L. A., & Stout, W. (2006). 31A Review of cognitively diagnostic
assessment and a summary of psychometric models. Handbook of statistics, 26, 979-
1030.
Eisenhardt, K.M. (1989). Agency theory: An assessment and review. The Academy of
Management Review, 14(1). 57-74.
Frank, K. A., Zhao, Y., & Borman, K. (2004). Social capital and the diffusion of innovations
within organizations: The case of computer technology in schools.Sociology of
Education, 77(2), 148-171.
Galeshi, R., & Skaggs, G. (2014). Traditional fit indices utility in new psychometric model:
cognitive diagnostic model. International Journal of Quantitative Research in
Education, 2(2), 113-132.
Goldhaber, D. D., & Brewer, D. J. (2000). Does teacher certification matter? High school teacher
certification status and student achievement.Educational evaluation and policy
analysis, 22(2), 129-145.
Gorin, J. S. (2009). Diagnostic Classification Models: Are they Necessary? Commentary on
Rupp and Templin (2008).
Haberman, S. J., Davier, M., & Lee, Y. H. (2008). Comparison of multidimensional item
response models: Multivariate normal ability distributions versus multivariate
polytomous ability distributions. ETS Research Report Series, 2008(2), i-25.
Hair, J., Tatham R., Anderson R., & Black W (1998). Multivariate data analysis. (Fifth Ed.)
Prentice-Hall: London.
141
Hagenaars, J. A., & McCutcheon, A. L. (Eds.). (2002). Applied latent class analysis. Cambridge
University Press.
Hallinger, P., Heck, R. H., & Murphy, J. (2014). Teacher evaluation and school improvement:
An analysis of the evidence. Educational Assessment, Evaluation and Accountability,
26(1), 1-24.
Halpin, P. F., & Kieffer, M. J. (2015). Describing Profiles of Instructional Practice A New
Approach to Analyzing Classroom Observation Data.Educational Researcher,
0013189X15590804.The New Teacher Project (2007)
Harris, D. N., & Sass, T. R. (2011). Teacher training, teacher quality and student
achievement. Journal of public economics, 95(7), 798-812.
Hartz, S. M. (2002). A Bayesian framework for the unified model for assessing cognitive
abilities: Blending theory with practicality (Doctoral dissertation, University of Illinois at
Urbana-Champaign).
Henson, R. A., Templin, J. L., & Willse, J. T. (2009). Defining a family of cognitive diagnosis
models using log-linear models with latent variables.Psychometrika, 74(2), 191-210.
Horn, J. L. (1965). A rationale and test for the number of factors in factor
analysis. Psychometrika, 30(2), 179-185.
Hu, L., & Bentler, P. M. (1998). Fit indices in covariance structure modeling: Sensitivity
to underparameterized model misspecification. Psychological Methods, 3, 424–453.
Jang, E. E. (2009). Cognitive diagnostic assessment of L2 reading comprehension ability:
Validity arguments for Fusion Model application to LanguEdge assessment. Language
Testing, 26(1), 031-73.
Kaiser, H. F. (1991). Coefficient alpha for a principal component and the Kaiser-Guttman
rule. Psychological reports, 68(3), 855-858.
Kane, T. J., McCaffrey, D. F., Miller, T., & Staiger, D. O. (2013). Have We Identified Effective
Teachers? Validating Measures of Effective Teaching Using Random Assignment.
Research Paper. MET Project. Bill & Melinda Gates Foundation.
Kunina‐Habenicht, O., Rupp, A. A., & Wilhelm, O. (2012). The Impact of Model
Misspecification on Parameter Estimation and Item‐Fit Assessment in Log‐Linear
Diagnostic Classification Models. Journal of Educational Measurement, 49(1), 59-81
Kelcey, B., Hill, H. C., & McGinn, D. (2014). Approximate measurement invariance in cross-
classified rater-mediated assessments. Frontiers in Psychology, 5(1469).
142
Lee, Y. W., & Sawaki, Y. (2009). Cognitive diagnosis approaches to language assessment: An
overview. Language Assessment Quarterly, 6(3), 172-189.
Leighton, J., & Gierl, M. (Eds.). (2007). Cognitive diagnostic assessment for education: Theory
and applications. Cambridge University Press.
Li, H., & Suen, H. K. (2013). Constructing and validating a Q-matrix for cognitive diagnostic
analyses of a reading test. Educational Assessment,18(1), 1-25.
Liu, Y., Douglas, J. A., & Henson, R. A. (2009). Testing person fit in cognitive
diagnosis. Applied psychological measurement, 33(8), 579-598.
Magidson, J., & Vermunt, J. K. (2004). Latent class models. The Sage handbook of quantitative
methodology for the social sciences, 175-198.
Marzano, R. J., Pickering, D. J., & Pollock, J. E. (2001). Classroom instruction that works:
Research-based strategies for increasing student achievement. Alexandria, VA:
Association for Supervision and Curriculum Development.
Meng, X.L. and Rubin, D.B. (1993) “Maximum Likelihood Estimation via the ECM Algorithm:
a general framework,” Biometrika, 80, 267–278.
Mihaly, K., McCaffrey, D. F., Staiger, D. O., & Lockwood, J. R. (2013). A composite estimator
of effective teaching. Seattle, WA: Bill & Melinda Gates Foundation.
Moynihan, D. P. (2008). The dynamics of performance management: Constructing information
and reform. Washington DC: Georgetown University Press.
Murphy, J., Hallinger, P., & Heck, R. H. (2013). Leading via teacher evaluation: The case of the
missing clothes? Educational Researcher, 42(6), 349–354.
Norris, Megan; Lecavalier, Luc (17 July 2009). "Evaluating the Use of Exploratory Factor
Analysis in Developmental Disability Psychological Research".Journal of Autism and
Developmental Disorders 40 (1): 8–20. doi:10.1007/s10803-009-0816-2.
Nye, B., Konstantopoulos, S., & Hedges, L. V. (2004). How large are teacher effects?
Educational Evaluation and Policy Analysis, 26(3), 237-257.
Poggio, A. J., Yang, X., Irwin, P. M., Glasnapp, D. R., & Poggio, J. P. (2007). Kansas
Assessments in Reading and Mathematics 2006 Technical Manual for the Kansas
General Assessments, Kansas Assessments of Multiple Measures (KAMM), Kansas
Alternate Assessments (KAA).Retrieved April, 20, 2008.
143
Proctor B, (2011) 38:65–76 DOI 10.1007/s10488-010-0319-7
Outcomes for Implementation Research: Conceptual Distinctions, Measurement
Challenges, and Research Adm Policy Ment Health
Putnam, Robert. (2000), "Bowling Alone: The Collapse and Revival of American Community"
(Simon and Schuster).
Ravand, H., Robitzsch, A. (2015). Cognitive Diagnostic Modeling Using R. Practical
Assessment, Research & Evaluation.
Roberts, M. R., & Gierl, M. J. (2010). Developing score reports for cognitive diagnostic
assessments. Educational Measurement: Issues and Practice,29(3), 25-38.
Rockoff, J. E. (2004). The impact of individual teachers on student achievement: Evidence from
panel data. The American Economic Review,94(2), 247-252.
Rockoff, JonahE. 2004. "The Impact of Individual Teachers on Student Achievement: Evidence
from Panel Data." American Economic Review, 94(2): 247-252.
Rivkin, S. G., Hanushek, E. A., & Kain, J. F. (2005). Teachers, schools, and academic
achievement. Econometrica, 73(2), 417-458.
Rogers, E. M. (2010). Diffusion of innovations. Simon and Schuster.
Rogers, E. M. (2003). Elements of diffusion. Diffusion of innovations, 5, 1-38.
Rosen, A., & Proctor, E. K. (1981). Distinctions between treatment outcomes and their
implications for treatment evaluation. Journal of Consulting and Clinical Psychology,
49(3), 418–425.
Rothstein, J. & Mathis, W.J. (2013). Review of two culminating reports from the MET project.
National Educaitonal Policy Center
Rupp A.A., Templin, J., & Henson, R.A. (2008). Unique characteristics of diagnostic
classification models: A comprehensive review of the current state-of-the-art. Taylor and
Francis Group, LLC. Measurement, 6: 219-262, ISSN 1536-6367 DOI:
10.1080/15366360802490866
Rupp A.A., Templin, J., & Henson, R.A. (2010). Diagnostic measurement: Theory, methods, and
applications. The Guilford Press
Rupp, A. A., & Templin, J. L. (2008). Unique characteristics of diagnostic classification models:
A comprehensive review of the current state-of-the-art.Measurement, 6(4), 219-262.
144
Sartain, L., Stoelinga, S.R., & Brown, E.R. (2011) Rethinking teacher evaluation in Chicago.
Consortium on Chicago School Research at the University of Chicago Urban Education
Institute. Pg 1-50
Sawaki, Y., Kim, H. J., & Gentile, C. (2009). Q-matrix construction: Defining the link between
constructs and test items in large-scale reading and listening comprehension
assessments. Language Assessment Quarterly,6(3), 190-209.
Sclove, S. L. (1987). Application of model-selection criteria to some problems in multivariate
analysis. Psychometrika, 52(3), 333-343.
Spillane, J. P., & Miele, D. B. (2007). Evidence in practice: A framing of the terrain. Yearbook
of the National Society for the Study of Education, 106(1), 46-73.
Spillane, J. P., Reiser, B. J., & Reimer, T. (2002). Policy implementation and cognition:
Reframing and refocusing implementation research. Review of Educational Research,
72(3), 387–431.
Spillane, J.P., Gomez, L., & Mesler, L. (2009). School organization and policy: Implementation,
organizational resources, and school work practice in Plank, D., Sykes, G., & Schneider,
B. , Handbook of Education Policy Research (pp. 409-425). Lawrence Erlbaum.
StataCorp, L. P. (2007). Stata data analysis and statistical Software. Special Edition Release, 10.
Stronge, J. H. (2010). Evaluating what good teachers do: Eight research-based standards for
assessing teacher excellence. Larchmont, NY: Eye On Education.
Stronge, J. H., Gareis, C. R., & Little, C. A. (2006). Teacher pay and teacher quality: Attracting,
developing, and retaining the best teachers. Corwin Press.
Sun, M., & Mutcheson, R. B. (2014). Implementation of Virginia new teacher evaluation system:
A report to district B. Virginia tech: Virginia, U.S.
Sun, M., Mutcheson, B., & Kim. J. (in press). Teachers’ use of evaluation for instructional
improvement and school supports. In J.A. Grissom & Peter Youngs (Eds.), Making the
most of multiple measures: The impacts and challenges of implementing rigorous teacher
evaluation systems. New York: Teachers College Press.
Tatsuoka, K. K. (1983). Rule space: An approach for dealing with misconceptions based on item
response theory. Journal of Educational Measurement, 20(4), 345-354.
Tatsuoka, K. K. (1990). Toward an integration of item-response theory and cognitive error
diagnosis. Diagnostic monitoring of skill and knowledge acquisition, 453-488.
145
Taylor, E. S., & Tyler, J. H. (2012). The effect of evaluation on teacher performance. The
American Economic Review, 102(7), 3628-3651.
Templin, J.L., & Henson, R. A. (2006). Measurement of psychological disorders using cognitive
diagnosis models. Psychological Methods, 11, 287-305.
Templin, J., & Henson, R. A. (2010). Diagnostic measurement: Theory, methods, and
applications. Guilford Press.
Tucker, P. D., & Stronge, J. H. (2001). Measure for Measure: Using Student Test Results in
Teacher Evaluations. American School Board Journal,188(9), 34-37.
USDOE http://www2.ed.gov/policy/elsec/guid/esea-flexibility/index.html retrieved 2/15/2015
Virginia Department of Education. (2011). Guidelines for Uniform Performance Standards and
Evaluation Criteria for Teachers. Virginia, U.S.
Wang, C., & Gierl, M. J. (2007, April). Investigating the cognitive attributes underlying student
performance on the SAT® critical reading subtest: An application of the Attribute
Hierarchy Method. In annual meeting of the National Council on Measurement in
Education, Chicago, Illinois (Vol. 9).
Weisberg, D., Sexton, S., Mulhern, J., Keeling, D., Schunck, J., Palcisco, A., & Morgan, K.
(2009). The widget effect: Our national failure to acknowledge and act on differences in
teacher effectiveness. New Teacher Project.
Westberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). The Widget Effect Our National
Failure to Acknowledge and Act on Differences in Teacher Effectiveness.
Xu, X., & Davier, M. (2008). Fitting the structured general diagnostic model to NAEP data. ETS
Research Report Series, 2008(1), i-18.
Von Davier, M. (2015) mldtm GDM software accessed from http://www.von-davier.com/
Von Davier, M. (2005). mdltm: Software for the general diagnostic model and for estimating
mixtures of multidimensional discrete latent traits models [Computer
software]. Princeton, NJ: ETS.
Youngs, P., Frank, K.A., Thum, Y.M., & Low, M. (2012). The motivation of teachers to
produce human capital and conform to their social contexts. In T. Smith, L. Desimone,
& A.C. Porter (Eds.), Yearbook of the National Society for the Study of Education: Vol.
110. Organization and effectiveness of high-intensity induction programs for new
teachers (pp.248-272). Malden, MA: Blackwell Publishi
146
APPENDECIES
Appendix A:
ITES Survey Items By Factor
Characteristics of the Policy
1: Policy Legitimacy
1 Extent sources of evidence were used to inform evaluation: formal obs
2 Extent sources of evidence were used to inform evaluation: informal obs
3 Extent sources of evidence were used to inform evaluation: student work
4 Extent sources of evidence were used to inform evaluation: feedback from parents
5 Extent sources of evidence were used to inform evaluation: student surveys
6 Extent sources of evidence were used to inform evaluation: student growth
7 Extent of agreement with statement: precise instruments were used
8 Extent of agreement with statements: policy impacted challenging homework
9 Extent of agreement with statements: policy impacted classroom assessment
10 Extent of agreement with statements: policy impacted feedback to students
11 Extent of agreement with statements: policy impacted reflection
12 Extent of agreement with statements: policy impacted preparing for tests
13
Extent of agreement with statements: policy impacted strategies for underperforming
students
14 Extent of agreement with statements: policy impacted collaboration
15 Extent of agreement with statement: my evaluation provided an accurate rating
16 Extent of agreement with statement: policy impacted my use of ratings
2: Policy Clarity/Adaptability
17 Extent of policy importance placed on guiding PD
18 Extent of policy importance placed on improving instruction
19 Extent of policy importance placed on accountability for achievement
20 Extent policy aligns with my job description
21 Extent policy aligns with previous evaluation
22 Extent policy aligns with school values
23 Extent of policy importance placed on guiding compensation/contract renewel
147
Characteristics of The Teachers
3: Teacher confidence in abilities relating to policy
24 Extent of confidence in understanding of standards
25 Extent of confidence in understanding of measures
26 Extent of confidence in collecting evidence
27 Extent of confidence setting SMART goals
28 Extent of confidence documenting student progress
29 Extent of confidence using data to adjust teaching
30 Extent of confidence using evaluation inform teach
31 Extent of confidence communicating teaching and student growth to parents
4: Attitude towards policy
32
Extent the teacher evaluation in your school was focused on aspects teacher
disposition
33 Extent policy aligns with my own views
34 Extent of agreement with statement: the evaluation process worth it
35
Extent of agreement with statement: the evaluation process was burdensome
(reverse)
36
Extent of agreement with statement: evaluation feedback helped my
improvement
37
Extent of agreement with statements: policy impacted extent policy impact:
communicating student progress
148
Characteristics of the Leadership
5: Innovation Advocacy/Communication
38 Extent to which teacher instruction was observed by principal
39 Extent to which teacher instruction was observed by asst principal
40
Extent of agreement with statement: teachers are encourages to find effective
strategies
41 Extent of agreement with statement: adequate recognition
42 Extent of agreement with statements: principal advocated for policy
43
Extent of agreement with statements: principal advocated tying evaluation to
personnel decisions
44 Extent of agreement with statements: policy impacted communicate with admin
6: Quality of Professional Development
45 Extent of professional development usefulness regarding content areas
46 Extent of professional development usefulness regarding teacher evaluation
47 Extent of professional development usefulness regarding making sense of data
48 Extent of professional development usefulness regarding overall impression
49 Extent of professional development hours provided regarding content areas
50 Extent of professional development hours provided regarding teacher evaluation
51
Extent of professional development hours provided regarding making sense of
data
52 Extent of agreement with statement: evaluation feedback informed my PD selection
7: Leader Legitimacy
53 Extent of agreement with statements: principal encouraged data for decisions
54 Extent of agreement with statement: fair procedures were used
55 Extent of agreement with statements: principal adequate observations
56 Extent of agreement with statements: principal collected adequate evidence
57 Extent of agreement with statements: principal applied same procedures
58 Extent of agreement with statements: principal had knowledge and skills
59 Extent of agreement with statements: principal decisions best interest school
149
Characteristics of the Organization
8: Resources
60 Extent of agreement with statement: sufficient time for evaluation
61 Extent of agreement with statement: sufficient resources for evaluation
62 Extent of agreement with statements: policy impacted rigorous materials
63 Extent of agreement with statements: policy impacted time interpreting data
9: Org locus of decision
64 Extent teacher involvement in design and modification of evaluation criteria
65 Extent teacher involvement in design and modification of what evidence is used
66 Extent teacher involvement in design and modification of using data
67 Extent teacher involvement in design and modification of how evidence is used
68
Extent teacher involvement in design and modification of using evaluation for
personnel decisions
69
Extent teacher involvement in design and modification of professional development
selection
10: Org Values
70
Extent the teacher evaluation in your school was focused on aspects content
knowledge
71
Extent the teacher evaluation in your school was focused on aspects instructional
knowledge/skills
72
Extent the teacher evaluation in your school was focused on aspects class
management
73
Extent the teacher evaluation in your school was focused on aspects relations with
parents/stud
74 Extent the teacher evaluation in your school was focused on aspects collegiality
75
Extent the teacher evaluation in your school was focused on aspects relation with
administrators
76
Extent the teacher evaluation in your school was focused on aspects service to
profession
77
Extent the teacher evaluation in your school was focused on aspects impact on
student growth
150
Appendix B
IRB Approval
MEMORANDUM
DATE: September 23, 2015
TO: Gary E Skaggs, Ryan Brock Mutcheson
FROM: Virginia Tech Institutional Review Board (FWA00000572, expires July 29, 2020)
PROTOCOL TITLE: Diagnostic Modeling of Intra-Organizational Mechanisms to
Support Policy Implementation
IRB NUMBER: 15-879
Effective September 23, 2015, the Virginia Tech Institution Review Board (IRB) Chair, David M
Moore, approved the New Application request for the above-mentioned research protocol. This
approval provides permission to begin the human subject activities outlined in the IRB-approved
protocol and supporting documents. Plans to deviate from the approved protocol and/or
supporting documents must be submitted to the IRB as an amendment request and approved by
the IRB prior to the implementation of any changes, regardless of how minor, except where
necessary to eliminate apparent immediate hazards to the subjects. Report within 5 business days
to the IRB any injuries or other unanticipated or adverse events involving risks or harms to
human research subjects or others. All investigators (listed above) are required to comply with
the researcher requirements outlined at: http://www.irb.vt.edu/pages/responsibilities.htm
(Please review responsibilities before the commencement of your research.)
PROTOCOL INFORMATION:
Approved As: Exempt, under 45 CFR 46.110 category(ies) 4
Protocol Approval Date: September 23, 2015
*Date a Continuing Review application is due to the IRB office if human subject
activities covered under this protocol, including data analysis, are to continue beyond the
Protocol Expiration Date.
FEDERALLY FUNDED RESEARCH REQUIREMENTS:
Per federal regulations, 45 CFR 46.103(f), the IRB is required to compare all federally funded
grant proposals/work statements to the IRB protocol(s) which cover the human research
activities included in the proposal / work statement before funds are released. Note that this
requirement does not apply to Exempt and Interim IRB protocols, or grants for which VT is not
the primary awardee.
151
Appendix C
Informal Blueprint of Teacher Survey Item Development
Virginia Guidelines for Professional Standards and Evaluation Criteria for Teachers (GUPSECT)
Areas of Implementation
from Conceptual
Framework
Standard 1:
Professional
Knowledge
Standard 2:
Instructional
Planning
Standard 3:
Instructional
Delivery
Standard 4:
Assessment
For/Of
Learning
Standard 5:
Learning
Environment
Standard 6:
Professionalism
Standard 7:
Student
Academic
Progress
Characteristics of the Policy
and Guidelines X X X X
X X X
Characteristics of School
Organizational Factors X X X X X X X
Characteristics of School
Leadership X X X X X X X
Characteristics of Teachers X X X X X X X
Survey Items Developed
152
Appendix D
Polytomous EFA Model Dimensionality Summary
In learning about the polytomous and dichotomous data, several steps were taken. One of
the steps included that the polytmous data were explored using exploratory factor analysis with
maximum likelihood and promax rotation. This method treats the polytomous items with 5-point
scales as continuous variables as a rough approximation. It also assumes normal distributions for
each item response. Such an approximation is considered to be acceptable if the number of
categories has at least four options (Hu & Bentler, 1998) and the distribution of item responses
are not heavily skewed (Hu & Bentler, 1998). The AIC and BIC fit indices suggested a 10-factor
model fit the data best. Moreover, based on the factor loadings, this model made sense as far as
interpretability. The results from the exploratory factor analyses indicated that the internal
structure of the model for the dichotomized items was comparable the internal structure of the
model for the polytomous items. In each case, factor loadings were used to determine which
item loaded onto which factor. A minimum loading of 0.3 was required in order for an item to
load to a factor (McDonald, 2000). A qualitative comparison of the factor loadings and items
revealed that not only were the interpretable factors of the dichotomous and polytomous EFA
similar, the loadings were also very similar. Although not the major focus of the current study,
these findings about the model dimensionality were important because they established that, with
this data, the dichotomous items can be reasonably relied upon for the cognitive diagnostic
analysis in the case that the sample size would not converge for the polytmous items. Moreover,
most of the software programs available that actually run cognitive diagnostic models are
restricted to running the dichotomous models. Those software programs that do run polytomous
153
models require a substantially larger sample sizes to reach convergence. In most cases it would
be possible to find additional resources to increase the sample size. However, this study relies on
a secondary dataset and the sample size will not be altered. Finally, one additional reason why
this finding was important was the since cognitive diagnostic models have not previously been
used for this type of application, the limited research tends to center on dichotomous models.
With reasonable approximations of univariate normality for the polytomous items, the
next step was to conduct the exploratory factor analysis with maximum likelihood estimation and
promax rotation. A promax rotation was used because it allows relationships between factors,
and is preferred in most situations, unless a strong argument can be made as to why the factors
should not be correlated (Beavers et al., 2013; Costello & Osborne, 2005; Gaskin & Happell,
2013; Matsunaga, 2010). Based on the AIC and BIC fit indices obtained by applying the
maximum likelihood that assumed continuous variables with normal distributions, it was
determined that the 10 factor model was the best fit for the data. The interpretability of the 10-
factor model was also favorable, and this is discussed in detail in the following sections.
Table 28: Factor Analysis of Polytomous Items
factors AIC BIC
1 24400.51 24760.56
4 13009.42 14421.94
5 11516.64 13270.74
6 10387.34 12478.42
7 9547.16 11970.60
8 8878.52 11629.69
9 8311.46 11385.76
10 7781.59 11174.40
154
Appendix E
Table 29: 4-Factor Solution for 70 Item Teacher Evaluation Instrument by Applying the Maximum Likelihood for Continuous
Variables
Item
Policy: Slopes
( λ𝑖,1,(1))
Teacher: Slopes
(λ𝑖,1,(2))
Leadership: Slopes
(λ𝑖,1,(3))
Organization: Slopes
( λ𝑖,1,(4)) Intercepts
( λ𝑖,0) RMSEA
1 0.59 (0.08) 0.44 (0.08) 0.05
2 0.09 (0.07) -0.33 (0.07) 0.02
3 0.22 (0.15) -2.63 (0.15) 0.02
4 0.57 (0.14) -2.42 (0.14) 0.03
5 0.35 (0.12) -1.98 (0.12) 0.05
6 0.31 (0.08) 0.48 (0.08) 0.73 (0.08) 0.06
7 0.11 (0.08) 0.36 (0.08) -0.68 (0.08) 0.06
8 0.37 (0.08) 0.52 (0.08) 0.22 (0.08) 0.04
9 0.43 (0.09) 1.59 (0.09) 2.41 (0.09) 0.07
10 0.96 (0.08) -0.58 (0.08) 0.15
11 0.43 (0.09) 0.67 (0.09) 0.02
12 1.41 (0.1) 0.14 (0.1) 1.42 (0.1) 0.16
13 0.35 (0.11) 1.42 (0.11) 2.22 (0.11) 0.04
14 1.00 (0.09) 0.64 (0.09) 0.84 (0.09) 0.08
15 1.14 (0.08) -0.76 (0.08) 0.12
16 0.77 (0.08) 0.32 (0.08) 0.05
17 0.66 (0.08) 0.47 (0.08) 0.29 (0.08) 0.03
18 0.78 (0.1) 0.98 (0.1) 1.87 (0.1) 0.08
19 0.87 (0.08) 0.24 (0.08) 0.11
20 0.78 (0.08) 0.31 (0.08) 0.09
21 0.97 (0.09) -1.44 (0.09) 0.07
22 0.62 (0.09) -0.8 (0.09) 0.06
23 0.63 (0.09) -1.52 (0.09) 0.12
24 0.95 (0.1) 1.96 (0.10) 0.05
Note 1. The values in ( ) are standard errors.
Note 2. Blanks in the factor loading estimate and its standard error mean that the estimated loading was lower than the pre-specified threshold
(i.e., 0.3) and therefore it was not reported here.
155
Item
Policy: Slopes
( λ𝑖,1,(1))
Teacher: Slopes
(λ𝑖,1,(2))
Leadership: Slopes
(λ𝑖,1,(3))
Organization: Slopes
( λ𝑖,1,(4)) Intercepts
( λ𝑖,0) RMSEA
25 0.14 (0.09) 0.41 (0.09) -1.35 (0.09) 0.05
26 0.52 (0.08) 0.98 (0.08) 0.05
27 0.15 (0.08) -0.41 (0.08) 0.07
28 0.84 (0.09) -0.92 (0.09) 0.06
29 0.17 (0.08) -0.9 (0.08) 0.08
30 1.03 (0.08) 0.03 (0.08) 0.06
31 0.79 (0.09) -0.9 (0.09) 0.06
32 0.02 (0.08) 0.76 (0.08) 0.03
33 0.49 (0.13) -2.13 (0.13) 0.04
34 0.22 (0.1) 0.32 (0.1) -1.4 (0.1) 0.05
35 0.55 (0.09) -1.11 (0.09) 0.06
36 0.35 (0.14) -2.77 (0.14) 0.03
37 0.69 (0.13) -2.13 (0.13) 0.03
38 1.03 (0.17) -2.92 (0.17) 0.08
39
1.89 (0.09)
-2.34 (0.09) 0.05
40 1.66 (0.09) -0.36 (0.09) 0.05
41 1.81 (0.11) 0.11 (0.11) 0.06
42 1.57 (0.09) -0.73 (0.09) 0.04
43 1.8 (0.12) 0.91 (0.12) 0.16
44 0.91 (0.13) 1.68 (0.13) 0
45 0.86 (0.11) 1.32 (0.11) 0.03
46 1.94 (0.13) 1.09 (0.13) 0.02
47 2.34 (0.15) 1.62 (0.15) 0.06
48 1.58 (0.13) 0.93 (0.13) 1.8 (0.13) 0.13
49 2.1 (0.09) 1.92 (0.09) 0.07
50 0.96 (0.1) 1.07 (0.1) 1.12 (0.1) 0.01
51 1.69 (0.1) -0.75 (0.1) 0.09
52 1.53 (0.1) 0.00 (0.1) 0.15
Note 1. The values in ( ) are standard errors.
Note 2. Blanks in the factor loading estimate and its standard error mean that the estimated loading was lower than the pre-specified threshold
(i.e., 0.3) and therefore it was not reported here.
156
Item
Policy: Slopes
( λ𝑖,1,(1))
Teacher: Slopes
(λ𝑖,1,(2))
Leadership: Slopes
(λ𝑖,1,(3))
Organization: Slopes
( λ𝑖,1,(4)) Intercepts
( λ𝑖,0) RMSEA
53 1.7 (0.1) 0.64 (0.1) 0.05
54 0.13 (0.07) -0.19 (0.07) 0.06
55 1.2 (0.09) -0.58 (0.09) 0.06
56 0.32 (0.09) 1.39 (0.09) -0.17 (0.09) 0.05
57 0.16 (0.09) 1.3 (0.09) -0.48 (0.09) 0.08
58 1.16 (0.09) -0.33 (0.09) 0.09
59 4.00 (0.09) 3.77 (0.09) 0.15
60 1.46 (0.09) 1.05 (0.09) 0.12
61 1.07 (0.08) 0.73 (0.08) 0.06
62 0.87 (0.08) -0.18 (0.08) 0.06
63 0.95 (0.09) -0.49 (0.09) 0.15
64 2.79 (0.09) 2.64 (0.09) 0.08
65 0.86 (0.09) -0.5 (0.09) 0.01
66 1.04 (0.09) 0.5 (0.09) 0.06
67 1.09 (0.09) -1.7 (0.09) 0.07
68 0.98 (0.09) 0.38 (0.09) -1.97 (0.09) 0.06
69 0.87 (0.09) -1.4 (0.09) 0.01
70 0.53 (0.08) -0.88 (0.08) 0.05
Note 1. The values in ( ) are standard errors.
Note 2. Blanks in the factor loading estimate and its standard error mean that the estimated loading was lower than the pre-specified threshold
(i.e., 0.3) and therefore it was not reported here.
157
Appendix F
Description of Model Evaluation Discrimination Index (DI)
The DI is calculated using the observed proportion correct scores for those teachers who
perceived support on an item and those teachers who did not perceive support on an item. This
method is adapted from Li & Suen, 2013. A larger difference between the proportion-correct
scores of these two groups indicates a larger degree of model fit because the membership of
“perceived support” or “non-perceived support” is based on each examinee’s skill classification
which is determined by a probability (DiBello, Roussos, & Stout, 2007).
For the 4-attribute model that was determined to be the best-fitting model, the average
difference in proportion correct between model predicted “perceived support” teachers and “non-
perceived support” teachers across all items is 0.4 The teachers whom perceived support had a
proportion of 0.68 and the “non-perceived support” teachers had a mean proportion correct of
0.26. However, there are no commonly agreed cutoff criteria, and thus this can only be used as
relative model fit evidence.
Figure 10. Description of Model Evaluation Discrimination Index (DI)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75
Masters Non-Masters
158
Appendix G
In the original analysis, a traditional eigenvalue-based exploratory factor analysis (EFA)
with promax rotation was used to understand the underlying structure of the data. The results
suggested a 10-factor solution and the factor loadings were interpreted. The resulting model was
conceptually sound. Here, a full-information EFA for dichotomous items was performed using
maximum likelihood and promax rotation with IRTPRO software (Scientific Software
International Incorporated, 2016).. The global fit statistics such as AIC and BIC for models with
1 through 13 factors were summarized in Table 30. The AIC was lowest for Model E (8 factors)
and the BIC was lowest for model B (5 factors). Clearly, model A1 (1 factor) is not empirically
supported as the best model. No RMSEA values were available from the IRTPRO output.
Table 30. Full-Information Exploratory Factor Analysis Model Fit
Model Factors Estimated Parameters AIC BIC
A1 1 140 51677.84 52324.09
A2 2 209 49356.67 50321.43
A3 3 277 48262.33 49540.98
A4 4 344 47236.3 48824.22
B 5 410 46483.42 48376.01
C 6 475 46194.84 48387.47
D 7 539 45966.38 48454.44
E 8 602 45728.64 48507.51
F 9 664 45768.6 48833.67
G 10 725 45809.92 49156.56
H 11 785 45930.19 49553.8
I 12 844 46164.73 50060.69
J 13 902 46876.64 51040.33
The discrepancy in the results of the AIC (8 factor solution) and BIC (5 factor solution)
demonstrated one of the difficulties in determining the number of factors statistically and/or
objectively. Further, the -2loglikelihoods were calculated from the AIC and a deviance test was
conducted to determine which model is a best fitting model based on the deviance test. The
deviance test comparing models with factors 1-13 was summarized in table 31. The results of
the deviance test indicated that the Model G (10-factors) was the best fitting model. Beginning
with Model G, the deviance values start to get bigger. At first, it was thought that this could be
because the likelihood function was getting too flat and multimodal which resulted in the
algorithm picking up one of the local maximums. However, closer look at the output revealed
that due to the number of parameters models with more than 11 factors, the models were actually
159
not converging. The deviance test between the 9 and 10 factor models indicated that the 10 factor
model fit significantly better than the 9 factor model. However, the p-value was 0.047. It is
important to note that had an 11-factor model actually converged, it appeared that it may not fit
significantly better than the 10 factor model. Based on the deviance test, Model G, a 10-factor
model was retained.
Table 31. Deviance Test
Model Deviance Test
Model Factors Estimated Parameters
-2LOG H0 Ha Statistic d.f. p-value Decision at .05 level
A1 1 140 51397.84 A1 A2 2459.17 69 0.000 A2
A2 2 209 48938.67 A2 A3 1230.34 68 0.000 A3
A3 3 277 47708.33 A3 A4 1160.03 67 0.000 A4
A4 4 344 46548.3 A4 B 884.88 66 0.000 B
B 5 410 45663.42 B C 418.58 65 0.000 C
C 6 475 45244.84 C D 356.46 64 0.000 D
D 7 539 44888.38 D E 363.74 63 0.000 E
E 8 602 44524.64 E F 84.04 62 0.033 F
F 9 664 44440.6 F G 80.68 61 0.047 G
G 10 725 44359.92 - - -0.27 60 - G
H 11 785 44360.19 - - -116.54 59 - G