draft - jim gleason - jim gleason
TRANSCRIPT
Draft
STRUCTURAL VALIDITY AND RELIABILITY OF
TWO OBSERVATION PROTOCOLS IN
COLLEGE MATHEMATICS
by
LAURA ERIN WATLEY
JIM GLEASON, COMMITTEE CHAIRYUHUI CHEN
DAVID CRUZ-URIBEKABE MOEN
JEREMY ZELKOWSKI
A DISSERTATION
Submitted in partial fulfillment of the requirementsfor the degree of Doctor of Philosophy
in the Department of Mathematicsin the Graduate School of
The University of Alabama
TUSCALOOSA, ALABAMA
2017
Draft
Copyright Laura Erin Watley 2017ALL RIGHTS RESERVED
Draft
ABSTRACT
Undergraduate mathematics education is being challenged to improve, with peer eval-
uation, student evaluations, and portfolio assessment the primary methods of formative
and summative assessment used by instructors. Observation protocols like the Mathemat-
ics Classroom Observation Protocol for Practices (MCOP2) and the abbreviated Reformed
Teaching Observation Protocol (aRTOP) are another alternative. However, before these
observation protocols can be used in the classroom with confidence, a study needed to be
conducted to examine both the aRTOP and the MCOP2. This study was conducted at three
large doctorate-granting universities and eight masters and baccalaureate institutions. Both
the aRTOP and the MCOP2 were evaluated at 110 classroom observations in the Spring
2016, Fall 2016, and Spring 2017 semester. The data analysis allowed conclusions regarding
the internal structure, internal reliability, and relationship between the constructs measured
by both observation protocols.
The factor loadings and fit indices produced from a Confirmatory Factor Analysis (CFA)
found a stronger internal structure of the MCOP2. Cronbach’s alpha was also calculated to
analyze the internal reliability for each subscale of both protocols. All alphas were in the
satisfactory range for the MCOP2 and most were in the satisfactory range for the aRTOP.
Linear Regression analysis was also conducted to estimate the relationship between the
constructs of both protocols. We found a positive and strong correlation between each
pair of constructs with a higher correlation between subscales that do not contain Content
Propositional Knowledge. This leads us to believe that Content Propositional Knowledge
is measuring something completely different from the other subscales. As noted above and
detailed in the body of the work, we find support for the Mathematics Classroom Observation
Protocol for Practices MCOP2 as a useful assessment tool for undergraduate mathematics
classrooms with the addition of the Content Propositional Knowledge subscale of the aRTOP.
ii
Draft
DEDICATION
This dissertation is dedicated to my parents and my husband. To my parents, Douglas
and Edith: Thank you for your unconditional love, guidance, and support. You have always
believed in me and encouraged me to strive for my dreams. I would not be who I am today
without you. To my husband, Kyle: Thank you for the unwavering love, support, and
encouragement. You have made my dreams yours and given me the strength to accomplish
them.
iii
Draft
ACKNOWLEDGMENTS
The completion of this Dissertation would not have been possible without the support
and guidance of a few very special people in my life. I would first like to give thanks to our
Lord and Savior for leading me on this path. It is only through his grace and mercy for
without him none of this would be possible.
Next I would like to thank Dr. Jim Gleason for his endless support and encouragement.
You have been a patient and caring mentor during this process. I cannot tell you how much I
value the time and effort you have put into me and my aspirations. I would also like to thank
the other members of my dissertation committee: Dr. Yuhui Chen, Dr. David Cruz-Uribe,
Dr. Kabe Moen, and Dr. Jeremy Zelkowski. I am forever grateful for the invaluable input
that has led to a strong dissertation.
To the Mathematics Department at The University of Alabama, you hold my gratitude
for dedicating your time to sharing your passion for mathematics with students like me.
I would like to thank my entering Department Chair and Graduate Program Director, Dr.
Zijian Wu and Dr. Vo T. Liem, for accepting me into the program and encouraging me at the
beginning of this process. To the current Department Chair and Graduate Program Director,
Dr. David Cruz-Uribe and Dr. David Halpern, your encouragement and advisement in these
last years has been vital to my success.
To the MTLC instructors at The University of Alabama, it is because of you that I am
the teacher I am today. You have instilled in me a sense of what it is to love mathematics
and to share that love with others. I will never forget all you have taught me and shared
with me over the years.
To my fellow graduate students at The University of Alabama, I cannot imagine this
experience with anyone else. To Bryan Sandor and Anne Duffee, I am so glad we found each
other. You both have been there for me when the challenges of graduate school seemed too
great to overcome. The University of Alabama will always hold a special place in my heart.
iv
Draft
To the seventy-two mathematics instructors that selflessly allowed me to observe your
class for this study. You have done more than just open your classroom to me, you have
opened my eyes to new ideas and expanded my love for teaching. To the institutions that
allowed me to observe, I will always cherish the time I spent on your campus.
To the mathematics department at Troy University, you have instilled in me the foun-
dation that has led to my dissertation. You not only shared your passion of mathematics,
but you opened my eyes to the limitless possibilities in mathematics. I will never forget your
kind words and support. Troy University will always hold a special place in my heart.
I want to also acknowledge my family members who constantly supported me and be-
lieved that I could achieve my goals. To my parents, Douglas and Edith Watley, thank you
for your relentless encouragement, unfailing support, and unconditional love. None of this
would have been possible without you. Finally I want to thank my husband, Kyle Scarbrough
and our furry friend Wesley. You both have stood by me throughout this process. You have
been patient with me when I needed it, you celebrate with me when even the littlest things
went right, and you loved me through it all.
v
Draft
TABLE OF CONTENTS
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
CHAPTER 1 - INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
CHAPTER 2 - LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Student Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Reliability and Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Peer Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Portfolios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Observation Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Reformed Teaching Observation Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Mathematics Classroom Observation Protocol for Practices . . . . . . . . . . . . . . . . . . 26
CHAPTER 3 - METHODS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Aim of Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
CHAPTER 4 - RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Internal Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Internal Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Relationship between the Constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
vi
Draft
CHAPTER 5 - DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .50
Study Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Future Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .55
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .57
APPENDICES
APPENDIX A. OVERVIEW OF OBSERVATION PROTOCOLS . . . . . . . . . . . . . . . . . . . . 71
APPENDIX B. INSTRUCTOR DEMOGRAPHICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76
APPENDIX C. INSTRUMENTS USED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
APPENDIX D. REGRESSION MODELS AND RESIDUAL PLOTS . . . . . . . . . . . . . . . . 93
APPENDIX E. IRB CERTIFICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .100
vii
Draft
List of Tables
1 Subscales as Predictors of the RTOP Total Score . . . . . . . . . . . . . . . 23
2 Interpretation of the RTOP Factor Pattern . . . . . . . . . . . . . . . . . . 24
3 aRTOP Items and Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Brief Description of MCOP2 items . . . . . . . . . . . . . . . . . . . . . . . . 29
5 Recommendations for Model Evaluation: Some Rules of Thumb . . . . . . . 40
6 Simple Linear Regression Results . . . . . . . . . . . . . . . . . . . . . . . . 48
7 Pearson’s Product-Moment Correlation . . . . . . . . . . . . . . . . . . . . . 49
8 Demographics Characteristics of the Sample . . . . . . . . . . . . . . . . . . 77
viii
Draft
List of Figures
1 Theoretical Model of aRTOP . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2 Theoretical Model of MCOP2 . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3 Confirmatory Factor Analysis Results: aRTOP . . . . . . . . . . . . . . . . 42
4 Confirmatory Factor Analysis Results: MCOP2 . . . . . . . . . . . . . . . . 43
5 Residual Plots of Regression Model 1 . . . . . . . . . . . . . . . . . . . . . . 45
6 Regression Model 1: Student Engagement and Inquiry Orientation . . . . . . 94
7 Regression Model 2: Student Engagement and Inquiry Orientation . . . . . . 95
8 Regression Model 3: Teacher Facilitation and and Inquiry Orientation . . . . 96
9 Regression Model 4: Teacher Facilitation and Content Propositional Knowledge 97
10 Regression Model 5: Inquiry Orientation and Content Propositional Knowledge 98
11 Regression Model 6: Student Engagement and Teacher Facilitation . . . . . 99
ix
Draft
CHAPTER 1
INTRODUCTION
The colleges and universities in the United States are being challenged to improve Sci-
ence, Technology, Engineering, and Mathematics (STEM) undergraduate education (Boyer
Commission on Educating Undergraduates in the Research University, 1998; National Re-
search Council, 1996, 1999, 2002, 2012; National Science Foundation, 1996, 1998), with
college and university STEM professors asked to bear the majority of the weight. These
same college and university professors are experts in their area of study, have received mul-
tiple degrees, and make contributions to their field resulting in awards and publications.
However, they have had little or no formal training in teaching and learning, and obtain
most, if not all, of their professional development in education during graduate school as
teaching assistants. Once they finish graduate school, the primary professional development
comes from reflection of the formative and summative assessments from peer observation,
student evaluations, and an assessment of their portfolio. Although each of these evaluation
methods can provide useful information, it is difficult to compare and analyze the infor-
mation obtained from these methods. In general these methods have broad questions that
create subjective information with low concurrence among raters.
A useful tool in the process of improving the quality of Science, Technology, Engineering,
and Mathematics (STEM) education is the development of aggregate methods to quantify the
state of teaching and learning in order to compare different teaching and learning strategies.
Observation protocols provide a quantifiable method useful for improving and strengthening
STEM undergraduate education. The two most common uses for observation protocols are to
support professional development and to evaluate teaching quality (Hora & Ferrare, 2013b).
Observation protocols provide a way to collect numerical data representing observed variables
1
Draft
describing the classroom environment and activities. This data can then be systematically
analyzed using statistical techniques and create meaningful ways to evaluate the scholarship
that professors use in their teaching.
The quantifiable understanding we gain from the use of observation protocols is in-
valuable. College and university professors can use this information to identify personal
strengths and weaknesses. They can easily compare and contrast the information obtained
from semester to semester to see a visual growth in their teaching effectiveness. The use of
observation protocols opens the door for professors to assess their teaching effectiveness in
different types of classrooms. The information that can be gained from observation protocols
is limitless at both the individual level and collectively for the university.
Although there are a multitude of observation protocols in use (see Appendix A), the
Mathematics Classroom Observation Protocol for Practices (MCOP2) and an abbreviated
Reformed Teaching Observation Protocol (aRTOP) are the most applicable toward the aim
of this study. The Mathematics Classroom Observation Protocol for Practices (MCOP2)
is used to measure the degree to which a mathematics classroom aligns with the Stan-
dards for Mathematical Practice from the Common Core State Standards in Mathematics
(National Governors Association Center for Best Practices, Council of Chief State School
Officers, 2010); recommendations from “Crossroads” and “Beyond Crossroads” of the Ameri-
can Mathematical Association of Two-Year Colleges (American Mathematical Association of
Two-Year Colleges (AMATYC), 1995, 2004); the Committee on the Undergraduate Program
in Mathematics Curriculum Guide from the Mathematical Association of America (Barker
et al., 2004); and the Process Standards of the National Council of Teachers of Mathematics
(National Council of Teachers of Mathematics, 2000). The MCOP2 is a 16 item protocol
that measures the two primary constructs of teaching facilitation and student engagement.
The Reformed Teaching Observation Protocol (RTOP) was designed by the Evaluation
Facilitation Group of the Arizona Collaborative for Excellence in the Preparation of Teachers
to measure “reformed” teaching and is said to be standards based, inquiry oriented, and stu-
2
Draft
dent centered (Piburn & Sawada, 2000). RTOP is a 25-item classroom observation protocol
on a 5 point Likert scale that measures the three primary constructs of lesson design and
implementation, content, and classroom culture. Although the RTOP has been the most
widely used observation protocol for mathematics classrooms during the past 10 to 15 years,
the review of literature revealed serious issues with the proposed structure and reliability
that led us to select the ten items we call the abbreviated Reformed Teaching Observation
Protocol (aRTOP).
There are limitations that we must account for in this study. One limitation is the use
of convenience sampling to collect the data. Time and travel costs have forced us to use this
sampling technique with our sample chosen strategically to include classroom observations
not likely to give us unusual data by sampling from a diverse range of institutions, based
on enrollment demographics and types of degrees offered, reasonably representing the larger
population of undergraduate institutions in the United States. The potential for observer
bias is another limitation to this project. Such biases could include gender, ethnicity, age,
teaching methodology, and course structure. Although it is impossible to remove the human
element from this study, we are well aware of the potential for observer bias and will make
every effort to avoid it. Being cognitive of potential biases and taking them into account is
the key strategy for avoiding researcher bias (Johnson & Christensen, 2014).
The goal of this project is to have a clear understanding of both the MCOP2 and the
aRTOP and the relationship between these two protocols as it relates to undergraduate
mathematics classrooms. Therefore, we hypothesize the following research questions:
1. What are the internal structures of the Mathematics Classroom Observation Protocol
for Practices (MCOP2) and the abbreviated Reformed Teaching Observation Protocol
(aRTOP) for the population of undergraduate mathematics classrooms?
2. What are the internal reliabilities of the subscales of the Mathematics Classroom Ob-
servation Protocol for Practices (MCOP2) and the abbreviated Reformed Teaching Ob-
servation Protocol (aRTOP) with respect to undergraduate mathematics classrooms?
3
Draft
3. What are the relationships between the constructs measured by the Mathematics Class-
room Observation Protocol for Practices (MCOP2) and the abbreviated Reformed
Teaching Observation Protocol (aRTOP)?
4
Draft
CHAPTER 2
LITERATURE REVIEW
Increased accountability in higher education has fostered a need to evaluate and develop
the effectiveness of undergraduate teaching in mathematics. The call for accountability
creates a demand in postsecondary institutions to provide quantifiable evidence of the effec-
tiveness of their academic programs (National Research Council, 2002). Unfortunately, there
is no widely accepted definition or agreed upon criteria of effective teaching at the under-
graduate level (Clayson, 2009). Student evaluations, peer evaluations, observation protocols,
and portfolios are the most common methods used in the current environment to evaluate
teaching effectiveness. This chapter will give a brief summary of each of the above listed
methods of measuring teaching effectiveness and review the benefits and barriers to each
evaluation method.
Content Knowledge for Teaching
What makes someone an effective teacher? Is having a strong understanding of teaching
procedures enough? Or is strong subject matter knowledge the key to effective teaching?
Shulman saw a strong disregard of the content being taught in the policies of the 1980s.
Shulman (1986) did not want to belittle the importance of the pedagogical skills being high-
lighted in these policies, but rather bring attention to the importance of content knowledge
for teachers by creating a theoretical framework to model the categories of content knowledge
which he identifies as subject matter content knowledge, pedagogical content knowledge, and
curricular knowledge.
5
Draft
Content knowledge for teaching refers to the amount and organization of knowledge in
the minds of the teachers. The understanding of facts and concepts is only a part of subject
matter content knowledge. It requires a much deeper understanding of the structure of the
subject matter. It is not enough for a teacher to merely understand something, but they
must also know why it is so, when it can be applied, and when is it weakened or no longer
applies (Shulman, 1986).
Subject matter knowledge is necessary, but not a sufficient condition for someone to be
an effective teacher. Shulman’s second category of content knowledge, pedagogical content
knowledge, is a combination of the teacher’s subject matter knowledge and the knowledge
utilized to teach that subject. A few of the examples Shulman provides of pedagogical
content knowledge are (a) the knowledge needed to represent and formulate the subject that
makes it comprehensible to others, (b) the knowledge of what makes a particular subject
difficult or easy to comprehend, (c) the knowledge of conceptions and misconceptions that
students bring with them from previous learning (Shulman, 1986).
Curricular knowledge is the knowledge of what programs are designed to teach a specific
subject to a given student level. It also includes the knowledge of the variety of materials
available to teach a specific subject. Most importantly, it is the knowledge teachers use to
select or reject a particular curriculum in a given circumstance. In addition to the knowledge
of curriculum materials, curricular knowledge includes lateral curriculum knowledge (rela-
tionship of the content to other subjects) and vertical curriculum knowledge (relationship of
the content to previous and future learning of the same subject).
Shulman’s theoretical framework was designed to focus on the nature and type of knowl-
edge needed for teaching a subject. He did not provide us with a list of necessary knowl-
edge for any particular subject areas, rather Shulman’s paper acted as a catalyst for other
researchers to expand on his ideas into their particular subjects. In 2008, Ball and her
colleagues examined and expanded Shulman’s ideas in the context of mathematics. Ball,
6
Draft
Thames, & Phelps (2008) developed in more detail Shulman’s idea of subject matter knowl-
edge for teachers in the context of mathematics.
Ball, Thames, & Phelps (2008) divided subject matter knowledge into three domains:
common content knowledge, specialized content knowledge, and horizon content knowledge.
Common content knowledge is the mathematical knowledge that teachers use, but is not
specialized to the work of teachers. Ball believes there is common content knowledge (CCK)
which teachers use that is also used in settings other than teaching.
Ball, Thames, & Phelps (2008) define specialized content knowledge (SCK) as the knowl-
edge and skills unique to teaching mathematics. Ball, Thames, & Phelps (2008) state, “this
work (SCK) involves an uncanny kind of unpacking of mathematics that is not needed – or
even desirable – in settings other than teaching” (p. 400). The distinction between common
content knowledge and specialized content knowledge, while clear in the elementary school
context, becomes more difficult to measure at the undergraduate level.
Horizon content knowledge is the third domain of Ball, Thames, & Phelps (2008) and
corresponds to portions of Shulman’s curricular knowledge. This includes the knowledge
of how to introduce a specific topic with the prior and future understandings of this topic
in mind. There is still some concern over whether this should be solely in the category of
subject matter knowledge or if it should be included in other categories.
Ball also expanded Shulman’s idea of pedagogical content knowledge into three domains.
Ball, Thames, & Phelps (2008) tell us, “two domains - knowledge of content and students
(KCS) and knowledge of content and teaching(KCT)coincide with the two central dimensions
of pedagogical content knowledge identified by Shulman” (p.402). KCS is the combination
of the knowledge teachers know about their students and mathematics. Teachers must
understand how their student will approach a particular problem and the struggles they
will encounter. Alternatively KCT is the combination of the knowledge teachers know about
mathematics and teaching. For example, teachers have to know the order to introduce topics
and what to spend more time on. Ball also included Shulman’s curriculum knowledge as
7
Draft
a domain of pedagogical content knowledge based on the work of Grossman, Wilson, &
Shulman (1989). Although Ball placed curriculum knowledge under Pedagogical Content
Knowledge, there is still some concern if it should only be there or in several different
categories.
Shulman (1986) poses the question of how expert students become novice instructors.
We must ask ourselves, how do teachers acquire knowledge of teaching? Most college and
university professors are experts in the content they are teaching, but most do not have any
formal background in education. Most professors have completed little to no preparation
teaching programs and typically have not taken any education courses (Speer & Hald, 2008).
The majority of training is based on the limited supervised training professors obtain during
their graduate teaching assistantship in graduate school.
The research of Speer & Hald (2008) assert that mathematics education research in K-12
has sought to document the extent teachers possess pedagogical content knowledge and the
effect it has on students learning and teaching practices. Similar research in higher education
is just now emerging and is relatively scarce. The research available on higher education
pedagogical content knowledge is focusing on Graduate Teaching Assistants (GTA) and their
training programs. The dissertation of Ellis (2014) gives us a wealth of information on GTA
professional development programs and GTA beliefs and practices. Another dissertation
focused on the differences in the beliefs and practice of international and U.S. domestic
mathematical teaching assistants (Kim, 2011). Kung & Speer (2007) focus their research on
the need for professional development activities for GTAs and the empirical research needed
to create these activities.
Being knowledgeable in mathematics is necessary, but alone is not a sufficient condition
for an instructor to create good learning opportunities for students (Speer & Hald, 2008). If
we could improve the mathematic instructor’s knowledge of student thinking, Kung & Speer
(2007) believe this will foster better learning opportunities for students. The hope is this
will, in turn, lead to improved student achievement.
8
Draft
Student Evaluations
With the increase in accountability within higher education, student evaluations are
becoming even more widely used as a measure of quality in university teaching. Clayson
(2009) brings to our attention that student evaluations of teaching are one of the most
well researched, documented, and long lasting debates in the academic community. In fact,
d’Apollonia and Abrami (1997) stated “most postsecondary institutions have adopted stu-
dent ratings of instruction as one (often the most influential) measure of instructional effec-
tiveness” (p. 1198). Chen and Hoshower (2003), as well as, Benton, Cashin, and Kansas
(2012) propose that student evaluations are commonly used to provide formative feedback to
faculty for improving teaching, course content and structure; a summary measure of teach-
ing effectiveness for promotion and tenure decisions; and information to students for the
selection of courses and teachers.
Student evaluations of instruction were first introduced to the United States in the
mid 1920’s (Algozzine et al., 2004). Since then there have been waves of research, including
studies which have verified the validity and effectiveness of student ratings. However, student
evaluations have not always been met with complete acceptance, and so we think it is
important to now discuss some of the most common misconceptions.
The literature on student evaluations varies widely. While others believe some of these
items are factual, Benton (2012), Feldman (2007), and Kulik (2001) believe student evalu-
ations are (a) only a measure of showmanship, (b) indicators of concurrence only at a low
level, (c) unreliable and invalid, (d) time and day dependent, (e) student grade dependent,
(f) not useful in the improvement of teaching, and (g) affected by leniency in grading result-
ing in high evaluations. These myths seem to persist even though there is over fifty years of
credible research showing the reliability and validity of student evaluations. This research
has been ignored for reasons that include personal biases, suspicion, fear, ignorance, and
general hostility towards any evaluation process (Feldman, 2007; Benton & Cashin, 2012).
9
Draft
Since teaching is comprised of many characteristics, Spooren, Brockx, and Mortelmans
(2013) believe it is widely accepted that student evaluations are considered multidimensional.
Jackson et al. (1999) warns there has been a dispute in the research as to the number of
dimensions or the nature of these dimensions. This causes the student evaluation instruments
to vary greatly in the item content and the number of items.
In the 1990’s, researchers including Abrami & D’apollonia (1990) debated the use of
global constructs for the evaluation of teaching effectiveness. Eventually they came to a
compromise that the use of both specific dimensions and global measures could be used for
an overall rating. More recent research supports the multidimensionality of teaching, by re-
porting that higher order factors can reflect general teaching effectiveness (Apodaca & Grad,
2005; Burdsal & Harrison, 2008; Cheung, 2000). The research of Burdsal and Harrison(2008)
and Spooren et al. (2013) provides evidence that both a multidimensional profile and an
overall evaluation are valid indicators of students’ perception of teacher effectiveness.
Reliability and Validity
Reliability refers to consistency, stability, and generalizability of data, and in the context
of student evaluations, most often refers to the consistency of the data (Cashin, 1995). The
consistency of student evaluations is highly influenced by the number of raters. In general,
the more raters the more dependable the ratings are. Also multiple classes provide more
reliable information than a single class. Benton et al. (2012) suggests the use of more than
one class if there are fewer than 10 raters in order to improve reliability.
The validity of student evaluations have been extensively debated over the years with
researchers often disagreeing as to the extent to which student evaluations measure the
construct of teaching effectiveness. A primary driver in this debate is the lack of agreement on
what defines effective teaching. One method to determine the validity of student evaluations
involves its relation to other forms of evaluation. The agreement or disagreement of these
other evaluation methods can give us greater insight into the validity of student ratings.
10
Draft
Logically, the best way to measure effective teaching would be to base it on the re-
sulting student learning and understanding. One would assume that a teacher who has a
high student evaluations would also have students that are highly successful. Davis (2009)
states, “Ratings of overall teaching effectiveness are moderately correlated with independent
measures of student learning and achievement. Students of highly rated teachers achieve
higher final exam scores, can better apply course material, and are more inclined to pursue
the subject subsequently” (p. 534).
In a study at Minot State University, Ellis, Burke, Lomire, and McCormack (2003)
found that courses with the highest average grades were taught by teachers who received
the highest rating from their students. The study was composed of 165 undergraduate
courses taught by 24 instructors. Ellis reported a weak, but significant positive correlation
(r = 0.35, p < .01) existing between average ratings of teachers and average grades received
by students. Ellis et al. (2003) warns this relation maybe due to numerous factors, but
it is most likely that giving higher grades to students results in more favorable student
evaluations. Clayson (2009) affirms that “as statistical sophistication has increased over
time, the reported learning/SET (student evaluation of teaching) relationship has generally
become more negative” (p. 26).
Despite the comparison of student evaluations to colleague ratings, expert judges ratings,
graduating seniors and alumni ratings and student learning provide evidence of validity,
many researchers are still concerned with the ability of students to be easily swayed by
superficialities (Socha, 2013). Also, the concern that students have the ability to be effective
evaluators of teaching competency plagues researchers. Algozzine et al. (2004) warns that
student ratings should only be influenced by characteristics that represent effective teaching
and not by sources of bias. Marsh (1984) defines a bias of student ratings as “substantially
and causally related to the ratings and relatively unrelated to other indicators of effective
teaching” (p. 709).
11
Draft
One of the most controversial and often most discussed concern is that high ratings can
be solely based on the faculty member’s “entertaining” ability. The Dr. Fox Effect, as it
is also known, is a study where an actor lectures (Ware Jr & Williams, 1975). Although
“Dr. Fox” did not cover any material, he received a high rating because of his “entertaining”
value. Wachtel (1998) states, “This was thought to demonstrate that a highly expressive
and charismatic lecturer can seduce the audience into giving undeservedly high ratings” (p.
200). Since the original study, Marsh (1982) cites several experts in the field that have raised
question as to its validity.
In classrooms where there are incentives to understand the material, earlier studies found
that content covered has a much greater impact on student ratings then expressiveness. So-
jka, Gupta, & Deeter-Schmelz (2002) found that students and teachers have a different
perception of how a faculty member’s “entertaining” ability effects student ratings. Faculty
believed that the ability to entertain has a great influence on ratings, while students strongly
disagreed. Shevlin, Bamyard, Davies, and Griffiths (2000) state that expressiveness of teach-
ers is positively correlated with student evaluations regardless of the content taught. They
found that the charisma factor accounted for 69% of the variation in the rating of a teacher’s
ability as determined by student ratings (Shevlin et al., 2000).
The relationship between gender and student evaluations remains undetermined. One
study by Ellis et al. (2003) found that the gender of the instructor was not significantly
correlated with the student ratings. While another study by Centra (2009) indicated gender
preferences, mainly the ratings of female instructors by female students. The research study
of Centra and Gaubatz (2000) agreed with this conclusion but warned that even though
these ratings are statistically significant, they have little practical importance.
In comparison to other instructor variables, there is a relatively small amount of quan-
titative data exploring this relationship. According to Merritt (2008), there is a lack of em-
pirical research examining the relationship between race and student evaluations. A study
conducted by Hamermesh & Parker (2005) of 436 classes reported minority faculty members
12
Draft
received lower teaching evaluations than majority instructors. Non-native English speakers
also received substantially lower ratings than their native speaking counterparts.
Logically, faculty ranking will have an impact on student evaluations. In a study con-
ducted by Centra (2009) with 1539 teaching assistants the overall evaluation of the quality
of teaching in a course had a mean score of 3.83 on a 5 point scale. While their higher rank-
ing colleagues, assistant professors and above, scored about a third of a standard deviation
higher on the overall evaluation. There is some question as to whether ranking or years of
experience are being represented in this study since the two correlate. However, Ellis et al.
(2003) found there was no significant correlation between years taught and the ratings of the
same instructor by students.
Like instructor variables, individual student variables can also influence evaluations of
teaching. Variables studied include age, gender, motivation, and personality of the student.
Also, individual academic characteristics of the student have been studied. Some of these
variables include scholastic level of the student, GPA, and reason for taking the course. Age
(Centra, 1993), gender (Feldman, 1977, 1993), and the level of students (McKeachie, 1979)
are not currently being researched, but have been in the past.
Student GPA and college required classes are two of the individual academic character-
istics that are currently being researched. In “Tools for Teaching”, Davis (2009) summarizes
the research on the relationship between student evaluations and student GPA. Citing sev-
eral authors, Davis (2009) concludes that there is little to no relationship noted for this
particular variable(Marsh & Roche, 2000; Abrami, 2001). Conversely, research has found
a slight bias against college-required courses. This is understandable given students may
be required to take a class in which they have little interest or background. Centra (2009)
suggests even though there is only a slight bias, institutions should take this into account
when reviewing student evaluation data.
The expected grade is probably the most researched student variable related to student
evaluation of instruction. Eiszler (2002) found that student evaluations are a small con-
13
Draft
tributor to grade inflation over time. Centra (2009) reports the correlation of .20 between
expected grades and teacher effectiveness. While Ellis et al. (2003) states, “the magnitude
of the correlation has been in the range of .35 to .50, meaning that roughly 12% to 25%
of the variance in ratings might be accounted for by varying grading standards” (p. 39),
they mention several researchers, including Mehdizadeh (1990) and Krautmann & Sander
(1999), that found a positive correlation between the expected (or received) course grade
and student evaluations.
We also recognize that the actual courses have variables that the instructor cannot influ-
ence. For instance, class size, topic difficulty, and the level of the course are all characteristics
of the course beyond the control of the instructor. The time of day a class is taught is another
course variable that has been of interest to researchers in the past (Aleamoni, 1981; Feldman,
1978). The relationship between student evaluations of teaching and course characteristics
has been the subject of research over the years, and the results have been inconsistent.
Research on course variables varies widely. Student evaluations did not significantly
correlate with the level of the course according to Ellis et al. (2003). However, lower level
classes generally receive lower ratings than higher level classes. This is especially true for
graduate level classes, however this difference tends to be small (Benton & Cashin, 2012).
Benton et al. (2012) suggests the development of local comparative data to help control for
this difference. Class size can also have an effect on student evaluations. Most researchers
have found that larger classes cause instructors to receive lower evaluations. Ellis et al.
(2003) reports that class size was correlated significantly, while Hoyt and Lee (2002) found
that it was not always statistically significant.
The academic discipline of the class being taught can affect the student ratings. In a
study by Centra (2009), an average mean of 3.87 on a 5 point scale was found for courses in
natural sciences, mathematics, engineering, and computer science. While the overall rating
for humanities (English, history, language) was a mean of 4.04. This was about a third of a
standard deviation less. Some have attributed this difference to the growth of knowledge in
14
Draft
these natural sciences causing teachers to cover increasing amounts of material. The meta-
analysis by Clayson (2009) supported these differences and stated that academic disciplines
are important variables to consider when reviewing student evaluation data.
Course load and difficulty are correlated to student evaluations, but not largely. Sur-
prisingly the correlation is positive, because students tend to give higher ratings to more
difficult courses that call for hard work (Marsh, 2001). Centra (2003) used a large data base
of classes and not surprisingly found that both too elementary and too difficult classes were
rated poorly. The findings indicated that classes balanced in the middle were the highest
rated classes.
Consideration must be given to the manner (paper vs. electronic) in which the stu-
dent evaluations are collected. Ballantyne (2003), Bullock (2003), Spooren et al. (2013),
and Tucker, Jones, Straker, and Cole (2003) offer us the following reasons for the move
from paper to electronic student evaluations; timely and accurate feedback, no interruption
in class time, more accurate analysis of data, ease of access to students, greater student
anonymity, decreased faculty influence, more detailed written comments, and lower cost and
time demand for administrators.
One of the major concerns of online student evaluations is the response rate. Online
survey response is much lower than that of traditional paper surveys with Dommeyer, Baum,
Hanna, and Chapman (2004) reporting an average response rate of 70% for in-class surveys
and 29% for online surveys. Although there is a decrease in number of surveys being re-
ceived, studies by Leung and Kember (2005) show no significant differences between the data
obtained from paper and electronic evaluations. These results lead us to conclude that the
differences in manner (paper vs. electronic) did not affect the validity of student evaluations.
Since the very first reports on student evaluations by Remmers and Brandenburg (1928,
1930; 1927), there have been thousands of reports covering various topics on these evalua-
tions. Student evaluations can provide useful information about the instructor’s knowledge,
organization and preparation, and ability to communicate clearly. According to Chen and
15
Draft
Hoshower (2003), “while the literature supports that students can provide valuable informa-
tion on teaching effectiveness given that the evaluation is properly designed, there is a great
consensus in the literature that students cannot judge all aspects of faculty performance” (p.
73). Despite the controversies, student evaluations are still the most widely used evaluation
method. In general, researchers are in agreement that no single source of evaluation, includ-
ing student evaluations, can provide sufficient information in order to make valid judgments
on effective teaching.
Peer Evaluations
Compared to the extensive research on student evaluations of teaching, few studies exist
on peer evaluations and are limited in scope. The National Research Council (2002) found
that direct observation of teachers over an extended period of time by their peers can be
a highly effective means of evaluating an individual instructor. Even though professional
accountability in higher education has grown over the years, peer evaluations are not a
dominant practice in the assessment of teaching at most colleges and universities (Thomas,
Chie, Abraham, Raj, & Beh, 2014).
The scope of peer evaluations is not limited to what can be observed in a classroom, but
can include course outlines, syllabi, and teaching materials. Hatzipanagos and Lygo-Baker
(2006) suggest that peer reviews include observation of lectures and tutorials, monitoring
online teaching, examining curriculum design, and the use of student assessments. Peer
evaluations also create ways to improve on the adherence of the ethical standards set forth
by the university. Based on the above, we note that peer evaluations are more than just
classroom observations and can be instrumental in curriculum and professional development.
There are many benefits of peer review in developing faculty members. Peer reviews fur-
ther the development of teachers through the expert input from colleagues’ experience and
knowledge (Kohut, Burnap, & Yon, 2007). Peer evaluations are not just about identifying
places that need improvement, but also strengths. The benefits concluded from the literature
16
Draft
by Thomas et al. (2014) include the validation of teaching practices already being imple-
mented, inspiring different teaching perspectives, fostering learning about teaching methods,
and development of peer respect. Both the observer and the teacher being observed can use
this evaluation process to reflect on how to improve their teaching methods (Kohut et al.,
2007).
According to Bernstein, Jonson, and Smith (2000), only feedback gained from knowl-
edgeable peers leads to growth of teaching to its greatest potential. However, Thomas et
al. (2014) warns that peer evaluations are most beneficial towards quality teaching develop-
ment if the peer review program includes clear, straightforward, and transparent structure;
engagement in professional discussion and debate among participants; focus on the develop-
ment of teaching and learning to maintain the motivation and commitment toward the peer
review process; and willingness to consider that difficulties that may arise when engaging in
professional development activities.
Unfortunately, there are also many barriers to peer review of teaching unless the observa-
tions are part of a carefully conceived, systematic process (Wachtel, 1998). One of the major
barriers of peer observation is the low level of concurrence among observers due to personal
bias of teaching behaviors and inexperienced observers. Although faculty are experts in their
area of study, most do not have any formal training in education. Another barrier is that
peer evaluations generally are not a part of the culture of teaching and learning. Researchers
seem to agree that peer evaluation must be coupled with other evaluation methods in order
to provide accurate information.
Despite these reservations, peer evaluations are still an effective way to improve teaching.
Peer evaluation can provide the opportunity for faculty to learn how to be more effective
teachers, to get regular feedback on their classroom performance and to receive support from
colleagues. Educators advocate multiple sources for teaching improvement or for teaching
evaluation, and classroom observations provide a source of input that can be balanced against
some of the other more common forms of instructional feedback such as student evaluations
17
Draft
(Wachtel, 1998). Most importantly peer evaluation can provide a third party observation of
what is occurring in a college classroom. This visualization can foster a renewed satisfaction
in teaching.
It is becoming obvious to increasing numbers of faculty that successful teachers are
not only experts in their fields of study but also knowledgeable about teaching strategies
and learning theories and styles, committed to the personal and intellectual development
of their students, cognizant of the complex contexts in which teaching and learning occur,
and concerned about colleagues’ as well as their own teaching (Keig & Waggoner, 1994).
The use of peer evaluations can provide a wealth of information that can lead to enhanced
teaching. Although there are numerous problems that cause concern to the validity of peer
evaluations, it can provide a vast amount of knowledge when coupled with other evaluation
methods.
Portfolios
Unlike other evaluation methods, which can only shed light on a small part of a teacher’s
effectiveness, portfolios have the ability to convey a broad range of a teacher’s skills, attitude,
philosophies, and achievement. Seldon and Miller (2009) define a portfolio as a reflective,
evidence-based collection of materials that document teaching, research, and service. A
professor’s portfolio usually includes an assertion about their teaching effectiveness along
with supporting documentation (Burns, 2000). This could include a sample syllabi, student
work, student ratings, and comments from both students and colleagues.
There are many benefits of portfolios. Portfolios are not simply an exhaustive collection
of all the documents and materials a teacher has, but rather a balance listing of professional
activities that provide evidence of teacher effectiveness (Seldin & Miller, 2009). They can
allow faculty to exhibit their teaching accomplishments to colleagues (Laverie, 2002). Burns
(2000) states that some institutions are beginning to require a portfolio as part of their
18
Draft
post tenure review. The key benefits of a portfolio, according to Seldin (2000), are that it
encourages faculty to reflect on their teaching and to improve their teaching.
Portfolios also have many negative qualties. Although there are numerous researchers
that praise the portfolio’s ability to improve teaching, Burns (2000) affirms that there are
no experiments that support this claim and goes on to even state, “The only experiment
that I could locate that compared teaching ratings before and after portfolio construction
concluded that these ratings did not improve significantly” (p. 45). When the impact of a
mandatory portfolio was studied by some researchers, the concern was that the creation of
the portfolio was the focus and not the improvement of teaching. Some of the other concerns
of faculty are: Is the time and energy that is takes to prepare a portfolio worth it? Does
the administration know how to use the information collected from the portfolio? For new
faculty, would a portfolio not be counterproductive?
With all these questions being posed, there is little research being conducted to answer
these questions. Although a portfolio has the ability to be a very useful tool in the assessment
of teaching effectiveness, without the reliability and validity of this instrument being known,
what do they really represent? Given the research that exists we have to view portfolios
with some reservation. Like all other evaluation methods, portfolios cannot stand alone but
is one more tool that if combined with other methods can be useful in evaluating teacher
effectiveness.
Observation Protocols
Classroom observations are direct observations of teaching practices, where the observer
either takes notes and/or codes teaching data live in the classroom or from a recorded video
lesson. The two most common uses for observation protocols are to support professional
development and to evaluate teaching quality (Hora & Ferrare, 2013b). We note that while
classroom observations are a very common practice in K-12 schools, observations are less
common in postsecondary settings with further theoretical development and testing needed.
19
Draft
Observation protocols development for K-12 is more advanced due in part to policies govern-
ing teaching evaluations (Hora, 2013). Postsecondary observation protocols are traditionally
less developed in terms of psychometric testing and conceptual development(Hora & Fer-
rare, 2013b). Unfortunately observation protocols in higher education trail far behind that
of K-12 (Pianta & Hamre, 2009). The most recently developed and currently utilized obser-
vation protocols in colleges and universities center on science, technology, engineering, and
mathematics (STEM) teaching (Hora & Ferrare, 2013b).
The development of aggregate methods of improvement in the quality of STEM edu-
cation is on the minds of institutions, disciplines, and national agencies (Seymour, 2002).
Smith, Jones, Gilbert, and Wieman (2013) cite several of these agencies that stress more
effective teaching in STEM courses, such as the President’s Council of Advisors on Science
and Technology Engage to Excel report (2012) and the National Research Council Discipline-
Based Education Research report (2012). The shift in teaching and learning of science and
mathematics towards student centered instruction and active learning is increasing (Freeman
et al., 2014; Gasiewski, Eagan, Garcia, Hurtado, & Chang, 2012; Michael, 2006).
In The Greenwood Dictionary of Education (Collins & O’Brien, 2003), student-centered
learning (SCL) is defined as an “approach in which students influence the content, activities,
materials, and pace of learning” (p. 338-339). If SCL is applied correctly it can lead to a
growth in student enthusiasm to learn, retention of knowledge, understanding, and attitude
towards the subject being taught. Michael (2006) defies active learning as engaging the
students in activities that require some sort of reflection on the ideas. Students should be
actively gathering information, thinking, and problem solving during a class that uses active
learning. The meta-analysis by Freeman et al. (2014) of classrooms using active learning
reported that the average examination score improved by 6 % over traditional lecturing. He
also reported that students in traditional lecturing classes were 1.5 times more likely to fail
than those in active learning classes.
20
Draft
Sawada et al.(2002) warns that the development and use of an evaluation instrument
that supports these efforts is problematic and controversial and higher education institutions
find it difficult to identify alignment of teaching to this construct. Walkington et al. (2012)
believes that classroom observations are one of the best methods to combine with student
achievement to get a measure of teaching effectiveness. However, “generic observation in-
struments aimed at all disciplines and employed by observers without disciplinary knowledge
are not sufficient” (p. 3). A protocol that is generic enough to be useful in a mathematics
and history class will lack complete understanding of the learning and teaching process (Hora
& Ferrare, 2013b). It is not reasonable to expect that a protocol can be useful and generic
enough to work for all different types of subject matter given the obvious differences between
disciplines.
There are two main types of observation protocols; unstructured (open-ended) and struc-
tured (Hora & Ferrare, 2013b). Unstructured protocols may not even indicate what the ob-
server should be looking for and in general do not have fixed responses. Although responses
to open-ended questions can be very useful to the observer and the instructor, the data is
very dependent on the observer and cannot easily be standardized (Smith et al., 2013). This
leads to difficulty to compare this data across multiple classrooms.
On the other hand, observers respond to a structured protocol with a common set of
statements or codes (Smith et al., 2013). The data that is produced is easily standardized
and can be used to compare multiple classrooms. The drawback to most structured protocols
is the requirement of some sort of multi-day training in order to have inter rater reliability
(Sawada et al., 2002). Observers must also pay close attention to the behavior of the teacher
and/or the students to assess the predetermined classroom dynamics.
It is impossible to include all the observation protocols that are used to evaluate under-
graduate courses, but Appendix A presents a brief summary of some of the existing protocols.
The two protocols used for this study are described in more detail below.
21
Draft
Reformed Teaching Observation Protocol
The Reformed Teaching Observation Protocol (RTOP) is probably the most widely used
STEM-specific observation protocol to date. This instrument was designed by the Evaluation
Facilitation Group of the Arizona Collaborative for Excellence in the Preparation of Teachers
(ACEPT) to measure “reformed” teaching. Sawada et al. (2002) tells us that during the
development of the RTOP the Evaluation Facilitation Group (EFG) affirmed that “the
instrument would have to be focused on both science and mathematics, standards based,
focused exclusively on reform rather than the generic characteristics of good teaching, easy
to administer, appropriate for classrooms K-20, valid, and reliable” (p. 246).
RTOP is a 25-item classroom observation protocol on a 5 point Likert scale that is said
to be standard based, inquiry oriented, and student centered. The items are divided into
three subsets: Lesson Design and Implementation (5), Content (10), and Classroom Culture
(10). The first subset containing items 1-5 are designed to capture what the reference manual
calls the ACEPT model for reformed teaching. The second subset focuses on content and
is divided into two parts. These are Propositional Pedagogic Knowledge (items 6-10) and
Procedural Pedagogic Knowledge (items 11-15). The third subset is also divided into two
equal parts that analyze the classroom culture called Communication Interaction (items
16-20) and Student/Teacher Relationships (items 21-25).
After the initial development testing and redesign, a team of nine trained observers col-
lected 287 RTOP forms from the observation of over 141 mathematics and science classrooms.
The team consisted of seven graduate students and two faculty members. The classrooms
included ranged from middle school, high school, community colleges, and universities. Of
the 141 classrooms observed, only 38 (27%) were mathematics classrooms. Of the math-
ematics classrooms included only 13 (34%) came from community college and university
observations. Since less than 10% of the sample focused on the undergraduate mathematics
classroom, and since these were exclusively mathematics courses designed for pre-service
22
Draft
elementary teachers, a more thorough analysis is necessary to determine the reliability and
structure of the instrument for general undergraduate mathematics classrooms.
Using the data collected by the nine trained observers, the inter rater reliability was
obtained by computing a best-fit linear regression of the observation of one observer on
those of the other with a correlation coefficient of 0.98 giving a shared variance between
observers of 95%. Additionally, Cronbach’s alpha for the whole instrument was reported to
be an astonishing 0.97, implying a high degree of uniformity across items, with the sub-scale
alphas ranging from 0.80 to 0.93 (Piburn & Sawada, 2000; Sawada et al., 2002). This verifies
that the RTOP has extremely strong internal consistency and can likely retain a reasonable
reliability with significantly fewer items.
The RTOP is divided into 5 sub-scales in order to test the hypothesis that “Inquiry-
Orientation” is a major part of the structure of RTOP (Piburn & Sawada, 2000). The
subscales and their R-squared totals are in Table 1. Piburn & Sawada note that the high
R-squares offer a very strong support of the construct validity. However, such high pre-
dictability of the total score by four of the sub-scales implies, at most, a two factor structure.
Table 1
Subscales as Predictors of the RTOP Total Score
R-squared as aPredictor of Total
Subscale 1: Lesson Design and Implementation 0.956Subscale 2: Content Propositional Pedagogic Knowledge 0.769Subscale 3: Content Procedural Pedagogic Knowledge 0.971Subscale 4: Classroom Culture Communicative Interactions 0.967Subscale 5: Classroom Culture Student/Teacher Relationships 0.941(Piburn & Sawada, 2000, p. 12)
An exploratory factor analysis was also conducted by Piburn & Sawada (2000) of the
25 items on the RTOP protocol using a database containing 153 classroom observations
and reported that an earlier reliability study implied the number of principal components
to be very small. Two strong factors and one weak factor were found to be appropriate
23
Draft
and interpretable. Component 1 had an eigenvalue of 14.72, while component 2 and 3
have significantly lower eigenvalues of 2.08 and 1.18, respectively. These low factor loadings
indicate how weakly component 2 and 3 influence the measured factor. This was proven
further by the introduction of the following chart (Table 2) in the RTOP reference manual.
Table 2
Interpretation of the RTOP Factor Pattern
FactorRTOP Items 1 2 3
1. The instructional strategies and activities respected studentsprior knowledge and the preconceptions inherent therein.
**
2. The lesson was designed to engage students as members of alearning community.
****
3. In this lesson, student exploration preceded formal presenta-tion.
****
4. This lesson encouraged students to seek and value alternativemodes of investigation or of problem solving.
****
5. The focus and direction of the lesson was often determined byideas originating with students.
***
6. The lesson involved fundamental concepts of the subject. ****7. The lesson promoted strongly coherent conceptual under-
standing.***
8. The teacher had a solid grasp of the subject matter contentinherent in the lesson.
**
9. Elements of abstraction (i.e., symbolic representations, theorybuilding) were encouraged when it was important to do so.
*
10. Connections with other content disciplines and/or real worldphenomena were explored and valued.
**
11. Students used a variety of means (models, drawings, graphs,concrete materials, manipulatives, etc.) to represent phenom-ena.
**
12. Students made predictions, estimations and/or hypothesesand devised means for testing them.
****
13. Students were actively engaged in thought-provoking activitythat often involved the critical assessment of procedures.
***
14. Students were reflective about their learning. ***15. Intellectual rigor, constructive criticism, and the challenging
of ideas were valued.***
16. Students were involved in the communication of their ideas toothers using a variety of means and media.
***
17. The teachers questions triggered divergent modes of thinking. **
24
Draft
FactorRTOP Items 1 2 3
18. There was a high proportion of student talk and a significantamount of it occurred between and among students.
***
19. Student questions and comments often determined the focusand direction of classroom discourse.
**
20. There was a climate of respect for what others had to say. * **21. Active participation of students was encouraged and valued. ** *22. Students were encouraged to generate conjectures, alternative
solution strategies, and ways of interpreting evidence.**
23. In general the teacher was patient with students. ****24. The teacher acted as a resource person, working to support
and enhance student investigations.****
25. The metaphor “teacher as listener” was very characteristic ofthis classroom.
***
* (0.5 - 0.59), ** (0.60 0.69), *** (0.70 0.79), **** (0.80 0.99)(Piburn & Sawada, 2000, p. 16)
Factor 1 named “inquiry orientation” draws heavily on all five sub-scales with the ex-
ception of sub-scale 2. While factor 2 labeled “content propositional knowledge” draws
exclusively on sub-scale 2. Factor 3, which is labeled “student/teacher relationship”, ac-
counts for less than five percent of the variance and only has three items that load on it. As
such, it is believed that a subset of the items from the RTOP could be used as an abbreviated
protocol measuring the same constructs as the original. Therefore, for the current study we
will use an abbreviated instrument (aRTOP) composed of items with large loadings onto the
two primary factors (See Table 3).
For the second factor of the aRTOP, focused on the content knowledge related to the
lesson, we include all 5 items from the original Subscale 2, as these items are likely to measure
something different from the remaining 20 items of the original RTOP. For the first factor
of this abbreviated instrument, focused on the inquiry orientation of the lesson, we chose
items that had significant loadings on the first factor, making sure to get items from each of
the related subscales. We also limited this factor to 5 items to match the content knowledge
factor in size in order to keep this factor from dominating the total scale score.
25
Draft
Table 3
aRTOP Items and Design
Inquiry Orientation Content Propositional Knowledge1. The lesson was designed to engage
students as members of a learningcommunity.
6. The lesson involved fundamentalconcepts of the subject.
2. Intellectual rigor, constructive crit-icism, and the challenging of ideaswere valued.
7. The lesson promoted strongly coher-ent conceptual understanding.
3. This lesson encouraged students toseek and value alternative modes ofinvestigation or of problem solving.
8. The teacher had a solid grasp of thesubject matter content inherent inthe lesson.
4. Students made predictions, estima-tions and/or hypotheses and devisedmeans for testing them.
9. Elements of abstraction (i.e., sym-bolic representations, theory build-ing) were encouraged when it wasimportant to do so.
5. The teacher acted as a resource per-son, working to support and enhancestudent investigations.
10. Connections with other content dis-ciplines and/or real world phenom-ena were explored and valued.
Mathematics Classroom Observation Protocol for Practices
The science specific language of the RTOP is a major disadvantage when used to ob-
serve mathematics classroom. This, along with the need for an observation protocol that is
supported in recent standards, led to the design of the Mathematics Classroom Observation
Protocol for Practices (MCOP2). The MCOP2 is designed to be implemented in K-16 math-
ematics classrooms to measure the degree to which a mathematics classroom aligns with the
Standards for Mathematical Practice from the Common Core State Standards in Mathemat-
ics (National Governors Association Center for Best Practices, Council of Chief State School
Officers, 2010); “Crossroads” and “Beyond Crossroads” from the American Mathematical
Association of Two-Year Colleges (American Mathematical Association of Two-Year Colleges
(AMATYC), 1995, 2004); the Committee on the Undergraduate Program in Mathematics
Curriculum Guide from the Mathematical Association of America (Barker et al., 2004) ;
26
Draft
and the Process Standards of the National Council of Teachers of Mathematics (National
Council of Teachers of Mathematics, 2000).
A test of the content was conducted with 164 identified experts in mathematics teaching
education. The first survey provided feedback on the initial 18 MCOP2 items and their
usefulness in measuring various components of mathematics classrooms (Gleason, Livers, &
Zelkowski, 2017). Over 94% of the experts rated the items as either “essential” or “not
essential, but useful.” rather than “not useful.” After adjusting the MCOP2 items based on
the expert feedback, a second survey was conducted with 26 experts that agreed to provide
additional information. This survey provided the experts with more details about each item,
the theoretical constructs, and the intended purpose of the MCOP2. Gleason, Livers, and
Zelkowski (2017) report all items were retain with mininal revisions, because they all loaded
on at least one of the factors. With the information gained from the experts, the structure
of the MCOP2 instrument was revised.
A pilot study was conducted by a graduate student in mathematics and a mathemat-
ics professor at a large southern university to determine if the data collected aligned with
the theoretical constructs and the verification of the expert survey. Based upon instructor
approval, 36 classrooms with 28 different instructors were observed throughout a semester.
The instructors varied widely from graduate teaching assistants to tenured professors. The
classes they taught also varied from college algebra to upper division mathematics.
The MCOP2 that was used in the pilot study was initially designed to measure three
primary components; student engagement; lesson design and implementation; and class cul-
ture and discourse. Seventeen items of the original eighteen item with full descriptions are
used to measure these three components. Student Engagement contained Items 1-5, Lesson
Content contained Items 6-11 and Classroom Culture and Discourse contained Items 12-17
(Gleason & Cofer, 2014).
After all the data was collected, Gleason and Cofer conducted exploratory factor anal-
ysis (EFA) and classical test theory analysis with some unexpected results. The orginal
27
Draft
assumption of three components was reexamined after a low eigenvalue was found for the
third factor. Gleason and Cofer report that a factor matrix of a potential 3 Factor Model
indicated Student Engagement and Classroom Culture and Discourse were both loading on
the same factor. These two were combined to create Student Engagement and Classroom
Discourse. The 2 Factor Model explained over 50% of the total variance.
Cronbach’s alpha was also calculated for the entire protocol and both factors. The entire
protocol had a Cronbach’s alpha of 0.898. The sub-scales of “Lesson Content” and “Student
Engagement and Classroom Discourse” had Cronbach’s alpha reliabilities of 0.779 and 0.907,
respectively. Gleason and Cofer (2014) state, “the internal reliabilities are high enough for
both sub-scales and the entire instrument to be used to measure at the group level, either
multiple observations of a single classroom or single observations of multiple classrooms” (p.
99). The overall high alpha coefficient demonstrates that MCOP2 is measuring something,
and the EFA clearly produces a 2 factor model of “Lesson Content” and “Student Engage-
ment and Classroom Discourse”. Overall this pilot study was very promising, but it was
truly in its beginning stage.
Gleason, Livers, and Zelkowski (2017) also conducted the inter-rater reliability of the
instrument to look at the response processes. Five raters were chosen with a variety of
educational and professional backgrounds. Two of the raters have doctorates in mathematics
education, one rater has a doctorate in mathematics and is heavily involved in mathematics
education, one rater is a mathematics specialist that works with secondary teachers and has
taught at both the secondary and introductory college level, and the fifth rater is a graduate
student in mathematics with minimal background in education other than teaching some
introductory college math classes.
Five different classroom videos were scored by the five raters. All were give the detailed
descriptors of the items with the rubric prior to the viewing of the videos. All videos were
watched independently by each rater, and no formal training was conducted. To make sure
there was a good representation of different levels of students and instructors, one video was
28
Draft
Table 4
Brief Description of MCOP2 items
1. Students engaged in exploration/investigation/problem solving.2. Students used a variety of means (models, drawings, graphs, concrete materials,
manipulatives, etc.) to represent concepts.3. Students were engaged in mathematical activities.4. Students critically assessed mathematical strategies.5. Students persevered in problem solving.6. The lesson involved fundamental concepts of the subject to promote relational/
conceptual understanding.7. The lesson promoted modeling with mathematics.8. The lesson provided opportunities to examine mathematical structure. (symbolic
notation, patterns, generalizations, conjectures, etc.)9. The lesson included tasks that have multiple paths to a solution or multiple solutions.10. The lesson promoted precision of mathematical language.11. The teachers talk encouraged student thinking.12. There were a high proportion of students talking related to mathematics.13. There was a climate of respect for what others had to say.14. In general, the teacher provided wait-time.15. Students were involved in the communication of their ideas to others (peer-to-peer).16. The teacher uses student questions/comments to enhance mathematical understand-
ing.
chosen from each of K-2, 3-5, 6-8, 9-12, and undergraduate. Gleason, Livers, and Zelkowski
(2017) used the sub-scale score to calculate the intra-class correlation (ICC) and report that
the inter-rater reliability was within acceptable levels.
The MCOP2 was revised after the second round of external experts to include only 16
items (Table 4) measuring the two primary constructs of teacher facilitation, focusing on the
interactions that are primarily dependent upon the teacher, and student engagement, focus-
ing on the interactions that are primarily dependent upon the students. The MCOP2 needs
to be extended to other mathematics classrooms at multiple higher education institutions.
The type of institution needs to be diversified to include community colleges, liberal arts
schools, and other research universities. This expansive review, including multiple institu-
tions, will yield much more significant results to be used in future analysis.
29
Draft
CHAPTER 3
METHODS
Aim of Study
The aim of this study was to investigate the structural validity and reliability of the
abbreviated Reformed Teaching Observation Protocol and the Mathematic Classroom Ob-
servation Protocol for Practices in the setting of undergraduate mathematics classrooms.
With the goal of answering the following research questions:
1. What are the internal structures of the Mathematics Classroom Observation Protocol
for Practices (MCOP2) and the abbreviated Reformed Teaching Observation Protocol
(aRTOP) for the population of undergraduate mathematics classrooms?
2. What are the internal reliabilities of the subscales of the Mathematics Classroom Ob-
servation Protocol for Practices (MCOP2) and the abbreviated Reformed Teaching Ob-
servation Protocol (aRTOP) with respect to undergraduate mathematics classrooms?
3. What are the relationships between the constructs measured by the Mathematics Class-
room Observation Protocol for Practices (MCOP2) and the abbreviated Reformed
Teaching Observation Protocol (aRTOP)?
Sample Description
The procedure for selecting the population is a crucial step in any study. Since the
study used Structural Equation Modeling (SEM) in the analysis, we aimed for a sample size
of 150-200 classroom observations at a variety of institutions (Kline, 2011; Weston & Gore,
2006) with a final sample size of 110 classroom observations.
30
Draft
Although this was on the smaller end of the sample size of 100-150 originally intended,
the literature supports a smaller sample size when necessary. In a Monte Carlo study con-
ducted by Boomsma (1982), she found the widely cited recommendation for sample size to
be at least 100, but 200 was desirable. In the study conducted by Marsh, Hau, Balla, &
Grayson (1998), it was found that a sample size of 100 was sufficient when there were at least
four items per factor and more was better. Ding, Velicer, & Harlow (1995) recommends a
minimum of 3 indicators per factor and a minimum sample size of 100. Schumacker & Lomax
(2016) suggest a sample size of 100 to 150 for small models with well-behaved data. You can
also find studies that suggest 5 or 10 observations per estimated parameter (Bentler & Chou,
1987) and 10 cases per variable (Nunnally, 1978, p. 355). These rules are convenient, but
do not take into account the specifics of the model and may lead to over or underestimation
of the minimum sample size. Structural Equation Modeling (SEM) flexibility makes it hard
to create generalizations on the sample size required.
Although there are numerous studies of sample size, the study conducted by Wolf is most
analogous to our study. Wolf, Harrington, Clark, & Miller (2013) used the Monte Carlo data
simulation techniques to evaluate sample size requirements using maximum likelihood (ML)
estimator. The study compares a Confirmatory Factor Analysis (CFA) conducted with
different number of factors, indicators, and loadings to see what the minimum sample size
is required to “achieve minimal bias, adequate statistical power, and overall propriety of a
given model”(p. 920). The two-factor model with 6 to 8 indicators is most closely aligned
with our study, because the MCOP2 is a two-factor model with 9 indicators per factor and
the aRTOP is a two-factor model with 5 indicators per factor.
Although increasing the number of latent variables in the model resulted in an increased
minimum sample size, models required a smaller sample size when there were more indicators
per factor and stronger factor loadings. According to Wolf et. al (2013), a two-factor model
with 6 indicators required a sample size of 120, and 100, at factor loadings of .65, and .80,
respectively. Similarly, a two-factor model with 8 indicators required a sample size of 120,
31
Draft
and 90, at factor loadings of .65, and .80, respectively. Wolf et. al (2013) conclude that a
“one size fits all” approach has problems because of the variability in SEM.
This study used the non-probability sampling method of convenience sampling in or-
der to reduce the relative travel cost and time required to achieve the sample size desired
(Johnson & Christensen, 2014). The chosen sample, to a large degree, represents the general
population of undergraduate mathematics classrooms, because the sample includes a large
number of classroom observations with a wide variety in class type, class size, institution,
demographics, etc.
The investigators observed 110 college mathematics classrooms at the undergraduate
level, with consent of the instructor, representing a wide variety of college and university
classrooms with faculty members of the classrooms observed range in age from 22 and up,
with a mixture of gender and ethnic backgrounds.
The American Mathematical Society’s Annual Survey of the Mathematical Sciences pro-
vides a way to group colleges and universities into two distinct classifications based upon the
highest mathematics degree offered at the institution: doctorate-granting universities and
master’s and baccalaureate colleges and universities. For our study, we will include three
large southern doctorate-granting universities with enrollments of approximately 18,000 to
35,000. The percent of full time students range from 64 to 85 and the percent of undergrad-
uate students range from 62 to 85. These universities are comprised of approximately 49 to
60 percent female students. All three of these universities have students with a wide variety
of ethnic backgrounds.
We also include eight southern masters and baccalaureate colleges and universities with
enrollments between 1,100 to 15,000 students. The percent of full time students range from
43 to 90 and the percent of undergraduate students range from 45 to 100. These colleges
and universities are comprised of approximately 50 to 72 percent female students. All twelve
colleges and universities have students with a wide variety of ethnic backgrounds.
32
Draft
We purposefully chose the institutions in this study to avoid atypical demographics.
Any institution with a high representation of one specific demographic was excluded. For
example, student populations composed exclusively or almost exclusively of women were
excluded from this study because it does not represent a typical college population. With
these selections of institutions, approximately 50-70 mathematics classrooms at each category
of institution are included in the study to overcome any potential bias due to the convenience
sampling (Johnson & Christensen, 2014). The actual classrooms chosen for observation were
selected from faculty members at each institution who elected to participate in the study.
The college and universities in this study were chosen to avoid the overrepresentation or
underrepresentation of a specific group, recognizing we have no control over the instructors
who chose to participate. We had some instructors that did not respond or chose not to
participate in this study. The main concern was if this group of non-responders or non-
participants will affect the validity of our study results (Hartman, Fuqua, & Jenkins, 1986).
In the case of our study, there could be a substantial differences in the personal or professional
demographics of those who chose not to respond or participate. The self-selected nature of
this sample will most likely include instructors who have an interest in teaching and learning
issues (Hora & Ferrare, 2013a). For instance, teachers who have a student-centered classroom
were more likely to respond than teachers that hold direct lecture only.
Seventy-two mathematics faculty members agreed to participate in this study. Since
some instructors teach two or more completely different courses, a total of 110 observations
were conducted in the Spring 2016, Fall 2016, and Spring 2017 semester. Only 86 of the 110
observations have instructor demographics data, because 15 instructors did not complete
the demographics survey. Of the 110 classroom observations, 50 were taught by a female
instructor and 60 were taught by a male instructor. The instructors self-identified between
the ages of 18-65 and older, 2% were 18-24 years old, 48% were 25-34 years old, 17% were 35-
44 years old, 12% were 45-54 years old, 16% were 55-64 years old, and 5% were 65 years and
33
Draft
over. 13% identified as Asian/Pacific Islander, 5% identified as Black or African American,
2% identified as Hispanic American, and 80% identified as White/Caucasian.
Of the 86 classrooms on which we have full demographic data about the instructor, 13
were taught by a Graduate Teaching Assistant, 24 were taught by an Adjunct/Instructor,
24 were taught by an Assistant Professor, 8 were taught by an Associate Professor, and 17
were taught by a Full Professor. The instructors were asked to self-identified their highest
level of education, 2% had a Bachelor’s degree, 31% had a Master’s degree, 63% had a PhD,
and 3% had other advanced degree beyond a Master’s degree.
They were also asked to identify how many years they have taught at the high school
level and college level. Over 75% reported only teaching at the high school level for less than
one year. The range of years spent teaching at the college level varied, with 1% teaching for
less than one year, 27% teaching for 1-5 years, 31% teaching for 6-10 years, 13% teaching
for 11-15 years, and 28% teaching over 15 years. A complete list of instructor demographics
is included in Appendix B.
The use of convenience sampling is one of the limitations of this study requiring toler-
ances. Convenience sampling can lead to the under-representation or over-representation of
a particular group of the sample. Another sampling issue that needs to be accounted for
is the presence of outliers, since convenience sampling is particularly influenced by outliers.
Our sample was chosen to avoid including classroom observations likely to give us unusual
data by collecting a large sample from a diverse range of institutions based on enrollment
demographics and types of degrees offered that reasonably represent the larger population
of undergraduate institutions in the United States.
Another limitation of this study is observer bias. Unfortunately researchers are suscep-
tible to obtaining the results they want to find. Observer biases can be positive or negative.
These biases can be a product of personal experience, environment and/or social and cultural
conditioning. Reflexivity, self reflection by the researcher on their biases and predispositions,
is the key strategy for avoiding researcher bias (Johnson & Christensen, 2014). Although it
34
Draft
is not possible to remove this potential bias completely, the observer is aware of the influence
that these biases may cause and will make every effort to avoid their influence. Johnson &
Christensen (2014) comment, “Complete objectivity being impossible and pure subjectivity
undermining credibility, the researchers focus is on balance - understanding and depicting
the world authentically in all its complexity while being self-analytical, politically aware,
and reflexive in consciousness” (p. 420).
Instruments
From the review of the literature, we see there are many ways that we can evaluate
college mathematics instruction. It is impossible to include all the observation protocols
used to evaluate undergraduate classes and so two protocols were chosen to align with the
research questions, an abbreviated form of the Reformed Teaching Observation Protocol for
its widely known use and the Mathematics Classroom Observation Protocol for Practices for
its mathematics specific design.
Reformed Teaching Observation Protocol (RTOP) is a 25 item protocol designed to be
used for both science and mathematics classroom observations. Piburn et al. (2000) divide
the RTOP into 5 sub-scales in order to test the hypothesis that “Inquiry-Orientation” is
a major part of the structure of the RTOP. One of the sub-scales, procedural pedagogic
knowledge, is a very high predictor of the total score with an R-squared of 0.971, and thus
97.1% of the variance accounted for by the predictor. This result, along with an exploratory
factor analysis finding two strong factors, solidified our idea that an abbreviated version of
the RTOP would produce a similar amount of information as the full instrument. This led
to the creation of the abbreviated Reformed Teaching Observation Protocol (aRTOP) to be
used in this study. (See Table 3 and Appendix C.2)
The theoretical structure of the aRTOP is depicted in Figure 1 and is based on the
results of the RTOP. Inquiry Orientation and Content Propositional Knowledge are the two
theoretical constructs that will be measured with the 10 observed variables. A double arrow
35
Draft
accounts for these two constructs being correlated. The model also contains a stochastic
error term accounting for the influence of unobserved factors.
Figure 1: Theoretical Model of aRTOP
The other observation protocol that we will focus on is the Mathematics Classroom
Observation Protocol for Practices (MCOP2). It is designed to be implemented in K-16
mathematics classrooms to measure mathematics classroom interactions. The MCOP2 mea-
sures the two primary constructs: teacher facilitation and student engagement. Sixteen
items with full descriptions are used to measure these two components. The validity and
reliability of the Mathematics Classroom Observation Protocol for Practices (MCOP2) has
been assessed in numerous ways. A survey of 164 experts in mathematics education was
conducted to test the content of the MCOP2. The results from this survey and a second
follow up survey were used to revise the original 18 MCOP2 items to 16 items. Inter-rater
reliably was also calculated with a panel of five raters of various backgrounds without any
formal training. This resulted in the intra-class correlation (ICC) of 0.669 for the Teacher
36
Draft
Facilitation Sub-scale and 0.616 for the Student Engagement Sub-scale (Gleason et al., 2017).
(See Table 4 and Appendix C.3)
The theoretical structure of the MCOP2 is depicted in Figure 2. There are two theoretical
constructs, student engagement and teacher facilitation that will be measured with the 16
observed variables. The double arrows between the two theoretical constructs represent the
correlation of these two factors. The model also included residual error terms to account for
the unmeasured variation in the model.
Figure 2: Theoretical Model of MCOP2
37
Draft
Procedures
After receiving approval from the University of Alabama Institutional Review Board
(IRB), we began the recruitment process of the selected institutions of higher education
through their local Institutional Review Boards. Upon approval of the institution to par-
ticipate, an email was sent to all undergraduate mathematics instructors at that institution
informing them that participation in this study is strictly voluntary, but we would like to
observe a class they teach to understand the current status of mathematics instruction at
the undergraduate level. The email will also explain that there is no foreseen risk associated
with this study and no individual benefits for the participants. For those that agree to al-
low us to observe their classrooms, we will confirm a classroom observation at the teacher’s
discretion.
The instructors of the mathematics courses were only asked to allow the investigators
to observe their classroom in order to complete the observation protocol forms. The time
commitment for the participants is to allow the researchers to observe one class period
(usually 50 or 75 minutes) during the Spring 2016, Fall 2016 and Spring 2017 semester, plus
about 5 minutes to read and complete the consent form and demographics form. There are
no other responsibilities for the participant.
For each classroom observation, the investigator arrived early and sat in a seat near the
back of the classroom. The goal as the observer was to blend in with the surroundings so the
students and instructor were not disturbed. The observer completed the Note Taker Form
(See Appendix C) during the lecture. At the conclusion, the observer used the information
collected on the Note Taker Form to complete both protocols. The observer alternated the
order the observation protocols were completed to avoid any bias that might be created.
Each classroom observation was given a number (1-200) that corresponds to the sequence
in which it was completed. Classroom observations labeled with an odd number (1,3,5,...)
were categorized as A indicating that the aRTOP protocol will be completed first. Classroom
38
Draft
observations labeled with an even number (2,4,6,...) were characterized as B indicating that
the MCOP2 protocol was completed first. This process was repeated until both protocols
were collected for all classroom observations.
Once all data was collected, we tested the theoretical structure for the MCOP2 and
the aRTOP using the statistical language and environment, R. All results are reported in
the aggregate to protect the confidentiality of the teachers. We use the linear regression
coefficients and the fit statistics we gain from the Confirmatory Factor Analysis (CFA) to
test the internal structure of the MCOP2 and the aRTOP with respect to undergraduate
mathematics classrooms. Cronbach’s alpha are used to test internal reliability of the MCOP2
and the aRTOP with respect to undergraduate mathematics classrooms. We use regression
to assess the association between the constructs measured by the MCOP2 and the aRTOP.
Although Hu & Bentler (1999) “rule of thumb” cutoff criteria for fit indexes are widely
used today, Marsh, Hua, & Wen (2004) warns of the overgeneralization of Hu and Bentlers
findings. Schermelleh-Engel et al. (2003) includes a table of recommendations for model
evaluation, but suggests it is clear that these cutoff criteria should not be taken too seriously.
Table 5 provided an overview of some of the rules of thumb. Hu & Bentler (1998) suggest
fit indices can be effected by model misspecification, small-sample bias, effects of violation
of normality and independence, and estimation methods. It is safe to conclude that a model
may fit the data, but having one or more fit measures suggest bad fit.
The acceptance level of Cronbach’s alpha depends upon the instrument being used in
the early stages of research, for basic research tools, or as a scale of an individual for a
clinical situation (Nunnally, 1978). Alpha values of .7 to .8 are regarded as satisfactory in
our case, because we are not looking at the scale of the individual, but with such values, the
instruments should be used for preliminary research to guide further understanding of the
constructs (Bland & Altman, 1997; Nunnally, 1978). According to Streiner (2003), Nunnally
was correct for acceptable alpha for research tools, but warns of an alpha over .90 most likely
indicates unnecessary redundancy.
39
DraftTable 5
Recommendations for Model Evaluation: Some Rules of Thumb
Fit Measure Good Fit Acceptable Fit
χ2/df 0 ≤ χ2/df ≤ 2 2 < χ2/df ≤ 3
RMSEA 0 ≤ RMSEA ≤ .05 .05 ≤ RMSEA ≤ .08
SRMR 0 ≤ SRMR ≤ .05 .05 < SRMR ≤ .10
CFI .97 ≤ CFI ≤ 1.00 .95 ≤ CFI < .97
GFI .95 ≤ GFI ≤ 1.00 .90 ≤ GFI < .95
Schermelleh-Engel, Moosbrugger, & Muller (2003)
40
Draft
CHAPTER 4
RESULTS
Internal Structure
A confirmatory factor analysis (CFA) was conducted on the data gathered to analyze
the internal structure of the Mathematics Classroom Observation Protocol for Practices
(MCOP2) and the abbreviated Reformed Teaching Observation Protocol (aRTOP) for the
population of undergraduate mathematics classrooms using R version 3.3.0 (2016) with the
lavaan package (Rosseel, 2012). CFA allows us to examine the relationship between the
observed variables and their underlying latent constructs for both the MCOP2 and the
aRTOP. The analysis of the fit indices for the CFA allows us to inspect the model fit for
both observation protocols.
As mentioned earlier, the aRTOP has two theoretical constructs, Inquiry Orientation
and Content Propositional Knowledge. Measured with the 10 observed variables. Items x1,
x2, x3, x4, and x5 load on Inquiry Orientation and x6, x7, x8, x9, and x10 load on Content
Propositional Knowledge. The aRTOP model with standardized factor loadings as well as
the standardized variance and covariance are included in Figure 3, with the factor loadings
relatively high for eight of the items. The goodness of fit indices for the aRTOP reveal almost
acceptable fits (χ2/df =3.478, RMSEA=.150, SRMR=.115, GFI=.831, and CFI=.820).
Although some indicator variables were low and modification indices existed, theory led
to the inclusion of these observed variables. For example, item x10, “connection with other
content disciplines and/or real world phenomena were explored and valued,” in the aRTOP
has a factor loading of .03. This tells 3% of the variance in Content Propositional Knowledge
is explained by item x10. Although this is low and modification indices suggest removal of
41
Draft
Figure 3: Confirmatory Factor Analysis Results: aRTOP
this item, theory tells us that content being connected to the real world or other disciplines is
important to Content Propositional Knowledge. Schermelleh-Engel, Moosbrugger, & Muller
(2003) support this idea stating, “one should never modify a model solely on the basis of
modification indices, although the program might suggest to do so” (p. 61).
The MCOP2 has two constructs, Student Engagement and Teacher Facilitation mea-
sured by 16 items. Items y1, y2, y3, y4, y5, y12, y13, y14, and y15 load on Student
Engagement, while items y4, y6, y7, 78, y9, y10, y11, y13, and y16 load on Teacher Facil-
itation. The MCOP2 model with standardized factor loadings as well as the standardized
variance and covariance are included in Figure 4, with the standardized loadings relatively
high for most items. The goodness of fit indices for the MCOP2 reveal an acceptable fit
for two indices (χ2/df =1.185 and SRMR=.078), and an almost acceptable fit for the other
indices (RMSEA=.094, GFI=.805, and CFI=.895). See Table 5 for recommendations for
model evaluation.
42
Draft
Figure 4: Confirmatory Factor Analysis Results: MCOP2
Internal Reliability
Cronbach’s alpha (1951) was calculated to analyze the internal reliability of the Math-
ematics Classroom Observation Protocol for Practices (MCOP2) and the abbreviated Re-
formed Teaching Observation Protocol (aRTOP) with respect to undergraduate mathematics
classrooms using R version 3.3.0 (2016) with the Rcmdr package (Fox, 2005, 2017; Fox &
Bouchet-Valat, 2017).
The alpha values for the subscales of the aRTOP, were .753 for the Inquiry Orientation
Subscale and .605 for the Content Propositional Knowledge Subscale. The Cronbach‘s alpha
43
Draft
for the first subscale is near the satisfactory range for basic research given by Nunnally (1978,
p. 245-246), while the second subscale is in the range for preliminary research.
Similarly, the Cronbach’s alpha values for the subscales of the MCOP2 were .888 for
the Student Engagement Subscale and .812 for the Teacher Facilitation Subscale. Both of
these subscales are therefore in the satisfactory range for basic research (Nunnally, 1978, p.
245-246) and are near acceptable levels for individual measurement.
Relationship between the Constructs
Simple Linear Regression analysis was conducted to estimate the relationship between
the constructs measured by the Mathematics Classroom Observation Protocol for Practices
(MCOP2) and the abbreviated Reformed Teaching Observation Protocol (aRTOP). Before
we could conduct linear regression we must first check the linear regression assumptions. The
assumptions that must be satisfied are (a) linearity of the model is good, (b) distribution of
the error has constant variance (homoscedasticity), (c) the errors are normally distributed,
(d) independent variables are determined without error, and (e) errors are independent
(Mathews, 2005).
Weisberg (Weisberg, 2005) suggests plots of residuals with other quantities are useful in
finding failures of assumptions. The residual plot of Regression Model 1 (See Figure 5) has
been included below to help aid in the discussion. You will find a complete list of all residual
plots in Appendix D for each of the models. The first plot “Residuals versus Fitted” and
the second plot “Normal Q-Q” are most useful in simple regression to determine if these
assumptions are met. Notice in the “Residual verses Fitted” plot there is no pattern and the
red line is fairly flat. This implies we have meet assumption of linearity and homoscedasticity.
In the “Normal Q-Q” plot, we see the points lie on the diagonal line pointing to normal
distribution. The last two assumptions are satisfied by the data collection and study design.
A simple linear regression was calculated to predict Regression Model 1: Student En-
gagement based on Inquiry Orientation (See Figure 6 in Appendix D). A significant regres-
44
Draft
Figure 5: Residual Plots of Regression Model 1
sion equation was found (F(1,108)=271.8, p < .001), with an R2 of .716 and adjusted R2
of .731. Roughly 72% of the variation in Student Engagement can be explained by Inquiry
Orientation. The linear regression equation predicted
(Student Engagement) = 6.671 + 1.11(Inquiry Orientation).
Student Engagement increased 1.11 for each one point increase in Inquiry Orientation.
Strongly correlated is defined by (Cohen, 1988) as Pearsons Product-Moment Correlation of
|r| > .5. Based on the results of the study, Student Engagement is strongly and positively
related to Inquiry Orientation with a r = .846, p < .001.
For Regression Model 2: Student Engagement based on Content Propositional Knowl-
edge (See Figure 7 in Appendix D) a simple linear regression was calculated. A significant
regression equation was found (F(1,108)= 44.8, p < .001), with an R2 of .293 and adjusted R2
of .287. Roughly 29% of the variation in Student Engagement can be explained by Content
45
Draft
Propositional Knowledge. The linear regression equation predicted
(Student Engagement) = 3.93 + 0.957(Content Propositional Knowledge).
Student Engagement increased 0.957 for each one point increase in Content Propositional
Knowledge. Based on the results of the study, Student Engagement is strongly and positively
related to Content Propositional Knowledge with a r= .541, p < .001.
For Regression Model 3: Teacher Facilitation based on Inquiry Orientation (See Figure
8 in Appendix D) a simple linear regression was calculated. A significant regression equation
was found (F(1,108)= 213.6, p < .001), with an R2 of .664 and adjusted R2 of .661. Roughly
66% of the variation in Teacher Facilitation can be explained by Inquiry Orientation. The
linear regression equation predicted
(Teacher Facilitation) = 8.73 + .926(Inquiry Orientation).
Teacher Facilitation increased .926 for each one point increase in Inquiry Orientation. Based
on the results of the study, Teacher Facilitation is strongly and positively related to Inquiry
Orientation with a r =. 815, p < .001.
Also a simple linear regression was calculated to predict Regression Model 4: Teacher
Facilitation based on Content Propositional Knowledge (See Figure 9 in Appendix D). A
significant regression equation was found (F(1,108)= 142.4, p < .001), with an R2 of .569 and
adjusted R2 of .565. Roughly 57% of the variation in Teacher Facilitation can be explained
by Content Propositional Knowledge. The linear regression equation predicted
(Teacher Facilitation) = 1.86 + 1.15(Content Propositional Knowledge).
46
Draft
Teacher Facilitation increased in 1.15335 for each one point increase in Content Propositional
Knowledge. Based on the results of the study, Teacher Facilitation is strongly and positively
related to Content Propositional Knowledge with a r =. 75, p < .001.
To predict Regression Model 5: Inquiry Orientation based on Content Propositional
Knowledge (See Figure 10 in Appendix D) a simple linear regression was calculated. A
significant regression equation was found (F(1,108)= 48.8, p < .001), with an R2 of .311 and
adjusted R2 of .305. Roughly 31% of the variation in Inquiry Orientation can be explained
by Content Propositional Knowledge. The linear regression equation predicted
(Inquiry Orientation) = −1.05 + .751(Content Propositional Knowledge).
Inquiry Orientation increased .751 for each one point increase in Content Propositional
Knowledge. Based on the results of the study, Inquiry Orientation is strongly and positively
related to Content Propositional Knowledge with a r= .558, p < .001.
A simple linear regression was calculated to predict Regression Model 6: Student Engage-
ment based on Teacher Facilitation (See Figure 11 in Appendix D). A significant regression
equation was found (F(1,108)= 217.3, p < .001), with an R2 of .668 and adjusted R2 of
.665. Roughly 67% of the variation in Student Engagement can be explained by Teacher
Facilitation. The linear regression equation predicted
(Student Engagement) = .462 + .945(Teacher Facilitation).
Student Engagement increased .945 for each 1 point increase in Teacher Facilitation. Based
on the results of the study, Student Engagement is strongly and positively correlated to
Teacher Facilitation with a r = .817, p < .001. A complete list of all the simple linear
regression results can be found in Appendix ??.
In summary, we found linear regression models for each pair of constructs based on the
Residual Plots meeting the linear regression assumptions (For a summary of the results,
47
Draft
see Table 6). A stronger amount of the variance was explained in the Regression Model 1,
Regression Model 3, and Regression Model 6. Content Propositional Knowledge was the
common construct between the models that had lower variance explained. The F-statistic
supports these findings with values greater than 200 for Regression Model 1, Regression
Model 3, and Regression Model 6. In Table 7 we see the correlations of over .80 correspond
to Regression Model 1, Regression Model 3, and Regression Model 6.
Table 6
Simple Linear Regression Results
Model PredictorStandardizedRegressionCoefficient
t valuedf=108
R2 F-statistic
Regression Model 1(Intercept)
inquiry6.67 ***1.11 ***
10.3616.49
.716 271.8
Regression Model 2(Intercept)
content3.93 *.957 ***
2.076.69
.293 44.8
Regression Model 3(Intercept)
inquiry8.73 ***.926 ***
14.4114.61
.664 213.6
Regression Model 4(Intercept)
content1.861.15 ***
1.4511.93
.569 142.4
Regression Model 5(Intercept)
content-1.05.751 ***
-0.736.99
.311 48.8
Regression Model 6(Intercept)facilitation
.462
.945 ***0.4214.74
.668 271.3
* p<0.05, ** p<.01, *** p<.001Regression Model 1: Student Engagement and Inquiry Orientation,Regression Model 2: Student Engagement and Content Propositional Knowledge,Regression Model 3: Teacher Facilitation and Inquiry Orientation,Regression Model 4: Teacher Facilitation and Content Propositional Knowledge,Regression Model 5: Inquiry Orientation and Content Propositional Knowledge,Regression Model 6: Student Engagement and Teacher Facilitation
48
DraftTable 7
Pearson’s Product-Moment Correlation
Inquiry Content Engagement FacilitationInquiry -Content .5578761 -Engagement .8459433 .5414396 -Facilitation .8149469 .7540780 .8173191 -p < .001 for all correlations.
49
Draft
CHAPTER 5
DISCUSSION
The improvement of Science, Technology, Engineering, and Mathematics (STEM) un-
dergraduate education is on the minds of faculty and staff at colleges and universities around
the United States. Every day we, as educators, are challenged by our departments and uni-
versities to make advances in the classroom, but how do we know if the changes we make
positively impact our students. Peer evaluation, student evaluations, and assessment of your
portfolio are the primary methods of formative and summative assessment instructors have
to evaluate their classroom. Although each of these methods are useful, they can be rid-
dled with subjective information that can skew the window into what is happening in an
undergraduate classroom.
Observation protocols like the Mathematics Classroom Observation Protocol for Prac-
tices (MCOP2) and the abbreviated Reformed Teaching Observation Protocol (aRTOP) are
a more objective way for an instructor to analyze their classroom. Before these observa-
tion protocols could be brought into the classroom with confidence, a study needed to be
conducted to examine both the aRTOP and the MCOP2. Although this study needs to be
repeated and extended to further validate the use of observation protocols in the classroom,
the findings have led to some conclusions on the internal structure, internal reliability, and
the relationship between the constructs measured by the observation protocols.
Study Limitations
While the current study provides useful information, there are several limitations that
must be mentioned. The use of convenience sampling is one limitation of this study. This
50
Draft
sampling technique was unavoidable because of time and financial constraints. One major
concern with the use of a convenience sample is the inclusion of outliers that may skew the
data. Our sample was chosen to avoid including classroom observations likely to give us
unusual data. Every effort was made to include colleges and universities that are from a
diverse range of institutions based on enrollment demographics and types of degrees offered
that reasonably represent the larger population of undergraduate institutions in the United
States.
Positive or negative observer bias is another limitation of this study. Reflexivity was
used by the observer as outlined by Johnson & Christensen (2014). The observer spent
time reflecting about her own biases and predispositions to include a strategy for avoidance.
Although it is not possible to remove the potential for biases completely, the observer made
a conscious effort to evade its influences.
Another limitation to this study is the effect of sample size on fit indices. The studies
conducted by Hu and Bentler (1999, 1998) shows how different fit indices are affected by
sample size with a true-populations and mis-specified models. As an unavoidable limitation,
we were careful when using the fit indices to decide if a model was supported by the data.
Conclusion
Confirmatory Factor Analysis (CFA) was conducted on the data gathered to analyze
the internal structure of the Mathematics Classroom Observation Protocol for Practices
(MCOP2) and the abbreviated Reformed Teaching Observation Protocol (aRTOP) for the
population of undergraduate mathematics classrooms. Factor loadings for the aRTOP were
relatively high for eight of the items. Although two items of the aRTOP did not have
high factor loadings, we included these items in our final model because of the theoretical
support for what should be happening in an undergraduate mathematics classroom based
upon national recommendations. The aRTOP goodness of fit indices produced from the
CFA reveal an almost acceptable fit. The factor loadings for the MCOP2 were relatively
51
Draft
high for most items and the items that did not have high loadings were included because
of the theoretical support by undergraduate mathematics education knowledge of a typical
classroom. The goodness of fit indices reveal an almost acceptable fit for the MCOP2. Our
findings point to a more consistent internal structure for the MCOP2 than the aRTOP.
Therefore, the Confirmatory Factor Analysis supports the previous Exploratory Factor
Analysis on the MCOP2 (Gleason & Cofer, 2014). We can clearly see that the MCOP2 is a
two factor model with almost all observed variables having high factor loadings. The almost
acceptable fit indices show the measure of Student Engagement and Teacher Facilitation
are consistent with our theoretical understanding of the model. Although we would have
liked higher factor loadings and fit indices, we can still confirm the theoretical model for the
MCOP2.
The CFA for the abbreviated Reformed Teaching Observation Protocol (aRTOP) did not
alignment with the original design of the Reformed Teaching Observation Protocol (RTOP)
(Piburn & Sawada, 2000). We could see from the original design that a two factor model
with a reduced number of items would produce the same results. The factor loadings of the
current study support a two factor model with most observed variables having high factor
loadings. The almost acceptable fit indices show the measure of Inquiry Orientation and
Content Propositional Knowledge are somewhat consistent with our theoretical model of the
aRTOP. Although we would have liked higher factor loadings and fit indices, we can to some
extent confirm the theoretical model of the aRTOP.
To analyze the internal reliability, strength of that consistency, Cronbach’s alpha (1951)
was calculated for each subscale of both the Mathematics Classroom Observation Protocol for
Practices (MCOP2) and the abbreviated Reformed Teaching Observation Protocol (aRTOP)
with respect to undergraduate mathematics classrooms. Using Nunnally’s (1978) acceptable
range for Cronbach’s alpha, we were able to assess the alpha for each subscale. When we
examined the aRTOP, we found Inquiry Orientation to have satisfactory internal reliability,
and Content Propositional Knowledge was just outside the satisfactory range. We found
52
Draft
both Student Engagement and Teacher Facilitation to have satisfactory internal reliability
when we inspected the alpha.
Therefore, for each each subscale, the satisfactory internal reliability of the Mathematics
Classroom Observation Protocol for Practices ((MCOP2) demonstrates that the instrument
is measuring something and producing similar scores. When we look at each factor indi-
vidually, the Student Engagement part of the MCOP2 instrument successfully gauges the
role of the student in an undergraduate mathematics classroom and their engagement in the
classroom environment. The high internal reliability of the Teacher Facilitation part of the
MCOP2 indicate the instrument is also successfully measuring the role of the instructor in
creating the structure and guidance in the classroom.
Although the abbreviated Reformed Teaching Observation Protocol (aRTOP) did not
have as high of an internal reliability, we can still see from the moderately satisfactory
alphas that each subscale is measuring something and is somewhat consistent in its scores.
The satisfactory internal reliability of the first factor, Inquiry Orientation, indicates the
instrument is successfully measuring the role of the instructor to act as a resource person and
help foster a community of learners. The second factor, Content Propositional Knowledge,
is also doing a fairly satisfactory job at measuring the lessons attention to fundamental
concepts and conceptual understanding. The data analysis conclude the MCOP2 has more
internal reliability than the aRTOP.
Theoretically Inquiry Orientation, Content Propositional Knowledge, Student Engage-
ment, and Teacher Facilitation are related, but distint, with respect to undergraduate mathe-
matics classrooms. To validate this theory a Simple Linear Regression analysis was conducted
to estimate the relationship between the constructs measured by the Mathematics Classroom
Observation Protocol for Practices (MCOP2) and the abbreviated Reformed Teaching Obser-
vation Protocol (aRTOP). The relationship between (MCOP2) and (aRTOP) was also found
to be significant. The Pearson’s Product-Moment Correlations for each pair of constructs
were found to be strongly correlated.
53
Draft
Therefore, for the constructs that have highest correlation, we can make some strong
conclusions. Mathematically, we have found that the student engagement is directly re-
lated to the idea of an inquiry oriented classroom. Theoretically, only when the students
are engaged would an inquiry oriented classroom be possible. And conversely, an inquiry
oriented classroom means the students are actively engaged in the learning community. Sim-
ilarly, there is a high correlation between teacher facilitation and inquiry oriented classroom.
Without the instructor facilitation, a classroom could not be a community of learners and
the converse is also true. Since both student engagement and teacher facilitation are highly
correlated with inquiry orientation, it is not hard to see why mathematically we found that
student engagement and teacher facilitation are also strongly correlated. Theoretically, the
facilitation of the teacher leads to an engaged body of students and the converse also follows.
We noticed the subscale, Content Propositional Knowledge, was the common construct
between the regression models that had lower variance explained. This leads us to believe
that Content Propositional Knowledge is measuring something completely different from
the other subscales. Given the data shows the MCOP2 has a better internal structure
and internal reliability, we infer the content subscale of the aRTOP could be added to the
(MCOP2) and the aRTOP could be no longer necessary.
Despite some limitations to the current study, this study produced some important
findings. The internal structure of the aRTOP and MCOP2 were measured using the factor
loadings and goodness of fit indices. Both protocols had relatively high factor scores for
most items. The goodness of fit indices for both protocols were found to be almost in the
acceptable range. A decision was made not to modify the theoretical model because the
deletion of items from each protocol would lead to a decrease in the information gained
from the undergraduate mathematics classroom. The internal reliability of the aRTOP has
been found to be fairly satisfactory and the internal reliability of the MCOP2 has been
found to be highly satisfactory. We found a positive and strong correlation between each
pair of constructs with a higher correlation between subscales that do not contain Content
54
Draft
Propositional Knowledge. We found that the MCOP2 had a stronger internal structure and
internal reliability than the aRTOP. We also found that the theoretical relationships we had
assumed between each construct was supported by the linear regression we conducted.
Therefore, the support of the structure of the aRTOP allows us to feel somewhat con-
fident with what the protocol is measuring, but we find higher confidence in the support
for the structure of the MCOP2. The internal reliability was also found to be higher for
the MCOP2, pointing to the protocol consistency. A high or low observation protocol score
does not just happen by chance. The high correlation between subscales that do not in-
clude the subscale, Content Propositional Knowledge, tell us that it is reasonable to infer
that the two observation protocols are measuring the same classrooms the same way except
for the Content Propositional Knowledge subscale of the aRTOP. This leads us to believe
that Content Propositional Knowledge is measuring something completely different from the
other subscales. With confidence in what we are measuring with the MCOP2, consistency
in the MCOP2, and correlation among the subscales, we find support for the Mathematics
Classroom Observation Protocol for Practices MCOP2 as a useful assessment tool for under-
graduate mathematics classrooms with the addition of the Content Propositional Knowledge
subscale of the aRTOP.
Future Direction
Future research should seek to extend the current study to a broader sampling commu-
nity. Although the current sample size was adequate, a larger sample with more colleges
and universities included from a broader geographic region could lead to a deeper under-
standing of Mathematics Classroom Observation Protocol for Practices (MCOP2) and the
abbreviated Reformed Teaching Observation Protocol (aRTOP). Increasing the sample size
will allow the researcher to answer more comparative questions about the populations and
institutions included. For example, it would be interesting to compare how different types of
institutions preform with both observation protocols. With a larger sample size, you could
55
Draft
also compare how different job titles, highest level of education, genders, age, and years
of teaching relate to the constructs. Although we focused on undergraduate mathematics
education in this study, with a larger sample size you could look at how these observation
protocols preform at additional education levels as both protocols are designed to be used for
K-16. The applications of an extension of this study are limitless and would help contribute
to a better understanding of the undergraduate mathematics classroom.
56
Draft
References
Abrami, P. C. (2001). Improving judgments about teaching effectiveness using teacher rating
forms. New Directions for Institutional Research, 2001 (109), 59-87.
Abrami, P. C., & d’Apollonia, S. (1990). The dimensionality of ratings and their use in
personnel decisions. New Directions for Teaching and Learning , 1990 (43), 97-111.
Aleamoni, L. M. (1981). Student ratings of instruction. In J. Millman (Ed.), Handbook of
Teacher Evaluation (p. 110-145). Beverly Hills, CA: Sage.
Algozzine, B., Gretes, J., Flowers, C., Howley, L., Beattie, J., Spooner, F., . . . Bray, M.
(2004). Student evaluation of college teaching: A practice in search of principles.
College Teaching , 52 (4), 134-141.
Allen, J., Gregory, A., Mikami, A., Lun, J., Hamre, B., & Pianta, R. (2013). Observations
of effective teacher-student interactions in secondary school classrooms: Predicting
student achievement with the classroom assessment scoring system-secondary. School
Psychology Review , 42 (1), 76-98.
American Mathematical Association of Two-Year Colleges (AMATYC). (1995). Cross-
roads in mathematics: Standards for introductory college mathematics before calculus.
(D. Cohen, Ed.). Memphis, TN: American Mathematical Association of Two Year
Colleges.
American Mathematical Association of Two-Year Colleges (AMATYC). (2004). Beyond
Crossroads: Implementing mathematics standards in the first two years of college
(R. Blair, Ed.). Memphis, TN: American Mathematical Association of Two Year
Colleges.
57
Draft
Apodaca, P., & Grad, H. (2005). The dimensionality of student ratings of teaching: In-
tegration of uni-and multidimensional models. Studies in Higher Education, 30 (6),
723-748.
Ball, D. L., Thames, M. H., & Phelps, G. (2008). Content knowledge for teaching: What
makes it special? Journal of Teacher Education, 59 (5), 389-407.
Ballantyne, C. (2003). Online evaluations of teaching: An examination of current practice
and considerations for the future. New Directions for Teaching and Learning , 2003 (96),
103-112.
Barker, W., Bressoud, D., Epp, S., Ganter, S., Haver, B., & Pollatsek, H. (2004). Under-
graduate programs and courses in the mathematical sciences: CUPM curriculum guide,
2004. Washington, D.C.: Mathematical Association of America.
Bentler, P. M., & Chou, C.-P. (1987). Practical issues in structural modeling. Sociological
Methods & Research, 16 (1), 78–117.
Benton, S. L., & Cashin, W. E. (2012). Student ratings of teaching: A summary of research
and literature (IDEA Paper No. 50). The Idea Center. Retrieved 12/07/2015, from
http://ideaedu.org/wp-content/uploads/2014/11/idea-paper 50.pdf
Bernstein, D. J., Jonson, J., & Smith, K. (2000). An examination of the implementation of
peer review of teaching. New Directions for Teaching and Learning , 2000 (83), 73-86.
Bland, J. M., & Altman, D. G. (1997). Statistics notes: Cronbach’s alpha. BMJ , 314 (7080),
572.
Boomsma, A. (1982). The robustness of LISREL against small sample sizes in factor analysis
models. In K. G. Jreskog & H. Wold (Eds.), Systems under indirect observation:
Causality, structure, prediction (Vol. 1, pp. 149–173). North-Holland.
Bowes, A. S., & Banilower, E. R. (2004). LSC classroom observation study: An analysis of
data collected between 1997 and 2003. Chapel Hill, NC: Horizon Research, Inc.
Boyer Commission on Educating Undergraduates in the Research University. (1998). Rein-
venting undergraduate education: A blueprint for America’s research universities.
58
Draft
(Tech. Rep.). Stony Brook, NY: State University of New York at Stony Brook for
the Carnegie Foundation for the Advancement of Learning.
Bullock, C. D. (2003). Online collection of midterm student feedback. New Directions for
Teaching and Learning , 2003 (96), 95-102.
Burdsal, C. A., & Harrison, P. D. (2008). Further evidence supporting the validity of both a
multidimensional profile and an overall evaluation of teaching effectiveness. Assessment
& Evaluation in Higher Education, 33 (5), 567-576.
Burns, C. W. (2000). Teaching portfolios: Another perspective. Academe, 86 (1), 44-47.
Cashin, W. E. (1995). Student ratings of teaching: The research revisited (IDEA Paper
No. 32). The Idea Center. Retrieved 12/07/2015, from http://www.clemson.edu/
oirweb1/CourseEvalHelp/StudentRatingsResearch1995.pdf
Centra, J. A. (1993). Reflective faculty evaluation: Enhancing teaching and determining
faculty effectiveness. San Francisco: Jossey-Bass.
Centra, J. A. (2003). Will teachers receive higher student evaluations by giving higher grades
and less course work? Research in Higher Education, 44 (5), 495-518.
Centra, J. A. (2009). Differences in responses to the student instructional report: Is it bias?
Princeton, NJ: Educational Testing Service.
Centra, J. A., & Gaubatz, N. B. (2000). Is there gender bias in student evaluations of
teaching? The Journal of Higher Education, 71 (1), 17-33.
Chen, Y., & Hoshower, L. B. (2003). Student evaluation of teaching effectiveness: An
assessment of student perception and motivation. Assessment & Evaluation in Higher
Education, 28 (1), 71-88.
Cheung, D. (2000). Evidence of a single second-order factor in student ratings of teaching
effectiveness. Structural Equation Modeling , 7 (3), 442-460.
Clayson, D. E. (2009). Student evaluations of teaching: Are they related to what students
learn? A meta-analysis and review of the literature. Journal of Marketing Education,
31 (1), 16-30.
59
Draft
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale,
N.J. : L. Erlbaum Associates.
Collins, J. W., & O’Brien, N. P. (2003). The Greenwood dictionary of education. Westport,
Connecticut: Greenwood Press.
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika,
16 (3), 297–334.
d’Apollonia, S., & Abrami, P. C. (1997). Navigating student ratings of instruction. American
Psychologist , 52 (11), 1198-1208.
Davis, B. G. (2009). Tools for teaching (2nd ed.). San Francisco, CA: Jossy-Bass.
Dayton Regional STEM Center. (2011). Reformed Teaching Observation Protocol
(RTOP) with accompanying Dayton Regional STEM Center rubric. Retrieved
12/7/2015, from http://daytonregionalstemcenter.org/wp-content/uploads/
2012/09/rtop\ with\ rubric\ smp-1.pdf
Ding, L., Velicer, W. F., & Harlow, L. L. (1995). Effects of estimation methods, number
of indicators per factor, and improper solutions on structural equation modeling fit
indices. Structural Equation Modeling: A Multidisciplinary Journal , 2 (2), 119–143.
Dommeyer, C. J., Baum, P., Hanna, R. W., & Chapman, K. S. (2004). Gathering faculty
teaching evaluations by in-class and online surveys: Their effects on response rates and
evaluations. Assessment & Evaluation in Higher Education, 29 (5), 611-623.
Eiszler, C. F. (2002). College students’ evaluations of teaching and grade inflation. Research
in Higher Education, 43 (4), 483-501.
Ellis, J. F. (2014). Preparing future college instructors: The role of Graduate Student Teach-
ing Assistants (GTAs) in successful college calculus programs (Unpublished doctoral
dissertation). University of California, San Diego.
Ellis, L., Burke, D. M., Lomire, P., & McCormack, D. R. (2003). Student grades and average
ratings of instructional quality: The need for adjustment. The Journal of Educational
Research, 97 (1), 35-40.
60
Draft
Feldman, K. A. (1977). Consistency and variability among college students in rating their
teachers and courses: A review and analysis. Research in Higher Education, 6 (3),
223-274.
Feldman, K. A. (1978). Course characteristics and college students’ ratings of their teachers:
What we know and what we don’t. Research in Higher Education, 9 (3), 199-242.
Feldman, K. A. (1993). College students’ views of male and female college teachers: Part II
-Evidence from students’ evaluations of their classroom teachers. Research in Higher
Education, 34 (2), 151-211.
Feldman, K. A. (2007). Identifying exemplary teachers and teaching: Evidence from stu-
dent ratings. In R. P. Perry & J. C. Smart (Eds.), The Scholarship of Teaching and
Learning in Higher Education: An Evidence-Based Perspective (p. 93-143). Springer
Netherlands.
Flick, L. B., Sadri, P., Morrell, P. D., Wainwright, C., & Schepige, A. (2009). A cross
discipline study of reformed teaching by university science and mathematics faculty.
School Science and Mathematics , 109 (4), 197-211.
Fox, J. (2005). The R Commander: A basic statistics graphical user interface to R. Journal
of Statistical Software, 14 (9), 1–42. Retrieved from http://www.jstatsoft.org/v14/
i09
Fox, J. (2017). Using the R Commander: A point-and-click interface for R. Boca Raton
FL: Chapman and Hall/CRC Press. Retrieved from http://socserv.mcmaster.ca/
jfox/Books/RCommander/
Fox, J., & Bouchet-Valat, M. (2017). Rcmdr: R Commander [Computer software man-
ual]. Retrieved from http://socserv.socsci.mcmaster.ca/jfox/Misc/Rcmdr/ (R
package version 2.3-2)
Freeman, S., Eddy, S. L., McDonough, M., Smith, M. K., Okoroafor, N., Jordt, H., & Wen-
deroth, M. P. (2014). Active learning increases student performance in science, engi-
61
Draft
neering, and mathematics. Proceedings of the National Academy of Sciences , 111 (23),
8410–8415.
Gasiewski, J. A., Eagan, M. K., Garcia, G. A., Hurtado, S., & Chang, M. J. (2012).
From gatekeeping to engagement: A multicontextual, mixed method study of student
academic engagement in introductory STEM courses. Research in Higher Education,
53 (2), 229–261.
Gleason, J., & Cofer, L. D. (2014). Mathematics classroom observation protocol for practices
results in undergraduate mathematics classrooms. In T. Fukawa-Connelly, G. Karakok,
K. Keene, & M. Zandieh (Eds.), Proceedings of the 17th Annual Conference on Research
on Undergraduate Mathematics Education, 2014, Denver, CO (p. 93-103).
Gleason, J., Livers, S., & Zelkowski, J. (2015). Mathematics Classroom Observation
Protocol for Practices: Descriptors manual. Retrieved 12/7/2015, from http://
jgleason.people.ua.edu/mcop2.html
Gleason, J., Livers, S., & Zelkowski, J. (2017). Mathematics Classroom Observation Protocol
for Practices (MCOP2): A validation study. Investigations in Mathematics Learning ,
9 .
Grossman, P. L., Wilson, S. M., & Shulman, L. S. (1989). Teachers of substance: Sub-
ject matter knowledge for teaching. In M. Reynolds (Ed.), The Knowledge Base for
Beginning Teachers (p. 23-36). New York: Pergamon.
Hamermesh, D. S., & Parker, A. (2005). Beauty in the classroom: Professors’ pulchritude
and putative pedagogical productivity. Economics of Education Review , 24 (4), 369–
376.
Hartman, B. W., Fuqua, D. R., & Jenkins, S. J. (1986). The problems of and remedies
for nonresponse bias in educational surveys. The Journal of Experimental Education,
54 (2), 85–90.
Hatzipanagos, S., & Lygo-Baker, S. (2006). Teaching observations: Promoting development
through critical reflection. Journal of Further and Higher Education, 30 (4), 421-431.
62
Draft
Hill, H. C., Blunk, M. L., Charalambous, C. Y., Lewis, J. M., Phelps, G. C., Sleep, L., &
Ball, D. L. (2008). Mathematical knowledge for teaching and the mathematical quality
of instruction: An exploratory study. Cognition and Instruction, 26 (4), 430-511.
Hora, M. T. (2013). Exploring the use of the Teaching Dimensions Observation Protocol to
develop fine-grained measures of interactive teaching in undergraduate science class-
rooms (WCER Working Paper No. 2013-6). Retrieved 12/10/2015, from http://www
.wcer.wisc.edu/publications/workingpapers/Working Paper No 2013 06.pdf
Hora, M. T., & Ferrare, J. J. (2013a). Instructional systems of practice: A multidimensional
analysis of math and science undergraduate course planning and classroom teaching.
Journal of the Learning Sciences , 22 (2), 212–257.
Hora, M. T., & Ferrare, J. J. (2013b). A review of classroom observation techniques
in postsecondary settings (WCER Working Paper No. 2013-01). Wisconsin Center
for Education Research. Retrieved 12/7/2015, from http://www.wcer.wisc.edu/
publications/workingpapers/Working Paper No 2013 01.pdf
Hora, M. T., Oleson, A., & Ferrare, J. J. (2013). Teaching Dimensions Observation Pro-
tocol (TDOP) user’s manual. Madison, WI. Retrieved 12/7/2015, from http://
tdop.wceruw.org/
Hoyt, D. P., & Lee, E.-J. (2002). Basic data for the revised IDEA system (IDEA Tech-
nical Report No. 12). Retrieved 12/7/2015, from http://ideaedu.org/wp-content/
uploads/2014/11/techreport-12.pdf
Hu, L.-t., & Bentler, P. M. (1998). Fit indices in covariance structure modeling: Sensitivity
to underparameterized model misspecification. Psychological Methods , 3 (4), 424.
Hu, L.-t., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure
analysis: Conventional criteria versus new alternatives. Structural Equation Modeling:
A Multidisciplinary Journal , 6 (1), 1–55.
63
Draft
Jackson, D. L., Teal, C. R., Raines, S. J., Nansel, T. R., Force, R. C., & Burdsal, C. A.
(1999). The dimensions of students’ perceptions of teaching effectiveness. Educational
and Psychological Measurement , 59 (4), 580-596.
Johnson, B., & Christensen, L. (2014). Educational research: Quantitative, qualitative, and
mixed approaches (5th ed.). Thousand Oaks, CA: Sage.
Keig, L., & Waggoner, M. D. (1994). Collaborative peer review: The role of faculty in
improving college teaching. Washington, D.C.: The George Washington University
School of Education and Human Development. (ASHE-ERIC Higher Education Report
No. 2.)
Kim, M. (2011). Differences in beliefs and teaching practices between international and US
domestic mathematics teaching assistants (Unpublished doctoral dissertation). The
University of Oklahoma.
Kline, R. (2011). Principles and practice of structural equation modeling (3rd ed.). New
York: Guilford Press.
Kohut, G. F., Burnap, C., & Yon, M. G. (2007). Peer observation of teaching: Perceptions
of the observer and the observed. College Teaching , 55 (1), 19-25.
Krautmann, A. C., & Sander, W. (1999). Grades and student evaluations of teachers.
Economics of Education Review , 18 (1), 59-63.
Kulik, J. A. (2001). Student ratings: Validity, utility, and controversy. New Directions for
Institutional Research, 2001 (109), 9-25.
Kung, D., & Speer, N. (2007). Mathematics teaching assistants learning to teach: Recast-
ing early teaching experiences as rich learning opportunities. In M. Oehrtman (Ed.),
Proceedings of the 10th annual Conference on Research in Undergraduate Mathematics
Education.
Laverie, D. A. (2002). Improving teaching through improving evaluation: A guide to course
portfolios. Journal of Marketing Education, 24 (2), 104-113.
64
Draft
Leung, D. Y., & Kember, D. (2005). Comparability of data gathered from evaluation
questionnaires on paper and through the internet. Research in Higher Education,
46 (5), 571-591.
Marsh, H. W. (1984). Students’ evaluations of university teaching: Dimensionality, relia-
bility, validity, potential baises, and utility. Journal of Educational Psychology , 76 (5),
707-754.
Marsh, H. W. (2001). Distinguishing between good (useful) and bad workloads on students’
evaluations of teaching. American Educational Research Journal , 38 (1), 183-212.
Marsh, H. W., Hau, K.-T., Balla, J. R., & Grayson, D. (1998). Is more ever too much? the
number of indicators per factor in confirmatory factor analysis. Multivariate Behavioral
Research, 33 (2), 181–220.
Marsh, H. W., Hau, K.-T., & Wen, Z. (2004). In search of golden rules: Comment
on hypothesis-testing approaches to setting cutoff values for fit indexes and dangers
in overgeneralizing Hu and Bentler’s (1999) findings. Structural Equation Modeling ,
11 (3), 320–341.
Marsh, H. W., & Roche, L. A. (2000). Effects of grading leniency and low workload on
students’ evaluations of teaching: Popular myth, bias, validity, or innocent bystanders?
Journal of Educational Psychology , 92 (1), 202.
Marsh, H. W., & Ware, J. E. (1982). Effects of expressiveness, content coverage, and
incentive on multidimensional student rating scales: New interpretations of the Dr.
Fox effect. Journal of Educational Psychology , 74 (1), 126.
Marshall, J. C., Smart, J., & Horton, R. M. (2010). The design and validation of EQUIP:
An instrument to assess inquiry-based instruction. International Journal of Science
and Mathematics Education, 8 (2), 299-321.
Mathews, P. G. (2005). Design of experiments with MINITAB. Milwaukee, WI: ASQ Quality
Press.
McKeachie, W. J. (1979). Student ratings of faculty: A reprise. Academe, 65 (6), 384-397.
65
Draft
Mehdizadeh, M. (1990). Loglinear models and student course evaluations. The Journal of
Economic Education, 21 (1), 7-21.
Merritt, D. J. (2008). Bias, the brain, and student evaluations of teaching. St. John’s Law
Review , 82 (1), 235–287.
Michael, J. (2006). Where’s the evidence that active learning works? Advances in Physiology
Education, 30 (4), 159–167.
Morrell, P. D., Wainwright, C., & Flick, L. (2004). Reform teaching strategies used by
student teachers. School Science and Mathematics , 104 (5), 199-213.
National Council of Teachers of Mathematics. (2000). Principles and standards for school
mathematics. Reston, VA: National Council of Teachers of Mathematics.
National Governors Association Center for Best Practices, Council of Chief State School
Officers. (2010). Common Core State Standards Mathematics. Washington D.C.: Na-
tional Governors Association Center for Best Practices, Council of Chief State School
Officers. Retrieved 12/7/2015, from http://www.corestandards.org/Math
National Research Council. (1996). From analysis to action: Undergraduate education
in science, mathematics, engineering, and technology. Washington, D.C.: National
Academies Press.
National Research Council. (1999). Transforming undergraduate education in science, mathe-
matics, engineering, and technology. Washington, D.C.: The National Academy Press.
National Research Council. (2002). Evaluating and improving undergraduate teaching in
science, technology, engineering, and mathematics (M. A. Fox & N. Hackerman, Eds.).
Washington, D.C.: National Academies Press.
National Research Council. (2012). Discipline-based education research: Understand-
ing and improving learning in undergraduate science and engineering (S. R. Singer,
N. R. Nielsen, H. A. Schweingruber, et al., Eds.). Washington, D.C.: National
Academies Press.
66
Draft
National Science Foundation. (1996). Shaping the future: New expectations for undergrad-
uate education in science, mathematics, engineering, and technology. Arlington, VA:
Author. (NSF 96-139)
National Science Foundation. (1998). Information technology: Its impact on undergradu-
ate education in science, mathematics, engineering, and technology. Arlington, VA:
Author. (NSF 98-82)
Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York: McGraw-Hill.
Pianta, R. C., & Hamre, B. K. (2009). Conceptualization, measurement, and improvement
of classroom processes: Standardized observation can leverage capacity. Educational
Researcher , 38 (2), 109–119.
Piburn, M., & Sawada, D. (2000). Reformed Teaching Observation Protocol (RTOP): Ref-
erence manual. Tempe, Arizona. Retrieved 12/7/2015, from http://files.eric.ed
.gov/fulltext/ED447205.pdf
President’s Council of Advisors on Science and Technology. (2012). Engage to excel: Pro-
ducing one million additional college graduates with degrees in science, technology,
engineering, and mathematics. report to the President. Washington, D.C.: Executive
Office of the President.
R Core Team. (2016). R: A language and environment for statistical computing [Computer
software manual]. Vienna, Austria. Retrieved from https://www.R-project.org/
Remmers, H. H. (1928). The relationship between students’ marks and student attitude
toward instructors. School & Society(28), 759-760.
Remmers, H. H. (1930). To what extent do grades influence student ratings of instructors?
The Journal of Educational Research(21), 314-316.
Remmers, H. H., & Brandenburg, G. C. (1927). Experimental data on the Purdue ratings
scale for instructors. Educational Administration and Supervision(13), 519-527.
67
Draft
Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of
Statistical Software, 48 (2), 1–36. Retrieved from http://www.jstatsoft.org/v48/
i02/
Sawada, D., Piburn, M. D., Judson, E., Turley, J., Falconer, K., Benford, R., & Bloom,
I. (2002). Measuring reform practices in science and mathematics classrooms: The
Reformed Teaching Observation Protocol. School Science and Mathematics , 102 (6),
245-253.
Schermelleh-Engel, K., Moosbrugger, H., & Muller, H. (2003). Evaluating the fit of struc-
tural equation models: Tests of significance and descriptive goodness-of-fit measures.
Methods of Psychological Research Online, 8 (2), 23–74.
Schumacker, R., & Lomax, R. (2016). A beginner’s guide to structural equation modeling
(4th ed.). Taylor & Francis.
Seldin, P. (2000). Teaching portfolios: A positive appraisal. Academe, 86 (1).
Seldin, P., & Miller, J. E. (2009). The academic portfolio: A practical guide to documenting
teaching, research, and service (Vol. 132). John Wiley & Sons.
Seymour, E. (2002). Tracking the processes of change in US undergraduate education in
science, mathematics, engineering, and technology. Science Education, 86 (1), 79-105.
Shevlin, M., Banyard, P., Davies, M., & Griffiths, M. (2000). The validity of student
evaluation of teaching in higher education: Love me, love my lectures? Assessment &
Evaluation in Higher Education, 25 (4), 397-405.
Shulman, L. S. (1986). Those who understand: Knowledge growth in teaching. Educational
Researcher , 4-14.
Smith, M. K., Jones, F. H., Gilbert, S. L., & Wieman, C. E. (2013). The classroom obser-
vation protocol for undergraduate STEM (COPUS): A new instrument to characterize
university STEM classroom practices. CBE-Life Sciences Education, 12 (4), 618-627.
Socha, A. (2013). A hierarchical approach to students’ assessments of instruction. Assess-
ment & Evaluation in Higher Education, 38 (1), 94-113.
68
Draft
Sojka, J., Gupta, A. K., & Deeter-Schmelz, D. R. (2002). Student and faculty perceptions
of student evaluations of teaching: A study of similarities and differences. College
Teaching , 50 (2), 44-49.
Speer, N., & Hald, O. (2008). How do mathematicians learn to teach? Implications from
research on teachers and teaching for graduate student professional development. In
M. P. Carlson & C. Rasmussen (Eds.), Making the connection: Research and practice in
undergraduate mathematics education (p. 305-218). Washington, D.C.: Mathematical
Association of America.
Spooren, P., Brockx, B., & Mortelmans, D. (2013). On the validity of student evaluation
of teaching: The state of the art. Review of Educational Research, 83 (4), 598-642.
Retrieved from http://dx.doi.org/10.3102/0034654313496870
Streiner, D. L. (2003). Starting at the beginning: An introduction to coefficient alpha and
internal consistency. Journal of Personality Assessment , 80 (1), 99–103.
Thomas, S., Chie, Q. T., Abraham, M., Raj, S. J., & Beh, L.-S. (2014). A qualitative
review of literature on peer review of teaching in higher education: An application of
the SWOT framework. Review of Educational Research, 84 (1), 112-159.
Tucker, B., Jones, S., Straker, L., & Cole, J. (2003). Course evaluation on the web: Facili-
tating student and teacher reflection to improve learning. New Directions for Teaching
and Learning , 2003 (96), 81-93.
Wachtel, H. K. (1998). Student evaluation of college teaching effectiveness: A brief review.
Assessment & Evaluation in Higher Education, 23 (2), 191-212.
Walkington, C., Arora, P., Ihorn, S., Gordon, J., Walker, M., Abraham, L., & Marder,
M. (2012). Development of the UTeach observation protocol: A classroom observation
instrument to evaluate mathematics and science teachers from the UTeach preparation
program (Tech. Rep.). Retrieved 12/7/2015, from htt://uteach.utexas.edu
Ware Jr, J. E., & Williams, R. G. (1975). The Dr. Fox effect: A study of lecturer effectiveness
and ratings of instruction. Academic Medicine, 50 (2), 149-56.
69
Draft
Weisberg, S. (2005). Applied linear regression (3rd ed.). John Wiley & Sons.
Weston, R., & Gore, P. A. (2006). A brief guide to structural equation modeling. The
Counseling Psychologist , 34 (5), 719-751.
Wieman, C., & Gilbert, S. (2014). The teaching practices inventory: A new tool for charac-
terizing college and university teaching in mathematics and science. CBE-Life Sciences
Education, 13 (3), 552-569.
Wolf, E. J., Harrington, K. M., Clark, S. L., & Miller, M. W. (2013). Sample size requirements
for structural equation models an evaluation of power, bias, and solution propriety.
Educational and Psychological Measurement , 73 (6), 913–934.
70
Draft
APPENDIX A
OVERVIEW OF OBSERVATION PROTOCOLS
Mathematics Classroom Observation Protocol for Practices (MCOP2)
Subject: Sample Size: Validated Grades:Mathematics 127 Classroom Observations K-16Brief Description: MCOP2 contains 16 items intended to measure two primaryconstructs student engagement and teacher facilitation. Each item contains afull description of the item with specific requirements for each rating level.Documented Drawbacks: Does not produce a fine-grained analysis. TheMCOP2 was not designed to evaluate a teacher on a single observation due tothe nature and complexity of the teaching.(Gleason & Cofer, 2014; Gleason et al., 2017)
Reformed Teaching Observation Protocol (RTOP)
Subject: Sample Size: Validated Grades:Mathematics andScience
87 observations of 141classrooms
Secondary andPostsecondary (2yr and4yr)
Brief Description: RTOP is a 25 item classroom observation protocol that isstandard based, inquiry oriented, and student centered. Requires a trainedobserver to rate on a Likert scale.Documented Drawbacks: “Though a Likert scale may be helpful to a researcherin quantifying an observation, it is difficult for teachers to know what theyneed to do to improve from a 4 to 5.” (Marshall, Smart, & Horton, 2010)“Exploratory factor analysis showed that some but not all of the individualitems within a given construct loaded together.” (Piburn & Sawada, 2000;Sawada et al., 2002) “RTOP places little emphasis on the accuracy and depthof the content being conveyed during a lesson.” (Walkington et al., 2012) “Theobservers must complete a multiday training program to achieve acceptableinterrater reliability.” (Smith et al., 2013)(Piburn & Sawada, 2000; Sawada et al., 2002)
71
Draft
Oregon Teacher Observation Protocol (OTOP)
Subject: Sample Size: Validated Grades:Mathematics andScience
123 observations of 41classes and 50 classroomobservations
Postsecondary (public andprivate) and Secondary
Brief Description: OTOP is a 10 item protocol designed to generate a profileof what is happening across instructional settings rather than assigning a scoreto a particular lesson. Items are treated as nominal data.Documented Drawbacks: “Despite its supposed reliability in Faculty Fellowsmathematics classes, the OTOP’s scientific nature and lack of recent mathe-matical standards make it undesirable for use in college mathematics courses.”(Gleason & Cofer, 2014)(Flick, Sadri, Morrell, Wainwright, & Schepige, 2009; Morrell, Wainwright, &Flick, 2004)
UTeach Observation Protocol (UTOP)
Subject: Sample Size: Validated Grades:Mathematics andScience
83 observations of 36teachers
Secondary
Brief Description: “The UTOP includes 32 classroom observation indicatorsorganized into four sections: Classroom Environment, Lesson Structure, Im-plementation, and Math/Science Content. The indicators are rated by ob-servers on a 7-point scale: 1 to 5 Likert with Don’t Know (DK) and NotApplicable (NA) options (for some items).” (Walkington et al., 2012)Documented Drawbacks: “Besides the science-specific language, another draw-back to the UTOP is it is solely based off of NCTM standards from 1991.”(Gleason & Cofer, 2014)(Walkington et al., 2012)
72
Draft
Classroom Observation Protocol (COP)
Subject: Sample Size: Validated Grades:Mathematics andScience
1,610 lesson observations K-12
Brief Description: The COP contains several sections where observers describeand classify the major activities, materials, and purposes of a math or sciencelesson, and then it provides four sections where observers rate various aspectsof classroom instruction using a Likert (1-5) scale.Documented Drawbacks: “Due to the large number of evaluators, inter-raterreliability was an issue for classroom observation data. Also, this study wascrosssectional in nature so there are limitations in the design of this study.”(Bowes & Banilower, 2004)(Bowes & Banilower, 2004)
Classroom Observation Protocol for Undergraduate STEM (COPUS)
Subject: Sample Size: Validated Grades:Mathematics andScience (listed asSTEM but nomention of engineeror technologyclassroom testing
30 classroom observations Postsecondary
Brief Description: “COPUS documents classroom behaviors in 2-min intervalsthroughout the duration of the class session. It does not require observers tomake judgments of teaching quality, and it produces clear graphical results.COPUS is limited to 25 codes in two categories (“What the students are doing”and “What the instructor is doing”) and can be reliably used by universityfaculty with only 1.5 hours of training.” (Smith et al., 2013)Documented Drawbacks: “COPUS observations provided a measurement foronly a single class period. From multiple COPUS observations of a singlecourse, we know that it is not unusual to have substantial variations from oneclass to another.” (Wieman & Gilbert, 2014)(Smith et al., 2013)
73
Draft
Teaching Dimensions Observation Protocol (TDOP)
Subject: Sample Size: Validated Grades:Mathematics andScience
Inter-rater reliability resultsfrom TDOP training in thespring of 2012 does notinclude a sample size.
Postsecondary(nonlaboratory courses)
Brief Description: Six dimensions of practice comprise the TDOP: Teachingmethods, Pedagogical strategies, Cognitive demand, Student-teacher interac-tions, Student engagement, and Instructional technology. Observers documentwith 46 codes the classroom behaviors in 2-min intervals throughout the classsession.Documented Drawbacks: “Requires substantial training, as one might expectfor a protocol that was designed to be a complex research instrument.” (Smithet al., 2013) “TDOP does not aim to measure latent variables such as instruc-tional quality, and it is not tied to external criterion such as reform-basedteaching standards.” (Hora, Oleson, & Ferrare, 2013)(Hora et al., 2013)
Classroom Assessment Scoring System - Secondary (CLASS-S)
Subject: Sample Size: Validated Grades:General 1482 lessons observations
(video)6-11
Brief Description: CLASS is a tool for observing and assessing the effective-ness of interactions among teachers and students in classrooms. It measuresthe emotional, organizational, and instructional supports provided by teachersthat have contribute to childrens social, developmental, and academic achieve-ment.Documented Drawbacks: “Does not take into account teaching behaviors spe-cific to the disciplines of mathematics and science, such as placing contentin the “big picture” of the domain, supporting sense-making about conceptsthrough real world connections, and appropriately and powerfully making useof tools of abstraction.” (Walkington et al., 2012)(Allen et al., 2013)
74
Draft
Mathematical Quality of Instruction (MQI)
Subject: Sample Size: Validated Grades:Mathematics 10 teacher observations 2-6Brief Description: MQI is designed to provide scores for teachers on importantdimensions of classroom mathematics instruction. These dimensions includethe richness of the mathematics, student participation in mathematical reason-ing and meaning-making, and the clarity and correctness of the mathematicscovered in class.Documented Drawbacks: “Although there is a significant, strong, and positiveassociation between levels of MKT (mathematical knowledge for teaching)and the mathematical quality of instruction, we also find that there are anumber of important factors that mediate this relationship, either supportingor hindering teachers’ use of knowledge in practice.” (Hill et al., 2008)(Hill et al., 2008)
75
Draft
APPENDIX B
DEMOGRAPHIC CHARACTERISTICS OF THE SAMPLE
A total of 110 observations of 72 instructors was conducted. Only 86 of the 110 ob-
servations have instructor demographics data, because 15 instructors did not complete the
demographics survey.
76
Draft
Table 8
Demographics Characteristics of the Sample
Frequency %GenderMale 60 55Female 50 45Age Range18-24 years old 2 225-34 years old 41 4835-44 years old 15 1745-54 years old 10 1255-64 years old 14 1665 years and over 4 5Race/EthnicitiyAmerican Indian or Alaskan Native 0 0Asian / Pacific Islander 11 13Black or African American 4 5Hispanic American 2 2White / Caucasian 69 80Multiple ethnicity / Other (please specify) 0 0Level of EducationBachelor’s degree 2 2Master’s degree 27 31PhD 54 63Other advanced degree beyond a Master’s degree 3 3Job TitleGraduate Teaching Assistant 13 15Adjunct/Instructor 24 28Assistant Professor 24 28Associate Professor 8 9Full Professor 17 20Number of Years Teaching at High School LevelLess than one year 65 761-5 years 10 126-10 years 5 611-15 years 1 1Over 15 years 5 6Number of Years Teaching at College LevelLess than one year 1 11-5 years 23 276-10 years 27 3111-15 years 11 13Over 15 years 24 28
77
Draft
APPENDIX C
INSTRUMENTS USED
Background Information
1. Institution:
2. Description of course (Calculus I, College Algebra, Analysis, etc.):
3. Gender of instructor:
4. Date of observation:
5. Time of observation:
78
Draft
Abbreviated Reformed Teaching Observation Protocol1
Inquiry Orientation
1. The lesson was designed to engage students as members of a learning community.
Score Description4 Lesson is designed to include both extensive teacher-student and student-
student interactions.3 Lesson is designed for continual interaction between teacher and students.2 Classroom interactions are only teacher-student or student-student.1 Lesson has limited opportunities to engage students. (e.g., rhetorical ques-
tions or shout out opportunities).0 This lesson is completely teacher-centered, lecture only.
2. Intellectual rigor, constructive criticism, and the challenging of ideas were valued.
Score Description4 Students debate ideas through a negotiation of meaning that results in
strong use of evidence/ arguments to support claim.3 Students engaged in a teacher-guided but student driven discussion (“de-
bate”) involving one or more of the following: a variety of ideas, alternativeinterpretations, or alternative lines of reasoning.
2 Students participate in a teacher directed whole-class discussion (debate)involving one or more of the following: a variety of ideas, alternative inter-pretations, or alternative lines of reasoning.
1 At least once the students respond (perhaps by “shout out”) to teach-ers queries regarding alternate ideas, alternative reasoning, or alternativeinterpretations.
0 Students were not asked to demonstrate rigor, offer criticisms, or challengeideas.
3. This lesson encouraged students to seek and value alternative modes of investigationor of problem solving.
Score Description4 Lesson was designed for students to engage in alternative modes and a
clear discussion of these alternatives occurs.3 Lesson was designed for students to engage in alternative modes of inves-
tigation, but without subsequent discussion.2 Lesson was designed for students to ask divergent questions, but not in-
vestigate.1 Lesson was designed for instructor to ask divergent questions (Teacher
directed).0 No alternative modes were explored during the lesson.
1Adapted from (Dayton Regional STEM Center, 2011; Walkington et al., 2012)
79
Draft
4. Students made predictions, estimations and/or hypotheses and devised means for test-ing them.
Score Description4 The students explicitly make, write down or depict, and explain their pre-
diction, estimation and/or hypothesis. Students devise a means for testingtheir prediction, estimation and/or hypothesis.
3 Students discuss predictions. Means for testing is highly suggested.2 Teacher may ask students to predict and wait for input (class as a whole
or as pairs, etc). No means for testing.1 Teacher may ask class to predict as a whole, but doesnt wait for a response.
No means for testing.0 No opportunities for any predictions (students explaining what happened,
does not mean predicting)
5. The teacher acted as a resource person, working to support and enhance student in-vestigations.
Score Description4 Students are actively engaged in learning process, students determine what
and how, teacher is available to help. The teacher uses student investiga-tion or questions to direct the inquiry process.
3 Students have freedom, but within confines of teacher directed boundaries.Student lead. Teacher answers questions instead of directing inquiry.
2 Primarily directed by teacher with occasional opportunities for students toguide the direction.
1 Very teacher directed, limited student investigation, very rote.0 No investigations (activity that engages students to apply content through
problem solving). Lecture based.
80
Draft
Content Propositional Knowledge
6. The lesson involved fundamental concepts of the subject.
Score Description4 The content covered and/or tasks, examples or activities chosen by the
teacher were clearly and explicitly related to significant concepts to gaina deeper understanding and make worthwhile connections to the mathe-matical or scientific ideas.
3 The content covered and/or tasks, examples or activities chosen by theteacher were clearly related to the significant content of the course, andthe tasks, examples or activities that were used allowed for developmentof worthwhile connections to the mathematical or scientific ideas.
2 The content covered was significant and relevant to the content of thecourse, but the presentation, tasks, examples or activities chosen were pre-scriptive, superficial or contrived and did not allow the students to makemeaningful connections to mathematical or scientific ideas.
1 The content covered and/or tasks, examples or activities chosen by theteacher were distantly or only sometimes related to the content of thecourse. This item should also be rated a 1 if the content chosen wasdevelopmentally inappropriate: either too low-level or too advanced forthe students.
0 The content covered and/or tasks, examples or activities chosen by theteacher were unrelated to the content of the course.
7. The lesson promoted strongly coherent conceptual understanding.
Score Description4 Lesson is presented in a clear and logical manner, relation of content to
concepts is clear throughout and it flows from beginning to end.3 Lesson is predominantly presented in a clear and logical fashion, but rela-
tion of content to concepts is not always obvious.2 Lesson may be clear and/or logical but relation of content to concepts is
very inconsistent (or vice versa).1 Lesson is disjointed and not consistently focused on the concepts.0 Not presented in any logical manner, lacks clarity and no connections be-
tween material.
81
Draft
8. The teacher had a solid grasp of the subject matter content inherent in the lesson.
Score Description4 The teacher clearly understood the content and how to successfully com-
municate the content to the class. The teacher was able to present inter-esting and relevant examples, explain concepts in multiple ways, facilitatediscussions, connect it to the big ideas of the discipline, use advanced ques-tioning strategies to guide student learning, and identify and use commonmisconceptions or alternative ideas as learning tools.
3 The teacher clearly understood the content and how to successfully com-municate the content to the class. The teacher used multiple examples andstrategies to engage students with the content.
2 There were no issues with the teachers understanding of the content and itsaccuracy, but the teacher was not always fluid or did not try to present thecontent in multiple ways. When students appeared confused, the teacherwas unable to re-teach the content in a completely clear, understandable,and/or transparent way such that most students understood.
1 There were several smaller issues with the teachers understanding and/orcommunication of the content that sometimes had a negative impact onstudent learning.
0 There was a significant issue with the teachers understanding and/orcommunication of the content that negatively impacted student learningduring the class.
9. Elements of abstraction (i.e., symbolic representations, theory building) were encour-aged when it was important to do so.
Score Description4 Abstraction is being used for a relevant and useful purpose. Variety of
representation were used to build the lesson and used to support/developthe content. The abstractions are presented in a way such that they areunderstandable and accessible to the class.
3 Teacher uses a variety of abstractions throughout the lesson, and occa-sionally explains them in a manner that supports/develops the content.Perhaps there was a small missed opportunity with respect to facilitatingstudents understanding of abstraction.
2 The teachers use of abstraction was adequate. Teacher uses a variety ofabstractions throughout the lesson, but does not explain them in a mannerthan supports/develops the content
1 The teacher neglects important explanation and discussion of abstractionthat is being used during the class, and this missed opportunity has anegative impact on student learning.
0 There was a major issue with the teachers use of abstraction or no abstrac-tion was presented. This had a negative impact on student learning duringthe class.
82
Draft
10. Connections with other content disciplines and/or real world phenomena were exploredand valued.
Score Description4 Throughout the class, the content was taught in the context of its use in
other disciplines, other areas of mathematics/science, or in the real andthe teacher clearly had deep knowledge about how the content is used inthose areas.
3 The teacher included one or more connections between the content andanother discipline/real world, and the teacher engaged the students in anextended discussion or activity relating to these connections.
2 The teacher connected the content being learned to another discipline/realworld, and the teacher explicitly brought this connection to students at-tention.
1 A minor connection was made to another area of mathematics/science, toanother discipline, or to real-world contexts, but generally abstract or nothelpful for content comprehension. (For example, word problems that canbe solved without the context of the problem.)
0 No connections were made to other areas of mathematics/science or toother disciplines, or connections were made that were inappropriate orincorrect.
83
Draft
Mathematics Classroom Observation Protocol for Practices Descriptors 2
1. Students engaged in exploration/investigation/problem solving.
Score Description3 Students regularly engaged in exploration, investigation, or problem solving.
Over the course of the lesson, the majority of the students engaged inexploration/investigation/problem solving.
2 Students sometimes engaged in exploration, investigation, or problem solv-ing. Several students engaged in problem solving, but not the majority ofthe class.
1 Students seldom engaged in exploration, investigation, or problem solving.This tended to be limited to one or a few students engaged in problemsolving while other students watched but did not actively participate.
0 Students did not engage in exploration, investigation, or problem solving.There were either no instances of investigation or problem solving, or theinstances were carried out by the teacher without active participation byany students.
2. Students used a variety of means (models, drawings, graphs, concrete materials, ma-nipulatives, etc.) to represent concepts.
Score Description3 The students manipulated or generated two or more representations to
represent the same concept, and the connections across the various repre-sentations, relationships of the representations to the underlying concept,and applicability or the efficiency of the representations were explicitlydiscussed by the teacher or students, as appropriate.
2 The students manipulated or generated two or more representations torepresent the same concept, but the connections across the various repre-sentations, relationships of the representations to the underlying concept,and applicability or the efficiency of the representations were not explicitlydiscussed by the teacher or students.
1 The students manipulated or generated one representation of a concept.0 There were either no representations included in the lesson, or represen-
tations were included but were exclusively manipulated and used by theteacher. If the students only watched the teacher manipulate the represen-tation and did not interact with a representation themselves, it should bescored a 0.
2Reprinted by permission from (Gleason, Livers, & Zelkowski, 2015)
84
Draft
3. Students were engaged in mathematical activities.
Score Description3 Most of the students spend two-thirds or more of the lesson engaged in
mathematical activity at the appropriate level for the class. It does notmatter if it is one prolonged activity or several shorter activities. (Notethat listening and taking notes does not qualify as a mathematical activityunless the students are filling in the notes and interacting with the lessonmathematically.)
2 Most of the students spend more than one-quarter but less than two-thirdsof the lesson engaged in appropriate level mathematical activity. It doesnot matter if it is one prolonged activity or several shorter activities.
1 Most of the students spend less than one-quarter of the lesson engaged inappropriate level mathematical activity. There is at least one instance ofstudents’ mathematical engagement.
0 Most of the students are not engaged in appropriate level mathematicalactivity. This could be because they are never asked to engage in anyactivity and spend the lesson listening to the teacher and/or copying notes,or it could be because the activity they are engaged in is not mathematicalsuch as a coloring activity.
4. Students critically assessed mathematical strategies.
Score Description3 More than half of the students critically assessed mathematical strategies.
This could have happened in a variety of scenarios, including in the contextof partner work, small group work, or a student making a comment duringdirect instruction or individually to the teacher.
2 At least two but less than half of the students critically assessed math-ematical strategies. This could have happened in a variety of scenarios,including in the context of partner work, small group work, or a studentmaking a comment during direct instruction or individually to the teacher.
1 An individual student critically assessed mathematical strategies. Thiscould have happened in a variety of scenarios, including in the context ofpartner work, small group work, or a student making a comment duringdirect instruction or individually to the teacher. The critical assessmentwas limited to one student.
0 Students did not critically assess mathematical strategies. This could hap-pen for one of three reasons: 1) No strategies were used during the lesson;2) Strategies were used but were not discussed critically. For example, thestrategy may have been discussed in terms of how it was used on the spe-cific problem, but its use was not discussed more generally; 3) Strategieswere discussed critically by the teacher but this amounted to the teachertelling the students about the strategy(ies), and students did not activelyparticipate.
85
Draft
5. Students persevered in problem solving.
Score Description3 Students exhibited a strong amount of perseverance in problem solving.
The majority of students looked for entry points and solution paths, mon-itored and evaluated progress, and changed course if necessary. Whenconfronted with an obstacle (such as how to begin or what to do next), themajority of students continued to use resources (physical tools as well asmental reasoning) to continue to work on the problem.
2 Students exhibited some perseverance in problem solving. Half of stu-dents looked for entry points and solution paths, monitored and evaluatedprogress, and changed course if necessary. When confronted with an obsta-cle (such as how to begin or what to do next), half of students continuedto use resources (physical tools as well as mental reasoning) to continue towork on the problem.
1 Students exhibited minimal perseverance in problem solving. At least onestudent but less than half of students looked for entry points and solutionpaths, monitored and evaluated progress, and changed course if necessary.When confronted with an obstacle (such as how to begin or what to donext), at least one student but less than half of students continued to useresources (physical tools as well as mental reasoning) to continue to workon the problem. There must be a road block to score above a 0.
0 Students did not persevere in problem solving. This could be because therewas no student problem solving in the lesson, or because when presentedwith a problem solving situation no students persevered. That is to say,all students either could not figure out how to get started on a problem, orwhen they confronted an obstacle in their strategy they stopped working.
86
Draft
6. The lesson involved fundamental concepts of the subject to promote relational/conceptual understanding.
Score Description3 The lesson includes fundamental concepts or critical areas of the course, as
described by the appropriate standards, and the teacher/lesson uses theseconcepts to build relational/conceptual understanding of the students witha focus on the “why” behind any procedures included.
2 The lesson includes fundamental concepts or critical areas of the course,as described by the appropriate standards, but the teacher/lesson missesseveral opportunities to use these concepts to build relational/conceptualunderstanding of the students with a focus on the “why” behind any pro-cedures included.
1 The lesson mentions some fundamental concepts of mathematics, but doesnot use these concepts to develop the relational/conceptual understandingof the students. For example, in a lesson on the slope of the line, theteacher mentions that it is related to ratios, but does not help the studentsto understand how it is related and how that can help them to betterunderstand the concept of slope.
0 The lesson consists of several mathematical problems with no guidanceto make connections with any of the fundamental mathematical concepts.This usually occurs with a teacher focusing on procedure of solving certaintypes of problems without the students understanding the “why” behindthe procedures.
7. The lesson promoted modeling with mathematics.
Score Description3 Modeling (using a mathematical model to describe a real-world situation) is
an integral component of the lesson with students engaged in the modelingcycle (as described in the Common Core State Standards).
2 Modeling is a major component, but the modeling has been turned into aprocedure (i.e. a group of word problems that all follow the same form andthe teacher has guided the students to find the key pieces of informationand how to plug them into a procedure.); or modeling is not a majorcomponent, but the students engage in a modeling activity that fits withinthe corresponding standard of mathematical practice.
1 The teacher describes some type of mathematical model to describe real-world situations, but the students do not engage in activities related tousing mathematical models.
0 The lesson does not include any modeling with mathematics.
87
Draft
8. The lesson provided opportunities to examine mathematical structure. (Symbolic no-tation, patterns, generalizations, conjectures, etc.)
Score Description3 The students have a sufficient amount of time and opportunity to look for
and make use of mathematical structure or patterns.2 Students are given some time to examine mathematical structure, but are
not allowed adequate time or are given too much scaffolding so that theycannot fully understand the generalization.
1 Students are shown generalizations involving mathematical structure, buthave little opportunity to discover these generalizations themselves or ad-equate time to understand the generalization.
0 Students are given no opportunities to explore or understand the mathe-matical structure of a situation.
9. The lesson included tasks that have multiple paths to a solution or multiple solutions.
Score Description3 A lesson which includes several tasks throughout; or a single task that takes
up a large portion of the lesson; with multiple solutions and/or multiplepaths to a solution and which increases the cognitive level of the task fordifferent students.
2 Multiple solutions and/or multiple paths to a solution are a significantpart of the lesson, but are not the primary focus, or are not explicitlyencouraged; or more than one task has multiple solutions and/or multiplepaths to a solution that are explicitly encouraged.
1 Multiple solutions and/or multiple paths minimally occur, and are not ex-plicitly encouraged; or a single task has multiple solutions and/or multiplepaths to a solution that are explicitly encouraged.
0 A lesson which focuses on a single procedure to solve certain types of prob-lems and/or strongly discourages students from trying different techniques.
10. The lesson promoted precision of mathematical language.
Score Description3 The teacher “attends to precision” in regards to communication during the
lesson. The students also “attend to precision” in communication, or theteacher guides students to modify or adapt non-precise communication toimprove precision.
2 The teachers “attends to precision” in all communication during the lesson,but the students are not always required to also do so.
1 The teacher makes a few incorrect statements or is sloppy about mathe-matical language, but generally uses correct mathematical terms.
0 The teacher makes repeated incorrect statements or incorrect names formathematical objects instead of their accepted mathematical names.
88
Draft
11. The teacher’s talk encouraged student thinking.
Score Description3 The teacher’s talk focused on high levels of mathematical thinking. The
teacher may ask lower level questions within the lesson, but this is notthe focus of the practice. There are three possibilities for high levels ofthinking: analysis, synthesis, and evaluation. Analysis: examines/ inter-prets the pattern, order or relationship of the mathematics; parts of theform of thinking. Synthesis: requires original, creative thinking. Evalu-ation: makes a judgment of good or bad, right or wrong, according to thestandards he/she values.
2 The teacher’s talk focused on mid-levels of mathematical thinking. In-terpretation: discovers relationships among facts, generalizations, defini-tions, values and skills. Application: requires identification and selectionand use of appropriate generalizations and skills.
1 Teacher talk consists of “lower order” knowledge based questions andresponses focusing on recall of facts. Memory: recalls or memorizes infor-mation. Translation: changes information into a different symbolic formor situation.
0 Any questions/ responses of the teacher related to mathematical ideas wererhetorical in that there was no expectation of a response from the students.
12. There were a high proportion of students talking related to mathematics.
Score Description3 More than three quarters of the students were talking related to the math-
ematics of the lesson at some point during the lesson.2 More than half, but less than three quarters of the students were talking
related to the mathematics of the lesson at some point during the lesson.1 Less than half of the students were talking related to the mathematics of
the lesson.0 No students talked related to the mathematics of the lesson.
13. There was a climate of respect for what others had to say.
Score Description3 Many students are sharing, questioning, and commenting during the les-
son, including their struggles. Students are also listening (active), clarify-ing, and recognizing the ideas of others.
2 The environment is such that some students are sharing, questioning, andcommenting during the lesson, including their struggles. Most studentslisten.
1 Only a few share as called on by the teacher. The climate supports thosewho understand or who behave appropriately. Or Some students are shar-ing, questioning, or commenting during the lesson, but most students areactively listening to the communication.
0 No students shared ideas.
89
Draft
14. In general, the teacher provided wait-time.
Score Description3 The teacher frequently provided an ample amount of “think time” for the
depth and complexity of a task or question posed by either the teacher ora student.
2 The teacher sometimes provided an ample amount of “think time” forthe depth and complexity of a task or question posed by either the teacheror a student.
1 The teacher rarely provided an ample amount of “think time” for thedepth and complexity of a task or question posed by either the teacher ora student.
0 The teacher never provided an ample amount of “think time” for thedepth and complexity of a task or question posed by either the teacher ora student.
15. Students were involved in the communication of their ideas to others (peer-to-peer).
Score Description3 Considerable time (more than half) was spent with peer to peer di-
alog (pairs, groups, whole class) related to the communication of ideas,strategies and solution.
2 Some class time (less than half, but more than just a few minutes)was devoted to peer to peer (pairs, groups, whole class) conversations re-lated to the mathematics.
1 peer to peer (pairs, groups, whole class) conversations. A few instancesdeveloped where this occurred during the lesson but only lasted less than5 minutes.
0 No peer to peer (pairs, groups, whole class) conversations occurred duringthe lesson.
16. The teacher uses student questions/comments to enhance conceptual mathematicalunderstanding.
Score Description3 The teacher frequently uses student questions/ comments to coach stu-
dents, to facilitate conceptual understanding, and boost the conversation.The teacher sequences the student responses that will be displayed in an in-tentional order, and/or connects different students’ responses to key math-ematical ideas.
2 The teacher sometimes uses student questions/ comments to enhanceconceptual understanding.
1 The teacher rarely uses student questions/ comments to enhance con-ceptual mathematical understanding. The focus is more on proceduralknowledge of the task verses conceptual knowledge of the content.
0 The teacher never uses student questions/ comments to enhance concep-tual mathematical understanding.
90
Draft
Note Taking Form
Observation number
Random Order
Date and Time:
Class name/description:
Number of Students:
1. Are students engaged? How many students are actively participating in the lesson?
(a) Exploring and problem solving
(b) Using a variety of means (abstractions)
(c) Assessing mathematical strategy
(d) Overcoming road blocks
2. What is the interaction between student and teacher? Between student peers?
(a) Talking related to mathematics (How many?)
(b) Respecting others’ ideas (How many sharing and/or listening?)
3. How is the content presented?
(a) Lesson structure (Direct lecture, discussion/debate, student led)
(b) Alternative methods (Multiple paths to a solution or multiple solutions)
(c) Abstractions connected
(d) Wait time provided to reason, make sense, and articulate
4. What is the content covered or task, examples, and activities?
(a) Fundamental (What to do and Why?)
91
Draft
(b) Added value and relevant
(c) Examined math structure (generalizations examined)
(d) connected and flowed smoothly
(e) Connected with other areas of mathematics, other disciplines, or real world
5. Did the instructor have a solid grasp of the material?
(a) Used precision of mathematical language
(b) Enhanced content with student comments
(c) Talk encouraged student thinking (level)
92
Draft
APPENDIX D
REGRESSION MODELS AND RESIDUAL PLOTS
93
Draft
Figure 6: Regression Model 1: Student Engagement and Inquiry Orientation
94
Draft
Figure 7: Regression Model 2: Student Engagement and Inquiry Orientation
95
Draft
Figure 8: Regression Model 3: Teacher Facilitation and and Inquiry Orientation
96
Draft
Figure 9: Regression Model 4: Teacher Facilitation and Content Propositional Knowledge
97
Draft
Figure 10: Regression Model 5: Inquiry Orientation and Content Propositional Knowledge
98
Draft
Figure 11: Regression Model 6: Student Engagement and Teacher Facilitation
99
Draft
APPENDIX E
IRB CERTIFICATIONS
See following pages for copies of IRB Certifications.
100
Draft
101
Draft
102
Draft
103
Draft
104
Draft
105
Draft
106