demystifying quantitative data: statistical explorations in educational assessment ·...

96
Demystifying Quantitative Data: Statistical Explorations in Educational Assessment University Assessment and Testing OKLAHOMA STATE UNIVERSITY September 12, 2011 Authored by: John D. Hathcoat

Upload: others

Post on 23-Apr-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

DemystifyingQuantitativeData:StatisticalExplorationsinEducationalAssessment

UniversityAssessmentandTesting 

OKLAHOMA STATE UNIVERSITY 

 

September 12, 2011 

Authored by: John D. Hathcoat 

 

Page 2: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

Introduction

Educational assessment may broadly be conceived as the systematic collection of information in order to understand student progress and/or experiences. All stakeholders, including students, have an interest in ensuring that conclusions about educational programs are data-driven. Data driven decisions are not easy to make however, given that such decisions require a basic understanding of research design and data analysis. All too often, I have witnessed individuals collect data haphazardly, with only a vague sense of their research aims and the kind of data that is necessary in order to answer pertinent questions. From a pragmatic perspective collecting data is an intentional act that reflects the questions driving our research. In other words, data aids our ability to form judgments about the vitality of educational programs. The data we collect is useful to the extent that it is utilized when making informative judgments about educational programs. This pragmatic view is not the only approach to conducting research, and it is not the intent of this handbook to advocate any specific philosophical worldview. Instead my intention is to provide stakeholders with an introductory guide to data analysis and interpretation. Fulfillment of this aim will not only allow stakeholders to maximize the usefulness of their data, but provide insight into the inherent complexity of educational research. In this respect my purpose has been to demystify the use of quantitative data in educational assessment.

When writing this guide difficult decisions had to be made about not only what to include, but the depth of information that is given for each topic. Entire textbooks have been written on each of the many issues addressed in this handbook. In no way is this handbook meant to replace these textbooks. Instead, this may be viewed as a supplement to more thorough treatments of each topic. No existing statistical knowledge is necessary in order to understand the material in this handbook. In fact, it was explicitly assumed throughout this text that the reader would have no background knowledge about statistics. Individuals with advanced statistical training are not the target audience, given that individuals with a high level of statistical training would see little need to have quantitative data demystified. The intended audience is therefore

Page 3: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

program administrators, staff, faculty, and students with little statistical training who are interested in exploring quantitative data in educational assessment. This guide aims to provide a preliminary framework for exploring quantitative data in a way that is easy to understand.

The reader would benefit most by going through the text in its entirety. However, it is possible to treat each section as being relatively independent. It is suggested that you should at minimum read the section on levels of measurement and statistical significance before skipping to other sections. These sections are crucial to providing clarity to other sections of the text. The text begins with an example in which rubric data is used to assess critical thinking. Though this example focuses upon rubrics it is important to keep in mind that this discussion is in no way limited to rubric data. As will be discussed, data exploration is not affected by whether one uses a rubric or survey, but is instead dependent upon what level of measurement exists for each variable. Exploring data is then illustrated by demonstrating how descriptive statistics can quickly summarize raw data. Variables may be measured differently however, and what statistics you report may change depending upon how numbers are assigned to each variable. A basic introduction to different levels of measurement is introduced. Page 17 provides a diagram that may be useful when deciding what statistics are appropriate for each level of measurement. Graphs also give a succinct way to summarize data; however deciding what graph is appropriate is also contingent upon how your variables are measured. Histograms and Pie Charts are illustrated in the text, but I have also state some general guidelines for other graphical displays on page 19.

Basic research questions are then addressed by dividing them into two categories. First, some research questions are concerned with whether averages are different across groups or time. Secondly, some research questions focus upon relationships among study variables or whether one variable predicts changes in a second variable. These research questions are not meant to be exhaustive, but may instead be viewed as an starting point for formulating research questions. The meaning

Page 4: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

of statistical significance is elucidated in some detail, given that even those with advanced coursework in statistical techniques often make inappropriate inferences from statistical hypothesis testing. Examples for each research question are illustrated with specific statistical tests, though what is primarily emphasized is the correct interpretation of output generated from SPSS/PASW. Limitations of statistical significance tests are also detailed and alternative ways to interpret the data are provided (e.g. effect sizes). This is followed by some general guidelines when exploring quantitative data in educational assessment. The handbook concludes with concerns about sampling and populations when working with a small number of students. Finally, an Appendix is given that details how you may choose to write a brief assessment report within an institution of higher education.

This writing is still a work in progress. At this stage of development the text is meant to be a practical guide only. There are some noted limitations. In terms of statistical tests, the text emphasizes the correct interpretation of SPSS/PASW output. All of these tests however, are based upon models that have specific assumptions. To keep the presentation simple the text has not examined these assumptions in detail, though some are briefly addressed in SPSS/PASW output. Since I have not highlighted an examination of these assumptions conclusions that are derived without a full examination of these assumptions should be treated in an exploratory manner. It is also important to recognize that much of these tests are inferential in nature, which basically means that we have a group of people that we believe represents some larger group. We therefore make inferences from our sample back to this larger population. These tests are generally fine for larger samples, but can become problematic with fairly small samples (less than 30 in many cases). This point of concern is discussed further in my concluding remarks. Since this text is still being developed, supplemental material will eventually be added to make the text more inclusive. I have provided links to basic tutorials that will allow you to perform these statistical tests in SPSS/PASW. Within the near future, this text will be supplemented by my own video tutorials using SPSS/PASW.

Page 5: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

Acknowledgment

I would like to specifically thank Dr. Jeremy Penn, the Director of University Assessment and Testing at Oklahoma State University. Dr. Penn’s support has been incalculable and it was his indication of a need for such a text that has acted as my initial motivation to write this handbook. Conversations with him, as well as his critical feedback have been invaluable resources for improving the quality of this handbook. This handbook would not exist without his prompting. I would also like to thank all of the staff at University Assessment and Testing. They are a continual source of inspiration and support.

Page 6: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

TableofContentsOverview of Rubric Data in Educational Assessment ................................................................................... 1 

Example: .................................................................................................................................................... 1 

A Quick Analysis of the Rubric: ................................................................................................................. 1 

The Data: ................................................................................................................................................... 2 

Data Exploration ........................................................................................................................................... 3 

Question 1: Where do critical thinking scores tend to fall? ..................................................................... 3 

Mean ..................................................................................................................................................... 3 

Median .................................................................................................................................................. 4 

Mode ..................................................................................................................................................... 4 

Question 2: How much do critical thinking scores tend to be different across students? ....................... 5 

Range .................................................................................................................................................... 5 

Variance and Standard Deviation ......................................................................................................... 5 

Why should we care? ................................................................................................................................ 6 

Level of Measurement and Descriptive Statistics ......................................................................................... 6 

Nominal ..................................................................................................................................................... 7 

Ordinal ...................................................................................................................................................... 8 

Interval ...................................................................................................................................................... 9 

Ratio ........................................................................................................................................................ 10 

Descriptive Statistics for each Level of Measurement ............................................................................ 11 

Level of Measurement and the Amount of Information ........................................................................ 11 

Using Graphs to Examine Data ................................................................................................................... 12 

Histogram: ............................................................................................................................................... 13 

Pie Chart .................................................................................................................................................. 14 

Research Questions .................................................................................................................................... 15 

Group or Mean Differences .................................................................................................................... 15 

Statistical Significance ................................................................................................................................. 17 

Statistical Tests............................................................................................................................................ 23 

Independent‐Sample T‐Test .................................................................................................................... 23 

Page 7: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

Confidence Intervals ........................................................................................................................... 26 

Interpretation of T‐Test ...................................................................................................................... 27 

Dependent or Paired Samples T‐test .......................................................................................................... 28 

Example ................................................................................................................................................... 28 

Research Questions ................................................................................................................................ 29 

Statistical Analysis ................................................................................................................................... 29 

One‐way ANOVA ......................................................................................................................................... 31 

Example ................................................................................................................................................... 31 

Notes about Data ................................................................................................................................ 33 

Research Questions ................................................................................................................................ 33 

Statistical Analysis ................................................................................................................................... 33 

Post Hoc Tests for ANOVA ...................................................................................................................... 36 

Interpretation ......................................................................................................................................... 36 

Association/Relationship and Prediction .................................................................................................... 38 

Correlations ............................................................................................................................................. 39 

Scatterplots – Positive Correlation ..................................................................................................... 39 

Scatterplots – Negative Correlation .................................................................................................... 40 

Correlation Sign and Strength ............................................................................................................. 41 

Quick Overview of Correlations .......................................................................................................... 43 

Correlation SPSS/PASW Output .......................................................................................................... 44 

Interpretation ..................................................................................................................................... 46 

Simple Linear Regression ........................................................................................................................ 46 

Regression Equation ........................................................................................................................... 47 

Linear Regression Research Questions ............................................................................................... 49 

Simple Linear Regression SPSS/PASW Output .................................................................................... 50 

Building our Regression Equation ....................................................................................................... 54 

Choosing a Statistical Test .......................................................................................................................... 55 

Statistical Significance and Practical Significance: Effect Size .................................................................... 58 

Effect Size Statistics ................................................................................................................................. 59 

Cohen’s d ............................................................................................................................................. 60 

Interpretation of Cohen’s ................................................................................................................... 61 

Cohen’s d and Percentile Estimates .................................................................................................... 63 

Page 8: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

Omega Squared ................................................................................................................................... 63 

Cohen’s f2 (or Cohen’s f squared) ....................................................................................................... 65 

Interpreting Cohen’s f2 ....................................................................................................................... 66 

General Guidelines in Interpreting Quantitative Data ................................................................................ 67 

1. Explore your data thoroughly! ............................................................................................................ 67 

2. In most circumstances it is inappropriate to use causal language. .................................................... 67 

3. When writing your results always report descriptive statistics. ......................................................... 68 

4. There is a difference between prediction and explanation. ............................................................... 68 

5. Statistical significance does NOT imply practical significance. ........................................................... 69 

6. Testing for statistical significance does not prove your research hypothesis. ................................... 70 

A Final Note about Samples, Populations, and Sample Size ....................................................................... 71 

Conclusion ................................................................................................................................................... 73 

Appendix: Sample Assessment Report ....................................................................................................... 75 

 

Page 9: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

1

Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

Overview of Rubric Data in Educational Assessment The use of rubrics is a common technique for evaluating educational outcomes. It is important that faculty, students, and staff are able to get the most out of assessment data. This section provides several detailed examples that will maximize the utilization of rubric data in the assessment of student and/or program outcomes. Upon providing an example, specific questions will be asked that allows one to derive a better understanding of what is happening in the data. Several alternative techniques for analyzing and reporting data are also provided. Finally, this chapter will conclude with some general cautions/guidelines in drawing conclusions with such data.

Example: The School of Program Assessment Studies at Anonymous University is utilizing rubric data in order to determine the extent to which students can think critically about program evaluation data. Three judges rated the critical thinking of 20 students using the following rubric:

A Quick Analysis of the Rubric: The rubric displayed above may be problematic for a few reasons. Judges were asked to rate the extent to which each student demonstrates critical thinking, yet no clear definition of critical thinking is provided. Each judge may have a very different idea about what it means to think critically.

Level of Critical Thinking

1

2

3

4

5

The student fails to demonstrate critical thinking or displays a minimal level of critical thinking.

The student demonstrates a moderate level of critical thinking.

The student demonstrates an exceptional level of critical thinking.

Page 10: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

2

These ideas may lead judges to give the same student very different scores. In other words, the reliability, or consistency of ratings across judges, may suffer in this rubric. This can be a huge problem that affects our findings in numerous ways.

There are some simple solutions to this problem have been found to be very helpful at University Assessment and Testing. First, it is possible to give judges examples of student work that illustrate key scores. So judges may get papers that are clearly a “1” and “5”. These papers can serve as a benchmark when they score individual students. It is also important that each judge is given a definition of the outcome you are assessing. Finally, we have found it useful to have two judges rate a student and use the third judge as a tie-breaker. In other words, if 2 judges disagree about Susie’s critical thinking score then she would get the score of the 3rd judge (hopefully the third judge agrees with at least one of the other judges). It is also possible to force judges to agree. So for example, after rating each student independently, judges would meet as a group to discuss students who were assigned different critical thinking scores. Judges would then have to come to a consensus about discrepant scores.

The Data: To simplify this presentation we will assume that each judge came to an agreement about critical thinking scores. Three variables are presented in the dataset below. First, we have the overall critical thinking score of each student. Gender is also measured for each variable, which is followed by their grade point average.

Some points to notice about this data:

1. First, it is extremely difficult to derive substantive meaning from this data in its current form. We need efficient ways to summarize the data.

2. It appears as though females may have higher critical thinking scores than males, but this interpretation is purely speculative.

Page 11: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

3

Critical_Thinking Gender GPA

5.00 Female 3.20

5.00 Female 3.20

5.00 Female 3.30

5.00 Male 3.50

4.00 Female 3.20

4.00 Female 2.30

4.00 Female 2.90

4.00 Male 3.00

3.00 Female 2.30

3.00 Female 2.50

3.00 Male 4.00

3.00 Male 3.00

3.00 Male 3.20

3.00 Male 2.80

2.00 Female 2.40

2.00 Male 2.70

2.00 Male 3.00

2.00 Male 2.10

1.00 Female 2.00

1.00 Male 2.80

Data Exploration There are two initial questions that are important when exploring your data.

1. Where do critical thinking scores tend to fall? 2. How much do critical thinking scores tend to differ across students?

Question 1: Where do critical thinking scores tend to fall? The answer to this question may be assessed in three ways:

Mean = average score. This is obtained by adding each score and dividing by the total number of scores. The mean is usually a good way to represent student scores. One disadvantage with the mean is that it is affected by extreme scores. So for

Page 12: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

4

example, if we are looking at the average age of our undergraduate students, having one student that is 65 will make the mean much larger than what may be representative of your students. It is always important to check the data for extreme cases.

Median = the midpoint of the scores. In other words, this is the score at the 50th percentile or the score at which 50% of the scores will be below. This can be obtained by listing the scores from low to high and then finding the score that falls exactly in the middle. So for example, assume we have five critical thinking scores listed from low to high:

1 2 3 4 5

When we have an even number of scores we have to take the average of the two middle scores.

1 2 3 4 5 6

It is important to note that there are particular circumstances in which the median may be more informative than the mean. If the distribution is positively or negatively skewed, which will be discussed below, the median is a more accurate representation of where most scores will tend to fall.

Mode = What score occurs most often? To obtain the mode one simply counts the number of students receiving a “1,” “2,” “3,” “4” and “5”. The score with the most number of counts is the mode. In the example from our table above, we can see that

 Median or Middle Score 

5.32

)43(

 

Page 13: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

5

there are 4 students with a score of a five, 4 students with a score of a 4, 6 students with a score of a 3, 4 students with a score of a 2, and 2 students with a score of a 1. Since most students have a score of 3, this score will be our mode.

Question 2: How much do critical thinking scores tend to be different across students? There are typically three ways to understand the spread of scores within a dataset, which include the range, standard deviation, and variance. Hand calculation of the standard deviation and variance can become cumbersome with a lot of scores. Since most programs allow for a quick calculation of these values the present discussion will only focus on their conceptual meaning.

Range: To obtain the range we must subtract the lowest score from the highest score. In our rubric data the highest score is a 5 and the lowest score is a 1, which would give us a range of 4. In this data the range doesn’t really give us much information. A second problem with the range is that it, like the mean, can be affected by extreme scores. Remember, the range is calculated from extreme scores, so a change in one score could drastically affect the range. For this reason the range doesn’t tend to be very stable across different samples.

Variance and Standard Deviation: The variance is the average squared differences from the mean and the standard deviation is simply the square root of this value. Could you say that again? This sounds very technical, but basically both the variance and the standard deviation are measures of average distances from the mean. In other words, the standard deviation is asking, on average how much do student scores tend to be different from the mean? If we square the standard deviation then we have the variance. The variance is a measure of the squared average deviation from the mean. The variance is not intuitively easy to understand so its’ direct interpretation is usually more ambiguous. For this reason, most people tend to interpret the standard deviation. In our data, the mean was 3.2 and the standard deviation is

Page 14: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

6

about 1.3. So on average student scores tend to vary by 1.3 points from the mean of 3.2.

Why should we care? If the standard deviation is large it indicates that the scores tend to be really spread out from the mean, and if the standard deviation is small it indicates that scores tend to be really close to the mean. So let’s pretend that in our example we found that the mean was still 3.2. We may believe that this indicates a moderate level of critical thinking and judge this to be acceptable. However, if our standard deviation is also pretty big, say it is around 2.5, it would suggest that student scores tend be fairly spread out from the mean. In other words, there are many students who score much higher and much lower than our mean. This may be a concern. If on the other hand, our standard deviation was only .5 points then it would suggest that most students are pretty close to the mean and would allow us to feel more confident in believing that our students are at acceptable levels of critical thinking. Determining whether a standard deviation is “large” or “small” can be fairly subjective so it is up to you as a researcher to make these decisions.

Level of Measurement and Descriptive Statistics A variable by definition is something that varies. In other words, a variable will take on different values or numbers. So for example, if we are interested in the college achievement of individuals that previously attended a public or private high school then school type would be considered a variable. What if we were interested in female math performance? In this situation gender would be considered variable if both males and females are included within the sample. Gender would not be considered a variable if we only have a sample of females because gender does not vary. The variables in your study will be measured in different ways, and it is important to consider these differences when deciding what statistics are appropriate to calculate.

Measurement, in a very general sense, is concerned with how numbers are assigned to variables. Traditionally speaking there are four different levels of measurement. For

Page 15: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

7

each level of measurement specific statistics may or may not be appropriate. It is therefore important that you have a basic understanding of this within your study so that you make informed choices regarding what statistics are appropriate. The four different levels of measurement are referred to as nominal, ordinal, interval, and ratio. Within this section each level of measurement will be briefly reviewed. Upon reviewing each level of measurement, a table will be given so that you may determine what statistics are appropriate for each level of measurement.

Nominal A nominal level of measurement exists whenever we have assigned numbers to different levels of a variable, but these numbers are arbitrary. This is best illustrated when we consider different groups as a variable. So for example, gender may be a variable within our study. In this example, let’s assume that we have assigned females a “1” and males a “2”. These numbers are completely arbitrary since we could have just as easily assigned females a “2” or males a “1”. There is also nothing magical about the numbers “1” and “2” given that we could have just as easily assigned females a “959” and males a “7”. A nominal level of measurement therefore exists when have assigned numbers to differentiate specific categories. It is important to recognize that we cannot in this situation say that a “2” is greater than a “1” or that “1” signifies something less than a “2”.

Given this situation many calculations should not be performed with this level of measurement. Nominal variables only indicate that differences exist and consequently nothing can be said about the strength of such differences. We cannot therefore calculate a mean or standard deviation. For this level of measurement we must typically use basic frequency counts to summarize the data. Within the case of gender, it would therefore be appropriate to report the percentage of males and females within the sample, but it would not be necessary to calculate an overall mean.

Page 16: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

8

Ordinal An ordinal level of measurement not only signifies that differences exist, but we are now capable of looking at the numbers in order to determine whether one is higher than another. So let’s pretend that we have 5 students in the classroom and we measure each student’s height by ranking them from low to high. This is illustrated in the below.

Rank 1 2 3 4 5 Joe Kaitlin Alex Tommy Jennifer

In this situation Jennifer, who is assigned a “5,” is taller than all of the other students. We can also see that Joe, who is assigned a “1” is shorter than all of the students. These numbers allow us to not only infer that each student has a different height, but unlike the nominal variable an ordinal variable allows us to determine the direction of height. We know that Jennifer is taller than Tommy, who is taller than Alex and so on. The only problem with this level of measurement is that we cannot say that the difference in height between people is equal. It is possible that Jennifer and Tommy are only 1 inch apart in their height, but there may be an 8 inch difference in height between Tommy and Alex. This possibility is illustrated in below.

Given that we cannot assume an equal distance between values that have an ordinal level of measurement some mathematical calculations are limited. For example, it would not make sense to multiply the numbers. For this reason, the statistics that we can use to describe our sample are still limited. If we are interested in where most

Student Height 5ft  6ft 

Joe 

Kaitlin 

Alex  Tommy 

Jennifer 

Page 17: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

9

student ranks tend to fall then we can calculate a median. If we are interested in the spread of ranks then we are limited to calculating the range.

Interval Interval level of measurement will indicate differences, direction, and the existence of equal intervals between scores. The most common example of this level of measurement is temperature when measured in Celsius or Fahrenheit. The difference in the amount of heat between 80 and 90 degrees Fahrenheit is equal to the amount of heat between 90 and 100 degrees Fahrenheit. Another example may be IQ where it is assumed that the difference in intelligence between an individual with an IQ of 70 and 80 is the same as the distance between two individuals with an IQ of 100 and 110.

There remains some controversy about whether Likert-type ratings should be treated as ordinal or interval level variables. We will not address this controversy in detail, but it is something that readers should at least recognize as a potential problem. To remain consistent with our rubric example we will once again consider the critical thinking rubric below.

Joe, Kaitlin, and Alex have been measured on critical thinking and there scores are respectively, 1, 3, and 5. Can we assume that the distance between Joe and Kaitlin’s critical thinking is equivalent to the distance in critical thinking among Kaitlin and Alex? It is arguable that this rubric may reflect an ordinal measure, though for pragmatic purposes it is frequently assumed that distances across each score is equal. For the sake of our purposes we will also be making this assumption.

Level of Critical Thinking

1

2

3

4

5

The student fails to demonstrate critical thinking or displays a minimal level of critical thinking.

The student demonstrates a moderate level of critical thinking.

The student demonstrates an exceptional level of critical thinking.

Page 18: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

10

With such assumptions in place we can calculate a meaningful average and standard deviation. The primary limitation with this level of measurement is that it lacks a true “0”. A true “0” would indicate the absence of whatever is being measured. So 0 degrees in Fahrenheit is not a true 0 because it doesn’t indicate the absence of heat. Similarly, a “0” on an intelligence test doesn’t indicate the complete absence of intelligence. Given the absence of a true zero we cannot discuss the numbers in terms ratios. In other words, we cannot say that a person with an IQ of 130 is twice as intelligent as a person with an IQ of 65.

Ratio A ratio level of measurement contains all of the properties of an interval level of measurement, with the additional characteristic of having a true 0. Most educational measures lack this property of measurement. One example of might be years of formal schooling where a 0 would indicate the complete absence of formal schooling (i.e. some children in the study are age 2 or 3). A second example might be might be yearly income where it is conceivable that some individuals may truly have no income. In both examples we it is apparent that we can discuss these in different terms than the interval level of measurement. It is possible for example to that a 12 years of education is twice as much as 6 years of education or that an income $60,000 is three times as much as an income of $20,000. All mathematical operations can be applied to this level of measurement.

Page 19: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

11

Descriptive Statistics for each Level of Measurement

Level of Measurement and the Amount of Information From the previous discuss we can see a hierarchy in the available information from each level of measurement. Nominal variables only indicate differences, whereas ordinal data indicate differences and direction. Interval and ratio are the most informative since we can infer difference, direction, and magnitude. If something is measured on an interval level we can always transform the data to ordinal or nominal levels of measurement. For example, we could take IQ scores and create groups of “high” and “low”. If we are measuring age, we could always rank the ages (i.e.

Ratio

Interval

Ordinal

Nominal

•Mean, Median, Mode

•Range, Standard deviation

•Mean, Median, Mode

•Range, Standard deviation

•Median, Mode

•Range

•Mode

Page 20: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

12

ordinal) or once again create categories of interest (20-29, 30-39, etc). However, if we have already collected the data at the nominal level we cannot transform this to an interval level. So if you are asking participants about their age, you get limited information by having participants check a pre-existing category. With this level of measurement you will never know the exact average age of your sample. Furthermore, answering specific research questions can become unnecessarily complex. If you have the choice, always measure a variable at the highest level of measurement (i.e. interval or ratio) and transform the data later if you are interested in asking other questions.

Using Graphs to Examine Data Graphing data can be very insightful. When choosing a graph you should consider how your variables are measured. So for example if you have assigned males a “1” and females a “2” then a bar or pie chart may be helpful. The rubric data we have above where critical thinking is measured on a 1-5 scale would not typically be illustrated with bar or pie charts. In this situation a histogram would be used as this would allow us to derive a better understanding of the distribution of the data. A table is provided below in order to help guide your decision about choosing a graph given that a variable has a specific level of measurement.

Page 21: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

13

Histogram: We can easily put the rubric data into a histogram to get an overall indication of the distribution of student scores. A histogram of this data is provided below. On the horizontal axis, or x-axis, we have placed each score from the rubric. The bars, which correspond to values on the vertical axis, or y-axis, represent how many students received each score. So looking at the graph we can quickly see that most students scored a 3 (i.e. mode) and that there tends to be slightly more students at the upper end of the distribution than the lower end. A normal distribution is a distribution where one side “mirrors” the other side. In our case we have a slight departure from a perfectly normal distribution, something that will occur in most data.

•Bar Chart

•Pie ChartNominal/Categories

•Bar Chart

•Pie Chart

•Boxplot

•Stem and Leaf

Ordinal/Ranks

•Stem and Leaf

•Boxplot

•HistogramInterval

•Stem and Leaf

•Boxplot

•HistogramRatio

Page 22: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

14

Pie Chart Pie charts are circular charts that are divided into sectors or categories which indicate the relative size of each category. So for example, in our data we have 10 females and 10 males. The pie chart would thus be divided into sectors with males on one side at 50% and females on the other side at 50%. It is also possible for us to change our question to something more interesting. The Pie Chart below indicates the percentage of males and females with a score of a “4” or “5” on the critical thinking rubric. Within our data, we can see that there are 8 students who have a critical thinking score of 4 or 5. Six of these students are female and only 2 are male.

Page 23: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

15

Research Questions In a very broad sense, basic research questions can generally be divided into two categories. First, are research questions pertaining to group differences or differences in two or more means. Examples of such questions include the following:

1. On average, do males and females have different critical thinking scores? 2. Are average critical thinking scores in 2011 higher than critical thinking scores

in 2010? 3. Are average critical thinking scores in 2011 significantly different from a minimal

standard of 3.0?

Secondly, research questions may also be concerned about the relationship among two or more variables and/or the ability of changes in one variable to predict changes in another. Examples of such questions include the following:

1. Are changes in critical thinking scores associated with changes in GPA? 2. Do critical thinking scores predict GPA?

Group or Mean Differences Let’s begin with an exploration of whether males and females on average have different critical thinking scores. Two histograms were constructed, one for males and one for females.

Female75%

Male25%

Critical Thinking Score of "4" or "5"

Page 24: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

16

Page 25: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

17

These histograms look very different than the first histogram containing all of the students. First, notice that for males the histogram has slightly more individuals at the lower end of distribution whereas the histogram for females indicates that there are more scores at the higher end of the distribution. We can also see that the mean critical thinking score for males is 2.8 (Standard deviation = 1.1) and females have an average critical thinking score of 3.6 (Standard deviation = 1.3).

This quick inspection suggests that females may indeed have higher means than males, and we can see from the standard deviations that the average deviation from their respective means appears to be similar. In other words, though scores for males and females tend to fall at different points in the distribution both males and females appear to have a similar spread of scores around their respective mean. We cannot however, conclude from this visual inspection alone that females have significantly higher means than males. But you may say, the means are different, can’t you see that the mean of females is 3.6 and the male mean is 2.8? This is very true, but to conclude that these differences are statistically significant or practically meaningful there are other things to consider.

Statistical Significance Statistical significance has a unique meaning, and it is important that we

consider this meaning in some detail because it can be confusing. First, let’s do a quick thought experiment in order to get an intuitive feel for what is happening. Let’s pretend that an individual tells you that they have psychic abilities. When a coin is flipped these abilities allow him to predict whether a coin will land with the heads facing up or down. You are interested, yet skeptical of this ability, so you decide to put it to a test. You are going to flip a fair coin and measure whether their prediction was accurate. Being skeptical, we will assume that the person lacks this ability until they demonstrate evidence to the contrary. This assumption will be called our null hypothesis, or the hypothesis indicating that nothing is really going on. On the other

Page 26: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

18

hand, our second hypothesis is that this person does indeed have psychic powers, and we will call this hypothesis our alternative hypothesis.

Testing the Psychic Ability of our Friend

Let’s assume that we flip the coin a single time. Our friend indicates before the flip that the coin will land on heads. The coin does indeed land on heads so our friend proudly proclaims, “See, I told you I was a psychic”. Why is this conclusion not convincing? The reason this is not convincing is that our friend has a 50% chance of getting a correct answer, even if our null hypothesis is true. In other words, he will be correct 50% of the time even if he is guessing at random. A second way of stating this that there is a 50% chance that we will conclude that he is a psychic when in fact he is guessing at random. The probability of him getting a trial correct by random guessing will be called our p-value. Right now our p-value is .50, which as we discussed, is not very convincing. We can get more convincing evidence however, by increasing the number of trials.

Now we are going to give our friend a series of 10 trials. The p-values for every possible number of correct responses are given in the table below. A quick examination of this table yields interesting information that can be applied to our friend’s claim. First this table provides information about a person without psychic ability who is guessing at random. In other words, this table indicates the likelihood of certain events if our null hypothesis is true. We can see that a person who is guessing at random will get 0 correct responses 00.09% of the time and that a person that is randomly guessing will also be correct over all 10 trials 00.09% of the time. A person that is randomly guessing will correctly choose the correct side of the coin 5 out of 10 trials about 25% of the time. From this table, we can therefore see that if our friend is randomly guessing it is highly unlikely that he will get do very bad (i.e. 0 or 1) or do extremely well (9 or 10).

Page 27: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

19

Correct Choices

P-Value (Random Guessing)

% of Time Correct from Random Guessing

0 .000977 00.09% 1 .009766 00.98% 2 .043945 04.39% 3 .117188 11.72% 4 .205078 20.51% 5 .246094 24.61% 6 .205078 20.51 7 .117188 11.72 8 .043945 04.39% 9 .009766 00.98% 10 .000977 00.09%

This table provides a decent way to evaluate our friend’s ability, but there is still an important question that remains unanswered. How well must our friend do in order to accept the conclusion that he does indeed have psychic powers? Clearly if our friend correctly called whether the coin would be heads or tails 10 out of 10 times this would be an impressive feat. This is impressive because it is extremely unlikely (i.e. .09%) that he would be able to do this if he were just randomly guessing. If our friend gets 10 out of 10 correct it is still possible that our null hypothesis is true; however, if our friend really was just guessing at random then he was extremely lucky during this set of 10 trials. Though it is possible that he still lacks this ability, it is extremely unlikely that he would perform this well by random guessing alone. Thus in this situation we could have some confidence when rejecting our null hypothesis and accepting the alternative hypothesis.

What would we conclude if our friend got 6 out of 10 correct? Well, when we assume that he is randomly guessing, or that our null hypothesis is true, he will get 6 out of 10 correct about 20% of the time. Though this is still relatively rare, it is not nearly as

Page 28: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

20

impressive as getting 9 or 10 correct. So if we were to conclude that he was psychic, this conclusion would be much more tentative than if he were to get 9 or 10 correct. Note that conclusions are always tentative since no matter how many times he answers correctly it never technically “proves” the existence of psychic powers. The most that we can technically say, is that it is extremely unlikely that he would perform this well if he were just randomly guessing.

Thankfully there are some general conventions that one can follow to decide whether or not to reject our null hypothesis. Researchers typically reject a null hypothesis if p-values are less than .05 or .01. These rules are generally set before one conducts a study. So we would decide before giving our friend the 10 trials that he must have a p-value of less than .05 before rejecting our null hypothesis. In other words, if our friend gets 8 out of 10 correct we know that this will happen less than 5% of the time if he is randomly guessing. So we would reject our null hypothesis and accept that that our friend has at least these psychic abilities. With this same standard, we would therefore accept our friends claim if he got 8, 9, or 10 out of 10 correct. If we were very skeptical, and wanted to be more stringent with our friend we could set our minimal p-value to less than .01. In this case, our friend must get 9 or 10 out of 10 correct before we accept the claim that he has psychic abilities.

Significance Testing using the Rubric Data

Understanding the logic above provides us with an intuitive feeling for what is happening in significance tests when asking other research questions. The logic is basically the same, though we have different null and alternative hypotheses. So let’s go back to our example of looking at mean differences. We are interested in determining whether males and females have, on average, different levels of critical thinking. Remember, that in this example males had an average critical thinking score of 2.8 and females have an average critical thinking score of 3.6. These do indeed look different, but at times appearances can be deceptive. Just as in the assessment of our friend’s psychic ability, so too it is possible that the difference we see among

Page 29: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

21

male and female means result from some random or chance event. Just as we were skeptical of our friend’s claim, we will also be skeptical of this apparent difference in male and female means. We will therefore assume that the difference in males and females is due to chance factors, which will be our new null hypothesis. To examine this assumption, we must understand the probability of observing specific mean differences if these were completely chance fluctuations. Let’s take a closer look at this line of logic.

In order to understand this, let’s examine the diagram provided below. This diagram provides an example wherein 30 students are measured on a math test. This class will serve as the entire population and we know that females had an average math score of 88 and males had an average math score of 88. In other words, the mean difference across males and female is 0 points for the entire class. What would happen if we took a random sample of 7 males and 7 females and calculated their mean differences? Would we also find a difference of 0 points? Probably not. We can see from the diagram below that even though males and females are coming from the same class (i.e. the same population), when we take a random sample of males and females their mean differences will probably be different than that of the entire population.

 

MATH CLASS 

Male and Female Mean 

Difference = 0 points. 

 

Female  

= 85

Male  

= 80

Sample 1 

Female  

= 79 

Male  

= 82 

Sample 2

Female  

= 92

Male  

= 91

Sample 3 

Page 30: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

22

In fact, from this diagram we can see that in Samples 1 and 3 females have a higher mean than the males, however, in sample 2 the males have a higher mean than the females. Without further tests, each sample could lead us to radically different conclusions about male and female math scores. Our sample of 7 males and 7 females will almost always have different means and these differences may or may not reflect the class at large. Just by chance alone the average difference in male and female math scores will fluctuate across each sample. We cannot consequently assume that the differences we observe in males and females reflect true differences in the population or class. In order for the difference in male and female sample means to be statistically significant it must be extremely rare when assuming that these differences reflect chance fluctuations.

Significance testing is rather abstract, and can be understandably difficult to fully comprehend. However, if you keep these two examples in mind you will have an intuitive understanding of what is occurring with every significance test. First, we assume that whatever we observe is due to chance fluctuations (e.g. random guessing, or differences across possible samples). This assumption is our null hypothesis, which states that nothing is happening (e.g. the person is guessing at random, or there is no true difference in male and female critical thinking scores). We then build statistical models of chance that help clarify what we actually see in the data. For example, we understand that a person who is randomly guessing will correctly choose heads or tails 6 out of 10 trials about 20% of the time. Or when we assume that males and females have equal critical thinking scores, then finding a sample where they have 3 point difference will occur only X% of the time. Once we understand the likelihood of particular things happening when we assume the null hypothesis, we can make judgments about the data we have actually observed. Given that we know a person who is randomly guessing will get the 6 trials correct about 20% of the time, what should we conclude when if our friend actually gets 6 out of 10 trials correct? Is this sufficient to reject the null hypothesis? Once we observe the data we therefore make decisions about rejecting the null hypothesis and accepting the alternative hypothesis.

Page 31: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

23

Remember, it is in the nature of probability that the improbable will occur, given enough time. Rejecting the null hypothesis does not prove the alternative hypothesis; it only suggests that our observations are unlikely given that the null hypothesis is true.

Statistical Tests

Independent-Sample T-Test Having this intuitive sense about significance testing will be sufficient for you to effectively utilize much of your assessment data. We will now provide an example of answering the question about male and female differences in critical thinking.

Research question: Are male and female critical thinking scores, on average, different?

Null Hypothesis: There are no differences in male and female critical thinking scores. In other words, the observed differences in our sample reflect random fluctuations.

Alternative Hypothesis: There are differences in male and female critical thinking scores.

In order to answer this question we will use an independent sample t-test. Output from SPSS is presented below and will serve as the basis of our interpretation. You may find a tutorial for conducting independent sample t-tests via SPSS on youtube http://www.youtube.com/watch?v=36PldkUpXH0. We will therefore focus upon interpreting the output derived from SPSS.

Page 32: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

24

Group Statistics

Gender N Mean Std. Deviation Std. Error Mean

Critical_Thinking Female 10 3.6000 1.34990 .42687

Male 10 2.8000 1.13529 .35901

On average, female 

critical thinking scores 

tend to vary 1.34 

points from their 

mean of 3.6. On 

average male critical 

thinking scores tend 

to vary 1.14 points On average female mean critical 

thinking scores can be expected to vary 

.43 points from sample to sample. On 

average, male mean critical thinking 

scores can be expected to vary .36 

points from sample to sample.  

The Female average critical thinking 

score is 3.6 and the male average 

critical thinking score is 2.8. The 

mean difference is thus 0.8 points.  

Page 33: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

25

Independent Samples Test

Levene's Test for

Equality of Variances t-test for Equality of Means

F

Sig. t df

Sig. (2-

tailed)

Mean

Difference

Std. Error

Difference

95% Confidence Interval of

the Difference

Lower Upper

Critical_Thinking Equal variances

assumed

.559 .464 1.43 18 .169 .80000 .55777 -.37184 1.97184

Equal variances

not assumed

1.434 17.486 .169 .80000 .55777 -.37431 1.97431

 

 

 

This is the 

calculated t‐

value from our 

sample data.  

This value must 

usually be 

greater than 

1.96 or lower 

than ‐1.96 to be 

statistically 

significant.   a 

Thisisanexaminationofwhetherthevariancesacrossmalesandfemalesareequal.Ifthisvalueisgreaterthan.05usethetoprowofoutput(highlightedinyellow).Ifthisvalueisless.05usethebottomrowofoutput.

This is our p‐value and must be less than .05 or 

.01 to be statistically significant. It indicates the 

probability of finding our mean differences under 

the null hypothesis. In our data we would observe 

a mean difference of 0.8 nearly 17% of the time if 

there were no differences in male and female 

critical thinking scores.  

Page 34: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

26

Confidence Intervals Confidence intervals are extremely informative for conducting significance tests. Typically researchers construct 95% and 99% confidence intervals. When calculating a 95% confidence interval there is a 95% chance that the interval we construct will contain the true population value. Likewise, with a 99% confidence interval there is a 99% chance that the interval we construct will contain the true population value. Let’s look at this example.

Independent Samples Test

Levene's Test for

Equality of Variances t-test for Equality of Means

F Sig. t Df

Sig. (2-

tailed)

Mean

Difference

Std. Error

Difference

95% Confidence Interval

of the Difference

Lower Upper

Critical_Thinking Equal variances

assumed

.559 .464 1.434 18 .169 .80000 .55777 -.37184 1.97184

Equal variances

not assumed

1.434 17.486 .169 .80000 .55777 -.37431 1.97431

 

 

 

 

 

 

Two words of caution are necessary. First, there may be times in which 0 falls between the confidence interval, yet the significance test indicates that there is a

Beforecalculatingthisintervalthereisa95%chancethatitwillcontainthetrue

differencebetweenmaleandfemalemeancriticalthinkingscores.Weestimatethat

thetruedifferencebetweenfemalesandmaleswillbebetween‐.37and1.97.

Itisimportanttonoticewhether0fallsinbetweenour‐.37and1.97.Since0does

fallinbetweenthesevaluesitisverylikelythatthetruedifferencebetweenmales

andfemalesis0.Inotherwords,thisintervalsupportsoursignificancetest.

Page 35: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

27

statistically significant difference in the results. Though this may be a matter of personal preference, the fact that the 0 falls within the confidence interval should make one suspicious of the significance test. Secondly, the following interpretation is technically inappropriate: “We can be 95% confident that the true differences among male and female means falls between -.37 and 1.97”. Before constructing the interval we have a 95% chance of creating one that has the true population values. After the interval is constructed it either contains the true population values or it does not.

Interpretation of T-Test Let’s summarize this section with how we could interpret the results of the significance test.

A sample of 30 students was assessed by three judges utilizing a critical thinking rubric. Judges were asked to come to a consensus about their critical thinking scores. Anonymous University is interested in examining whether males and females had, on average, different critical thinking scores. Of the 30 students, 50% were male. Males had an average critical thinking score of 2.8 (standard deviation = 1.14) and females had an average critical thinking score of 3.6 (standard deviation = 1.34). Females thus had a higher critical thinking score than males of 0.8 points; with a 95% confidence interval of -.37 to 1.97. An independent sample t-test indicated that this mean difference failed to be statistically significant (p = .167).

Before moving on to an examination of correlational/prediction research questions, we will briefly examine the output from two other research questions that examine group differences. The first, will what is referred to as a dependent sample t-test which can examine the same people at two points in time. This will be followed by an illustration of a one-way ANOVA which can be used to examine mean differences across more than 2 groups.

Page 36: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

28

Dependent or Paired Samples T-test As previously indicated the dependent sample t-test can be used to evaluate the same people across two points in time. This test would not be applied if you are comparing samples of different people across two years. So for example, if you want to compare the critical thinking of 2009 graduates to that of 2010 graduates these are different samples so we would use an independent sample t-test.

Example The School of Program Assessment Studies at Anonymous University utilized a rubric to measure the critical thinking of 10 incoming freshman. These same freshmen were once again evaluated for critical thinking upon graduation. The sample data are provided below. (note: if we were to compare incoming freshman for 2011 to graduates of 2011 we would use the independent sample t-test). Participant Critical Thinking as

Freshmen Critical Thinking as

Graduate 001 2.00 4.00 002 2.00 4.00 003 3.00 3.00 004 1.00 2.00 005 1.00 3.00 006 3.00 5.00 007 3.00 3.00 008 4.00 5.00 009 1.00 2.00 010 2.00 3.00

Page 37: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

29

Research Questions Research Question: On average, do students have higher critical thinking scores as graduates than when arriving as freshmen? Null Hypothesis: There are no differences in the average critical thinking scores of graduates from the average critical thinking scores when they arrived as freshmen. Alternative Hypothesis: There are differences in the average critical thinking scores of graduates from the average critical thinking scores when they arrived as freshmen.

Statistical Analysis A tutorial for conducting a dependent sample t-test can be found at the following link: http://www.youtube.com/watch?v=9ipXE6q6tnU. The output for our sample data is interpreted below.

Paired Samples Statistics

Mean N Std. Deviation Std. Error Mean

Pair 1 Freshman 2.2000 10 1.03280 .32660

Graduates 3.4000 10 1.07497 .33993

On average critical thinking scores as freshman 

are expected to vary from their mean of 2.2 by 

.33 points across multiple samples. On average, 

critical thinking scores as graduates are 

expected to vary from the mean of 3.4 by .34 

points across multiple samples.  On average critical thinking scores as 

freshman vary 1.03 points from the 

mean of 2.2. On average critical 

thinking scores as graduates vary 1.07 

points from the mean of 3.4.  

The average critical thinking score of 

freshman is 2.2, and the average critical 

thinking score of graduates is 3.4. 

Page 38: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

30

The average critical thinking score for 

incoming freshman was 2.2. As 

graduates this critical thinking scores 

is now at 3.4.  

Paired Samples Test

Paired Differences

t df Sig. (2-tailed)Mean

Std.

Deviation

Std. Error

Mean

95% Confidence Interval of the

Difference

Lower Upper

Pair 1 Freshman - Graduates -1.20000 .78881 .24944 -1.76428 -.63572 -4.811 9 .001

As freshman students 

average critical 

thinking scores were 

1.2 points lower than 

The mean difference of ‐1.2 

points is statistically significant if 

this value is less than .05 or .01. 

The .001 also indicates that when 

assuming that null hypothesis is 

true, a mean difference of ‐1.2 

points will occur less than 1% of 

the time.   

● ● ● 

Thisisthe95%confidenceintervalaroundthemeandifferenceof‐1.2.Before

constructingtheintervalwehave95%chanceof

constructinganintervalcontainingthetrue

populationdifference.Thisintervalis‐1.76and‐.63.

Sincethisdoesnotcontain0itsupportsthesignificance

test.

● ● ●  

Page 39: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

31

Interpretation The following provides a brief summary of our results: A paired samples t-test was used in order to investigate whether the average critical thinking scores of incoming freshman changed upon graduation. A sample of 10 students was assessed for critical thinking using rubric data as freshman and again upon graduation. The sample of incoming freshman had an average critical thinking score of 2.2 (standard deviation = 1.03) and as graduates had an average critical thinking score of 3.4 (standard deviation = 1.07). Upon graduation average critical thinking scores were found to increase by 1.2 points, with a 95% confidence interval of .63 to 1.76. This mean difference was statistically significant (p = .001).

One-way ANOVA This procedure allows us to investigate differences in means among 2 or more groups. Instead of utilizing a t-distribution however, this statistic utilizes an F-distribution. Within this distribution the F-statistic has an average of 1, thus calculated scores that are close to or less than 1 will often fail to be statistically significant. This procedure is basically an analysis of different forms of variation. Theoretically speaking this procedure assumes that individuals that are exposed to the same treatment should score the same. The degree to which individuals within the same track are different from other people in the same track is considered error (e.g. participants in track A should have similar critical thinking scores). Differences within a single track are referred to as within-group variation. If the program track has an effect then it is expected that participants in different tracks will have different scores (e.g. track A people are different than track B). This is often referred to as a treatment effect, or between-group variation. ANOVA basically compares the amount of between-group variation to the within-group variation in order to estimate the F-statistic. This F-statistic is then evaluated for statistical significance. Example:

Page 40: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

32

Researchers at Anonymous University are interested in examining whether graduates in distinct program tracks have different critical thinking scores. The program under evaluation has three tracks, which include track A, B, and C. Ten graduates were randomly sampled from each track and measured for critical thinking using a rubric. The data for the following study are provided below. Participant Track Critical Thinking Scores 001 1.00 4.00 002 1.00 4.00 003 1.00 3.00 004 1.00 3.00 005 1.00 2.00 006 1.00 4.00 007 1.00 4.00 008 1.00 2.00 009 1.00 3.00 010 1.00 1.00 011 2.00 1.00 012 2.00 2.00 013 2.00 3.00 014 2.00 3.00 015 2.00 4.00 016 2.00 3.00 017 2.00 3.00 018 2.00 2.00 019 2.00 4.00 020 2.00 2.00 021 3.00 2.00 020 3.00 3.00 021 3.00 4.00 022 3.00 1.00 023 3.00 1.00 024 3.00 3.00 025 3.00 4.00

Page 41: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

33

026 3.00 1.00 027 3.00 5.00 028 3.00 1.00 029 3.00 5.00 030 3.00 1.00

Notes about Data In the data for this example the participant track is treated as a variable. Participants in Track A were assigned a 1, participants in Track B were assigned a 2, and participants in track C were assigned a 3. Each participant’s critical thinking score is indicated to the right of the track variable.

Research Questions Research Question: Do students graduating from tracks A, B, and C exhibit different critical thinking scores? Null Hypothesis: There are no differences in the average critical thinking scores among students graduating from tracks A, B, and C. Alternative Hypothesis: There are differences in the average critical thinking scores among students graduating from tracks A, B, and C.

Statistical Analysis Once again, this presentation will be limited to understanding the output provided in SPSS. To view an illustration of this in SPSS you may view the following link: http://www.youtube.com/watch?v=LieKLVzxbuA

Page 42: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

34

Descriptives

Critical_Thinking

N Mean Std. Deviation Std. Error

95% Confidence Interval for Mean

Minimum Maximum Lower Bound Upper Bound

Track_A 10 3.0000 1.05409 .33333 2.2459 3.7541 1.00 4.00

Track_B 10 2.7000 .94868 .30000 2.0214 3.3786 1.00 4.00

Track_C 10 2.5000 1.50923 .47726 1.4204 3.5796 1.00 5.00

Total 30 2.7333 1.17248 .21406 2.2955 3.1711 1.00 5.00

We see that Track A has the highest mean critical 

thinking score at 3.0. This is followed by Track B (2.7) 

and C respectively (2.5). The average critical thinking 

score for all groups combined is 2.7.  

These values indicate, on average how 

much do scores vary around their 

respective group mean. For example, 

we see that in Track A scores on 

average vary 1.05 points from the 

mean of 3.0.  

This indicates an expected or average fluctuation in estimated 

means across multiple samples. For example, across multiple 

samples it is expected that Track A means will vary by .33 points. 

This is the 95% confidence interval for 

each group mean. It has the same 

interpretation as in previous examples. 

It is important to consider whether 

confidence intervals around one group 

overlap with another group. For 

example, since Track A’s interval of 2.2 

to 3.8 overlaps with Track B’s interval 

of 2.02 to 3.4 it is extremely likely that 

these the critical scores for the Track A 

and Track B populations are similar.  

Page 43: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

35

ANOVA

Critical_Think

Sum of Squares df Mean Square F Sig.

Between Groups 1.267 2 .633 .443 .647

Within Groups 38.600 27 1.430

Total 39.867 29

 

This value must be 

less than .01 or .05 

to be statistically 

significant. It also 

indicates that when 

we assume that 

there are no 

differences across 

the three tracks, 

there is a nearly 

65% chance of 

finding the mean 

differences observed 

across the three 

tracks in our study. 

This is the estimated between‐group 

variance, which estimates the effect of 

being in a specific track (i.e. treatment 

effect) and the within‐group variance or 

differences we see among people in same 

track (i.e. error). The between‐group 

variance is .633 and the within‐group 

variance is 1.43.  

ThisisourcalculatedF‐statistic.Wecangetthis

valuebydividingthebetween‐groupvarianceor

estimatedtreatmenteffect(.633)bythewithin‐

groupvarianceorerror(1.43).

Page 44: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

36

Post Hoc Tests for ANOVA In the output listed above the results indicated that there were no statistically significant differences in the average critical thinking scores for students across the three tracks. What if we did find a significant result however? How would we know which groups are significantly different from which? If the F-statistic is statistically significant it does not tell us which groups are different. It only suggests that somewhere in the three groups there is a statistically significant difference. To understand which track is different from which we would have to run what is called a post hoc test. There are many post hoc tests available, but the simplest one to understand is Tukey’s Honestly Significant Difference test. This test basically compares every possible combination of groups for significant differences.

Interpretation

Anonymous University is interested in determining whether graduates from track A, B, and C, have different levels of critical thinking. The average critical thinking score among graduates in track A was higher than those (mean = 3.0, standard deviation = 1.0) in track B (mean = 2.7, standard deviation = 0.9) and track C (mean = 2.5, standard deviation = 1.5). A one-way ANOVA indicated that these observed differences were not statistically significant (p = .647). There is currently insufficient evidence to conclude that the mean critical thinking scores among graduates from tracks A, B, and C are different.

Page 45: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

37

Multiple Comparisons

Critical_Thinking

Tukey HSD

(I) Track (J) Track Mean Difference (I-J) Std. Error Sig.

95% Confidence Interval

Lower Bound Upper Bound

Track_A Track_B .30000 .53472 .842 -1.0258 1.6258

Track_C .50000 .53472 .623 -.8258 1.8258

Track_B Track_A -.30000 .53472 .842 -1.6258 1.0258

Track_C .20000 .53472 .926 -1.1258 1.5258

Track_C Track_A -.50000 .53472 .623 -1.8258 .8258

Track_B -.20000 .53472 .926 -1.5258 1.1258

This indicates the mean 

difference among 2 groups. The 

mean difference between group 

A and B is .30 points. The mean 

difference among group A and C 

is .50 points.  

Thisisthe95%confidenceintervalaroundeachmeandifference.FortrackAandBthisintervalis‐1.02to1.62.Noticethat0iscontainedinthisintervalsoitislikelythatthedifferenceinthepopulationofTrackAandBmaybe0.

Values in this column must be below .05 or .01 to be statistically significant. None 

of these values are statistically significant. Look at the value .842, this can also be 

interpreted in the following way: When we assume the differences between Track 

A and Track B critical thinking scores are due to chance we will observe a mean 

difference of .30 about 84% of the time.  

It is important to note that much of the information in this 

table is redundant. In other words, it is not unique because 

the same information is given more than once. I have 

highlighted in yellow unique information within the table.  

Page 46: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

38

Association/Relationship and Prediction At times we may not be interested in group differences or differences in two or more means. We can also utilize rubric data to ask questions about relationships and/or prediction. Questions regarding relationships are concerned with whether changes in one variable are associated with changes in a second variable; and these relationships are assessed with correlations. Correlations are a pre-requisite to good predictions. There are multiple kinds of correlations. Which correlation we calculate changes depending upon the way we have measured our variables. Though this definition remains controversial, measurement is typically concerned with how we assign numbers. So for example, if we want to measure sex we may give females a 1 and males a 2. Sex is considered a categorical variable because the numbers reflect different categories. The numbers really don’t mean anything however, because we could easily replace these with any random number. A higher score on the variable sex is therefore meaningless. Our rubric data is different than this because the numbers have more meaning. For the sake of our purposes, we are going to assume that the rubric data is measured at the interval level. This basically means that higher scores reflect more critical thinking and that the distance between scores is equal. In other words, the difference in critical thinking among a person who scored a 1 and 2 is the same as the difference in critical thinking among a person who scored a 4 and 5. For the sake of brevity we are only going to introduce correlations where we have two continuous variables that are measured at the interval scale. If you are utilizing rubric data it is generally a safe assumption that this is measured on the interval scale. If you want to examine the correlation among your rubric with other variables it is important that you consider how this other variable is measured so that you pick the right test. If your variable is a scale (e.g. several items that a person must choose “strongly agree” to “strongly disagree”) then this example will be applicable to you. This example would not work if you wish to correlate a categorical variable, such as sex and the rubric data.

Page 47: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

39

Correlations As previously discussed correlations are representations of the degree to which changes in one variable are associated with changes in a second variable. What exactly does this mean? The meaning of this becomes clear whenever we examine a graph referred to as a scatterplot. Scatterplots are useful summaries of how two variables are associated with each other and they are relatively easy to construct.

Scatterplots – Positive Correlation To take a simple example, let’s examine the relationship among IQ and academic achievement among 15 students. What are some things to notice about the scatterplot below?

Page 48: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

40

Each student in this graph is represented by a circle. So for example, student 1 has an IQ of 70 and an achievement score of 65. Each student’s achievement and IQ would be plotted in a similar manner. We can also see that as the IQ of students gets bigger so does their achievement score. This is referred to as a positive correlation. Positive correlations are indicated when two variables move in the same direction. So positive correlations are reflected when increases in one variable are associated with increases in a second variable, or when decreases in one variable are associated with decreases in another variable. Another way of viewing the scatterplot is to say “as student IQ tends to get lower so too does their achievement score”. As we can see from the graph, student scores are best represented by a straight line. The other thing to notice about this plot is that there are two cases that don’t fit the pattern of other students. One student has a relatively high IQ of around 120 and low achievement score, while the other student has an average IQ of 100 and a nearly perfect achievement score. Once we re-examine the data for possible mistakes, the fact that these two students fail to display typical patterns may warrant further investigation. Perhaps these students display different levels of motivation? Despite this possibility however, within this contrived example we can infer that a fairly strong relationship exists among IQ and achievement. This will be examined further when we calculate an actual correlation coefficient, but for now let’s examine what a negative correlation would look like on a scatterplot.

Scatterplots – Negative Correlation In this example, let’s consider the relationship among achievement and the number of hours spent partying. An examination of the scatterplot below indicates that as students engage in more partying they tend to have lower achievement scores. Once again, this data appears to be best represented by a straight line. There is one exception however, of a student was engaged in relatively low levels of partying and who had a low achievement score. What we notice in this case is that the variables move in different directions. As partying increases achievement tends to decrease.

Page 49: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

41

Correlation Sign and Strength When calculating correlation coefficients if we obtain a negative value then we have a negative correlation, which indicates that increases in one variable are associated with decreases in the second variable. The correlation coefficient for partying and achievement would therefore have a negative value. If a correlation has a positive value then it would indicate that both variables move in the same direction. In other words, increases in one are associated in increases in another, or decreases in one are associated with decreases in another. So a positive or negative correlation is indicated by the sign of our coefficient. So for example, when calculated the correlation among partying and achievement was -.746. The negative sign indicates that we have a negative correlation. The calculated correlation among was .681. The fact that this is a positive value indicates that we the relationship is also positive.

Page 50: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

42

The fact that a correlation is positive or negative however, indicates nothing about the strength of that relationship. In order to assess the strength of the relationship we must look at the actual numbers that are calculated. Correlations will range from a -1 to a +1. As scores get closer to either -1 or +1 they are stronger. In fact a correlation of -1 or +1 is actually considered to be perfect. Coefficients that approximate 0 indicate no relationship. This is illustrated in the scatterplots provided below.

Note: Diagram comes from http://allpsych.com/researchmethods/correlation.html. This was pulled from their website on August 22, 2011. Within the Figure r = correlation. Notice within the diagram above that for a perfect correlation (i.e. -1 or +1) every score will fall exactly on a straight line. This is extremely rare in educational research. Notice that as a correlation gets smaller in strength the scores tend to spread further away from a straight line. No relationship (e.g. correlation = 0.0) indicates that changes in one variables are not systematically related to changes in a second variable.

Page 51: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

43

Quick Overview of Correlations A quick overview of what we have learned about correlations thus far is in order. For this overview we will interpret the inter-correlations among three different variables. The variables are provided in the correlation matrix below. IQ Achievement Partying IQ 1.00 0.68 -0.81 Achievement 0.68 1.00 -0.75 Partying -0.81 -0.75 1.00 When reading across the row labeled IQ we see that the correlation among IQ and IQ is 1.00. This should not be surprising given that IQ should be perfectly related to itself. The correlation among IQ and Achievement is .68 and the relationship among IQ and partying is -.81. The positive correlation among IQ and Achievement of .68 indicates that the as IQ scores increase to too does achievement. The negative correlation among IQ and Partying of -.81 indicates that as IQ increases hours spent partying also tends to decrease. We can see that -.81 is closer to -1 than .68 is to +1. Another way of viewing this is to look at how far each correlation is from 0. Because -.81 is further from 0 than .68, the value -.81 is considered to be stronger than .68. Because of this we should notice that the observed relationship among IQ and partying is stronger than the observed relationship among IQ and achievement. Let’s now look at the second row for achievement. We see that achievement is correlated to IQ of .68, which is something we already know from reading the row for IQ. In other words, this value is redundant because it doesn’t provide unique information. Unique information is highlighted in yellow. Reading across the row we notice that achievement is perfectly correlated with itself, something that is once again not very useful. The final piece of unique information is given by correlation of -.75 among achievement and partying. This indicates that as hours spent partying increases achievement scores tend to decrease. From this concocted correlation matrix we have learned that individuals with higher IQ’s tend to party less and achieve more than those with lower IQ’s. Time spent partying was also negatively related to achievement.

Page 52: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

44

Correlation SPSS/PASW Output The present output is aligned with our previous examples in that we are correlating two continuous variables (e.g. scale like responses) that are assumed to be of an interval level of measurement. For this we are estimating what is technically referred to as a Pearson product-moment correlation coefficient. All of the correlations we have examined thus far have been of this kind, so it is not presenting anything new. We will focus on the output derived from SPSS/PASW. A current tutorial is found at the following link: http://www.youtube.com/watch?v=loFLqZmvfzU Research Questions: a) Is there a relationship among IQ and student achievement? b) Is there a relationship among IQ and partying? c) Is there a relationship among partying and student achievement? Null Hypothesis: All possible correlations among achievement, IQ, and partying = 0. In other words, there is no relationship among these variables. Alternative Hypothesis: There is a relationship among all possible combinations of these variables. Now that we have specified our null and alternative hypotheses, let’s take a closer look at the output as presented in SPSS/PASW. Once again unique information is highlighted in yellow.

Page 53: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

45

Correlations

IQ Achievment Partying

IQ Pearson Correlation 1 .681** -.814**

Sig. (2-tailed) .005 .000

N 15 15 15

Achievement Pearson Correlation .681** 1 -.746**

Sig. (2-tailed) .005 .001

N 15 15 15

Partying Pearson Correlation -.814** -.746** 1

Sig. (2-tailed) .000 .001

N 15 15 15

**. Correlation is significant at the 0.01 level (2-tailed).

Notice that ** 

indicates that the 

correlation is at least 

significant at .01. This 

means that the 

probability of 

observing a correlation 

of this magnitude will 

occur less than 1% of 

the time if we assume 

that there is no 

relationship among 

the variables.   

This row provides the actual 

significance level or p‐value. For 

IQ and Partying this value = .000. 

This would be interpreted as 

follows – “The correlation among 

IQ and Partying was ‐.81 (p < 

.001)”. This suggests that the 

observed correlation would only 

occur less than .001% of the time 

if there was no relationship 

among the two variables.  

This provides the total 

number of cases included 

in the calculation of each 

correlation. Since there are 

no missing responses we 

have 15 cases that are 

used to calculate each 

correlation. 

Page 54: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

46

Interpretation

Pearson’s product-moment correlation coefficients were calculated in order to examine the relationship among achievement, IQ, and partying among a sample of 15 students. Results indicated increases in IQ were related to higher achievement scores (correlation = .681, p = .005) and lower hours per week spent partying (correlation = -.814, p < .001). Increases in weekly partying were associated with lower levels of achievement (correlation = -.746, p = .001).

Simple Linear Regression This section will be the most technical aspect of the chapter. To keep matters simple, the mathematical elements of this technique will generally be avoided, so that we may solely focus upon your conceptual understanding of this procedure. This conceptual understanding will allow you to feel comfortable utilizing this technique, and the principles learned can be generalized (with some caution) to more advanced procedures. When you have data, it is important that you develop an intimate understanding of what the data is telling you. You should therefore always graph the data so that you can identify potentially extreme cases and make sure that there is a tendency for scores to fall on a straight line. If the data looks curvilinear or doesn’t seem to fall on a line then this procedure may not be suitable for your analysis. It is also important to recognize however, that small correlations (e.g. .20) may be statistically significant but look rather messy when examined with a scatterplot. In these situations you must use your best judgment about the utility of this procedure. A simple linear regression may be used when we want to predict one variable from another. So for example, if we have rubric data for critical thinking scores measured among freshmen, we may be interested in determining whether such scores are a significant predictor of GPA in a following semester. If we have critical thinking scores measured at graduation, we may be concerned with whether GPA is a significant predictor of critical thinking scores among a graduating class. As seen from these examples, the specification of which variable is a predictor and which is an outcome is

Page 55: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

47

a conceptual, as opposed to a statistical, problem. Ideally theory should guide these decisions. In practice however, theory may not be foretelling so we must use our best judgment as researchers to make these decisions. The variable that we use to make predictions will be referred to as our predictor variable (i.e. independent variables) and the variable that is being predicted will be called the criterion variable (i.e. dependent variable). An obvious example of something problematic however, would be the use of a predictor which preceded the criterion in time. For example, it would not make sense to predict history of abuse (criterion) from current depression scores (predictor) or to use critical thinking scores at graduation (predictor) to predict motivation when entering a degree program (criterion). A predictor should at least be capable of being conceived as preceding the criterion in time.

Regression Equation Simple linear regression is built upon a single equation, and this equation is the same equation that we would use to estimate a straight line. For our purposes, we do not need to worry about how each part of this equation is calculated. This section will therefore address the conceptual meaning of each component of the regression equation. Unfortunately textbooks within this area do not tend to agree about what symbols to use in these equations. However, if you understand the conceptual elements, you can always take note of what symbols a particular text prefers. The concept, not the symbol, is important for understanding this procedure. What are we trying to do? Before examining the regression equation in detail, let’s consider what we are trying to do. Take a second look at the scatterplot given below where we have predicted achievement from IQ. Notice the line traveling through the data; it is this line that we are going to estimate with this procedure. We can use this line in order to make predictions.

Page 56: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

48

We estimate this line with the following linear equation:

ebXaY

Y = this is our criterion or what we are predicting. a = this value is a constant and is the point at which our line crosses the Y axis.

This is also our estimated level of achievement when an individual has an IQ of 0.

b = this is referred to as the slope and indicates the steepness of our line. Lines that are flatter will have a smaller slope. A negative value for b would indicate that the line would travel the same way as a negative correlation. X = this is our scores on IQ. e = this is our estimate of error. Error reflects the distance from an actual person’s score and what our line would predict.

a =

b =

Distance from dot to the 

line is an example of error. 

The dot indicates where an 

individual actually falls and 

the line is used to make 

predictions. For this dot 

the person has an IQ of a 

little less than 120 and an 

achievement score of 

about 65. The line would 

predict an achievement 

score of about 88. The 

difference between their 

actual achievement and 

predicted achievement is a 

source of error.   

Predicted score for circled case

Page 57: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

49

Linear Regression Research Questions Within a simple regression analysis we are generally interested in two basic questions. The first question is informed by our previous discussion and is concerned with whether a variable is a significant predictor of a criterion. If interested, we could not only determine whether a particular variable is a statistically significant predictor, but could build an equation that would allow us to predict individual scores. The second question of interest pertains to the amount of variance in the criterion that can be accounted for by a set of predictors or by a single predictor. Within the present example, we are only focusing upon a single predictor. Just as we can square a correlation coefficient to get an indication of shared variance among two variables, so too in regression we may obtain what is referred to as an R-square value. R-square will range from 0 to 1, with greater values indicating more shared variance. We will now turn to an example. Research Questions: a) How much variance in achievement can be accounted for by IQ scores? b) Is IQ a significant predictor of achievement? c) What is the regression equation predicting achievement from IQ? Null Hypotheses - Remember, the null hypothesis always assumes that nothing is happening. Given this assumption we will assume that the variation in achievement that can be accounted for by IQ is 0. We will also assume that IQ is not a statistically significant predictor of achievement. It is possible to generate a regression equation from any predictors, even those that fail to be significantly related to your criterion variable. So just because it is possible to get a regression equation does not mean that you should use the equation. Many factors outside of our study can affect these equations so they should always be interpreted with extreme caution.

Page 58: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

50

Alternative Hypotheses – This is once again the inverse of our null hypotheses. These hypotheses assert that the variation in achievement that is accounted for by IQ is greater than 0 and that IQ is a significant predictor of achievement. To find a tutorial on using simple linear regression in SPSS/PASW see the following link http://www.youtube.com/watch?v=2HqCpuWd4ek

Simple Linear Regression SPSS/PASW Output This presentation will solely focus upon the output that pertains to a simple linear regression.

Variables Entered/Removedb

Model Variables Entered Variables Removed Method

1 IQa . Enter

a. All requested variables entered.

b. Dependent Variable: Achievement

This indicates that IQ was 

entered into the regression 

equation as a single predictor. 

In more complex analyses we 

may estimate more than 1 

model, but this will always refer 

to your predictor variable in a 

simple linear regression.  

This identifies your dependent variable, or 

the criterion. This is what we are 

predicting. In the present model we are 

therefore predicting achievement from IQ. 

Page 59: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

51

Model Summary

Model R R Square Adjusted R Square

Std. Error of the

Estimate

1 .681a .463 .422 8.85588

a. Predictors: (Constant), IQ

This reflects the 

total correlation 

among IQ and 

achievement. If 

we had multiple 

predictors this 

would reflect how 

strong the 

correlation was 

among 

achievement and 

the set of 

predictors.  

Remember, we only have a sample. 

There is a tendency for R‐square to be 

bigger in a sample than in the 

population. Adjusted R‐square gives an 

estimate of what this may be in the 

population.  

This value indicates the amount of variance that is shared 

among our predictors and the criterion variable. We may 

calculate this value by sampling squaring .681 from the R 

output. In our sample, R‐squared is estimated to be .463 which 

indicates that 46.3% of the variance in achievement scores is 

shared with IQ.  

Page 60: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

52

ANOVAb

Model Sum of Squares df Mean Square F Sig.

1 Regression 879.788 1 879.788 11.218 .005a

Residual 1019.546 13 78.427

Total 1899.333 14

a. Predictors: (Constant), IQ

b. Dependent Variable: Achievment

An F‐statistic is calculated, or an ANOVA, in order to determine whether our R‐

squared valued is statistically significant. In our example, we are assuming that 

the shared variance among IQ and achievement is 0. What then is the likelihood 

of observing an R‐square of .463? From this example the .005 indicates that we 

only observe this less than 1% of the time if indeed these variables share no 

variance. Since this value is less than .05 or .01 we can conclude that the R‐

squared value is statistically significant.  

Page 61: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

53

Coefficientsa

Model

Unstandardized Coefficients

Standardized

Coefficients

t Sig.

Collinearity Statistics

B Std. Error Beta Tolerance VIF

1 (Constant) 39.083 12.624 3.096 .009

IQ .415 .124 .681 3.349 .005 1.000 1.000

a. Dependent Variable: Achievment

This indicates the constant, or 

the point at which our regression 

line would cross the x‐axis. In 

other words, this is the predicted 

Achievement score if for an 

individual with an IQ of 0.  

This is the slope or b of our 

regression line. One way of 

interpreting this is as follows: For 

every one point increase in IQ 

scores we predict an increase in 

achievement of .415 points.  

The slope is tested for statistical 

significance with a t‐test. We 

want this value to be less than 

.05 or .01. In this example, the 

slope of .415 is statistically 

significant since .005 is less than 

.05 or .01.  

Page 62: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

54

Building our Regression Equation Coefficientsa

Model

Unstandardized Coefficients

Standardized

Coefficients

t Sig.

Collinearity Statistics

B Std. Error Beta Tolerance VIF

1 (Constant) 39.083 12.624 3.096 .009

IQ .415 .124 .681 3.349 .005 1.000 1.000

a. Dependent Variable: Achievment

bXaY '

For predictions we will lose the error term in our regression equation, thus making our new equation Y’ = a + bX. As indicated above, the output corresponds to specific parts of the equation. Y’ will be our achievement variable and X will be equal to our IQ variable. Now we can insert these values into our equation.

Predicted Achievement = 39.08 + .415 (IQ).

This equation may now be used in order to predict achievement from IQ scores. Let’s say that student 1 has an IQ score of 85. What is their predicted achievement score? To use the equation we would simply substitute 85 for the variable IQ.

Predicted Achievement = 39.08 + .415 (85)

If we were to work out this equation we would predict that a student with an IQ of 85 would have an achievement score of 74.36. Though we are able to make these predictions, it is important to keep in mind that such predictions will have a degree of error. We can derive prediction equations from any predictor, even if the predictor were completely random. Consequently, just because we can construct an equation does not imply that it is necessarily good. This equation is constructed on one sample, and will work best for the sample on which it is based. If we were to use this equation on a different sample of students our predictions would become less accurate.

Page 63: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

55

Choosing a Statistical Test There are two questions that are paramount when deciding what statistical test to use. First, you must at least articulate clear research questions. Too often individuals collect data haphazardly without due consideration of the questions that are pertinent to their research. There are notable exceptions, particularly in some areas of qualitative inquiry where an individual may intentionally immerse themselves within a particular context or culture while attempting to limit the influence of preconceived ideas. This advice is not an effort to critique such methodologies, as they are valuable in their own right. What is suggested however is that when conducting a survey or utilizing rubric data in program assessment there should be some a priori consideration about what questions are guiding the research. What are you trying to learn? Are you concerned with whether your rubric data is related to other variables of interest? Are you interested in making predictions, or are you more concerned about examining average differences across different groups or years? There are a range of possibilities and you will find that as your knowledge of research methodology advances so too does your ability to formulate interesting questions. Remember, you are not limited to one question within a single study. Try to get as much as you can with as little as you can. In other words, try to design studies that potentially answer a range of important questions without overwhelming your participants.

The second main consideration is determining how to collect the data that is needed to address your research questions. If you are interested in whether scores from a critical thinking rubric are associated with program satisfaction then it is necessary think about how you are going to measure each of these variables. Poor measurement can stifle our ability to answer well formulated research questions. The data that you collect is intentional, and should be an effort to address the questions that are critical to your study. There may be times when you include a variable in a study for exploratory purposes. Even such explorations however, are often interesting because they potentially elucidate specific questions (e.g. Will X be related to Y; Do males and females tend to report different levels of X?). Since the data you

Page 64: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

56

collect is an intentional act, each variable should have some importance to your study. If it is not relevant or even potentially relevant then why measure it?

Once we have formulated well-reasoned questions and decided about how are variables should be measured the next question to ask is about statistical tests. Two simple flowcharts are provided. The first flowchart is for questions pertaining to mean differences. This chart assumes that we have one independent variable (e.g. group or year) and one dependent variable. It also assumes that the dependent variable is a continuous variable that is measured at an interval level.

The second table depicts the statistical test that is appropriate when correlating two variables that are measured differently. Levels of measurement were explained in a previous section, but for convenience it will be briefly reviewed here. Nominal variables are categories. So if we assign males a “1” and females a “2” these numbers are used to demarcate different groups. Notice that the numbers are arbitrary since we could assign any number to males and females. The second level is referred to as ordinal and reflects a variable that has been rank-ordered. So for example, let’s say

Statistic

Sample

Concern

Overview What is your research question?

Mean Differences

1 sample

One sample t‐test

2 different groups

Independent sample t‐test

Same group twice

Paired samples t‐test

More than 2 groups

ANOVA

Page 65: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

57

that we have a class of thirty students and we rank them according to their height. The tallest student will be assigned a “1,” the next in height would be assigned a “2” and so on. With this level of measurement a 1 indicates that the person is taller than the second person, and we know that “2” is taller than “3”. However, we do cannot say with this level of measurement that the difference in height between person “1” and person “2” is the same as the difference in height between person “2” and person “3”. With interval level measurement the distance between each number is considered the same (note: it is frequently assumed that Likert-type questions are interval). So for example if we measure temperature in Fahrenheit we know that the difference in heat from 30 to 40 degrees is the same as the distance from temperature. Fahrenheit however, does not have a true “0” point; or in other words a “0” does not indicate the absence of heat. Measures that have equal distances between values with a true 0 are referred to as a ratio level of measurement. An example of this level of measurement would be temperature measured in Kelvin.

A complete review of the statistics below falls well beyond the scope of the present chapter. However, for a basic introduction to conducting such statistics in SPSS/PASW you may visit the following link http://www.statsoft.com/textbook/elementary-concepts-in-statistics/

Page 66: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

58

Statistical Test for Correlating 2 Variables with different Levels of Measurement

Variable 1

Variable 2

Level of Measurement

Nominal Ordinal Interval/Ratio

Nominal

Chi-Square Chi Square Point bi-serial

Ordinal

Spearman’s Rho

Point bi-serial; Spearman’s rho

Interval/Ratio

Pearson’s product moment

Statistical Significance and Practical Significance: Effect Size As emphasized throughout this chapter statistical significance has a very specific meaning. There are many factors, aside from actual differences between two groups that may contribute to finding a statistically significant result. For example, statistical significance is in part a function of sample size, so as your sample size increases it becomes easier to find a statistically significant result. With extremely large or small samples it is possible for significance test to be misleading. For example, if you have if you have 1,000 people in your study it is possible for a correlation of .081 to be statistically significant. In studies of mean differences it is possible that very large sample sizes could make seemingly trivial differences statistically significant. The inverse of this is also true, in that large differences may fail to be statistically significant in very small samples.

It is not as though this effect is completely unjustified. For example, if you have a very small sample of males and females (say 5 in each group) chance can play a larger role in determining whether in say whether a difference of 10 points I likely to be observed. Think back once again to our psychic friend. If our friend successfully predicted whether a coin will land heads on a single trial this was not convincing

Page 67: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

59

evidence that he had psychic ability. In other words, if we are only taking 1 sample of his behavior, it was extremely likely that he could guess correctly and still lack this ability. The same reasoning applies to all significance tests. Small samples make huge differences more likely just by chance alone. Larger samples control for this possibility, but there is a point at which samples can be so large that just about everything is statistically significant!

Unfortunately, it remains challenging to give general guidelines about how large a sample should be since this advice can change depending upon the specific context of each study and the statistical tests you are running. You should at least however, remain cautious about findings that are conducted on what seems to be overly large or small samples. Generally speaking if you are comparing two or more groups then having less than 30 in each group may be considered extremely small. Thirty individuals, though fairly small, is typically enough to estimate a correlation between two variables. With the simple linear regression it is important to remember that equations generated from one sample will be less accurate when applied to a different sample.

Given the many limitations of statistical significance testing, many researchers have advocated for the use of effect size statistics along with confidence interval estimates. Since confidence intervals have already been discussed in detail the rest of this section will focus upon three effect size statistics that can be used in the analyses we have illustrated in this chapter.

Effect Size Statistics Effect sizes are estimates of the magnitude of a particular effect, or the difference between two variables. There are numerous effect size statistics and each of them has particular interpretations. Many statistical programs are still relatively limited in their estimation of effect size statistics so to obtain these estimates some hand calculations are usually necessary. Three commonly employed effect size statistics are detailed below.

Page 68: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

60

Cohen’s d Cohen’s d is a measure of the magnitude of differences between two groups. Cohen’s d is referred to as a standardized measure because it is looking at the difference in average scores in terms of the standard deviation. Technically Cohen’s d is an estimate of the number of standard deviations that separate the two groups. So for example, let’s say that we calculate a female mean of 100 and a male mean of 110. If the standard deviation is 10 then we would get a Cohen’s d of 1.0. In other words, the group means are separated by 1 standard deviation. This is commonly used to estimate effect sizes when conducting a t-test. Currently SPSS/PASW does not provide this statistic so in the output some hand calculations may be necessary. There are two different formulas that can be used for this statistic, one of which is more intuitive than the other. However, the intuitive formula takes more hand calculations so a more simplified formula is presented. This formula is provided below along with output from the independent sample t-test that was previously calculated.

Remember in this example, we were comparing the average critical thinking scores of females and males. In this example, we found that the average critical thinking score of females (mean = 3.6, standard deviation = 1.3) was not significantly different from males (mean = 2.8, standard deviation = 1.1) when conducting an independent samples t-test (p = .169).

 

   

Page 69: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

61

df

td

2

 

t = the calculated t from our SPSS output.

df = degrees of freedom from SPSS output

Independent Samples Test

Levene's Test for

Equality of Variances t-test for Equality of Means

F Sig. t Df

Sig. (2-

tailed)

Mean

Difference

Std. Error

Difference

95% Confidence Interval

of the Difference

Lower Upper

Critical_Thinking Equal variances

assumed

.559 .464 1.434 18 .169 .80000 .55777 -.37184 1.97184

Equal variances

not assumed

1.434 17.486 .169 .80000 .55777 -.37431 1.97431

We can see that the SPSS/PASW output provides ample information to fill in the necessary elements to calculate Cohen’s d. Hand calculations would thus give the following solution:

67.18

)43.1(2d

 

Interpretation of Cohen’s  

The following general guidelines have been provided by Cohen (1988): a small effect = .20, a medium effect = .50, and a large effect = .80. These are rough guidelines and were not meant to be hard rules for interpreting the magnitude of effect sizes. No clear guidelines exist which may be used to ascertain whether an effect is meaningful.

Page 70: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

62

Thus what constitutes a large or meaningful effect may change depending upon the context of the study. For example, if you were examining the effectiveness of an intervention it may be meaningful to know that previous interventions had relatively small effects say less than .08. If your intervention were to have an effect of .15 it may be small, but still meaningful. Within the analysis outlined above we may see that the mean of each group is approximately .67 standard deviations apart. This would therefore indicate that the average score of females is approximately .67 standard deviations higher than that of males.

A more intuitive understanding of Cohen’s d is given by converting Cohen’s d estimates to percentiles (Coe, 2002). The table provided below indicates the percentage of group 2 (males) that would have a mean lower than females under different effect sizes. From this table we can see that with an effect size of .67 (which is close to .70) a little less than 76% of males would have a critical thinking score lower than the average female. If we found a Cohen’s d of 0.0 we would conclude that 50% of males of a critical thinking score that is lower than the average female critical thinking score. In other words, the distribution of critical thinking among females and males would completely overlap (i.e. there are no differences in critical thinking scores).

Notice that we have failed to find a statistically significant difference among female and male critical thinking scores using an independent sample t-test. However, when we calculate Cohen’s d this estimate indicates that the effect of gender on critical thinking is large. How should we make sense of this inconsistency? People are going to have different preferences in this situation. Some would prefer to give more weight to the significance test and less weight to the effect size estimate. Others however, particularly critics of significance testing, may give more credence to the effect size estimate. In this situation you should report both findings and let the reader decide which they prefer. I would personally attribute the failure to find statistically significant result to the small sample size and likely give preference to the effect size statistic.

Page 71: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

63

This is something that you must judge for yourself as a researcher, given your specific aims and the overall context of the study.

Cohen’s d and Percentile Estimates Cohen’s D Percentage in Group 2 who have a

lower score than the average of Group 1 0.0 50 0.1 54 0.2 58 0.3 62 0.4 66 0.5 69 0.6 73 0.7 76 0.8 79 0.9 82 1.0 84 1.2 88 1.4 92 1.6 95 1.8 96 2.0 98 2.5 99 3.0 99.9 (Table adapted from McGough & Faraone, 2009, p. 23)

Omega Squared Omega squared )( 2 is a measure of the proportion of variance in the dependent variable that can be accounted for by the independent variable. Omega squared is in estimate of this value in the population, not the sample. This is a common effect size statistic within the context of ANOVA where we are comparing differences in two or more averages. Though this effect size statistic is commonly employed, it is not common within SPSS/PASW output. Thus we will introduce the hand calculations for

Page 72: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

64

this statistic. First however, the output from the ANOVA example above will be presented. In this example we examined whether graduates from track A, B, and C had, on average, different critical thinking scores. The “Sig” column in this output indicated that these differences were not statistically significant. Let’s now look at the formula for Omega Square. The SPSS output is listed twice, once so that it may clearly indicated what values are placed within the numerator, and second so that it may be clearly indicated what values are placed within the denominator.

ANOVA

Critical_Think

Sum of Squares df Mean Square F Sig.

Between Groups 1.267 2 .633 .443 .647

Within Groups 38.600 27 1.430

Total 39.867 29

totalerror

errorbetweenbetween

SSMS

MSdfSS

))((

)( 2

ANOVA

Critical_Think

Sum of Squares df Mean Square F Sig.

Between Groups 1.267 2 .633 .443 .647

Within Groups 38.600 27 1.430

Total 39.867 29

From this information we would simply plug in the needed values to obtain the following solution:

038.86.3943.1

)43.1)(2(26.12

Notice that our calculated value is less than 0. Omega squared will typically range from 0 to 1 thus indicating the amount of variance that is accounted for in the dependent variable. It is not technically possible to account for negative variance thus

Page 73: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

65

these results are meaningless. The reason this happened was because we failed to obtain a significant F ratio. Consequently, negative values for omega square are possible when we have an F-statistic that is below 1. This value, though in many ways is without substantive meaning, corroborates the conclusion of no differences among the critical thinking scores that we reached when conducting the one-way ANOVA.

Interpreting Omega Squared

Cohen (1988, p. 284-288) provide the following guidelines

Small = .01 Medium = .059 Large = .138 or bigger

It is important to once again to keep in mind that the magnitude of an effect does not necessarily imply that it is meaningful. The guidelines given above are merely rough approximations and should be used as such. Meaning is determined by contextual, as opposed to purely statistical, criteria.

Cohen’s f2 (or Cohen’s f squared) Cohen also provides an effect size statistic that can be used within the context of regression analyses. Cohen’s 2f also provides an index of the percent of variance that can be accounted for, that is relative to the error variance within the population. In other words, this statistic is a ratio that expresses explained to unexplained variation. This becomes more explicit when we examine the equation given below:

2

22

1 R

Rf

iancelainedR varexp2  

 

Page 74: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

66

From the equation above we can see that R-squared is in the numerator, which expresses the amount of variation in the criterion variable (i.e. dependent variable) that is explained simultaneously by a set of predictors. The denominator (1 – R-squared) would thus indicate the unexplained variance, or that which cannot be accounted for by the set of predictors.

Within our previous simple linear regression analysis we examined IQ as a predictor of achievement. From this output we see that R-Squared is .463, which indicates that IQ accounts for approximately 46.3% of the variance in achievement. We may now plug this into our equation to obtain the following solution:

Model Summary

Model R R Square Adjusted R Square

Std. Error of the

Estimate

1 .681a .463 .422 8.85588

a. Predictors: (Constant), IQ

86.463.1

463.2

f

Interpreting Cohen’s f2 Cohen provides the following guidelines for the interpretation of this statistic:

Small = .02 Medium = .15 Large = .35

By this standard we can see that our value of .86 would constitute a large effect. It must be kept in mind that all of these guidelines should be used with caution.

Page 75: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

67

General Guidelines in Interpreting Quantitative Data This chapter will conclude with an overview of some general guidelines for interpreting quantitative data. Though this chapter has tended to focus on rubric data, these guidelines may be equally applied to the vast majority of research using many statistical procedures.

1. Explore your data thoroughly!

The importance of exploring your data cannot be overly emphasized. Be sure to graph your data in different ways. What do the descriptive statistics tell you? When graphing data it is important to examine potential outliers, or extreme cases. Outliers will reduce the strength of correlation coefficients. This does not mean that outliers should be automatically deleted. How to handle outliers can be tricky, but if the outlier is definitely not due to a data entry error, then this person may serve as the basis for subsequent research. Follow-up interviews may be helpful in understanding extreme scores. When calculating a correlation or simple linear regression make sure that the data appears to fall on a line. Small correlations (e.g. .30 or less) will typically look pretty messy on a scatterplot so this can be difficult to judge at times. However, if it is drastically apparent that the data is curved or not on a straight line then more advanced statistical techniques will need to be utilized (e.g. quadratic regression). Exploring your data both graphically and descriptively is an essential step that should be undertaken before you conduct statistical tests.

2. In most circumstances it is inappropriate to use causal language.

Let’s assume that you have 10 students who were assessed for critical thinking upon being accepted to a program. At graduation these students were once again examined for critical thinking. You calculate a paired samples t-test and find that the average critical thinking score increased upon graduation. Is it legitimate to conclude that participating in the program caused an increase in critical thinking? This conclusion is not warranted given the design of the study. All that you know is that scores changed, but there is no evidence about what would have happened with their critical thinking

Page 76: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

68

had they not participated in the program. It is possible that critical thinking scores would have changed even though they didn’t participate in the program. Without this counterfactual information causality cannot be concluded. Very strict requirements are needed in order to infer causality, and even under highly controlled experimental settings this can be difficult to establish. It is generally better to avoid such conclusions (unless you have strong experimental evidence) than to make unwarranted causal claims that will be ridiculed by observant colleagues.

3. When writing your results always report descriptive statistics.

Descriptive statistics allow people to examine the data in alternative ways. There are times when statistical tests can be misleading. It is possible for example that among two groups have different means according to statistical significance test, yet this interpretation is not supported by confidence intervals. Statistically significant results may occur even with small effect sizes. Reporting descriptive statistics will give readers the opportunity to calculate confidence intervals or effect sizes if they so desire. Many people may not choose to do this, but it is nice to have the option if wanted. Thus in all reports you should at minimum include the mean and standard deviation of your variables.

4. There is a difference between prediction and explanation.

Correlations assess how changes in one variable are related to changes in another variable. Increases in critical thinking may be related to increases in GPA. This does not imply that critical thinking causes GPA. Though it remains possible that critical thinking may cause GPA a correlation does not provide evidence that this is indeed the case. As previously indicated there are strict requirements to meet the standards of causality. Not only does correlational evidence alone fail to meet these standards, the presence of a correlation among X and Y may be interpreted in one of three ways. First, it is possible that X causes Y. Second, Y may be the cause of X; and finally a correlation among X and Y could imply that they are both caused by a third variable, Z. So in our example, critical thinking may cause GPA, GPA may cause

Page 77: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

69

critical thinking, or critical thinking and GPA are related because they are both caused by a third variable. To elucidate this latter possibility it may be that both critical thinking and GPA are caused by intelligence. Correlations are important, but they are limited in terms of their explanatory power.

This leads to an examination of the difference among prediction and explanation. Correlations allow us to make predictions, but just because we can predict something does not mean that we have explained it, or much less understand it. For example, since the time of Aristotle scholars have been capable of predicting planetary orbits with a high degree of accuracy. Though these predictions were useful, they were based upon theoretical models of the universe that have now been widely rejected. Prediction was not an indication of explanation. Let’s look at an extreme example so that this point becomes more obvious. What if a researcher found that there was a strong correlation (e.g. = .75) among daily coffee consumption as an undergraduate student and success in graduate school. If we were only interested, say for admission purposes, to predict success in graduate school then this finding would allow us to make accurate predictions by simply measuring applicants’ daily coffee consumption. No other information would really be needed. With this extreme example, we can easily see that despite this advantageous finding an actual explanation of success in graduate school would require much more than asking someone about how much coffee they drink. With such extreme examples, these problems become obvious. It would be much more tempting however, to derive such erroneous conclusions if we were to find similar correlations among critical thinking and success in graduate school.

5. Statistical significance does NOT imply practical significance.

There are many factors that can affect statistical significance. Of these, one important consideration is your sample size. With large samples very small differences will be statistically significant, and with very small samples big differences may fail to be statistically significant. So for example, if you have a 1,000 people a correlation of .081 will be statistically significant. If we square this correlation (i.e. .081 multiplied by

Page 78: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

70

.081) we get .006. In other words, though this correlation is statistically significant, there is only 0.6% of the variance that is shared among the two variables. Is this meaningful? It may or may not be depending upon the context of your particular study. However, what this example does indicate is that it may be possible for trivial findings to be statistically significant. Caution should therefore be used in interpreting your findings if you have an extremely large or small sample. In order to account for this problem many researchers have advocated for consistent use of effect sizes when interpreting data. Effect sizes are estimates about the magnitude of observed group differences or relationships. Within numerous contexts, confidence intervals and effect sizes are more telling than significance tests.

6. Testing for statistical significance does not prove your research hypothesis.

This point was illustrated in the discussion pertaining to significance testing. This is a subtle point that may take some time to fully internalize. Statistical significance indicates the likelihood of observing something when we assume that nothing is happening. Thus with our psychic friend we assumed that he was randomly guessing when predicting whether a coin would turn up heads or tails. If he were randomly guessing he would still have a chance of getting 10 out of 10 correct, though it is extremely small. If he actually predicted 10 out of 10 correct it does not prove that he is psychic. It only indicates that the likelihood of him doing this is extremely rare if he were randomly guessing. This is so rare that we may go ahead and conclude that he has this specific psychic ability. This same reasoning is applied to all significance tests. We may observe in a sample of students huge differences in male and female critical thinking scores. If this is statistically significant it only means that this finding would be extremely rare if in fact they had similar critical thinking scores. It is in the nature of probability that the improbable happens. Consequently, these differences may merely reflect an extreme sample, when in fact there are no differences in the entire population.

Page 79: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

71

A Final Note about Samples, Populations, and Sample Size This handbook will conclude with some final remarks about sampling, populations, and sample size requirements. The statistical tests this handbook discusses assumes that you have a sample (a subset of individuals) that is gathered from an identified population (a group that shares at least one thing in common). For example, some programs or universities may ask alumna to participate in follow-up surveys to examine their program satisfaction and job success. So we may have a total 100 graduates and get a sample of 40 in the follow-up survey. From this sample of 40 we make inferences about the program as a whole. So if males and females in our sample have different levels of satisfaction with the program, we would like to conclude that this is probably true of all of the males and females that participated in the program.

There are numerous problems that can arise in this interpretation. Inferences from a sample back to a population are legitimate to the extent to which the sample actually represents the population. The follow-up survey likely consisted of volunteers. Were males or females more likely to volunteer? Is it possible that that volunteers may be more or less satisfied than those who chose not to participate? There are no easy answers to this set of problems. One thing to consider is your response rate, or the total number of people who participated relative to those in your entire population. A high response rate may make these problems less likely, but in and of itself this is not a guaranteed solution. Unfortunately the best way to control for this problem is through a systematic random sample where each individual has an equal chance of being a participant in the study. If we put the participants names into a hat and drew 40 people at random this would constitute a systematic random sample. Even if we select people at random however, we cannot force them to participate in our follow-up survey. Consequently, this form of control is often not feasible in practice. There are some things you can potentially do however to investigate this issue. It is possible to compare people who did participate to those who failed to participate on known variables. For example if 70% of the 100 graduates are male and my sample is 75% female then this would imply that my sample is not representative of the 100

Page 80: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

72

graduates. Though this is a useful approach it is still possible that the sample is different from the population on important, yet unknown variables. For example, maybe volunteers were more likely to be employed than nonvolunteers. Gender differences in other variables may therefore only apply to graduates who happen to be employed. These problems point once again to the tentative nature of data-driven decision making.

There are other problems that may be common. The statistical tests discussed in this handbook assume that you have a decent sample size. Exactly what constitutes a decent sample size is somewhat controversial. Most individuals however suggest that a sample size of 30 is sufficient for many statistical tests. Smaller samples than this can become problematic for reasons that are well beyond the scope of this handbook. Suffice it to say that these statistical tests discussed in this handbook provide the best results under the following conditions:

1. All statistical assumptions are met. 2. Your sample size is at least 30 or you have about 30 individuals in each

group. 3. Your sample is representative of the population.

This handbook has not emphasized statistical assumptions and has instead focused upon the interpretation of SPSS/PASW output for each statistical test. These assumptions are discussed in most statistical textbooks. This handbook should therefore be considered a practical supplement to more technical statistical texts.

What should we do however, if we fail to have a decent sample size or we have measured all of our students? If the problem is a small sample size and you have quantitative data I would recommend to primarily report this data with descriptive statistics. In some situations it may be appropriate to use what is referred to as nonparametric statistics (see http://www.okstate.edu/sas/v8/saspdf/stat/chap13.pdf). It is possible in educational assessment that some programs have only a small number of students thus making it easy to sample all of the students. In this situation many of

Page 81: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

73

the statistics discussed in this handbook seem a little odd to use. Running statistical significance tests on all of the students would make sense if you think about all of the students in the program as really a sample. In other words, it is possible to view the entire students as a subset of all possible students in the program. If you would like to make inferences to these possible students then significance tests would appear to make sense. It would not make sense however, if you are only interested in these particular students (i.e. these students are your population).

This will be illustrated with a brief example because it is rather abstract. Let’s say that within the follow-up survey that all of our 100 students participated. What if we were interested in differences in male and female satisfaction? If we assume that the males and females admitted to our program are really just a subset of all possible males and females that we could have admitted a statistical significance test would be appropriate. A significance test however, is not appropriate if we only care about these particular males and females. In this situation there is no need to estimate population differences because we can actually observe these differences by simply looking at the average male and female level of satisfaction. In this situation, I would simply report the descriptive statistics for each of these groups.

Conclusion The present handbook has aimed to demystify the basic use of quantitative data in educational assessment. This handbook provides a useful starting point for stakeholders involved in educational assessment who wish to explore their own assessment data in more detail. It should therefore provide clarity to numerous conceptual issues and aid stakeholders’ ability to get the most out of assessment data. As implied throughout the text, this is an introduction, or initial framework, to apprehend the intricacy of research design and data analysis. Demystification, at least as a stated aim of this handbook, remains a lofty enterprise. If one theme persists it is that all conclusions are at best tentative understandings that should be incessantly subjected to critical scrutiny. Not all questions can be answered with

Page 82: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

74

quantitative data; however, when conducting quantitative research, two points are vital to keep in mind: 1) What questions am I trying to answer and 2) How do these data help me answer these questions? Relevant to question 2 is also the importance of considering how you are measuring study variables. It matters how you assign numbers to a variable since some levels of measurement are more informative than others. If you have the choice, use an interval level of measurement over and above ordinal or nominal measures. With these points in mind, develop an intimate understanding of your data. Explore divergent possibilities, graph the data in different ways, and be open, yet skeptical about what you believe the data are communicating!

Page 83: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

75

Appendix: Sample Assessment Report This appendix will provide a detailed example of how you may choose to write an assessment report. To see more examples of Assessment Reports visit http://tinyurl.com/osureport. To remain consistent with this handbook the critical thinking data will be used for this example. Additional variables were added to the dataset. You may choose to modify this format accordingly to fit the specific objectives of your educational program or department.

A. Program Name

B. Assessment Coordinator

C. Mission Statement

Describe the mission statement of your specific educational department and program. If you have a formal mission statement then it would be appropriate to insert this mission statement at this point in the document. The mission statement should provide a succinct and clear description of the goals, objectives, and purpose of the specific program. Within the mission statement it is also appropriate to reference the population that is being served.

D. Student Learning Outcomes

In this section you should state each student learning outcome. For each student learning outcome you should be capable of gathering evidence that suggests that the goal is being achieved. For each student learning outcome you should report:

1. Assessment method used to gather evidence of student achievement a. The number of students that were assessed and the number that were

not b. The process by which students were selected to participate c. The student product(s) that were evaluated (e.g., course project, oral

presentation, performance assessment, etc.) d. The process for assessing student products and for summarizing results

Page 84: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

76

2. Summary of assessment evidence / results from the assessment method a. Report aggregate scores and scores by rubric content category (if

available) b. Relative to the student learning outcome, report students’ strengths and

weaknesses 3. Description of program faculty members’ interpretation of the results

a. Describe how results were shared and discussed b. Describe faculty members’ response to the results. What do the results

suggest about the curriculum, about teaching practices, and about student achievement of the learning outcome?

4. Program improvements implemented or being considered in response to the assessment evidence / results

a. Describe actions taken resulting from discussion of the assessment evidence or issues needing additional study

b. Describe actions that are being considered c. Describe additional assessment data that may need to be collected

E. Use of University Assessment Funds (if applicable)

Page 85: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

77

Example

Program Assessment Studies, B.A., Anonymous University

Assessment Coordinator Dr. No-Name

Student Learning Outcome I.

Graduates from the School of Program Assessment Studies at Anonymous University will demonstrate the ability to critically evaluate assessment data.

Research Questions I.

1. What is the current level of critical thinking for selected papers? 2. Are there observed demographic differences in critical thinking scores? 3. What is the relationship among current critical thinking scores and variables

collected during admission?

Assessment Method I.

Overview of Sampling Method I. This year a total of 20 student papers were randomly selected from course SPA0001 in order to determine the extent to which this artifact demonstrates critical thinking. Course SPAS0001 was selected since it focuses upon research methodology and is a core requirement for all students within the School of Program Assessment Studies. This assignment required students to critique a published program evaluation conducted on a non-profit organization. Selected papers were rated by two faculty judges utilizing the rubric given below. Each faculty rater assigned scores for each student independently.

Level of Critical Thinking

1

2

3

4

5

The student fails to demonstrate critical thinking or displays a minimal level of critical thinking.

The student demonstrates a moderate level of critical thinking.

The student demonstrates an exceptional level of critical thinking.

Page 86: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

78

Student papers that were assigned discrepant scores were reviewed in a subsequent meeting of the faculty raters. During this meeting faculty discussed their individual scores and came to a consensus regarding the critical thinking score of these papers. Within the sample, 20% (i.e. 4 papers) of the papers were initially assigned discrepant scores. High school GPA and a high school achievement test are used to admit students. Student papers were therefore matched to these variables that were collected prior to student admission into the program.

Description of Participants I. Demographic information was also collected on the student of each sampled paper. This information was used in order to examine group differences, as well as relationship of study variables to critical thinking scores. The sample data is provided below (Note: providing raw data may not be necessary for your report, but this is given so that you may replicate the findings). Note that within our present sample 50% of the papers were written by females. Also 50% of the papers were written by seniors and the other 50% were written by freshman.

Raw Data for Study Variables

Critical Thinking

Gender High School GPA

Current Class High School Achievement Test

5.00  .00  3.50  1.00  65.00 4.00  .00  3.00  1.00  70.00 

3.00  .00  4.00  1.00  80.00 3.00  .00  3.00  1.00  81.00 

3.00  .00  3.20  .00  90.00 3.00  .00  2.80  .00  65.00 

2.00  .00  2.70  .00  93.00 2.00  .00  3.00  1.00  73.00 

2.00  .00  2.10  .00  76.00 1.00  .00  2.80  .00  75.00 

5.00  1.00  3.20  1.00  68.00 5.00  1.00  3.20  .00  89.00 

5.00  1.00  3.30  1.00  99.00 4.00  1.00  3.20  .00  87.00 

Page 87: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

79

4.00  1.00  2.30  1.00  99.00 4.00  1.00  2.90  1.00  90.00 

3.00  1.00  2.30  1.00  88.00 3.00  1.00  2.50  .00  70.00 

2.00  1.00  2.40  .00  65.00 1.00  1.00  2.00  .00  70.00 

Note: For Gender 1 = female and 0 = male; For Current Class 1 = senior and 0 = freshman.

Results Method I.

Descriptive Statistics I. Within the entire sample the average cumulative high school GPA was 2.87 (standard deviation = .50) with an average achievement score of 79.7 (standard deviation = 11.45). All students within the School of Program Assessment Studies have an average high school GPA of 3.00. A one sample t-test indicated that our sample mean of 2.87 was not lower than what may be expected from chance fluctuations (p = .259). All students within the School of Program Assessment Studies also had an average high school achievement score of 82, which also not significantly higher than the sample mean of 79.7 (p = .370). Though tentative, this provides some support that the sample of selected papers is representative of the broader population in terms of high school GPA and achievement. Finally, the average critical thinking score for the entire sample of papers is 3.2 (standard deviation = 1.3). The distribution of critical thinking scores is provided in the histogram below. From this distribution we see that most (i.e. the mode) papers were rated by faculty as demonstrating a “moderate level of critical thinking” (Quote comes from rubric).

Page 88: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

80

Investigation of Gender Group Differences I. Group differences in critical thinking scores were investigated for both gender and grade classification. First results pertaining to the investigation of gender differences is provided. Descriptive statistics for gender across all study variables are provided in the table below. From the table we see that Females have a higher average critical thinking score and high school achievement score than males. Males however, have an average high school GPA than females.

Descriptive Statistics of Males and Females for Study Variables

Variable Minimum Maximum Mean Standard Deviation

Males HS GPA 2.10 4.00 3.01 0.50 Critical Thinking 1.00 5.00 2.80 1.13 HS Achievement 65.00 93.00 76.80 9.47 Females HS GPA 2.00 3.30 2.73 0.48 Critical Thinking 1.00 5.00 3.60 1.34

HS Achievement 65.00 99.00 82.50 13.00 Note: HS = high school.

Page 89: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

81

Though our present aim is to primarily to investigate whether males and females have, on average, different levels of critical thinking, a series of independent sample t-tests were conducted in order to investigate gender differences across critical thinking, high school GPA, and high school achievement. These results are presented in the Table below.

Results of Independent Sample T-tests for Gender differences among Study Variables

t-value Significance (p-value)

Mean Difference

among Male and

Females

95% Confidence

Interval around Mean

Difference Critical Thinking

-1.43 .169 -0.80 -01.97 to .37

HS GPA

1.274 .219 0.28 -00.18 to .74

HS Achievement -1.12 .277 -5.7 -16.38 to 4.98 Note: For each analysis the male mean was subtracted from the female mean. HS = high school.

Interpretation of Gender Group Mean Differences I. From this table it is apparent that the sample of males and females are not significantly different in critical thinking scores, high school GPA, or high school achievement. In other words, observed differences in critical thinking, high school GPA and high school achievement did not exceed chance expectations.

Investigation of Grade Classification Group Mean Differences I. Group differences were then investigated for grade classification. Descriptive statistics for the study variables across grade classification are provided in the table below.

Page 90: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

82

Descriptive Statistics of Freshmen and Seniors for Study Variables

Variable Minimum Maximum Mean Standard Deviation

Freshman HS GPA

2.00 3.20 2.69 0.44

Critical Thinking

1.00 5.00 2.60 1.26

HS Achievement 65.00 93.00 78.00 10.80 Seniors HS GPA

2.30 4.00 3.05 0.51

Critical Thinking

2.00 5.00 3.80 1.34

HS Achievement 65.00 99.00 81.30 12.41 Note: HS = high school.

From the Table above we can see have an average high school GPA, critical thinking score, and high school achievement for seniors is greater than that of freshman. A series of independent sample t-tests were conducted in order to investigate whether such differences exceed what may be expected by chance. These results are presented in the Table below.

Page 91: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

83

Results of Independent Sample T-tests for Classification differences among Study Variables

t-value Significance (p-value)

Mean Difference among

Freshman and Seniors

95% Confidence

Interval around Mean

Difference Critical Thinking

-2.32 .032 -1.20 -2.28 to -.37

HS GPA

-1.69 .109 -0.36 -.81 to .08.

HS Achievement -0.63 .534 -5.7 -14.23 to 7.63 Note: For each analysis the Freshmen mean was subtracted from the Senior mean. HS = high school.

Interpretation of Classification Group Mean Differences I. From the table above we can see that freshman and senior average high school GPA’s and high school achievement scores were not significantly different. Freshman had an average critical thinking score 1.2 points lower than seniors, and this difference was statistically significant when conducting an independent samples t-test (p = .032). When constructing a 95% confidence interval around this mean difference we have 95% chance of constructing an interval that contains the true difference in the respective population. This value interval was calculated as -2.28 to -.37. Since this interval fails to contain 0 it supports the conclusion of the independent samples t-test.

Further investigation of this difference was investigated by examining effect size, as measured by the Cohen’s d statistic. This value is obtained by the formula below.

df

td

2

Page 92: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

84

For our independent sample t-test we have degrees of freedom or df = 18. Cohen’s d therefore becomes (Note: we are taking the absolute value of the t-value from our table):

09.118

)32.2(2d

This statistic allows us to infer that a little more than 84% of the freshmen have a critical thinking score that is lower than the average senior (see page 63 of the handbook). This may therefore be characterized as a large effect.

Investigation of High School GPA, High School Achievement, and Critical Thinking I. The investigation of high school GPA, high school achievement, and critical thinking will be conducted in two steps. First the scatterplots among admission variables (i.e. high school GPA and the high school achievement test) and critical thinking scores will be examined. This will be followed by an examination of correlations among critical thinking, high school GPA, and high school achievement.

Correlation Analysis I. Pearson product moment correlation coefficients were calculated among all of the variables. Before conducting such an analysis however, scatterplots were constructed in order to examine whether the data appears to be linear and to identify potential outliers. The scatterplots depicting high school GPA and critical thinking as well as high school achievement and critical thinking are displayed below. For convenience cases that may not fit the pattern are circled.

Page 93: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

85

From the scatterplots depicted above, it appears that critical thinking and GPA would be more strongly related than critical thinking and achievement. One individual has a high school GPA of 4.0 and a critical thinking score of 3.0. This individual does not appear to fit the pattern of other individuals. Two individuals appear to be slightly deviant in the achievement and critical thinking scatterplot. These individuals have relatively low achievement scores, yet high critical thinking scores. Further investigation of these outliers may be warranted.

Pearson correlations were calculated on all study variables. These results indicated that GPA was positively related to critical thinking (correlation = .54, p = .015), yet the

Page 94: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

86

high school achievement test was not significantly related to critical thinking (correlation = .26, p = .269). Squaring the correlation coefficient among critical thinking and high school GPA indicates that approximately 29% of variance in critical thinking is shared with high school GPA. Interestingly, high school GPA was not significantly associated with the high school achievement test (correlation = .073, p = .761). [Note: the correlation among these variables would increase if we deleted the circled cases – extreme caution should be done when doing these things however because outliers may exist for legitimate reasons. When deleting such cases you should always tell the reader that this has occurred and provide sufficient justification to support your decision].

Substantive Interpretation of Results I

Overall results indicated that the average critical thinking score was 3.1 (standard deviation = 1.3), and the distribution of the critical thinking scores among selected papers indicated that most students demonstrated a “moderate level of critical thinking”. Seniors on average had higher critical thinking scores than freshman, with results indicating that approximately 84% of freshman had lower critical thinking scores than the average senior. Given that selected seniors did not significantly differ from freshman in high school GPA or high school achievement it is unlikely that observed differences in critical thinking are attributable to differences in admission procedures pertaining to these variables. Of the admission criteria selected for this study, high school GPA was the only variable that was significantly related to current critical thinking. The high school achievement test was not significantly related to critical thinking among these student papers. This may lead to doubts about the utility of high school achievement as an indicator of subsequent performance in this program.

It is necessary to note some strong limitations of the current findings. First there are some possible limitations in determining the extent to which our sample of student papers is representative of all students in the program. Though the sample did not differ from the population in terms of high school GPA or the high school achievement

Page 95: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

87

test, it is possible that the sample differs in important ways that are not measured by the current study. The sample of student papers was also selected from not only one course, but one assignment from the course. This procedure therefore assumes that a single paper adequately captures a student’s critical thinking. Though both the assignment and course were relevant to this study it is possible that this single assignment failed to fully represent the student’s critical thinking ability. This same argument may be applied to sampling a single course. Broader sampling strategies should be considered in subsequent research.

It is important to note that the differences in freshman and senior critical thinking scores provides preliminary evidence that student critical thinking scores may increase as an individual proceeds through the program. This study is only suggestive that such changes occur. It is possible that current freshman and seniors differ in important ways that have affected the results. For example, seniors may have taken previous courses that assisted in this particular assignment; courses that sampled freshman have not yet taken. To control for such possibilities it is necessary to track the same students across time. Finally, the observed correlations in this study suggest that high school GPA may be a better indicator of subsequent critical thinking than the high school achievement test. The small sample size of the current study poses limitations when drawing such conclusions. Future research on larger samples should be undertaken to fully investigate the usefulness of these two variables for admission criteria.

Recommendations I

1. Faculty and administrators within the School of Program Assessment Studies should construct a broad sampling strategy to assess critical thinking. This sampling strategy may include sampling student papers from a broader range of courses and/or collecting multiple assignments from single students.

Page 96: Demystifying Quantitative Data: Statistical Explorations in Educational Assessment · 2016-02-02 · Demystifying Quantitative Data: Statistical Explorations in Educational Assessment

88

2. To assess potential changes in critical thinking across time, it is recommended that samples of freshman work are taken on an annual basis.

3. Until changes in critical thinking among one student cohort are measured across time, it is recommended that cross-sectional (i.e. comparing freshman to other grade classifications) comparisons are still conducted.

4. Faculty and administrators should discuss the expected level of critical thinking students are desired to reach once they graduate the program. This would provide a benchmark for evaluating critical thinking scores.

5. The critical thinking rubric should have clear definitions for “minimal,” “moderate,” and “exceptional” levels of critical thinking.

This procedure, or something similar, should be conducted with each measured outcome. Modifications to this however, may be necessary given the specific outcomes under investigation and the

variables you have measured.