introduction to measurement goals of workshop reviewing assessment concepts reviewing instruments...

Post on 26-Mar-2015

216 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introduction to Measurement

Goals of Workshop

• Reviewing assessment concepts• Reviewing instruments used in norming process

• Getting an overview of the secondary and elementary normative samples

• Learning how to use the manuals in interpreting students’ scores.

ASSESSMENT

• The process of collecting data for the purpose of making decisions about students

• It’s a process and typically involves multiple sources and methods.

• Assessment is in service of a goal or purpose.

• The data we collect will be used to support some type of decision (e.g., monitoring, intervention, placement)

Major Types of Assessment in Schools

• More frequently used:– Achievement: how well is child doing in curriculum?

– Aptitude: what is this child’s intellectual and other capabilities?

– Behavior: Is the child’s behavior affecting learning?

• Less frequently used:– Teacher competence: Is teacher actually imparting knowledge?

– Classroom environment: Are classroom conditions conducive to learning?

– Other concerns: home, community,...

Types of Tests

• Norm-referenced– Comparison of performance to a specified population/set of individuals

• Individually-referenced– Comparisons to self

• Criterion-referenced– Comparison of performance to mastery of a content area; what does the student know?

• The data in the manual will allow you to do look at norms and at individual growth.

MAJOR CONCEPTS

• Nomothetic and Idiographic• Samples• Norms• Standardized Administration• Reliability• Validity

Nomothethic

• Relating to the abstract, the universal, the general.

• Nomothetic assessment focuses on the group as a unit.

• Refers to finding principles that are applicable on a broad level.

• For example, boys report higher math self-concepts than girls; girls report more depressive symptoms than boys..

Idiographic

• Relating to the concrete, the individual, the unique

• Idiographic assessment focuses on the individual student

• What type of phonemic awareness skills does Joe possess?

Populations and Samples I

• A population consists of all the representatives of a particular domain that you are interested in

• The domain could be people, behavior, curriculum (e.g. reading, math, spelling, ...

Populations and Samples II

• A sample is a subgroup that you actually draw from the population of interest

• Ideally, you want your sample to represent your population– people polled or examined, test content, manifestations of behavior

Random Samples

• A sample in which each member of the population had an equal and independent chance of being selected.

• Random samples are important because the idea is to have a sample that represents the population fairly; an unbiased sample.

• A sample can be used to represent the population.

Probability Samples I

• Sampling in which elements are drawn according to some known probability structure.

• Random samples are subcases of probability samples.

• Probability samples are typically used in conjunction with subgroups (e.g., ethnicity, socioeconomic status, gender).

Probability Samples II

• Probability samples using subgroups are also referred to as stratified samples.

• Standardization samples are typically probability or stratified samples.

• Standardization samples need to represent population because the sample’s results will be used to create norms against which all members of population will be compared.

Norms I

• Norms are examples of how the “average” individual performs.

• Many of the tests and rating scales that are used to compare children in the US are norm-referenced.– An individual child’s performance is compared to the norms established using a representative sample.

Norms II

• For the score on a normed instrument to be valid, the person being assessed must belong to the population for which the test was normed

• If we wish to apply the test to another group of people, we need to establish norms for the new group

Norms III

• To create new norms, we need to do a number of things:– Get a representative sample of new population

– Administer the instrument to the sample in a standardized fashion.

– Examine the reliability and validity of the instrument with that new sample

– Determine how we are going to report on scores and create the appropriate tables

Standardized Administration

• All measurement has error.• Standardized administration is one way to reduce error due to examiner/clinician effects.

• For example, consider these questions with different facial expressions and tone:

• Please define a noun for me :-)• DEFINE a noun if you can ? :- (

Distributions

• Any group of scores can arranged in a distribution from highest to lowest

• 10, 3, 31, 100, 17, 4

• 3, 4, 10, 17, 31, 100

Normal Curve

• Many distributions of human traits form a normal curve

• Most cases cluster near middle, with fewer individuals at extremes; symmetrical

• We know how the population is distributed based on the normal curve

Ways of reporting scores

• Mean, standard deviation• Distribution of scores

– 68.26% ± 1; 95.44 ± 2; 99.72 ±3• Stanines (1, 2, 3, 4, 5, 6, 7, 8, 9)

• Standard scores - linear transformations of scores, but easier to interpret

• Percentile ranks* • Box and Whisker Plots*

Percentiles

• A way of reporting where a person falls on a distribution.

• The percentile rank of a score tells you how many people obtained a score equal to or lower than that score.

• So if we have a score at the 23rd %tile and another at the 69th %tile, which score is higher?

Percentiles 2

• Is a high percentile always better than a low percentile?

• It depends on what you are measuring.

• For example….• Box and whisker plots are visual displays r graphic representation of the shape of a distribution using percentiles.

The box plot is a picture of the distribution of scores on a measure.

Explanation of the

Box Plot

Individual

Outliers

90th Percentile

Performance

75th Percentile

50th

Percentile

25th Percentile

10th Percentile

0

2

4

6

8

10

12

14

16

18

20

Grade 2 Students

Correlation

• We need to understand the correlation coefficient to understand the manual

• The correlation coefficient, r, quantifies the relationship between two sets of scores.

• A correlation coefficient can have a range from -1 to + 1.– Zero means the two sets of scores are not related.

– One means the two sets of scores are identical (a perfect correlation)

Correlation 2

• Correlations can be positive or negative.• A + correlation tells us that as one set of scores increases, the second set of scores also increases. they can be negative. Examples?

• A negative correlation tells us that as one set of scores increases, the other set decreases. Think of some examples of variables with negative r’s.

• The absolute value of a correlation indicates the strength of the relationship. Thus .55 is equal in strength to -.55.

How would you describe the correlations shown

by these charts?3 15 47 39 7

11 913 10

1098765

1.21.21.21.21.2

0

2

4

6

8

10

12

14

1 2 3 4 5 6

Series1

0

2

4

6

8

10

12

1 2 3 4 5 6

Series1

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 2 3 4 5

Series1

Correlation 4

• .25, .70, -.40, .55, -.87, .58, .05• Order these from strongest to weakest• -.87, .70, .58, .57, -.40, .25, .05• We will meet 3 different types of correlation coefficients today:

• Reliability coefficients - Definitions?

• Validity coefficients• Pattern coefficients

Reliability

• Reliability addresses the stability, consistency, or reproducibility of scores.– Internal consistency

– Split half, Cronbach’s alpha

– Test-retest– Parallel forms– Inter-rater

Reliability 2

• Internal Consistency– How do the items on a scale relate to one another? Are respondents relating to them in the same way?

• Test-retest– How do respondents’ scores at Time 1 relate to their scores at Time 2?

Reliability 3

• Parallel forms– Begin by creating at least two versions of the exam. How do respondents performance on one version compare to their performance on another version

• Inter-rater– Connected to ratings of behavior. How does one rater’s scores compare to another’s?

Validity

• Validity addresses the accuracy or truthfulness of scores. Are they measuring what we want them to?– Content– Criterion - Concurrent– Criterion - Predictive– Construct

– Face

Content Validity

• Is the assessment tool representative of the domain (behavior, curriculum) being measured?

• An assessment tool is scrutinized for its (a) completeness or representativeness, (b) appropriateness, (c) format, and (d) bias– E.g., MSPAS

Criterion-related Validity

• What is the correlation between our instrument, scale, or test and another variable that measures the same thing, or measures something that is very close to ours?

• In concurrent validity, we compare scores on the instrument we are validating to scores on another variable that are obtained at the same time.

• In predictive validity, we compare scores on the instrument we are validating to scores on another variable that are obtained at some future time.

Structural Validity

• Used when an instrument has multiple scales.• Asks the question, “Which items go together best?

• For example, how would you group these items from the Self-Description Questionnaire?

• 3. I am hopeless in English classes.• 5. Overall, I am no good.• 7. I look forward to mathematics class.• 15. I feel that my life is not very useful.• 24. I get good marks in English.• 28. I hate mathematics.

Structural Validity 2

• We expect the English items (3, 24), Math items (7, 28) and global items (5, 15) to group together.

• The items that group together make up a new composite variable we call a factor.

• We want each item to correlate highly with the factor it clusters on, and less well with other factors.

• Typically, we accept item-factor coefficients from about .30 and higher.

What can we say about the structural validity of the SDQ given these scores?

Item # Verbal Math Global

3 .587 -.044 .624

5 -.016 .024 .561

7 .086 .630 -.059

23 .019 -.015 .625

24 .754 -.006 -.024

28 -.020 .750 .042

Construct Validity

• Overarching construct: Is the instrument measuring what it is supposed to?– Dependent on reliability, content and criterion-related validity.

• We also look at some other types of validity evidence some times– Convergent validity: r with similar construct

– Discriminant validity: r with unrelated construct

– Structural validity: What is the structure of the scores on this instrument?

Statistical Significance

• When we examine group differences in science, we want to make objective rather than subjective decisions.

• We use statistics to let us know if the difference we are observing occurs by chance.

• In psychology, we typically set our alpha or error rate at 5% (i.e., .05), and we conclude that if a difference was likely less than 5% of the time, that difference is statistically significant.

Statistical Significance 2

• When our statistical test tells us that our difference is statistically significant (i.e., < .05).

• Statistical significance is affected by a number of variables, including sample size. The larger the sample, the easier it is to achieve statistical significance.

• We also look at the magnitude of the difference (or effect size).

• A difference may be statistically significant, but have a small effect size.

• .10 to . 30 = small effect; .40 to .60 = medium effect; > .60 = large effect.

top related