alexander beaujean william shiu · 2008 teaching colloquy, department of religion definitions...

THEORY OF MEASUREMENT:Everything You Wanted To Know About Classroom Assessment But Were Afraid To Ask

Alexander BeaujeanWilliam ShiuBaylor Psychometric Laboratoryhttp://homepages.baylor.edu/psychometric_lab

2008 Teaching Colloquy Department of Religion

2008 Teaching Colloquy, Department of Religion

TABLE OF CONTENTSDefinitionsTest DesignTest Score Properties: Reliability and ValidityCognitive ProcessesSome Item TypesDeveloping the TestTake Home Message

Definitions



DEFINITIONSTest (noun)

Etymology:Middle English, vessel in which metals were assayed [analyzed], potsherd, from Anglo-French test, tees pot, Latin testum earthen vessel; akin to Latin testa earthen pot, shell

Definition: (1): a procedure, reaction, or reagent used to identify or characterize a substance or constituent (2): something (as a series of questions or exercises) for measuring the skill, knowledge, intelligence, capacities, or aptitudes of an individual or group

(Test. (2008). In Merriam-Webster Online Dictionary.Retrieved September 26, 2008, from http://www.merriam-webster.com/dictionary/test)


DEFINITIONS(Achievement) Test:

A collection of items or tasks used to measure a underlying construct of interest, the results (i.e., test scores) of which allows for decisions based on the construct's level


DEFINITIONSItem:

Genesis is the first book of the Bible. T/F

Item Stem Item Response


DEFINITIONSConstruct:

A measure of some trait/attribute/quality that is not “operationally defined.”A latent entity whose level and relationship with other objects (either latent or manifest) can only be inferred

Latent:Extant, but not perceivable by bodily senses

Cronbach & Meehl (1955)

Test Design



TEST DESIGNTest Philosophy:

What will/will not your test measure ?What construct are you hoping to makes inferences?

What is required for your test to measure that construct?


TEST DESIGN

Person Ability/Trait(Construct)

Cognitive Process(es)Item Response

Context


TEST DESIGNTest Purpose:

What information do you want to obtain from this test?

…and…What decision(s) do you need to make from this information?


TEST DESIGNExaminee Population:

For whom is this test intended?

≠


TEST DESIGNConstraints

Time to take testPlatform

Paper vs. ComputerLocation

security/standardizationAdministration

Entire Group vs. Subgroups vs. Individual

Test Score Properties: Reliability and Validity



RELIABILITYReliability

Do the test scores measure its construct consistently?

Contributors to inconsistencyRandomness (vary from examinee to examinee)Systematic (consistent for all examinees)

Effects can be innocuous or severe, depending on the: purpose of the test


RELIABILITYEstimation

0 < reliability < 1Published: .80-.95Classroom: .50

MethodsCorrelation between 2 administrations (of same test)Correlation among test items

Internal Consistency (α)See Frisbee (1988)


RELIABILITYInfluences on Reliability Estimates

LengthDimensionality

How many constructs is the test measuring?Item DifficultyItem Discrimination

How likely is a response in examinees “high” on the construct vs. examinees “low” on the construct?

Heterogeneity of the examineesStudent Factors (motivation, “testwiseness”)Time AllotmentSecurity


VALIDITYValidity

Are the test scores measuring the intended construct?An argument, for which you need multiple stands of evidence, e.g.:

Do they appear to measure what its intended construct?Do experts think they are measuring its intended construct?Do they have relationships with other measures that……

Measure the same thingsMeasure different things

Do they predict outcomes of interest?Do the test’s items have a basis in the curriculum?

See AERA/APA/NCME (1999)

Cognitive Processes



ASSESSMENT PROCESSGood Classroom Assessments Flow From the Class’s Instructional Objectives/Learning Outcomes………And Allow Inferences About the Construct of Interest


“Learning Objectives”

CognitiveProcesses

Item Responses Test

Scores

Construct Inference

ASSESSMENT PROCESS


KNOWLEDGEKNOWLEDGE

BLOOM’S TAXONOMY

COMPREHENSIONCOMPREHENSIONAPPLICATIONAPPLICATION

ANALYSISANALYSISSYNTHESISSYNTHESISEVALUATIONEVALUATION

Bloom (1956)

Developm

ent/D

ifficulty


BLOOM’S TAXONOMYLevel 1-Knowledge

Recall informationSome item stems: recall, recite, list, label, define, identify, quote, who, what, when, where, telllist, describe, relate, locate, write, find, state, name

Examples:Define consubstantiation.Who was Constantine?When were the first Crusades?List the five points of Calvinism.


BLOOM’S TAXONOMYLevel 2-Comprehension

Understand informationSome item stems: demonstrate, explain, describe, interpret, summarize, cause-effect, explaininterpret, outline, discuss, distinguish, restate, translate, describe

Examples: Why did Paul to write to the church at Philippi?

(a) Address the issue of rivals, and uphold his apostleship(b) To preserve the view of justification by faith(c) To emphasize that under salvation by Christ, Jews and Gentiles are brought together


BLOOM’S TAXONOMYLevel 3-Application

Use informationSome item stems: demonstrate, apply, calculate, illustrate, show, construct, interview, solve, showuse, illustrate, construct, complete, examine, classify

Example: Translate the following into English:Αειδε Θεά ούλομένην μήνιν Αχιλήος


BLOOM’S TAXONOMYLevel 4- Analysis

Examine/break apart informationSome item stems: explain, connect, classify, categorize, compare, analyze, distinguish, examinecompare, contrast, investigate, categorize, explain, separate

Example:Compare Plato’s Republic with Lenin’s April ThesesWhich of the following names of God is most different from the other three:(a) JEHOVAH (b) ELOHIM (c) KURIOS (d) DESPOTES


BLOOM’S TAXONOMYLevel 5- Synthesis

Create with informationSome item stems: combine, integrate, modify, hypothesize, abstract, create, design, inventcompose, predict, plan, imagine, propose, devise, formulate, conjecture

Example:Conjecture about Stephen’s response to Paul néeSaul, were they to have met after Paul’s Roman imprisonment.


BLOOM’S TAXONOMYLevel 6- Evaluation

Combine previous information skills to make a judgmentSome item stems: judge, select, choose, decidejustify, debate, verify, argue, recommend, assessdiscuss, rate, prioritize, determine

Example:Appraise Calvin’s Institutes in light of Oberman’s The Dawn of the Reformation.Who deserves precedence as the earliest Baptist church in North America: Roger Williams’ Providence church or John Clarke’s Newport church. Support your answer with scholarly sources.

Some Item Types



“Learning Objectives”

CognitiveProcesses

Item Responses Test

Scores

Construct Inference

ASSESSMENT PROCESS


ITEM TYPE #1TRUE-FALSE

Example: Augustine wrote The Confessions. T/F?

Pros:Convenient to writeEasy to scoreAllows flexibility in content coverage

Cons:Limited in cognitive processes coveredGuessingStudent response sets


ITEM TYPE #1TRUE-FALSE

Best Practice:Make the statements as short and specific as possibleOne idea per statementAvoid trivial informationUse positive statements instead of negative, and always avoid double negative statementsDo not use opinion statements unless they are attributed to someoneLength should not differ between true/false statementsApproximately equal number of true/false statements


ITEM TYPE #2MULTIPLE CHOICE RESPONSE

Example: Who is famous for his 95 Theses?(a) Pope Leo X; (b) Martin Luther; (c) Johann Eck

Pros:“Best Answer” is more flexible than unequivocal true/falseAllows different cognitive processes in item responseGuessing less of a factor than T/FEasy to score

Cons:Large amount of time to write good distracters (wrong response alternatives)Guessing is possible


ITEM TYPE #2MULTIPLE CHOICE RESPONSE

Best Practice:Item stems should: (a) have autonomous meaning , (b) present as much of the item as possible, and (c) have no irrelevant materialAvoid negative item stemsAll item responses should be grammatically compatible with their stem and of approximately equal lengthThere should be only one correct/best answerDistracters should be plausibleAvoid “clues” in item stemAvoid “none/all of the above” response options


ITEM TYPE #3 MATCHING

Example: Match the philosopher with their work:

ProsCan cover much material in content domainEasy to administer

ConsLimited in cognitive processes coveredDifficult to find homogenous material Difficult to develop good, plausible set of responses

A. Plato B. AristotleC. SocratesD. EuclidE. Zeno

_A__ 1. The Socratic Dialogues_C__ 2. None_B___3. Organon_D__ 4. The Elements_E___5. Reminiscences of Crates


ITEM TYPE #3 MATCHING

Best Practice:Use homogenous materialHave an unequal numbers of stems and responsesPlace responses in numerical or alphabetical orderExplicitly state the basis for finding a matchPlace all items/responses on the same page


ITEM TYPE #4FILL IN THE BLANK

Example: Martin Buber edited the _______, a Zionist periodical. (Die Welt)Pros:

Very, very minimal guessingEasy to construct item stems

Cons:Must score by hand, and possibility of multiple correct responses.Assess only factual knowledge


ITEM TYPE #4FILL IN THE BLANK

Best Practice:Make the item require a short, specific response Do not take items stems directly from textbooksQuestions are better than incomplete statementsRight or left justify the item response blanks, and make them the same size for all itemsOnly one blank per item


ITEM TYPE #5ASHORT RESPONSEExample: List the Beatitudes. Pros:

Can measure complex learning objectives and cognitive processesMinimizes cheating

Cons:Scoring can be subjectiveLimited sampling of content


ITEM TYPE #5BESSAY

Example: Explain how Nietzsche's notion of the will to power is a response to Schopenhauer's will to live?(Your answer should be no longer than 2 pages, and should cite scholarly sources. It will be evaluated on your analysis of cited scholarship and the skill at which the essay is organized)

Pros:Can help students connect related ideas

Responding can (possibly) be a learning exercise itselfCan measure complex objectives & processes

Cons:Relies on both writing skills and content familiarityScoring is subjective less score reliabilityLimited sampling of content


ITEM TYPE #5SHORT ANSWER/ESSAY

Best Practice:Only use for learning outcomes that require non-objective assessmentMap the questions directly onto learning objectivesInform respondents on the grading criteria (e.g., content knowledge, thought organization)Make the examinee’s writing task explicitEstimate the time needed for an appropriate answerGive all examinees the same (or equivalent) questions. Avoid optional questions.Outline the expected answer in advance, and……Develop a rubric that allocates points in the desired manner before administering exam

Developing the Test



TEST SPECIFICATIONSContent Domain

How do topics within the content area relate to each other and how does knowledge in the area build?

Cognitive Skills/Process to Answer ItemDistribution of Content Areas and Cognitive Skills Demand throughout Test


TEST SPECIFICATIONSFor Classroom Evaluations, You Want Your Tests to Map onto Your Instructional Objectives/Learning Outcomes

Test ItemInstructional Objective/

Learning Outcomes

I. Demonstrates Skill in Critical Thinking

A. Comprehends Relevant Antecedents to Historical Events

Name Three Precipitating Events to the First Crusade


TABLE OF SPECIFICATIONSInstructional

Objectives

Total Items

Content W

eight

10

15

10

35

Objective Weight

Major Content Area

Know

ledge

Com

prehension

Application

Analysis

Synthesis

1 2

4

1

7

3

2

6

2

3

2

7

Early Christian Writers in the West

2 3

Luther and the Beginning of the Reformation

3 2

Liberal Protestantism in Modernity 3 2

Total Items 8 7


TEST LENGTHNo “correct” lengthDepends on:

Administration timeExamineesScores neededContent coverageItem types usedDesired reliability


TEST ORGANIZATIONDirections

Be explicitGive time allowed to take testGive directions for respondingGive point allocation (weighting) if different across items

Item GroupingIf there are different item types on the test:

Only if needed, group items by content areaPut same items types togetherWithin a type, place in order of simpler to more complex


TEST/ITEM SCORINGPoints to Consider

Allow for partial credit?Should content areas be weighted equally?Should learning objectives be weighted equally?If a test is made of multiple “subtests”, is each autonomous or graded as a whole?

e.g., if Jane missing all 10 of the “Liberal Protestantism in Modernity” questions, but gets the other 25 items correct, can she still “pass” the test?


TEST/ITEM ANALYSISA Multiple Item Test Provides Much Information

Item difficulties (e.g., percent who “passed” the item)Does they differ by content area?Does they differ by instructional objective?

DistractersAre “high scorers” endorsing a distracter more than the correct answer?

DiscriminationHow well does an item discriminate “high scorers” from “low scorers”

Are there omitted items or items not reached?Is there a pattern in those items?

Reliability Calculations & Validity Evidence


TEST/ITEM ANALYSISFor More Information:

EDP 5340. Measurement/EvaluationChapter 13 of: Hollis-Sawyer, Thornton, Hurd, & Condon (2008)Chapter 14 of: Linn & Miller (2005)Chapter 6 of Urbina (2004)LERTAP program [http://www.assess.com/ ]

Take Home Message



TAKE HOME MESSAGEBe Mindful In Test ConstructionBe Purposeful in Item Selection and Development

Questions?



REFERENCESAmerican Educational Research Association, American

Psychological Association, and the National Council on Measurement in Education [AERA/APA/NCME]. (1999), Standards for educational and psychological testing, Washington, DC: American Psychological Association.

Bloom B. S. (1956). Taxonomy of educational objectives, Handbook I: The cognitive domain. New York: David McKay Co Inc.

Brennan, R. L. (Ed.) (2006), Educational measurement (4th ed.). Westport, CT: Praeger.

Cronbach, L. J. & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281-302.

Frisbie, D. A. (1988). Reliability of scores from teacher-made tests, Educational measurement: Issues and practice, 7, 25-35. [free: http://www.ncme.org/pubs/items/ITEMS_Mod_3.pdf ]


REFERENCESHollis-Sawyer, L., Thornton, G. C., Hurd, B. & Condon, M. E. (2008).

Exercises in psychological testing (2nd ed.). Boston: Allyn & Bacon

Linn, R. L. & Miller, M. D. (2005). Measurement and assessment in teaching (9th ed.). Upper Saddle River, NJ: Pearson.

Urbina, S. (2004). Essentials of psychological testing. Hoboken, N.J.: John Wiley & Sons.

alexander beaujean william shiu · 2008 teaching colloquy, department of religion definitions...

Documents