alexander beaujean william shiu · 2008 teaching colloquy, department of religion definitions...
TRANSCRIPT
THEORY OF MEASUREMENT:Everything You Wanted To Know About Classroom Assessment But Were Afraid To Ask
Alexander BeaujeanWilliam ShiuBaylor Psychometric Laboratoryhttp://homepages.baylor.edu/psychometric_lab
2008 Teaching Colloquy Department of Religion
2008 Teaching Colloquy, Department of Religion
TABLE OF CONTENTSDefinitionsTest DesignTest Score Properties: Reliability and ValidityCognitive ProcessesSome Item TypesDeveloping the TestTake Home Message
Definitions
2008 Teaching Colloquy Department of Religion
2008 Teaching Colloquy, Department of Religion
DEFINITIONSTest (noun)
Etymology:Middle English, vessel in which metals were assayed [analyzed], potsherd, from Anglo-French test, tees pot, Latin testum earthen vessel; akin to Latin testa earthen pot, shell
Definition: (1): a procedure, reaction, or reagent used to identify or characterize a substance or constituent (2): something (as a series of questions or exercises) for measuring the skill, knowledge, intelligence, capacities, or aptitudes of an individual or group
(Test. (2008). In Merriam-Webster Online Dictionary.Retrieved September 26, 2008, from http://www.merriam-webster.com/dictionary/test)
2008 Teaching Colloquy, Department of Religion
DEFINITIONS(Achievement) Test:
A collection of items or tasks used to measure a underlying construct of interest, the results (i.e., test scores) of which allows for decisions based on the construct's level
2008 Teaching Colloquy, Department of Religion
DEFINITIONSItem:
Genesis is the first book of the Bible. T/F
Item Stem Item Response
2008 Teaching Colloquy, Department of Religion
DEFINITIONSConstruct:
A measure of some trait/attribute/quality that is not “operationally defined.”A latent entity whose level and relationship with other objects (either latent or manifest) can only be inferred
Latent:Extant, but not perceivable by bodily senses
Cronbach & Meehl (1955)
Test Design
2008 Teaching Colloquy Department of Religion
2008 Teaching Colloquy, Department of Religion
TEST DESIGNTest Philosophy:
What will/will not your test measure ?What construct are you hoping to makes inferences?
What is required for your test to measure that construct?
2008 Teaching Colloquy, Department of Religion
TEST DESIGN
Person Ability/Trait(Construct)
Cognitive Process(es)Item Response
Context
2008 Teaching Colloquy, Department of Religion
TEST DESIGNTest Purpose:
What information do you want to obtain from this test?
…and…What decision(s) do you need to make from this information?
2008 Teaching Colloquy, Department of Religion
TEST DESIGNExaminee Population:
For whom is this test intended?
≠
2008 Teaching Colloquy, Department of Religion
TEST DESIGNConstraints
Time to take testPlatform
Paper vs. ComputerLocation
security/standardizationAdministration
Entire Group vs. Subgroups vs. Individual
Test Score Properties: Reliability and Validity
2008 Teaching Colloquy Department of Religion
2008 Teaching Colloquy, Department of Religion
RELIABILITYReliability
Do the test scores measure its construct consistently?
Contributors to inconsistencyRandomness (vary from examinee to examinee)Systematic (consistent for all examinees)
Effects can be innocuous or severe, depending on the: purpose of the test
2008 Teaching Colloquy, Department of Religion
RELIABILITYEstimation
0 < reliability < 1Published: .80-.95Classroom: .50
MethodsCorrelation between 2 administrations (of same test)Correlation among test items
Internal Consistency (α)See Frisbee (1988)
2008 Teaching Colloquy, Department of Religion
RELIABILITYInfluences on Reliability Estimates
LengthDimensionality
How many constructs is the test measuring?Item DifficultyItem Discrimination
How likely is a response in examinees “high” on the construct vs. examinees “low” on the construct?
Heterogeneity of the examineesStudent Factors (motivation, “testwiseness”)Time AllotmentSecurity
2008 Teaching Colloquy, Department of Religion
VALIDITYValidity
Are the test scores measuring the intended construct?An argument, for which you need multiple stands of evidence, e.g.:
Do they appear to measure what its intended construct?Do experts think they are measuring its intended construct?Do they have relationships with other measures that……
Measure the same thingsMeasure different things
Do they predict outcomes of interest?Do the test’s items have a basis in the curriculum?
See AERA/APA/NCME (1999)
Cognitive Processes
2008 Teaching Colloquy Department of Religion
2008 Teaching Colloquy, Department of Religion
ASSESSMENT PROCESSGood Classroom Assessments Flow From the Class’s Instructional Objectives/Learning Outcomes………And Allow Inferences About the Construct of Interest
2008 Teaching Colloquy, Department of Religion
“Learning Objectives”
CognitiveProcesses
Item Responses Test
Scores
Construct Inference
ASSESSMENT PROCESS
2008 Teaching Colloquy, Department of Religion
KNOWLEDGEKNOWLEDGE
BLOOM’S TAXONOMY
COMPREHENSIONCOMPREHENSIONAPPLICATIONAPPLICATION
ANALYSISANALYSISSYNTHESISSYNTHESISEVALUATIONEVALUATION
Bloom (1956)
Developm
ent/D
ifficulty
2008 Teaching Colloquy, Department of Religion
BLOOM’S TAXONOMYLevel 1-Knowledge
Recall informationSome item stems: recall, recite, list, label, define, identify, quote, who, what, when, where, telllist, describe, relate, locate, write, find, state, name
Examples:Define consubstantiation.Who was Constantine?When were the first Crusades?List the five points of Calvinism.
2008 Teaching Colloquy, Department of Religion
BLOOM’S TAXONOMYLevel 2-Comprehension
Understand informationSome item stems: demonstrate, explain, describe, interpret, summarize, cause-effect, explaininterpret, outline, discuss, distinguish, restate, translate, describe
Examples: Why did Paul to write to the church at Philippi?
(a) Address the issue of rivals, and uphold his apostleship(b) To preserve the view of justification by faith(c) To emphasize that under salvation by Christ, Jews and Gentiles are brought together
2008 Teaching Colloquy, Department of Religion
BLOOM’S TAXONOMYLevel 3-Application
Use informationSome item stems: demonstrate, apply, calculate, illustrate, show, construct, interview, solve, showuse, illustrate, construct, complete, examine, classify
Example: Translate the following into English:Αειδε Θεά ούλομένην μήνιν Αχιλήος
2008 Teaching Colloquy, Department of Religion
BLOOM’S TAXONOMYLevel 4- Analysis
Examine/break apart informationSome item stems: explain, connect, classify, categorize, compare, analyze, distinguish, examinecompare, contrast, investigate, categorize, explain, separate
Example:Compare Plato’s Republic with Lenin’s April ThesesWhich of the following names of God is most different from the other three:(a) JEHOVAH (b) ELOHIM (c) KURIOS (d) DESPOTES
2008 Teaching Colloquy, Department of Religion
BLOOM’S TAXONOMYLevel 5- Synthesis
Create with informationSome item stems: combine, integrate, modify, hypothesize, abstract, create, design, inventcompose, predict, plan, imagine, propose, devise, formulate, conjecture
Example:Conjecture about Stephen’s response to Paul néeSaul, were they to have met after Paul’s Roman imprisonment.
2008 Teaching Colloquy, Department of Religion
BLOOM’S TAXONOMYLevel 6- Evaluation
Combine previous information skills to make a judgmentSome item stems: judge, select, choose, decidejustify, debate, verify, argue, recommend, assessdiscuss, rate, prioritize, determine
Example:Appraise Calvin’s Institutes in light of Oberman’s The Dawn of the Reformation.Who deserves precedence as the earliest Baptist church in North America: Roger Williams’ Providence church or John Clarke’s Newport church. Support your answer with scholarly sources.
Some Item Types
2008 Teaching Colloquy Department of Religion
2008 Teaching Colloquy, Department of Religion
“Learning Objectives”
CognitiveProcesses
Item Responses Test
Scores
Construct Inference
ASSESSMENT PROCESS
2008 Teaching Colloquy, Department of Religion
ITEM TYPE #1TRUE-FALSE
Example: Augustine wrote The Confessions. T/F?
Pros:Convenient to writeEasy to scoreAllows flexibility in content coverage
Cons:Limited in cognitive processes coveredGuessingStudent response sets
2008 Teaching Colloquy, Department of Religion
ITEM TYPE #1TRUE-FALSE
Best Practice:Make the statements as short and specific as possibleOne idea per statementAvoid trivial informationUse positive statements instead of negative, and always avoid double negative statementsDo not use opinion statements unless they are attributed to someoneLength should not differ between true/false statementsApproximately equal number of true/false statements
2008 Teaching Colloquy, Department of Religion
ITEM TYPE #2MULTIPLE CHOICE RESPONSE
Example: Who is famous for his 95 Theses?(a) Pope Leo X; (b) Martin Luther; (c) Johann Eck
Pros:“Best Answer” is more flexible than unequivocal true/falseAllows different cognitive processes in item responseGuessing less of a factor than T/FEasy to score
Cons:Large amount of time to write good distracters (wrong response alternatives)Guessing is possible
2008 Teaching Colloquy, Department of Religion
ITEM TYPE #2MULTIPLE CHOICE RESPONSE
Best Practice:Item stems should: (a) have autonomous meaning , (b) present as much of the item as possible, and (c) have no irrelevant materialAvoid negative item stemsAll item responses should be grammatically compatible with their stem and of approximately equal lengthThere should be only one correct/best answerDistracters should be plausibleAvoid “clues” in item stemAvoid “none/all of the above” response options
2008 Teaching Colloquy, Department of Religion
ITEM TYPE #3 MATCHING
Example: Match the philosopher with their work:
ProsCan cover much material in content domainEasy to administer
ConsLimited in cognitive processes coveredDifficult to find homogenous material Difficult to develop good, plausible set of responses
A. Plato B. AristotleC. SocratesD. EuclidE. Zeno
_A__ 1. The Socratic Dialogues_C__ 2. None_B___3. Organon_D__ 4. The Elements_E___5. Reminiscences of Crates
2008 Teaching Colloquy, Department of Religion
ITEM TYPE #3 MATCHING
Best Practice:Use homogenous materialHave an unequal numbers of stems and responsesPlace responses in numerical or alphabetical orderExplicitly state the basis for finding a matchPlace all items/responses on the same page
2008 Teaching Colloquy, Department of Religion
ITEM TYPE #4FILL IN THE BLANK
Example: Martin Buber edited the _______, a Zionist periodical. (Die Welt)Pros:
Very, very minimal guessingEasy to construct item stems
Cons:Must score by hand, and possibility of multiple correct responses.Assess only factual knowledge
2008 Teaching Colloquy, Department of Religion
ITEM TYPE #4FILL IN THE BLANK
Best Practice:Make the item require a short, specific response Do not take items stems directly from textbooksQuestions are better than incomplete statementsRight or left justify the item response blanks, and make them the same size for all itemsOnly one blank per item
2008 Teaching Colloquy, Department of Religion
ITEM TYPE #5ASHORT RESPONSEExample: List the Beatitudes. Pros:
Can measure complex learning objectives and cognitive processesMinimizes cheating
Cons:Scoring can be subjectiveLimited sampling of content
2008 Teaching Colloquy, Department of Religion
ITEM TYPE #5BESSAY
Example: Explain how Nietzsche's notion of the will to power is a response to Schopenhauer's will to live?(Your answer should be no longer than 2 pages, and should cite scholarly sources. It will be evaluated on your analysis of cited scholarship and the skill at which the essay is organized)
Pros:Can help students connect related ideas
Responding can (possibly) be a learning exercise itselfCan measure complex objectives & processes
Cons:Relies on both writing skills and content familiarityScoring is subjective less score reliabilityLimited sampling of content
2008 Teaching Colloquy, Department of Religion
ITEM TYPE #5SHORT ANSWER/ESSAY
Best Practice:Only use for learning outcomes that require non-objective assessmentMap the questions directly onto learning objectivesInform respondents on the grading criteria (e.g., content knowledge, thought organization)Make the examinee’s writing task explicitEstimate the time needed for an appropriate answerGive all examinees the same (or equivalent) questions. Avoid optional questions.Outline the expected answer in advance, and……Develop a rubric that allocates points in the desired manner before administering exam
Developing the Test
2008 Teaching Colloquy Department of Religion
2008 Teaching Colloquy, Department of Religion
TEST SPECIFICATIONSContent Domain
How do topics within the content area relate to each other and how does knowledge in the area build?
Cognitive Skills/Process to Answer ItemDistribution of Content Areas and Cognitive Skills Demand throughout Test
2008 Teaching Colloquy, Department of Religion
TEST SPECIFICATIONSFor Classroom Evaluations, You Want Your Tests to Map onto Your Instructional Objectives/Learning Outcomes
Test ItemInstructional Objective/
Learning Outcomes
I. Demonstrates Skill in Critical Thinking
A. Comprehends Relevant Antecedents to Historical Events
Name Three Precipitating Events to the First Crusade
2008 Teaching Colloquy, Department of Religion
TABLE OF SPECIFICATIONSInstructional
Objectives
Total Items
Content W
eight
10
15
10
35
Objective Weight
Major Content Area
Know
ledge
Com
prehension
Application
Analysis
Synthesis
1 2
4
1
7
3
2
6
2
3
2
7
Early Christian Writers in the West
2 3
Luther and the Beginning of the Reformation
3 2
Liberal Protestantism in Modernity 3 2
Total Items 8 7
2008 Teaching Colloquy, Department of Religion
TEST LENGTHNo “correct” lengthDepends on:
Administration timeExamineesScores neededContent coverageItem types usedDesired reliability
2008 Teaching Colloquy, Department of Religion
TEST ORGANIZATIONDirections
Be explicitGive time allowed to take testGive directions for respondingGive point allocation (weighting) if different across items
Item GroupingIf there are different item types on the test:
Only if needed, group items by content areaPut same items types togetherWithin a type, place in order of simpler to more complex
2008 Teaching Colloquy, Department of Religion
TEST/ITEM SCORINGPoints to Consider
Allow for partial credit?Should content areas be weighted equally?Should learning objectives be weighted equally?If a test is made of multiple “subtests”, is each autonomous or graded as a whole?
e.g., if Jane missing all 10 of the “Liberal Protestantism in Modernity” questions, but gets the other 25 items correct, can she still “pass” the test?
2008 Teaching Colloquy, Department of Religion
TEST/ITEM ANALYSISA Multiple Item Test Provides Much Information
Item difficulties (e.g., percent who “passed” the item)Does they differ by content area?Does they differ by instructional objective?
DistractersAre “high scorers” endorsing a distracter more than the correct answer?
DiscriminationHow well does an item discriminate “high scorers” from “low scorers”
Are there omitted items or items not reached?Is there a pattern in those items?
Reliability Calculations & Validity Evidence
2008 Teaching Colloquy, Department of Religion
TEST/ITEM ANALYSISFor More Information:
EDP 5340. Measurement/EvaluationChapter 13 of: Hollis-Sawyer, Thornton, Hurd, & Condon (2008)Chapter 14 of: Linn & Miller (2005)Chapter 6 of Urbina (2004)LERTAP program [http://www.assess.com/ ]
Take Home Message
2008 Teaching Colloquy Department of Religion
2008 Teaching Colloquy, Department of Religion
TAKE HOME MESSAGEBe Mindful In Test ConstructionBe Purposeful in Item Selection and Development
Questions?
2008 Teaching Colloquy Department of Religion
2008 Teaching Colloquy, Department of Religion
REFERENCESAmerican Educational Research Association, American
Psychological Association, and the National Council on Measurement in Education [AERA/APA/NCME]. (1999), Standards for educational and psychological testing, Washington, DC: American Psychological Association.
Bloom B. S. (1956). Taxonomy of educational objectives, Handbook I: The cognitive domain. New York: David McKay Co Inc.
Brennan, R. L. (Ed.) (2006), Educational measurement (4th ed.). Westport, CT: Praeger.
Cronbach, L. J. & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281-302.
Frisbie, D. A. (1988). Reliability of scores from teacher-made tests, Educational measurement: Issues and practice, 7, 25-35. [free: http://www.ncme.org/pubs/items/ITEMS_Mod_3.pdf ]
2008 Teaching Colloquy, Department of Religion
REFERENCESHollis-Sawyer, L., Thornton, G. C., Hurd, B. & Condon, M. E. (2008).
Exercises in psychological testing (2nd ed.). Boston: Allyn & Bacon
Linn, R. L. & Miller, M. D. (2005). Measurement and assessment in teaching (9th ed.). Upper Saddle River, NJ: Pearson.
Urbina, S. (2004). Essentials of psychological testing. Hoboken, N.J.: John Wiley & Sons.