eurocall 2006 1 turning teacher written exams into valid and reliable computer-assisted testing...
TRANSCRIPT
EuroCALL 20061
Turning teacher written exams into valid and reliable
computer-assisted testing tools
Desmet, P., Vandewaetere, M., Sercu, L. & Van den Noortgate, W.
KULeuven and KULeuven-Campus Kortrijk, Belgium
Lingu@flex
EuroCALL 20062
Overview1. Introduction
2. Construct specification (what to measure)
3. Test specification (how to measure)
4. Analysis specification (what to do with the measurements)
5. Conclusion and discussion
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
EuroCALL 20063
1. IntroductionLingu@flex-project
• development of an achievement test
English Grammar
French Reading skills
• flexible learning trajectories between 4 educational programmes: BA secondary teaching
BA office management
MA translation/interpreting
MA linguistics/literature
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
EuroCALL 20064
1. IntroductionLingu@flex-project
• construction of a measurement scale with indication of minimal achievement level. Eg:
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
student x meets the requirements (desired ability level) for programme y
students of programme y do not meet the requirements (desired ability level) for programme y
EuroCALL 20065
1. Introduction- GOAL: construction of a Computer-assisted
language achievement test
- 2 steps:A. import of pen-and-paper test into an authoring
system (task-oriented or holistic approach)
B. test development approach (construct-driven approach)
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
EuroCALL 20066
1. Introduction- GOAL: development of Computer-assisted
language achievement tests
- 2 methods:A. import of pen-and-paper test into an authoring
system (task-oriented or holistic approach)
B. test development approach (construct-driven approach)
- Different methodological stages
Turning teacher written exams into valid and reliable computer-assisted testing tools
What to measure? (construct specification) How to measure? (design specification) What to do with the measurements?(analyses specification)
Lingu@flex
EuroCALL 20067
2. Construct specificationA. Step 1: Pilot study
• “naive method”: existing pp-questions into authoring tool• no uniformity in design (stratification, randomization, ...)• no uniformity in procedure (instructions, time, ...)• possible confounders not measured• low (construct) validity and reliability
RESULTS of the pilot study gave information about: - item construction (content and formulation)- need for optimalisation of design and procedure- possible confounders (pc-knowledge, motivation, etc.)- increasing validity and reliability and generalisability of future results
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
What to measure?
june 2005
EuroCALL 20068
2. Construct specificationA. Pilot study
• “naive method”: existing pp-questions into authoring tool• no uniformity in design (stratification, randomization, ...)• no uniformity in procedure (instructions, time, ...)• possible confounders not measured• low (construct) validity and reliability
RESULTS of the pilot study gave information about: - item construction (content and formulation)- need for optimalisation of design and procedure- possible confounders (pc-knowledge, motivation, etc.)- increasing validity and reliability and generalisability of future results
good starting point for Lingu@flex...
but: no endpoint!
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
What to measure?
june 2005
EuroCALL 20069
2. Construct specificationB. Step 2: Construct-driven approach / test
development approach
• L2 construct = multidimensional (Chalhoub-Deville, 2001)• Test content determined by specifying knowledge domain (Messick,
1994)
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
What to measure?
EuroCALL 200610
2. Construct specificationB. Construct-driven approach / test development
approach
• L2 construct = multidimensional (Chalhoub-Deville, 2001)• Test content determined by specifying knowledge domain (Messick,
1994)
testing matrix: 2 dimensions1. language construct
2. cognitive level
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
What to measure?
EuroCALL 200611
2. Construct specification- Testing matrix English Grammar
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
What to measure?
EuroCALL 200612
2. Construct specification- Testing matrix English Grammar
• 3 morfosyntactic domains and subdomains
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
What to measure?
EuroCALL 200613
2. Construct specification- Testing matrix English Grammar
• 3 morfosyntactic domains and subdomains • 3 cognitive levels (ordered)
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
What to measure?
EuroCALL 200614
2. Construct specification- Testing matrix English Grammar
• empty cells as a consequence of inherent characteristics of the construct
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
What to measure?
EuroCALL 200615
2. Construct specification- Testing matrix English Grammar
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
What to measure?
EuroCALL 200616
2. Construct specification- Testing matrix English Grammar
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
What to measure?
multiple-choice
EuroCALL 200617
2. Construct specification- Testing matrix English Grammar
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
What to measure?
multiple-choice
correct if necessary
reformulate
fill-in-the-blank
order in columns
EuroCALL 200618
2. Construct specification- Testing matrix English Grammar
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
What to measure?
multiple-choice
correct if necessary
reformulate
fill-in-the-blank
order in columns
translate
reformulate
EuroCALL 200619
2. Construct specification- Testing matrix French Reading
• 5 categories (types of text) • 3 cognitive levels (ordered)• empty cells as a consequence of construct specification• numbers of items depending on type of text and on content of text
(e.g. informative text: more knowledge-related items)
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
What to measure?
EuroCALL 200620
3. Test specification
1. Tool• Idioma-tic (www.idiomatic.be) and CogniStreamer (
www.cognistreamer.com)
• exercise environment (intelligent feedback) and testing environment (no feedback)
• closed and half-open exercises and tests• full online tool (frond end and backoffice)
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
How to measure?
EuroCALL 200621
3. Test specification
2. Available item types 3 technical types:• multiple choice (multiple answer)
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
How to measure?
EuroCALL 200622
3. Test specification
2. Available item types 3 technical types:• fill-in-the-blank (≥ 1 field)
adjust, reformulate, conjugate, synonyms…
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
How to measure?
EuroCALL 200623
3. Test specification
2. Available item types 3 technical types:• translate
translate, find in text, reformulate,…
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
How to measure?
EuroCALL 200624
3. Test specification
2. Available item types
Existing types, but to work out the scoring mechanism:• drag and drop text and hot spots• dictation• order horizontal, order vertical, order in columns• crossword
For an overview of the existing types:
http://www.kuleuven-kortrijk.be/Idioma-tic/doc.html
• translation/reformulation exercises: comparison algorithm (core group + by-group)
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
How to measure?
EuroCALL 200625
3. Test specification
3. Content itemdatabase- goal: achievement test
- strong need for highly discriminative items (IRT)
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
How to measure?
EuroCALL 200626
3. Test specification
3. Content itemdatabase- goal: achievement test
- strong need for highly discriminative items (IRT)
A. Phase 1: pilot study (holistic approach)
existing course materials
itemratings by experts
quantitative (difficulty of the item, relevance for each of the 4 educational programmes)
analysis of tracking and logging data
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
june 2005
How to measure?
EuroCALL 200627
3. Test specification
3. Content itemdatabaseB. Phase 2: construct-based approach
1. Operationalisation testing matrices
2. Item construction and implementation in itemdatabase (template)
3. Subsamples of itemdatabase (subtests) rated by students
- random allocation of subtests to groups of students
- several designs
- paper versus electronic reading of texts
- number of texts
- quantitative: rating of item difficulty and item clarity, general rating of subtest diversity, length of subtest and tool usability
- qualitative: personal comments
4. All items rated by experts
- quantitative: difficulty and clarity, relevance for eductional programme)
- qualitative: personal comments
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
march 2006
How to measure?
EuroCALL 200628
3. Test specification
4. Optimalisation
based on results from holistic and construct-based approach:
tool optimalisation
- user interface design and usability
- forward/backward navigation
- answer check (not change)
- tracking and logging
- scoring mechanisms
- changed features are tested first (testgroup: n=6)
LOGGING: comparison between handwritten answers and electronic answers
SCORING: comparison between handbased scoring and electronic scoring
work in progress: MC, fill-in-the-blank and translate fully operational
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
How to measure?
EuroCALL 200629
3. Test specification
4. Optimalisation
content optimalisation
first selection/adjustment of items and texts
error analyses, qualitative analyses
non-IRT-based
design optimalisation
test length, subtest construction, instructions
preparation construction final testtrial run statistical programming code (SQL and SAS 9.1®)
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
How to measure?
EuroCALL 200630
4. Analysis specification1. Item Response Theory• the study of item scores• based on assumptions concerning the (mathematical)
relationship between abilities (or other traits) and item responses
• curve: probabilities of a correct response for students with different ability (theta) levels.
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
What to do with the measures?
theta: student ability
P(T): probablity of a correct response to one test item
EuroCALL 200631
4. Analysis specification1. Item Response Theory
• students are located on the ability scale• each student has a numerical value, a score, that places
him/her somewhere on the measurement (ability) scale
• students can be evaluated in terms of how much underlying ability he/she possesses
• comparisons among students (and programmes) can be made
• each item of the test measures some facet of the ability
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
What to do with the measures?
EuroCALL 200632
4. Analysis specification2. Our analyses
• latent trait: English Grammar ability 1French Reading ability 2
• descriptive statistics – traditional psychometrics
• item response theory (IRT): 2 parameter-logistic model
- discrimination parameter ai
the degree to which the item discriminates between persons with different ability
an item that can be answered equally well by either student is a poor discriminator.
- difficulty/location parameter bi
the item's difficulty
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
What to do with the measures?
EuroCALL 200633
4. Analysis specification2. Our analyses
• multilevel IRT (nested design): scores < students < schools- scores within a student will be more similar than scores between studens- scores of students in the same school (within school) will be more similar than scores of students of different schools (between-schools)
- scores are nested within students[level 1]
students are nested within schools[level 2]
schools are nested within educational programme[level 3]
• where is each educational programme placed on the scale?• covariates: age, sex, motivation, rating of test, rating of tool• large testgroup (n > 1000)
• using SAS® PROC GLIMMIX and PROC MIXED
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
What to do with the measures?
EuroCALL 200634
4. Analysis specification3. defining item and test quality
research questions that can be answered by IRT:- which items must be mastered by a student in order to achieve
a certain ability level desired for a certain programme (required ability level)?
- which items are mastered by a student in a certain programme (obtained ability level)?
- are items equally difficult for equally able students from different educational programmes?
- are item scores related to the sex of students, the kind of items, the tool rating, the educational programme, ...?
Turning teacher written exams into valid and reliable computer-assisted testing tools
october 2006
Lingu@flex
What to do with the measures?
EuroCALL 200635
5. Discussion• Future research in this project:
- French Lexicon and English Speaking- improve scoring of half-open exercises/tests- implementation of item review (Revuelta, Ximénez, & Olea,
2003) via ‘check-button’- use of NLP in scoring
• CBT is a delivery method, not a different and better type of test per se (Carr, 2006)
- validation of methodology- application of methodology to other domains
Need for a strong methodological frame
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
EuroCALL 200636
5. Conclusion and discussion• Six approaches to validity evidence (Chapelle, 1999)
- content analysisexperts’ judgments of what they believe a test
measures (construct or internal validity)
- empirical item of task analysisempirical/quantitative analysis of learners’ responses
- dimensionality analysisinternal structure of the test (uni- or multidimensional)
- concurrent validity relation of test scores with other test scores or
behaviors
- differences in test performance – generalisabilitythe extent to which performance on one task can be
assumed to generalize to other tasks (external validity)
- testing consequences implications of the interpretations made from test
scores
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex
EuroCALL 200637
5. Conclusion and discussion• Six approaches to validity evidence (Chapelle, 1999)
- content analysisexperts’ judgments of what they believe a test
measures (construct or internal validity)
- empirical item of task analysisempirical/quantitative analysis of learners’ responses
- dimensionality analysisinternal structure of the test (uni- or multidimensional)
- concurrent validity relation of test scores with other test scores or
behaviors
- differences in test performance – generalisabilitythe extent to which performance on one task can be
assumed to generalize to other tasks (external validity)
- testing consequences implications of the interpretations made from test
scores
Turning teacher written exams into valid and reliable computer-assisted testing tools
1 methodology
Lingu@flex
EuroCALL 200638
References• Carr (2006). Computer-based testing: Prospects for Innovative Assessment. Calling on
Call: From Theory and Research to New Directions in Foreign Language Teaching (Eds. Ducate and Arnold).
• Chalhoub-Deville, M. (2001). Language testing and technology: past and future. Language Testing and Technology, 5, 95-98.
• Chapella, C.A. (1999). Validity in language assessment. Annual Review of Applied Linguistics, 19, 254-272.
• Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23, 13-23
• Revuelta, J., Ximénez, M.C., & Olea, J. (2003). Psychometric and psychological effects of item selection and review on computerized testing. Educational and Psychological Measurement, 63, 791-808.
ContactProf. Dr. P. DesmetE. Sabbelaan 53B – 8500 [email protected]. +32 56 246185
Mieke VandewaetereE. Sabbelaan 53B – 8500 [email protected]. +32 56 246149
Turning teacher written exams into valid and reliable computer-assisted testing tools
Lingu@flex