eurocall 2006 1 turning teacher written exams into valid and reliable computer-assisted testing...

EuroCALL 20061

Turning teacher written exams into valid and reliable

computer-assisted testing tools

Desmet, P., Vandewaetere, M., Sercu, L. & Van den Noortgate, W.

KULeuven and KULeuven-Campus Kortrijk, Belgium

Lingu@flex

EuroCALL 20062

Overview1. Introduction

2. Construct specification (what to measure)

3. Test specification (how to measure)

4. Analysis specification (what to do with the measurements)

5. Conclusion and discussion

Turning teacher written exams into valid and reliable computer-assisted testing tools

Lingu@flex

EuroCALL 20063

1. IntroductionLingu@flex-project

• development of an achievement test

English Grammar

French Reading skills

• flexible learning trajectories between 4 educational programmes: BA secondary teaching

BA office management

MA translation/interpreting

MA linguistics/literature


Lingu@flex

EuroCALL 20064

1. IntroductionLingu@flex-project

• construction of a measurement scale with indication of minimal achievement level. Eg:


Lingu@flex

student x meets the requirements (desired ability level) for programme y

students of programme y do not meet the requirements (desired ability level) for programme y

EuroCALL 20065

1. Introduction- GOAL: construction of a Computer-assisted

language achievement test

- 2 steps:A. import of pen-and-paper test into an authoring

system (task-oriented or holistic approach)

B. test development approach (construct-driven approach)


Lingu@flex

EuroCALL 20066

1. Introduction- GOAL: development of Computer-assisted

language achievement tests

- 2 methods:A. import of pen-and-paper test into an authoring

system (task-oriented or holistic approach)

B. test development approach (construct-driven approach)

- Different methodological stages


What to measure? (construct specification) How to measure? (design specification) What to do with the measurements?(analyses specification)

Lingu@flex

EuroCALL 20067

2. Construct specificationA. Step 1: Pilot study

• “naive method”: existing pp-questions into authoring tool• no uniformity in design (stratification, randomization, ...)• no uniformity in procedure (instructions, time, ...)• possible confounders not measured• low (construct) validity and reliability

RESULTS of the pilot study gave information about: - item construction (content and formulation)- need for optimalisation of design and procedure- possible confounders (pc-knowledge, motivation, etc.)- increasing validity and reliability and generalisability of future results


Lingu@flex

What to measure?

june 2005

EuroCALL 20068

2. Construct specificationA. Pilot study

• “naive method”: existing pp-questions into authoring tool• no uniformity in design (stratification, randomization, ...)• no uniformity in procedure (instructions, time, ...)• possible confounders not measured• low (construct) validity and reliability

RESULTS of the pilot study gave information about: - item construction (content and formulation)- need for optimalisation of design and procedure- possible confounders (pc-knowledge, motivation, etc.)- increasing validity and reliability and generalisability of future results

good starting point for Lingu@flex...

but: no endpoint!


Lingu@flex

What to measure?

june 2005

EuroCALL 20069

2. Construct specificationB. Step 2: Construct-driven approach / test

development approach

• L2 construct = multidimensional (Chalhoub-Deville, 2001)• Test content determined by specifying knowledge domain (Messick,

1994)


Lingu@flex

What to measure?

EuroCALL 200610

2. Construct specificationB. Construct-driven approach / test development

approach

• L2 construct = multidimensional (Chalhoub-Deville, 2001)• Test content determined by specifying knowledge domain (Messick,

1994)

testing matrix: 2 dimensions1. language construct

2. cognitive level


Lingu@flex

What to measure?

EuroCALL 200611

2. Construct specification- Testing matrix English Grammar


Lingu@flex

What to measure?

EuroCALL 200612


• 3 morfosyntactic domains and subdomains


Lingu@flex

What to measure?

EuroCALL 200613


• 3 morfosyntactic domains and subdomains • 3 cognitive levels (ordered)


Lingu@flex

What to measure?

EuroCALL 200614


• empty cells as a consequence of inherent characteristics of the construct


Lingu@flex

What to measure?

EuroCALL 200615



Lingu@flex

What to measure?

EuroCALL 200616



Lingu@flex

What to measure?

multiple-choice

EuroCALL 200617



Lingu@flex

What to measure?

multiple-choice

correct if necessary

reformulate

fill-in-the-blank

order in columns

EuroCALL 200618



Lingu@flex

What to measure?

multiple-choice

correct if necessary

reformulate

fill-in-the-blank

order in columns

translate

reformulate

EuroCALL 200619

2. Construct specification- Testing matrix French Reading

• 5 categories (types of text) • 3 cognitive levels (ordered)• empty cells as a consequence of construct specification• numbers of items depending on type of text and on content of text

(e.g. informative text: more knowledge-related items)


Lingu@flex

What to measure?

EuroCALL 200620

3. Test specification

1. Tool• Idioma-tic (www.idiomatic.be) and CogniStreamer (

www.cognistreamer.com)

• exercise environment (intelligent feedback) and testing environment (no feedback)

• closed and half-open exercises and tests• full online tool (frond end and backoffice)


Lingu@flex

How to measure?

EuroCALL 200621


2. Available item types 3 technical types:• multiple choice (multiple answer)


Lingu@flex

How to measure?

EuroCALL 200622


2. Available item types 3 technical types:• fill-in-the-blank (≥ 1 field)

adjust, reformulate, conjugate, synonyms…


Lingu@flex

How to measure?

EuroCALL 200623


2. Available item types 3 technical types:• translate

translate, find in text, reformulate,…


Lingu@flex

How to measure?

EuroCALL 200624


2. Available item types

Existing types, but to work out the scoring mechanism:• drag and drop text and hot spots• dictation• order horizontal, order vertical, order in columns• crossword

For an overview of the existing types:

http://www.kuleuven-kortrijk.be/Idioma-tic/doc.html

• translation/reformulation exercises: comparison algorithm (core group + by-group)


Lingu@flex

How to measure?

EuroCALL 200625


3. Content itemdatabase- goal: achievement test

- strong need for highly discriminative items (IRT)


Lingu@flex

How to measure?

EuroCALL 200626


3. Content itemdatabase- goal: achievement test

- strong need for highly discriminative items (IRT)

A. Phase 1: pilot study (holistic approach)

existing course materials

itemratings by experts

quantitative (difficulty of the item, relevance for each of the 4 educational programmes)

analysis of tracking and logging data


Lingu@flex

june 2005

How to measure?

EuroCALL 200627


3. Content itemdatabaseB. Phase 2: construct-based approach

1. Operationalisation testing matrices

2. Item construction and implementation in itemdatabase (template)

3. Subsamples of itemdatabase (subtests) rated by students

- random allocation of subtests to groups of students

- several designs

- paper versus electronic reading of texts

- number of texts

- quantitative: rating of item difficulty and item clarity, general rating of subtest diversity, length of subtest and tool usability

- qualitative: personal comments

4. All items rated by experts

- quantitative: difficulty and clarity, relevance for eductional programme)

- qualitative: personal comments


Lingu@flex

march 2006

How to measure?

EuroCALL 200628


4. Optimalisation

based on results from holistic and construct-based approach:

tool optimalisation

- user interface design and usability

- forward/backward navigation

- answer check (not change)

- tracking and logging

- scoring mechanisms

- changed features are tested first (testgroup: n=6)

LOGGING: comparison between handwritten answers and electronic answers

SCORING: comparison between handbased scoring and electronic scoring

work in progress: MC, fill-in-the-blank and translate fully operational


Lingu@flex

How to measure?

EuroCALL 200629


4. Optimalisation

content optimalisation

first selection/adjustment of items and texts

error analyses, qualitative analyses

non-IRT-based

design optimalisation

test length, subtest construction, instructions

preparation construction final testtrial run statistical programming code (SQL and SAS 9.1®)


Lingu@flex

How to measure?

EuroCALL 200630

4. Analysis specification1. Item Response Theory• the study of item scores• based on assumptions concerning the (mathematical)

relationship between abilities (or other traits) and item responses

• curve: probabilities of a correct response for students with different ability (theta) levels.


Lingu@flex

What to do with the measures?

theta: student ability

P(T): probablity of a correct response to one test item

EuroCALL 200631

4. Analysis specification1. Item Response Theory

• students are located on the ability scale• each student has a numerical value, a score, that places

him/her somewhere on the measurement (ability) scale

• students can be evaluated in terms of how much underlying ability he/she possesses

• comparisons among students (and programmes) can be made

• each item of the test measures some facet of the ability


Lingu@flex


EuroCALL 200632

4. Analysis specification2. Our analyses

• latent trait: English Grammar ability 1French Reading ability 2

• descriptive statistics – traditional psychometrics

• item response theory (IRT): 2 parameter-logistic model

- discrimination parameter ai

the degree to which the item discriminates between persons with different ability

an item that can be answered equally well by either student is a poor discriminator.

- difficulty/location parameter bi

the item's difficulty


Lingu@flex


EuroCALL 200633

4. Analysis specification2. Our analyses

• multilevel IRT (nested design): scores < students < schools- scores within a student will be more similar than scores between studens- scores of students in the same school (within school) will be more similar than scores of students of different schools (between-schools)

- scores are nested within students[level 1]

students are nested within schools[level 2]

schools are nested within educational programme[level 3]

• where is each educational programme placed on the scale?• covariates: age, sex, motivation, rating of test, rating of tool• large testgroup (n > 1000)

• using SAS® PROC GLIMMIX and PROC MIXED


Lingu@flex


EuroCALL 200634

4. Analysis specification3. defining item and test quality

research questions that can be answered by IRT:- which items must be mastered by a student in order to achieve

a certain ability level desired for a certain programme (required ability level)?

- which items are mastered by a student in a certain programme (obtained ability level)?

- are items equally difficult for equally able students from different educational programmes?

- are item scores related to the sex of students, the kind of items, the tool rating, the educational programme, ...?


october 2006

Lingu@flex


EuroCALL 200635

5. Discussion• Future research in this project:

- French Lexicon and English Speaking- improve scoring of half-open exercises/tests- implementation of item review (Revuelta, Ximénez, & Olea,

2003) via ‘check-button’- use of NLP in scoring

• CBT is a delivery method, not a different and better type of test per se (Carr, 2006)

- validation of methodology- application of methodology to other domains

Need for a strong methodological frame


Lingu@flex

EuroCALL 200636

5. Conclusion and discussion• Six approaches to validity evidence (Chapelle, 1999)

- content analysisexperts’ judgments of what they believe a test

measures (construct or internal validity)

- empirical item of task analysisempirical/quantitative analysis of learners’ responses

- dimensionality analysisinternal structure of the test (uni- or multidimensional)

- concurrent validity relation of test scores with other test scores or

behaviors

- differences in test performance – generalisabilitythe extent to which performance on one task can be

assumed to generalize to other tasks (external validity)

- testing consequences implications of the interpretations made from test

scores


Lingu@flex

EuroCALL 200637

5. Conclusion and discussion• Six approaches to validity evidence (Chapelle, 1999)

- content analysisexperts’ judgments of what they believe a test

measures (construct or internal validity)

- empirical item of task analysisempirical/quantitative analysis of learners’ responses

- dimensionality analysisinternal structure of the test (uni- or multidimensional)

- concurrent validity relation of test scores with other test scores or

behaviors

- differences in test performance – generalisabilitythe extent to which performance on one task can be

assumed to generalize to other tasks (external validity)

- testing consequences implications of the interpretations made from test

scores


1 methodology

Lingu@flex

EuroCALL 200638

References• Carr (2006). Computer-based testing: Prospects for Innovative Assessment. Calling on

Call: From Theory and Research to New Directions in Foreign Language Teaching (Eds. Ducate and Arnold).

• Chalhoub-Deville, M. (2001). Language testing and technology: past and future. Language Testing and Technology, 5, 95-98.

• Chapella, C.A. (1999). Validity in language assessment. Annual Review of Applied Linguistics, 19, 254-272.

• Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23, 13-23

• Revuelta, J., Ximénez, M.C., & Olea, J. (2003). Psychometric and psychological effects of item selection and review on computerized testing. Educational and Psychological Measurement, 63, 791-808.

ContactProf. Dr. P. DesmetE. Sabbelaan 53B – 8500 [email protected]. +32 56 246185

Mieke VandewaetereE. Sabbelaan 53B – 8500 [email protected]. +32 56 246149


Lingu@flex

eurocall 2006 1 turning teacher written exams into valid and reliable computer-assisted testing...

Documents