7 assessment and the cefr

Assessment: Assessment of the proficiency of the language user

3 key concepts:• Validity: the information gained is an accurate

representation of the proficiency of the candidates

• Reliability: A student being tested twice will get the same result (technical concept: the rank order of the candidates is replicated in two separate—real or simulated—administrations of the same assessment )

• Feasibility: The procedure needs to be practical, adapted to the available elements and features

If we want assessment to be valid, reliable, and feasible, we need to specify: • What is assessed: according to the CEFR,

communicative activities (contexts, texts, and tasks). See examples.

• How performance is interpreted: assessment criteria. See examples

• How to make comparisons between different testsand ways of assessment (for example, between publicexaminations and teacher assesment). Two mainprocedures: Social moderation: discussion between experts Benchmarking: comparison of samples in relation to

standardized definitions and examples, which becomereference points (benchmarks)

• Guidelines for good practice: EALTA

TYPES OF ASSESSMENT1 Achievement assessment / Proficiency assessment2 Norm-referencing (NR)/ Criterion-referencing (CR)3 Mastery learning CR / Continuum CR4 Continuous assessment / Fixed assessment points5 Formative assessment / Summative assessment6 Direct assessment / Indirect assessment7 Performance assessment / Knowledge assessment8 Subjective assessment / Objective assessment9 Checklist rating / Performance rating10 Impression / Guided judgement11 Holistic assessment/ Analytic assessment12 Series assessment / Category assessment13 Assessment by others / Self-assessment

Types of tests:• Proficiency tests

• Achievement tests. 2 approaches:

To base achievement tests on the textbook/syllabus

To base them on course objectives. More beneficial

washback.

• Diagnostic tests

• Placement tests

Validity Types:• Construct validity (very general, the information

gained is accurate representation of theproficiency of the candidate. It checks the validityof the construct, the thing we want to measure)

• Content validity. This checks it the test’s content isa representative simple of the skills or structuresthat it wants to measure. In order to check this weneed a complete specification of all the skills orstructures we want to cover. If it covers 5% only, ithas less content validity than if it covers 25 %.

Validity Types:• Criterion-related validity: Results on the test agree with

other dependable results (criterion test) Concurrent validity. We compare the test results with the

criterion test.

Predictive validity. A placement test is validated by theteachers who teach the selected students.

• Validity in scoring. Not only the items need to be valid, but also the way in which responses are scored (taking into account grammar mistakes in a reading comprehension exam is not valid)

• Face validity: the test has to look as if it measures what it is supposed to measure. A written test to check pronunciation has little face validity.

How to make tests more valid (Hughes)

Write specifications for the test.

Include a representative sample ot the

content of the specifications in the text

Whenever feasible, use direct testing

Make sure that the scoring relates directly

to what is being tested

Try to make the test reliable

Reliability: A student being tested twice will get the same

result (technical concept: the rank order of the candidates

is replicated in two separate—real or simulated—

administrations of the same assessment )

- We compare two tests. Methods:

- Test-Retest: the student takes the same test again

- Alternate Forms: the students take two alternate forms

of the same test

- Split.Half: you split the test into two equivalent halves

and compare them as if they were two different tests.

- Reliability coefficient / Standard Error of Measurement

A High Stakes Test needs a high reliability coefficient (highest is 1), and therefore a very low standard error of measurement (a number obtained by statistical analysis). A Lower Stakes exam does not need those coefficients.

- True Score: the real score that a student would get in a perfectly reliable test. In a very reliable test, the true score is clearly defined (the student will always get a similar result, for example 65-67). In a less reliable test, the range is wider (55-75).

- Scorer reliability (coefficient). You compare the scores given by different scorers (examiners). The more agreement, the more reliable their reliability coefficient.

Item analysis:

Facility value

Discrimination indices: drop some, improve

others

Analyse distractors

Item banking

1.Take enough samples of behaviour.

2.Exclude items which do not descriminate well

3.Do not allow candidates too much freedom.

4.Write unambiguous items

5.Provide clear and explicit instructions

6.Ensure that tests are well laid out and perfectly

legible

7.Make candidates familiar with format and testing

techniques

8.Provide uniform and non-distracting conditions of

administration

9. Use items which permit scoring which is as

objective as possible

10. Make comparisons between candidates as direct

as possible

11. Provide a detailed scoring key

12. Train scorers

13. Agree acceptable responses and appropriate

scores at the beginning of the scoring process.

14. Identifty candidates by number not by name

15. Employ multiple, independent scorers..

To be valid a test must be reliable (provide

accurate measurement)

A reliable test may not be valid at all

(technically perfect, but globally wrong: it

does not test what it is supposed to test)

Washback/Backwash

Test the abilities/skills you want to encourage.

Sample widely and unpredictably

Use direct testing

Make testing criterion-referenced (CEFR)

Base achievement tests on objectives

Ensure that the test is known and understood by

students and teachers

Counting the cost

1. Make a full and clear statement of the testing ‘problem’.2. Write complete specifications for the test.3. Write and moderate items.4. Trial the items informally on native speakers and reject or modify problematic ones as necessary.5. Trial the test on a group of non-native speakers similar to those for whom the test is intended.6. Analyse the results of the trial and make any necessary changes.7. Calibrate scales.8. Validate.9. Write handbooks for test takers, test users and staff.10. Train any necessary staff (interviewers, raters, etc.).

Common Test Techniques

• Multiple choice

• Yes/No, True/False

• Short Answer

• Gap-Filling

Chapters from Hughes’ Testing for Language Teachers

9. Testing Writing

10. Testing Oral Abilities

11. Testing Reading

12. Testing Listening

13. Testing Grammar and Vocabulary

14. Testing Overall Ability

15. Tests for Young Learners

1. Set representative tasks

1. Specify all possible content

2. Include a representative sample of the specified content

2. Elicit valid samples of writing ability

1. Set as many separate tasks as feasible

2. Test only writing ability and nothing else

3. Restrict candidates

3. Ensure valid and reliable scoring:

1. Set as many tasks as possible

2. Restrict candidates

3. Give no choice of tasks

4. Ensure long enough samples

5. Create appropriate scales for scoring: HOLISTIC/ANALYTIC

6. Calibrate the scale to be used

7. Select and train scorers

8. Follow acceptable scoring procedures

• “The most highly prized language skill”, Lado’s

Language Testing (1961).

• Challenges: ephemeral, intangible.

• Contrast US/UK: Certificate of Proficiency in English

(1913) already included it, TOEFL only in 2005 iBT

• Key notion: not accent, but intelligibility

• Very different approaches.

Indirect

Direct (Cambridge, EOIs) or Semi-direct (TOEFL ibt, OTE,

Aptis). Conflict with the American tradition.

The future?: Fully automated L2 speaking tests: Versant

test, Speechrater.

• Not only speaking, also interaction

1. Set representative tasks

1. Specify all possible content

2. Include a representative sample of the specified content

2. Elicit valid samples of oral ability.

1. Techniques:

1. Interview :Questions, pictures, role play, interpreting (L1 to L2),

prepared monologue, reading aloud

2. Interaction: discussion, roleplay

3. Responses to audio- or video-recordings (semi-direct)

2. Plan and structure the test carefully

1. Make the oral test as long as it is feasible

2. Plan the test carefully

3. As many tasks (“fresh starts”) as possible

4. Use a second tester

5. Set only tasks that candidates could do easily in L1

Plan and structure the test carefully

1. Set only tasks that candidates could do easily in L1

2. Quiet room with good acoustics

3. Put candidates at ease (at first, easy questions, not assessed,

problem with note-taking?)

4. Collect enough relevant information

5. Do not talk too much

6. (select interviewers carefully and train them)

1. Ensure valid and reliable scoring:

1. Create appropriate scales for scoring: HOLISTIC/ANALYTIC. Calibrate

the scale to be used

2. Select and train scorers (different from interviewers if possible)

3. Follow acceptable scoring procedures

PROBLEMS: Indirect assessment: We read in very different ways: scanning, skimming,

inferring, intensive, extensive reading…

SOME TIPS As many texts and operations as possible (Dialang). Avoid texts which deal with general knowledge Avoid disturbing topics, or texts students might have

read Use authentic texts Techniques: better short answer and gap filling than

multiple choice Task difficulty can be lower than text difficulty Items should follow the order of the text Make items independent of each other Do not take into account errors of grammar or spelling

PROBLEMS As in listening: Indirect assessment and different ways

of listening As in speaking: Transient nature of speechhttp://www.usingenglish.com/articles/why-your-students-have-problems-with-listening-comprehension.html

TIPS: Same as in reading If recording is used, make it as natural as possible Items should be far apart in the text Give students time to become familiar with the tasks Techniques: apart from multiple choice, shot answers

and gap filling, information transfer, note taking, partialdictation, transcription

Moderation is essential How many times?

http://www.usingenglish.com/articles/why-your-students-have-problems-with-listening-comprehension.html

GRAMMAR Why? Easy to test, Content validity Why not? Harmful washback effect It depends on the type of test. Specifications: from the Council of Europe books Techniques: Gap filling, rephrasings, completion Don’t penalize for mistakes that were not tested (-s if

the item is testing relatives, for example)

VOCABULARY Why (not)? Specifications: use frequency considerations Techniques:

Recognition: Recognise synonims, recognise definitions, recognise appropriate word for context

Production: pictures, definitions, gap filling,

Useful in particular tests where washback is

not important

Cloze test (from closure). Based on the

idea of “reduced redundancy”. Subtypes: Selected deletion cloze

Conversational cloze

C-Tests

Dictation

Main problem : horrible washback effect.

TIPS

- Testing-assessment-teaching

- Feedback

- Self assessment

- Washback

- Short tasks

- Use stories and games

- Use pictures and color

- Don’t forget that children are still developing L1 and cognitive abilities

- Include interaction

- Use colour and drawing

- Use cartoon stories

- Long warm-ups in speaking

- Use cards eotj pictures

7 assessment and the cefr

Education

ways of assessment

assessment criteria

assessment feasibility

base achievement tests

cefrtypes of tests

different tests

textbooksyllabusto base

language courses