7 assessment and the cefr
TRANSCRIPT
Assessment: Assessment of the proficiency of the language user
3 key concepts:• Validity: the information gained is an accurate
representation of the proficiency of the candidates
• Reliability: A student being tested twice will get the same result (technical concept: the rank order of the candidates is replicated in two separate—real or simulated—administrations of the same assessment )
• Feasibility: The procedure needs to be practical, adapted to the available elements and features
If we want assessment to be valid, reliable, and feasible, we need to specify: • What is assessed: according to the CEFR,
communicative activities (contexts, texts, and tasks). See examples.
• How performance is interpreted: assessment criteria. See examples
• How to make comparisons between different testsand ways of assessment (for example, between publicexaminations and teacher assesment). Two mainprocedures: Social moderation: discussion between experts Benchmarking: comparison of samples in relation to
standardized definitions and examples, which becomereference points (benchmarks)
• Guidelines for good practice: EALTA
TYPES OF ASSESSMENT1 Achievement assessment / Proficiency assessment2 Norm-referencing (NR)/ Criterion-referencing (CR)3 Mastery learning CR / Continuum CR4 Continuous assessment / Fixed assessment points5 Formative assessment / Summative assessment6 Direct assessment / Indirect assessment7 Performance assessment / Knowledge assessment8 Subjective assessment / Objective assessment9 Checklist rating / Performance rating10 Impression / Guided judgement11 Holistic assessment/ Analytic assessment12 Series assessment / Category assessment13 Assessment by others / Self-assessment
Types of tests:• Proficiency tests
• Achievement tests. 2 approaches:
To base achievement tests on the textbook/syllabus
To base them on course objectives. More beneficial
washback.
• Diagnostic tests
• Placement tests
Validity Types:• Construct validity (very general, the information
gained is accurate representation of theproficiency of the candidate. It checks the validityof the construct, the thing we want to measure)
• Content validity. This checks it the test’s content isa representative simple of the skills or structuresthat it wants to measure. In order to check this weneed a complete specification of all the skills orstructures we want to cover. If it covers 5% only, ithas less content validity than if it covers 25 %.
Validity Types:• Criterion-related validity: Results on the test agree with
other dependable results (criterion test) Concurrent validity. We compare the test results with the
criterion test.
Predictive validity. A placement test is validated by theteachers who teach the selected students.
• Validity in scoring. Not only the items need to be valid, but also the way in which responses are scored (taking into account grammar mistakes in a reading comprehension exam is not valid)
• Face validity: the test has to look as if it measures what it is supposed to measure. A written test to check pronunciation has little face validity.
How to make tests more valid (Hughes)
Write specifications for the test.
Include a representative sample ot the
content of the specifications in the text
Whenever feasible, use direct testing
Make sure that the scoring relates directly
to what is being tested
Try to make the test reliable
Reliability: A student being tested twice will get the same
result (technical concept: the rank order of the candidates
is replicated in two separate—real or simulated—
administrations of the same assessment )
- We compare two tests. Methods:
- Test-Retest: the student takes the same test again
- Alternate Forms: the students take two alternate forms
of the same test
- Split.Half: you split the test into two equivalent halves
and compare them as if they were two different tests.
- Reliability coefficient / Standard Error of Measurement
A High Stakes Test needs a high reliability coefficient (highest is 1), and therefore a very low standard error of measurement (a number obtained by statistical analysis). A Lower Stakes exam does not need those coefficients.
- True Score: the real score that a student would get in a perfectly reliable test. In a very reliable test, the true score is clearly defined (the student will always get a similar result, for example 65-67). In a less reliable test, the range is wider (55-75).
- Scorer reliability (coefficient). You compare the scores given by different scorers (examiners). The more agreement, the more reliable their reliability coefficient.
Item analysis:
Facility value
Discrimination indices: drop some, improve
others
Analyse distractors
Item banking
1.Take enough samples of behaviour.
2.Exclude items which do not descriminate well
3.Do not allow candidates too much freedom.
4.Write unambiguous items
5.Provide clear and explicit instructions
6.Ensure that tests are well laid out and perfectly
legible
7.Make candidates familiar with format and testing
techniques
8.Provide uniform and non-distracting conditions of
administration
9. Use items which permit scoring which is as
objective as possible
10. Make comparisons between candidates as direct
as possible
11. Provide a detailed scoring key
12. Train scorers
13. Agree acceptable responses and appropriate
scores at the beginning of the scoring process.
14. Identifty candidates by number not by name
15. Employ multiple, independent scorers..
To be valid a test must be reliable (provide
accurate measurement)
A reliable test may not be valid at all
(technically perfect, but globally wrong: it
does not test what it is supposed to test)
Washback/Backwash
Test the abilities/skills you want to encourage.
Sample widely and unpredictably
Use direct testing
Make testing criterion-referenced (CEFR)
Base achievement tests on objectives
Ensure that the test is known and understood by
students and teachers
Counting the cost
1. Make a full and clear statement of the testing ‘problem’.2. Write complete specifications for the test.3. Write and moderate items.4. Trial the items informally on native speakers and reject or modify problematic ones as necessary.5. Trial the test on a group of non-native speakers similar to those for whom the test is intended.6. Analyse the results of the trial and make any necessary changes.7. Calibrate scales.8. Validate.9. Write handbooks for test takers, test users and staff.10. Train any necessary staff (interviewers, raters, etc.).
Chapters from Hughes’ Testing for Language Teachers
9. Testing Writing
10. Testing Oral Abilities
11. Testing Reading
12. Testing Listening
13. Testing Grammar and Vocabulary
14. Testing Overall Ability
15. Tests for Young Learners
1. Set representative tasks
1. Specify all possible content
2. Include a representative sample of the specified content
2. Elicit valid samples of writing ability
1. Set as many separate tasks as feasible
2. Test only writing ability and nothing else
3. Restrict candidates
3. Ensure valid and reliable scoring:
1. Set as many tasks as possible
2. Restrict candidates
3. Give no choice of tasks
4. Ensure long enough samples
5. Create appropriate scales for scoring: HOLISTIC/ANALYTIC
6. Calibrate the scale to be used
7. Select and train scorers
8. Follow acceptable scoring procedures
• “The most highly prized language skill”, Lado’s
Language Testing (1961).
• Challenges: ephemeral, intangible.
• Contrast US/UK: Certificate of Proficiency in English
(1913) already included it, TOEFL only in 2005 iBT
• Key notion: not accent, but intelligibility
• Very different approaches.
Indirect
Direct (Cambridge, EOIs) or Semi-direct (TOEFL ibt, OTE,
Aptis). Conflict with the American tradition.
The future?: Fully automated L2 speaking tests: Versant
test, Speechrater.
• Not only speaking, also interaction
1. Set representative tasks
1. Specify all possible content
2. Include a representative sample of the specified content
2. Elicit valid samples of oral ability.
1. Techniques:
1. Interview :Questions, pictures, role play, interpreting (L1 to L2),
prepared monologue, reading aloud
2. Interaction: discussion, roleplay
3. Responses to audio- or video-recordings (semi-direct)
2. Plan and structure the test carefully
1. Make the oral test as long as it is feasible
2. Plan the test carefully
3. As many tasks (“fresh starts”) as possible
4. Use a second tester
5. Set only tasks that candidates could do easily in L1
Plan and structure the test carefully
1. Set only tasks that candidates could do easily in L1
2. Quiet room with good acoustics
3. Put candidates at ease (at first, easy questions, not assessed,
problem with note-taking?)
4. Collect enough relevant information
5. Do not talk too much
6. (select interviewers carefully and train them)
1. Ensure valid and reliable scoring:
1. Create appropriate scales for scoring: HOLISTIC/ANALYTIC. Calibrate
the scale to be used
2. Select and train scorers (different from interviewers if possible)
3. Follow acceptable scoring procedures
PROBLEMS: Indirect assessment: We read in very different ways: scanning, skimming,
inferring, intensive, extensive reading…
SOME TIPS As many texts and operations as possible (Dialang). Avoid texts which deal with general knowledge Avoid disturbing topics, or texts students might have
read Use authentic texts Techniques: better short answer and gap filling than
multiple choice Task difficulty can be lower than text difficulty Items should follow the order of the text Make items independent of each other Do not take into account errors of grammar or spelling
PROBLEMS As in listening: Indirect assessment and different ways
of listening As in speaking: Transient nature of speechhttp://www.usingenglish.com/articles/why-your-students-have-problems-with-listening-comprehension.html
TIPS: Same as in reading If recording is used, make it as natural as possible Items should be far apart in the text Give students time to become familiar with the tasks Techniques: apart from multiple choice, shot answers
and gap filling, information transfer, note taking, partialdictation, transcription
Moderation is essential How many times?
GRAMMAR Why? Easy to test, Content validity Why not? Harmful washback effect It depends on the type of test. Specifications: from the Council of Europe books Techniques: Gap filling, rephrasings, completion Don’t penalize for mistakes that were not tested (-s if
the item is testing relatives, for example)
VOCABULARY Why (not)? Specifications: use frequency considerations Techniques:
Recognition: Recognise synonims, recognise definitions, recognise appropriate word for context
Production: pictures, definitions, gap filling,
Useful in particular tests where washback is
not important
Cloze test (from closure). Based on the
idea of “reduced redundancy”. Subtypes: Selected deletion cloze
Conversational cloze
C-Tests
Dictation
Main problem : horrible washback effect.
TIPS
- Testing-assessment-teaching
- Feedback
- Self assessment
- Washback
- Short tasks
- Use stories and games
- Use pictures and color
- Don’t forget that children are still developing L1 and cognitive abilities
- Include interaction
- Use colour and drawing
- Use cartoon stories
- Long warm-ups in speaking
- Use cards eotj pictures