assessing speaking part 2 17.01.14
DESCRIPTION
a state of art article for learning assessing speaking.TRANSCRIPT
Part 1: Defining Speaking for Assessment Purposes
Part 2: Assessing Speaking: Challenges and Solutions
Dr. Sari Luoma
High quality implementation: two processes
Interlocutor(s) Examinee(s)
Task(s)
The performance
process
Performance(s) Rater(s)
Criteria
The assessment
process
Challenges in assessment Reliability
Validity
Fairness
Impact & Washback
Practicality
Reliability Consistency and precision of measurement
Measuring length is an idealized analogy
Instrument with international agreement
Instrument used for its intended purpose
A better analogy for language assessment, especially speaking assessment
Measuring your waist
When? Morning/night, before/after a meal, before/after a vacation
How, exactly?
Tape measure, reliable units, perhaps multiple samples
Reliability for assessing speaking Ensure that we have measurement
Appropriate tasks, evaluation criteria and scales
Appropriate test administration & scoring processes
Consistency of measurement
Examinee performance does not vary by examiner
Tasks provide an appropriate, realistic & useful sample
Examinee’s rating does not vary by rater
Internal consistency (rater agrees with him- or herself)
Inter-rater agreement (rater gives scores that are comparable to other raters doing the same job)
What supports reliability? (the test developer perspective)
Clear test specifications to ensure items and test forms are comparable conceptually
Appropriate sampling in pilot/field testing to ensure that all relevant aspects of target test takers are represented in test construction and norming
Clear instructions for the examiners and examinees
Clear rubrics and thorough training of raters
What supports reliability? (examiner and rater perspective)
Consistent examining practices Asking initial and follow-up questions comparably
Using the same, fair and consistent processes for when to stop asking follow-up questions (within and between examiners)
“Friendly and fair” attitude
Cultural sensitivity
Consistent rating practices Paying attention to all analytic criteria fairly across all
examinees
Rating the communication, not the person
Important for speaking tests: decision consistency
Likelihood that the same (significant) classification decision would be made if the test takers were tested again
e.g., A2 vs. B1; both examiner and rater perspectives
Important for high-stakes tests that are used to make classification decisions
Important particularly at the significant cut score(s), such as pass/fail, proficient/ below proficient, or qualified/not qualified for a job
What does this mean for raters? Know what the target of measurement is
Take care to provide consistent scores
With yourself, across occasions
With other raters
Keep referring to the rating scale and anchor samples
If you have questions or concerns, talk about them
You are probably not the only one with concerns
You are part of the measurement process, and your contribution can make it better, both when you are working conscientiously and when you raise questions
Validity Related to values and legitimacy
E.g., valid forms of payment, a valid ticket
In testing, related to meaningfulness and usefulness
Specifically: How well the interpretation and use of scores can be supported by evidence and rationales
Validity is not a property of a test, but of a proposed score interpretation in a particular setting
Need to challenge proposed interpretations and consider alternate interpretations Threats: Construct underrepresentation and construct-
irrelevant variance
‘A little history’ on validity (Kane, 2012)
Early 1900s: content considerations, if anything Mid-1900s: the criterion model
“How well an assessment does the job it is employed to do” (Cureton, 1951)
Second half of the century: the content model Content-based inferences are legitimate if sampling is
representative, evaluation is appropriate and fair, and sample is large enough to control sampling error (Guion, 1977)
1970s and 80s: the construct model correspondence between the test and its scores on the one
hand and the theory of ability on the other
1980s Construct validation as the basis for a unified model of validity
Argument-based approach to validation (Kane 2006, 2012) Two basic questions:
What is being claimed (about the meaning of scores)? Are these claims warranted, given all the evidence?
Strong conclusions require more evidence than weaker conclusions
Validation tends to have two stages In the development stage: develop assessment
procedures that support the proposed interpretation and evidence to support the proposed use
In the appraisal stage (= after development has been completed and test use begins), evaluate the extent to which the proposed interpretations are plausible and appropriate
Important points about validity
Validation evidence comes from test content (content-construct relationship)
response processes (theoretical and empirical analyses)
internal structure (relationships among test items and components)
relations to other variables (criteria that the test is expected to predict, other tests/scores – convergent and discriminant evidence)
consequences of testing (e.g., test bias; whether expected benefits are likely)
Focus on test purpose and score use
AERA, APA, & NCME Standards for Educational and Psychological Testing
What does this mean for examiners and raters? A lot of the validity analyses are conducted at the system
level (test publisher, test users, ministries…)
However, as participants in the testing and rating processes, examiners and raters are part of the construct
Construct: theory of ability/proficiency underlying the test, and its operationalization in testing and scoring
Construct underrepresentation: if the testing and scoring processes omit important elements of the underlying theory
Construct-irrelevant variance: if the scores are affected by factors not related to the intended construct
Fairness
Absence of bias
Equitable treatment of all examinees
Equality of outcomes regardless of subgroup membership (fairness in selection & prediction)
Equity in opportunity to learn
What is bias? Construct-irrelevant components that result in
systematically lower or higher scores for identifiable subgroups of test takers Inappropriate sampling of test content
Lack of clarity in test instructions for identifiable subgroups
Incomplete scoring criteria, or biased application of criteria
Types of bias Item bias / Differential Item Functioning (DIF)
Predictive bias (patterns of association for different groups between scores and other variables)
To provide evidence about lack of bias, use fairness review and analyze and report item performance and test use by sub-group
What is equitable treatment? All test takers must be given a comparable
opportunity to demonstrate their standing on the construct(s) Appropriate testing conditions (standardization)
Equal opportunity to become familiar with the test format, practice materials, etc.
Appropriate and fully informative score reporting
Especially relevant for disabled test takers and, in general educational testing, language minorities Treated through Universal Design, careful wording and
design of items, and testing accommodations
What is equality of outcomes? NOT comparability of passing rates across groups But persons who would perform equally well on a
criterion measure should achieve a similar score on the assessment Unfortunately, this is hard to determine
Should unequal testing outcomes for subgroups signal need for further analysis? Maybe, for generating new hypotheses about bias.
Standards: evaluate the fairness of a test relative to that of nontest alternatives that might be used instead
What is opportunity to learn (OTL)? Connected with educational achievement testing:
what individuals know and can do as a result of formal education When some test takers have not had OTL…
Complex decisions, e.g. in connection with awarding/withholding a certificate, job, citizenship…
Standard 13.5: … there should be evidence that the test adequately covers only the specific or generalized content and skills that students have had an opportunity to learn
What is impact (and washback)? The influence, positive or negative, that
assessments have on individuals, institutions, and society
• Includes washback – effects of assessment on teaching and learning (both good and bad)
• “the extent to which the introduction and use of a test influences language teachers and learners to do things they would not otherwise do that promote or inhibit language learning” (Messick 1996: 241)
Principles Aim to ensure positive impact and avoid negative
consequences
Often impressionistic rather than empirical “It is generally accepted that ‘high-stakes’ tests … will
influence the way that students and teachers behave as well as their perceptions of their own abilities and worth.” (Wall, 1997)
Alderson & Wall (1993) Does washback exist?
To investigate: list intended uses and potential consequences, good and bad, and collect data (Bachman & Palmer, 1996)
What do fairness concerns imply for examiners and raters? Examiners and raters are an integral part of the
fairness of the assessment system
Awareness
Openness
Discussion
Practicality: Return on Investment Developer view (Bachman & Palmer 1996)
Test user view
Amount of effort required to implement
Dealing with impracticality: how much trouble is enough to require a revision (vs. other solutions, e.g. more training)
Comparison with alternatives, or not testing at all
Related to sustainability
Important consideration during development
Practicality = Available resources
Required resources
Assessment Quality
Use Purpose
The Cycle of Speaking Assessment Formal Assessment (“Testing”)
Score need
Score use
System Development
QA/QC
Design
Purpose Design Specs
Performances Criteria
Tasks Criteria Instructions
Scores
Rating/Evaluation
Raters
Criteria Performances
Administration/Performance
Tasks
Examiner(s) Test taker(s)
The Cycle of Speaking Assessment Formal Assessment (“Testing”)
Score need
Score use
System Development
QA/QC
Design
Purpose Design Specs
Performances Criteria
Tasks Criteria Instructions
Scores
Rating/Evaluation
Raters
Criteria Performances
Administration/Performance
Tasks
Examiner(s) Test taker(s)
Why this speaking assessment?
What speaking skills will
be assessed?
Purposeful design
High quality implementation
High quality implementation
Responsible score use