assessing speaking part 2 17.01.14

Part 1: Defining Speaking for Assessment Purposes

Part 2: Assessing Speaking: Challenges and Solutions

Dr. Sari Luoma

High quality implementation: two processes

Interlocutor(s) Examinee(s)

Task(s)

The performance

process

Performance(s) Rater(s)

Criteria

The assessment

process

Challenges in assessment Reliability

Validity

Fairness

Impact & Washback

Practicality

Reliability Consistency and precision of measurement

Measuring length is an idealized analogy

Instrument with international agreement

Instrument used for its intended purpose

A better analogy for language assessment, especially speaking assessment

Measuring your waist

When? Morning/night, before/after a meal, before/after a vacation

How, exactly?

Tape measure, reliable units, perhaps multiple samples

Reliability for assessing speaking Ensure that we have measurement

Appropriate tasks, evaluation criteria and scales

Appropriate test administration & scoring processes

Consistency of measurement

Examinee performance does not vary by examiner

Tasks provide an appropriate, realistic & useful sample

Examinee’s rating does not vary by rater

Internal consistency (rater agrees with him- or herself)

Inter-rater agreement (rater gives scores that are comparable to other raters doing the same job)

What supports reliability? (the test developer perspective)

Clear test specifications to ensure items and test forms are comparable conceptually

Appropriate sampling in pilot/field testing to ensure that all relevant aspects of target test takers are represented in test construction and norming

Clear instructions for the examiners and examinees

Clear rubrics and thorough training of raters

What supports reliability? (examiner and rater perspective)

Consistent examining practices Asking initial and follow-up questions comparably

Using the same, fair and consistent processes for when to stop asking follow-up questions (within and between examiners)

“Friendly and fair” attitude

Cultural sensitivity

Consistent rating practices Paying attention to all analytic criteria fairly across all

examinees

Rating the communication, not the person

Important for speaking tests: decision consistency

Likelihood that the same (significant) classification decision would be made if the test takers were tested again

e.g., A2 vs. B1; both examiner and rater perspectives

Important for high-stakes tests that are used to make classification decisions

Important particularly at the significant cut score(s), such as pass/fail, proficient/ below proficient, or qualified/not qualified for a job

What does this mean for raters? Know what the target of measurement is

Take care to provide consistent scores

With yourself, across occasions

With other raters

Keep referring to the rating scale and anchor samples

If you have questions or concerns, talk about them

You are probably not the only one with concerns

You are part of the measurement process, and your contribution can make it better, both when you are working conscientiously and when you raise questions

Validity Related to values and legitimacy

E.g., valid forms of payment, a valid ticket

In testing, related to meaningfulness and usefulness

Specifically: How well the interpretation and use of scores can be supported by evidence and rationales

Validity is not a property of a test, but of a proposed score interpretation in a particular setting

Need to challenge proposed interpretations and consider alternate interpretations Threats: Construct underrepresentation and construct-

irrelevant variance

‘A little history’ on validity (Kane, 2012)

Early 1900s: content considerations, if anything Mid-1900s: the criterion model

“How well an assessment does the job it is employed to do” (Cureton, 1951)

Second half of the century: the content model Content-based inferences are legitimate if sampling is

representative, evaluation is appropriate and fair, and sample is large enough to control sampling error (Guion, 1977)

1970s and 80s: the construct model correspondence between the test and its scores on the one

hand and the theory of ability on the other

1980s Construct validation as the basis for a unified model of validity

Argument-based approach to validation (Kane 2006, 2012) Two basic questions:

What is being claimed (about the meaning of scores)? Are these claims warranted, given all the evidence?

Strong conclusions require more evidence than weaker conclusions

Validation tends to have two stages In the development stage: develop assessment

procedures that support the proposed interpretation and evidence to support the proposed use

In the appraisal stage (= after development has been completed and test use begins), evaluate the extent to which the proposed interpretations are plausible and appropriate

Important points about validity

Validation evidence comes from test content (content-construct relationship)

response processes (theoretical and empirical analyses)

internal structure (relationships among test items and components)

relations to other variables (criteria that the test is expected to predict, other tests/scores – convergent and discriminant evidence)

consequences of testing (e.g., test bias; whether expected benefits are likely)

Focus on test purpose and score use

AERA, APA, & NCME Standards for Educational and Psychological Testing

What does this mean for examiners and raters? A lot of the validity analyses are conducted at the system

level (test publisher, test users, ministries…)

However, as participants in the testing and rating processes, examiners and raters are part of the construct

Construct: theory of ability/proficiency underlying the test, and its operationalization in testing and scoring

Construct underrepresentation: if the testing and scoring processes omit important elements of the underlying theory

Construct-irrelevant variance: if the scores are affected by factors not related to the intended construct

Fairness

Absence of bias

Equitable treatment of all examinees

Equality of outcomes regardless of subgroup membership (fairness in selection & prediction)

Equity in opportunity to learn

What is bias? Construct-irrelevant components that result in

systematically lower or higher scores for identifiable subgroups of test takers Inappropriate sampling of test content

Lack of clarity in test instructions for identifiable subgroups

Incomplete scoring criteria, or biased application of criteria

Types of bias Item bias / Differential Item Functioning (DIF)

Predictive bias (patterns of association for different groups between scores and other variables)

To provide evidence about lack of bias, use fairness review and analyze and report item performance and test use by sub-group

What is equitable treatment? All test takers must be given a comparable

opportunity to demonstrate their standing on the construct(s) Appropriate testing conditions (standardization)

Equal opportunity to become familiar with the test format, practice materials, etc.

Appropriate and fully informative score reporting

Especially relevant for disabled test takers and, in general educational testing, language minorities Treated through Universal Design, careful wording and

design of items, and testing accommodations

What is equality of outcomes? NOT comparability of passing rates across groups But persons who would perform equally well on a

criterion measure should achieve a similar score on the assessment Unfortunately, this is hard to determine

Should unequal testing outcomes for subgroups signal need for further analysis? Maybe, for generating new hypotheses about bias.

Standards: evaluate the fairness of a test relative to that of nontest alternatives that might be used instead

What is opportunity to learn (OTL)? Connected with educational achievement testing:

what individuals know and can do as a result of formal education When some test takers have not had OTL…

Complex decisions, e.g. in connection with awarding/withholding a certificate, job, citizenship…

Standard 13.5: … there should be evidence that the test adequately covers only the specific or generalized content and skills that students have had an opportunity to learn

What is impact (and washback)? The influence, positive or negative, that

assessments have on individuals, institutions, and society

• Includes washback – effects of assessment on teaching and learning (both good and bad)

• “the extent to which the introduction and use of a test influences language teachers and learners to do things they would not otherwise do that promote or inhibit language learning” (Messick 1996: 241)

Principles Aim to ensure positive impact and avoid negative

consequences

Often impressionistic rather than empirical “It is generally accepted that ‘high-stakes’ tests … will

influence the way that students and teachers behave as well as their perceptions of their own abilities and worth.” (Wall, 1997)

Alderson & Wall (1993) Does washback exist?

To investigate: list intended uses and potential consequences, good and bad, and collect data (Bachman & Palmer, 1996)

What do fairness concerns imply for examiners and raters? Examiners and raters are an integral part of the

fairness of the assessment system

Awareness

Openness

Discussion

Practicality: Return on Investment Developer view (Bachman & Palmer 1996)

Test user view

Amount of effort required to implement

Dealing with impracticality: how much trouble is enough to require a revision (vs. other solutions, e.g. more training)

Comparison with alternatives, or not testing at all

Related to sustainability

Important consideration during development

Practicality = Available resources

Required resources

Assessment Quality

Use Purpose

The Cycle of Speaking Assessment Formal Assessment (“Testing”)

Score need

Score use

System Development

QA/QC

Design

Purpose Design Specs

Performances Criteria

Tasks Criteria Instructions

Scores

Rating/Evaluation

Raters

Criteria Performances

Administration/Performance

Tasks

Examiner(s) Test taker(s)

The Cycle of Speaking Assessment Formal Assessment (“Testing”)

Score need

Score use

System Development

QA/QC

Design

Purpose Design Specs

Performances Criteria

Tasks Criteria Instructions

Scores

Rating/Evaluation

Raters

Criteria Performances

Administration/Performance

Tasks

Examiner(s) Test taker(s)

Why this speaking assessment?

What speaking skills will

be assessed?

Purposeful design

High quality implementation

High quality implementation

Responsible score use

assessing speaking part 2 17.01.14

Documents

test forms

test construction

measurement appropriate

measurement process

consistent scores

consistent processes

target of measurement

assessment process challenges