vol6 full report

PUBLISHING DETAILS

IELTS RESEARCH REPORTS VOLUME 6, 2006 Published by: IELTS Australia and British Council Project Managers: Jenny Osborne, IELTS Australia

Uyen Tran, British Council Acknowledgements: Dr Lynda Taylor,

University of Cambridge ESOL Examinations Dr Anthony Green, University of Cambridge ESOL Examinations

Editors: Petronella McGovern Dr Steve Walsh

British Council IELTS Australia Pty Limited Bridgewater House ABN 84 008 664 766 (incorporated in the ACT) 58 Whitworth Street GPO Box 2006, Canberra, ACT, 2601, Australia Tel 44 161 957 7755 Tel 61 2 6285 8222 Fax 44 161 957 7762 Fax 61 2 6285 3233 Email [email protected] Email [email protected] www.ielts.org Web www.ielts.org© British Council 2006 © IELTS Australia Pty Limited 2006 This publication is copyright. Apart from any fair dealing for the purposes of: private study, research, criticism or review, as permitted under Division 4 of the Copyright Act 1968 and equivalent provisions in the UK Copyright Designs and Patents Act 1988, no part may be reproduced or copied in any form or by any means (graphic, electronic or mechanical, including recording or information retrieval systems) by any process without the written permission of the publishers. Enquiries should be made to the publisher. The research and opinions expressed in this volume are of individual researchers and do not represent the views of IELTS Australia Pty Limited or British Council. The publishers do not accept responsibility for any of the claims made in the research. National Library of Australia, cataloguing-in-publication data 2006 edition, IELTS Research Reports 2006 Volume 6 ISBN 0-9775875-1-7

© Copyright 2006

© IELTS Research Reports Volume 6 Publishing details

mailto:[email protected]

mailto:[email protected]

http://www.ielts.org/


CONTENTS Foreword

Introduction

Publishing details

1 An investigation of the effectiveness and validity of planning time in Part 2 of the IELTS Speaking Test Addresses the question of whether the use of planning time for the IELTS Speaking Test assists in candidate performance. Catherine Elder and Gillian Wigglesworth

2 An examination of the rating process in the revised IELTS Speaking Test Examines the validity of the analytic rating scales used to assess performance in the revised IELTS Speaking Test, through an analysis of verbal reports produced by IELTS examiners when rating test performances and a subsequent questionnaire. Annie Brown

3 Candidate discourse in the revised IELTS Speaking Test Aims to verify the IELTS Speaking Test scale descriptors by providing empirical validity evidence derived from a linguistic analysis of candidate discourse. Annie Brown

4 The impact on candidate language of examiner deviation from a set interlocutor frame in the IELTS Speaking Test Shows that the deviations examiners make from the interlocutor frame in the IELTS Speaking Test have little significant impact on the language produced by candidates. Barry O’Sullivan and Yang Lu

5 Exploring difficulty in Speaking tasks: an intra-task perspective Looks at how the difficulty of a speaking task is affected by changes to the time offered for planning, the length of response expected and the amount of scaffolding provided. Cyril Weir, Barry O’Sullivan and Tomoko Horai

6 The interactional organisation of the IELTS Speaking Test Describes the interactional organisation of the IELTS Speaking Test in terms of turn-taking, sequence and repair. Paul Seedhouse and Maria Egbert

7 An investigation of the lexical dimension of the IELTS Speaking Test Investigates vocabulary use by candidates in the IELTS Speaking Test by measuring lexical output, variation and sophistication, and the use of formulaic language. John Read and Paul Nation

© IELTS Research Reports Volume 6 Contents

FOREWORD Welcome to Volume 6 of the IELTS Research Reports. The studies reported in this volume were funded by the IELTS joint-funded research programme, sponsored by British Council and IELTS Australia. The third IELTS partner, Cambridge ESOL, supports the programme by providing assistance to approved researchers.

Since the programme began in 1995, nearly 70 studies and over 120 leading researchers have received grants under the joint programme. The results have made a significant contribution to the monitoring, evaluation and development process of IELTS. It is now one of the world’s most researched English language tests, ensuring that IELTS continues to be the test that sets the standard through its high level of quality, validity, security and overall integrity.

IELTS research activities are co-ordinated as part of a coherent framework for research and validation of the IELTS Test and the research programme is a major component of this framework. A summary of the impact of the research studies reported in this volume can be found in the Introduction by Cambridge ESOL.

The annual call for research proposals is widely publicised and aims to reflect current issues relating to IELTS as a major international English language proficiency test. A Joint Research Committee of the three IELTS partners agrees on research priorities and oversees the tendering process. Committee members, collaborating with experts in applied linguistics and language testing, assess the research proposals according to the following criteria:

relevance and benefit of outcomes to IELTS clarity and coherence of the proposal’s rationale, objectives and methodology feasibility of outcomes, timelines and budget qualifications and experience of proposed project staff potential to be published for both IELTS and an international audience.

Volume 6 is the first of two volumes of IELTS research reports to be published jointly by British Council and IELTS Australia and it contains reports of research funded by both partners. The main theme of Volume 6 is the IELTS Speaking Test. Volume 7 will focus on a range of topics including the IELTS Writing Test. Further information about IELTS research and the joint-funded programme is on the IELTS website – www.ielts.org

Martin Davidson Anthony Pollock Deputy Director General Chief Executive British Council IELTS Australia

© IELTS Research Reports Volume 6 Foreword


INTRODUCTION The British Council/ IELTS Australia joint-funded research programme makes a significant contribution to the ongoing development of IELTS. External studies funded by these two IELTS partners complement internal validation and research studies conducted or commissioned by the third IELTS partner, Cambridge ESOL. The funded studies form an integral part of the process of IELTS monitoring, validation and evaluation.

This volume brings together a number of important empirical studies focusing on the IELTS Speaking Test. A major review of the IELTS Speaking Test took place towards the late 1990s and a formal project to revise the Speaking module was conducted between 1998 and 2001. The revision project concentrated on several key areas with the aim of achieving greater standardisation of test conduct and improving the reliability of assessment; this included:

developing a clearer specification of tasks, eg in terms of input and expected candidate output, and the revision of the tasks themselves for some phases of the Test

introducing an examiner frame to guide examiner language and behaviour, and so increase standardisation of test management

re-developing the assessment criteria and rating scale to ensure that the rating descriptors matched more closely the output from candidates in relation to the specified tasks

re-training and re-standardising a community of around 1500 IELTS examiners worldwide using a face-to-face approach, and introducing ongoing quality assurance procedures for this global examiner cadre.

The revised IELTS Speaking Test was introduced in July 2001 and since that time the joint-funded programme has invited research proposals for empirical studies which explore various aspects of the revised test module along the dimensions listed above. Such studies are considered essential to confirm that the revised test is functioning as intended, to identify any issues that may need addressing, and to contribute to the body of evidence in support of the validity arguments underpinning use of the test.

The first study reported in this volume, by Gillian Wigglesworth and Catherine Elder, investigated the relationship between three variables in the IELTS Speaking Test – planning, proficiency and task. Their study aimed to increase our understanding of how these variables interact with one another and how they impact on test-taker performance. The specific focus was the role and use of the one minute of planning time afforded to candidates in Part 2 of the Speaking Test. Part 2 is a long turn task with in-built pre-task planning time. The task design reflects the fact that some speech – especially in academic and professional contexts – is more formal in nature and is often planned prior to delivery (though, as the researchers acknowledge, it is clearly difficult to replicate this condition within the limited time-frame of a speaking test). Early Second Language Acquisition (SLA) research into the effect of pre-task planning, including work by Wigglesworth, suggested that planning time impacted positively on both content and quality of L2 oral performance; later research findings, however, proved less conclusive. As this was an innovative feature of the revised

© IELTS Research Reports Volume 6 1

Introduction to the IELTS Research Reports, Volume 6 – Lynda Taylor

IELTS Speaking Test introduced in 2001, the test developers were keen to investigate the effectiveness and validity of the planning time and how test-takers make use of it.

Interestingly, Wigglesworth and Elder’s experimental study found no evidence that the availability of planning time advantages or disadvantages candidates’ performance, either in terms of the discourse they produce or the scores they receive. Despite this finding, the researchers recommend that one minute of pre-task planning should continue to be included on Part 2 in the interests of fairness and for face validity purposes. An important dimension of this study was that it canvassed the candidates’ own perceptions of the planning time available to them; feedback from test-takers suggests they perceive the one minute as adequate and useful. This study therefore offers positive support for the decision by the IELTS test developers to include a small amount of planning time in the revised Speaking Test; it also confirms that there would be no value in increasing it to two minutes since this would be unlikely to produce any measurable gain. Another useful outcome from this study is the feedback from both researchers and test-takers on possible task factors relating to topic; this type of information is valuable for informing the test writing process.

Volume 6 includes two studies by Annie Brown who has a long association with the IELTS Speaking Test dating back to the early 1990s. Findings from Brown’s studies of the Test as it was then (some of which formed the basis of her doctoral research) were instrumental in shaping the revised Speaking Test introduced in 2001. The first of the two studies in this volume examined the validity of the analytic rating scales used by IELTS examiners to assess test-takers’ performance. When the Speaking Test was revised, a major change was the move from a single global assessment scale to a set of four analytic scales that all IELTS examiners worldwide are trained and standardised to administer. The IELTS partners were therefore keen to investigate how examiners are interpreting and applying the new criteria and scales, partly to confirm that they are functioning as intended, and also to highlight any issues that might need addressing in the future.

Brown’s study used verbal report methodology to analyse examiners’ cognitive processes when applying the scales to performance samples, together with a questionnaire probing the rating process further. The study’s findings provided encouraging evidence that the revised assessment approach is a significant improvement over the pre-revision Speaking Test in which examiners were relatively unconstrained in their language and behaviour and used a single, holistic scale. Firstly, the revised test format has clearly reduced the extent to which an interviewer’s language and behaviour is implicated in a test-taker’s performance. Secondly, examiners in this study generally found the scales easy to interpret and apply, and they adhered closely to the descriptors when rating; they reported a high degree of ‘comfort’ in terms of both managing the interaction and awarding scores. The study was instructive in highlighting several aspects that may need further attention, including some potential overlap between certain analytic scales and some difficulty in differentiating across levels; however, Brown suggests that these can relatively easily be addressed through minor revisions to the descriptors and through examiner training.

Brown’s second study in this volume is a partner to the first. It too sought empirical evidence to validate the new Speaking Test scale descriptors but through a discourse analytic study of test-taker performance rather than a focus on examiner attitudes and behaviour. Overall, the study findings confirmed that all the measures relating to each analytical criterion contribute in some way to the assessment on that scale and that no single measure appears to dominate the rating process. As we would wish, a range of performance features contribute to the overall impression of a candidate’s proficiency and the results of this study are therefore encouraging for the IELTS team who developed the revised scales and band descriptors.



This study also highlights the complexities that are involved in assessing speaking proficiency across a broad ability continuum (as is the case in IELTS). Specific aspects of performance may be more or less relevant at certain levels, and so contribute differentially to the scores awarded. Furthermore, even though two candidates may be assessed at the same level on a scale, their respective performances may display subtle differences on different dimensions of that trait. This reminds us that, at the level of the individual, the nature of spoken language performance and what it indicates about their proficiency level can be highly idiosyncratic in nature.

Barry O’Sullivan and Yang Lu set out to analyse the way in which the examiner script (or Interlocutor Frame) used in the IELTS Speaking Test impacted on the test-taker’s performance, specifically in cases where an examiner deviates from the scripted guide provided. An Interlocutor Frame was introduced in the 2001 revision on grounds of fairness – to increase standardisation of the test and to reduce the risk of rater variability; since then, the functioning of the Interlocutor Frame has been the focus of ongoing research and validation work. The study reported here forms part of that research agenda, and aimed to locate specific sources of deviation, the nature of the deviations and their effect on the language of the candidates. Taking a discourse analytic approach, the researchers analysed transcription extracts from over 60 recordings of live speaking tests to investigate the nature and impact of examiner deviations from the interlocutor frame.

Findings from their study suggest that in Parts 1 and 2 of the Speaking Test, the examiners adhere closely to the Frame; any deviations are relatively rare and they occur at natural interactional boundaries with an essentially negligible effect on the language of candidates. Part 3 shows a different pattern of behaviour, with considerable variation across examiners in the paraphrased questions, though even here little impact on candidate language could be detected. It is important to note, however, that some variation is to be expected in this third part of the Test as it is specifically designed to offer the examiner flexibility in choosing and phrasing their questions, matching them to the level of the test-taker within the context of a test measuring across a broad proficiency continuum. Once again, findings from this study confirm that the revised Speaking Test is functioning largely as the developers originally intended. The study also provides useful insights which will inform procedures for IELTS examiner training and standardisation, and will shape future changes to the Frame.

Cyril Weir, Barry O’Sullivan and Tomoko Horai investigated how the difficulty of speaking tasks can be affected if changes are made to three key task variables: amount of planning time offered; length of response expected; and amount of content scaffolding provided. Their study explored these variables in relation to Part 2 (individual long turn task) of the IELTS Speaking Test and it therefore complements the Wigglesworth and Elder study which focused on the planning time variable in isolation. Using an experimental design, the researchers collected performance and score data for analysis. They supplemented this with an analysis of questionnaire responses related to test-takers’ cognitive processing based upon a socio-cognitive framework for test validation. Once again, the findings are encouraging for the IELTS test developers. There is welcome empirical support for the current design of the Part 2 task used in the operational IELTS, both in terms of the quality of candidate performance and the scores awarded, and also in relation to candidate perceptions of task difficulty. Task equivalence is an important issue for IELTS given the large number of tasks which are needed for the operational test, and this study provides useful insights into some of the variables which can affect task difficulty, especially for test candidates at different ability levels.



Paul Seedhouse and Maria Egbert explored the interactional organisation of the IELTS Speaking Test in terms of turn-taking, sequence and repair, drawing their sample for analysis from the large corpus of audio-recordings held by Cambridge ESOL. Since 2002 several thousand recordings of live IELTS Speaking Tests have been collected and these now form a valuable spoken language corpus used by researchers at Cambridge ESOL to investigate various aspects of the Speaking Test. By applying Conversation Analysis (CA) methodology to 137 complete speaking tests, Seedhouse and Egbert were able to highlight key features of the spoken interaction.

Like O’Sullivan and Yang, they observed that examiners adhere closely to the scripted guide they are given to ensure standardisation of the test event. Although spoken interaction in the IELTS Speaking Test is somewhat different to ordinary conversation due to the institutional nature of the test event, the researchers confirm that it does share similarities with interactions in teaching and academic contexts. In addition, the three parts of the Test allow for a variety of task types and patterns of interaction. Seedhouse and Egbert make a number of useful recommendations which will inform aspects of test design as well as examiner training, particularly in relation to the rounding-off questions at the end of Part 2.

In the final report in Volume 6, John Read and Paul Nation investigated the lexical dimension of the IELTS Speaking Test. Allocation of grant funding for this study once again reflected the IELTS partners’ concern to undertake validation work following introduction of the revised Speaking Test in 2001. When the holistic or global scale for speaking was replaced with four analytic criteria and scales in July 2001, one of these four was Lexical Resource; this requires examiners to attend to the accuracy and range of a candidate’s vocabulary use as one basis for judging their performance. The Read and Nation study therefore set out to measure lexical output, variation and sophistication, as well as the use of formulaic language by candidates. As the researchers point out in their literature review, there was a strong motivation to explore speaking assessment measures from a lexical perspective given the relative lack of previous research on spoken (rather than written) vocabulary and the growing recognition of the importance of lexis in second language learning.

For this study the researchers created a small corpus of texts derived from transcriptions of Speaking Tests recorded under operational conditions at IELTS tests centres worldwide. As for the Seedhouse and Egbert study, they were given access to the corpus of IELTS Speaking Test recordings at Cambridge ESOL from which they selected a subset of 88 performances for transcription and analysis.

The study’s findings are broadly encouraging for the IELTS test developers, confirming that the Lexical Resource scale does indeed differentiate between higher and lower proficiency candidates. At the same time, however, the study highlights the complexity of this aspect of spoken performance and the extent to which candidates who receive the same band score sometimes display markedly different qualities in their individual performance. The study also provides useful insights into how different topics in Parts 2 and 3 influence the nature and extent of lexical variation. Such insights can feed back into the test writing process; they can also inform the training of IELTS examiners to direct their attention to salient distinguishing features of the different bands and so assist them in being able to reliably rate vocabulary performance as a separate component from the other three rating criteria. The researchers suggest that in the longer term, and following additional research into how this scale operates, there may be a case for some further revision of the rating descriptors.



Revision of the IELTS Speaking Test in 2001 made it possible to address a number of issues relating to the quality and fairness of the Test. Each of the research studies reported in Volume 6 offers important empirical evidence to support claims about the usefulness of the current IELTS Speaking Test as a measure of L2 spoken language proficiency. In addition, they all provide valuable insights which can inform the ongoing development process (task design, examiner training, etc) as well as future revision cycles. All the reports from the funded projects in this volume highlight avenues for further research and researchers wishing to apply for future grants with British Council and IELTS Australia joint-funded programme may like to take note of some of the suggestions made.

Dr Lynda Taylor Assistant Director – Research and Validation Group University of Cambridge ESOL Examinations September 2006



1. An investigation of the effectiveness and validity of planning time in Part 2 of the IELTS Speaking Test Authors Catherine Elder The University of Melbourne, Australia

Gillian Wigglesworth The University of Melbourne, Australia

Grant awarded Round 9, 2003

This study addresses the question of whether the use of planning time for the IELTS Speaking Test assists in candidate performance.

ABSTRACT

This study investigates the relationship between three variables in the oral IELTS test – planning, proficiency and task – and was designed to enhance our understanding of how or whether these variables interact. The study questioned whether differences in performance resulted from one or two minutes of planning time. The study also aimed to identify the most effective strategies used by candidates in their planning.

Ninety candidates, in two groups – intermediate and advanced – each undertook three tasks with no planning time, one minute or two minutes’ planning time. All tasks were rated by two raters, and the transcripts of the speech samples subjected to a discourse analysis.

Neither the analysis of the scores, nor the discourse analysis revealed any significant differences in performance according to the amount of planning time provided. While this suggests that planning time does not positively advantage candidates, we argue that one minute of pre-task planning should continue to be included on Task 2 of the IELTS test in the interests of fairness, and to enhance the face validity of the test. The report concludes with a discussion of possible reasons for the null findings and proposes avenues for further research.

1. An investigation of the effectiveness and validity of planning time in Part 2, IELTS Speaking – Elder + Wigglesworth

AUTHOR BIODATA

GILLIAN WIGGLESWORTH

Gillian Wigglesworth is Associate Professor and Head of the School of Languages and Linguistics at The University of Melbourne. She has a wide range of research interests which broadly include both first and second language acquisition, language testing and assessment, and bilingualism. Gillian has several edited book publications, and numerous journal articles and book chapters which reflect her research interests.

CATHERINE ELDER

Catherine Elder is Associate Professor of Applied Linguistics and Director of the Language Testing Research Centre in the School of Languages and Linguistics at The University of Melbourne. Previously, and while undertaking this research study, she was with Monash University. Catherine is co-author of the Dictionary of Language Testing (CUP 1999) and co-editor of Experimenting with Uncertainty: Essays in Honour of Alan Davies (CUP 2001) and of the Handbook of Applied Linguistics (Blackwell 2004).



CONTENTS

1 Background to the research .....................................................................42 The current study.......................................................................................53 Context for the research............................................................................6 3.1 Research questions .............................................................................6 3.2 Variables ............................................................................................6 3.2.1 Proficiency level..................................................................6 3.2.2 Amount of planning time.....................................................6 3.2.3 Task ....................................................................................74 Methodology ............................................................................................7 4.1 Participants ..........................................................................................7 4.2 Study design ........................................................................................7 4.3 Data collection procedures ..................................................................8 4.3.1 Interviews............................................................................8 4.3.2 Post-interview questionnaires.............................................9 4.3.3 Focus groups......................................................................9 4.4 Data compilation and analysis .............................................................10 4.4.1 Transcription and digitisation of tapes................................10 4.4.2 Post-performance ratings ...................................................10 4.4.3 Discourse analysis..............................................................10 4.4.4 Questionnaire responses....................................................11 4.4.5 Focus group responses ......................................................115 Results ............................................................................................11 5.1 Research question 1 (amount of planning time/scores) ......................11 5.2 Research question 2 (amount of planning time/quality).......................12 5.3 Research question 3 (candidates’ perception of planning time)..........15 5.3.1 Topic as a factor .................................................................17 5.3.2 Planning time as a factor ....................................................18 5.3.3 Planning and topic as a factor ............................................18 5.4 Research question 4 (how planning time used)...................................19 5.5 Research question 5 (most effective strategies for planning time)......196 Discussion and Conclusion ......................................................................20References ............................................................................................23Appendix 1: Task prompts provided for candidates ..................................25Appendix 2: Task administration instructions for interviewer ..................26Appendix 3: Marking sheet ............................................................................27Appendix 4: Focus group interview questions ...........................................27Appendix 5: Student questionnaire ..............................................................28



1 BACKGROUND TO THE RESEARCH

The time variable is critical in information processing theories of speech production, and there is now a substantial body of Second Language Acquisition (SLA) research within this cognitive tradition investigating the effects of pre-task planning time on oral performance. This research has yielded fairly convincing evidence that opportunities for planning before a task impact both on the content of learners’ speech and also on the quality of the language they produce. With regard to the latter, planning is seen as important because of the role it can play in helping learners access their L2 knowledge through controlled processing, promoting selective attention to form and monitoring (Skehan, 1988).

A review of the effects of planning time by Ellis (2005) shows that planning generally enhances the fluency and complexity of L2 learners’ spoken performance (eg Foster, 1996; Foster and Skehan, 1996; Skehan and Foster, 1997; Wendel, 1997; Mehnert, 1998; Ortega, 1999; Yuan and Ellis, 2003). Results pertaining to accuracy are less consistent, but some studies (eg Ellis, 1997; Mehnert, 1998) show that planning also reduces the incidence of error in learner speech. This inconsistency has been attributed to a number of variables, including the characteristics of the tasks used to elicit learner speech and to the conditions under which these tasks are performed. Performance on structured tasks, for example, has been found to be more responsive to planning than is the case with unstructured tasks (Foster and Skehan, 1996; Skehan and Foster, 1997). The type of planning which learners engage in may also be important as Sangarun (2005) showed. Finally, the time allowed for planning needs also has an impact with some aspects of speech improving after only one minute of planning time and others requiring more sustained rehearsal. In Ortega’s (1999) study, for example, fluency improvements were evident only after 10 minutes of pre-task planning.

One of the reasons for the intense interest in planning amongst SLA researchers is that it is believed to foster pushed output (Swain, 1993) and hence to aid acquisition, although firm evidence in support of this belief is yet to emerge. Whether or not this is the case, the different qualities of speech produced under planned and unplanned conditions provide insight into the psycholinguistic constraints on L2 production, and lend support to the distinction made by Ellis (2005) and others between implicit (automated) and explicit (analytic) knowledge. These constructs are regarded by many as central to psycholinguistic theories of second language production.

The justification for researching planning time in language testing contexts, such as the one investigated in this study, is somewhat different. Skehan (1998) invokes test validity, claiming that speaking tests need to sample language produced under planned and unplanned conditions if test scores are to be considered representative of a broad range of real world performances. Such a position begs the question of how much planning time will produce the desired variation in the quality of speech. Elder et al (2002) propose that tests like IELTS and TOEFL, which are used to predict language performance in academic settings, should include planning time for authenticity reasons, given that academic speech is more often than not planned prior to delivery. There are, however, obvious constraints on how closely test tasks can mirror the requirements of academia where students may spend several hours or days preparing for an academic presentation. In a testing context, the amount of planning time must be limited to what is practical given the resources available. It should also be acknowledged that the majority of speaking taking place in academic contexts is entirely spontaneous, so it seems logical to also include some tasks with no planning time. This however raises fairness issues – a further argument for allowing planning in testing contexts. In the highly stressful test situation, planning time may serve to reduce anxiety, a possible source of construct-irrelevant variance on a test. It may thereby give candidates opportunities to produce their best possible performance (see Swain’s (1985) arguments about “biasing for best” in the test situation). However, what is not clear is either whether planning does reduce anxiety, or whether planning in fact makes a difference to test performance, as the SLA research would lead us to believe. © IELTS Research Reports Volume 6 4


The few studies which have been conducted into the effects of planning in language testing contexts have produced less consistent results than is the case with classroom-based SLA research. The first study to be undertaken was that of Wigglesworth (1997), which explored the effects of planning on the oral proficiency component of the access: test (used to screen immigrants for entry to Australia) and found that pre-task planning increased the accuracy of certain grammatical features, such as verb tenses and articles, particularly amongst the higher proficiency candidates when performing cognitively demanding tasks. But while she found significant effects for planning at the discourse level, giving candidates pre-task planning time made no difference to their scores.

Two recent studies have also found that planning time can have a positive impact on performance. The first, by Tavakoli and Skehan (2005), which was conducted in what the authors claim to be a testing environment, found consistent benefits for planning on discourse measures of accuracy, complexity and “breakdown” fluency. The impact of planning on scores however, is not reported. Proficiency again interacted with planning time, as in the Wigglesworth study, but this time it was the less-proficient learners who gained the most (elementary planners in some cases outperformed the intermediate non-planners). Learners also found task performance easier under the planned condition. The second study by Xi (2005), which focused on a graph description task from the taped-mediated SPEAK (Speaking Proficiency English Assessment Kit), found that planning time had the effect of increasing holistic scores on some line graph tasks and also served to mitigate the effects of task familiarity on performance. Qualitative analyses revealed that candidates described more line segments and offered more complex information when planning was provided.

However, these findings are at odds with those of other test-based research, namely that of Wigglesworth (2000) and Iwashita et al (2001). In Wigglesworth’s study, which focused only on test scores, planning was found to be counterproductive in the case of unstructured tasks and had little impact on learner performance on other task types. Iwashita et al (2001) found that planning before a monologic story-telling task had no impact on either the quality of test discourse or test scores, or indeed on candidates’ perceptions of task difficulty. Elder and Iwashita (2005) offered a variety of tentative explanations for the discrepancy between the findings of classroom and language testing research, including the nature of the tasks themselves, of the instructions given to candidates and the opportunities for on-line planning during task performance (which they speculate may obscure the effects of pre-task preparation). They also suggest that the use of planning time by test-takers may be ineffective. Although some classroom-based research has investigated how learners use their planning time (Wendel, 1997; Ortega, 1999 & 2005; Sangarun, 2005), this issue is yet to be explored in a language testing context.

2 THE CURRENT STUDY

The current study was motivated by a desire to probe these issues further in the context of a face-to-face oral interview (the previous studies were conducted with tests requiring tape-based performances). Particular attention was paid to the design features of the study to avoid some weaknesses of previous research efforts in this area. As well as investigating the effect of different levels of planning time on learners at different levels of proficiency, we were interested in investigating the nature and effectiveness of test-taker planning processes and also in canvassing test-takers’ perceptions of planning time (ie its adequacy and usefulness).



3 CONTEXT FOR THE RESEARCH

The study (funded from an IELTS Australia grant awarded in 2003) explored the effects of pre-task planning time on performance on Part 2 of the International English Language Testing System (IELTS) oral interview. The interview offers one minute’s preparation time to all candidates and allows them to prepare notes which they can refer to during the actual interview. We will hereafter use Ellis’s term “strategic planning” (2005: 3-5) to make it clear that we are talking about the preparation time given to candidates immediately before performing a test task rather than to pre-task rehearsal (Bygate and Samuda, 2005) in which the candidate actually practises the task prior to performing it.

The following questions were investigated in the study.

3.1 Research questions 1. Does the amount of strategic planning time provided make a difference to the scores awarded

to candidates in Part 2 of the oral test?

2. Does the amount of strategic planning time make a difference to the quality of candidate discourse in Part 2 of the oral test?

3. How do candidates’ perceive the usefulness and validity of strategic planning time?

4. How do candidates use their strategic planning time?

5. What are the most effective strategies for the use of strategic planning time?

3.2 Variables Three variables were manipulated in the study design:

1. Proficiency level

2. Amount of planning time

3. Task.

3.2.1 Proficiency level There were two groups of candidates at different levels of proficiency. Group A were intermediate level candidates as determined by previous scoring on IELTS (band 5.0-5.5) and/or institutional estimates derived from in-house measures used for placement purposes. Group B were advanced candidates (ie previous scores of 6.0 or more in the IELTS band or institutionally determined equivalent). Items from Nation’s 3,000-5,000 level Academic Word List were also administered to candidates in each group to confirm the validity of these proficiency groupings. The vocabulary test was used as a surrogate for general language proficiency, which was the basis for the institutional groupings, to confirm that the candidates belonged to two distinct proficiency groupings.

3.2.2 Amount of planning time The instructions for the IELTS Part 2 of the oral test indicate that candidates should be given “one to two minutes to prepare”. Given previous research which has indicated that as little as one minute can affect performance on some discourse measures (see Mehnert, 1998; Wigglesworth, 1997), this study set out to investigate if there were any differences according to whether candidates are instructed to perform with a) no planning time, b) one minute or c) two minutes of planning time. In each case, 15 seconds was provided for the candidate to read the task.



3.2.3 Task Three tasks were developed in line with the specifications for the Part 2 task. These were then sent to TESOL Cambridge for feedback. Modifications were made after suggestions by test developers to ensure that the tasks did indeed correspond very closely to what might be used in operational test conditions. (See Appendix 1 for the tasks and accompanying prompts to candidates). The design of the study was set up to control for variations in performance that might occur as a result of differences between the tasks, rather than as a result of the planning or proficiency variables (for details of the study design see under Interview Procedure below). This builds on previous research which has suggested that the impact of planning time on performance may be sensitive to relatively small differences in tasks (Foster, 1996; Foster and Skehan, 1996; Skehan and Foster, 1997; Mehnert, 1998; Ortega, 1999; Wigglesworth, 2001).

4 METHODOLOGY

4.1 Participants Participants for the study were recruited from three different Australian tertiary institutions which offered English for Academic Purposes (EAP) and IELTS training. The candidates were given a small payment in compensation for time spent and also promised feedback on their performance against the various IELTS criteria (although it was explained that the resultant score was only roughly indicative of their IELTS level). The explanatory letter to participants is Appendix 2.

The participants were aged between 19 and 36 years of age, and came from a range of language backgrounds. Approximately 60% were Chinese speakers (Mandarin vs Cantonese not specified), and the remainder included Korean, Japanese, Thai, Arabic and Vietnamese speakers. Most participants had taken the IELTS test before and all were university bound, for either undergraduate or postgraduate study. All were intending to take the IELTS test in the near future. This study provided an important opportunity to practise an IELTS-like task and therefore motivation to participate was generally very high. There were 90 candidates in all, equally distributed across advanced and intermediate levels.

4.2 Study design Each candidate did all three Part 2 task versions. In one task they were allowed no planning time; in another, one minute of planning time; and in the other, two minutes. Tasks, planning time and order were counterbalanced across candidates using a Latin Square design as indicated in Tables 1 and 2 below. There were 45 candidates in Group A (intermediate), and 45 candidates in Group B (advanced).

In each group, the candidates were divided into 3 subgroups (i, ii and iii), and within each subgroup, the candidates were divided into groups of five to avoid any practice effect. So, for example, in group Bi, all 15 candidates did Task 1 with no planning time, but five did this task first, five did it second and five did it third. Thus each student did each of the three tasks, and each student experienced one task with no planning time, one task with one minute of planning time and one task with two minutes of planning time.



Planning time Group Ai Group Aii Group Aiii

0 minutes Task 1 Task 2 Task 3

1 minute Task 2 Task 3 Task 1








Table 1: Research design (Group A: intermediate candidates)

Planning time Group Bi Group Bii Group Biii










Table 2: Research design (Group B: advanced candidates)

4.3 Data collection procedures 4.3.1 Interviews A total of eight trained and experienced IELTS interviewers were recruited for the study and were thoroughly briefed on the interview procedures. Candidates within each proficiency grouping (advanced and intermediate) were assigned randomly to interviewers who were issued with a bundle of pre-prepared student packs for their candidates. These packs contained the task prompts in the order in which they were to be administered, together with instructions for the candidates about the amount of planning time allowed (Appendix 3).



Apart from the differences in planning time, each task was administered under conditions which simulated as closely as possible the operational conditions of the IELTS interview. In the one and two minute planning conditions, candidates were given a sheet of paper and a pen to do their planning and were allowed to refer to their notes during task performance (as is normal during the IELTS interview). On completion of each task they were asked to hand the paper to the interviewer, who wrote the amount of planning time on the same sheet as well as any difficulties or notable features of candidate’s behaviour that had been observed.

All interviews were tape-recorded so that any breach of the planning time instructions by either candidate or interviewer could be detected, and so that additional retrospective ratings of performance could be arranged.

Standard IELTS analytic criteria were used to rate each task performance separately as soon as the candidates completed each task (see Appendix 4 for the rating sheet). Ratings were assigned concurrently by the interviewer for feedback purposes but it was decided not to use these ratings for our research investigation given a) informal feedback indicating that interviewers found it difficult to rate one task at a time (under normal operational conditions ratings are completed once only after the interview is over) and b) our fear that ratings might be contaminated by interviewers’ attitudes to planning time. (Interviewers have been found to compensate candidates for what they perceive as a difficult task or interlocutor and we believed the same might be true for task conditions perceived by raters to pose challenges to candidates.)

4.3.2 Post-interview questionnaires On completion of the interview, all candidates filled out a questionnaire (see Appendix 6) which canvassed their perceptions of planning time. It asked about any prior strategy training the candidates had experienced (eg in IELTS preparation classes) and asked them to indicate which strategies they used during planning time. The planning strategies adopted for the questionnaire were based on those identified by Rutherford (2001) on the basis of feedback from a focus group of students very similar to the participants in the current study. Both micro level (language-related) and macro level (content-related) strategies were included. The questionnaire was administered on completion of the three-task sequence to avoid the risk of a learning effect (ie candidates using some of the strategies included on the questionnaire in subsequent task performance). Candidates then completed the vocabulary test (described above).

4.3.3 Focus groups Candidates’ perceptions regarding the difficulty/fairness of the task under the three different conditions and the utility of planning time were further probed during two focus group interviews each involving 8–10 participants from the larger study who volunteered to stay on for a further hour after the questionnaire and vocabulary test. The questions which guided the focus groups are given in Appendix 5. These focus group discussions were recorded on tape. For the purposes of this study, focus group interviews were preferred over individual interviews for two main reasons. Firstly, for entirely practical reasons, the focus group meant that the candidates were not required to wait for a long period of time. Secondly, focus groups allow for a dynamic interaction between the members of the group (Greenbaum, 1998; Bryman, 2001), which was considered to be productive in terms of drawing out candidates’ views, particularly given that they were second language learners. We acknowledge, however, that the views expressed by focus group participants are not necessarily representative of the broader sample.



4.4 Data compilation and analysis 4.4.1 Transcription and digitisation of tapes All 90 tapes were transcribed so that transcripts could be analysed and coded (see further details below). The tapes were then sent to a laboratory for digitization and a CD-Rom created of all 90 performances. Instructions from the interviewer and silences for planning were removed from the CD so that raters would be unaware of the conditions under which each task was performed.

4.4.2 Post-performance ratings Two trained IELTS raters were recruited to rate all three tasks on each of the 90 tapes using the IELTS analytic criteria. They were instructed to take a break at least once an hour to avoid fatigue. The ratings of both assessors were then entered into a database. Inter-rater reliability was calculated (using the Spearman’s correlation coefficient). The data were first analysed using the Facets rating scale model (Linacre 1990) with rater, task and proficiency and planning time entered as separate facets in the file. Univariate F tests (using SPSS) were then calculated with task and planning time entered as independent variables and average (of the two raters) scores on each of the analytic rating criteria as the outcome measures. Due to the Latin Square design, whereby candidates were randomly assigned to different planning conditions within each proficiency grouping rather than across groupings, these analyses were conducted separately for high and low proficiency candidates.

4.4.3 Discourse analysis A subset of speech samples was selected for further analysis of the discourse. Two candidates from each of the nine cells in tables 1 and 2 were randomly selected. Thus 18 advanced and 18 intermediate candidates’ speech samples were selected. Transcribed speech samples for each candidate were coded for the following categories.

1. Fluency fluent versus disfluent speech filled and unfilled pauses self repairs

2. Accuracy: global measures in terms of: error-free AS-units error-free clauses

3. Complexity proportion of dependent clauses per AS-unit percentage of subordinate clauses to AS-units

Fluency features were coded on the WAV files using the EMU Speech Data Base System and the R Statistical package for extracting the statistics. The EMU System offers a more accurate means of measuring fluency than does the traditional approach based solely on written transcripts. It allows data to be coded in real time on a variety of different levels chosen by the investigator, and the R package allows these to be read once the features have been labelled. In other words, stretches of fluent speech were marked at beginning and end, as were filled and unfilled pauses and self-repairs. Although much more detailed labeling is available (eg syllables can be marked and thus counted) this process was very time-consuming. It was decided not to do this in the first instance, and only to focus on a more detailed analysis in the event of significant differences between groups on the broader categories.

For the measures of accuracy and complexity, the transcripts were coded into AS-units (Foster, Tonkyn and Wigglesworth, 2000) and clauses. Following Foster et al (2000), an AS-unit was defined as an utterance consisting of an independent clause together with any subordinate clause associated with it. An independent clause was defined minimally as a clause which included a finite verb, while a



subordinate clause was defined as a clause consisting of a finite or non-finite verbal element with at least one other clausal element such as a subject, object, complement or adverbial (pp 365). Subordinate clauses were divided into two types which we labeled subordinate (when introduced by a subordinating discourse marker, eg because, before, after) and dependent, consisting of non-finite and other non-independent clauses.

Twelve speech samples were coded by the two chief investigators and reliability checks were then conducted. Areas of discrepancy were discussed and modifications were made to the coding system where necessary. The remaining speech samples were coded by a single researcher only.

4.4.4 Questionnaire responses Questionnaire responses were entered into a database and descriptive statistics (frequencies and percentages) were calculated for the various items. T-tests were used to compare the mean number of strategies used under the one and two minute planning conditions. The relative frequency of micro- and macro- planning strategies at each proficiency level was also calculated and these frequencies were compared using the Chi Square statistic. Correlations were also computed to determine whether there was a significant relationship between number of micro- and macro-planning strategies used and test scores. The questionnaires also yielded qualitative data about candidates’ attitudes to planning time. These comments were thematically coded and summarised with reference to findings from the focus group interviews (see below).

4.4.5 Focus group responses Focus group interviews were replayed and coded for keywords based on themes emerging from the data. These themes were exemplified with verbatim quotes where appropriate.

5 RESULTS

The results of the vocabulary test, which we used as a surrogate for proficiency, confirmed that the intermediate and advanced students came from different groups, with the intermediate students averaging 46.15 (standard deviation 13.21) and the advanced students averaging 56.50 (sd 9.43). This difference was significant (t= 4.243, df 87, p<.0001). An inter-rater reliability check on the two trained IELTS raters was calculated on each of the rating categories and yielded coefficients ranging from .51 (for intelligibility) to .73 (for accuracy). However, it should be pointed out that while candidates – the majority of whom were from mainland China – were at different levels of proficiency, their speaking skills were not highly variable. This is likely to be as a result of their lack of exposure to spoken English in their previous instructional contexts. In future studies it might be useful to pre-test students' oral proficiency, rather than their general proficiency, as a means of forming the different groupings.

5.1 Research question 1 Does the amount of strategic planning time provided make a difference to the

scores awarded to candidates in Part 2 of the oral test? The first research question addressed the issue of whether strategic planning time made a difference to the scores awarded to the candidates. Mean IELTS scores and standard deviations for advanced and intermediate groups are presented in Table 3 below. The univariate analysis revealed no significant effects for either task or planning time at either level of proficiency on the global ratings, a null finding that was confirmed in the facets analysis.



none 1 minute 2 minutes Planning time

mean SD mean SD mean SD

Intermediate 23.6 2.2 23.6 2.2 23.8 2.1

Advanced 23.9 2.2 24 1.9 24 2

Table 3: Total IELTS score (N=90)

Similarly, descriptive statistics presented in Tables 4 and 5 below show only minimal mean differences according to planning time on each component of the analytic rating scale at each proficiency level. The univariate F test again confirmed that there were no significant effects for either task or planning time.



Fluency 5.8 0.9 5.8 0.9 5.8 0.8

Lexis 6.0 0.7 5.9 0.7 6.0 0.7

Grammar 5.8 0.6 5.8 0.7 5.8 0.7

Pronunciation 6.0 0.3 6.1 0.3 6.1 0.3

Table 4: Analytic measures for intermediate candidates (N=45)



Fluency 6.1 0.7 6.1 0.7 6.0 0.7

Lexis 6.0 0.7 6.1 0.5 6.1 0.6

Grammar 5.9 0.7 5.9 0.5 5.9 0.6

Pronunciation 5.9 0.6 6.0 0.5 6.0 0.5

Table 5: Analytic measures for advanced candidates (N=45)

5.2 Research question 2 Does the amount of strategic planning time make a difference to the quality of

candidate discourse in Part 2 of the oral test? The discourse analytic measures were used to determine whether planning time made a difference to the quality of the discourse in these tasks. As indicated above, the discourse of a subset of candidates was assessed on measures of fluency, accuracy and complexity. The fluency measures identified the percentage of fluent versus non fluent speech, filled and unfilled pauses, and duration of reformulations, repetitions and false starts (self repairs). The results for the intermediate candidates are given in Table 6, and those for the advanced candidates in Table 7. The univariate analyses yielded no significant differences for either task or planning time across any of these measures.





% fluent vs non fluent speech 65.20 10.87 66.83 9.96 65.85 10.73

Unfilled pauses 25.96 10.86 27.59 12.80 27.76 14.89

Filled pauses 9.58 4.79 8.07 5.91 8.47 4.19

Reformulations (duration seconds) 2.74 1.94 2.49 1.65 3.48 2.61

Repetitions (duration secs) 3.64 2.70 3.78 1.60 3.64 2.74

False starts (duration secs) 3.34 2.14 3.55 1.71 3.09 1.59

Table 6: Fluency measures for intermediate candidates (N=18)



% fluent vs non fluent speech 69.49 9.41 68.83 9.26 70.13 7.53

Unfilled pauses 22.49 9.05 21.62 8.58 21.69 8.61

Filled pauses 6.79 3.92 7.17 2.58 7.72 3.88

Reformulations (duration seconds) 2.04 1.61 2.09 1.83 2.86 2.37

Repetitions (duration secs) 3.58 5.42 2.80 2.30 3.47 3.18

False starts (duration secs) 2.63 2.34 2.51 1.81 3.54 3.33

Table 7: Fluency measures for advanced candidates (N=18)

There were two measures of complexity – proportion of dependent clauses per AS-unit, and percentage of subordinate clauses per AS-unit. Once again, as shown in Tables 8 and 9, the mean scores for each planning condition were fairly close, although there does appear to be an increase in the number of subordinate clauses per AS-unit in the one minute planning condition for both intermediate and advanced candidates. However, this difference was not large enough to reach statistical significance.





Dependent clauses / AS-unit 1.4 0.4 1.5 0.5 1.5 0.5

Subordinate clauses / AS-unit 16.5 10.2 26.9 21.3 21.8 15.1

Table 8: Complexity measures for intermediate candidates (N=18)



Dependent clauses / AS-unit 1.7 0.2 1.7 0.2 1.7 0.4

Subordinate clauses / AS-unit 21.9 12.8 27.1 14.5 20.4 20.6

Table 9: Complexity measures for advanced candidates (N=18)

The global measures for accuracy (error free AS-units and error free clauses) are presented in Tables 10 and 11. Statistical analyses again indicated that there were no significant differences according to either task or the amount of planning time provided.



% error free AS-units 26.5 21.7 27.3 23.1 24.8 22.7

% error free clauses 40.4 9.9 40.1 21.4 39.1 12

Table 10: Accuracy measures for intermediate candidates (N=18)



% error free AS-units 26.1 16.8 26.5 16.8 30 16.7

% error free Clauses 39.1 16.6 42 18.7 40.3 15.8

Table 11: Accuracy measures for advanced candidates (N=18)

To summarise, there were no significant differences in any of the score measures, or in the discourse measures between groups depending upon whether they had had access to one or two minutes’ planning time, or whether they had had no planning time.

The implications of these results for continuing to include planning time in Part 2 of the IELTS test are discussed further below.



5.3 Research question 3 How do candidates’ perceive the usefulness and validity of strategic planning time?

Candidates were asked in the questionnaire whether they felt that planning time helped them, to which 89% responded positively. This was reiterated in the focus group interviews where most of the students said that they found it easier when planning time was available “Planning time is important … you can organise your idea and prepare what you want to say”. One candidate stated that planning time is useful not only for organising ideas but also for providing time in which to calm nerves in the stressful testing situation.

The comment section of the questionnaire provided some interesting insights. The candidates were asked to comment on three aspects of their performance in the questionnaire: a) why planning time had not helped them b) which task they thought they had performed best on and why and c) which task they had performed worst on and why. Very generally, the candidate responses can be broken down as follows.

Planning time was used to: Number % of candidates

Organise 21 23.59

Improve ideas/think about topic 18 20.22

Improve speaking 16 17.98

Structure 5 5.61

Nervousness 3 3.37

Other 16 17.98

Negative 10 11.24

Total 89 100%

Table 12: Use of planning time by candidates

Negative responses indicating that planning time was not useful or even counterproductive were few in number, although one candidate at the focus group interview suggested that having to prepare in front of the interviewer made him more anxious than when he spoke without any planning. Typical responses from the major categories are given below.

Organise planning time helps organise ideas (candidate 11) planning lead to organise my ideas (cand 23) helped me know what I have to say and what is first, second… (cand 34) I can decide on ideas and organise them (cand 40) had time to organise topic and write down my idea (cand 49) it can help to organise my ideas (cand 54) I can prepare and organise my ideas to explain better (cand 63) I can organise my thinking and ideas before speaking (cand 64) helped me organise my idea (cand 85)



Improve ideas/think about topic more time allows you to better use ideas (cand 13) makes me brainstorm (cand 15) think about the topic step-by-step (cand 33) can prepare and think more to say about the topic (cand 44) helps me think about the content of the topic (cand 45) I spent time thinking about how to extend my topic (cand 53) thought about more things to talk about (cand 68) I can describe more about the topic (cand 80)

Improve speaking can make speaking more clearly (cand 6) because I can speak well planning the tasks (cand 27) improve my speaking in English (cand 43) I know what I am going to talk, making me more fluent (cand 55) successful speech – smooth (cand 60) I didn’t think about the topic, but it helped to speak calmly (cand 70) helps speak clearly (cand 79) I wrote the points and then I was able to speak clearly (cand 82)

Structure organised sentences better (cand 3) can think about how to make sentences correctly then word form (cand 65) tried to write down words relating to my topic (cand 66) better arrangement, grammar structure and fewer awkward sentences (cand 73)

Perception of ‘worst’ task Interestingly, when asked to identify which of the three tasks they did worst on, many commented that they did worst on a particular task because they did not have time to prepare their response properly. (Task 1 was describing a subject they had studied; Task 2 was describing a book or movie; and Task 3 was describing an important event in their lives.)

The last task. I wasn't able to take notes, so I had to think immediately (cand 10, task 1, no planning)

Subject. I had little time to prepare (cand 13, task 1, no planning) The first one. The time was only enough to remember my event (cand 20, task 2,

no planning) The second one. No enough time even to read the topic (cand 28, task 2, no planning) The first one due to no enough time (cand 32, task 3, no planning) Event due to no time to plan (cand 33, task 3, no planning) The last task due to not enough time to get ready (cand 36, task 3, no planning) The last one. No time to prepare (cand 39, task 3, no planning) Task 1. I had no time to organise or think (cand 49, task 1, no planning) Subject. I had no time to think about my ideas (cand 54, task 1, no planning) Subject. I had no time to think about the topic (cand 60, task 1, no planning) The first one due to no enough time (cand 65, task 2, no planning)



The last one. Without preparation, I kept repeating the same information (cand 82, task 3, no planning)

Event. I had neither time nor ideas (cand 85, task 3, no planning)

As can be seen from the examples above, this was often when the candidates had no planning time available, but this was not always the case, as the following examples show.

Task 2: time was short and the topic was hard for me (cand 4, task 2, one minute) Task 3. I didn't have time to think (cand 5, task 3 (event), two minutes) The last one. Not enough time (cand 26, task 3, one minute) Task 1. I had no time and didn't know what to say (cand 78, task 1, one minute) Task 1. Not enough time (cand 84, task 1, one minute)

Topic was another important factor which impacted on the activity. As can be seen from the responses below, the task they found most difficult was often where they found the topic difficult, and the presence or absence of planning time was unlikely to make much difference.

2nd. In the middle of that task, I couldn't talk about anything (cand 5, task 2, one minute) Event. I don't have information about it (cand 7, task 3, two minutes) Task 2, subject. I had no idea (cand 14, task 2, one minute) The third one. I seldom watch movies (cand 15, task 2, one minute) The first one. I couldn't think of anything (cand 16, task 2, no planning) Book/movie. I had no idea about it (cand 21, task 2, no planning) Subject. I've never thought about this (cand 22, task 1, two minutes) Book. I was confused (cand 35, task 2, two minutes) Subject. I have no idea about it, even when I use my own language (cand 4, task 1,

one minute) Subject. I've never thought about this task (cand 43, task 1, one minute) Subject. I have no idea to describe a subject (cand 44, task 1, one minute) First one. I have never thought of it before (cand 47, task 1, no planning) Book. I had no idea about the book (cand 51, task 2, one minute) The last one. I didn't know about the topic very well (59, task 2, one minute) Subject. I've never done this task before (cand 71, task 1, two minutes) The third one. I had nothing to say (cand 75, task 3, one minute) Movie/book. It was hard to describe a book, especially some Chinese book

(cand 79, task 2, two minutes) Event. This topic is too big (cand 80, task 3, no planning)

5.3.1 Topic as a factor Topic was also identified as a salient factor in responses to the question about which task they felt they had performed best on, with 56 of the candidates (62.9%) mentioning this, compared to 21 (23.5%) who identified the presence of planning time as the major determinant of their performance. However, 19 of the responses in the latter category indicated that it was the two minutes of planning time which they perceived to have made the difference and of these, five mentioned both planning time and topic as contributing. Some typical responses are below.

Maybe the movie because I am interested in it (cand 3) Task 1. The topic was easier than others (cand 6) Event. I have a lot of events in my life, that I can explain very well (cand 11) Book. I just read the book recently, so I can remember (cand 24)



Task 2. My memorable event in my life that I never forget (cand 41) Last one. It was a part of my life (cand 47) Movie. There were many ideas in my mind (cand 64) Event. It was the most important event in my life (cand 72) Task 1. I got many points to talk about (cand 81) Task 3. I was familiar with the topic (cand 83) Movie. I'm interested in it (cand 86)

5.3.2 Planning time as a factor Second one. Enough time to think (cand 8, task 3, two minutes) Event. I had more time to prepare (cand 13, task 3, two minutes) Task 3 about subject. I could use two mins to plan (cand 16) The first one. I had time to prepare (cand 28, task 1, two minutes) Task 3. I had more time to prepare (cand 34, task 3, two minutes) Event. I had more time to think about my ideas and how to say them (cand 54,

task 3, two minutes) Event. I had more time to prepare. With two mins, you have enough time to think of it (cand 59) The last one. I had time to prepare it (cand 65, task 1, two minutes)

5.3.3 Planning and topic as a factor 3: I had time to organise my ideas and the topic was familiar with me (cand 4) Task 3. I had enough time to think and the test name "talking about movies" was

interesting (cand 14) Task 3. I had more time to organise and the topic was easier for me (cand 49) Subject. Enough time to prepare and familiar subject (cand 30)

It should be pointed out that in the questionnaire almost 50% of the candidates claimed to be familiar with the tasks. Almost 15% of the candidates reported they had previously practised the tasks in their responses to a subsequent question “Which task do you think you performed best on? Why?”

Subject. I am familiar with it (cand 21) Movie. I have done this topic before. I’m familiar with it (cand 31) I practised it before (cand 62). Subject. I prepared this topic before and am familiar with the vocabulary (cand 80)

As discussed below, this may be a factor which contributes to the null findings presented in this study.

Overall, from the candidates’ responses above, it appears that although planning time does not seem to affect scores, or engender differences in the discourse measures investigated above, the majority of the candidates clearly found it useful, and identified difficulties when it was not present. Nevertheless, the topic of the task emerged as the most important factor in how candidates perceive themselves performing on these tasks.



5.4 Research question 4 How do candidates use their planning time?

The analysis of the strategy questionnaires revealed that the candidates used a variety of strategies when they had planning time available. The most common strategies used are given in Table 13; the six most popular strategies are shaded.

one minute planning

two minutes planning

Strategy

Number % Number %

I tried to decide what topic I would talk about 72 80.9 68 76.4

I thought about the content and ideas needed for the task

58 65.2 57 64.0

I read the task card again 57 64.0 53 59.5

I thought about how to organise my ideas 53 59.5 61 68.5

I wrote down vocabulary on paper 42 47.2 51 57.3

I wrote down useful sentences or phrases on paper 40 44.9 44 49.4

I thought about grammar (eg verb forms) in my head 32 35.9 37 41.6

I made notes about grammar on paper 11 12.4 17 19.1

I practised useful sentences or phrases in my head 33 37.1 42 47.2

I made a list of vocabulary in my head 26 29.2 31 34.8

I made a list of useful organising and/or linking language in my head

32 35.9 43 48.3

I wrote down useful organising and/or linking language on paper

22 24.7 35 39.3

I practised the task in my head 27 30.3 35 39.3

I practised pronunciation in my head 15 16.8 21 23.6

I wrote down ideas in my first language and then translated them

10 11.2 13 14.6

I thought about nothing 12 13.5 9 10.1

Table 13: Use of strategies by candidates with one and two minutes planning time

While the pattern of strategy use is similar for both the one and two minute planning condition, there was a significant difference between the number of strategies candidates reported using when more planning time was available (t=2.575, df=88, p=0.012).

5.5 Research question 5 What are the most effective strategies for the use of strategic planning time?

Given the results of the previous analyses, it was anticipated that there would be no significant correlations between the number of planning strategies and either the global or analytic scores given by the raters for planning condition. This proved to be the case.

A further analysis was undertaken which involved identifying the strategies as either macro-strategies (those concerned with topic, content and organisation), and micro-strategies (those concerned with language level issues such as grammar, structure, vocabulary, etc). The last strategy (‘I thought about nothing’), which attracted very few responses, was omitted. (See Figure 1.)



Macro-strategies Micro-strategies

I read the task card again I thought about grammar (eg verb forms) in my head

I practiced the task in my head I made notes about grammar on paper

I practiced pronunciation in my head I practiced useful sentences or phrases in my head

I tried to decide what topic I would talk about I wrote down useful sentences or phrases on paper

I thought about how to organise my ideas I made a list of vocabulary in my head


I wrote down vocabulary on paper


Figure 1: Macro and micro strategies

Table 14 summarises the strategy use by proficiency level and amount of planning time provided. While it appears that macro-strategies were used more frequently than micro-strategies under the one minute planning condition and that the reverse was true when two minutes of planning was allowed, a Chi Square analysis revealed no significant differences across any of the groupings.

Macro-strategies used Micro-strategies used

1 minute 149 138

Intermediate 2 minutes 139 166

1 minute 128 115

Advanced 2 minutes 148 155

Table 14: Use of micro and macro strategies by group

Finally, the results of a t-test analysis indicated that there was no significant difference in the mean level of performance between micro and macro planners (ie candidates who reported using more language related strategies and those who reported focusing more on content and organisation).

6 DISCUSSION AND CONCLUSION

The null findings in this study mirror those of Iwashita et al (2001) and Wigglesworth (2000). As noted in our earlier literature review, test-based research has produced scant evidence of benefits for strategic planning time on the quality of the subsequent speaking performance. In this study, the lack of any effect for planning time was consistent across all measures used, including the different categories of the IELTS rating scale and the various discourse dimensions. While there was some trend towards greater discourse complexity (as measured by the ratio of subordinate clauses to AS-units) under the one minute planning condition for both intermediate and advanced level candidates, this finding did not prove to be statistically significant. It therefore seems reasonable to conclude that planning time has limited utility for Part 2 of the IELTS oral test, which uses very similar tasks.



Does this mean that the one minute of planning time currently available to prepare performance on Part 2 of the IELTS oral is superfluous? We think not. Candidates’ expressed preference for planning time is worth taking notice of, if only for face validity reasons. Providing opportunities for planning may engender greater confidence in the IELTS Speaking Test on the part of candidates and, accordingly, greater acceptance of the scores obtained. However, while candidates’ questionnaire and interview responses suggest that removing the currently offered one minute of planning time from IELTS task 2 is likely to be unwelcome, there is surely no point in extending the amount of planning time provided, since the longer (two minute) planning condition yielded no additional benefit on any performance measure. Even for complexity, the marginal gains observed under the one minute condition disappeared completely when two minutes of planning were provided.

As far as strategies are concerned, the results of this study (and indeed from most other studies of planning in a test situation) suggest that, while candidates appreciate being given planning time before speaking, they make poor use of it. There was no evidence that either the number of strategies or the particular type of strategy (macro or micro) used by learners made a significant difference to performance. Interviewer feedback after administering the test indicated that many learners appeared lost during the planning period, or were too anxious to make use of what they had prepared. This is supported by comments made by one of the focus group interviewees, who reported that the presence of the interviewer distracted him from his planning efforts. Another commented that she was unable to read the notes she had made.

Another possibility (also reflected in comments from focus group candidates) is that the benefits of planning are constrained by memory, and that improvements in the fluency, accuracy or complexity of the discourse cannot be sustained beyond the first few utterances of candidate speech. It seems likely that raters are also constrained by memory and that it is the final impression which informs their judgement. This would explain the lack of any impact for planning on scores and on the discourse measures which are averaged across the whole stretch of performance.

It is also possible that in an unpressured monologic performance such as this one, candidates are able to monitor their speech as they go, and that this produces benefits even in the zero planning condition (see Yuan and Ellis 2005). The effects of strategic planning may therefore be discernible only under highly-pressured performance conditions where on-line planning is not possible. Further investigation of this may be warranted using the approach adopted by Yuan and Ellis (2005) in which on-line planning is sharply differentiated from pre-task planning and no planning by introducing a time limit for both the pre-task and no planning conditions, but providing unlimited time for the on-line planning condition.

Alternatively, it may be that there is a mismatch between the focus of candidate planning and what is valued by the IELTS rater and captured by our discourse measures. The strategies which candidates reported using most frequently in both the one and two minute planning conditions were those directed to planning the message content, whereas the main focus of the IELTS analytic rating scale categories is on form, or, to be more precise, candidates’ accuracy, fluency, pronunciation and the lexical resources they deploy. It might therefore be instructive to devise some means of measuring the propositional complexity of the discourse, to see if planning makes a difference to this dimension of performance (although it is debatable whether propositional complexity is of interest in a language testing context).

It might also be useful to examine in more detail those individuals who benefit from planning to determine what planning strategies these candidates engage in. However, to do so, we would need to devise a more fine-grained taxonomy of strategy use (see Ortega, 2005) and to gather rich think-aloud data (of the kind elicited by Sangarun, 2005). Such a study would be of interest to those involved in teaching test preparation courses and could form the basis for further research on the role of strategy training in boosting performance.



As pointed out above, many of the candidates reported having practised these or similar tasks before. It may be that planning is to no avail when candidates are already familiar with the task, particularly simple ones (like those used in this study) which require a description or commentary on past experience. Indeed it may be that on a high-stakes test such as IELTS, some candidates have prepared so well that much of what we are really measuring on this test is pre-rehearsed rather than spontaneous unplanned discourse (although this study provides no direct evidence of such a phenomenon, which should be the subject of further research). On the other hand, we saw comments from a number of test-takers indicating they were unprepared for the topics and in these instances, as was suggested earlier, planning time may do little to improve their performance.

The current study adds to the weight of evidence suggesting that planning time is not conducive to producing better performance in a testing environment. However, Xi’s (2005) recent findings in relation to the graph task on the SPEAK exam nevertheless give some grounds for believing that planning time may interact with task type. Before definitive conclusions are drawn, further research needs to be conducted using more complex and cognitively demanding tasks. In this respect, integrated tasks in which candidates may be required to integrate specific features of aural and written input in their oral response, may mean that planning is more beneficial than in other types of task. This would mitigate again, for example, the situation found in this study where some candidates find the topic difficult and this overrode the availability or not of planning time. In integrated tasks, where familiarity (or not) with the task is likely to be less of an issue since input material is given, planning would certainly be warranted, not only for reasons of fairness, but also on authenticity grounds.

In summary, the findings of this study offer positive support for the inclusion of a small amount of planning time on oral proficiency tests. However, the null findings on all measures of both rater evaluations and of the discourse suggest that the rationale for this relates more to fairness and face validity, than to the ability of candidates to improve their performance as a result of planning time. As already noted, it is clear that further research into the effects of planning time in testing contexts is warranted if we are to fully understand the impact that the provision of planning time may have in oral proficiency tests, and the ways in which it may impact on the test construct.



REFERENCES

Bryman, A, 2001, Social Research Methods, Oxford University Press, Oxford

Bygate, M and Samuda, V, 2005, ‘Integrative planning through the use of task-repetition’ in Planning and task performance in a second language, ed R Ellis, John Benjamins, Amsterdam and Philadelphia

Crookes, G, 1989, ‘Planning and interlanguage variation’, Studies in Second Language Acquisition, vol 11, pp 183-199

Elder, C and Iwashita, N, 2005, ‘Planning for test performance: What difference does it make?’ in Planning and task performance in a second language, ed R Ellis, John Benjamins, Amsterdam and Philadelphia

Elder, C, Iwashita, N and McNamara, T, 2002, ‘Estimating the difficulty of oral proficiency tasks: what does the test-taker have to offer?’, Language Testing, vol 19, no 4, pp 347-368

Ellis, R, 1987, ‘Interlanguage variability in narrative discourse: style shifting in the use of the past tense’, Studies in Second Language Acquisition, vol 9, no 1, pp 1-20

Ellis, R, ed, 2005, Planning and task performance in a second language, John Benjamins, Amsterdam and Philadelphia

Ellis, R and Yuan, F, 2005, ‘The effects of careful within-task planning on oral and written task performance’ in Planning and task performance in a second language, ed R Ellis, John Benjamins, Amsterdam and Philadelphia

Foster, P, 1996, ‘Doing the task better: how planning time influences students' performance’ in Challenge and change in language teaching, eds J Willis and D Willis, Heinemann, London

Foster, P and Skehan, P, 1996, ‘The influence of planning and task-type on second language performance’, Studies in Second Language Acquisition, vol 18, pp 299-323

Foster, P, Tonkyn and Wigglesworth, G, 2001, ‘Measuring spoken language: a unit for all reasons’, Applied Linguistics, vol 21, no 3, pp 354-375

Greenbaum, TL, 1998, The handbook for focus group research, Thousand Oaks, Sage, California

Iwashita, N, McNamara, T and Elder, C, 2001, ‘Can we predict task difficulty in an oral proficiency test? Exploring the potential of an information processing approach to task design’, Language Learning vol 21, no 3, pp 401-436

Linacre, M, 1990, FACETS, computer program for many faceted Rasch measurement, Mesa Press, Chicago

Mehnert, U, 1998, ‘The effects of different lengths of time for planning on second language performance’, Studies in Second Language Acquisition, vol 20, no 1, pp 83-108

Ortega, L, 1999, ‘Planning and focus on form in L2 oral performance’, Studies in Second Language Acquisition, vol 21, pp 109-148

Ortega, L, 2005, ‘What do learners plan? Learner-drive attention to form during pre-task planning’ in Planning and task performance in a second language, ed R Ellis, John Benjamins, Amsterdam and Philadelphia

Rutherford, K, ‘An investigation into the effects of planning on oral production in a second language’, unpublished masters dissertation, University of Auckland, New Zealand



Sangarun, J, 2005, ‘The effects of focusing on meaning and form in strategic planning’ in Planning and task performance in a second language, ed R Ellis, John Benjamins, Amsterdam and Philadelphia

Skehan, P, 1996, ‘A framework for the implementation of task-based instruction’, Applied Linguistics, vol 17, pp 38-62

Skehan, P, 1998, A cognitive approach to language learning, Oxford University Press, Oxford

Skehan, P and Foster, P, 1997, ‘Task type and task processing conditions as influences on foreign language performance’, Language Teaching Research, vol 13, pp 185-211

Skehan, P and Foster, P, 1999, ‘The influence of task structure and processing conditions on narrative retellings’, Language Learning, vol 49, no 1, pp 93-120

Swain, M, 1985, ‘Large-scale communicative language testing: A case study’ in Testing, Pergamon Press, Oxford, pp. 35-46

Swain, M, 1993, ‘The output hypothesis: Just speaking and writing aren’t enough’, The Canadian Modern Language Review, vol 50, pp 158-164

Tavarkoli, P and Skehan, P, 2005, ‘Strategic planning, task structure and performance testing’ in Planning and task performance in a second language, ed R Ellis, John Benjamins, Amsterdam and Philadelphia

Wendel, JN, 1997, ‘Planning and second language narrative production’, unpublished doctoral dissertation, Temple University, Japan

Wigglesworth, G, 1997, ‘An investigation of planning time and proficiency level on oral test discourse’, Language Testing, vol 14, no 1, pp 101-122

Wigglesworth, G, 2000, ‘Issues in the development of oral tasks for competency-based assessments of second language performance’ in Studies in immigrant English language assessment, Vol 1, Research series 11, ed G Brindley, National Centre for English Language Teaching and Research Macquarie University, Sydney, pp 81-124

Wigglesworth, G, 2001, ‘Influences on performance in task-based oral assessments’ in Task based learning, eds M Bygate, P Skehan and M Swain, Addison Wesley Longman, London, pp 186-209

Yuan, F and Ellis, R, 2003, ‘The effects of pre-task planning and on-line planning on fluency, complexity and accuracy in L2 monologic oral production’, Applied Linguistics, vol 24, no 1, pp 1-27

Xi, X, 2005, ‘Do visual chunks and planning impact performance on the graph description task in the SPEAK exam?’, Language Testing, vol 22, no 4, pp 463-508



APPENDIX 1: TASK PROMPTS PROVIDED FOR CANDIDATES

TASK 1 SUBJECT

Describe a subject you have studied which has had a great influence on your life:

You should say

what the subject was

where you learned the subject

who your teacher was

and explain how it has influenced your life.

TASK 2 BOOK OR MOVIE

Talk about a book or a movie that you found interesting.

You should say:

what the book or movie was about

who the main characters were

what you liked and/or disliked about it

and explain why you found the book or movie interesting.

TASK 3 EVENT

Describe an event in your life (eg holiday or childhood experience) which made a great impression on you.

You should say:

what the event was

where and when it took place

who you were with

and explain why it made a great impression on you.



APPENDIX 2: TASK ADMINISTRATION INSTRUCTIONS FOR INTERVIEWER

When there is NO PLANNING TIME you should say the following: Now, I’m going to give you a topic and I’d like you to talk about it for one to two minutes. I’d like you to start talking straight away. Do you understand?

Here’s your topic [hand over the relevant task card and give students 15 seconds to read the card]

I’d like you to talk about X (mention the topic of the task)

All right? Remember you have one to two minutes for this so don’t worry if I stop you. I’ll tell you when the time is up. Can you start speaking now please?

When there is ONE MINUTE OF PLANNING TIME you should say the following: Now, I’m going to give you a topic and I’d like you to talk about it for one to two minutes. Before you talk, you’ll have one minute to think about what you are going to say. You can make some notes if you wish. Do you understand?

Here’s some paper and a pen for making notes [hand over spare paper and a pencil] and here’s your topic [hand over the relevant task card]


Allow up to a minute for preparation, but the candidate can start earlier if he/she wants.

When the time is up or the student signals readiness to begin you should say:

All right? Remember you have one to two minutes for this, so don’t worry if I stop you. I’ll tell you when the time is up. Can you start speaking now please?

When there is TWO MINUTES OF PLANNING TIME you should say the following: Now, I’m going to give you a topic and I’d like you to talk about it for one to two minutes. Before you talk, you’ll have two minutes to think about what you are going to say. You can make some notes if you wish. Do you understand?

Here’s some paper and a pen for making notes[hand over spare paper and a pencil]and here’s your topic [hand over the relevant task]


Allow up to two minutes for preparation, but the candidate can start earlier if he/she wants. When the time is up or the student signals readiness to begin you should say:

All right? Remember you have one to two minutes for this so don’t worry if I stop you. I’ll tell you when the time is up. Can you start speaking now please?

When the student has finished the task you should retrieve the notes he has made and attach your own notes (if relevant) to them. Say:

Thank you very much.



APPENDIX 3: MARKING SHEET

Student’s number_____________________________

Interviewer name____________________________

Please give 4 ratings for each task, using the normal IELTS criteria, namely:

FC = Fluency and coherence LR = Lexical resources GRA = Grammatical range and accuracy P = Pronunciation

FC LR GRA P

Task 1

Task 2

Task 3

Tasks are to be rated one at a time in order of performance.

APPENDIX 4: FOCUS GROUP INTERVIEW QUESTIONS

1. Did you the think the tasks used for this study were a good measure of your ability to use language in university settings? (Give reasons for your answer)

2. Did you find planning time made the tasks easier? If no, please explain why. If yes indicate how you see the benefits of planning time (ie how did it help you?)

3. Which planning activities were most helpful in performing the task?

4. Do you think you used the planning time as well as you could have? Say why/why not.

5. If you took notes during the planning session did you use these when performing the task? If yes, did having the notes in front of you help you?

6. Have you ever been given instruction/training on how to use pre task planning time? If yes, how useful was it? If no, do you think it would help to have this kind of training?



APPENDIX 5: STUDENT QUESTIONNAIRE

A Task Feedback

1. Have you practised any of the three tasks you have just done before? (Tick yes or no)

Talking about a SUBJECT Yes No

Talking about a BOOK/MOVIE Yes No

Talking about an EVENT Yes No

2. Have any of your teachers taught you how to plan before speaking? Yes No

3. For two of the three tasks you have just performed some planning time was given.

Indicate (by ticking all the relevant boxes) which of the following things you did during your planning time before you started speaking.

With 1 minute With 2 minutes

TASK NAME

I read the task card again

I thought about grammar (eg verb forms) in my head

I made notes about grammar on paper

I practised useful sentences or phrases in my head

I wrote down useful sentences or phrases on paper

I made a list of vocabulary in my head

I wrote down vocabulary on paper

I made a list of useful organising and/or linking language in my head

I wrote down useful organising and/or linking language on paper

I practised the task in my head

I practised pronunciation in my head

I tried to decide what topic I would talk about

I thought about how to organise my ideas



I thought about nothing

I did other things (please tell us what you did)

Do you think the planning helped you? Yes No

Explain why/why not

Which task do you think your performed best on? Why?

Which of the three tasks do you think you performed worst on? Why?



2. An examination of the rating process in the revised IELTS Speaking Test Author Annie Brown Ministry of Higher Education and Scientific Research United Arab Emirates


This study examines the validity of the analytic rating scales used to assess performance in the IELTS Speaking Test, through an analysis of verbal reports produced by IELTS examiners when rating test performances and their responses to a subsequent questionnaire.

ABSTRACT

In 2001 the IELTS interview format and criteria were revised. A major change was the shift from a single global scale to a set of four analytic scales focusing on different aspects of oral proficiency. This study is concerned with the validity of the analytic rating scales. Through a combination of stimulated verbal report data and questionnaire data, this study seeks to analyse how IELTS examiners interpret the scales and how they apply them to samples of candidate performance.

This study addresses the following questions: How do examiners interpret the scales and what performance features are salient

to their judgements? How easy is it for examiners to differentiate levels of performance in relation to

each of the scales? What problems do examiners identify when attempting to make rating decisions?

Experienced IELTS examiners were asked to provide verbal reports after listening to, and rating a set of the interviews. Each examiner also completed a detailed questionnaire about their reactions to the approach to assessment. The data were transcribed, coded and analysed according to the research questions guiding the study.

Findings showed that, in contrast with their use of the earlier holistic scale (Brown, 2000), the examiners adhered closely to the descriptors when rating. In general, the examiners found the scales easy to interpret and apply. Problems that they identified related to overlap between the scales, a lack of clear distinction between levels, and the inference-based nature of some criteria. Examiners reported the most difficulty with the Fluency and Coherence scale, and there were concerns that the Pronunciation scale did not adequately differentiate levels of proficiency.

2. An examination of the rating process in the revised IELTS Speaking Test – Annie Brown

CONTENTS

1 Rationale for the study.............................................................................. 32 Rating behaviour in oral interviews......................................................... 33 Research questions .................................................................................. 54 Methodology ........................................................................................... 5 4.1 Data ........................................................................................... 5 4.2 Score data ........................................................................................... 7 4.3 Coding ........................................................................................... 75 Results ........................................................................................... 8 5.1 Examiners’ interpretation of the scales and levels within the scales .. 8 5.1.1 Fluency and coherence ..................................................... 8 5.1.2 Lexical resource................................................................. 12 5.1.3 Grammatical range and accuracy...................................... 15 5.1.4 Pronunciation ..................................................................... 18 5.2 The discreteness of the scales............................................................ 20 5.3 Remaining questions........................................................................... 22 5.3.1 Additional criteria ............................................................... 22 5.3.2 Irrelevant criteria ................................................................ 22 5.3.3 Interviewing and rating....................................................... 226 Discussion ........................................................................................... 237 Conclusion ........................................................................................... 25References ........................................................................................... 26Appendix 1: Questionnaire ........................................................................... 28

AUTHOR BIODATA:

ANNIE BROWN

Annie Brown is Head of Educational Assessment in the National Admissions and Placement Office (NAPO) of the Ministry of Higher Education and Scientific Research, United Arab Emirates. Previously, and while undertaking this study, she was Senior Research Fellow and Deputy Director of the Language Testing Research Centre at The University of Melbourne. There, she was involved in research and development for a wide range of language tests and assessment procedures, and in language program evaluation. Annie's research interests focus on the assessment of speaking and writing, and the use of Rasch analysis, discourse analysis and verbal protocol analysis. Her books include Interviewer Variability in Oral Proficiency Interviews (Peter Lang, 2005) and the Language Testing Dictionary (CUP, 1999, co-authored with colleagues at the Language Testing Research Centre). She was winner of the 2004 Jacqueline A Ross award for the best PhD in language testing, and winner of the 2003 ILTA (International Language Testing Association) award for the best article on language testing.



1 RATIONALE FOR THE STUDY

The IELTS Speaking Test was re-designed in 2001 with a change in format and assessment procedure. These changes responded to two major concerns: firstly, that a lack of consistency in interviewer behaviour in the earlier unscripted interview could influence candidate performance and hence, ratings outcomes (Taylor, 2000); and secondly, that there was a degree of inconsistency in interpreting and applying the holistic band scales which were being used to judge performance on the interview (Taylor and Jones, 2001).

A number of studies of interview discourse informed the decision to move to a more structured format. These included Lazaraton (1996a, 1996b) and Brown and Hill (1998) which found that despite training, examiners had their own unique styles, and they differed in the degree of support they provided to candidates. Brown and Hill’s study, which focused specifically on behaviour in the IELTS interview, indicated that these differences in interviewing technique had the potential to impact on ratings achieved by candidates (see also Brown, 2003, 2004). The revised IELTS interview was designed with a more tightly scripted format (using interlocutor “frames”) to ensure that there would be less individual difference among examiners in terms of interviewing technique. A study by Brown (2004) conducted one year into the operational use of the revised interview found that generally this was the case.

In terms of rating consistency, a study of examiner behaviour on the original IELTS interview (Brown, 2000) revealed that while examiners demonstrated a general overall orientation to features within the band descriptors, they appeared to interpret the criteria differently and included personal criteria not specified in the band scales (in particular interactional aspects of performance, and fluency). In addition, it appeared that different criteria were more or less salient to different raters. Together these led to ratings variability. Taylor and Jones (2001) reported that “it was felt that a clearer specification of performance features at different proficiency levels might enhance standardisation of assessment” (2001: 9).

In the revised interview, the holistic scale was replaced with four analytic scales. This study seeks to validate the new scales through an examination of the examiners’ cognitive processes when applying the scales to samples of test performance, and a questionnaire which probes the rating process further.

2 RATING BEHAVIOUR IN ORAL INTERVIEWS

There has been growing interest over the last decade in examining the cognitive process employed by examiners of second language production through the analysis of verbal reports produced during, or immediately after, performing the rating activity. Most studies have been concerned with the assessment of writing (Cumming, 1990; Vaughan, 1991; Weigle, 1994; Delaruelle, 1997, Lumley, 2000). But more recently, the question of how examiners interpret and apply scales in assessments of speaking has been addressed (Meiron, 1998; Brown, 2000; Brown, Iwashita and McNamara, 2005). These studies have investigated questions such as: how examiners assign a rating to a performance; what aspects of the performance they privilege; whether experienced or novice examiners rate differently; the status of self-generated criteria; and how examiners deal with problematic performances.

In her examination of the functioning of the now retired, IELTS holistic scale, Brown (2000) found that the holistic scale was problematic for a number of reasons. Different criteria appeared to be more or less salient at different levels; for example comprehensibility and production received greater attention at the lower levels and were typically commented on only where there was a problem. Brown found that different examiners attended to different aspects of performance, privileging certain features over others in their assessments. Also, some examiners were found to be more performance-



oriented, focusing narrowly on the quality of performance in relation to the criteria, while others were reported to be more inference-oriented, drawing conclusions about candidates’ ability to cope in other contexts. The most recently trained examiner focused more exclusively on features referred to in the scales and made fewer inferences about candidates.

In the present study, of course, the question of weighting should not arise, although examiners may have views on the relative importance of the criteria. A survey of examiner reactions to the previous IELTS interview and holistic rating procedure (Merrylees and McDowell, 1999) found that most Australian examiners would prefer a profile scale. Another question then, given the greater detail in the revised, analytic scales, is whether examiners find them easier to apply than the previous one, or whether the additional detail and difficulty distinguishing the scales makes the assessment task more problematic.

Another question of concern when validating proficiency scales is the ease with which examiners are able to distinguish levels. While Merrylees and McDowell (1999) found that around half the examiners felt the earlier holistic scale used in the IELTS interview was able to distinguish clearly between proficiency levels, Taylor and Jones reported concern as to “how well the existing holistic IELTS rating scale and its descriptors were able to articulate key features of performance at different levels or bands” (2001: 9). Again, given the greater detail and narrower focus of the four analytic scales compared with the single holistic one, the question arises of whether this allows examiners to better distinguish levels. A focus in the present study, therefore, is the degree of comfort that examiners report when using the analytic scales to distinguish candidates at different levels of proficiency.

When assessing performance in oral interviews, in addition to a range of linguistic and production related features, examiners have also been found to attend to less narrowly linguistic aspects of the interaction. For example, in a study of Cambridge Assessment of Spoken English (CASE) examiners’ perceptions, Pollitt and Murray (1996) found that in making judgements of candidates’ proficiency, examiners took into account perceived maturity and willingness or reluctance to converse. In a later study of examiners’ orientations when assessing performances on SPEAK (Meiron, 1998), despite it being a non-interactive test, Meiron found that examiners focused on performance features such as creativity and humour, which she described as reflecting a perspective on the candidate as an interactional partner.

Brown’s analysis of the IELTS oral interview (2000) also found that examiners focused on a range of performance features, both specified and self-generated, and these included interactional skills, in addition to the more explicitly defined structural, functional and topical skills. Examiners noted candidates’ use of interactional moves such as challenging the interviewer, deflecting questions and using asides, and their use of communication strategies such as the ability to self-correct, ask for clarification or use circumlocution. They also assessed candidates’ ability to “manage a conversation” and expand on topics. Given the use in the revised IELTS interview of a scripted interview and a set of four linguistically focused analytic scales, rather than the more loosely worded and communicatively-oriented holistic one in the earlier format, the question arises of the extent to which examiners still attend to, and assess communicative or interactional skills, or any other features not included in the scales.

Another factor which has been found to impact on ratings in oral interviews is interviewer behaviour. Brown (2000, 2003, 2004) found that in the earlier unscripted quasi-conversational interviews, examiners took notice of the interviewer and even reported compensating when awarding ratings for what they felt was inappropriate interviewer behaviour or poor technique. This finding supported those of Morton, Wigglesworth and Williams (1997) and McNamara and Lumley (1997), whose analyses of score data combined with examiners’ evaluations of interviewer competence also found that examiners compensated in their ratings for less-than-competent interviewers. Pollitt and Murray (1993) found



that examiners made reference to the degree of encouragement interviewers gave candidates. While it is perhaps to be expected that interviewer behaviour might be salient to examiners in interviews which allow interviewers a degree of latitude, the fact that the raters in Morton et al’s study, which used a scripted interview (the access: oral interview), took the interviewer into account in their ratings, raises the question of whether this might also be the case in the current IELTS interview, which is also scripted, in those instances where interviews are rated from tape.

3 RESEARCH QUESTIONS

On the basis of previous research, and in the interests of seeking validity evidence for the current oral assessment process, this study focuses on the interpretability and ease of application of the revised, analytic scale, addressing the following sets of questions:

1. What performance features do examiners explicitly identify as evidence of proficiency in relation to each of the four scales? To what extent do these features reflect the “criteria key indicators” described in the training materials? Do examiners attend to all the features and indicators? Do they attend to features which are not included in the scales? How easy do they find it to apply the scales to samples of candidate performance? How easy do they find it to distinguish between the four scales?

2. What is the nature of oral proficiency at different levels of proficiency in relation to the four assessment categories? How easy is it for examiners to distinguish between adjacent levels of proficiency on each of the four scales? Do they believe certain criteria are more or less important at different levels? What problems do they identify in deciding on ratings for the samples used in the study?

3. Do examiners find it easy to follow the assessment method stipulated in the training materials? What problems do they identify?

4 METHODOLOGY

4.1 Data The research questions were addressed through the analysis of two complementary sets of data:

verbal reports produced by IELTS examiners as they rated taped interview performances the same examiners’ responses to a questionnaire which they completed after they had

provided the verbal reports.

The verbal reports were collected using the stimulated recall methodology (Gass and Mackey, 2000). In this approach, the reports are produced retrospectively, immediately after the activity, rather than concurrently, as the online nature of speaking assessment makes this more appropriate. The questionnaire was designed to supplement the verbal report data and to follow up any rating issues relating to the research questions which were not likely to be addressed systematically in the verbal reports. Questions focused on the examiners’ interpretations of, application of, and reactions to, the scales. Most questions required descriptive (short answer) responses. The questionnaire is included as Appendix 1.

Twelve IELTS interviews were selected for use in the study: three at each of Bands 5 to 8. (Taped interviews at Band 4 level and below were too difficult to follow due to intelligibility and hence, interviews from Band 5 and above only were used.) The interviews were drawn from an existing data-set of taped operational IELTS interviews used in two earlier analyses: one of interviewer behaviour (Brown, 2003) and one of candidate performance (Brown, 2004). Most of the interviews were



conducted in Australia, New Zealand, Indonesia and Thailand in 2001-2, although the original set was supplemented with additional tapes provided by Cambridge ESOL (test centres unknown). Selection for the present study was based on ratings awarded in Brown’s 2004 study, averaged across three examiners and the four criteria, and rounded to the nearest whole band.

Of the 12 interviews selected, seven involved male candidates and five female. The candidates were from the following countries: Bangladesh, Belgium, China (3), Germany, India, Indonesia (2), Israel, Korea and Vietnam. Table 1 shows candidate information and ratings.

Interview Sex Country Averaged ratings

1 M Belgium 8 2 F Bangladesh 8 3 M Germany 8 4 M India 7 5 F Israel 7 6 M Indonesia 7 7 M Vietnam 6 8 M China 6 9 F China 6

10 M China 5 11 F Indonesia 5 12 F Korea 5

Table 1: Interview data

Six expert examiners (as identified by the local IELTS administrator) participated in the study. Expertise was defined in terms of having worked with the revised Speaking Test since its inception, and having demonstrated a high level of accuracy in rating.

Each examiner provided verbal reports for five interviews, see Table 2. (Note: Examiner 4 only provided four reports.) Prior to data collection they were given training and practice in the verbal report methodology.

The verbal reports took the following form. First, the examiners listened to the taped interview and referred to the scales in order to make an assessment. When the interview had finished, they stopped the tape and wrote down the score they had awarded for each of the criteria. They then started recording their explanation of why they had awarded these scores. Next they re-played the interview from the beginning, stopping the tape whenever they could comment on some aspect of the candidate’s performance. Each examiner completed a practice verbal report before commencing the main study. After finishing the verbal reports, all of the examiners completed the questionnaire.



Interview Examiner 1 Examiner 2 Examiner 3 Examiner 4 Examiner 5 Examiner 6 1 X X

2 X X X

3 X X X

4 X X

5 X X

6 X X

7 X X X

8 X X

9 X X

10 X X

11 X X X

12 X X X

Table 2: Distribution of interviews

4.2 Score data There were a total of 29 assessments for the 12 candidates. The mean score and standard deviation across all of the ratings for each of the four scales is shown in Table 3. The mean score was highest on Pronunciation, followed by Fluency and coherence, Lexical resource and finally Grammatical range and accuracy. The standard deviation was smaller on Pronunciation than on the other three scales, which reflects the narrower range of band levels used by the examiners; there were only three ratings lower than Band 6.

Scale Mean Standard deviation

Fluency and coherence 6.28 1.53

Lexical resource 6.14 1.60

Grammatical range and accuracy 5.97 1.52

Pronunciation 6.45 1.30

Table 3: Mean ratings

4.3 Coding After transcription, the verbal report data were broken up into units, a unit being a turn – a stretch of talk bounded by replays of the interview. Each transcript consisted of several units, the first being the summary of ratings, and the remainder being the talk produced during the stimulated recall. At times, examiners produced an additional turn at the end, where they added information not already covered, or reiterated important points.

Before the data was analysed, the scales and the training materials were reviewed, specifically the key indicators and the commentaries on the student samples included in the examiner training package



(UCLES, 2001). A comprehensive description of the aspects of performance that each scale and level addressed was built up from these materials.

Next, the verbal report data were coded in relation to the criteria. Two coders, the researcher and a research assistant undertook the coding with a proportion of the data being double coded to ensure inter-coder reliability (over 90% agreement on all scales). This coding was undertaken in two stages. First, each unit was coded according to which of the four scales the comment addressed: Fluency and coherence, Lexical resource, Grammatical range and accuracy, and Pronunciation. Where more than one was addressed the unit was double-coded. Additional categories were created, namely Score, where the examiner simply referred to the rating but did not otherwise elaborate on the performance; Other, where the examiner referred to criteria or performance features not included in the scales or other training materials; Aside, where the examiner made a relevant comment but one which did not directly address the criteria; and Uncoded, where the examiner made a comment which was totally irrelevant to the study or was inaudible. Anomalies were addressed through discussion by the two coders.

Once the data had been sorted according to these categories, a second level of coding was carried out for each of the four main assessment categories. Draft sub-coding categories were developed for each scale, based on the analysis of the scale descriptors and examiner training materials. These categories were then applied and refined through a trial and error process, and with frequent discussion of problem cases. Once coded, the data were then sorted in various ways and reviewed in order to answer the research questions guiding the study.

Of the comments that were coded as Fluency and coherence, Lexical resource, Grammatical range and accuracy, and Pronunciation (a total of 837), 28% were coded as Fluency and coherence, 26% as Lexical resource, 29% as Grammatical range and accuracy and 17% as pronunciation. Examiner 1 produced 18% of the comments; Examiner 2, 17%; Examiner 3, 10%; Examiner 4, 14%; and Examiners 5 and 6, 20% each.

The questionnaire data were also transcribed and analysed in relation to the research questions guiding the study. Where appropriate, the reporting of results refers to both sets of data.

5 RESULTS

5.1 Examiners’ interpretation of the scales and levels within the scales In this section the analysis of the verbal report data and relevant questionnaire data is drawn upon to illustrate, for each scale, the examiners’ interpretations of the criteria and the levels within them. Subsequent sections will focus on the question of the discreteness of the scales and the remaining interview questions.

5.1.1 Fluency and coherence 5.1.1a Understanding the fluency and coherence scale The Fluency and coherence scale appeared to be the most complex in that the scales, and examiners’ comments, covered a larger number of relatively discrete aspects of performance than the other scales – hesitation, topic development, length of turn, and use of discourse markers.

The examiners referred often to the amount of hesitation, repetition and restarts, and (occasionally) the use of fillers. They noted uneven fluency, typically excusing early disfluency as “nerves”. They also frequently attempted to infer the cause of hesitation, at times attributing it to linguistic limitations – a search for words or the right grammar – and at other times to non-linguistic causes – to candidates thinking about the content of their response, to their personality (shyness), to their cultural background, or to a lack of interest in the topic (having “nothing to say”). Often examiners were unsure whether language or content was the cause of disfluency but, because it was relevant to the ratings decision © IELTS Research Reports Volume 6 8


(Extract 1), they struggled to decide. In fact, this struggle appeared to be a major problem as it was commented on several times, both in the verbal reports and in the responses to the questionnaire.

Extract 1 And again with the fluency he’s ready, he’s willing, there’s still some hesitation. And it’s a bit like ‘guess what I’m thinking’. It annoys me between 7 and 8 here, where it says – I think I alluded to it before – is it content related or is it grammar and vocab or whatever? It says here in 7, ‘some hesitation accessing appropriate language’. And I don’t know whether it’s content or language for this bloke. So you know I went down because I think sometimes it is language, but I really don’t know. So I find it difficult to make that call and that’s why I gave it a 7 because I called it that way rather than content related, so being true to the descriptor.

In addition to the amount or frequency of hesitation and possible causes, examiners frequently also considered the impact of too much hesitancy on their understanding of the candidate’s talk. Similarly, they noted the frequency of self-correction, repetition and restarts, and its impact on clarity. Examiners distinguished repair of the content of speech (“clarifying the situation”, “withdrawing her generalisation”), which they saw as native-like, even evidence of sophistication, from repair of grammatical or lexical errors. Moreover, this latter type of repair was at times interpreted as evidence of limitations in language but at other times was viewed positively as a communication strategy or as evidence of self-monitoring or linguistic awareness.

Like repair, repetition could also be interpreted in different ways. Typically it was viewed as unhelpful (for example, one examiner described the candidate’s repetition of the interviewer’s question as “tedious”) or as reducing the clarity of the candidate’s speech, or as indicative of limitations in vocabulary, but occasionally it was evaluated positively, as a stylistic feature (Extract 2).

Extract 2 So here I think she tells us it’s like she’s really got control of how to…not tell a story but her use of repetition is very good. It’s not just simple use; it’s kind of drawing you … ‘I like to do this, I like to do that’ – it’s got a kind of appealing, rhythmic quality to it. It’s not just somebody who’s repeating words because they can’t think of others she knows how to control repetition for effect so I put that down for a feature of fluency.

Another aspect of the Fluency and coherence scale that examiners attended to was the use of discourse markers and connectives. They valued the use of a range of discourse markers and connectives, and evaluated negatively their incorrect use and the overuse or repetitive use of only a few basic ones.

Coherence was addressed in terms of a) the relevance or appropriateness of candidates’ responses and b) topic development and organisation. Examiners referred to candidates being on task or not (“answering the question”), and to the logic of what they were saying. They commented negatively on poor topic organisation or development, particularly the repetition of ideas (“going around in circles”) or introduction of off-topic information (“going off on a tangent”), and on the impact of this on the coherence or comprehensibility of the response. At times examiners struggled to decide whether poor topic development was a content issue or a language issue. It was also noted that topic development favours more mature candidates.

A final aspect of Fluency and coherence that examiners mentioned was candidates’ ability, or willingness, to produce extended turns. They made comments such as “able to keep going” or “truncated”. The use of terms such as “struggling” showed their attention to the amount of effort involved in producing longer turns. They also commented unfavourably on speech which was disjointed or consisted of sentence fragments, and on speech where candidates kept on keep adding phrases to a sentence or when they ran too many ideas together into one sentence.



5.1.1b Determining levels within the fluency and coherence scale To determine how examiners coped with the different levels within the Fluency and coherence scale, the verbal report data were analysed for evidence of how the different levels were interpreted. Examiners also commented on problems they had distinguishing levels. In the questionnaire the examiners were asked whether each scale discriminated across the levels effectively and, if not, why.

In general, hesitancy and repetition were key features at all levels, with levels being distinguished by the frequency of hesitation and repetition and its impact on the clarity or coherence of speech. At the higher levels (Bands 7–9), examiners use terms like “relaxed” and “natural” to refer to fluency. Candidates at these levels were referred to as being “in control”.

Examiners appeared uncomfortable about giving the highest score (Band 9), and spent some time trying to justify their decisions. One examiner reported that the fact that Band 9 was “absolute” (that is, required all hesitation to be content-related) was problematic (Extract 3), as was distinguishing what constituted appropriate hesitation, given that native speakers can be disfluent. Examiners also expressed similar difficulties with the differences between Bands 7 and 8, where they reported uncertainty as to the cause of hesitation (whether it was grammar, lexis, or content related, see Extract 4).

Extract 3 Now I find in general, judgements about the borderline between 8 and 9 are about the hardest to give and I find that we’re quite often asked to give them. And the reason they’re so hard to give is that on the one hand, the bands for the 9 are stated in the very absolute sense. Any hesitation is to prepare the content of the next utterance for Fluency and coherence, for example. What’ve we got – all contexts and all times in lexis and GRA. Now as against that, you look at the very bottom and it says, a candidate will be rated on their average performance across all parts of the test. Now balancing those two factors is very hard. You’re being asked to say, well does this person usually never hesitate to find the right word? Now that’s a contradiction and I think that’s a real problem with the way the bands for 9 are written, given the context that we’re talking about average performance.

Extract 4 It annoys me between 7 and 8 here. Where it says – I think I alluded to it before – is it content related or is it grammar and vocab or whatever? It says here in 7: ‘Some hesitation accessing appropriate language’. And I don’t know whether it’s content or language for this bloke. So you know I went down because I think sometimes it is language, but I really don’t know. So I find it difficult to make that call and that’s why I gave it a 7 because I called it that way rather than content related, so being true to the descriptor.

The examiners appeared to have some difficulty distinguishing Bands 8 and 9 in relation to topic development, which was expected to be good in both cases. At Band 7, examiners reported problems starting to appear in the coherence and/or the extendedness of talk.

At Band 6, examiners referred to a lack of directness (Extract 5), poor topic development (Extract 6) and to candidates “going off on a tangent” or otherwise getting off the topic, and to occasional incoherence. They referred to a lack of confidence, and speech was considered “effortful”. Repetition and hesitation or pausing was intrusive at this level (Extract 6). As described in the scales, an ability to “keep going” seemed to distinguish a 6 from a 5 (Extract 7).

Extract 5 And I found that she says a lot but she doesn’t actually say anything; it takes so long to get anywhere with her speech.



Extract 6 6 for Fluency and coherence. She was very slow, very hesitant. I felt that her long searches, her low pauses were searching for right words. And I felt that there was little topic development; that she wasn’t direct.

Extract 7 I ended up giving him a 6 for Fluency and coherence. I wasn’t totally convinced but by-and-large he was able to keep going.

At Band 5, examiners commented on having to work hard to understand the candidate. All of them expressed an inability at times to follow what the candidate was saying. Other comments related to the degree of repetition, hesitation and pausing, the overuse of particular discourse markers, not answering the question, and occasional trouble keeping going, being able to elaborate or take long turns (Extracts 8-10).

Extract 8 So I’ve given it 5 for fluency and I guess the deciding factor there was the 4 says unable to keep going without noticeable pauses, and she was able to keep going. There were pauses and all that but she did keep going, so I had to give her a 5 there.

Extract 9 Overuse of certain discourse markers, connectives and other cohesive features. He was using the same ones again and again and again.

Extract 10 So she got a 5 for Fluency and coherence because she was usually able to keep going, but there was repetition and there were hesitations mid-sentence while she looked for fairly basic words and grammar, and then she would stop as if she had more to say but she couldn’t think of the words. I think there is a category for that in the descriptor, but anyway…

Extract 11 It’s interesting in this section, the fluency tends to drop away and I don’t know whether it’s just that he doesn’t like the topic of soccer very much, so maybe I’m doing him an injustice but I’m going to end up marking him down to 7 on Fluency whereas before I was tending more to an 8 but I felt that if he was really fluent he’d be able to sustain it a little bit better.

5.1.1c Confidence in using the fluency and coherence scale When asked to judge their confidence in understanding and interpreting the scales, no examiner selected lower than the mid-point on a scale of 1 (Not at all confident) to 5 (Very confident) for any of the scales (see Table 4). Examiners were marginally the least confident about Fluency and coherence, and the most confident about Pronunciation. When asked to elaborate on why they felt confident or not confident about the Fluency and coherence scale several commented, as they had also done in their verbal reports, that the focus on hesitation was problematic because it was necessary but not always possible to infer its cause – a search for content or language – in order to decide whether a rating of 7 or 8 should be given. One examiner commented that “there can at times be more witchcraft than science involved in discerning why hesitation is used”. It was also noted that fluency can be affected by familiarity with or liking of a topic (Extract 11). Another commented that assessing whether speech is “situationally appropriate” is problematic given the restricted context of the interview, while another said that topic development being mentioned at Band 7 but not Band 8 is problematic. One examiner remarked that the Fluency and coherence descriptors are longer than the others and thus harder to “internalise”.



Only one examiner reported that the Fluency and coherence scale was easy to apply, commenting that the key indicators for her were whether the candidate could or could not keep going and could or could not elaborate.

Examiner

1 2 3 4 5 6 Mean

Fluency and coherence 4 5 4 3 4 3 3.8

Lexical resources 5 4 3 4 4 4 4.0

Grammatical range and accuracy 5 4 4 3 4 4 4.0

Pronunciation 4.5 5 5 5 3 3 4.3

Table 4: Confidence using the scales

When asked in the questionnaire whether the descriptors of the Fluency and coherence scale capture the significant performance qualities at each of the band levels and distinguished adequately between levels, most examiners reported problems. Bands 6 and 7 were considered difficult to distinguish in terms of the frequency or amount of hesitation, repetition and/or self-correction. The terms “some” (Band 7) and “at times” (Band 6) were said to be very similar. One examiner said it was difficult to infer intentions (“willingness” and “readily”) to discriminate between Bands 6 and 7.

Distinguishing Bands 7 and 8 was also considered problematic for two reasons: firstly, because topic development is mentioned at Band 7 but not Band 8, but also because, as noted earlier, examiners found it difficult to infer whether disfluency was caused by a search for language (Band 7) or by the candidate thinking about their response (Band 8).

One examiner felt that Bands 4 and 5 were particularly difficult to distinguish because hesitation and repetition are “the hallmarks of both”; another reported problems distinguishing Band 4 and Band 6 (“even 6 versus 4 can produce problems … “coherence may be lost” at 6 and “some breakdowns in coherence” at 4”). Finally, it was noted that hesitation and pace may be an indication of limitations in language but often reflected individual habits of speech.

5.1.2 Lexical resource 5.1.2a Understanding the lexical resource scale As was the case for the Fluency and coherence scale, examiners tended to refer to the range of features referred to in the descriptors and the key indicators. These included lexical errors, range of lexical resource (including stylistic choices and adequacy for different topics), the ability to paraphrase, and the use of collocations. One feature included in the key indicators but not referred to was the ability to convey attitude. Although not referred to in the scales, the examiners took candidates’ lack of comprehension of interviewer talk as evidence of limitations in Lexical resource.

As expected, there were numerous references to the sophistication and range of the lexis used by candidates, and to inaccuracies or inappropriate word choice. When they referred to inaccuracies or inappropriateness, examiners commented on their frequency (“occasional errors”), their seriousness (“a small slip”), the type of error (“basic”, “simple”, “non-systematic”) and the impact the errors had on comprehensibility. Examiners also commented on the appropriateness or correctness of collocations, and on morphological errors (the use of dense instead of density). They commented unfavourably on candidates’ inability to “find the right words”, a feature which overlapped with assessments of fluency.



While inaccuracies or inappropriate word choice were typically taken as evidence of lexical limitations, it was also recognised that unusual lexis or use of lexis may in fact be normal in the candidate’s dialect or style. This was particularly the case for candidates from the Indian sub-continent. The evidence suggests, however, that determining whether a particular word or phrase was dialectal or inappropriate was not necessarily straightforward (Extract 12).

Extract 12 That’s her use of “in here” and she does it a lot. I don’t know whether it’s a dialect or whether it’s a systematic error.

As evidence of stylistic control, examiners commented on a) the use of specific, specialist or technical terms, and b) the use of idiomatic or colloquial terms. They also evaluated the adequacy of candidates’ vocabulary for the type of topic (described in terms of familiar, unfamiliar, professional, etc). There was some uncertainty as to whether candidates’ ability to use specialist terms within their own professional, academic or personal fields of interest was indicative of a broad range of lexis or whether, because the topic was ‘familiar’, it was not. Reference was also made to the impact of errors or inappropriate word use on comprehensibility. Finally, although there were not a huge number of references to learned expressions or ‘formulae’, examiners typically viewed their use as evidence of vocabulary limitations (Extract 13), especially if the use of relatively sophisticated learned phrases contrasted with otherwise unsophisticated usage.

Extract 13 Very predictable and formulaic kind of response: “It’s a big problem” and “I’m not sure about the solution” kind of style, which again suggests very limited lexis and probably pre-learnt.

Examiners also attended to candidates’ ability to paraphrase when needed (Extract 14). They drew attention to specific instances of what they considered to be successful or creative circumlocution, “my good memory moment” or “the captain of a company”.

Extract 14 He rarely attempts paraphrase, he sort of stops, can’t say it and he doesn’t try and paraphrase it; he sort of repeats the bit that he didn’t say right.

5.1.2b Determining levels within the lexical resource scale The verbal report and questionnaire data were next analysed for evidence of how examiners distinguished levels within the Lexical resource scale and what problems they had distinguishing them.

Examiners tended to value “sophisticated” or idiomatic lexical use at the higher end (Extract 15), although they tended to avoid Band 9 because of its ‘absoluteness’. Band 8 was awarded if they viewed non-native usage as “occasional errors” (Extract 16), and Band 9 if they considered them to be dialectal or “creative”. Precise and specific use of lexical items was also important at the higher levels, as per the descriptors.

Extract 15 … and very sophisticated use of common, idiomatic terms. He was clearly 8 in terms of lexical resources.



Extract 16 Then with Lexical resource, occasionally her choice of word was slightly not perfect and that’s why she didn’t get a 9 but she really does nice things that shows that she’s got a lot of control of the language – like at one stage she says that something “will end” and then she changed it and said it “might end”, and that sort of indicated that she knew about the subtleties of using; the impact of certain words.

At Band 7 examiners noted style and collocation. They still looked for sophisticated use of lexical items, although in contrast with Band 8, performance was considered uneven or patchy (Extract 17). They also noticed occasional difficulty elaborating or finding the words at Band 7.

Extract 17 So unusual vocabulary there; it’s suitable and quite sophisticated to say “eating voluptuously”, so eating for the joy of eating. So this is where my difficulty in assessing her lexical resource. She’ll come out with words like that which are really quite impressive but then she’ll say “the university wasn’t published”, which is quite inappropriate and distracting. So yes, at this stage I’m on a 7 for Lexical resource.

Whereas Band 7 required appropriate use of idiomatic language, at Band 6 examiners reported errors in usage (Extract 18). Performance at Band 6 was also characterised by “adequate” or “safe” use of common lexis.

Extract 18 Lexical resource was very adequate for what she was doing. She used a few somewhat unusual and idiomatic terms and there were points where therefore I was torn between a 6 and a 7. The reason I erred on the side of the 6 rather than the 7 was because those idiomatic and unusual terms were sometimes themselves not used quite correctly and that was a bit of a giveaway, it just wasn’t quite the degree of comfort that I’d have expected with a 7.

A Band 5 was typically described in terms of the range of lexis (“simple”), the degree of struggle involved in accessing it, and the inability to paraphrase. At this level candidates were seen to struggle for words and there was some lack of clarity in meaning (Extract 19).

Extract 19 It’s pretty simple vocabulary and he’s struggling for words, at times for the appropriate words, so I’d say 5 on Lexical resource.

Examiners awarded Band 4 when they felt candidates were unable to elaborate, even on familiar topics (Extract 20) and when they were unable to paraphrase (Extract 21). They also noted repetitive use of vocabulary.

Extract 20 So she can tell us enough to tell us that the government can’t solve this problem but she hasn’t got enough words to be able to tell us why, so it’s like she can make the claims but she can’t work on the meaning to build it up, even when she’s talking about something fairly familiar.

Extract 21 I did come down to a 4 because resource was ‘sufficient for familiar topics’ but really only basic meaning on unfamiliar topics, which is number 4. ‘Attempts paraphrase’ – well she didn’t really, she couldn’t do that. So I felt that she fitted a 4 with the Lexical resource.



5.1.2c Confidence in using the lexical resource scale The examiners reported being slightly more comfortable with the Lexical resource scale than they were with the Fluency and coherence scale (Table 4). Three of them noted that it was clear, and the bands easily distinguishable. One noted that it was easy to check “depth” or “breadth” of lexical knowledge with a quick replay of the taped interview, focusing on the candidate’s ability to be specific. When asked to elaborate on what they felt the least confident about, examiners commented on:

the lack of interpretability of terms used in the scales (terms such as sufficient, familiar and unfamiliar)

the difficulty they had distinguishing between levels (specifically, the similarity between Band 7 Resource flexibly used to discuss a variety of topics and Band 6 Resource sufficient to discuss at length), and

the difficulty distinguishing between Fluency and coherence and Lexical resource (discussed in more detail later).

In relation to this last point, one examiner remarked that spoken discourse markers and other idiomatic items such as adverbials (“possibly”, “definitely”), emphatic terms (“you know”, “in a sense”) and intensifiers or diluters (“really”, “somewhat”, “quite”) are relevant to both Lexical resources and Fluency and coherence. One examiner commented that paraphrasing is difficult to assess as not all candidates do it, and that indicators such as repetition and errors are more useful. In contrast, another commented that the paraphrase criterion was useful, particularly across Bands 4 to 7. Another remarked that it is difficult to assess lexical resources in an interview and that the criteria should focus more on the relevance or appropriateness of the lexis to the context.

When asked whether the descriptors of the Lexical resource scale capture the significant performance qualities at each of the Band levels and discriminate across the levels effectively, most examiners said that they felt that this was the case. One said the scale developed well – from basic meaning at 4 through sufficient at 5 to meaning clear, and then higher levels of idiom and collocation etc. One felt that a clearer distinction was needed in relation to paraphrase for Bands 7 and 8, and another that Bands 5 and 6 were difficult to distinguish because the ability to paraphrase, which seemed to be a key ‘cut off’, was difficult to judge. Another felt that deciding what was a familiar or unfamiliar topic was problematic, particularly across Bands 4 and 5. One examiner did not like the use of the term “discuss” at Band 5, as this for her implied dealing in depth with an issue, something she felt was unlikely at that level. She suggested the term “talk about”. Another commented that some candidates have sophisticated vocabulary relating to specific areas of work or study yet lack more general breadth.

5.1.3 Grammatical range and accuracy 5.1.3a Understanding the grammatical range and accuracy scale In general, the examiners were very true to the descriptors and all aspects of the Grammatical range and accuracy scale were addressed. The main focus was on error frequency and error type on the one hand, and complexity of sentences and structures on the other. Examiners appeared to balance these criteria against each other.

In relation to grammatical errors, examiners referred to density or frequency, including the number of error-free sentences. They also noted the type of error – those viewed as simple, basic, or minor included articles, tense, pronouns, subject-verb agreement, word order, plurals, infinites and participles – and whether they were systematic or not. They also noted the impact of errors on intelligibility.

The examiners commented on the range of structures used, and the flexibility that candidates demonstrated in their use. There was reference, for example, to the repetitive use of a limited range of structures, and to candidates’ ability to use, and frequency of use of, complex structures such as passive, present perfect, conditional, adverbial constructions, and comparatives. Examiners also noted



candidates’ ability to produce complex sentences, the range of complex sentence types they used, and the frequency and success with which they produced them. Conversely, what they referred to as fragmented or list-like speech or the inability to produce complete sentences or connect utterances (a feature which also impacted on assessments of coherence) was taken as evidence of limitations in grammatical resources.

5.1.3b Determining levels within the grammatical range and accuracy scale To determine how examiners coped with the different levels within the Grammatical range and accuracy scale, the verbal report data were analysed for evidence of how the different levels were interpreted. Again Band 9 was used little. This seemed to be because of its “absolute” nature; the phrase “at all times” was used to justify not awarding this Band (Extract 22). Examiners did have some problems deciding whether non-native usage was dialectal or error. At Band 8, examiners spoke of the complexity of structures and the “flexibility” or “control” the candidates displayed in their use of grammar. At this level errors were expected to be both occasional non-systematic, and tended to be referred to as “inappropriacies” or “slips”, or as “minor”, “small”, or “unusual” (for the candidate), or as “non-native like” usage.

Extract 22 And again I think I’m stopping often enough for these grammatical slips for it on average, remembering that we are always saying that, for it on average to match the 8 descriptor which allows for these, than the 9 descriptor which doesn’t.

Overall, Band 7 appeared to be a default level; not particularly distinguishable but more a middle ground between 8 and 6, where examiners make a decision based on whether the performance is as good as an 8 or as bad as a 6. Comments tended to be longer as examiners tended to argue for a 7 and against a 6 and an 8 (Extract 23). At this level inaccuracies were expected but they were relatively unobtrusive, and some complex constructions were expected (Extract 24).

Extract 23 I thought that he was a 7 more than a 6. He definitely wasn’t an 8, although as I say, at the beginning I thought he might have been. There was a ‘range of structures flexibly used’. ‘Error free sentences frequent’, although I’m not a hundred per cent sure of that because of pronunciation problems. And he could use simple and complex sentences effectively, certainly with some errors. Now when you compare that to the criteria for 6: ‘Though errors frequently occur in complex structures these rarely impede communication …’

Extract 24 For Grammatical range and accuracy, even though there was [sic] certainly errors, there was certainly still errors, but you’re allowed that to be a 7. What actually impressed me here … he was good on complex verb constructions with infinitives and participles. He had a few really quite nice constructions of that nature which, I mean there we’re talking about sort of true complex sentences with complex verbs in the one clause, not just subordinate clauses, and I thought they were well handled. His errors certainly weren’t that obtrusive even though there were some fairly basic ones, and I think it would be true to say that error-free sentences were frequent there.

At Band 6 the main focus for examiners was the type of errors and whether they impeded communication. While occasional confusion was allowed, if the impact was too great then examiners tended to consider dropping to a 5 (Extract 25). Also, an inability to use complex constructions successfully and confidently kept candidates at 6 rather than a 7 (Extract 26).



Extract 25 A mixture of short sentences, some complex ones, yes variety of structures. Some small errors, but certainly not errors that impede communication. But not an advanced range of sentence structures. I’ll go for a 6 on the grammar.

Extract 26 Grammatical range and accuracy was also pretty strong, relatively few mistakes, especially simple sentences were very well controlled. Complex structures. The question was whether errors were frequent enough for this to be a 6, there certainly were errors. There were also a number of quite correct complex structures. I did have misgivings I suppose about whether this was a 6 or a 7 because she was reasonably correct. I suppose I eventually felt the issue of flexible use told against the 7 rather than the 6. There wasn’t quite enough comfort with what she was doing with the structures at all times for it to be a 7.

At Band 5 examiners noted frequent and basic errors, even in simple structures, and errors were reported as frequently impeding communication. Where attempts were made at more complex structures these were viewed as limited, and tended to lead to errors (Extract 27) Speech was fragmented at times. Problems with the verb ‘to be’ or sentences without verbs were noted.

Extract 27 She had basic sentences, she tended to use a lot of simple sentences but she did also try for some complex sentences, there were some there, and of course the longer her sentences, the more errors there were.

The distinguishing feature of Band 4 appeared to be that basic and systematic errors occurred in most sentences (Extract 28).

Extract 28 Grammatical range and accuracy, I gave her a 4. Even on very familiar phrases like where she came from, she was missing articles and always missed word-ending ‘s’. And the other thing too is that she relied on key words to get meaning across and some short utterances were error-free but it was very hard to find even a basic sentence that was well controlled for accuracy.

5.1.3c Confidence in using the grammatical range and accuracy scale When asked to comment on the ease of application of the Grammatical range and accuracy scale, one examiner remarked that is easier to notice specific errors than error-free sentences, and another that errors become less important or noticeable if a candidate is fluent. Three examiners found the scale relatively easy to use.

Most examiners felt that the descriptors of the scale captured the significant performance qualities at each of the Band levels. One examiner said that he distinguished levels primarily in terms of the degree to which errors impeded communication. Another commented that the notion of "error" in speech can be problematic as natural speech flow (ie native) is often not in full sentences and is sometimes grammatically inaccurate.

When asked whether the Grammatical range and accuracy scale discriminates across the levels effectively, three agreed and three disagreed. One said that terms such as error-free, frequently, and well controlled are difficult to interpret (“I ponder on what per cent of utterances were frequently error-free or well controlled”). Another felt that Bands 7 and 8 were difficult to distinguish because he was not sure whether a minor systematic error would drop the candidate to 7, and that Bands 5 and 6 could also be difficult to distinguish. Another felt that the Band 4/5 threshold was problematic because some candidates can produce long turns (Band 5) but are quite inaccurate even in basic sentence forms



(Band 4). Finally, one examiner remarked that a candidate who produces lots of structures with a low level of accuracy, even on basic ones, can be hard to place, and suggested that some guidance on “risk takers” is needed.

5.1.4 Pronunciation 5.1.4a Understanding the pronunciation scale When evaluating candidates’ pronunciation, examiners focused predominantly on the impact of poor pronunciation on intelligibility, in terms of both frequency of unintelligibility and the amount of strain for the examiner (Extract 29).

Extract 29 I really do rely on that ‘occasional strain’, compared to ‘severe strain’. [The levels] are clearly formed I reckon.

When they talked about specific aspects of pronunciation, examiners referred most commonly to the production of sounds, that is, vowels and consonants. They did also, at times, mention stress, intonation and rhythm, and while they again tended to focus on errors there was the occasional reference to the use of such features to enhance the communication (Extract 30).

Extract 30 And he did use phonological features in a positive way to support his message. One that I wrote down for example was ‘well nobody was not interested’. And he got the stress exactly right and to express a notion which was, to express a notion exactly. I mean he could have said ‘everybody was interested’ but he actually got it exactly right, and the reason he got it exactly right among other things had to do with his control of the phonological feature.

5.1.4b Determining levels within the pronunciation scale Next the verbal report and questionnaire data were analysed for evidence of how the different levels were interpreted and problems that examiners had distinguishing levels. While they attended to a range of phonological features – vowel and consonant production but also stress and rhythm – intelligibility, or the level of strain involved in understanding candidates appeared to be the key feature used to determine level (Extract 33). Because only even numbered bands could be awarded, it seemed that examiners took into account the impact that the Pronunciation score might have on overall scores (Extract 34).

Extract 31 I really do rely on that ‘occasional strain’, compared to ‘severe strain’.

Extract 32 So I don’t know why we can’t give those bands between even numbers. So, just as I wanted to give a 5 to the Indian I want to give a 9 to this guy. Because you see the effect of 9, 9, 8, 8 will be he’ll come down to 8 probably, I’m presuming.

At Band 8 examiners tended to pick out isolated instances of irregular pronunciation, relating the impact of these on intelligibility to the descriptors: minimal impact and accent present but never impedes communication. Although the native speaker was referred to as the model, it was recognised that native speakers make occasional pronunciation errors (Extract 35). Occasional pronunciation errors were generally considered less problematic than incorrect or non-native stress and rhythm (Extract 36) One examiner expressed a liking for variety of tone or stress in delivery and noted that she was reluctant to give an 8 to a candidate she felt sounded bored or disengaged.



Extract 33 Because I suppose the truth is, as native speakers, we sometimes use words incorrectly and we sometimes mispronounce them.

Extract 34 It’s interesting how she makes errors in pronunciation on words. So she’s got “bif roll” and “steek” and “selard” and I don’t think there is much of a problem for a native speaker to understand as if you get the pauses in the wrong place, if you get the rhythm in the wrong place… so that’s why I’ve given her an 8 rather than dropping her down because it says ‘L1 accent may be evident, this has minimal effect on intelligibility’, and it does have minimal effect because it’s always in context that she might get a word mispronounced or pronounced in her way, not my way.

Band 6 appeared to be the ‘default’ level where examiners elect to start. Examiners seemed particularly reluctant to give 4; of the 29 ratings, only three were below 6. Bands 4 and 6 are essentially determined with reference to listener strain, with severe strain at Band 4 and occasional strain at Band 6 (Extract 37).

Extract 35 Again with Pronunciation I gave her a 6 because I didn’t find patches of speech that caused ‘severe strain’, I mean there was ‘mispronunciation causes temporary confusion’, some ‘occasional strain’.

At Band 4 most comments referred to severe strain, or to the fact that examiners were unable to comprehend what the candidate had said (Extract 38).

Extract 36 I actually did mark this person down to Band 4 on Pronunciation because it did cause me ‘severe strain’, although I don’t know whether that’s because of the person I listened to before, or the time of the day but there were large patches, whole segments of responses that I just couldn’t get through and I had to listen to it a couple of times to try and see if there was any sense in it.

5.1.4c Confidence in using the pronunciation scale When asked to judge their confidence in understanding and interpreting the scales, the examiners were the most confident about Pronunciation (see Table 4). However, there was a common perception that the scale did not discriminate enough (Extract 31). One examiner remarked that candidates most often came out with a 6, and another that she doesn't take pronunciation as seriously as the other scales. One examiner felt that experience with specific language groups could bias the assessment of pronunciation (and, in fact, there were a number of comments in the verbal report data where examiners commented on their familiarity with particular accents, or their lack thereof). One was concerned that speakers of other Englishes may be hard to understand and therefore marked down unfairly (Extract 32). Volume and speed were both reported in the questionnaire data and verbal report data as having an impact on intelligibility.

Extract 37 And I would prefer to give a 5 on Pronunciation but it doesn’t exist. But to me he’s somewhere between ‘severe strain’, which is the 4, and the 6 is ‘occasional strain’. He caused strain for me nearly 50% of the time, so that’s somewhere between occasional and severe. And this is one of the times where I really wish there was a 5 on Pronunciation because I think 6 is too generous and I think 4 is too harsh.



Extract 38 I think there is an issue judging the pronunciation of candidates who may be very difficult for me to understand, but who are fluent/accurate speakers of recognised second language Englishes, (Indian or Filipino English). A broad, Scottish accent can affect comprehensibility in the Australian context and I’m just not sure therefore, whether an Indian or Filipino accent affecting comprehensibility should be deemed less acceptable.

While pronunciation was generally considered to be the easiest scale on which to distinguish Band levels because there are fewer levels, four of the six examiners remarked that there was too much distinction between levels, not too little, so that the scale did not discriminate between candidates enough. One examiner commented that as there is really no Band 2, it is a decision between 4, 6, or 8, and that she sees 4 as “almost unintelligible”. In arguing for more levels they made comments like: “Many candidates are Band 5 in pronunciation – between severe strain for the listener and occasional. Perhaps mild strain quite frequently, or mild strain in sections of the interview. One examiner felt a Band 9 was needed (Extract 39).

Extract 39 Levels 1,3,5,7 and 9 are necessary. It seems unfair not to give a well-educated native speaker of English Band 9 for pronunciation when there’s nothing wrong with their English, Australian doctors going to UK.

Examiners commented at times on the fact that they were familiar with the pronunciation of candidates of particular nationalities, although they typically claimed to take this into account when awarding a rating (Extract 40).

Extract 40 I found him quite easy to understand but I don’t know that everybody would and there’s a very strong presence of accent or features of pronunciation that are so specifically Vietnamese that they can cause other listeners problems. So I’ll go with a 6.

5.2 The discreteness of the scales In this section, the questionnaire data and, where relevant, the analysis of the verbal report data were drawn upon to address the question of the ease with which examiners were able to distinguish the four analytic scales – Fluency and coherence (F&C); Grammatical range and accuracy (GRA); Lexical resource (LR); and Pronunciation (P).

The examiners were asked how much overlap there was between the scales on a range of 1 (Very distinct) to 4 (Almost total overlap), see Table 5. The greatest overlap (mean 2.2) was reported between Fluency and coherence and Grammatical range and accuracy. Overall, Fluency and coherence was considered to be the least distinct and Pronunciation the most distinct scale.



Examiner

Scale overlap 1 2 3 4 5 6 Mean

F&C and LR 1 2 2 2 3 2 2.0

F&C and GRA 3 2 2 2 2 2.2

F&C and P 2 2 2 1 2 1.8

LR and GRA 2 2 2 1 2 2 1.8

LR and P 1 1 1 1 1 1.0

GRA and P 1 1 1 1 1 1.0

Table 5: Overlap between scales

When asked to describe the nature of the overlap between scales, the examiners responded as follows. Comments made during the verbal report session supported these responses,

Overlap: Fluency and coherence / Lexical resource Vocabulary was seen as overlapping with fluency because “to be fluent and coherent [candidates] need the lexical resources”, and because good lexical resources allow candidates to elaborate their responses. Two examiners pointed out that discourse markers (and, one could add, connectives), which are included under Fluency and coherence, are also lexical items. Another examiner commented that the use of synonyms and collocation helps fluency.

Overlap: Fluency and coherence / Grammatical range and accuracy Grammar was viewed as overlapping with fluency because if a candidate has weak grammar but a steady flow of language, coherence is affected negatively. The use of connectives (“so”, “because”) and subordinating conjunctions (“when”, “if”) was said to play a part in both sets of criteria. Length of turn in Grammatical range and accuracy was seen as overlapping with the ability to keep going in Fluency and coherence (Extract 41).

Extract 41 Again I note both with fluency and with grammar the issue of the length of turns kind of cuts across both of them and I’m sometimes not sure whether I should be taking into account both of them or if not which for that, but as far as I can judge it from the descriptors, it’s relevant to both.

One examiner remarked that fluency can dominate the other criteria, especially grammar (Extract 42).

Extract 42 Well I must admit that I reckon if the candidate is fluent, it does tend to influence the other two scores. If they keep talking you think ‘oh well they can speak English’. And you have to be really disciplined as an examiner to look at those other – the lexical and the grammar – to really give them an appropriate score because otherwise you can say ‘well you know they must have enough vocab I could understand them’. But the degree to which you understand them is the important thing. So even as a 4 I said that I think there also needs to be some other sort of general band score. It does make you focus on those descriptors here.



Overlap: Lexical resource / Grammatical range and accuracy Three examiners wondered whether errors in expressions or phrases (preposition phrases, phrasal verbs, idioms) were lexical or grammatical (“If a candidate says in the moment instead of at the moment, what is s/he penalised under?” and “I’m one of those lucky persons – Is it lexical? Is it expression?”) Another examiner saw the scales as overlapping in relation to skill at paraphrasing.

Overlap: Fluency and coherence / Pronunciation Two examiners pointed out that if the pronunciation is hard to understand the coherence will be low. Another felt that slow speech (disfluent) was often more clearly pronounced and comprehensible, although another felt that disfluent speech was less comprehensible if there was “a staccato effect”.

One examiner remarked that if pronunciation is unintelligible it is not possible to accurately assess any of the other areas.

5.3 Remaining questions 5.3.1 Additional criteria As noted earlier, during the verbal report session, examiners rarely made reference to features not included in the scales or key criteria. Those that examiners did refer to were:

the ability to cope with different functional demands confidence in using the language, and creative use of language.

In response to a question about the appropriateness of the scale contents, the following additional features were proposed as desirable: voice; engagement; demeanour; and paralinguistic aspects of language use. Three examiners criticised the test for not testing “communicative” language. One examiner felt there was a need for a holistic rating in addition to the analytic ratings because global marking was less accurate than profile marking “owing to the complexity of the variables involved”.

5.3.2 Irrelevant criteria When asked whether any aspects of the descriptors were inappropriate or irrelevant, one examiner remarked that candidates may not exhibit all aspects of particular band descriptors. Another saw conflict between the “absolute nature of the descriptors for Bands 9 and 1 and requirement to assess on the basis of ‘average’ performance across the interview”.

When asked whether they would prefer the descriptors to be shorter or longer, most examiners said they were fine. Three remarked that if a candidate must fully fit all the descriptors at a particular level, as IELTS instructs, it would create more difficulties if descriptors were longer. One examiner said that the Fluency and coherence descriptors could be shorter and should rely less on discerning the cause of disfluency, whereas another remarked that more precise language was needed in Fluency and coherence Bands 6 and 7. Another referred to the need for more precise language in general. One examiner suggested that key ‘cut off’ statements would be useful, and another that an appendix to the criteria giving specific examples would help.

5.3.3 Interviewing and rating While they acknowledged that it was challenging to conduct the interview and rate the candidate simultaneously, the examiners did not feel it was inappropriately difficult. In part, this was because they had to pay less attention to managing the interaction and thinking up questions than they did in the previous interview, and in part because they were able to focus on different criteria in different sections of the interview, while the monologue turn gave then ample time to focus exclusively on rating. When asked whether they attended to specific criteria in specific parts of the interview, some said “yes” and some “no”.



They also reported different approaches to arriving at a final rating. The most common approach was to make a tentative assessment in the first part and then confirm this as the interview proceeded (Extract 43). One reported working down from the top level, and another making her assessment after the interview was finished.

Extract 43 By the monologue I have a tentative score and assess if I am very unsure about any of the areas. If I am, I make sure I really focus for that in the monologue. By the end of the monologue, I have a firmer feel for the scores and use the last section to confirm/disconfirm. It is true that the scores do change as a candidate is able to demonstrate the higher level of language in the last section. I do have some difficulties wondering what weight to give to this last section.

When asked if they had other points to make, two examiners remarked that the descriptors could be improved. One wanted a better balance between “specific” and “vague” terms, and the other “more distinct cut off points, as in the writing descriptors”. Two suggested improvements to the training: the use of video rather than audio-recordings of interviews, and the provision of examples attached to the criteria. Another commented that “cultural sophistication” plays a role in constructing candidates as more proficient, and that the test may therefore be biased towards European students (“some European candidates come across as better speakers, even though they may be mainly utilising simple linguistic structures”).

6 DISCUSSION

The study addressed a range of questions pertaining to how trained IELTS examiners interpret and distinguish the scales used to assess performance in the revised IELTS interview, how they distinguish the levels within each scale, and what problems they reported when applying the scales to samples of performance.

In general, the examiners referred closely to the scales when evaluating performances, quoting frequently from the descriptors and using them to guide their attention to specific aspects of performance and to distinguish levels. While there was reference to all aspects of the scales and key criteria, some features were referred to more frequently than others. In general, the more ‘quantifiable’ features such as amount of hesitation (Fluency and coherence) or error density and type (Lexical resource and Grammatical range and accuracy) were the most frequently mentioned, although it cannot be assumed that this indicates greater weighting of these criteria over the less commonly mentioned ones (such as connectives or paraphrasing). Moreover, because examiners are required to make four assessments, one for each of the criteria, it seems that there is less likelihood than was the case previously with the single holistic scale of examiners weighting these four main criteria differentially.

There were remarkably few instances of examiners referring to aspects of performance not included in the scales, which is in marked contrast to the findings of an examination of functioning of the earlier holistic scale (Brown, 2000). In that study Brown reported while some examiners focused narrowly on the criteria, others were “more inference-oriented, drawing more conclusions about the candidates’ ability to cope in other contexts” (2000: 78). She noted also that this was the case more for more experienced examiners.

The examiners reported finding the scales relatively easy to use, and the criteria and their indicators to be generally appropriate and relevant to test performances, although they noted some overlap between scales and some difficulties distinguishing levels.



It was reported that some features were difficult to notice or interpret. Particularly problematic features included:

the need to infer the cause of hesitation (Fluency and coherence) a lack of certainty about whether inappropriate language was dialectal or error (Lexical

resource and Grammatical range and accuracy) a lack of confidence in determining whether particular topics were familiar nor not,

particularly those relating to professional or academic areas (Lexical Resource).

Difficulty was also reported in interpreting the meaning of “relative” terms used in the descriptors, such as sufficient, adequate, etc. There was some discomfort in the “absoluteness” of the Band 9 descriptors across the scales.

The most problematic scale appeared to be Fluency and coherence. It was the most complex in terms of focus and was also considered to overlap the most with other scales. Overlap resulted from the impact of a lack of lexical or grammatical resources on fluency, and because discourse markers and connectives (referred to in the Fluency and coherence scale) were also lexical items and a feature of complex sentences. Examiners seemed to struggle the most to determine band levels on the Fluency and coherence scale, perhaps because of the broad range of features it covers, and the fact that the cause of hesitancy, a key feature in the scale at the higher levels, is a high-inference criterion.

The Pronunciation scale was considered the easiest to apply, however the examiners expressed a desire for more levels for Pronunciation. They felt it did not distinguish candidates sufficiently and the fewer band levels meant the rating decision carried too much weight in the overall (averaged) score.

As was found in earlier studies of examiner behaviour in the previous IELTS interview (Brown, 2000) and in prototype speaking tasks for Next Generation TOEFL (Brown, Iwashita and McNamara, 2005), in addition to ‘observable’ features such as frequency of error, complexity and accuracy, examiners were influenced in all criteria by the impact of particular features on comprehensibility. Thus they referred frequently to the impact of disfluency, lexical and grammatical errors and non-native pronunciation on their ability to follow the candidate or the degree of strain it caused them.

A marked difference in the present study from that of Brown (2000) was the relevance of interviewer behaviour to ratings. Brown found that a considerable number of comments were devoted to the interviewer and reports that the examiners “were constantly aware of the fact that the interviewer is implicated in a candidate’s performance” (2000:74). At times, the examiners even compensated for what they perceived to be unsupportive or less-than-competent interviewer behaviour (see also Brown 2003, 2004). While there were one or two comments on interviewer behaviour in the present study, they did not appear to have any impact on ratings decisions. In contrast, however, some of the examiners did report a level of concern that the current interview and assessment criteria focused less on “communicative” or interactional skills than previously, a result of the use of interlocutor frames.

Finally, although the examiners in this study were rating taped tests conducted by other interviewers, they reported feeling comfortable, (and more comfortable than was the case in the earlier unscripted interview), with simultaneously conducting the interview and assessing it, despite the fact that they were required to focus on four scales rather than one. This seemed to be because they no longer have to manage the interview by developing topics on-the-fly and also have the opportunity during Part 2 (the long turn) to sit back and focus entirely on the candidate’s production.



7 CONCLUSION

This study set out to investigate examiners’ behaviour and attitudes to the rating task in the IELTS interview. The study was designed as a follow-up to an earlier study (Brown, 2000), which investigated the same issues in relation to the earlier IELTS interview. Two major changes in the current interview are: the use of interlocutor frames to constrain unwanted variation amongst interviewers; and the use of a set of four analytic scales rather than the previous single holistic scale.

The study aimed to derive evidence for or against the validity – the interpretability and ease of application – of these revised scales within the context of the revised interview. To do this, the study drew on two sets of data, verbal reports and questionnaire responses provided by six experienced IELTS examiners when rating candidate performances.

On the whole, the evidence suggested that the rating procedure works relatively well. Examiners reported a high degree of comfort using the scales. The evidence suggested there was a higher degree of consistency in examiners’ interpretations of the scales than was previously the case; a finding which is perhaps unsurprising given the more detailed guidance that four scales offer in comparison with a single scale. The problems that were identified – perceived overlap amongst scales, and difficulty distinguishing levels – could be addressed in minor revisions to the scales and through examiner training.



REFERENCES

Brown, A, 1993, ‘The role of test-taker feedback in the development of an occupational language proficiency test’ in Language Testing, vol 10 no 3, pp 277-303

Brown, A, 2000, ‘An investigation of the rating process in the IELTS Speaking Module’ in Research Reports 1999, vol 3, ed R Tulloh, ELICOS, Sydney, pp 49-85

Brown, A, 2003a, ‘Interviewer variation and the co-construction of speaking proficiency’, Language Testing, vol 20, no 1, pp 1-25

Brown, A, 2003b, ‘A cross-sectional and longitudinal study of examiner behaviour in the revised IELTS Speaking Test’, report submitted to IELTS Australia, Canberra

Brown, A, 2004, ‘Candidate discourse in the revised IELTS Speaking Test’, IELTS Research Reports 2006, vol 6 (the following report in this volume), IELTS Australia, Canberra, pp 71-89

Brown, A, 2005, Interviewer variability in oral proficiency interviews, Peter Lang, Frankfurt

Brown, A and Hill, K, 1998, ‘Interviewer style and candidate performance in the IELTS oral interview’ in Research Reports 1997, vol 1, ed S Woods, ELICOS, Sydney, pp 1-19

Brown, A, Iwashita, N and McNamara, T, 2005, An examination of rater orientations and test-taker performance on English for Academic Purposes speaking tasks, TOEFL Monograph series MS-29, Educational Testing Service, Princeton, New Jersey

Cumming, A, 1990, ‘Expertise in evaluating second language compositions’ in Language Testing, vol 7, no 1, pp 31-51

Delaruelle, S, 1997, ‘Text type and rater decision making in the writing module’ in Access: Issues in English language test design and delivery, eds G Brindley and G Wigglesworth, National Centre for English Language Teaching and Research, Macquarie University, Sydney, pp 215-242

Gass, SM and Mackey, A, 2000, Stimulated recall methodology in second language research, Lawrence Erlbaum, Mahwah, NJ

Green, A, 1998, Verbal protocol analysis in language testing research: A handbook, (Studies in language testing 5), Cambridge University Press and University of Cambridge Local Examinations Syndicate, Cambridge

Lazaraton, A, 1996a, ‘A qualitative approach to monitoring examiner conduct in the Cambridge assessment of spoken English (CASE)’ in Performance testing, cognition and assessment: Selected papers form the 15th Language Testing Research Colloquium, eds M Milanovic and N Saville, Cambridge University Press, pp 18-33

Lazaraton, A, 1996b, ‘Interlocutor support in oral proficiency interviews: The case of CASE’ in Language Testing, vol 13, pp 151-172

Lewkowicz, J, 2000, ‘Authenticity in language testing: some outstanding questions’ in Language Testing, vol 17 no 1, pp 43-64

Lumley, T and Stoneman, B, 2000, ‘Conflicting perspectives on the role of test preparation in relation to learning’ in Hong Kong Journal of Applied Linguistics, vol 5 no 1, pp 50-80

Lumley, T, 2000, ‘The process of the assessment of writing performance: the rater's perspective’, unpublished doctoral thesis, The University of Melbourne



Lumley, T and Brown, A, 2004, ‘Test-taker response to integrated reading/writing tasks in TOEFL: evidence from writers, texts and raters’, unpublished report, The University of Melbourne

McNamara, TF and Lumley, T, 1997, ‘The effect of interlocutor and assessment mode variables in overseas assessments of speaking skills in occupational settings’ in Language Testing, vol 14, pp 140-156

Meiron, BE, 1998, ‘Rating oral proficiency tests: a triangulated study of rater thought processes’, unpublished Masters thesis, University of California, LA

Merrylees, B and McDowell, C, 1999, ‘An investigation of Speaking Test reliability with particular reference to the Speaking Test format and candidate/examiner discourse produced’ in IELTS Research Reports Vol 2, ed R Tulloh, IELTS Australia, Canberra, pp 1-35

Morton, J, Wigglesworth, G and Williams, D, 1997, ‘Approaches to the evaluation of interviewer performance in oral interaction tests’ in Access: Issues in English language test design and delivery, eds G Brindley and G Wigglesworth, National Centre for English Language Teaching and Research, Macquarie University, Sydney, pp 175-196

Pollitt, A and Murray, NL, 1996, ‘What raters really pay attention to’ in Performance testing, cognition and assessment, (Studies in language testing 3), eds M Milanovic and N Saville, Cambridge University Press, Cambridge, pp 74-91

Taylor, L and Jones, N, 2001, University of Cambridge Local Examinations Syndicate Research Notes 4, University of Cambridge Local Examinations Syndicate, Cambridge, pp 9-11

Taylor, L, 2000, Issues in speaking assessment research, (Research notes 1), University of Cambridge Local Examinations Syndicate, Cambridge, pp 8-9

UCLES (2001) IELTS examiner training material, University of Cambridge Local Examinations Syndicate, Cambridge

Vaughan, C, 1991, ‘Holistic assessment: What goes on in the rater’s mind?’ in Assessing second language writing in academic contexts, ed L Hamp-Lyons, Ablex, Norwood, New Jersey, pp 111-125

Weigle, SC, 1994, ‘Effects of training on raters of ESL compositions’ in Language Testing, vol 11, no 2, pp 197-223



APPENDIX 1: QUESTIONNAIRE

A Focus of the criteria

1. Do the four criteria cover features of spoken language that can be readily assessed in the testing situation? Yes / No Please elaborate

2. Do the descriptors relate directly to key indicators of spoken language? Is anything left out?

Yes / No Please elaborate

3. Are any aspects of the descriptors inappropriate or irrelevant?


B Interpretability of the criteria

4. Are the descriptors easy to understand and interpret? How would you rate your confidence on a scale of 1-5 in using each scale?

Not at all confident

Very confident

Fluency and coherence 1 2 3 4 5

Lexical resource 1 2 3 4 5

Grammatical range and accuracy 1 2 3 4 5

Pronunciation 1 2 3 4 5

5. Please elaborate on why you felt confident or not confident about each of the scales:

Fluency and coherence

Lexical resource

Grammatical range and accuracy

Pronunciation



6. How much overlap do you find among the scales?

Very distinct Some overlap A lot of overlap Almost total overlap

F&C and LR 1 2 3 4

F&C and GRA 1 2 3 4

F&C and P 1 2 3 4

LR and GRA 1 2 3 4

LR and P 1 2 3 4

GRA and P 1 2 3 4

7. Could you describe this overlap?

8. Would you prefer the descriptors to be shorter / longer?

Please elaborate

C Level distinctions

9. Do the descriptors of each scale capture the significant performance qualities at each of the band levels?

Fluency and coherence Yes / No Please elaborate

Lexical resource Yes / No Please elaborate

Grammatical range and accuracy Yes / No Please elaborate

Pronunciation Yes / No Please elaborate

10. Do the scales discriminate across the levels effectively? (If not, for each scale which levels are the most

difficult to discriminate, and why?)

Fluency and coherence Yes / No Please elaborate

Lexical resource Yes / No Please elaborate

Grammatical range and accuracy Yes / No Please elaborate

Pronunciation Yes / No Please elaborate



11. Is the allocation of bands for pronunciation appropriate?


12. How often do you award flat profiles?

Please elaborate

D The rating process

13. How difficult is it to interview and rate at the same time?

Please elaborate

14. Do you focus on particular criteria in different parts of the interview?


15. How is your final rating achieved? How do you work towards it? At what point do you finalise your rating?

Please elaborate

Final comment

Is there anything else you think you should have been asked or would like to add?



3. Candidate discourse in the revised IELTS Speaking Test Author Annie Brown Ministry of Higher Education and Scientific Research United Arab Emirates


This study aims to verify the IELTS Speaking Test scale descriptors by providing empirical validity evidence derived from a linguistic analysis of candidate discourse.

ABSTRACT

In 2001 the IELTS interview format and criteria were revised. A major change was the shift from a single global scale to a set of four analytic scales focusing on different aspects of oral proficiency. This study is concerned with the validity of the analytic rating scales. It aims to verify the descriptors used to define the score points on the scales by providing empirical evidence for the criteria in terms of their overall focus, and their ability to distinguish levels of performance.

The Speaking Test band descriptors and criteria key indicators were analysed in order to identify relevant analytic categories for each of the four band scales: fluency, grammatical range and accuracy, lexical resource and pronunciation. Twenty interviews drawn from operational IELTS administrations in a range of countries, and representing a range of proficiency levels, were analysed with respect to these categories.

The analysis found that most of the measures displayed increases in the expected direction over the levels, which appears to confirm the validity of the criteria. However, for all measures the standard deviations tended to be large, relative to the differences between levels. This indicates a high level of variation amongst candidates assessed at the same level, and a high degree of overlap between levels, even for those measures which produced significant findings. In addition, for most measures the differences between levels were greater at some boundaries between two bands than at others.

Overall, the findings indicate that while all the measures relating to one scale contribute in some way to the assessment on that scale, no one measure drives the rating; rather a range of performance features contribute to the overall impression of the candidate’s proficiency.

3. Candidate discourse in the revised IELTS Speaking Test – Annie Brown

CONTENTS

1 Aim of the study ........................................................................................ 32 Discourse studies of L2 speaking task performance ............................ 33 Methodology ........................................................................................... 4 3.1 Data ........................................................................................... 4 3.2 The IELTS Speaking Test ................................................................... 5 3.3 Analytic categories .............................................................................. 5 3.3.1 Fluency and coherence ..................................................... 6 3.3.2 Lexical resources............................................................... 7 3.3.3 Grammatical range and accuracy...................................... 84 Results ........................................................................................... 10 4.1 Fluency and coherence....................................................................... 10 4.1.1 Repair ................................................................................ 10 4.1.2 Hesitation ........................................................................... 10 4.1.3 Speech rate........................................................................ 10 4.1.4 Response length ................................................................ 10 4.1.5 Amount of speech.............................................................. 11 4.2 Lexical resources ................................................................................ 11 4.3 Grammatical range and accuracy ....................................................... 125 Summary of findings................................................................................. 13References ........................................................................................... 16Appendix 1: ANOVAs (Analysis of variance) .............................................. 18

AUTHOR BIODATA:

ANNIE BROWN

Annie Brown is Head of Educational Assessment in the National Admissions and Placement Office (NAPO) of the Ministry of Higher Education and Scientific Research, United Arab Emirates. Previously, and while undertaking this study, she was Senior Research Fellow and Deputy Director of the Language Testing Research Centre at The University of Melbourne. There, she was involved in research and development for a wide range of language tests and assessment procedures, and in language program evaluation. Annie's research interests focus on the assessment of speaking and writing, and the use of Rasch analysis, discourse analysis and verbal protocol analysis. Her books include Interviewer Variability in Oral Proficiency Interviews (Peter Lang, 2005) and the Language Testing Dictionary (CUP, 1999, co-authored with colleagues at the Language Testing Research Centre). She was winner of the 2004 Jacqueline A Ross award for the best PhD in language testing, and winner of the 2003 ILTA (International Language Testing Association) award for the best article on language testing.



1 AIM OF THE STUDY

This study comprises an analysis of candidate discourse on the revised IELTS Speaking Test as part of the program of validation research funded by IELTS Australia. The overall aim of the study is to try to verify the descriptors used to define the score points on the scales by providing empirical validity evidence for the criteria, in terms of:

their overall focus and their ability to distinguish levels of performance.

The aim will be addressed through an analysis of samples of performance at each of several levels of proficiency using a variety of quantitative and qualitative measures selected to reflect the features of performance relevant to the test construct and defined within the band scales.

2 DISCOURSE STUDIES OF L2 SPEAKING TASK PERFORMANCE

One of the first studies to examine learner discourse in relation to levels of proficiency was that of Mangan (1988). Mangan examined the occurrence of specific grammatical errors in French Oral Proficiency Interviews. He found that while there was a decrease as the proficiency level increased, it was not linear. Douglas (1994) found similar results on a semi-direct speaking test for a variety of measures, including grammatical errors, fluency, vocabulary, and rhetorical organisation. He speculates that this could be because raters were attending to features not included in the scales, which raises the question of the validity of the scales used in this context. It may also be, as Douglas and Selinker (1992, 1993) and Brown et al (2005) argue, that holistic ratings do not adequately capture jagged profiles, that is, different levels of performance by a candidate across different criteria.

Brown, Iwashita and McNamara (2005) undertook an analysis of candidate performance on speaking tasks to be included in New TOEFL. The tasks had an English for Academic Purposes (EAP) focus and included both independent and integrated tasks (see Lewkowicz, 1997 for a discussion of integrated tasks). As the overall aim of the study was to examine the feasibility of drawing on verbal report data to develop scales, the measures used to examine the actual discourse were selected to reflect the criteria applied by EAP specialists when not provided with specific guidance, rather than those contained within existing scales. The criteria applied by the specialists and used to determine the discourse measures reflected four major categories: linguistic resources (which included grammar and vocabulary), fluency (which included repair phenomena, pausing and speech rate), phonology (which included pronunciation, intonation and rhythm), and content.

Brown et al found that for each category only one or two of the measures they used revealed significant differences between levels. In addition, the effect sizes were generally marginal or small, indicating relatively large variability within each score level. This, they surmise, may have been because the score data which formed the basis of the selection of samples was rated holistically rather than analytically. They argue that it may well have been that samples assessed at the same level would reveal very different profiles across the different ‘criteria’ (the major categories identified by the raters). A similar study carried out by Iwashita and McNamara (2003) using data from the Examination for the Certificate of Competency in English (English Language Institute, 2001) produced similar findings.

Discourse analysis of candidate data has also been used in the empirical development of rating scales. The work of Fulcher (1993, 1996, 2003) on the development of scales for fluency is perhaps the most original and detailed. He drew on data taken from a range of language tests to examine what constituted increasing levels of proficiency in terms of a range of fluency measures. He found strong evidence of progression through the levels on a number of these measures, which led to the



development of descriptors reflecting this progression, that, he argued, would not only be more user-friendly but, because of their basis in actual performance, would lead to more valid and reliable ratings.

Other studies that have used various discourse measures to examine differences in candidate performance on speaking tasks include those by Skehan and Foster (1999), Foster and Skehan (1996) and Wigglesworth (1997, 2001), which used measures designed to capture differences in grammatical accuracy and fluency. In these studies the measures were applied not to performances assessed as being at different levels of proficiency, but to performances on different tasks (where the cognitive complexity of the task differed) or on the same task completed under varying conditions.

Iwashita, McNamara and Elder (2001) drew on Skehan’s (1998) model of cognitive complexity to examine the feasibility of defining levels of ability according to cognitive demand. They manipulated task conditions on a set of narrative tasks and measured performance using measures of accuracy and fluency. However, they found the differences in performance under the different conditions did not support the development of a continuum of ability based on cognitive demand.

As Brown et al (2005) point out in discussing the difficulty of applying some measures, particularly those pertaining to grammatical analysis, most of the studies cited above do not provide measures of inter-coder agreement; Brown et al’s study is exemplary in this respect. Like Foster, Tonkyn and Wigglesworth (2000), they discuss the difficulty of analysing the syntactic quality of spoken second language data using measures developed originally for the analysis of first language written texts. Foster et al consider the usefulness for the analysis of spoken data of several units of analysis commonly used in the analysis of written data. They conclude by proposing a new unit which they term the AS-unit. However, the article itself contains very little guidance on how to apply the analysis. (The AS-unit was considered for this study but an attempt at its use created too many ambiguities and unexplained issues.)

3 METHODOLOGY

3.1 Data A set of 30 taped operational IELTS interviews, drawn from testing centres in a range of countries, was rated analytically using the IELTS band descriptors. Ratings were provided for each of the categories:

fluency and coherence lexical resource grammatical range and accuracy pronunciation.

To select interviews for the study which could be assumed to be soundly at a particular level, each was rated three times. Then, for each criterion, five interviews were selected at each of four levels, 5 to 8, on that specific criterion (totalling 20 interview samples).

(The IELTS scale ranges from 0 to 9, with 6, 6.5 and 7 typically being the required levels for entry to tertiary study. This study had intended to include level 4 but the quality of the production of candidates at this level and the poor quality of the operational test recordings was such that their interviews proved impossible to transcribe accurately or adequately.)

For example, interviews to be included in the analysis of grammatical accuracy were selected on the basis of the scores awarded in the category grammatical range and accuracy. Similarly, interviews to be included in the analysis of hesitation were selected on the basis of the scores awarded in the category fluency and coherence.



For interviews to be selected to reflect a specific level on a specific criterion, the following types of agreement on scores were required:

all three scores were the specified level (eg 7 – 7 – 7), or two scores were at the specified level and one a level above or below

(eg 7 – 7 – 8), or the three scores reflected different levels but averaged to the level

(eg 6 – 7 – 8).

Prior to analysis the selected tapes were transcribed in full by a research assistant and checked by the researcher.

3.2 The IELTS Speaking Test The IELTS Speaking Test consists of a face-to-face interview between an examiner and a single candidate. The interview is divided into three main parts (Figure 1). Each part fulfils a specific function in terms of interaction pattern, task input and candidate output. In Part 1, candidates answer general questions about themselves, their homes/families, their jobs/studies, their interests, and a range of similar familiar topic areas. Three different topics are addressed in Part 1. Part 1 lasts between four and five minutes. In Part 2, candidates are given a topic and asked to talk for between one and two minutes. There is one minute preparation time. Examiners may ask one or two follow-up questions. In Part 3, the examiner and candidate engage in a discussion of more abstract issues and concepts which are thematically linked to the topic used in Part 2. The discussion lasts between four and five minutes.

Part 1: Introduction and Interview (4–5 minutes) Examiner introduces him/herself and confirms candidate’s identity. Examiner interviews candidate using verbal questions based on familiar topic frames.

Part 2: Individual long turn 3–4 minutes (including 1 minute preparation time)

Examiner asks candidate to speak for 1–2 minutes on a particular topic based on written input in the form of a general instruction and content-focused prompts. Examiner asks one or two questions at the end of the long turn.

Part 3: Two-way discussion (4–5 minutes) Examiner invites candidate to participate in discussion of more abstract nature, based on verbal questions thematically linked to Part 2 prompt.

Figure 1: Interview structure

3.3 Analytic categories For each assessment category, the aim was to select or develop specific analyses which:

addressed each of the individual scales and covered the main features referred to in each might be expected to show differences between performances scored at levels 5 to 8 could be applied reliably and meaningfully.

To address the first two criteria, three pieces of documentation were reviewed. 1. The band descriptors (UCLES, 2001) 2. The Speaking Test criteria key indicators, as described in the Examiner Training

Materials (UCLES, 2001) 3. The descriptions of the student samples contained in the Examiner Training Materials

(UCLES, 2001)



In order to address the last criterion, the literature on the analysis of learner discourse was reviewed to see what it indicated about the usefulness of particular measures, particularly whether they had sound operational definitions, could be applied reliably, and had sound theoretical justifications. While the measures typically used to measure fluency and vocabulary seemed relatively straightforward, there appeared to be a wide range of measures used for the analysis of syntactic quality but little detailed guidance on how to segment the data or what levels of reliability might realistically be achieved. Phonology proved to be the most problematic; the only reference was that of Brown et al (2005) who analysed the phonological quality of candidate performance in tape-based monologic tasks. However, not only did the phonological analyses used in that study consist of subjective evaluative judgements rather than (relatively) objective measures, but they required the use of specific phonetic software and the involvement of trained phoneticians. Ultimately, it was decided that such analyses were beyond the scope of the present study.

Sections 3.3.1 to 3.3.3 describe the analyses selected for the present study.

3.3.1 Fluency and coherence Key Fluency and coherence features as described within the IELTS documentation include:

repetition and self-correction hesitation / speech rate the use of discourse markers, connectives and cohesive features the coherence of topic development response length.

Following a review of the literature to ascertain how these aspects of fluency and coherence might be operationalised as measures, the following analyses were adopted.

Firstly, repair was measured in terms of the frequency of self-corrections (restarts and repeats) per 100 words. It was calculated over the Part 2 and Part 3 long responses (not including single word answers or repair turns). Secondly, hesitation was measured in terms of the ratio of pausing (filled and unfilled pauses) to speech (measures in terms of milliseconds). For this analysis the data were entered into the Cool Edit Pro program (Version 2.1, 2001). Hesitation was also measured in terms of the number of pauses (filled, unfilled and filled/unfilled). Both of these measures were carried out using speech produced in response to Part Two, the monologue turn. Thirdly, speech rate was calculated in terms of the number of words per minute. This was also calculated over Part 2, and the analysis was carried out after the data were cleaned (pruned of repairs, repeats, false starts and filled pauses).

Because the interview is divided into three parts, each of which takes a distinct form, response length was measured in a number of ways, as follows.

1. Average length of response in Part 1. Single word answers and repair turns were excluded. The analysis was carried out after the data were cleaned (pruned of repairs, repeats, false starts and filled pauses).

2. Number of words in Part 2. The analysis was also carried out after the data were cleaned.

3. Average length of response in Part 2 follow-up questions (if presented) and Part 3. Single word answers and repair turns were excluded. Again, the analysis was carried out after the data were cleaned.

4. Average length of response in Part 1, Part 2 (follow-up question only) and Part 3 combined (all the question-answer sections).



Finally, while not explicitly referred to within the assessment documentation, it was anticipated that the total amount of speech produced by candidates might have a strong relationship with assessed level. The total amount of speech was calculated in terms of the number of words produced by the candidate over the whole interview. Again, the analysis was carried out after the data were cleaned. Table 1 summarises the Fluency and coherence analyses.

Assessment feature Measure Data

1. Repair restarts and repeats per 100 words Part 2-3

2. Hesitation ratio of pause time (filled and unfilled pauses) to speech time

Part 2 monologue

ratio of filled and unfilled pauses to words

Part 2 monologue

3. Speech rate words per 60 secs Part 2 monologue

4. Response length average length of response Part 1

total number of words Part 2 monologue

Average length of response Part 2 follow-up questions and Part 3

Average length of response Part 1, Part 2 follow-up questions and Part 3

5. Total amount of speech words per interview Parts 1-3

Table 1: Summary of fluency and coherence measures

3.3.2 Lexical resources Key Lexical resources features as described within the IELTS documentation are:

breadth of vocabulary accuracy / precision / appropriateness idiomatic usage effectiveness and amount of paraphrase or circumlocution.

After a review of the literature to ascertain how these aspects of lexical resources might be operationalised as measures, the following analyses were adopted.

Vocabulary breadth was examined using the program VocabProfile (Cobb, 2002), which measures the proportions of low and high frequency vocabulary. The program is based on the Vocabulary Profile (Laufer and Nation, 1995), and performs the analysis using the Academic Word List (AWL) (Coxhead, 2000). VocabProfile calculates the percentage of words in each of five categories: the most frequent 500 words of English; the most frequent 1000 words of English (K1); the second most frequent thousand words of English (1001 to 2000) (K2); words found in the Academic Word List (AWL); and any remaining words not included in any of the first four lists (Offlist). The vocabulary breadth analysis was carried out on the Part 2 monologue task using cleaned data (after all filled pauses, repeats/restarts and unclear words were removed). Before the analyses were run the texts were checked for place names and other proper names, and lexical fillers and discourse markers such as okay or yeah. These were re-coded as high frequency as they would otherwise show up as Offlist.



Another measure of vocabulary sophistication used in earlier studies is average word length (Cumming et al, 2003). The average word length in each Part 2 monologue performance was calculated by dividing the total number of characters by the total number of words using the cleaned texts. In addition, as VocabProfile calculates the type-token ratio (the lexical density of the spoken text) this is also reported for Part 2. The type-token ratio is the number of different lexical words to the total number of lexical words, and has typically been used as a measure of semantic density. Although it has been used traditionally to analyse written texts, it has more recently been used on spoken texts also (eg, see O’Loughlin, 1995; Brown et al, 2005).

The three remaining key vocabulary features were more problematic. For the first two – contextualised accuracy, precision or appropriateness of vocabulary use, and idiomatic usage – no measure was found in the literature for objectively measuring them. These, it seemed, could only be done judgementally but would be: difficult to define; time consuming to carry out: and almost certainly have low reliability. These performance features were, therefore, not addressed in the present study because of resource constraints. Perhaps the best way to understand how these evaluative categories are interpreted and applied might be to analyse what raters claim to pay attention to when evaluating these aspects of vocabulary (see Brown et al, 2005).

The last key vocabulary feature – the ability to paraphrase or use circumlocution – is also not objectively measurable as it is a communication strategy which is not always ‘visible’ in speech. It only possible to know it has been employed (successfully or unsuccessfully) in those cases where the speaker overtly attempts to repair a word choice. However, even this is problematic to measure, as in many cases it may not be clear whether a repair or restart is an attempt at lexical repair or grammatical repair.

For these reasons, it was decided that the sole measures of vocabulary in this study would be of vocabulary breadth and density. Table 2 summarises the vocabulary measures.

Assessment feature Measure Data 1. Word type Proportion of words in most

frequent 500 words Part 2 monologue

Proportion of words in K1 Part 2 monologue Proportion of words in K2 Part 2 monologue Proportion of words in AWL Part 2 monologue Proportion of words in Offlist Part 2 monologue 2. Word length Average no. of characters per word Part 2 monologue 3. Lexical density type/token ratio Part 2 monologue

Table 2: Summary of lexical resources measures

3.3.3 Grammatical range and accuracy Key Grammatical range and accuracy features described within the IELTS documentation are:

range / variety of structures errors type (eg basic) and density error-free sentences impact of errors sentence complexity length of utterances complexity of structures.



Most of the better-known and well-defined measures for the analysis of syntactic complexity and accuracy depend on first dividing the speech into units, typically based on syntax, such as the clause and the t-unit – a t-unit being an independent clause and all attached dependent clauses. However, because of the elliptical nature of speech, and learner speech in particular, it proved very difficult to divide the speech into these units consistently and reliably, in particular to distinguish elliptical or ill-formed clauses from fragments. Other measures which have been proposed for spoken data such as the c-unit and the AS-unit (Foster et al, 2000) are less widely-used and less well-defined in the literature and were, therefore, equally difficult to apply.

Consequently, an approach to segmentation was developed for the present study to be both workable (to achieve high inter-coder agreement) and valid. It rested on the identification of spoken sentences or utterances primarily in terms of syntax, but also took semantic sense into account in identifying unit boundaries. While utterances were defined primarily as t-units, because of the often elliptical syntax produced by many of the learners, the segmentation also took meaning into account in that the semantic unity of utterances overrode syntactic (in)completeness. Fragments and ill-formed clauses which were semantically integrated into utterances were treated as part of that utterance. Abandoned utterances and unattached sentence fragments were identified as discrete units. Segmentation was carried out on the cleaned Part 2 and 3 data; hesitation and fillers were removed and, where speech was repaired, the data included the repaired speech only. Once the approach to segmentation had been finalised, 75% of the data was segmented by two people. Inter-coder agreement was 91.5%. Disagreements were resolved through discussion.

Once the data had been segmented, each Part 2 utterance was coded for the occurrence of specific basic errors, these being tense, noun-verb agreement, singular/plural, article, preposition, pronoun choice and comparative formation. In addition, each utterance was coded to indicate whether it contained any type of syntactic error at all. Error-free units were those that were free from any grammatical errors, including the specific errors defined above as well as any others (relative clause formation) but excluding word order as it was extremely difficult to reach agreement on this. In addition, each utterance was coded to indicate the number of clauses it contained.

Once the data had been coded, the following analyses were undertaken:

Complexity - mean length of utterance as measured by the number of words - number of clauses per utterance

Accuracy - proportion of error-free utterances - frequency of basic errors: the ratio of specific basic errors to words.

Assessment feature Measure Data 1. Complexity # 1 Words per utterance Part 2–3 2. Complexity # 2 Clauses per utterance Part 2–3 3. Accuracy # 1 Proportion of error-free utterances Part 2 monologue 4. Accuracy # 3 Ratio of specific basic errors to words Part 2 monologue

Table 3: Summary of grammatical range and accuracy measures



4 RESULTS

4.1 Fluency and coherence The descriptive statistics for the Fluency and coherence analyses are shown in Table 4. The results of the ANOVAs (analysis of variance) are shown in Appendix 1.

4.1.1 Repair The number of self-corrections (restarts and repeats) was calculated per 100 words over Parts 2 and 3. Column 1 shows that there is a trend over the four levels for the frequency of self-correction to decrease as the band score for Fluency and coherence increases, although Bands 6 and 7 are very similar and the expected direction is reversed for these two levels. There appears to be a significant amount of individual variation among students assessed at the same level; the standard deviation for each level is rather large. An ANOVA showed that the differences were not significant (F (3, 16) = .824, p = .499).

4.1.2 Hesitation The amount of hesitation was measured in terms of the ratio of pause time (filled and unfilled pauses) to speech time, and the ratio of filled and unfilled pauses to words. Columns 2 and 3 shows that the ratio of pause to speech for each of these measures decreased as the proficiency level increased, with the greatest difference being between levels 5 and 6. However, ANOVAs showed that the differences were not significant (F (3, 16) = 2.314, p = .116 and (F (3, 16) = 1.454, p = .264).

1 2 3 4 5 6 7 8 9 Score Repair Speak

time: pause time

Words: pauses

P2 words per 60 secs

P1 Average length of

turn

Words P2

P2/3 Average length of turn

P1-3 Average length of turn

Total words

8 Mean 5.49 7.10 15.40 125.3 49.01 250.6 61.23 51.52 1227 StDev 3.25 2.75 6.28 20.0 18.84 109.3 37.50 23.86 175.6

7 Mean 7.14 7.06 18.31 123.6 39.03 232.0 60.18 44.74 1034 StDev 3.45 3.61 15.67 26.0 13.84 66.9 14.62 11.09 354.2

6 Mean 7.01 5.99 14.56 103.5 37.60 224.0 54.15 42.24 1007 StDev 1.09 2,44 8.58 24.1 22.55 46.7 16.36 19.61 113.6

5 Mean 8.64 3.22 6.37 87.2 24.51 154.0 28.62 25.59 657 StDev 4.07 1.51 1.28 20.3 10.54 44.7 12.57 8.63 80.4

Table 4: Fluency and coherence: descriptive statistics

4.1.3 Speech rate Speech rate was measured in terms of the number of words per minute, calculated for Part 2, excluding repairs and restarts. Column 4 shows an increase in the speech rate as the band score for Fluency and coherence increases, although Bands 7 and 8 are very similar. Again the standard deviations are rather large. An ANOVA indicated that the differences were close to significance (F (3, 16) = 3.154, p = .054).

4.1.4 Response length The interview contained two types of speech – responses to questions (Part 1, Part 2 follow-up questions, and Part 3) which could, in theory, be as long as the candidate wished, and the monologue turn (Part 2) which had a maximum time allowance. Column 5 shows that the average length of response in Part 1 increased as the band score for Fluency and coherence increased, with Band 8 responses being, on average, twice as long as Band 5 responses. The biggest increases were from © IELTS Research Reports Volume 6 10


Band 5 to Band 6, and Band 7 to Band 8. The average length of response in Bands 6 and 7 was very similar. Again, the standard deviations for each level were high and an ANOVA showed that the differences were not significant (F (3, 16) = 1.736, p = .200).

In the monologue turn, Part 2, there was an increase in the number of words over the levels with the biggest increase from Band 5 to 6 (Column 6). The standard deviations for each level were high. Again, an ANOVA showed that the differences were not significant (F (3, 16) = 1.733, p = .200).

As was the case for the responses to questions in Part 1, the average length of response to Part 2 follow-up questions and Part 3 questions increased as the band score for Fluency and coherence increased (Column 7). Again Band 8 responses were, on average, twice as long as Band 5 responses. The biggest increase was from Band 5 to 6, but this time Bands 7 and 8 were very similar. Again, the standard deviations for each level were high and again an ANOVA showed that the differences were not significant (F (3, 16) = 2.281, p = .118).

When the average length of response for all question responses was calculated, we again found an increase over the levels, with Band 8 being twice as long as Band 5, and with the most marked increase being from Band 5 to 6 (Column 8). Again, an ANOVA showed that the differences were not significant (F (3, 16) = 2.074, p = .144).

4.1.5 Amount of speech Column 9 shows that as the band score for Fluency and coherence increases, the total number of words over the whole interview increases. The most marked increase is from Bands 5 to 6. Bands 6 and 7 are very similar. An ANOVA confirmed significant differences (F (3, 16) = 6.412, p = .005).

4.2 Lexical resources The descriptive statistics for the Lexical resources analyses are shown in Table 5.

1 2 3 4 5 6 7

Score 500 %

K1 %

K2 %

AWL %

OWL %

Word length

T/T ratio

8 Mean 83 91 4 1 3 4.02 0.47 StDev 5 5 3 1 3 4.44 0.03

7 Mean 83 90 5 3 4 4.06 0.44 StDev 4 3 1 2 3 3.72 0.06

6 Mean 86 93 3 2 2 3.86 0.49 StDev 4 2 2 2 1 3.59 0.09

5 Mean 90 94 4 1 2 4.02 0.44 StDev 2 2 2 1 1 4.05 0.06

Table 5: Lexical resources: descriptive statistics

The word frequency analysis calculated the percentage of word in each of five categories:

1. the first 500 words – 500 2. the first 1000 words – K1 3. the second 1000 words – K2 4. the academic word list – AWL 5. Offlist – OWL.



Columns 1 and 2 in Table 5 show that although there is a slight decrease in the proportion of words from the first 500 words and the first 1000 words lists as the Lexical resources band score increases, a large proportion of words are in the first 1000 words list for all levels (91%–94%). The average proportion of words from the remaining categories (K2, AWL and OWL) is relatively low for all levels and there is no linear increase in the proportion of K2 and AWL (Columns 3 and 4) across the levels.

While the percentage of Offlist words increases across the levels (Column 5) this is, in fact, uninterpretable as Offlist words were found to include mis-formed words on the one hand, and low frequency words on the other. The ANOVAs showed that none of the measures exhibited significant differences. (The results of the ANOVAs are shown in Appendix 1.)

The analysis of average word length (Column 6) indicated that the measure was relatively stable across the levels. This is probably due to the high incidence of high frequency words at all levels, something that is typical of spoken language in general. Column 7 indicates that there is no linear increase across the band levels in the average type-token ratio.

4.3 Grammatical range and accuracy The descriptive statistics for the Lexical resources analyses are shown in Table 6. The results of the ANOVAs are shown in Appendix 1.

1 2 3 4

Score Utterance length

Clauses per utterance

Proportion of error-free

utterances

Ratio of specific errors

to words 8 Mean 12.33 1.57 6.41 72.96 StDev 2.47 .36 3.76 38.98

7 Mean 12.32 1.64 3.00 35.86 StDev 2.24 .46 1.29 15.30

6 Mean 12.33 1.51 1.44 17.97 StDev 3.22 .17 .27 5.36

5 Mean 11.07 1.31 1.35 14.15 StDev 2.54 .20 .40 3.91

Table 6: Grammatical range and accuracy: descriptive statistics

The two measures of complexity (utterance length in terms of mean number of words, and mean number of clauses per utterance) showed very little variation across the levels (Columns 1 and 2). For utterance length, Band 5 utterances were shorter than those of higher levels, those of Bands 6–8 were almost identical. The ANOVAs showed that the differences were not significant (F (3, 15) = .270, p = .886). For the second measure of complexity, the number of clauses per utterance, there was little difference between levels and the progression was not linear. Band 8 utterances were on average less complex than those of Band 7. Again the ANOVA revealed no significant differences (F (3, 15) = 1.030, p = .407).

In terms of accuracy, both measures were as expected. The proportion of error-free utterances increased as the level increased (Column 3) and the frequency of basic errors decreased (Column 4). Both ANOVAs revealed significant differences: (F (3, 15) = 6.721, p = .004 and F (3, 15) = 7.784, p = .002).



5 SUMMARY OF FINDINGS

Overall, the analyses revealed evidence that features of test-takers’ discourse varied according to the assessed proficiency level. While all measures broadly exhibited changes in the expected direction across the levels, for some, the difference between two adjacent levels were not always as expected. In addition, for most measures the differences between levels were greater at some boundaries than others, for example between Band 5 on the one hand, and Bands 6 to 8 on the other, or between Band 8 on the one hand and Bands 5 to 7 on the other. This indicates, perhaps, that, rather than contributing equally at all levels, specific aspects of performance are relevant at particular levels only. This finding supports the argument of Pollitt and Murray who, on the basis of analyses of raters’ orientations rather than analyses of candidate performance, argued that the trait of proficiency is “understood in different terms at different levels” and that, as a consequence, proficiency should not be assessed as a “rectangular set of components” (1996:89).

Figure 2 shows where the greatest differences lie for each of the measures. On all fluency measures, there was a clear difference between Bands 5 and 6 but the size of the differences between the other bands varied across the measures. For the grammar complexity measures, the greatest difference lay between Band 5 on the one hand, and Bands 6 to 8 on the other. For the accuracy measures, however, the greatest difference lay between Bands 7 and 8, with Bands 5 and 6 being very similar. For the lexical resource measures there was little difference between means for any of the measures.

Fluency and coherence

Repair/restart 5 // 6=7 // 8 Pause to speak time 5 // 6 / 7=8 Frequency of pauses 5 // 6=7=8 Words per minute 5 // 6 // 7=8 P1 length of turn 5 // 6=7 // 8 P2 words 5 // 6=7 / 8 P2/3 length of turn 5 // 6=7=8 P1-2 length of turn 5 // 6=7 / 8 Total words 5 // 6=7 // 8

Grammatical range and accuracy

Utterance length 5 // 6=7=8 Clauses per utterance 5 // 6=7=8 Error free utterances 5=6 / 7 // 8 Specific errors 5=6 / 7 // 8

Lexical resource

Little difference between means for all measures

KEY = indicates little difference between means / indicates some difference between means // indicates substantial difference between means

Figure 2: Differences across bands within measures



For all measures the standard deviations tended to be large, relative to the differences between levels, indicating a high level of variation amongst candidates assessed at the same level and a high degree of overlap between levels, even for those measures which produced significant findings. This would appear to indicate that while all the measures contribute in some way, none is an overriding driver of the rating awarded; candidates assessed at one particular level on one scale display subtle differences in performance on the different dimensions of that trait. This is perhaps inevitable where different and potentially conflicting features (such as accuracy and complexity) are combined into the one scale. Brown et al (2005) acknowledge this possibility when they discuss the tension, referred to by raters, between dimensions on all traits – grammar, vocabulary, fluency and pronunciation – such as accuracy (or nativeness), complexity (or sophistication) and impact. This tension is also acknowledged in the IELTS band scales themselves, with the following statement about grammar: “Complex structures are attempted but these are limited in range, nearly always contain errors and may lead to the need for reformulation”. Impact, of course, is listener-related and is therefore not something that can be measured objectively, unlike the other measures addressed in this study.

The findings are very interesting for a number of reasons. First, they reveal that, for each assessment category, a range of performance features appear to contribute to the overall impression of the candidate. In terms of the relatively low number of measures which revealed significant differences amongst the levels, this may be attributed to the relatively few samples at each level which resulted in large measurement error.

While a number of the measures approached significance, the only one to exhibit significant differences across levels was the total amount of speech. This is in many ways surprising, because amount of speech is not specifically referred to in the scales. In addition, it is not closely related to the length of response measures, which showed trends in the expected direction but were not significant. It may be, then, that interviewers close down or otherwise cut short the phases of the interview if they feel that candidates are struggling, which would explain the significance of this finding. It may also be that while the extended responses produced by weaker candidates were not substantially shorter than those of stronger candidates, weaker candidates produced many more single-word responses and clarification requests which resulted in the interviewer dominating the talk more.

Second, the conduct of the analysis and review of the results allow us to draw conclusions about the methodology used in the study. Not all of the measures proved to be useful. For example, the relatively high proportion of high frequency vocabulary in all performances meant that the lexical frequency measures proved to be unhelpful in distinguishing the levels. It would appear that a more fine-grained analysis is required here, something that lay outside the scope of the present study. In addition, for some aspects of performance it was not possible to find previously-used valid and reliable measures – for example, to measure syntactic sophistication. Brown et al (2005), who tried to address this dimension through the identification of specific structures such as passives and conditionals, found so few examples in the spoken texts that the measure failed to reveal differences amongst levels. It may be that raters’ impressions about sophistication are driven by one or two particularly salient syntactic (or lexical) features in any one candidate’s performance, but that these differ for different candidates. In short, it may prove to be impossible to get at some of the key drivers of assessments through quantification of discourse features.

Other measures appear to be somewhat ambiguous. For example, self-repair might, on the one hand, be taken as evidence of monitoring strategies and therefore a positive feature of performance. On the other, it might draw attention to the fact that errors had been made or be viewed as affecting the fluency of the candidate’s speech, both of which might lead it to be evaluated negatively. Given this, this feature on its own is unlikely to have a strong relationship with assessed levels of proficiency.



Despite the problems outlined above and while there were some limitations to the study in terms of size, scope, and choice of analyses, in general the results of this study are encouraging for the validity of the IELTS band descriptors. The overall tendency for most of the measures to display increases in the expected direction over the levels appears to confirm the relevance of the criteria they address to the assessment of proficiency in the IELTS interview.



REFERENCES

Brown, A, Iwashita, N and McNamara, T, 2005, An Examination of Rater Orientations and Test-Taker Performance on English-for-Academic-Purposes Speaking Tasks, TOEFL Monograph series, MS-29, Educational Testing Service, Princeton, NJ

Cobb, T, 2002, The Web Vocabulary Profiler, ver 1.0, computer program, University of Québec, Montréal, retrieved from <http://www.er.uqam.ca/nobel/r21270/textools/web_vp.html>

Coxhead, A, 2000, ‘A new academic word list’, TESOL Quarterly, vol 34, no 2, pp 213-238

Cumming, A, Kantor, R, Baba, K, Eouanzaoui, E, Erdosy, U and James, M, 2003, ‘Analysis of discourse features and verification of scoring levels for independent and integrated prototype written tasks for New TOEFL’, draft project report, Educational Testing Service, Princeton, New Jersey

Douglas, D, 1994, ‘Quantity and quality in speaking test performance’, Language Testing, vol 11, no 2, pp 125-144

Douglas, D, and Selinker, L, 1992, ‘Analysing oral proficiency test performance in general and specific-purpose contexts’, System, vol 20, no 3, pp 17-328

Douglas, D, and Selinker, L, 1993, ‘Performance on a general versus a field-specific test of speaking proficiency by international teaching assistants’ in A new decade of language testing research, eds D Douglas and C Chapelle, TESOL Publications, Alexandria, VA, pp 235-256

English Language Institute, 2001, Examination for the Certificate of Competency in English, Ann Arbour, English Language Institute, University of Michigan

Foster and Skehan, 1996, ‘The influence of planning on performance in task-based learning’, Studies in Second Language Acquisition, vol 18, no 3, pp 299-324

Foster, P, Tonkyn, A and Wigglesworth G, 2000, ‘A unit for all measures: Analysing spoken discourse’, Applied Linguistics, vol 21, no 3, pp 354-375

Fulcher, G, 1993, ‘The construction and validation of rating scales for oral tests in English as a foreign language’, unpublished doctoral dissertation, University of Lancaster, UK

Fulcher, G, 1996, ‘Does thick description lead to smart tests? A data-based approach to rating scale construction’, Language Testing, vol 13, no 2, pp 208-238

Fulcher, G, 2003, Testing second language speaking, Pearson Education Limited, London

Iwashita, N and McNamara, T, 2003, ‘Task and interviewer factors in assessments of spoken interaction in a second language’, unpublished report, Language Testing Research Centre, The University of Melbourne

Iwashita, N, McNamara, T and Elder, C, 2001, ‘Can we predict task difficulty in an oral proficiency test? Exploring the potential of an information processing approach to task design’, Language Learning, vol 21, no 3, pp 401-436

Laufer, B, and Nation, P, 1995, ‘Vocabulary size and use: Lexical richness in L2 written production’, Applied Linguistics, vol 16, no 3, pp 307-322

Lewkowicz, J, 1997, ‘The integrated testing of a second language’ in Encyclopedia of Language and Education, Vol 7: Language Testing and Assessment, eds C Clapham and D Corson, Kluwer, Dortrecht, The Netherlands, pp 121-130



Mangan, SS, 1988, ‘Grammar and the ACTFL oral proficiency interview: discussion and data’, Modern Language Journal vol 72, pp 266-76

O’Loughlin, K, 1995, ‘Lexical density in candidate output on direct and semi-direct versions of an oral proficiency test’, Language Testing, vol 12, no 2, pp 217-237

Paltridge, B, 2000, Making sense of discourse analysis, Antipodean Educational Enterprises, Gold Coast, Queensland

Pollitt and Murray, 1996, ‘What raters really pay attention to’ in Performance testing, cognition and assessment, (Studies in language testing 3), eds M. Milanovic and N. Saville, Cambridge University Press, Cambridge, pp 74-91

Schiffrin, D, 1987, Discourse markers, Cambridge University Press, Cambridge

Skehan, P, and Foster, P, 1999, ‘The influence of task structure and processing conditions on narrative retellings’, Language Learning, vol 49, pp 93-120


Syntrillium Software Corporation, 2001, Cool Edit Pro, ver 2.1, computer program, Phoenix, Arizona

UCLES, 2001, IELTS Examiner Training Materials, University of Cambridge Local Examinations Syndicate, Cambridge

Wigglesworth, G, 1997, ‘An investigation of planning time and proficiency level on oral test discourse’, Language Testing, vol 14, pp 85-106

Wigglesworth, G, 2001, ‘Influences on performance in task-based oral assessments’ in Task based learning, eds M Bygate, P Skehan, and M Swain, Addison Wesley Longman, pp 186-209



APPENDIX 1: ANOVAS – ANALYSIS OF VARIANCE

Fluency and coherence ANOVAs Sum of Squares df Mean square F Sig.

Abandoned words and repeats Between groups 24.849 3 8.283 .824 .499

Per 100 words * Score Within groups 160.753 16 10.047

Ratio of pause time to speak Between groups 49.844 3 16.615 2.304 0.116

Time * Score Within groups 115.384 16 7.212

Ratio of pauses to words * Between groups 392.857 3 130.952 1.454 0.264

Score Within groups 1,440.910 16 90.057

P2 words per 60 secs * Between groups 4,896.861 3 1,632.287 3.154 0.054

Score Within groups 8,280.294 16 517.518

Words P2 only * Score Between groups 26791.350 3 8930.450 1.733 .200

Within groups 82433.200 16 5152.075

P1 Av. length of turn * Between groups 1518.518 3 506.173 1.736 .200

Score Within groups 4664.718 16 291.545

P2/3 Av. Length of turn * Between groups 3499.907 3 1166.636 2.281 .118


P1-3 a Av length of turn * Between groups 1790.619 3 596.873 2.074 .144


Total words * Score Between groups 844710.550 3 281570.183 6.412 .005

Within groups 702596.400 16 43912.275



Lexical resources ANOVAs

Sum of Squares df Mean square F Sig.

500% * Score Between groups 147.497 3 49.166 2.636 .085

(first 500 words) Within groups 298.453 16 18.653

K1% * Score Between groups 55.125 3 18.375 1.564 .237

(first 1000 words) Within groups 187.984 16 11.749

K2% * Score Between groups 8.524 3 2.841 .709 .561

(second 1000 words) Within groups 64.144 16 4.009

AWL% * Score Between groups 7.873 3 2.624 1.416 .275

(academic word list) Within groups 29.659 16 1.854

OWL% * Score Between groups 6.026 3 2.009 .480 .701

(offlist) Within groups 67.011 16 4.188

Word length * Score Between groups .102 3 .034 .587 .632

Within groups .926 16 .058

T/T ratio * Score Between groups .010 3 .003 .817 .503

Within groups .067 16 .004

Grammatical range and accuracy ANOVAs

Sum of Squares df

Mean square F Sig.

Utterance length * Score Between groups 5.768 3 1.923 .270 .846

Within groups 106.973 15 7.132

Clauses per utterance Between groups .296 3 .099 1.030 .407

* Score Within groups 1.436 15 .096

Proportion of error-free Between groups 84.112 3 28.037 6.721 .004

utterances * Score Within groups 62.574 15 4.172

Ratio of specific errors to Between groups 10830.11 3 3610.04 7.784 .002

words * Score Within groups 6956.58 15 463.77



4. The impact on candidate language of examiner deviation from a set interlocutor frame in the IELTS Speaking Test Author Barry O’Sullivan University of Roehampton, UK

Yang Lu University of Reading, UK


This paper shows that the deviations examiners make from the interlocutor frame in the IELTS Speaking Test have little significant impact on the language produced by candidates.

ABSTRACT

The Interlocutor Frame (IF) was introduced by Cambridge ESOL in the early 1990s to ensure that all test events conform to the original test design so that all test-takers participate in essentially the same event. While essentially successful, Lazaraton (1992, 2002) demonstrated that examiners sometimes deviate from the IF under test conditions. This study of the IELTS Speaking Test set out to locate specific sources of deviation, the nature of these deviations and their effect on the language of the candidates.

Sixty recordings of test events were analysed. The methodology involved the identification of deviations from the IF, and then the transcription of the candidates’ pre- and post-deviation output. The deviations were classified and the test-takers’ pre- and post-deviation oral production compared in terms of elaborating and expanding in discourse, linguistic accuracy and complexity as well as fluency.

Results indicate that the first two parts of the Speaking Test are quite stable in terms of deviations, with relatively few noted, and the impact of these deviations on the language of the candidates was essentially negligible in practical terms. However, in the final part of the Test, there appears to have been a somewhat different pattern of behaviour, particularly in relation to the number of paraphrased questions used by the examiners. The impact on candidate language again appears to have been minimal.

One implication of these findings is that it may be possible to allow for some flexibility in the Interlocutor Frame, though this should be limited to allowing for examiner paraphrasing of questions.

4. The impact on candidate language of examiner deviation from a set interlocutor frame – Barry O’Sullivan & Yang Lu

AUTHOR BIODATA:

BARRY O’SULLIVAN

Barry O’Sullivan has a PhD in language testing, and is particularly interested in issues related to performance testing, test validation and test-data management and analysis. He has lectured for many years on various aspects of language testing, and is currently Director of the Centre for Language Assessment Research (CLARe) at Roehampton University, London.

Barry’s publications have appeared in a number of international journals and he has presented his work at international conferences around the world. His book Issues in Business English Testing: the BEC revision project was published in 2006 by Cambridge University Press in the Studies in Language Testing series; and his next book is due to appear later this year. Barry is very active in language testing around the world and currently works with government ministries, universities and test developers in Europe, Asia, the Middle East and Central America. In addition to his work in the area of language testing, Barry taught in Ireland, England, Peru and Japan before taking up his current post.

YANG LU

Dr Yang Lu has a BA in English and English Literature from Jilin University, China. She obtained both her MA and doctorate degrees from the University of Reading. Her PhD investigates the nature of EFL test-takers’ spoken discourse competence. Dr Yang Lu has 18 years’ experience of language teaching and testing. She worked first as a classroom teacher and later as Director of the ESP Faculty and Deputy Coordinator of a British Council project based at Qingdao University, where she also worked as Associate Professor of English. Her academic interests are spoken discourse analysis and its applications in classroom and oral assessment contexts.

Dr Yang Lu’s publications include papers on: EFL learners’ interlanguage pragmatics; application of the Birmingham School approach; the roles of fuzziness in English language oral communication; and task-based grammar teaching. She has presented different aspects of her work at a number of international conferences. Dr Yang Lu was a Spaan Fellow for a validation study on the impact of examiners’ conversational styles.



CONTENTS

1 Introduction ............................................................................................42 The Interlocutor Frame ............................................................................43 Methodology ............................................................................................5 3.1 The IELTS Speaking Test ...................................................................6 3.2 Test-takers ..........................................................................................6 3.3 The examiners .....................................................................................74 The study ............................................................................................7 4.1 The coding process .............................................................................7 4.2 Locating deviations .............................................................................10 4.3 Transcribing ........................................................................................105 Analysis ............................................................................................116 Results ............................................................................................12 6.1 Overall ............................................................................................12 6.1.1 Paraphrasing ......................................................................12 6.1.2 Interrupting..........................................................................13 6.1.3 Improvising .........................................................................13 6.1.4 Commenting .......................................................................14 6.2 Impact on test-takers’ language of each deviation type ......................15 6.3 Location of deviations .........................................................................17 6.3.1 Deviations by test part ........................................................17 6.3.2 Details of the deviations .....................................................187 Conclusions ............................................................................................21 Acknowledgement......................................................................................228 References ............................................................................................23Appendix 1: Profiles of the test-takers included in the study ..................26



1 INTRODUCTION

While research into various aspects of speaking tests has become more common and more varied over the past decade, there is still great scope for researchers in the area, as the fractured nature of research to date betrays the lack of a systematic research agenda in the field.

O’Sullivan (2000) called for a focus on a more clearly defined socio-cognitive perspective on speaking, and this is reflected in the framework for validating speaking tests outlined by Weir (2005). This is of particular relevance in tests of speaking where candidates are asked to interact either with other candidates and an examiner or, in the case of IELTS, with an examiner only. The co-constructive nature of spoken language means that the role played by the examiner-as-interlocutor in the test event is central to that event. One source of construct irrelevant variance in face-to-face speaking tests lies in the potential for examiners to misrepresent the developer’s construct either by consciously or subconsciously changing the way in which individual candidates are examined. There is considerable anecdotal evidence to suggest that examiners have a tendency to deviate from planned patterns of discourse during face-to-face speaking tests, and to some extent we might want this to happen, for example to allow for the interaction to develop in an authentic way. However, the dangers inherent in examining speaking by using what is sometimes called a conversational interview (Brown 2003:1) are far more likely to result in test events that are essentially unique, though this is something that can be said of any truly free conversation – see also van Lier’s (1989) criticism of this type of test in which he convincingly argues that true conversation is not necessarily reflected in interactions performed under test conditions. These dangers, which include unpredictability in terms of topic, linguistic input and expected output, all of which can have an impact on test-taker performance, have long been noted in the language testing literature (see Wilds 1975; Shohamy 1983; Bachman 1988; 1990; Stansfield 1991; Stansfield & Kenyon 1992; McNamara 1996; Lazaraton 1996a).

There have been a number of studies in which rater linguistic behaviour has been explored in terms of its impact on candidate performance (see Brown & Hill 1998; Brown & Lumley 1997; Young & Milanovic 1992), and others in which the focus was on linguistic behaviour without an overt focus on the impact on candidate performance (Lazaraton 1996a; Lazaraton 1996b; Ross 1992; Ross & Berwick 1992). Other studies have looked at the broader context of examiner behaviour (Brown 1995; Chalhoub-Deville 1995; Halleck 1996; Hasselgren 1997; Lumley 1998; Lumley & O’Sullivan 2000; Thompson 1995; Upshur & Turner 1999). The results of these studies suggest that there is likely to be systematic variation in how examiners behave during speaking test events, in relation both to their language and to their rating.

These studies have tended to look either at the scores achieved by candidates or at the identification of specific variations in rater behaviour and have not focused so much on how the language of the candidates might be affected as a result of particular examiner linguistic behaviour (with the exception perhaps of Brown & Hill 1998). Another limitation of these studies (at least in terms of the study reported here) is the fact that they were almost all conducted on so-called conversational interviews (with the exception of the work of Lazaraton 2002). Since the 1990s, many tests have moved away from this format, to a more tightly controlled model of spoken test using an Interlocutor Frame.

2 THE INTERLOCUTOR FRAME

An Interlocutor Frame (IF) is essentially a script. The idea of using such a device is to ensure that all test events conform to the original test design so that all test-takers participate in essentially the same event. Of course, the very nature of live interaction means that no two are ever likely to be exactly



the same but some measure of standardisation is essential if test-takers are to be treated fairly and equitably. Such frames were first introduced by Cambridge ESOL in the early 1990s (Saville & Hargreaves 1999) to increase standardisation of examiner behaviour in the test event – though it was demonstrated by Lazaraton (1992) that there might still be deviations from the Interlocutor Frame even after examiner training. This may have been at least partly a response by the examiners to the extreme rigidity of the early frames, where all responses (verbal, paraverbal and non-verbal) were scripted. Later work by Lazaraton (2002) provided evidence of the effect of examiner language and behaviour on ratings, and contributed to the development of the less rigid Interlocutor Frames used in subsequent speaking tests.

As we have pointed out above, the IF was originally introduced to give the test developer more control of the test event. However, Lazaraton has demonstrated that, when it comes to the actual event itself, examiners still have the potential to deviate from any frame.

The questions that emerge from this are:

1. Are there identifiable positions in the IELTS Speaking Test in which examiners tend to deviate from the Interlocutor Frame?

2. Where a deviation occurs, what is the nature of the deviation?

3. Where a deviation occurs, what is the effect on the linguistic performance of the candidate?

To investigate these questions, it was decided to revisit the IELTS Speaking Test following earlier work. Brown & Hill (1998) and Brown (2003) reported a study based on a version of the IELTS Speaking Test which was operational between 1989 and 2001. Findings from this work, together with outcomes from other studies on the IELTS Speaking Test, informed a major revision of the test in the late 1990s; from July 2001 the revised test incorporated an Interlocutor Frame for the first time to reduce rater variability (see Taylor, in press). (The structure of the current test is described briefly below in 3.1.) Since its introduction, the functioning of the Interlocutor Frame in the IELTS Speaking Test has been the focus of ongoing research and validation work; the study reported here forms part of that agenda and is intended to help shape future changes to the IF and to inform procedures for IELTS examiner training and standardisation.

3 METHODOLOGY

Previous studies into the use by examiners of Interlocutor Frames used time-consuming, and therefore, extremely expensive research methodologies, particularly conversation analysis (see the work of Lazaraton 1992, 1996a, 1996b, 2002). Here, an alternative methodology is applied. In this methodology, audio-recorded examination events were first studied for deviations from the specified IF. These deviations were then coded and the area of discourse around them transcribed and analysed.

The methodology involved the identification of deviations from the existing IF (in ‘real time’). The deviations identified were then transcribed to identify the test-takers’ pre- and post-deviation oral output. A total of approximately 60 recorded live IELTS Speaking Tests undertaken by a range of different examiners were analysed. The deviations were classified and the test-takers’ pre- and post-deviation oral production compared in terms of elaborating and expanding in discourse, linguistic accuracy and complexity as well as fluency.



3.1 The IELTS Speaking Test The Speaking Test is one of four skills-focused components which make up the IELTS examination administered by the IELTS partners – Cambridge ESOL, British Council and IELTS Australia.

The Test consists of a one-to-one, face-to-face oral interview with a single examiner and candidate. All IELTS interviews are audio-taped for purposes of quality assurance and monitoring. The test has three parts (see Figure 1), each of which is designed to elicit different profiles of a candidate’s language. This has been shown to be the case in speaking tests for the Cambridge ESOL Main Suite examinations by O’Sullivan, Weir & Saville (2002) and O’Sullivan & Saville (2000) through use of an observation checklist. Brooks (2003) reports how a similar methodology was developed for and applied to IELTS; an internal Cambridge ESOL study (Brooks 2002) demonstrated that the different IELTS test parts were capable of fulfilling a specific function in terms of interaction pattern, task input and candidate output.

Part

Nature of interaction Timing

Part 1 Introduction and interview

Examiner introduces him/herself and confirms candidate’s identity. Examiner interviews candidate using verbal questions selected from familiar topic frames

4-5 minutes

Part 2 Individual long turn

Examiner asks candidate to speak for 1-2 minutes on a particular topic based on written input in the form of a candidate task card and content-focused prompts. Examiner asks one or two questions to round off the long turn.

3-4 minutes (incl. 1 minute preparation time)

Part 3 Two-way discussion

Examiner invites candidate to participate in discussion of a more abstract nature, based on verbal questions thematically linked to Part 2 topic.

4-5 minutes

Figure 1: IELTS Speaking Test format

The examiner interacts with the candidate and awards scores on four analytical criteria which contribute to an overall band score for speaking on a nine-point scale (further details of test format and scoring are available on the IELTS website: www.ielts.org). Since this study is concerned with the language of the test event as opposed to the outcome (ie score awarded) no further discussion of the scoring will be entered into at this point except to say that the band scores were used to assist the researchers in selecting a range of test events in which candidates of different levels were represented.

The test version selected for use in this study is Version 88, a version that was in use after July 2001, but that was later retired.

3.2 Test-takers A total of 85 audio-taped live IELTS Speaking Test events using Test Version 88 were selected from administrations of the test conducted during 2002. Of these, 70 were selected for the study after consideration of test-takers’ nationality and first language. This was done to reflect the composition of the general IELTS candidature worldwide. Band scores awarded to candidates were also looked at to avoid a situation where one nationality might be over-represented at the different overall score levels. However, this was not always successful as it is clear from the overall patterns of IELTS scores that there are differences in performance levels across the many different nationalities represented in the test-taking population.



After an initial listening, a further eight performances were excluded because of poor quality of recording (previous experience has shown that this makes accurate transcription almost impossible), leaving 62 speaking performances for inclusion in the analysis. There were 21 female test-takers and 41 males. The language and nationality profile is shown in Table 1. From this table we can see that the population represents a wide range of first languages (17) and nationalities (18). This sample allows for some level of generalisation to the main IELTS population. More detailed information about the test-takers can be found in Appendix 1.

Language Nationality Number Language Nationality Number

Arabic Iraq 1 Portuguese Brazil 1

Arabic Oman 5 Portuguese Portugal 1

Arabic UAE 3 Punjabi India 3

Bengali Bangladesh 3 Pushtu Pakistan 1

Chinese China 17 Spanish Colombia 1

Chinese Taiwan 1 Spanish Mexico 1

Farsi Iran 1 Swedish Sweden 5

German Switzerland 1 Telugu India 1

Hindi India 5 Urdu Pakistan 4

Japanese Japan 1 Other India 1

Korean S Korea 1 Other Malawi 1

Table 1: Language and nationality profile

3.3 The examiners A total of 52 examiners conducted the 62 tests included in the matrix. The intention was to include as large a number of examiners as possible in order to minimise any impact on the data of non-standard behaviour by particular judges. For this reason, care was also taken to ensure that no one examiner would conduct the test on more than three occasions.

As all of the test events used in this study were ‘live’ (ie recordings of actual examinations), the conditions under which the tests were administered were controlled. This meant that all of the examiners were fully trained and standardised and had experience working with this test.

4 THE STUDY

4.1 The coding process The first listening was undertaken to identify the nature and location of the obvious and recurring deviations from the Interlocutor Frame by examiners. The more frequent deviations were first identified, then categorised, and finally coded. Efforts were made to be consistent with the coding according to a set of definitions given to these deviations which was generated gradually during the listening. As is usual with this kind of work, definitions were very sketchy at the outset but became more clearly defined when the first careful listening was finished. Table 2 presents the findings of this first listening.



Types of deviations Coding Definitions interrupting question itr question asked that stops the test-taker’s answer

hesitated question hes question asked hesitatingly – possibly because of unfamiliarity with the interlocutor frame

paraphrased question para question that is rephrased without test-taker’s request – appears to be based on examiner’s judgement of the candidate’s listening comprehension ability

paraphrased and explained question parax question that is both paraphrased and explained with

example with or without test-taker’s request

comments after replies com comment made after test-taker’s reply that is more than the acknowledgement or acceptance the examiner is supposed to give; it tends to make the discourse more interactive

improvised question imp question that is not part of the interlocutor frame but asked based on test-taker’s reply – very often about their personal interests or background

informal chatting chat informal discussion mainly held by examiner who is interested in test-taker’s experience or background

loud laughing la examiner’s loud laughing caused by test-taker’s reply or answer

offer of clues cl examiner’s utterance made to offer a hint and/or to facilitate candidate reply

Table 2: Development of coding for deviations (Listening 1)

A second careful listening was undertaken to confirm the identification of deviations, to check the coding for each case and to decide on a final list of the deviations to be examined. As can be seen from Table 2, there were two distinct types of deviation related to paraphrasing. While this coding appeared at first a useful distinction, it became quite difficult to operationalise, as the study was based on audio tapes, a medium which does not allow the researcher to observe the body language and facial expressions of the parties involved. This made it practically impossible to know whether paraphrasing was performed in response to test-takers’ requests (verbal or non-verbal) or volunteered by the examiner. Therefore, the decision was made to collapse the two ‘paraphrasing’ categories and to report only the single category ‘paraphrase’.

A list of occurrences of the deviations resulted as shown in Table 3:

Types of deviations Coding Occurrences

interrupting question Itr 34

hesitated question Hes 7

paraphrased question Para 47

comments after reply Com 12

improvised question Imp 28

informal chatting Chat 9

Laughing La 5

Clues Cl 2

Table 3: Occurrences of deviations



Two decisions were made after the second listening:

1. The four types of deviations that were found to be most frequent in the tests were selected for investigation. They are: interrupting question, paraphrased question, comment after replies and improvised question. We also believe that these four types of deviations can be established because in the Instructions to IELTS Examiners (Cambridge ESOL 2001) it is made very clear to the examiners that:

The Interlocutor Frame is used for the purpose of standardisation in order that all candidates are treated fairly and equally. Deviations from the script may introduce easier or more difficult language or change the focus of a task.

In Part 1 the exact words in the Frame should be used. Reformulating and explaining the questions in the examiner’s own words are not allowed.

In Part 2 examiners must use the words provided in the Frame to introduce the long turn task.

In Part 3 the Frame is less controlled so that the examiner’s language can be accommodated to the level of the candidate being examined.

In all parts of the test, examiners should refrain from making unscripted comments or asides.

Explanation needs to be given at this point about the rationale for including the interrupting questions and paraphrased questions in Part 3 as deviation types. Although, understandably, examiners sometimes cannot help stopping test-takers whose replies in Part 1 and 3 are lengthy and slow down the procession of the Speaking Test, this should be done in a more subtle way with body language as suggested in IELTS Speaking Test-FAQs and Feedback document (Cambridge ESOL 2001) or by using more tentative verbal hints. These strategies are suggested so as to limit any potential impact on future candidate linguistic performance. The interrupting questions we have coded as deviations occur neither after lengthy replies by test-takers nor are they made in a non-threatening (ie tentative) manner.

In Part 1, as the Instructions to IELTS Examiners states, ‘examiners should not explain any vocabulary in the frame’. Therefore, any reformulating of the questions is regarded here as a deviation and coded as such. However, in Part 3 examiners have more independence and flexibility within the Frame and are even encouraged ‘to develop the topic in a variety of directions according to the responses from the candidates’ (Cambridge ESOL 2001). The examiners’ decisions to reformulate, rephrase, exemplify or paraphrase the questions in Part 3 were noticed in the first listening of the tapes. For most of the cases this was done without a specific request from the test-takers and appears to have been based on examiner judgements of the individual test-taker’s level of proficiency and ability to discuss the comparatively more abstract topics contained in this section of the Test. However, it should be noted that this part differs from Parts 1 and 2 in that the prompts are just that – indicative prompts designed for them to articulate in a way that is appropriate to the level of the candidate, but not fully scripted questions for them to ‘read off the page’ as in Parts 1 and 2.

2. The second decision concerned the amount of speech to be transcribed on either side of the deviation. Since it was believed that we needed a significant amount of language for transcription so that realistic observations could be made, and that all language chunks transcribed should be of similar length, we decided that 30 seconds of pre- and post-deviations should be transcribed and analysed to provide reliable data for investigation. Details of the transcription conventions used are given below. Pre-deviations that were found to be overlapping with the post-deviation of a previous question could not be transcribed. As a



result, the number of pre- and post-deviation sections from the oral production by the candidates in each category was reduced, the final numbers being:

33 paraphrased questions 26 interrupting questions 17 improvised questions 9 comments after replies.

4.2 Locating deviations The reason for looking at the points of deviation was to identify places in the Interlocutor Frame that might be prone to lead to unintended breakdowns or deviations. It was thought that locating these ‘weak’ points in the Frame would offer valuable insights into why the breakdown occurred and lead to a series of practical recommendations for the improvement of the IF as well as guidance for examiner training. Two procedures were undertaken for this purpose:

1. Occurrences of each deviation in the three test parts were identified to highlight where they were most likely to occur.

2. Occurrences of the questions where examiners deviated most were counted in order to discover where certain deviations would be most likely to occur within each test part.

4.3 Transcribing Transcribing was conducted after the second, more detailed listening. The maximum amount of time for each pre- or post-deviation chunk was 30 seconds.

Conventions for transcriptions are as below:

er ---- filled pauses x ---- one syllable of a non-transcribed word …… ---- not transcribed pre- or post-deviation oral production.

A total of over 10,000 were transcribed in the pre- and post-deviation data. This dataset was then divided into nine files:

Part 1. com (comments after replies in Part 1) Part 2. com (comments after replies in Part 2) Part 3. com (comments after replies in Part 3) Part 1. itr (interrupting questions in Part 1) Part 3. itr (interrupting questions in Part 3) Part 1. imp (improvised questions in Part 1) Part 3. imp ( improvised questions in Part 3) Part 1. para (paraphrased questions in Part 1) Part 3. para (paraphrased questions in Part 3)



5 ANALYSIS

To realise the aim of the study (to compare the quality of the candidates’ oral production in the pre and post deviation sections), four categories of measure were used; these are presented in Table 4 along with the sub-categories.

Category of measures Sub-category of measures

1. filled pauses per AS-unit Fluency 2. words per second (excluding repetitions, self-corrections and filled

pauses)

1. number of errors of plural or singular forms per word Grammatical Accuracy

2. number of errors of subject and verb agreement per word

Linguistic Complexity Average number of clauses per AS-unit

1. number of expanding moves per T-unit

2. number of elaborating moves per T-unit Discoursal Performance

3. number of enhancing moves per T-unit

Table 4: Categories of measures used in transcription analysis

The Analysis of Speech Unit, or AS-unit (Foster, Tonkyn & Wigglesworth 2000) was used for calculating filled pauses and investigating linguistic complexity; for comparing the discoursal performance before and after deviations, the T-unit (Hunt 1970) was chosen as the unit in which changes were examined. The rationale for this approach is:

1. According to Foster et al (2000: 365), the AS-unit is ‘a mainly syntactic unit…consisting of an independent clause, or sub-clausal unit, together with any subordinate clause(s) associated with either’. This allows us to analyse speech at different clausal units such as the non-finite clauses, so that the complexity of linguistic features can be measured.

2. Since studies of pausing in native-speaker speech have shown that pauses often occur at syntactic unit boundaries, especially at clausal boundaries (Raupach 1980; Garman 1990), the AS-unit was selected as the most appropriate unit for calculating filled pauses.

3. The T-unit is the ‘shortest unit into which a piece of discourse can be cut without leaving any sentence fragments as residue’ (Hunt 1970:189). The T-unit enables us to include in the analysis all acts, some of which can be coordinate clauses or fragments of clauses. This is beyond the scope of the AS-unit which regards these structures as separate units.



6 RESULTS

6.1 Overall The results are presented in relation to the three research questions posed in section one. We will look at the overall evidence of deviation and at any apparent impact on test-taker language of these deviations. In addition, we will look at the location of the deviations for evidence of systematicity which may point to inherent weaknesses in the interlocutor frame method. The overall results are presented so as to reflect the four areas identified as the most common deviation type above.

6.1.1 Paraphrasing The results suggest that there is a very limited impact on fluency, while in the other areas there are mixed results. There appears to be a reduction in accuracy immediately following the deviation in terms of plural/singular errors, though this is counteracted by the post-deviation increase in subject/verb agreement accuracy. It is in the area of complexity that the most obvious change occurs, with both the number of AS-units and the number of clauses per AS-unit appearing to significantly drop following the deviation. The discourse indicators also appear to show a mixed reaction. The results are grouped together as Table 5.

Filled pauses per T-unit Words per second Fluency

pre post pre post

Average 1.021 1.346 1.77 1.67

Total 31.993 36.933 58.33 55.26

Plural/Single Error per word Subject/Verb agreement Error per wordAccuracy

pre post pre post

Average 0.01 0.01 0.02 0.03

Total 0.47 0.17 0.64 0.92

Clauses per AS-unit Complexity

pre post

Average 0.01 0.01

Total 0.47 0.17

Expanding per T-Unit Elaborating per T-Unit Enhancing per T-Unit Discourse

Pre Post pre post pre Post

Average 0.43 0.31 0.16 0.22 0.23 0.17

Total 14.28 10.28 5.41 7.12 7.75 5.57

Table 5: The impact of paraphrasing questions on candidate language



6.1.2 Interrupting In Table 6 we can see that there is quite a large reduction in filled pauses per T-unit, though there is little change as regards the number of words spoken per second. Like the results from the paraphrasing analysis, there seems to be a reduction in accuracy immediately following the deviation in terms of plural/singular errors, though this is again reversed with the post-deviation increase in subject/verb agreement accuracy. The pattern found for complexity is not repeated here, and is instead seen to be much more inconsistent. The discourse indicators are the most consistent, with a slight drop in the post-deviation position, though this does not appear to be great enough to suggest a significant reaction.


Pre post pre post

Average 1.035 0.558 1.832 1.857

Total 26.919 14.500 47.63 48.28


Pre post pre post

Average 0.009 0.005 0.008 0.016

Total 0.222 0.142 0.207 0.428


Pre post

Average 0.89 1.01

Total 23.05 26.13


pre post pre post pre post

Average 0.356 0.340 0.118 0.058 0.147 0.125

Total 9.255 8.833 3.060 1.500 3.833 3.250

Table 6: The impact of interrupting questions on candidate language

6.1.3 Improvising As far as the results for fluency are concerned (Table 7), there seems to be a significant reduction in the number of filled pauses following the deviation, though a corresponding reduction in the number of words spoken per second does not appear great. As for accuracy, there seems to be a very slight increase in the measures over the two sections, though the numbers are probably too small to draw any definite conclusions. With complexity, the picture is once again mixed, while the discourse indicators also appear to show little reaction apart from the amount of expanding carried out.




pre Post pre post

Average 0.666 0.373 2.159 2.023

Total 11.328 6.333 36.710 34.390


pre Post pre post

Average 0.005 0.008 0.012 0.026

Total 0.093 0.137 0.212 0.449


pre Post

Average 1.217 1.431

Total 20.692 24.333


pre post Pre post pre post

Average 0.340 0.152 0.156 0.153 0.198 0.229

Total 5.787 2.583 2.660 2.600 3.368 3.892

Table 7: The impact of improvising questions on candidate language

6.1.4 Commenting In the results from the analysis of the language bordering the deviations which were identified as being related to unscripted comments made by the examiners, we can see that there is a drop in the number of filled pauses, while there is little significant change in the number of words spoken per second (Table 8). The figures for accuracy are so small that there seems little point in attempting to make any meaningful comment on them, while for complexity there is quite a large increase in the number of clauses per AS-unit. Finally, the discourse indicators seem to indicate a systematic decrease right across the board.




pre Post pre post

Average 0.666 0.473 2.137 2.353

Total 4.983 4.386 19.230 21.180


pre Post pre post

Average 0.000 0.002 0.008 0.015

Total 0.000 0.017 0.069 0.137


pre post

Average 0.609 0.816

Total 5.483 7.343


pre post pre post pre post

Average 0.372 0.257 0.206 0.083 0.307 0.254

Total 3.345 2.317 1.852 0.750 2.760 2.283

Table 8: The impact of commenting on responses on candidate language

6.2 Impact on test-takers’ language of each deviation type If we then review these results in terms of each of the four language areas, we can see that of the four deviation types, paraphrasing seems to result in relatively little change to the language performance of the candidates, while all other deviation types seem to be having a negative impact on fluency (see Table 9). However, the rate of speed does not appear to be affected to any great extent by the deviations.

The negative direction of interrupting/improvising/commenting’ suggested by Table 9 could imply that examiners should really avoid doing any of these, while the positive direction of the impact of ‘paraphrasing’ suggests that examiners need not be so concerned about doing this because it may even have a positive impact?


pre post pre Post

Paraphrasing 1.021 1.346 1.77 1.67

Interrupting 1.035 0.558 1.832 1.857

Improvising 0.666 0.373 2.159 2.023

Commenting 0.554 0.487 2.137 2.353

Table 9: The impact on fluency of each deviation type



In terms of the accuracy of the output, we can see that there does not appear to be any significant impact as a result of the deviations recorded here – though the numbers recorded may in any case be too small to make any meaningful difference (see Table 10).

Plural/Single Error per wordSubject/Verb agreement

Error per word Accuracy pre Post pre post

Paraphrasing 0.01 0.01 0.02 0.03

Interrupting 0.009 0.005 0.008 0.016

Improvising 0.005 0.008 0.012 0.026

Commenting 0.000 0.002 0.008 0.015

Table 10: The impact on accuracy of each deviation type

The complexity of the language is affected in different ways (Table 11). If anything, there is a slight increase in the complexity of the language used following each of the deviations with the exception of paraphrasing.


Pre Post

Paraphrasing 0.01 0.01

Interrupting 0.89 1.01

Improvising 1.217 1.431

Commenting 0.609 0.816

Table 11: The impact on complexity of each deviation type

Finally, we can see from Table 12 that the amount of expanding undertaken by candidates is systematically reduced following all four deviation types, though the picture for elaborating and enhancing is quite mixed.

Expanding per T-Unit Elaborating per T- Unit Enhancing per T- Unit Discourse

Pre post pre post Pre post

Paraphrasing 0.43 0.31 0.16 0.22 0.23 0.17

Interrupting 0.356 0.340 0.118 0.058 0.147 0.125

Improvising 0.340 0.152 0.156 0.153 0.198 0.229

Commenting 0.372 0.257 0.206 0.083 0.307 0.254

Table 12: The impact on discourse of each deviation type



6.3 Location of deviations The other aim of the research is to investigate where the deviations occur to identify a pattern of the possible or likely situations or conditions for the deviations to occur. Two kinds of deviation location were studied: deviations across the three test parts and deviation within each test part.

6.3.1 Deviations by test part Table 13 shows the numbers of occurrences of both the transcribed and non-transcribed (ie where the amount of language on either side of the deviation was too small to make meaningful inferences from the analyses) deviations in the tasks used in the three parts of the test. The non-transcribed deviations are added here to give a more complete picture of the amount of deviation from the IF that actually took place during these test events.

Deviation Type

Paraphrased Questions

Improvised Questions

Comments after Replies

Interrupting Questions

P1 P2 P3 P1 P2 P3 P1 P2 P3 P1 P2 P3

Deviations analysed for

this study 4 0 29 8 0 9 2 4 4 14 0 12

Total number of Deviations

4 0 43 10 0 18 2 4 6 19 0 15

Table 13: Number of deviations by test part

There are a number of clear tendencies implied by Table 13:

Interrupting questions spread more or less evenly in Part 1 and Part 3. This is possibly due to the two-way nature of these parts both of which involve questions and answers. When the test-taker gives a longer reply than necessary from the point of view of the examiner, the examiner may ask the next question to stop the candidate’s reply to the previous question in the middle of a sentence or even a word. The table also suggests that about 30% of interrupting questions do not result in an extended turn (at least 30 seconds) from the candidate. This may be due to the fact that the questions are rhetorical (and do not require a response); or they may be yes/no questions or questions that elicit only very short responses; or it may be that the questions are either not clearly heard or understood by the candidates (and are either ignored or poorly answered). Since these possibilities can have potentially different impacts on candidate performance, it is clear that this aspect of examiner behaviour deserves more detailed examination.

There are more improvising questions in Part 3 than in Part 1, though the discourse patterns are the same. It is possible that the improvising questions in Part 3 result from the more abstract nature of the questions, and is most likely related to the way Part 3 is designed from the examiner’s perspective – see the above discussion. However, under what conditions the examiners tend to ask questions which are not in the Frame but are spontaneously raised by the examiners according to information given by test-takers can only be disclosed by examining the location of deviations within tasks. We can also see that in only half of the instances was there enough language resulting from the improvised question to merit inclusion in this study. This implies that this question type did not tend to result in the elicitation of a meaningful response (in terms of length of utterance) and as such may not always impact on candidate performance – though any



lack of response may result in a lowering of the examiner’s opinion of the proficiency level of the candidate. Again, more detailed study of this phenomenon is required.

The only type of deviation observed in Part 2 (the individual long turn) was where the examiners made comments following the candidate responses. This is not really surprising when we consider that the nature of the task reduces the potential for paraphrasing and improvising questions. Also, since the candidates are told before they start the task that they will be stopped when time is up, interruptions are not expected to occur.

Comments after test-takers’ replies seem to occur most often in the Individual long turn task, if we bear in mind that in this part of the test examiners are only required to ask one or two rounding-off questions. Where and when these commenting deviations happen is certainly an interesting revelation, which will be discussed in the next part of this study.

91% of the paraphrasing questions occurred in Part 3, the two-way discussion task, where examiners invite the candidates to discuss the abstract aspect of the topic linked to Part 2 using unscripted questions. There is a suggestion here that in this part of the test the test-takers may have more difficulty answering the questions. Because of this, the examiners offered (based on their assessment of the candidates’ levels of proficiency and ability to answer abstract questions) to rephrase or explain the questions without examinees’ requests in most of the cases. The nature of the questions seems to be the cause, as there are far fewer paraphrasing questions in Part 1 where the purpose of the questions is to access factual information. When we consider the overall number of paraphrased questions to those analysed here, we can see that there is no difference for Part 1, suggesting that the paraphrasing was successful – in that it always resulted in a long response (at least 30 seconds). The picture in Part 3 is different; here one in three of the paraphrased questions failed to elicit a long enough turn to be included in this analysis. This suggests that the paraphrases failed to enlighten the candidates, perhaps not surprisingly, since the concepts in Part 3 tend to be more abstract, and therefore more difficult to paraphrase than in Part 1. The implication here is that examiner training, in this particular examination and in other tests in which this approach is used, should focus specifically on developing noticing, questioning and paraphrasing skills. It is also clear that this element of the test should be closely monitored in future administrations to ensure that candidate performances are not significantly affected by features of examiner behaviour that are not relevant to the skill being tested.

6.3.2 Details of the deviations We will now examine each part of the test separately in order to identify which of the scripted questions were most likely to lead to or result in deviations from the Interlocutor Frame.

In Part 1 we can see that there is an even spread of deviations across the various questions (see Table 14). All of these questions are scripted for the examiner, who makes decisions on which ones to ask during the course of the test. It should be mentioned that there are more questions than listed in the table. They are not included here either because they were not asked by the examiners or there were no deviations associated with them.



PART 1 Paraphrased Questions


Comments after

Replies


Total Deviations

Introductory Not analysed as this section is not assessed

Place of origin 0 0 0 3 3

Work/study 0 0 1 2 3

Accommodation in UK 0 0 0 1 1

Everyday habits 0 1 0 0 1

Likes and personality 0 1 0 1 2

Favourite clothing 0 1 0 1 2

Language & other learning 0 1 0 0 1

Mode of learning 0 1 1 0 2

Cooking 0 0 0 1 1

New experiences 0 0 0 1 1

Museums & galleries 1 0 0 1 2

Most loved festivals 1 0 0 2 3

Festival games 1 0 0 0 1

Festival general 0 0 0 1 1

Sports 0 1 0 0 1

Sporting addictions 0 1 0 0 1

Most loved sports 1 1 0 0 2

Total 4 8 2 14 28

Table 14: Spread of deviations in Part 1

There are a number of observations that can be made at this juncture:

1. One examiner was responsible for five of the interrupting questions, suggesting that this is more of a test monitoring issue than a training issue (if it were a training issue we would expect to find a greater spread of occurrences).

2. The majority of the interrupting questions served to bring a candidate turn to an end, and as such do not appear to impact on candidate performance on the task.

3. We might need to think further about improvised questions. These are unscripted, and represented a real threat to the integrity of the test. It may well be that this type of question can be eliminated to a great extent by training and by the inclusion of a statement on the Frame specifically referring to the problem.

4. There does not appear to be a systematic pattern of deviation in relation to specific questions or question types (direct or slightly more abstract).





Comments after

Replies


Total Deviation

s

Instructions 0 0 0 0 0

During long turn 0 0 0 0 0

Anyone with job? 0 0 2 0 2

Will you have the job? 0 0 2 0 2

Total 0 0 4 0 4


Table 15 shows that in Part 2, the Individual long turn, the examiners stayed very clearly with the Frame both during the introductory section of the task (when they were giving instructions) and while the candidate was involved in the long turn itself. There were four commenting responses by the examiners out of a total of 10 analysed for Part 2. A further probing of the data shows that they all happened when the examiners were rounding off this part by asking one or two questions. It also seems that at this point they tend to make comments about the candidates’ answers to the questions, thus giving more acknowledgement and/or acceptance than required by the IF. This is an interesting finding, in that it suggests that examiners sense some need to ‘backchannel’; although the original purpose of the rounding-off questions appears to have been to help examiners form a bridge from Part 2 to Part 3, they still seem to need to say something else. This is yet another area in which further exploration is likely to significantly add to our understanding of the Speaking Test event in general and examiner behaviour in particular.

In Part 3 (Table 16) we can see that the stable patterns observed in the first two parts are not repeated. Instead, there are a far greater number of deviations from the IF, though this is not unexpected as examiners are offered a choice of prompts from which to select and fashion their questions, depending on how the interaction evolves and are likely make unscripted contributions in this final part of the test. As we have seen above, Parts 1 and 3 are somewhat similar in design, with both designed to result in interactive communication. We would therefore expect to see similar patterns of behaviour from the examiners in the two parts. In fact, it is true that the patterns are strikingly similar in most areas – there are similar levels of occurrence of improvised questions, comments and interruptions. However, it is clear that there are far more instances of paraphrasing in this last part than in any of the others (in fact there are almost as many paraphrased questions in Part 3 as there are deviations in total for the other two parts). This may well be due to the less rigid nature of this final part, with the examiner offered a broad range of prompts to choose from when continuing the interaction, but is more likely due to the nature of the questions asked. Even if we take a less rigid view of paraphrasing (where scripted questions are asked using alternative wording or emphasis) and view this final part as being more loosely controlled, there is an issue with the degree of variation here. Examiners must regularly make ‘real-time’ decisions as to the value or relevance of questions. The fact that they are likely to make changes to the alternatives offered in this part of the test implied that they may not be totally comfortable with the alternatives offered, at least in terms of language.





Comments after Replies


Total Deviations

Factors for choice of career 3 2 1 3 9

Different factors for men/women 1 1 0 0 2

More important factors 5 2 0 3 10

Career structure important? 7 1 1 0 9

(±) of job for life and change of jobs 2 1 2 2 7

Future working patterns? 6 0 0 1 7

Being a boss (±) 1 1 0 2 4

Qualities of a good employer? 4 0 0 1 5

Future boss/employee relationship?

0 1 0 0 1

Total 29 9 4 12 45


We can see from Table 16 that some of the prompts appear to be more likely to result in paraphrasing than others (though the number of times each question was asked varied); it is possible that they potentially place a greater demand on the resources of the candidate in terms of background knowledge and understanding or awareness of European/Western working habits. The inability of candidates to respond to the questions may well result in the greater resort to paraphrasing seen in this part of the test. As with the other findings here, this raises as many questions as it answers, particularly in relation to examiner decision making, and the impact on overall score awarded of these deviations appearing so late in the test event.

7 CONCLUSIONS

In this study, we set out to explore the way in which IELTS examiners deviated from the relatively new Interlocutor Frame in the revised IELTS Speaking Test introduced in July 2001. We were interested to identify the nature and location of any deviations and to establish evidence of their impact on the language of the candidates who participated in the test events.

Our analyses appear to show that the first two parts of the Speaking Test are quite stable in terms of deviations, with relatively few noted; where these were found they were either associated with a single examiner or were unsystematically spread across the tasks. It was also clear that the examiners seemed to adhere very closely to the IF, and that the deviations that did occur came at natural interactional boundaries, such as at the end of medium or long turns from candidates. The impact of these deviations on the language of the candidates was essentially negligible in practical terms.

In the final part of the Test, there appears to have been a somewhat different pattern of behaviour, particularly in relation to the number of paraphrased questions used by the examiners. While Part 3 mirrors the other interactive task in terms of the number of improvised questions, comments on candidate responses and interrupting questions, there are seven times more paraphrased questions in the final task. The reasons for this difference appears to be related to the alternative format of the task which offers the examiner greater flexibility than in Parts 1 or 2: while the candidate was



basically asked information-based questions in the first part (typically of a personal nature), in the final part the questions asked the candidate to conjecture, offer opinions and reflect on often abstract topics. The other possible explanation is that the question types may have been beyond the typical candidate in terms of cognitive load or of their cultural or background knowledge. Whatever the cause of the deviations, the impact on candidate language appears to have been minimal, though it remains unclear if there was any impact on the final score awarded to candidates.

The use of an Interlocutor Frame is based on the rationale that without a scripted guide, examiners are likely to treat each test event as unique and that candidates risk being unfairly advantaged or disadvantaged as a result. Anecdotal evidence from some stakeholders, principally teachers and examiners, suggests that there is some concern that very tight Interlocutor Frames might cause examiners to become too stilted and unnatural in their language during a test event and that this has a negative impact on the face validity of the test. Test developers therefore have to balance the need to standardise the test event as much as possible (to ensure that all test-takers are examined under the same conditions and that an appropriate sample of language is elicited) against the need to give examiners some degree of flexibility so that they (and the more directly affected stakeholders) feel that the language of the event is natural and free flowing.

The results of our analyses suggest that examiners in the revised IELTS Speaking Test essentially adhere to the Interlocutor Frame they are given. The absence of systematicity in the location of deviations implies that the Frames are working as the test developers intended, and that there are no obvious points in the test in which deviation is likely to occur, particularly for the first two tasks. There is some slight cause for concern with the final part. It may well be that it is not possible to create a Frame that can adequately cope with the requirements of less controlled interaction, though the evidence from this study suggests that the extensive paraphrasing that resulted in the less controlled final section did not seriously impact on candidate performance; indeed, if anything it resulted in slightly improved performance. However, the evidence from this study implies that greater care with the creation of question options may result in a more successful implementation of the Frame. The most relevant implication of the findings of this study is that it may be possible to allow for some flexibility in the Interlocutor Frame, though this flexibility might be best confined to allowing for examiner paraphrasing of questions. That this might be achieved without negatively impacting on the language of the candidate is of particular interest.

ACKNOWLEDGEMENT

The authors would like to acknowledge the valuable input provided by Dr Lynda Taylor in preparing the report of the study that appears here.



REFERENCES

Bachman, LF, 1988, ‘Problems in examining the validity of the ACTFL oral proficiency interview’, Studies in Second Language Acquisition, vol 10, pp 149-64

Bachman, LF, 1990, Fundamental considerations in language testing, Oxford University Press, Oxford

Brooks, L, 2002, Report on functions observed in the old IELTS Speaking Test versus those in the revised Speaking Test, Internal Cambridge ESOL Report, Cambridge

Brooks, L, 2003, ‘Converting an observation checklist for use with the IELTS Speaking Test’, Research Notes Issue 11, University of Cambridge ESOL Examinations, Cambridge, pp 20-21

Brown, A, 1995, ‘The effect of rater variables in the development of an occupation specific language performance test’, Language Testing, vol 12, pp 1-15

Brown, A and Hill, K, 1998, ‘Interviewer style and candidate performance in the IELTS oral interview’, IELTS Research Reports, vol 1, IELTS Australia, Canberra, pp 1-19

Brown, A, 2003, ‘Interviewer variation and the co-construction of speaking proficiency’ Language Testing, vol 20, pp 1-25

Brown, A, and Lumley, T, 1997, ‘Interviewer variability in specific-purpose language performance tests’ in Current Developments and Alternatives in Language Assessment, eds A Huhta, V Kohonen, L Kurki-Suonio and S Luoma, University of Jyväskylä and University of Tampere, Jyväskylä, pp 137-150

Cambridge ESOL, 2001, IELTS Speaking Test-FAQs and feedback, Cambridge ESOL, Cambridge

Chalhoub-Deville, M, 1995, ‘A contextualized approach to describing oral language proficiency’, Language Learning, vol 45, pp 251-281

Foster, P, Tonkyn, A, and Wigglesworth, G, 2000, ‘Measuring spoken language: a unit for all reasons’ Applied Linguistics, vol 21, pp 354-375

Garman, M, 1990, Psycholinguistics, Cambridge University Press, Cambridge

Halleck, G, 1996, ‘Interrater reliability of the OPI: using academic trainee raters’, Foreign Language Annals, vol 29, pp 223-238

Hasselgren, A, 1997, ‘Oral test subskill scores: what they tell us about raters and pupils’ in Current Developments and Alternatives in Language Assessment, eds A Huhta, V Kohonen, L Kurki-Suonio and S Luoma, University of Jyväskylä and University of Tampere, Jyväskylä, pp 241-256

Hunt, K, 1970, Syntactic maturity in school-children and adults, Monograph of the Society for Research into Child Development

Lazaraton, A, 1992, ‘The structural organisation of a language interview: a conversational analytic perspective’, System, vol 20, pp 373-386

Lazaraton, A, 1996a, ‘Interlocutor support in oral proficiency interviews: the case of CASE’, Language Testing, vol 13, pp 151-172



Lazaraton, A, 1996b, ‘A qualitative approach to monitoring examiner conduct in the Cambridge Assessment of Spoken English (CASE)’, Performance, Testing and Cognition: Selected Papers from the 15th Language Testing Research Colloquium, Cambridge and Arnhem, eds M Milanovic and N Saville, UCLES/Cambridge University Press, Cambridge, pp 18-33

Lazaraton, A, 2002, A qualitative approach to the validation of oral language tests, Cambridge University Press, Cambridge

Lumley, T, 1998, ‘Perceptions of language-trained raters and occupational experts in a test of occupational English language proficiency’, English for Specific Purposes, vol 17, pp 347-367

Lumley, T and O’Sullivan, B, 2000, ‘The effect of speaker and topic variables on task performance in a tape-mediated assessment of speaking’, Paper presented at the 2nd Annual Asian Language Assessment Research Forum, The Hong Kong Polytechnic University, January 2000

McNamara, T, 1996, Measuring second language performance, Addison Wesley Longman, Harlow

O’Sullivan, B, 2000, ‘Towards a model of performance in oral language testing’, unpublished PhD dissertation, The University of Reading

O’Sullivan, B and Saville, N, 2000, ‘Developing observation checklists for speaking tests’, Research Notes, vol 3, pp 6-10

O’Sullivan, B, Weir, C and Saville, N, 2002, ‘Using observation checklists to validate speaking-test tasks’, Language Testing, vol 19, pp 33-56

Raupach, M, 1980, ‘Temporal variables in first and second language production’ in Temporal Variables in Speech: Studies in Honor of Freida Goldman-Eissler, eds HW Dechert and M Raupach, Mouton, The Hague

Ross, S, 1992, ‘Accommodative questions in oral proficiency interviews’, Language Testing, vol 9, pp 173-186

Ross, S and Berwick, R, 1992, ‘The discourse of accommodation in oral proficiency interviews’, Studies in Second Language Acquisition, vol 14, pp 159-176

Saville, N and Hargreaves, P, 1999, ‘Assessing speaking in the revised FCE’, ELT Journal, vol 53, pp 42-51

Shohamy, E, 1983, ‘The stability of oral proficiency assessment on the oral interview testing procedures’, Language Learning, vol 33, pp 527-40

Stansfield, CW, 1991, ‘A comparative analysis of simulated oral proficiency interviews’ in Current Developments in Language Testing, ed S Anivan, SEAMEO Regional Language Centre, Singapore, pp 199-209

Stansfield, CW and Kenyon, DM, 1992, ‘Research on the comparability of the oral proficiency interview and the simulated oral proficiency interview’ System vol 20, pp 347-64

Taylor, L (in press), ‘Introduction’ in IELTS Collected Papers: Research in Speaking and Writing Assessment. Studies in Language Testing Volume 19, eds L Taylor and P Falvey, Cambridge ESOL/Cambridge University Press, Cambridge

Thompson, I, 1995, ‘A study of interrater reliability of the ACTFL oral proficiency interview in five European Languages: Data from ESL, French, German, Russia, and Spanish’, Foreign Language Annals, vol 28, pp 407-422



Upshur, JA and Turner, C, 1999, ‘Systematic effects in the rating of second-language speaking ability: test method and learner discourse’, Language Testing, vol 16, pp 82-111

Van Lier, L, 1989, ‘Reeling, writhing, drawling, stretching and fainting in coils: oral proficiency interviews as conversations’, TESOL Quarterly, vol 23, pp 480-508

Weir, C, 2005, Language testing and validation: an evidence-based approach, Palgrave, Oxford

Wilds, C, 1975, ‘The oral interview test’ in Testing Language Proficiency, eds RL Jones and B Spolsky, Center for Applied Linguistics, Arlington, VA, pp 29-44

Young, R and Milanovic, M, 1992, ‘Discourse variation in oral proficiency interviews’, Studies in Second Language Acquisition, vol 14, pp 403-424



APPENDIX 1: PROFILES OF THE TEST-TAKERS INCLUDED IN THE STUDY

Cand. No. Gender Score (speaking) Nationality L1 Examiner

1188 M 6 UAE Arabic 9

0214 M 7 Jordan Arabic 23

0105 F 6 UAE Arabic 28

0397 M 7 Iraq Arabic 22

0385 M 6 UAE Arabic 22

0801 M 6 Oman Arabic 12

0803 F 9 Oman Arabic 48



0971 F 4 Oman Arabic 50

0190 M 6 Bangladesh Bengali 1

0403 M 6 Bangladesh Bengali 22

0386 F 8 Bangladesh Bengali 38

0931 M 5 China Chinese 26



1383 F 6 China Chinese 43



1487 F 4 Taiwan Chinese 27












1396 M 6 Iran Farsi 41



Cand. No. Gender Score (speaking) Nationality L1 Examiner

0767 M 7 Switzerland German 18

3526 M 9 India Hindi 37


5372 F 8 India Hindi 39



0941 M 8 Japan Japanese 32

1015 F 5 Japan Japanese 6

0078 F 6 Japan Japanese 45

0466 F 4 S Korea Korean 30

1002 M 8 Malawi Other 44

5371 M 6 India Other 39

1423 F 7 Brazil Portuguese 9

1494 M 7 Portugal Portuguese 34

3880 M 8 India Punjabi 33



1235 M 8 Pakistan Pushtu 49

1236 F 7 Colombia Spanish 32

0354 F 8 Mexico Spanish 31

0996 M 8 Sweden Swedish 9

0381 F 9 Sweden Swedish 31



0152 F 7 Sweden Swedish 14

6351 F 7 India Telugu 25

0229 M 7 Pakistan Urdu 8


0371 F 8 Pakistan Urdu 42




5. Exploring difficulty in Speaking tasks: an intra-task perspective Authors Cyril Weir University of Bedfordshire, UK

Barry O’Sullivan Roehampton University, UK

Tomoko Horai Roehampton University, UK


This study looks at how the difficulty of a speaking task is affected by changes to the time offered for planning, the length of response expected and the amount of scaffolding provided (eg suggestions for content).

ABSTRACT

The oral presentation task has become an established format in high stakes oral testing as examining boards have come to routinely employ them in spoken language tests. This study explores how the difficulty of the Part 2 task (Individual Long Turn) in the IELTS Speaking Test can be manipulated using a framework based on the work of Skehan (1998), while working within the socio-cognitive perspective of test validation. The identification of a set of four equivalent tasks was undertaken in three phases. One of these tasks was left unaltered; the other three were manipulated along three variables: planning time, response time and scaffolded support. In the final phase of the study, 74 language students, at a range of ability levels, performed all four versions of the tasks and completed a brief cognitive processing questionnaire after each performance. The resulting audio files were then rated by two IELTS trained examiners working independently of each other using the current IELTS Speaking criteria. The questionnaire data were analysed in order to establish any differences in cognitive processing when performing the different task versions.

Results from the score data suggest that while the original un-manipulated version tends to result in the highest scores, there are significant differences to be found in the responses of three ability groups to the four tasks, indicating that task difficulty may well be affected differently for test candidates of different ability. These differences were reflected in the findings from the questionnaire analysis. The implications of these findings for teachers, test developers, test validators and researchers are discussed.

5. Exploring difficulty in Speaking tasks: an intra-task perspective – Cyril Weir, Barry O’Sullivan + Tomoko Horai

AUTHOR BIODATA

CYRIL WEIR

Cyril Weir has a PhD in language testing and has published widely in the fields of testing and evaluation. He is the author of Communicative Language Testing, Understanding and Developing Language Tests and Language Testing and Validation: an evidence based approach. He is the co-author of Evaluation in ELT, An Empirical Investigation of the Componentiality of L2 Reading in English for Academic Purposes, Empirical Bases for Construct Validation: the College English Test – a case study, and Reading in a Second Language and co-editor of Continuity and Innovation: Revising the Cambridge Proficiency in English Examination 1913-2002. Cyril Weir has taught short courses, lectured and carried out consultancies in language testing, evaluation and curriculum renewal in over 50 countries worldwide. With Mike Milanovic of UCLES he is the series editor of the Studies in Language Testing series published by CUP and on the editorial board of Language Assessment Quarterly and Reading in a Foreign Language. Cyril Weir is currently Powdrill Professor in English Language Acquisition at the University of Bedfordshire, where he is also the Director of the Centre for Research in English Language Learning and Assessment (CRELLA) which was set up on his arrival in 2005.

BARRY O’SULLIVAN

Barry O’Sullivan has a PhD in language testing, and is particularly interested in issues related to performance testing, test validation and test-data management and analysis. He has lectured for many years on various aspects of language testing, and is currently Director of the Centre for Language Assessment Research (CLARe) at Roehampton University, London. Barry’s publications have appeared in a number of international journals and he has presented his work at international conferences around the world. His book Issues in Business English Testing: the BEC revision project was published in 2006 by Cambridge University Press in the Studies in Language Testing series; and his next book is due to appear later this year. Barry is very active in language testing around the world and currently works with government ministries, universities and test developers in Europe, Asia, the Middle East and Central America. In addition to his work in the area of language testing, Barry taught in Ireland, England, Peru and Japan before taking up his current post.

TOMOKO HORAI

Tomoko Horai is a PhD student at Roehampton University, UK. She has an MA in Applied Linguistics and an MA in English Language Teaching, in addition to a MEd in TESOL/Applied Linguistics. She also has a number of years of teaching experience in a secondary school in Tokyo. Her current research interests are intra-task comparison and task difficulty in the testing of speaking. Her work has been presented at a number of international conferences including Language Testing Research Colloquium 2006, British Association of Applied Linguistics (BAAL) 2006, International Association of Teaching English as a Foreign Language (IATEFL) 2005 and 2006, Language Testing Forum 2005, and Japan Association of Language Teachers (JALT) 2004 and 2005.



CONTENTS

1 Introduction ............................................................................................42 The oral presentation ...............................................................................43 Task difficulty ...........................................................................................54 The study ............................................................................................6 4.1 Aims of the study .................................................................................7 4.2 Methodology ........................................................................................7 4.2.1 Quantitative analysis ..........................................................8 4.2.2 Qualitative analysis.............................................................105 Results ............................................................................................13 5.1 Rater agreement ................................................................................14 5.2 Score data analysis .............................................................................15 5.3 Questionnaire data analysis (from the perspective of the task) ..........186 Conclusions ............................................................................................24 6.1 Implications .........................................................................................25 6.1.1 Teachers ............................................................................26 6.1.2 Test developers .................................................................26 6.1.3 Test validators ...................................................................26 6.1.4 Researchers ......................................................................26References ............................................................................................28Appendix 1: Task difficulty checklist ............................................................33Appendix 2: Readability statistics for 9 tasks..............................................32Appendix 3: The original set of tasks ...........................................................34Appendix 4: The final set of tasks .................................................................35Appendix 5: SPSS one-way ANOVA output .................................................36Appendix 6: Questionnaire about Task 1 .....................................................37Appendix 7: Questionnaire – unchanged and reduced time versions ......38Appendix 8: Questionnaire – no planning version ......................................40Appendix 9: Questionnaire – unscaffolded version ....................................41



1 INTRODUCTION

In recent years, a number of studies have looked at variability in performance on spoken tasks from the perspective of language testing. Empirical evidence has been found to suggest significant effects resulting from test-taker-related variables (Berry 1994, 2004; Kunnan 1995; Purpura 1998), interlocutor-related variables (O'Sullivan 1995, 2000a, 2000b; Porter 1991; Porter & Shen 1991) and rater- and examiner-related variables (Brown 1995, 1998; Brown & Lumley 1997; Chalhoub-Deville 1995; Halleck 1996; Hasselgren 1997; Lazaraton 1996a, 1996b; Lumley 1998; Lumley & O’Sullivan 2000, 2001; Ross 1992; Ross & Berwick 1992; Thompson 1995; Upshur & Turner 1999; Young & Milanovic 1992).

Skehan and Foster (1997) have suggested that foreign language performance is affected by task processing conditions (see also Ortega 1999; Shohamy 1983; Skehan 1998). They have attempted to manipulate processing conditions in order to modify or predict difficulty. In line with this, Skehan (1998) and Norris et al (1998) have made serious attempts to identify task qualities which impinge upon task difficulty in spoken language. They proposed that difficulty is a function of code complexity, cognitive complexity, and communicative demand. A number of empirical findings have revealed that task difficulty has an effect on performance, as measured in the three areas of accuracy, fluency, and complexity (Skehan 1998; Mehnert 1998; Wigglesworth 1997; Skehan & Foster 1997, 1999; Ortega 1999; O'Sullivan, Weir & ffrench 2001).

2 THE ORAL PRESENTATION

‘Oral presentation’ is advocated as a valuable elicitation task for assessing speaking ability by a number of prominent authorities in the field (Clark & Swinton 1979; Bygate 1987; Underhill 1987; Weir 1993, 2005 Hughes 1989, 2003; Butler et al, 2000; Fulcher 2003; Luoma 2004). Its practical advantages are obvious, not least that it can be delivered in a variety of modes. The telling advantage of this method is one speaker produces a long turn alone, without interacting with other speakers. As such, it does not suffer from the ‘contaminating’ effect of the co-construction of discourse in interactive tasks where one participant’s performance will affect the other’s, so is also more suitable for the investigation of intra-task variation, the subject of this study (Iwashita 1997; Luoma 2004; McNamara 1996; Ross & Berwick 1992; Weir 1993, 2005).

Over the past three decades, oral presentation tasks (also known as ‘individual long turn’ or ‘monologic’ tasks) have become an established format in high stakes oral testing as examining boards have come to routinely employ them in spoken language tests. The Test of Spoken English (TSE) from Educational Testing Service (ETS) in the USA, the International English Language Testing System (IELTS), the Cambridge ESOL Main Suite examinations, and the College English Test in China (the world’s biggest EFL examination) all include an ‘oral presentation’ task in their tests of speaking. In ETS’s TOEFL Academic Speaking Test (TAST) only monologues are used. In the context of the New Generation TOEFL speaking component, Butler et al (2000) advocate testing ‘extended discourse’, arguing that this is most relevant to the academic use of language at the university level. Earlier, Clark and Swinton (1979) found that the ‘picture sequence’ task was one of the most effective techniques in experimental tests which investigated suitable techniques for a speaking component for TOEFL.

Given its importance, it is surprising that over the last 20 years no research articles dedicated to oral presentation speaking tasks per se can be found in the most prominent journal in the field, Language Testing. Similarly, there has been little published research on the long turn elsewhere even in the non-language testing literature (see Abdul Raof 2002). Certainly, very little empirical investigation has been conducted to find out what contributes to the degree of task difficulty within oral



presentation tasks in a speaking test even though such tasks play an important function in high stakes tests around the world.

3 TASK DIFFICULTY

In recent years, a number of studies have looked at variability in spoken performance from the perspective of task difficulty in language testing. Empirical evidence has been found to suggest significant effects resulting from how interlocutor-related variables impact on difficulty in interaction-based tasks (Porter 1991; Porter & Shen 1991; O'Sullivan 2000a, 2000b, 2002; Berry 1997, 2004; Buckingham 1997; Iwashita 1997).

In terms of the study of test task related variables, a number of studies concerning inter-task comparison have been undertaken. These have adopted both quantitative perspectives (Chalhoub-Deville 1995; Fulcher 1996; Henning 1983; Lumley & O’Sullivan 2000, 2001; O’Loughlin 1995; Norris et al 1998; Robinson 1995; Shohamy 1983; Shohamy, Reves & Bejarano 1986; Skehan 1996; Stansfield & Kenyon 1992; Upshur and Turner 1999; Wigglesworth & O’Loughlin 1993) and qualitative perspectives (Bygate 1999; Kormos 1999; O’Sullivan, Weir & Saville 2002; Shohamy 1994; Young 1995). These studies were conducted to investigate the impact on scores awarded for speakers’ performances across the different tasks. O’Sullivan and Weir (2002) report that on the whole, the results of these investigations are mixed, perhaps in part due to the crude nature of such investigations where many variables are uncontrolled, and tasks and test populations tend to vary with each study.

There is less research available on intra-task comparison, where internal aspects of one task are systematically manipulated. This is perhaps surprising as this type of study enables the researcher to more closely control and manipulate the variables involved. Skehan and Foster (1997) suggest that foreign language performance is affected by task processing conditions. They propose that difficulty is a function of code complexity, cognitive complexity, and communicative stress. This view is largely supported by the literature (see, for example, Foster & Skehan 1996, 1999; Mehnert 1998; Ortega 1999; Skehan 1996, 1998; Skehan and Foster 2001; Wigglesworth 1997; Brown & Yule 1983; Crookes 1989). The most likely sources of intra-task variability appear to lie in the three broad areas outlined by Skehan and Foster (1997) mentioned above and appear to be most clearly observed when the following specific performance conditions are manipulated:

1. Planning time

2. Planning condition

3. Audience

4. Type and amount of input

5. Response time

6. Topic familiarity

Empirical findings have revealed that intra-task variation in terms of these conditions has an effect on performance as measured in the four areas of accuracy, fluency, complexity and lexical range (Ellis 1987; Crookes 1989; Williams 1992; Skehan 1996; Mehnert 1998; Wigglesworth 1997; Foster & Skehan 1996; Skehan & Foster 1997, 1999; Ortega 1999; O'Sullivan, Weir & ffrench 2001).

Weir (2005) argues that it is critical that examination boards are able to furnish validity evidence on their tests and that this should include research-based evidence on intra-task variation, ie how the conditions under which a single task is performed affect candidate performance. Research into intra-task variation is critical for high stakes tests because if we are able to manipulate the difficulty level of tasks we can create parallel forms of tasks at the same level and offer a principled way of © IELTS Research Reports Volume 6 5


establishing versions of tasks across the ability range (elementary to advanced). This is clearly of relevance to examination bodies that offer a suite of examinations as is the case with Cambridge ESOL.

4 THE STUDY

This study is primarily designed to explore how the difficulty of the IELTS Speaking paper Part 2 task (Individual Long Turn) can be deliberately manipulated using a framework based on the work of Skehan (1998), while working within the socio-cognitive perspective of test validation suggested by O’Sullivan (2000a) and discussed in detail by Weir (2005).

In this research project, the conditions under which tasks are performed are treated as independent variables. We have omitted the variables type and amount of input and topic familiarity from our study as it was decided that it was necessary to limit the scope of the study. These were felt to be adequately controlled for in the task selection process (described in detail below) in which an analysis of the language and topic of each task was undertaken (by considering student responses from the pilot study questionnaire and from the responses of an ‘expert’ panel who applied the difficulty checklist to all tasks). The variable audience was also controlled for by identifying the same audience for each task variant. The remaining variables are operationalised for the purpose of this study in the following way:

Variable Unaltered Altered

Planning Time 1 minute No planning time

Planning Condition Guided (3 scaffolding points) No scaffolding

Response Time 2 minutes 1 minute

Table 1: Task manipulation

The first of the three manipulations is in response to the findings of researchers such as Skehan and Foster (1997, 1999, 2001), Wigglesworth (1997) and Mehnert (1998) who suggest that there is a significant difference in performance where as little as one minute of planning is allowed. Since the findings have shown that this improvement is manifested in increased accuracy, we expect that the scores awarded by raters for this criterion will be most significantly affected. The second area of manipulation is related to the suggestion (by Foster & Skehan, among others) that the nature of the planning can contribute to its effect. For that reason, students will be given an opportunity to engage in guided planning (by using the scaffolded points) or unguided planning (where these points are removed). Finally, the notion of response time is addressed. Anecdotal evidence from examiners and researchers who have listened to recordings of timed responses suggest that test-takers (particularly at a low level of proficiency) tend to run out of things to say and either struggle to add to their performance, engage in repetition of points already made, or simply dry up. Any of these situations can lead to a lowering of the scores candidates are awarded by examiners. Since the original version of this task asks test-takers to respond for 1 to 2 minutes, it was felt to be important to investigate what the consequences of allowing this wide variation in performance time might be.



The hypotheses are formulated as follows:

1. Planning time will impact on task performance in terms of the test scores achieved by candidates.

2. Planning condition will impact on task performance in terms of the test scores achieved by candidates.

3. Response time will impact on task performance in terms of the test scores achieved by candidates.

4. Differences in performance in respect of the variables in hypotheses 1 to 3 will vary according to the level of proficiency of test-takers.

5. The manipulations to each task, as represented in hypotheses 1-3, will result in significant changes in the internal processing of the participants (i.e. the theory-based validity of the task will be affected by manipulating elements of the task setting or demands).

4.1 Aims of the study To establish any differences in candidate linguistic behaviour, as reflected in test scores, arising

from language elicitation tasks that have been manipulated along a number of socio-cognitive dimensions

Since all students complete a theory-based validity questionnaire on completion of each of the four tasks they perform (see Appendix 7), analysis of these responses will allow us to make statements regarding the second of our research questions:

To establish any differences in candidate behaviour (cognitive processing) arising from language elicitation tasks that have been manipulated along a number of socio-cognitive dimensions

4.2 Methodology As mentioned above, this study employs a mixture of quantitative and qualitative methods as appropriate. The study is divided into a number of phases, described below.

Phase 1: In this phase, a number of retired IELTS oral presentation tasks were analysed by the researchers using a checklist based on Skehan (1996). This analysis led to the selection of a series of nine tasks from which it was hoped to identify at least four that were truly equivalent (see Appendix 1 for the checklist). Readability statistics were generated for each of the tasks (see Appendix 2) in order to ascertain that each task was similar in terms of level of input. In addition to these analyses, a qualitative perspective on the task topics was undertaken. The nine tasks are contained in Appendix 3.

Phase 2: A series of pilot administrations was conducted involving overseas university students at a UK institution. These students were on or above the language threshold level for entry into UK university (ie approximately 6.5 on the IELTS overall band scale). The students were asked to perform a number of tasks and to report verbally to one of the researchers on their experience. From these pilot studies it was noted that the topic of two of the tasks (‘visiting a museum or art gallery’ and ‘entering a contest’) were considered by many students to be outside their experience and as such too difficult to talk about for two minutes. For this reason, the former was changed to a ‘sports event’ and the scaffolding or prompts rewritten, while the latter was dropped from the study. It was decided at this stage that the eight tasks that remained were suitable, and that these should form the basis of the next phase (these are in Appendix 4).

Phase 3: In this phase of the project, a formal trial of the eight selected tasks (A to H) was undertaken.



4.2.1 Quantitative analysis A group of 54 students was asked to participate in the trial. Each student was asked to complete four tasks, and to fill in a short questionnaire immediately on completing each task. To ensure that an approximately equal number of students responded to each task, the following matrix was devised. This meant that students were given at random a pack marked Version 1 to 8. These packs contained the rubric for each of the tasks in the pack as well as four questionnaires.

Version 1 Version 2 Version 3 Version 4 Version 5 Version 6 Version 7 Version 8

A H G F E D C B

B A H G F E D C

C B A H G F E D

D C B A H G F E

Table 2: Make-up of task batches for the trial

The above design resulted in the following numbers of students responding to each task.

Task Number of Students A 27 B 26 C 27 D 28 E 26 F 26 G 26 H 26

Table 3: Number of students responding to each task

The students performed the tasks in a multimedia laboratory, speaking directly to a computer. Each student’s four responses were recorded and saved on the computer as a single file. These files were later edited to remove unwanted elements (such as long breaks following the end of a task performance or unwanted noise that occurred outside of the performance but was inadvertently recorded). The volume of each file was edited to ensure maximum audibility throughout. The performances of each student were then split up into the four constituent tasks and further edited (ie an indicator of student number and task was inserted at the beginning of the task and a bleep inserted to signal to the future rater that the task was now complete). The order of the files was randomised using a random numbers list generated using Microsoft Excel. Finally, eight CDs were created, each of which contained all of the performances for each task.



These eight CDs were then duplicated and a set was given to each of two trained and experienced IELTS raters who rated all tasks over a one-week period. The resulting score data were subjected to multi-faceted Rasch (MFR) analysis using the FACETS program (Linacre 2003) in order to identify a set of at least four tasks where any differences in difficulty could be shown to be statistically insignificant. (For recent examples of this statistical procedure in the language testing literature see Lumley & O’Sullivan 2005, Bonk & Ockey 2004).

The task measurement report from the FACETS output (Table 4) suggests that Task A is potentially significantly easier than the other seven. In addition, the infit mean square statistic (which indicates that all tasks are within the accepted range) suggests that all of the tasks are working in a predictable way.

| Fair-M| Model | Infit Outfit | |

| Avrage|Measure S.E. |MnSq ZStd MnSq ZStd | N Tasks |

| 5.86| -.71 .11 | 1.1 0 1.1 0 | 1 A |

| 5.74| -.27 .11 | 1.1 0 1.1 1 | 2 B |

| 5.69| -.11 .11 | 1.0 0 1.0 0 | 3 C |

| 5.66| -.02 .11 | .8 -2 .8 -2 | 4 D |

| 5.63| .08 .12 | .9 -1 .9 -1 | 5 E |

| 5.51| .45 .12 | 1.2 1 1.1 1 | 6 F |

| 5.56| .29 .11 | 1.0 0 .9 0 | 7 G |

| 5.57| .28 .11 | 1.0 0 1.0 0 | 8 H

Table 4: Task measurement report (summary of FACETS output)



Follow-up analysis of the scores awarded by the raters indicates that this difference appears to be of statistical significance only in the case of Tasks G and H (see Appendix 5) which appear to be significantly easier than Tasks A and C. The boxplots generated from the SPSS output (Figure 1) suggest that there is a broader spread of scores for Tasks A and C, though in general the mean scores do not appear to be widely spread.

Figure 1: Boxplots comparing task means from SPSS output

The results of these analyses suggest that Tasks A, C, G and H should not be considered for inclusion in the main study, though all of the others are acceptable.

4.2.2 Qualitative analysis In addition to the quantitative analysis described above, we analysed the responses of all students to a short questionnaire (see Appendix 6) about students’ perceptions of the tasks. For this phase of the study, we focused primarily on their responses to the items related to topic familiarity and degree of abstractness of the tasks. The data from these questionnaires (each student completed a questionnaire for each task) were entered into SPSS and analysed for instances of extreme views – as it was thought that we should only accept tasks in which the students felt a degree of comfort that the topic was familiar and that the information given was of a concrete nature. From this analysis, we made a preliminary decision to eliminate two of the eight tasks: Tasks G and H (Table 5). It was decided to monitor Task C, as students perceived it as being somewhat difficult in terms of vocabulary and grammar – though the language of the task (see Appendix 4) does not appear to be significantly different from that of the other tasks.



Topic Information Vocabulary Grammar

TASK 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

A 9 8 7 3 0 9 8 8 1 0 12 8 6 1 0 11 10 4 1 0

B 8 8 6 2 1 9 6 10 1 0 14 8 4 0 0 11 7 6 1 1

C 2 13 5 2 3 6 9 7 3 1 12 6 4 4 1 8 9 8 2 0

D 9 9 7 1 2 5 12 6 3 1 11 13 3 1 0 11 13 4 0 0

E 7 8 8 2 0 6 10 10 0 0 15 8 2 1 0 14 8 3 1 0

F 4 10 8 3 1 6 9 11 0 0 11 7 6 0 1 10 11 4 0 0

G 3 8 11 3 1 7 2 12 4 1 14 5 4 3 0 11 6 8 1 0

H 7 3 11 3 2 7 3 10 3 3 15 6 5 0 0 11 5 9 1 0

KEY: Topic 1 = Familiar 5 = Unfamiliar Information 1 = Very Concrete 5 = Very Abstract

Vocabulary & Grammar 1 = Easy 5 = Difficult

Table 5: Qualitative analysis of the tasks (suggesting that G & H be eliminated)

Based on the two types of analyses, the researchers identified four tasks as being equivalent from the qualitative and quantitative perspectives. These were:

Task B

Task E

B. Describe a part-time/holiday job that you have done.

You should say: How you got the job What the job involved How long the job lasted

And explain why you think you did the job well or badly.

E. Describe a teacher who has influenced you in your

education. You should say:

Where you met them What subject they taught What was special about them

And explain why this person influenced you so much.

Task D

Task F

D. Describe an enjoyable event that you experienced

when you were at school. You should say:

What the event was When it happened What was good about it

And explain why you particularly remember this event.

F. Describe a film or a TV programme which made a

strong impression on you. You should say:

What kind of film or TV programme it was (eg comedy)

When you saw it What it was about

And explain why it made such an impression on you.

Figure 2: Four tasks selected for the main study (Phase 5)

In addition to identifying four tasks that can be considered ‘equivalent’ from as broad a number of perspectives as possible, the early phases of the project also saw the development of a series of theory-based validity questionnaires based on ongoing research at the Centre for Research in Testing, Evaluation and Curriculum (CRTEC) at Roehampton University, London (reported by Akmar Zainal Abidin at the Language Testing Forum, Cambridge, 2003). These questionnaires, which are designed to offer insights into the cognitive processing of the participants before and during test task



performance, are based on Weir (2005) and were piloted during Phase 3 (see Appendix 7 for the four versions developed for use in this project).

During this piloting, a number of minor amendments were made to the original drafts based on qualitative feedback from participants – primarily for reasons of clarity and where the language proved to be beyond the level of participating learners.

Phase 4: The above phases meant that we were able to identify a set of four oral presentation tasks for which we could claim equivalence from both qualitative and quantitative perspectives; to the best of our knowledge, this has not been attempted before in either language testing or SLA research.

In this phase, the resulting tasks were manipulated according to the variables identified in Section IV above. Table 6 shows that this manipulation resulted in four versions of each of the four tasks: Task B remained unchanged, Task D had no planning time, Task E had no scaffolding and Task F required a response time of one minute (instead of two minutes).

Task No Change No Planning time No Scaffolding 1 minute response

B ● x x X

D x ● x X

E x x ● X

F x x x ●

Table 6: Manipulation of each task

To ensure that there was no order effect, the following matrix was designed (see Table 7). As described above, in this phase of the study, students were asked to perform four tasks, one of which remained unchanged from the original and the others manipulated in the way described in Table 6. In the matrix in Table 7, each version appears on an equal number of occasions and at each level (ie to be performed first, second, etc).

Vers

ion

1

Vers

ion

2

Vers

ion

3

Vers

ion

4

B D E F D B F E E F B D F E D B

Table 7: Setup for test versions for the main study



The tasks used in the study can be seen in Figure 3 below.

Task B [UNCHANGED] Task E [NO SCAFFOLDING]

You will have to talk about the topic for two minutes. You have one minute to think about what you are going to say. B. Describe a part-time/holiday job that you have done.

You should say: How you got the job What the job involved How long the job lasted


You will have to talk about the topic for two minutes. You have one minute to think about what you are going to say. E. Describe a teacher who has influenced you in your

education. And explain why this person influenced you so much.

Task D [NO PLANNING] Task F [REDUCED OUTPUT]

You will have to talk about the topic for two minutes. You should start speaking now, without taking time to think about what you are going to say. D. Describe an enjoyable event that you experienced

when you were at school. You should say:



You will have to talk about the topic for one minute. You have one minute to think about what you are going to say. F. Describe a film or a TV programme which made a

strong impression on you. You should say:

What kind of film or TV programme it was (eg comedy) When you saw it What it was about


Figure 3: Manipulation of the tasks in the main study

Phase 5: In the main part of the study, a total of 74 language students at a range of ability levels performed all four versions of the tasks according to the schedule defined by the matrix in Table 7. The resulting audio files were then edited and saved as individual MP3 files. This was done to avoid any halo effect in the rating process as the four tasks performed by any individual were separated so that raters would not be overly affected by performance on an early task when rating the later tasks. Four CDs were created each containing a randomised set of performances for each task (B, D, E and F). These were rated by two IELTS trained examiners working independently of each other using the current rating criteria and scales for the operational IELTS Speaking Test.

5 RESULTS

The scores from these ratings were then analysed using MFR and the resulting data were used for ANOVA and correlational analysis using the programme SPSS, Version 12. The model used in this MFR analysis takes into account the ability of the candidates, the relative harshness of the raters and the difficulty of the tasks to suggest a score called the Fair Average; Fair Average scores have the additional advantage of being true interval in nature.

This will allow us to make statements regarding the first aim of the study:

To establish any differences in candidate linguistic behaviour, as reflected in test scores, to language elicitation tasks that have been manipulated along a number of socio-cognitive dimensions



Since all students complete a theory-based validity questionnaire on completion of each of the four tasks they perform (see Appendix 7), analysis of these responses will allow us to make statements regarding the second of our research questions:

To establish any differences in candidate behaviour (cognitive processing) to language elicitation tasks that have been manipulated along a number of socio-cognitive dimensions

The existence (or not) of observable systematic differences across the four tasks will be interpreted in light of our third aim:

To create a framework for the systematic manipulation of speaking tasks

5.1 Rater agreement Before analysing the candidate performance data, it is first necessary to explore the area of inter-rater reliability. In this project, a number of measures will be considered, in order to gain a broad picture of the extent to which the two raters behaved in a consistent and predictable way.

First correlation analysis was undertaken to explore the degree to which the two raters placed the candidates in a similar order. The results of this analysis (Table 8) indicate a significant level of correlation for all comparisons (the more meaningful correlations have been highlighted in the table). The overall agreement, based on the raw data is 0.75, certainly acceptable, though not as high as we would expect to find in an operational test event (where it is usual to expect correlations above 0.8). It is possible that the unnatural nature of the rating process, where each rater was given a set of four CDs each one containing the performances of all candidates for a particular task, may have affected rating.

Fluency & coherence 2

Lexical resource 2

Grammatical range &

accuracy 2

Pronunciation 2 Overall 2

Fluency & coherence 1 .700 .696 .685 .629 .738

Lexical resource 1 .677 .662 .662 .592 .694

Grammatical range & accuracy 1 .656 .631 .668 .588 .679

Pronunciation 1 .583 .604 .651 .589 .640

Overall 1 .720 .703 .715 .643 .750 All correlations significant at the 0.01 level (2-tailed).

Table 8: Correlations between the raters

Another estimate of inter-rater agreement is the degree to which they agree on scores around the critical boundary. A widely recognised threshold boundary for IELTS is an overall band score of 6.5 (ie the level demanded by most universities for entrance, computed from scores on the four skills modules); although operational scores for IELTS Speaking are only reported at the whole band level, it was decided to use 6.5 in the following analysis. Table 9 shows the level of agreement/disagreement between the two raters. The shaded areas of the table indicate the areas in which the two raters agreed. This indicates that they agreed for a total of 78% of the candidates and disagreed on the remaining 22%. The table also suggests that Rater 1 is somewhat harsher than Rater 2.

From these two analyses, we can see that the raters were in broad agreement. As both the correlation between the overall scores and the critical boundary agreement indices are acceptable, we can accept that the scores awarded can be used for additional analysis.



Rater 2 Pass Rater 2 Fail

Rater 1 Pass 48 45 Rater 1 Fail 20 183

Table 9: Critical boundary agreement (boundary = 6.5)

5.2 Score data analysis Following the tests of rater agreement, the first analysis conducted on the task performance score data involved estimating the correlations between the four tasks. Table 10 shows that the correlations were very high and were all significant at the 0.01 level. It is particularly interesting to see that Task B is most highly correlated with Tasks D and F suggesting that the existence of planning time may not significantly affect task performance. Task D was the same as Task B with the single exception that in Task D there was no planning time available to test candidates. The other interesting suggestion here is that the amount of output expected of the candidate does not appear to have had a significant impact on the score achieved. Task F is the same as Task B except that the candidates are expected to talk for two minutes in the former and for just one minute in the latter.

Correlations

Task B Task D Task E Task F

Task B 1 .900 .871 .901Task D .900 1 .862 .858Task E .871 .862 1 .862Task F .901 .858 .862 1

All correlations are significant at the 0.01 level (2-tailed).

Table 10: Correlations between the four tasks

To more fully explore the data from the perspective of variation in performance across the four tasks it was decided to classify each candidate into one of three groups; those who are of High ability (setting the critical boundary at 6.5 and including those at and above it); those who could be considered Borderline cases (here the range is from 6.0 to 6.5); and finally those who would have been categorised as Low ability candidates (scoring less than 6.0). All three of these categorisations were based on performance over the four tasks.

N

Ability Level

Pass Borderline Fail

Fail

19 27 28

Task Original 74 No Planning 74 No Support 74 Reduced Response 74

Table 11: Descriptive statistics of the main study data



The descriptive statistics (see Table 11) show that the relative ability level of the population was quite low, with approximately half of the candidates in the ‘fail’ category and only about 20% clearly achieving 6.5 or above. The results of the ANOVA (Table 12) show that there are significant differences between the four task types and the three ability groups (as we would expect since they were selected based on overall scores averages over the four tasks). There does not appear to be any significant interaction between the ability groups and the task type suggesting the stability of these tasks across ability level. However, significant differences emerge in respect of task and ability as separate variables.

Source Type III Sum of Squares Df Mean Square F Sig.

Corrected Model 158.490(a) 11 14.408 58.714 .000

Intercept 9891.754 1 9891.754 40309.670 .000

Task 4.287 3 1.429 5.823 .001

Ability 151.483 2 75.742 308.653 .000

task * ability 2.570 6 .428 1.745 .110

Error 69.692 284 .245

Total 10066.500 296

Corrected Total 228.182 295

R Squared = .695 (Adjusted R Squared = .683)

Table 12: ANOVA results from the main study

The post hoc (Bonferroni) analysis (Table 13) suggests that there are differences in the responses and that these are significant for comparisons between the original version of the task and the versions which included no planning time and reduced response time. The actual differences in scores achieved for these tasks are approximately one third and one quarter of a band respectively with the original task proving easier in both cases.

95% Confidence Interval Comparison

Mean Difference Sig. Lower Bound Upper Bound

Original No Planning .32(*) .001 .10 .54

Original No Support .15 .378 -.06 .37

Original Reduced Response .26(*) .008 .05 .48

No Planning No Support -.17 .234 -.39 .05

No Planning Reduced Response -.06 1.000 -.27 .16

No Support Reduced Response .11 1.000 -.10 .33

Based on observed means.* The mean difference is significant at the .05 level.

Table 13: Multiple post hoc analysis (Bonferroni)

Having completed the main analyses, a set of charts was then generated. These consisted of a set of clustered boxplots and a line diagram, both of which were based on averaged scores for each task but with ability group also taken into account.



In the first of these charts (Figure 4) we can see that there is relatively little difference in the range of mean scores achieved by each group for the four tasks. While there is a clear difference between the three ability groups in terms of the mean scores achieved by each group for the different tasks, there is also an apparent difference between the pattern of scores on the four tasks between the High ability group (the ‘pass’ group), the Borderline group and the Low ability group (the ‘fail’ group).

Figure 4: Boxplots comparing task mean score by ability group

In the final chart (Figure 5 – see following page) we can now see that the pattern of scoring is relatively similar for the Low and Borderline groups but quite different for the High scoring group. Taken with the significant results found in the ANOVA reported above, this suggests that manipulating tasks may result in more complex effects on difficulty than initially thought. The standard version of the task appears to result in optimum performance for all groups; by contrast, the no-planning version appears to result in systematically lower scores across the three ability groups. The lack of support (or scaffolding) appears to have a greater negative impact on test scores achieved by the High and Borderline groups while at the same time having only a very slight (and certainly non-significant) impact on the Low group who may be at a level of language ability where any changes have little impact on performance. Finally, the reduction in response time appears to have had little impact on the performances of the High and Borderline groups, though it clearly has had a different impact on the Low group, with their mean score at its lowest point.



Original No Planning No Support Reduced Response

task

4.5

5

5.5

6

6.5

7

Estim

ated

Mar

gina

l Mea

ns

ability Low Borderline High

Estimated Marginal Means of tottask

High

Borderline

Low

Figure 5: Line diagram comparing task mean score by ability group

5.3 Questionnaire data analysis (from the perspective of the task) For reasons of clarity of analysis and presentation, we will present the results from the three parts of the questionnaires separately. In the first part of the questionnaire, all participants were asked to respond to items related to how they dealt with their initial response to each task version. The results are shown in Table 13 below. These results are based on a series of univariate ANOVAs carried out on the data after the questionnaires had been shown to be working as predicted through factor analysis.

The factor analysis of the data was carried out to find evidence that the questionnaires were producing consistent results. Since the three parts of the instrument had been designed to elicit information on specific aspects of the candidates’ behaviour, it was expected that a factor analysis of the responses should result in identifying background factors that matched the planning. The results of the analysis of Part 1 indicated a very clear two-factor solution, with the first four items loading on Factor 1 (which we suggest indicates a more general background knowledge of speaking test response), while the latter four items load a second factor (which appears to be more task-specific knowledge).



Component Factor

1 2

1. I read the task very carefully to understand what was required. .104 .702

2. I thought of HOW to deliver my speech in order to respond well to the topic. .114 .748

3. I thought of HOW to satisfy the audiences and examiners. .273 .643

Goal setting

4. I understood the instructions for this speaking test completely. .182 .657

5. I had ENOUGH ideas to speak about this topic. .750 .236

6. I felt it was easy to produce enough ideas for the speech from memory. .813 .185

7. I know A LOT about this type of speech, i.e., I know how to make a speech on this type of topic.

.823 .180

Generating Ideas

8. I know A LOT about other types of speaking test, e.g., interview, discussion. .745 .126

Extraction Method: Principal Component Analysis. Rotation Method: Varimax with Kaiser Normalisation. A Rotation converged in 3 iterations.

Table 14: Factor analysis of Questionnaire Part 1 (before speaking)

When this is taken into account, the analysis of the responses to individual items should reflect this two-factor solution.

In the first section, which explores candidates’ awareness of how they might go about responding to the task when in the initial stages of reading and considering their response, we can see that there are a number of significant differences between the tasks and the ability groups (though as with all responses to the questionnaire items there is no interaction between the two variables).

Item Ave. Task Type Ability Group

1. I read the task very carefully to understand what was required. 4.2 Less likely for No Planning Less likely for

BORDERLINE group 2. I thought of HOW to deliver my speech in order to respond well to the topic.

3.7 Less likely for No Planning No meaningful differences

3. I thought of HOW to satisfy the audiences and examiners. 3.3 No meaningful differences No meaningful differences

4. I understood the instructions for this speaking test completely. 4 Less likely for No Planning More likely for HIGH group

5. I had ENOUGH ideas to speak about this topic. 3.1

More likely in Original, least for No Planning & No Support

Less likely for LOW group

6. I felt it was easy to produce enough ideas for the speech from memory.

3.1 More likely in Original, least for No Planning & No Support

Less likely for BORDERLINE group

7. I know A LOT about this type of speech, i.e., I know how to make a speech on this type of topic.

2.9 No meaningful differences No meaningful differences

8. I know A LOT about other types of speaking test, e.g., interview, discussion.

3 No meaningful differences No meaningful differences

= no significant difference found = significant difference found Note: the Likert scale upon which the Average (column 2) is calculated is from 1-5

Table 15: Univariate ANOVA results for Questionnaire Part 1 (before speaking)



The mean response levels (in the Ave. column) indicate that the candidates are likely to read the instructions carefully, and that they tended to have no problem understanding the task. However, they were less likely to consider the audience (Item 3) or to give much thought to the generation of ideas prior to speaking (Items 5 – 8).

It is interesting to note that there is less likelihood that candidates responding to the No Planning version of the tasks will either read the rubric as carefully as for the other versions or think about how to respond in the same was as they might do for the other versions. However, it should be noted that the low mean response to the first item appears to have been heavily influenced by the Borderline group. Review of the data indicates that no errors in data entry could have led to this, and in the absence of post-test interview data, the reason for the very low response cannot easily be explained.

We can also see that the No Planning task appears to have resulted in candidates failing to fully understand the instructions (not surprising in light of the earlier responses which indicated they may not have read them carefully), though this was not a problem for the High ability group.

In the second part of the section, which focused on generating ideas in the pre-planning stage, candidates indicated that the manipulation of the task appears to have had a significant impact on their ability to produce ideas from their background knowledge. Where the task has been altered in terms of planning time or support offered, the candidates report significantly more difficulty in generating ideas – this is most significant for the Low and Borderline groups. For Items 5 and 6 the pattern of response for the Low group was similar across the four tasks, while both the High and Borderline groups indicated a high likelihood for both the Original task and the Reduced Response version and a low likelihood for the other two versions. Perhaps not surprisingly, in the final pair of items, which link the generating of ideas to what is essentially background knowledge, there are no meaningful differences between the tasks or between the three ability levels.

As with the factor analysis of the first section of the questionnaire, the analysis of the second section suggests that this part of the instrument is also working well (Table 16); note that in this analysis the No Planning task was not included as the candidates were not asked to complete a questionnaire since they had not been given any time for planning. The single exception seems to be Item 7, which loads on two factors, so in the analysis that follows this item has been removed. The six-factor solution reflects the original design.



Component Factor

1 2 3 4 5

1. I thought of MOST of my ideas for the speech BEFORE planning an outline.

-.071 -.070 .222 .084 .635

Time Element 2. During the period allowed for planning, I was conscious of

the time. .114 .171 -.067 -.059 .805

3. I followed the 3 short prompts provided in the task when I was planning.

-.035 .771 .167 -.061 -.107

4. The information in the short prompts provided was necessary for me to complete the task.

-.118 .731 -.001 .042 .156 Task Specific

Planning

5. I wrote down the points I wanted to make based on the 3 short prompts provided in the task.

-.111 .602 .050 .443 .118

6. I wrote down the words and expressions I needed to fulfil the task.

-.110 .002 .152 .730 .050 Linguistic Planning

7. I wrote down the structures I need to fulfil the task. .439 .000 .162 .512 .310

8. I made notes only in ENGLISH. -.758 .114 -.078 .209 .022

9. I took notes only in my own language. .785 -.056 .084 .157 -.001 Language used when Planning

10. I took notes in both ENGLISH and own language. .862 -.092 -.016 -.039 .044

11. I planned an outline on paper BEFORE starting to speak. -.057 -.082 .014 -.652 .045 Organisation

12. I planned an outline in my mind BEFORE starting to speak. -.232 -.004 -.431 .410 -.200

Generating & Practicing

13. Ideas occurring to me at the beginning tended to be COMPLETE.

-.111 .265 .726 .059 .016

14. I was able to put my ideas or content in good order. .040 .257 .661 .243 -.066

15. I practiced the speech in my mind WHILE I was planning. .192 -.396 .584 -.015 .246

16. After finishing my planning, I practiced what I was going to say in my mind until it was time to start.

.369 -.309 .543 .000 .241


Table 16: Factor analysis of Questionnaire Part 2 (planning – excludes Task 2)

The mean responses in Table 17 show an interesting pattern, particularly with the high levels for Items 3, 4 and 5 indicating that candidates tended to rely to a great extent on the bullet-pointed prompts: the high mean for Item 8 (when combined with the low means for Items 9 and 10) indicate that planning tends to be done in the target language (though the Low ability group are more likely to use L1). The low means for Items 11 and 12 suggest that little concern is given to planning an outline before speaking. This appears to contradict Item 5, where candidates say they wrote down the points they wanted to make before speaking. It is possible that they interpreted this as actually making a full plan or script of what to say, though not necessarily on paper. This needs to be clarified before any future administration of the instrument.

In the first part of the section (labeled ‘Time Element’) there is little difference across ability levels, though there appears to be a significant effect for the Reduced response version of the task for the item referring to awareness of time. Since there are only two significant effects for all items related to planning we can deduce that manipulating tasks in the ways adapted here may have a limited impact on the planning phase. These aspects can be summarised as:

With reduced response time candidates may feel they are under less pressure and so are less conscious of time when responding

Removing support from a task appears to make it more difficult for students to plan their response



High level candidates are more likely to rely on the supporting points in a task rubric Low level candidates are more likely to use either their own language only or a combination

of the target language and their own language in planning. Low level students are more likely to practise what they are about to say both during and after

planning

Item Ave Task Type Ability Level 1. I thought of MOST of my ideas for the

speech BEFORE planning an outline. 3.64 No meaningful difference No meaningful

differences

2. During the period allowed for planning, I was conscious of the time.

3.31 Least likely for Reduced Response No meaningful

differences

3. I followed the 3 short prompts provided in the task when I was planning.


4. The information in the short prompts provided was necessary for me to complete the task.

3.78 No meaningful differences HIGH group more likely to respond

positively

5. I wrote down the points I wanted to make based on the 3 short prompts provided in the task.


6. I wrote down the words and expressions I needed to fulfil the task.

3.35 No meaningful difference No meaningful differences

7. I wrote down the structures I need to fulfil the task.

2.4 No meaningful difference LOW group more likely to respond

positively

8. I took notes only in ENGLISH. 4.05 No meaningful difference No meaningful differences

9. I took notes only in my own language. 1.9 No meaningful difference

LOW group more likely to respond

positively (but low means)

10. I took notes in both ENGLISH and own language.

2.14 No meaningful difference Lower level more likely to respond

positively

11. I planned an outline on paper BEFORE starting to speak.


12. I planned an outline in my mind BEFORE starting to speak.


13. Ideas occurring to me at the beginning tended to be COMPLETE.


14. I was able to put my ideas or content in good order.

2.88 Less likely for No Support No meaningful differences

15. I practiced the speech in my mind WHILE I was planning.

2.89 No meaningful difference

LOW group more likely to respond

positively (but low means)

16. After finishing my planning, I practiced what I was going to say in my mind until it was time to start.

2.72 No meaningful difference HIGH group less likely to respond

positively = no significant difference found = significant difference found

Note: Items 3, 4 and 5 not included in No Support version (as they refer to supporting points)

Table 17: Univariate ANOVA results for Questionnaire Part 2 (during planning)



In the final section of the questionnaire, candidates were asked to respond to items related to what they did as they were speaking. The factor analysis reflected the original design, as so the section was considered to have worked as predicted.

Component Factor

1 2 3 4

1. I felt it was easy to put ideas in good order. .819 .083 .079 -.028

2. I was able to express my ideas using appropriate words. .705 .203 .134 .015

3. I was able to express my ideas using correct grammar. .695 .194 .133 .088

6. I was able to put sentences in logical order. .736 .226 .086 .040

7. I was able to CONNECT my ideas smoothly in the whole speech. .602 .264 .073 -.136

Idea Development

(ability)

14. I felt it was easy to complete the task. .748 .125 .158 .094

4. I thought of MOST of my ideas for the speech WHILE I was actually speaking. -.048 .205 .330 .714 Idea Development

(temporal) 5. Some ideas had to be omitted while I was speaking. .103 -.132 -.326 .759

8. I was conscious of the time WHILE I was making this speech. .194 .009 .819 -.025 Time

Awareness 9. I tried NOT to speak more than the required length of time in the instructions. .239 .278 .629 .012

10. I was listening and checking the correctness of the contents and their order WHILE I was making this speech. .251 .754 .030 -.017

11. I was listening and checking whether the contents and their order fit the topic WHILE I was making this speech. .195 .786 .049 -.020

12. I was listening and checking the correctness of sentences WHILE I was making this speech. .215 .783 .090 .016

Monitoring

13. I was listening and checking whether the words fit the topic WHILE I was making this speech.

.170 .744 .221 .107


Table 18: Factor analysis of Questionnaire Part 3 (during speaking)

The most interesting thing about mean responses in this section is the lack of variation across the items. In the first part, there is very much a ‘no view’ perspective displayed, suggesting that the candidates were not overly challenged by the tasks. In support of the findings for the previous section, there appears to have been a tendency for candidates to plan while speaking (Item 4) and a slight tendency for them to monitor the contents and language of their responses (though the latter seems to have been most likely with the High ability level).

In the first part of the section, which related to ease and ability to develop ideas, the suggestion appears to be that the candidates found the Original version of the task the easiest to respond to (though this was shared with the Reduced Response version for Item 1). Not surprisingly the High level candidates indicated that they found it easy to “express…ideas using good grammar,” while the Borderline candidates seemed to struggle with cohesion and coherence.

Low level candidates were more likely to omit ideas as they were speaking, though this was reported as being less likely with the No Support task version, possibly because the candidates considered the ‘idea’ to be related primarily with the three bullet-pointed supporting points suggested and when these were removed they struggled.



Item Ave. Task Type Ability Level

1. I felt it was easy to put ideas in good order. 2.9 Easier for Original and Reduced Response No meaningful

differences

2. I was able to express my ideas using appropriate words.

3 No meaningful differences No meaningful differences

3. I was able to express my ideas using correct grammar.

2.8 No meaningful differences More likely with HIGH group

6. I was able to put sentences in logical order. 3 No meaningful differences Less likely with BORDERLINE group

7. I was able to CONNECT my ideas smoothly in the whole speech.

2.8 More likely with Original especially compared to No Planning

Less likely with BORDERLINE group

14. I felt it was easy to complete the task. 2.9 No meaningful differences No meaningful differences

4. I thought of MOST of my ideas for the speech WHILE I was actually speaking.


5. Some ideas had to be omitted while I was speaking.

3 Less likely with No Support version Most likely for LOW

group

8. I was conscious of the time WHILE I was making this speech.


9. I tried NOT to speak more than the required length of time in the instructions.


10. I was listening and checking the correctness of the contents and their order WHILE I was making this speech.


11. I was listening and checking whether the contents and their order fit the topic WHILE I was making this speech.

3.3 No meaningful differences Less likely with Borderline group

12. I was listening and checking the correctness of sentences WHILE I was making this speech. 3.3 No meaningful differences More likely with HIGH

group

13. I was listening and checking whether the words fit the topic WHILE I was making this speech.

3.3 No meaningful differences More likely with HIGH group

= no significant difference found = significant difference found

Table 19: Univariate ANOVA results for Questionnaire Part 3 (during speaking)

Time did not seem to be particularly important to candidates, and though there was a slight tendency for them to be conscious of time, this does not appear to have varied across ability level or task type attempted. Similarly, though candidates tended to monitor their responses for content, organisation and language, this was not a very strong trend, with the exception of the High ability group who were significantly more likely to monitor their language (but not content or organisation) than the other groups.

6 CONCLUSIONS

In this research project we set out to establish whether the difficulty of a task could be varied by systematic manipulation along a number of dimensions. In doing this we were interested in whether the scores achieved by a group of test candidates would vary along with the cognitive processing associated with performance on the various tasks. This was hoped to provide the basis for a framework which could be used to manipulate tasks in order to systematically alter the difficulty of these tasks.



The project called for a set of four equivalent tasks to be identified so that all participants would respond to an unaltered version as well as three versions in which systematic variations had been made (removal of planning time; removal of support; and reduction of expected response time). In order to identify four equivalent tasks, a complex procedure was designed, in which a set of nine tasks was analysed both quantitatively (based on the performances of a group of 54 participants) and qualitatively (using the responses of these same participants to a series of short questionnaires).

At this stage, a set of four tasks was identified and manipulated as planned. A group of 74 participants then recorded their responses to the tasks which were presented to different people in different orders. At the same time, all respondents then completed questionnaires (one per task, so a total of four per participant) based on Weir’s (2005) socio-cognitive framework for test validation for speaking. The resulting data were then analysed using the two datasets.

Results of the analysis of the score data suggest that there are significant differences to be found in the responses of three ability groups to the four tasks, indicating that task difficulty may well be affected differently for test candidates of different ability. In other words, simply altering a task along a particular dimension may not result in a version that is equally more or less difficult for all test candidates. Instead, there is likely to be a variety of effects as a result of the alteration. For instance, here, mid-level and higher-level participants were not significantly affected by the reduction in response time, while this same alteration to the task resulted in the most serious negative effect for the lower level participants.

The analysis of the questionnaire data further complicates the picture. We can briefly summarise the findings as:

The most significant effects of task manipulation on candidates appear to be at the pre-speaking phase, particularly where no planning time is offered. However, these effects appear to differ depending on the ability level of the candidate.

The effects on planning are far less obvious. The candidates report essentially the same approach to planning regardless of the task. Here, while there are far more significant differences in the ways in which candidates of different ability level approach task planning, there appears to be a clear tendency for them not to outline their response before speaking, so even though they take the time to plan, they seem to do much of their planning ‘on-line’ ie, as they are speaking (though lower level candidates report practising what they plan to say before speaking).

When speaking, the candidates seemed to feel that the original version of the task offered them the greatest opportunity to perform at their best, though not surprisingly, this depended on their ability level (lower levels did not find any particular version easier in any way than the others). There was a significant difference in approach to monitoring of own output, with the higher level students more likely to monitor language, though not content or organisation).

6.1 Implications We believe the study has implications for teachers who prepare students for examinations containing speaking tasks which involve individual long turn responses, for the test developers who design these tasks, for test validators and first and second language acquisition researchers.



6.1.1 Teachers The differences in approach to task performance highlighted here suggest that teachers might focus more explicitly on pre-speaking strategies such as focusing more clearly on any bulleted prompts and on using the target language for any planning. The lack of impact on approach to planning of task manipulation suggests that students (certainly those involved in this study) have already formed strategies for task performance. However, to improve their understanding of a task, students should be encouraged to read task rubrics more carefully, focus on the language used in the instructions and perhaps ask for assistance where things are not clear.

6.1.2 Test developers The notion of task equivalence is not as straightforward as it seems. The nine tasks initially used here were presumed by their developers to be equivalent. The methodology used to establish equivalence demonstrated how difficult it can be to create truly equivalent versions of a task. The main study also demonstrates how task difficulty can be affected by decisions to either include or exclude support (eg in the form of bulleted prompts) or by altering the planning time afforded to candidates. This suggests that any substantive changes to these conditions of task performance need to be empirically tested before they are considered in any test revision (or as alternative choices within a test). This is particularly relevant for the planning variable, where the difference in scores achieved was significantly lower for the ‘no planning’ condition than for the original version of the task (which allows one minute of planning time).

The situation regarding amount of response time seems to be less conclusive. Apart from a reduced awareness of time in the planning phase (possibly due to the perception that less speaking time meant there was less to worry about), there appears to have been no difference to the approach taken to task response. However, the scores achieved appear to have been significantly lower for this version than for the original version of the task (in the original version candidates spoke for 2 minutes as opposed to 1 minute in the reduced response version).

The rubric appears to be especially important in this type of task. It is clear that a number of candidates (typically at the lower level) had some difficulty understanding what to do. While this is possibly unavoidable in a test which is designed to be used across a broad range of abilities, it is clearly very important for the test developer to ensure measures are in place to avoid poor reading or listening skills affecting student spoken performance. In ‘live’ tests this is not so difficult (examiners can be trained to deal systematically with comprehension problems), though it is a potentially serious limitation of any computer-delivered test of this sort.

6.1.3 Test validators In the same way that test developers need to focus on the area of task equivalence, test validators should also consider the area when establishing evidence of the context validity (see Weir 2005) of their tests. Consideration should be given to using the methodology developed here in order to establish true equivalence in test tasks, as well as to investigating how tasks are affected when variations are suggested by stakeholders.

6.1.4 Researchers SLA researchers have argued since the mid-1980s that performing language elicitation tasks in a learning environment supports learning. While O’Sullivan (2000a: 298) argues that ‘[The] notion of an interlocutor effect on performance does not appear to have been sufficiently addressed in the [SLA] literature’, he also argues that the ’conditions under which tasks are performed should be more rigorously described’ (O’Sullivan, 2000a: 297). While there has been a recognition in the task-based learning literature that task performance conditions can affect performance (Larson-Freeman & Long, 1991: 30-33), there is little evidence that this awareness has found its way into SLA or Applied Linguistics research.



The evidence presented in this project suggests that researchers need to more clearly understand the implications of decisions they make when designing tasks for use as elicitation devices in their studies. Research studies should contain both more detail of task design and equivalence and an awareness on the side of the researcher of the rationale for task selection and manipulation. In other words, tasks for both testing and research purposes should be specified in an equally systematic and comprehensive fashion using a model of validation such as that of Weir (2005) to ensure that the results obtained are credible in terms of the validity evidence available.



REFERENCES

Abdul Raof, AH, 2002, ‘The production of a performance rating scale: an alternative methodology’, unpublished PhD dissertation, The University of Reading, UK

Berry, V, 1994, ‘Personality characteristics and the assessment of spoken language in an academic context’, paper presented at the 16th Language Testing Research Colloquium, Washington, DC

Berry, V, 1997, ‘Gender and personality as factors of interlocutor variability in oral performance tests’, paper presented at the 19th Language Testing Research Colloquium, Orlando, Florida

Berry, V, 2004, ‘A study of the interaction between individual personality differences and oral test performance test facets’, unpublished PhD dissertation, Kings College, The University of London

Bonk, WJ and Ockey, GJ, 2003, ‘A many-facet Rasch analysis of the second language group oral discussion task’, Language Testing, vol 20, no 1, pp 89-110

Brown, A, 1995, ‘The effect of rater variables in the development of an occupation specific language performance test’, Language Testing, vol 12, no 1, pp 1-15

Brown, A, 1998, ‘Interviewer style and candidate performance in the IELTS oral interview’, paper presented at the 20th Language Testing Research Colloquium, Monterey, CA

Brown, A, and Lumley, T, 1997, ‘Interviewer variability in specific-purpose language performance tests’ in Current Developments and Alternatives in Language Assessment, eds A Huhta, V Kohonen, L Kurki-Suonio and S Luoma, University of Jyväskylä and University of Tampere, Jyväskylä, pp137-150

Brown, G, and Yule, G, 1983, Teaching the spoken language, Cambridge University Press, Cambridge

Buckingham, A, 1997, ‘Oral language testing: do the age, status and gender of the interlocutor make a difference?’, unpublished MA dissertation, University of Reading

Butler, FA, Eignor, D, Jones, S, McNamara, T, and Suomi, BK, 2000, TOEFL (2000) Speaking Framework: A Working Paper, TOEFL Monograph Series 20, Educational Testing Service, Princeton, NJ

Bygate, M, 1987, Speaking, Oxford University Press, Oxford

Bygate, M, 1999, ‘Quality of language and purpose of task: patterns of learners’ language on two oral communication tasks’, Language Teaching Research, vol 3, no 3, pp 185-214

Chalhoub-Deville, M, 1995, ‘Deriving oral assessment scales across different tests and rater groups’, Language Testing, vol 12, pp16-33

Clark, JLD and Swinton, SS, 1979, ‘An exploration of speaking proficiency measures in the TOEFL context’, TOEFL Research Report, Educational Testing Service, Princeton, NJ

Crookes, G, 1989, ‘Planning and interlanguage variation’, Studies in Second Language Acquisition, vol 11, pp 367-383

Ellis, R, 1987, ‘Interlanguage variability in narrative discourse: style shifting in the use of the past tense’, Studies in Second Language Acquisition, vol 9, pp 1-20

Foster, P and Skehan, P, 1996, ‘The influence of planning and task type on second language performance’, Studies in Second Language Acquisition, vol 18, pp 299-323



Foster, P and Skehan, P, 1999, ‘The influence of source of planning and focus of planning on task-based performance’, Language Teaching Research, vol 3, no 3, pp 215-247

Fulcher, G, 1996, ‘Testing tasks: issues in task design and the group oral’, Language Testing, vol 13, no 1, pp 23-51

Fulcher, G, 2003, Testing second language speaking, Longman/Pearson, London

Halleck, G, 1996, ‘Interrater reliability of the OPI: using academic trainee raters’, Foreign Language Annals, vol 29, no 2, pp 223-238

Hasselgren, A, 1997, ‘Oral test subskill scores: what they tell us about raters and pupils’, in Current Developments and Alternatives in Language Assessment, eds A Huhta, V Kohonen, L Kurki-Suonio and S Luoma, University of Jyväskylä and University of Tampere, Jyväskylä, pp 241-256

Henning, G, 1983, ‘Oral proficiency testing: comparative validities of interview, imitation, and completion methods’, Language Learning, vol 33, no 3, pp 315-332

Hughes, A, 1989, Testing for language teachers, Cambridge University Press, Cambridge

Hughes, A, 2003, Testing for language teachers: Second Edition, Cambridge University Press, Cambridge

Iwashita, N, 1997, ‘The validity of the paired interview format in oral performance testing’, paper presented at the 19th Language Testing Research Colloquium, Orlando, Florida

Kormos, J, 1999, ‘Simulation conversations in oral proficiency assessment: a conversation analysis of role plays and non-scripted interviews in language exams’, Language Testing, vol 16, no 2, pp 163-188

Kunnan, AJ, 1995, Test-taker characteristics and test performance: a structural modeling approach, UCLES/Cambridge University Press, Cambridge

Larson-Freeman, D, and Long, MH, 1991, An introduction to second language acquisition research, Longman, London

Lazaraton, A, 1996a, ‘Interlocutor support in oral proficiency interviews: the case of CASE, Language Testing, vol 13, no 2, pp 151-172

Lazaraton, A, 1996b, ‘A qualitative approach to monitoring examiner conduct in the Cambridge Assessment of Spoken English (CASE)’, in Performance testing, cognition and assessment: selected papers from the 15th Language Testing Research Colloquium, Cambridge and Arnhem, eds M Milanovic and N Saville, UCLES/Cambridge University Press, Cambridge, pp 18-33

Linacre, JM, 2003, FACETS 3.45 computer program, MESA Press, Chicago, IL

Lumley, T, 1998, ‘Perceptions of language-trained raters and occupational experts in a test of occupational English language proficiency’, English for Specific Purposes, vol 17, no 4, pp 347-367

Lumley, T and O’Sullivan, B, 2000, ‘The effect of speaker and topic variables on task performance in a tape-mediated assessment of speaking’, paper presented at the 2nd Annual Asian Language Assessment Research Forum, The Hong Kong Polytechnic University

Lumley, T and O’Sullivan, B, 2001, ‘The effect of test-taker sex, audience and topic on task performance in tape-mediated assessment of speaking’, Melbourne Papers in Language Testing, vol 9, no 1, pp 34-55



Lumley, T and O’Sullivan, B, 2005, ‘The effect of test-taker gender, audience and topic on task performance in tape-mediated assessment of speaking’, Language Testing, vol 23, no 4, pp 415-437

Luoma, S, 2004, Assessing Speaking, Cambridge University Press, Cambridge

McNamara, T, 1997, ‘Interaction’ in second language performance assessment: whose performance?’ Applied Linguistics, vol 18, pp 446-466

Mehnert, U, 1998, ‘The effects of different lengths of time for planning on second language performance’, Studies in Second Language Acquisition, vol 20, pp 83-108

Norris, J, Brown, JD, Hudson, T and Yoshioka, J, 1998, Designing second language performance assessment, Technical Report #18, University of Hawai’i Press, Hawai’i

O’Loughlin, K, 1995, ‘Lexical density in candidate output on direct and semi-direct versions of an oral proficiency test’, Language Testing, vol 12, no 2, pp 217-237

O’Sullivan, B, 1995, ‘Oral language testing: does the age of the interlocutor make a difference?’ unpublished MA dissertation, University of Reading

O’Sullivan, B, 2000a, ‘Towards a model of performance in oral language testing’, unpublished PhD dissertation, University of Reading

O’Sullivan, B, 2000b, ‘Exploring gender and oral proficiency interview performance’, System, vol 28, no 3, pp 373-386

O’Sullivan, B, 2002, ‘Learner acquaintanceship and oral proficiency test pair-task performance’, Language Testing, vol 19, no 3, pp 277-295

O’Sullivan, B, and Weir, C, 2002, Research issues in testing spoken language, mimeo: internal research report commissioned by Cambridge ESOL

O’Sullivan, B, Weir, C and ffrench, A, 2001, ‘Task difficulty in testing spoken language: a socio-cognitive perspective’, paper presented at the 23rd Language Testing Research Colloquium, St Louis, Miss

O’Sullivan, B, Weir, CJ and Saville, N, 2002, ‘Using observation checklists to validate speaking-test tasks’, Language Testing, vol 19, no 1, pp 33-56

Ortega, L, 1999, ‘Planning and focus on form in L2 oral performance’, Studies in Second Language Acquisition, vol 20, pp 109-148

Porter, D, 1991, ‘Affective factors in language testing’ in Language Testing in the 1990s, eds JC Alderson and B North, Modern English Publications in association with British Council, Macmillan, London, pp 32-40

Porter, D and Shen SH, 1991, ‘Gender, status and style in the interview’, The Dolphin 21, Aarhus University Press, pp 117-128

Purpura, J, 1998, ‘Investigating the effects of strategy use and second language test performance with high- and low-ability test-takers: a structural equation modeling approach’, Language Testing, vol 15, no 3, pp 333-379

Robinson, P, 1995, ‘Task complexity and second language narrative discourse’, Language Learning, vol 45, no 1, pp 99-140



Ross, S, 1992, ‘Accommodative questions in oral proficiency interviews’, Language Testing, vol 9, pp 173-186


Shohamy, E, 1983, ‘The stability of oral language proficiency assessment on the oral interview testing procedure’, Language Learning, vol 33, pp 527-540

Shohamy, E, 1994, ‘The validity of direct versus semi-direct oral tests’, Language Testing, vol 11, pp 99-123

Shohamy, E, Reves, T and Bejarano, Y, 1986, ‘Introducing a new comprehensive test of oral proficiency’, ELT Journal, vol 40, no 3, pp 212-220

Skehan, P, 1996, ‘A framework for the implementation of task based instruction’, Applied Linguistics, vol 17, pp 38-62


Skehan, P and Foster, P, 1997, ‘The influence of planning and post-task activities on accuracy and complexity in task-based learning’, Language Teaching Research, vol 1, no 3, pp 185-211

Skehan, P and Foster, P, 1999, ‘The influence of task structure and processing conditions on narrative retellings’, Language Learning, vol 49, no 1, pp 93-120

Skehan, P and Foster, P, 2001, ‘Cognition and tasks’ in Cognition and second language instruction, ed P Robinson, Cambridge University Press, Cambridge, pp 183-205

Stansfield, CW and Kenyon, DM, 1992, ‘Research on the comparability of the oral proficiency interview and the simulated oral proficiency interview’, System, vol 20, pp 347-364

Thompson, I, 1995, ‘A study of interrater reliability of the ACTFL oral proficiency interview in five European Languages: data from ESL, French, German, Russia, and Spanish’, Foreign Language Annals, vol 28, no 3, pp 407-422

Underhill, N, 1987, Testing spoken language: a handbook of oral testing techniques, Cambridge University Press, Cambridge

Upshur, JA and Turner, C, 1999, ‘Systematic effects in the rating of second-language speaking ability: test method and learner discourse’, Language Testing, vol 1, no 1, pp 82-111

Weir, CJ, 1990, Communicative language testing, Prentice Hall International

Weir, CJ, 1993, Understanding and developing language tests, Prentice Hall London

Weir, CJ, 2005 Language testing and validation: an evidence-based approach, Palgrave, Oxford

Wigglesworth, G, 1997, ‘An investigation of planning time and proficiency level on oral test discourse’, Language Testing, vol 14, no 1, pp 85-106

Wigglesworth, G, and O’Loughlin, K, 1993, ‘An investigation into the comparability of direct and semi-direct versions of an oral interaction test in English’, Melbourne Papers in Language Testing, vol 2, no 1, pp 56-67



Williams, J, 1992, ‘Planning, discourse marking, and the comprehensibility of international teaching assistants’, TESOL Quarterly, vol 26, pp 693-711

Young, R, 1995, ‘Conversational styles in language proficiency interviews’, Language Learning, vol 45, no 1, pp 3-42

Young, R, and Milanovic, M, 1992, ‘Discourse variation in oral proficiency interviews’, Studies in Second Language Acquisition, vol 14, pp 403-424



APPENDIX 1: TASK DIFFICULTY CHECKLIST (BASED ON SKEHAN, 1998)

MODERATOR VARIABLES

CONDITION GLOSS (THE MORE DIFFICULT THE HIGHER THE NUMBER)

DIFFICULTY (CIRCLE ONE)

Range of linguistic input

Vocabulary and structure as appropriate to ALTE levels 1 – 5 (beginner to advanced) 1 2 3 4 5 6

CODE COMPLEXITY Sources of

input Number and types of written and spoken input 1 = one single written or spoken source to 5 = multiple written and spoken sources

1 2 3 4 5 6

Amount of linguistic input to be processed

Quantity of input 1 = sentence level (single question, prompts) 5 = long text (extended instructions and/or texts)

1 2 3 4 5 6

Availability of input

Extent to which information necessary for task completions is readily available to the candidate 1 = all information provided 5 = student attempts an open ended task [student provides all information];

1 2 3 4 5 6

Familiarity of information

1 = the information given and/or required is likely to be within the candidates’ experience 5 = information given and/or required is likely to be outside the candidates’ experience

1 2 3 4 5 6

Organisation of information required

1 = almost no organisation required 5 = extensive organisation required simple answer to a question to a complex response

1 2 3 4 5 6

COGNITIVE COMPLEXITY

As information becomes more abstract

1 = concrete 5 = abstract 1 2 3 4 5 6

Time pressure 1 = no constraints on time available to complete task (if candidate does not complete the task in the time given he/she is not penalised) 5 = serious constraints on time available to complete task (if candidate does not complete the task in the time given he/she is penalised)

1 2 3 4 5 6

Response level

1 = more than sufficient to plan or formulate a response 5 = no planning time available

1 2 3 4 5 6

Scale Number of participants in a task, number of relationships involved 1 = one person 5 = five or more people

1 2 3 4 5 6

Complexity of task outcome

1 = simple unequivocal outcome 5 = complex unpredictable outcome 1 2 3 4 5 6

Referential complexity

1 = reference to objects and activities which are visible 5 = reference to external/displaced (not in the here and now) objects and events

1 2 3 4 5 6

Stakes 1 = a measure of attainment which is of value only to the candidate 5 = a measure of attainment which has a high external value

1 2 3 4 5 6

Degree of reciprocity required

1 = no requirement of the candidate to initiate, continue or terminate interaction 5 = task requires each candidate to participate fully in the interaction

1 2 3 4 5 6

Structured 1 = task is highly structured/scaffolded 5 = task is totally unstructured/unscaffolded 1 2 3 4 5 6

COMMUNICATIVE DEMAND

Opportunity for control

1 = complete autonomy 5 = no opportunity for control 1 2 3 4 5 6



APPENDIX 2: READABILITY STATISTICS FOR 9 TASKS

Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8 Task 9 Counts

Words 35 33 36 43 34 35 46 31 38 Characters 153 142 150 162 169 169 185 146 151 Paragraph 1 1 1 1 1 1 1 1 1 Sentences 6 6 6 6 6 6 6 6 6

Average Sentence/Paragraph 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 Words/Sentence 5.8 5.5 6.0 7.1 5.6 5.8 7.6 5.1 6.3 Characters/word 4.2 4.0 3.9 3.6 4.7 4.6 3.8 4.5 3.8

Readability Passive sentences 0% 0% 0% 0% 0% 0% 0% 0% 0% Flesch Reading Ease 70.3 80.7 85.5 91.3 59.2 75.2 85.0 65 84.6 Flesch-Kincaid Grade Level 4.8 3.3 2.8 2.2 6.4 4.2 3.3 5.4 3.0 Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8 Task 9

APPENDIX 3: THE ORIGINAL SET OF TASKS

You will have to talk about the topic for 2 minutes. You have 1 minute to think about what you are going to say.

1. Describe a city you have visited which has impressed you. You should say:

Where it is situated Why you visited it What you liked about it

And explain why you prefer it to other cities.

6. Describe a teacher who has influenced you in your education. You should say:



2. Describe a competition (or contest) that you have entered. You should say:

When the competition took place What you had to do How well you did it

And explain why you entered the competition (or contest).

7. Describe a film or a TV programme which has made a strong impression on you. You should say:

What kind of film or TV programme it was, eg comedy When you saw the film or TV programme What the film or TV programme was about

And explain why this film or TV programme made such an impression on you.

3. Describe a part-time/holiday job that you have done. You should say:

How you got the job What the job involved How long the job lasted


8. Describe a memorable event in your life. You should say:

When the event took place Where the event took place What happened exactly

And why this event was memorable for you. 4. Describe a museum, exhibition or art gallery that you

have visited. You should say:

Where it is What made you decide to go there What you particularly remember about the place

And explain why you would or would not recommend it to your friend.

9. Describe something you own which is very important to you. You should say:

Where you got it from How long you have had it What you use it for

And explain why it is so important to you.

5. Describe an enjoyable event that you experienced when you were at school. You should say:





APPENDIX 4: THE FINAL SET OF TASKS

You will have to talk about the topic for 2 minutes. You have 1 minute to think about what you are going to say.

A. Describe a city you have visited which has

impressed you. You should say:

Where it is situated Why you visited it What you liked about it

And explain why you prefer it to other cities.

E. Describe a teacher who has influenced you in

your education. You should say:



B. Describe a part-time/holiday job that you have

done. You should say:

How you got the job What the job involved How long the job lasted


F. Describe a film or a TV programme which made

a strong impression on you. You should say:

What kind of film or TV programme it was (eg comedy) When you saw it What it was about


C. Describe a sports event that you have been to

or seen on TV. You should say:

What it was Why you wanted to see it What was the most exciting or boring part

And explain why it was good or bad.

G. Describe a memorable event in your life.

You should say: When the event took place Where the event took place What happened exactly

And why this event was memorable for you.

D. Describe an enjoyable event that you

experienced when you were at school. You should say:



H. Describe something you own which is very

important to you. You should say:

Where you got it from How long you have had it What you use it for

And explain why it is so important to you.



APPENDIX 5: SPSS ONE-WAY ANOVA OUTPUT

Multiple Comparisons

Dependent Variable: TOTALBonferroni

.3622 .22786 1.000 -.3591 1.0835-.0185 .22570 1.000 -.7330 .6959.3824 .22368 1.000 -.3256 1.0905.4487 .22786 1.000 -.2726 1.1700.6891 .22786 .079 -.0322 1.4104.9103* .22786 .003 .1890 1.6315.7853* .22786 .019 .0640 1.5065

-.3622 .22786 1.000 -1.0835 .3591-.3807 .22786 1.000 -1.1020 .3406.0203 .22586 1.000 -.6947 .7352.0865 .23000 1.000 -.6415 .8146.3269 .23000 1.000 -.4011 1.0550.5481 .23000 .507 -.1800 1.2761.4231 .23000 1.000 -.3050 1.1511.0185 .22570 1.000 -.6959 .7330.3807 .22786 1.000 -.3406 1.1020.4010 .22368 1.000 -.3071 1.1090.4672 .22786 1.000 -.2540 1.1885.7076 .22786 .061 -.0137 1.4289.9288* .22786 .002 .2075 1.6501.8038* .22786 .015 .0825 1.5251

-.3824 .22368 1.000 -1.0905 .3256-.0203 .22586 1.000 -.7352 .6947-.4010 .22368 1.000 -1.1090 .3071.0663 .22586 1.000 -.6487 .7812.3067 .22586 1.000 -.4083 1.0216.5278 .22586 .572 -.1871 1.2428.4028 .22586 1.000 -.3121 1.1178

-.4487 .22786 1.000 -1.1700 .2726-.0865 .23000 1.000 -.8146 .6415-.4672 .22786 1.000 -1.1885 .2540-.0663 .22586 1.000 -.7812 .6487.2404 .23000 1.000 -.4877 .9684.4615 .23000 1.000 -.2665 1.1896.3365 .23000 1.000 -.3915 1.0646

-.6891 .22786 .079 -1.4104 .0322-.3269 .23000 1.000 -1.0550 .4011-.7076 .22786 .061 -1.4289 .0137-.3067 .22586 1.000 -1.0216 .4083-.2404 .23000 1.000 -.9684 .4877.2212 .23000 1.000 -.5069 .9492.0962 .23000 1.000 -.6319 .8242

-.9103* .22786 .003 -1.6315 -.1890-.5481 .23000 .507 -1.2761 .1800-.9288* .22786 .002 -1.6501 -.2075-.5278 .22586 .572 -1.2428 .1871-.4615 .23000 1.000 -1.1896 .2665-.2212 .23000 1.000 -.9492 .5069-.1250 .23000 1.000 -.8531 .6031-.7853* .22786 .019 -1.5065 -.0640-.4231 .23000 1.000 -1.1511 .3050-.8038* .22786 .015 -1.5251 -.0825-.4028 .22586 1.000 -1.1178 .3121-.3365 .23000 1.000 -1.0646 .3915-.0962 .23000 1.000 -.8242 .6319.1250 .23000 1.000 -.6031 .8531

(J) TASKTask BTask CTask DTask ETask FTask GTask HTask ATask CTask DTask ETask FTask GTask HTask ATask BTask DTask ETask FTask GTask HTask ATask BTask CTask ETask FTask GTask HTask ATask BTask CTask DTask FTask GTask HTask ATask BTask CTask DTask ETask GTask HTask ATask BTask CTask DTask ETask FTask HTask ATask BTask CTask DTask ETask FTask G

(I) TASKTask A

Task B

Task C

Task D

Task E

Task F

Task G

Task H

MeanDifference

(I-J) Std. Error Sig. Lower Bound Upper Bound95% Confidence Interval

The mean difference is significant at the .05 level.*.



APPENDIX 6: QUESTIONNAIRE ABOUT TASK 1

For each of the items below, circle the number that REFLECTS YOUR VIEWPOINT on a five point scale.

1. The vocabulary in the task prompts was: Very easy Very difficult

1 2 3 4 5

2. The grammatical structures in the task prompts were:

Very easy Very difficult

1 2 3 4 5

3. Topic of the task was:

Very familiar Very unfamiliar

1 2 3 4 5

4. Information given in the task was: Very concrete Very abstract

1 2 3 4 5

5. The planning time to complete (prepare for) the task was:

Too long appropriate Too short

1 2 3 4 5

6. Time to complete the task was: Too long appropriate Too short

1 2 3 4 5

7. How much information did you use from the 4 short prompts provided in the task?

1 = I used 100% of information provided in the task




5 = I did not use any information in the task at all

8. How did you use notes while you were speaking?

1 = I read aloud my notes.

2 = I referred to my notes line by line and looked up to speak.

3 = I referred to my notes when I needed.

4 = I prepared for my notes, but I did not use it.

5 = I did not take my notes.

Thank you very much for your cooperation.



APPENDIX 7: QUESTIONNAIRE – UNCHANGED AND REDUCED TIME VERSIONS

For students responding to the unchanged versions and to the reduced response time versions For each of the items below, circle the number that reflects your view point on the five point scale.

What I thought of or did before I started

stro

ngly

di

sagr

ee

disa

gree

no v

iew

agre

e

stro

ngly

ag

ree

1. I read the task very carefully to understand what was required. 1 2 3 4 5 2. I thought of HOW to deliver my speech in order to respond well to the topic. 1 2 3 4 5 3. I thought of HOW to satisfy the audiences and examiners. 1 2 3 4 5 4. I understood the instructions for this speaking test completely. 1 2 3 4 5 5. I had ENOUGH ideas to speak about this topic. 1 2 3 4 5 6. I felt it was easy to produce enough ideas for the speech from memory. 1 2 3 4 5 7. I know A LOT about this type of speech, i.e., I know how to make a speech on this

type of topic. 1 2 3 4 5

8. I know A LOT about other types of speaking test, e.g., interview, discussion. 1 2 3 4 5

What I thought of or did in planning stage

st

rong

ly

disa

gree

disa

gree

no v

iew

agre

e

stro

ngly

ag

ree

1. I thought of MOST of my ideas for the speech BEFORE planning an outline. 1 2 3 4 5 2. During the period allowed for planning, I was conscious of the time. 1 2 3 4 5 3. I followed the 3 short prompts provided in the task when I was planning. 1 2 3 4 5 4. The information in the short prompts provided was necessary for me to complete

the task. 1 2 3 4 5

5. I wrote down the points I wanted to make based on the 3 short prompts provided in the task. 1 2 3 4 5

6. I wrote down the words and expressions I needed to fulfil the task. 1 2 3 4 5 7. I wrote down the structures I need to fulfil the task. 1 2 3 4 5 8. I took notes only in ENGLISH. 1 2 3 4 5 9. I took notes only in my own language. 1 2 3 4 5 10. I took notes in both ENGLISH and own language. 1 2 3 4 5 11. I planned an outline on paper BEFORE starting to speak. 1. Yes 2. No 12. I planned an outline in my mind BEFORE starting to speak. 1. Yes 2. No 13. Ideas occurring to me at the beginning tended to be COMPLETE. 1 2 3 4 5 14. I was able to put my ideas or content in good order. 1 2 3 4 5 15. I practiced the speech in my mind WHILE I was planning. 1 2 3 4 5 16. After finishing my planning, I practiced what I was going to say in my mind until it

was time to start. 1 2 3 4 5



What I thought of or did while I was speaking

stro

ngly

di

sagr

ee

disa

gree

no v

iew

agre

e

stro

ngly

ag

ree

1. I felt it was easy to put ideas in good order. 1 2 3 4 5 2. I was able to express my ideas using suitable words. 1 2 3 4 5 3. I was able to express my ideas using correct grammar. 1 2 3 4 5 4. I thought of MOST of my ideas for the speech WHILE I was speaking. 1 2 3 4 5 5. WHILE I was speaking, I did not use some ideas that I had planned. 1 2 3 4 5 6. I was able to put sentences in logical order. 1 2 3 4 5 7. I was able to CONNECT my ideas smoothly in the whole speech. 1 2 3 4 5 8. I was conscious of the time WHILE I was making this speech. 1 2 3 4 5 9. I tried to finish speaking within the time. 1 2 3 4 5 10. I was listening and checking the correctness of the contents and their order WHILE I

was making this speech. 1 2 3 4 5

11. I was listening and checking whether the contents and their order fit the topic WHILE I was making this speech. 1 2 3 4 5

12. I was listening and checking the correctness of sentences WHILE I was making this speech. 1 2 3 4 5

13. I was listening and checking whether the words fit the topic WHILE I was making this speech. 1 2 3 4 5

14. I felt it was easy to complete the task. 1 2 3 4 5 15. Comments on the above items:

Thank you for completing this questionnaire



APPENDIX 8: QUESTIONNAIRE – NO PLANNING VERSION

For students responding to the no planning versions For each of the items below, circle the number that reflects your view point on the five point scale.


stro

ngly

di

sagr

ee

disa

gree

no v

iew

agre

e

stro

ngly

a g

ree



8. I know A LOT about other types of speaking test, e.g., interview, discussion. 1 2 3 4 5 What I thought of or did while I was speaking

stro

ngly

di

sagr

ee

disa

gree

no v

iew

agre

e

stro

ngly

ag

ree








APPENDIX 9: QUESTIONNAIRE – UNSCAFFOLDED VERSIONS

For students responding to the unscaffolded versions For each of the items below, circle the number that reflects your view point on the five point scale. © IELTS Research Reports Volume 6 40



stro

ngly

di

sagr

ee

disa

gree

no v

iew

agre

e

stro

ngly

ag

ree



8. I know A LOT about other types of speaking test, e.g., interview, discussion. 1 2 3 4 5 What I thought of or did in planning stage

stro

ngly

di

sagr

ee

disa

gree

no v

iew

agre

e

stro

ngly

agr

ee

1. I thought of MOST of my ideas for the speech BEFORE planning an outline. 1 2 3 4 5 2. During the period allowed for planning, I was conscious of the time. 1 2 3 4 5 3. I followed the 3 short prompts provided in the task when I was planning. 1 2 3 4 5 4. The information in the short prompts provided was necessary for me to complete

the task. 1 2 3 4 5

5. I wrote down the points I wanted to make based on the 3 short prompts provided in the task. 1 2 3 4 5

6. I wrote down the words and expressions I needed to fulfil the task. 1 2 3 4 5 7. I wrote down the structures I need to fulfil the task. 1 2 3 4 5 8. I took notes only in ENGLISH. 1 2 3 4 5 9. I took notes only in my own language. 1 2 3 4 5 10. I took notes in both ENGLISH and own language. 1 2 3 4 5 11. I planned an outline on paper BEFORE starting to speak. 1. Yes 2. No 12. I planned an outline in my mind BEFORE starting to speak. 1. Yes 2. No 13. Ideas occurring to me at the beginning tended to be COMPLETE. 1 2 3 4 5 14. I was able to put my ideas or content in good order. 1 2 3 4 5 15. I practiced the speech in my mind WHILE I was planning. 1 2 3 4 5 16. After finishing my planning, I practiced what I was going to say in my mind until it

was time to start. 1 2 3 4 5



What I thought of or did while I was speaking

stro

ngly

dis

agre

e

disa

gree

no v

iew

agre

e

stro

ngly

agr

ee










6. The interactional organisation of the IELTS Speaking Test Authors Paul Seedhouse University of Newcastle upon Tyne, UK

Maria Egbert University of Southern Denmark, Denmark


This report describes the interactional organisation of the IELTS Speaking Test in terms of turn-taking, sequence and repair.

ABSTRACT

This study is based on the analysis of transcripts of 137 audio-recorded tests using a Conversation Analysis (CA) methodology. The institutional aim of standardisation in relation to assessment is shown to be the key principle underlying the organisation of the interaction. Overall, the vast majority of examiners conform to the instructions; in cases where they do not do so, they often give an advantage to some candidates. The overall organisation of the interaction is highly constrained, although there are some differences in the different parts of the test. The organisation of repair has a number of distinctive characteristics in that it is conducted according to strictly specified rules, in which the examiners have been briefed and trained.

Speaking test interaction is an institutional variety of interaction with three sub-varieties. It is very different to ordinary conversation, has some similarities with some sub-varieties of L2 classroom interaction and some similarities with interaction in universities.

A number of recommendations are made in relation to examiner training, instructions and test design.

6. The interactional organisation of the IELTS Speaking Test – Paul Seedhouse + Maria Egbert

AUTHOR BIODATA

PAUL SEEDHOUSE

Dr Paul Seedhouse is Reader in Educational and Applied Linguistics in the School of Education, Communication and Language Sciences at the University of Newcastle upon Tyne, UK, where he is also Postgraduate Research Director. Following a teaching career in which he taught ESOL, German and French in five different countries, he published widely in journals of applied linguistics, language teaching and pragmatics. His monograph, The Interactional Architecture of the Language Classroom: A CA Perspective, was published by Blackwell in 2004 and won the 25th annual Kenneth W Mildenberger Prize of the Modern Language Association of America in 2005. He has also edited (with Keith Richards) the collection, Applying Conversation Analysis, published by Palgrave Macmillan in 2005.

MARIA EGBERT

Maria Egbert, PhD (University of California Los Angeles), is Associate Professor at the Institute of Business Communication and Information Science at the University of Southern Denmark. She has taught conversation analysis, applied linguistics and German at the University of Texas at Austin, the University of Oldenburg, the University of Jyväskylä and most recently at the University of Southern Denmark. Her research focuses on conversational repair, interculturality, and affiliation.



CONTENTS

1 Introduction ............................................................................................42 Research design ........................................................................................4 2.1 Background information on the IELTS Speaking Test.........................4 2.2 The study ............................................................................................5 2.3 Methodology ........................................................................................6 2.4 Data ............................................................................................7 2.5 Sampling ............................................................................................8 2.6 Relationship to existing research literature .........................................83 Data analysis ............................................................................................9

3.1 Trouble and repair ...............................................................................9 3.1.1 Repair initiation...................................................................10 3.1.2 Repetition of questions .......................................................14 3.1.3 Lack of uptake to the prompt ..............................................15 3.1.4 Vocabulary..........................................................................17 3.2 Turn-taking and sequence .................................................................19 3.2.1 The introduction section .....................................................19 3.2.2 Transition between parts of the test and questions............21 3.2.3 Evaluation...........................................................................23

3.3 Topic ............................................................................................24 3.3.1 Topic disjunction.................................................................25 3.3.2 Recipient design and rounding-off questions .....................284 Answers to research questions ...............................................................315 Conclusion ............................................................................................34 5.1 Implications and recommendations: test design / examiner training ...34 5.2 Suggestions for further research .........................................................35References ............................................................................................37Appendix 1: Transcript conventions.............................................................39Appendix 2: Low test score of Band 3.0 .......................................................40Appendix 3: High test score of Band 9.0 ......................................................43



1 INTRODUCTION

This report presents the results of a qualitative study of the IELTS Speaking Test, which is the most widely used English proficiency test for overseas applicants to British universities. The Speaking Test is designed to assess how effectively candidates can communicate in English. About 4,000 certified examiners administer well over 500,000 IELTS tests annually at over 300 centres in around 120 countries around the world.

Based on a selection of 137 transcribed oral proficiency interviews, this study analyses the internal organisation of this institutional variety of interaction in terms of examiner-candidate talk. In particular, the interactional structures are investigated in the areas of trouble and repair, turn-taking and sequence, and topic development. The analysis also focuses on how examiners put instructions from the training documents into practice, and how institutional constraints may implicate learners’ speech behaviour. Since the Speaking Test is taken to predict how well candidates will communicate in a university setting, it is important to understand what kind of interaction is generated in the test and its relationship to interaction in the target setting.

In the next section of this report (Part 2), a background description of the Speaking Test is provided, together with a presentation of the research design. The ensuing presentation of the analytic results focuses on brief answers to the research questions (Part 3). A more detailed qualitative data analysis with displays of exemplary transcript excerpts follows in Part 4. The conclusion (Part 5) raises applied issues for test design and examiner training, and develops implications for future research.

2 RESEARCH DESIGN

2.1 Background information on the IELTS Speaking Test IELTS Speaking Tests are encounters between one candidate and one examiner and are designed to take between 11 and 14 minutes. There are three main parts. Each part fulfils a specific function in terms of interaction pattern, task input and candidate output. These are now described as a backdrop for the analysis.

In Part 1 (Introduction) candidates answer general questions about themselves, their homes/families, their jobs/studies, their interests, and a range of familiar topic areas. Examiners introduce themselves and confirm candidate’s identity. Examiners interview candidates using verbal questions selected from familiar topic frames. This part lasts between four and five minutes. In Part 2 (Individual long turn) the candidate is given a verbal prompt on a card and is asked to talk on a particular topic. The candidate has one minute to prepare before speaking at length, for between one and two minutes. The examiner then asks one or two rounding-off questions. In Part 3 (Two-way discussion) the examiner and candidate engage in a discussion of more abstract issues and concepts which are thematically linked to the topic prompt in Part 2.

Examiners receive detailed directives in order to maximise test reliability and validity. The most relevant and important instructions to examiners are as follows: “Standardisation plays a crucial role in the successful management of the IELTS Speaking Test.” (Instructions to IELTS Examiners, pp11). “The IELTS Speaking Test involves the use of an examiner frame which is a script that must be followed (original emphasis)…Stick to the rubrics – do not deviate in any way…If asked to repeat rubrics, do not rephrase in any way…Do not make any unsolicited comments or offer comments on performance.” (IELTS Examiner Training Material 2001, pp5). The degree of control over the phrasing differs in the three parts of the test as follows: “The wording of the frame is carefully controlled in Parts 1 and 2 of the Speaking Test to ensure that all candidates receive similar input delivered in the same manner. In Part 3, the frame is less controlled so that the examiner’s



language can be accommodated to the level of the candidate being examined. In all parts of the test, examiners are asked to follow the frame in delivering the script…Examiners should refrain from making unscripted comments or asides.” (Instructions to IELTS Examiners, pp5). Research has shown that the speech functions which occur regularly in a candidate’s output during the Speaking Test are: providing personal information; expressing a preference; providing non-personal information; comparing; expressing opinions; summarising; explaining; conversation repair; suggesting; contrasting; justifying opinions; narrating and paraphrasing; speculating; analysing. Other speech functions may emerge during the test, but they are not forced by the test structure (Taylor, 2001a).

Detailed performance descriptors have been developed which describe spoken performance at the nine IELTS bands, based on the following criteria. Scores are reported as whole bands only.

Fluency and coherence refers to the ability to talk with normal levels of continuity, rate and effort and to link ideas and language together to form coherent, connected speech. The key indicators of fluency are speech rate and speech continuity. The key indicators of coherence are logical sequencing of sentences, clear marking of stages in a discussion, narration or argument, and the use of cohesive devices (eg connectors, pronouns and conjunctions) within and between sentences.

Lexical resource refers to the range of vocabulary the candidate can use and the precision with which meanings and attitudes can be expressed. The key indicators are the variety of words used, the adequacy and appropriacy of the words used and the ability to circumlocute (get round a vocabulary gap by using other words) with or without noticeable hesitation.

Grammatical range and accuracy refers to the range and the accurate and appropriate use of the candidate’s grammatical resource. The key indicators of grammatical range are the length and complexity of the spoken sentences, the appropriate use of subordinate clauses, and variety of sentence structures, and the ability to move elements around for information focus. The key indicators of grammatical accuracy are the number of grammatical errors in a given amount of speech and the communicative effect of error.

Pronunciation refers to the capacity to produce comprehensible speech in fulfilling the Speaking Test requirements. The key indicators will be the amount of strain caused to the listener, the amount of unintelligible speech and the noticeability of L1 influence. (IELTS Handbook 2005, pp11)

2.2 The study The overall aim is to uncover the interactional organisation of the IELTS Speaking Test as it is collaboratively produced in its three parts. In this section, we present the research questions, methodology, data, sampling and the relation to existing literature.

Sub-questions are as follows:

1. How and why does interactional trouble arise and how is it repaired by the interactants? What types of repair initiation are used by examiners and examinees and how are these responded to? What role does repetition play?

2. What is the organisation of turn-taking and sequence?

3. What is the relationship between Speaking Test interaction and other speech exchange systems such as ordinary conversation, L2 classroom interaction, and interaction in universities?

4. What is the relationship between examiner interaction and candidate performance?

5. To what extent do examiners follow the briefs they have been given?



6. In cases where examiners diverge from briefs, what impact does this have on the interaction?

7. How are tasks implemented? What is the relationship between the intended tasks and the implemented tasks, between the task-as-workplan and task-in-process?

8. How is the organisation of the interaction related to the institutional goal and participants’ orientations?

9. How are the roles of examiner and examinee, the participation framework and the focus of the interaction established?

10. How long do tests last in practice and how much time is given for preparation in Part 2?

Language proficiency interviews in general are intended to assess the language proficiency of non-native speakers and to predict their ability to communicate in future encounters. IELTS “is designed to assess the language ability of candidates who need to study or work where English is used as the language of communication” (www.ielts.org.handbook.htm). The Speaking Test aims to evaluate how well a language learner might function in a target context, often an academic one. The IELTS Speaking Test is predominantly used to assess and predict whether a candidate has the ability to communicate effectively on programmes in English-speaking universities. Hypothetically, interaction in oral proficiency interviews could be characterised in a number of ways, including similarities and differences with other speech exchange systems such as ordinary conversation, L2 classroom interaction, task-based interaction, academic interaction, interviews and tests.

This project aims to determine the endogenous organisation of the Speaking Test and its relationship to some of these other systems. Because the Speaking Test (with its own interactional organisation) evaluates learners’ ability to function in future in other speech exchange systems, each with their own interactional organisation, the proposed research should be of interest to the following parties: fellow researchers in language testing; designers of the IELTS Speaking Test and other similar tests; IELTS examiners; teachers preparing students for the Speaking Test. It is argued that making the interactional organisation of the Speaking Test explicit may help to ensure comparability of challenge to candidates from different cultural backgrounds.

The question of how and why interactional trouble arises and how it is repaired by the interactants should be of interest to all those taking part and designers of test items would be interested in how the items are actually implemented in practice. Seedhouse (2004) suggests that the organisation of repair in L2 classrooms is reflexively related to the pedagogical focus. This study will investigate when repair occurs, how it is organised in the Speaking Test and what the relationship is between the organisation of repair and the institutional goal. The research, then, intends to provide empirical insights and raise awareness which can then feed into all areas of test development and training.

2.3 Methodology The methodology employed is Conversation Analysis (CA) (Drew & Heritage, 1992a; Lazaraton, 2002; Sacks, Schegloff & Jefferson, 1974; Seedhouse, 2004). Studies of institutional interaction have focussed on how the organisation of the interaction is related to the institutional aim and on the ways in which this organisation differs from the benchmark of free conversation. Heritage (1997) proposes six basic places to probe the institutionality of interaction, namely:

turn-taking organisation overall structural organisation of the interaction sequence organisation turn design lexical choice epistemological and other forms of asymmetry.



He also proposes four different kinds of asymmetry in institutional talk:

asymmetries of participation, eg the professional asking questions to the lay client asymmetries of interactional and institutional know-how, eg professionals being used to

type of interaction, agenda and typical course of an interview in contrast to the lay client epistemological caution and asymmetries of knowledge, eg professionals often avoiding

taking a firm position rights of access to knowledge, particularly professional knowledge.

Interactional asymmetry and roles in LPIs are controversial issues (Taylor, 2001c) and Speaking Test data are examined with the above issues in mind. Perhaps the most important analytical consideration is that institutional talk displays goal orientation and rational organisation. In contrast to conversation, participants in institutional interaction orient to some “core goal, task or identity (or set of them) conventionally associated with the institution in question.” (Drew & Heritage, 1992b, pp22). CA institutional discourse methodology attempts to relate not only the overall organisation of the interaction but also individual interactional devices to the core institutional goal. CA attempts, then, to understand the organisation of the interaction as being rationally derived from the core institutional goal. Levinson sees the structural elements of institutional talk as:

Rationally and functionally adapted to the point or goal of the activity in question, that is the function or functions that members of the society see the activity as having. By taking this perspective it seems that in most cases apparently ad hoc and elaborate arrangements and constraints of very various sorts can be seen to follow from a few basic principles, in particular rational organisation around a dominant goal. (Levinson, 1992, pp 71)

Seedhouse (2004) describes the overall interactional organisation of the L2 classroom, identifying the institutional goal as well as the interactional properties which derive directly from the goal. He also identifies the basic sequence organisation of L2 classroom interaction and exemplifies how the institution of the L2 classroom is talked in and out of being by participants. Seedhouse demonstrates that, although L2 classroom interaction is extremely diverse, heterogeneous, fluid and complex, it is nonetheless possible to describe its interactional architecture. In the case of Speaking Test interaction, we will see that there is considerably less diversity and heterogeneity than in L2 classrooms because of the restrictions of the test format and the use of similar tasks for all participants. Language proficiency interviews (LPIs) differ from other types of institutional interaction in one respect. Normally, the institutional business is achieved via the content of the talk, whereas in the LPI the content of the talk is not central. The responses are required to be accurate and relevant to the questions, but the examiner does not have to employ the responses to further the institutional business; language is used for display rather than communication. (The authors are grateful to G Thompson for this and other comments.)

In this study, we employ Richards and Seedhouse’s (2005) model of “description leading to informed action” in relation to applications of CA. We link the description of the interaction to the institutional goals and provide proposals for informed action based on our analysis of the data.

2.4 Data The analysis of naturalistic data, one of the basic premises of CA research, allows a direct and authentic examination of the interactants’ conduct. Therefore, the primary raw data consist of audio recordings in cassette format of operational IELTS Speaking Tests. All IELTS Speaking Tests are routinely recorded for monitoring and quality assurance purposes; in addition, a selection of these is entered into an IELTS Speaking Test Corpus which is used for research purposes and currently contains several thousand test performances. The data set for this study was drawn from recordings of live tests conducted during 2003. Secondary data included paper materials relevant to the Speaking Tests recorded on cassette, including examiners’ briefs, marking criteria, examiner



induction, training, standardisation and certification packs (Taylor, 2001b). These data were helpful in establishing the institutional goal of the interaction and the institutional orientations of the examiners. The primary raw data (137 Speaking Tests) were transcribed using CA transcription conventions (Appendix 1) by postgraduate research students at the University of Newcastle, using the existing transcription equipment in the School of Education, Communication and Language Sciences. The resultant transcripts were produced in paper and electronic format and are copyright of Cambridge ESOL, one of the IELTS partners. All personal references have been anonymised.

2.5 Sampling The IELTS Speaking Test Corpus contains over 2,500 recordings of tests conducted during 2003; the researchers selected an initial sample of 300 cassettes and then transcribed 137 of these. The aim of the sampling was to ensure variety in the transcripts in terms of gender, region of the world, task/topic number and Speaking Test band score. The test centre countries covered by the transcribed tests are: Albania, Brazil, Cameroon, United Kingdom, Greece, Indonesia, India, Iran, Jamaica, Lebanon, Mozambique, Netherlands, Norway, New Zealand, Oman, Pakistan, Syria, Vietnam and Zimbabwe. However, we do not have data on individual candidate nationality and ethnicity and it should be borne in mind that in, for example, the data from the UK, a wide range of nationalities and ethnic backgrounds are covered. We do not have any data on the first languages of candidates. Overall test scores covered by the transcribed sample range from band 9.0 to band 3.0 on the IELTS Speaking Module. Two tasks among the many used for the test were selected for transcription. This enabled easy location of audio cassettes whilst at the same time ensuring diversity of task.

The way in which sampling was conducted is as follows: Cambridge ESOL has written information on the above variables in relation to their corpus of IELTS Speaking Tests. The researchers first examined the information available in consultation with Cambridge ESOL and then requested a set of 300 cassettes which covered the range of variables, namely gender, region of the world, task/topic number and Speaking Test band score. A certain number of these cassettes were not usable due to poor sound quality or inadequate labelling. From the researchers’ perspective, the aim was to produce a description of the interactional architecture of the Speaking Test which was able to account for all of the data, regardless of variables relating to particular candidates. The description will tend to have more credibility if the data sampled cover a wide range of variables.

2.6 Relationship to existing research literature The research builds on existing research in two areas. Firstly, it builds on existing research done specifically on the IELTS Speaking Test and on language proficiency interviews in general. Secondly, it builds on existing CA research into language proficiency interviews in particular, into institutional talk (Drew & Heritage, 1992a) and into applications of CA (Richards & Seedhouse, 2005).

Taylor (2000) identifies the nature of the candidate’s spoken discourse and the language and behaviour of the oral examiner as issues of current research interest. Wigglesworth (2001:206) suggests that “In oral assessments, close attention needs to be paid, not only to possible variables which can be incorporated or not into the task, but also to the role of the interlocutor… in ensuring that learners obtain similar input across similar tasks.” Brown & Hill (1998) examine the relationship between the interactional style of the interviewer and candidate performance, with easier interviewers shifting topics frequently and asking simpler questions, while more difficult interviewers used interruption, disagreement and challenging questions. This study builds on this work by examining through a sizeable dataset the relationship between the interactional style of the interviewer and candidate performance.

Previous CA-informed work in the area of oral proficiency interviews area by Young and He (1998) and Lazaraton (1997) examined the American Language Proficiency Interview (LPI). Egbert points out that “LPIs are implemented in imitation of natural conversation in order to evaluate a learner’s © IELTS Research Reports Volume 6 8


conversational proficiency” (Egbert, 1998:147). Young and He’s collection demonstrates, however, a number of clear differences between LPIs and ordinary conversation. Firstly, the systems of turn-taking and repair differ from ordinary conversation. Secondly, LPIs are examples of goal-oriented institutional discourse, in contrast to ordinary conversation. Thirdly, LPIs constitute cross-cultural communication in which the participants may have very different understandings of the nature and purpose of the interaction. Egbert’s (1998) study demonstrates that interviewers explain to students not only the organisation of repair they should use, but also the forms they should use to do so; the suggested forms are cumbersome and differ from those found in ordinary conversation. He’s (1998) microanalysis reveals how a student’s failure in an LPI is due to interactional as well as linguistic problems. Kasper and Ross (2001:10) point out that their CA analysis of LPIs portrays candidates as “eminently skilful interlocutors”, which contrasts with the general SLA view that clarification and confirmation checks are indices of NNS incompetence, while their (2003) paper analyses how repetition can be a source of miscommunication in LPIs.

In the context of course placement interviews, Lazaraton (1997) notes that students initiated a particular sequence, namely self-deprecations of their English language ability. She further suggests that a student providing a demonstration of poor English language ability constitutes grounds for acceptance onto courses. Interactional sequences are therefore linked to participant orientations and goals. Lazaraton (2002) presents a CA approach to the validation of LPIs and her framework should enable findings from this research to feed into future decision-making in relation to the Speaking Test.

3 DATA ANALYSIS

We now move on from the summary answers to examine in more detail a number of themes which emerged from our more detailed qualitative analysis of the data. In particular, we show the interview-specific structures of (1) trouble and repair, including repair initiation and repetition as the repair operation, (2) turn-taking and sequence, with a special focus on the (lack of) transitions between test parts and question sequences, and (3) topic development, with disjunction being related to abrupt sequencing. Other issues arising in the data are addressed in terms of vocabulary, evaluation, answering the question, and introducing the interview (4). Two themes which arise frequently are interactional problems caused by examiners deviating from instructions and problems issuing from the design of the test itself. In this part of the report, excerpts from transcripts serve to exemplify the findings. Please note that two complete transcripts are available in Appendices 2 and 3 for further review.

3.1 Trouble and repair Repair is the mechanism by which interactants address and resolve trouble in speaking, hearing and understanding (Schegloff, Jefferson & Sacks, 1977). Trouble is anything which the participants display as impeding speech production or intersubjectivity; a repairable item is one which constitutes such trouble for the participants. Any element of talk may in principle be the focus of repair, even an element which is well-formed, propositionally correct and appropriate. Schegloff, Jefferson & Sacks (1977:363) point out that “nothing is, in principle, excludable from the class ‘repairable’”. Repair, trouble and repairable items are participants’ constructs, for use how and when participants find appropriate. Their use may be related to institutional constraints, however. In courtroom cross-examination of a witness by an opposing lawyer, for example, a failure by the witness to answer questions with yes or no may constitute trouble within that institutional setting (Drew, 1992). Such a failure is therefore repairable (for example by the lawyer and/or judge insisting on a yes/no answer) and even sanctionable. So within a particular institutional sub-variety, the constitution of trouble and what is repairable may be related to the particular institutional focus.



We now focus on the connection between repair and test design. By examining how and why interactional problems arise, it may be possible to fine-tune test design and procedures to minimise trouble. As mentioned above, there does appear to be some kind of correlation between test score and occurrence of trouble and repair: in interviews with high test scores, fewer examples of repair are observable.

To illustrate this observation, two complete transcripts are produced in the Appendices, one with a high score of band 9.0 (Appendix 3) and no occurrence of trouble in hearing or understanding, and one with a low score of band 3.0 (Appendix 2), which gives the impression of great strain in both the candidate’s and the examiner’s conduct. The candidate’s performance is characterised by three instances of other-initiated repair in the first half of Part 1 of the interview. Although she does not initiate any further other-repair, her long delays in uptake in combination with answers which display partial, wrong or lack of understanding occur throughout the interview. While there are indications that high scoring and low occurrence of trouble co-occur, our study is furthermore interested in uncovering any instances of trouble which may have been created by the test format or procedures themselves and which may therefore have an impact on test validity and reliability.

3.1.1 Repair initiation Repair policy and practice vary in the different parts of the test. Examiners have training and written instructions on how to respond to repair initiations by candidates. “When interaction has clearly broken down, or fails to develop initially, the examiner will need to intervene. This…may involve: repetition of all or part of the rubric (Part 1 or 2); the examiner asking: ‘Can you tell me anything more about that?’ (Part 2); re-wording a question/prompt or asking a different question (Part 3)” (IELTS Examiner Training Material 2001, pp 6).

Candidates initiate repair in relation to examiner questions in a variety of ways. Examiner instructions are to repeat the question once only but not to paraphrase or alter the question. In Part 1, “The exact words in the frame should be used. If a candidate misunderstands the question, it can be repeated once but the examiner cannot reformulate the question in his or her own words. If misunderstanding persists, the examiner should move on to another question in the frame. The examiner should not explain any vocabulary in the frame.” (Instructions to IELTS Examiners, pp5). The vast majority of examiners in the data do conform to this guidance; however, they frequently do make prosodic adjustments, as in the example below. (For transcription conventions, the reader is referred to Appendix 1.)

Extract 1 70→ E: do people (0.6) <generally prefer watching films at home> (0.2) 71 C: yeah (0.5) 72 E: <or in a cinema> (0.2) 73 C: yeah (2.7) so (1.2) 74→ E: do people generally prefer watching films (.) at home (0.3) 75 C: mm hm (0.6) 76 E: or in a (0.3) cinema (0.2) 77 C: I think a cinema (0.4) 78 E: °why°? (0.6) 79 C: because I think cinema (0.9) is too big (0.2) and (1.2) you can (0.3) you 80 can join in the:: the film (0.7) (Part 1)

In this case the examiner repeats the question once. Sometimes examiners do not follow the guidelines and modify the question, as in the extract below:



Extract 2 17 E: can we talk about your country (.) which part of China (0.2) do most 18 people live in (0.4) 19 C: uhm in I think in the south of China most people living (0.4) 20 E: yeah (0.3) tell me about the main industries in China (0.8) 21 C: sorry? (0.3) 22 E: the main industries (0.3) 23 C: industries?= 24 E: → =like ca:r industry:= 25 C: =o::h (0.4) 26 E: → factories where they make th[in]gs (0.3) 27 C: [oh] 28 E: → what what things does China (.) m[ake ] 29 C: [oh ye]s (0.4) uhm I think mm China: 30 (0.5) the heavy industry (0.2) is uh most uh °important° in (0.5) China 31 (0.3) 32 E: mm hm (1.2) how easy is it to travel ↑round China (0.9) (Part 1)

In Part 3, by contrast, “The scripted frame is looser and the examiner uses language appropriate to the level of the candidate being examined. The examiner should use the topic content provided and formulate prompts to which the candidate responds in order to develop the dialogue.”

Extract 3 117 E: can you suggest some of the ways life has improved because of 118 technology? 119 C: → (0.4) can you repeat that? 120 E: are there some ways that our life has improved because of this technology 121 C: → mm (3.0) 122 E: have our lives become easier and more convenient because of new 123 technology 124 C: erm yes (0.3) I think technology helps us a lot (Part 3)

Here the examiner reformulates the question in line 122 in response to a repair initiation in line 119 which is followed by a hesitation and 3 second pause in line 121; this is within the guidelines for Part 3 of the test. Another example of the examiner modifying questions in Part 3 was found in 0125, lines 290-295.

On rare occasions, candidates ask for help in explaining a question. Sometimes examiners follow the guidelines and repeat the question. Sometimes this leads to the interaction being able to proceed on track, as in the extract below.

Extract 4 73 E: and what do you think (0.7) erm (2.1) what do you think the role of 74 public transport will be in the future here in Albania? (2.6) 75→ C: what do you mean? (0.3) 76 E: what kind of role does it (1.0) will it have in the future? (1.1) 77 C: oh (.) what role? (1.2) well (2.0) the same as now I think (.) the (.) greater 78 part the people mostly erm (0.7) travel or (.) be with the public transport 79 E: (2.1)you don’t think that will change? (2.8) 80 C: no (0.4) (Part 3)



Note that in this case, the candidate’s repair initiation at line 75 does not claim trouble in understanding the words of the utterance but rather the intended meaning of the prompt. While the examiner does not produce a verbatim repetition in response, she repeats only the key words from the original prompt. In the candidate’s ensuing uptake (line 77) he displays that his trouble had been “what role”, indicating that his trouble had not just been the meaning but rather the meaning was impeded by lack of word recognition.

Sometimes the examiner’s repetition of the question does not result in the interaction being able to proceed, as shown in the data segment below. After a repair initiation (line 64) and the ensuing repetition (line 65) do not resolve the candidate’s trouble in understanding, his ensuing request for reformulation (line 66-67) is declined implicitly, as per examiner instructions, and the sequence is aborted (line 68).

Extract 5 63 E: what qualifications or certificates do you hope to get? (0.4) 64→ C: sorry? (0.4) 65 E: what qualifications or (.) certificates (0.3) do you hope to get (2.2) 66 C: could you ask me in another way (.) I’m not quite sure (.) quite sure about 67 this (1.3) 68 E: it’s alright (0.3) thank you (0.5) uh:: can we talk about your childhood? 69 (0.7) (Part 1)

In the above extract we can see that there is no requirement for the examiner to achieve intersubjectivity or mutual understanding. The institutional aim is for the examiner to assess the candidate in terms of a specific band, and the candidate’s inability to answer even after repetition provides the examiner with data for this task; the examiner simply moves on to the next prompt. We should note, however, that the lack of requirement to achieve intersubjectivity produced by the test design creates a major difference between Speaking Test interaction and interaction in university seminars, tutorials and workshops, in which the achievement of intersubjectivity is a major institutional goal.

Sometimes examiners oblige the candidate and explain the question, contrary to instructions. Note that in a similar way to the previous example, there are two succeeding repair sequences, each consisting of a repair initiation and operation. In both cases, the examiner first repeats the prompt. In this case, however, the second repair operation consists of examples. It is noteworthy that once intersubjectivity is re-established, the candidate heavily recycles words from the helpful repair operation. It thus seems that the examiner’s deviation from the training manual provides an advantage to the candidate.

Extract 6 50 E: what kind of shops do you prefer? 51 C: (1.0) shop? (.) er (0.3) do you explain perhaps for me please? 52 E: erm (2.4) what kind of shops do you like? 53 C: kind of shop? 54 E: → big shop? small shop? 55 C: ah ah yeah I understand (0.2) I like er big shop (0.2) I prefer big shop (Part 1)

Sometimes, candidates may ask for clarification of a question. According to the guidelines for Part 3, examiners may do so, as in the example below.



Extract 7 79 E: first of all (0.3) er if we could (.) look a little at public and private 80 transport (.) em could you evaluate for me (0.6) please (0.7) the 81 advantages of private and public transport? 82 C: when you mean private and public transport you mean like er (.) private 83 for example a family goes alone or you mean like like private owned? 84 E: no with a car or something like that (1.4) (Part 3)

Examiners are briefed to not help candidates who are struggling “Examiners should not prompt candidates who are struggling to find language.” (Instructions to IELTS Examiners, pp6). However, there are exceptions to these instructions: in Part 2, “When interaction has clearly broken down, or fails to develop initially, the examiner will need to intervene. This…may involve: repetition of all or part of the rubric (Part 1 or 2); the examiner asking: ‘Can you tell me anything more about that?’ (Part 2)” (Examiner Training Material 2001, pp6). In an exceptional case (Extract 8) with a very weak student, we can see an example of the examiner trying to help the candidate in Part 2.

Extract 8 78 C: yes (0.8) er I travelled in er (.) in er (inyana) (eryana’s) very (.) very (like) 79 and er (.) I went (.) I went er to: (1.4) I went there for er the job (0.3) and 80 er (5.4) and er (0.8) 81 E: did you enjoy your trip or not (.) how did you go there? (.) you went to 82 (Indiana) 83 C: yes 84 E: → how did you travel there? (12.7) did you go by train did you go by plane? 85 C: er (0.3) I went er (1.0) to the bus and erm (.) I went erm to my parents and 86 erm (2.1) 87 E: did you enjoy the trip? (1.0) ((name omitted)) 88 C: er yes I (1.8) I enjoy the er (7.1) (Part 2)

Above we see the examiner rephrasing the question in line 84 and then simplifying the question by offering train and plane alternatives.

Examiners are instructed to not correct candidate utterances, and instances of correction are indeed very rare. In Extract 9, with a weak candidate, we see an example of correction.

Extract 9 1 E: can you tell me where you come from? (name omitted) 2 C: ahh I come from I I I come from Korea. 3 E: alright um where where in Korea? 4 C: er Korea is um 5 E: → no no not where is Korea where in Korea which city? 6 C: it’s Asia Asia 7 E: (1.0) I know that 8 C: → it’s Seoul Seoul (Part 1)

In the above extract, the examiner initiates repair of the candidate’s answer in line 5 before s/he has completed it. Eventually in line 8 the candidate is able to self-repair successfully. In line 5, the examiner initiates repair on her own prior utterance in light of the fact that the candidate’s answer in line 4 displays a wrong understanding of what the examiner said in line 3. Note that the trouble source is “in”, which the candidate mistakes for “is”. In her third position repair, the examiner places



emphasis on “in”, yet this is not reflected in the candidate’s next response (line 6). After the examiner has rejected that answer (line 7), the candidate finally responds adequately to the prompt.

In this section we have seen that there are slight differences in the interpretation of examiner instructions relating to repair in the different parts of the test. The vast majority of examiners adhere rigidly to these instructions. Some examiners do not follow the rules, and in these cases they provide a clear advantage to their candidates.

3.1.2 Repetition of questions The repetition of questions plays a key role in the Speaking Test and is therefore examined in detail. The instruction manual states for Part 1 of the interview that examiners are to repeat the question only one time (in case of trouble) and then to move on: “The exact words in the frame should be used. If a candidate misunderstands the question, it can be repeated once but the examiner cannot reformulate the question in his or her own words. If misunderstanding persists, the examiner should move on to another question in the frame.” (Instructions to IELTS Examiners, pp5). In the vast majority of cases, examiners adhere to this policy. Occasionally, however, some examiners do not follow these instructions, and we examine some instances of this below; the consequences of repeated repetition vary.

Extract 10 53 E: yes (0.3) was it a good place for children (1.1) 54 C: s- (0.3) beg your pardon ma’am (0.5) 55 E: → was it a good place (.) for children (0.3) 56 C: for children (1.2) eh well that’s definitely my whole ((°inaudible°)) (0.5) 57 E: → was it a good place (.) for children (0.7) 58 C: good place for children. (0.4) I’m sorry I’m not can you please be a bit 59 more specific I hope if you don’t mind so ma’am (0.6) 60 E: mm= 61 C: =I mean like I’m not getting you (0.4) 62 E: okay (0.3) 63 C: yeah exactly (0.4) 64 E: → was it a good (0.3) 65 C: oh [was ] 66 E: → [place] 67 C: it a good place I see [see I thought] 68 E: → [for children ] 69 C: that you were saying (.) what it s a good place like wa- (0.3) yeah 70 definitely it was (0.4) 71 E: mm hm (.) (Part 1)

In the above case the question is repeated three times and the talk becomes a long repair sequence. When comprehension is finally achieved in line 69, only a very simple answer is provided and the candidate does not engage with the topic. In this case, then, repeated repetition does not result in helping the candidate display a high level of proficiency in his/her answer.

Extract 11 102 E: and (0.6) eh where (0.4) did you usually play. (1.7) 103 C: play (0.9) 104 E: play (.) 105 C: yes= 106 E: → =where (0.4) 107 C: like eh (0.7) cricket ((inaudible))= 108 E: → no where did you usually play (1.6)



109 C: so[rry ] 110 E: → [where] (0.6) where did you usually play (1.6) 111 C: sorry I can’t get that (0.4) 112 E: → where (0.7) 113 C: where (1.1) I usually play? (1.3) mm s- eh (.) 114 E: → when you were a child (0.5) 115 C: yeah (0.5) 116 E: → where <did you play> (0.5) 117 C: eh well (.) eh as I told you that as this is divided into portions and 118 particularly the first portion which I (.) don’t like, (Part 1)

In Extract 11, there is a repair initiation in form of a partial repeat in line 103. The examiner repeats the question no fewer than five times, without being able to obtain an answer at the end of the cycle. Again, repetition as a repair operation does not always work so well.

Extract 12

116 E: I see (.) alright (1.9) do you generally enjoy (.) travelling (1.5) 117 C: sorry (0.6) 118 E: do you generally (.) enjoy (.) travelling (1.3) 119 C: eh (1.5) I think eh I want to eh eh (0.3) drive home in the car (1.0) 120 because eh (.) all the facilities and eh (0.3) mm save time car (0.5) car 121 save you time (0.3) and it give you [much] 122 E: [but ] do you enjoy (.) travelling 123 (2.3) 124 C: eh travelling? (0.5) 125 E: mm hm (0.7) 126 C: yeah I (0.6) eh (0.3) get travelling (0.4) eh (.) in (0.3) trip (0.9) 127 E: → do you enjoy (0.4) travelling (1.6) 128 C: yeah I have eh (0.4) fond of travelling eh somewhere (0.7) so because 129 eh (.) travelling it give you some time (0.3) to fresh your mind (0.5) and 130 eh (0.3) eh because eh life (0.3) is now (.) very eh h- eh (.) quick (0.3) 131 indeed eh (.) and we have not much time to travel (0.4) so it 132 give [you some] freshness (Part 2)

In the above case the examiner repeats the question three times and this enables the candidate to finally provide a correct answer after the examiner has stressed the key word in line 127. In this particular case, the examiner has ignored the instructions and this has given a distinct advantage to this particular candidate. Other examples of excessive repetition in the data are: 0127, lines 60 onwards, repeats three times; 1106 lines 23 on, repeats twice; lines 97 on, repeats twice; 0272 lines 38 on, repeats four times; 0836, lines 23 on, repeats three times.

3.1.3 Lack of uptake to the prompt We now consider what happens if the candidate does not answer the question directly. The instructions for examiners in this area are different for the three parts of the Test, as follows.

“How is the rating affected if the candidate answers the prompt without actually answering the question? This may be an indicator of inadequate lexis, and therefore that the candidate can only deal with familiar topics. It may also indicate a prepared answer.” (IELTS Examiner Training Material 2001, pp69, Part 1).



“The candidate misunderstands the task and talks off the topic. Let them go ahead; assessment is still on their ability to talk at length and unassisted for the required time…” (IELTS Examiner Training Material 2001, pp70, Part 2)

“The candidate does not seem to answer the questions directly. This may indicate inadequate lexis, or simply be a roundabout way of dealing with a difficult topic. It is a judgement call for the examiner.” (IELTS Examiner Training Material 2001, pp73, Part 3)

The vast majority of examiners follow these instructions and do not take action if candidates do not answer the question directly. However, in some exceptional cases, examiners do treat failure to provide a direct answer as trouble. The extracts below demonstrate a variety of behaviours by examiners.

Extract 13 36 E: alright (0.7) now let’s move on to talk about some of the activities 37 you enjoy in your freetime (.) right? (1.5) 38 C: alright. Yeah (1.0) 39 E: when do you have free time? (1.5) 40 C: well I love to play on the computer (0.7) I love to travel with my 41 family to my farm (.) because I have a farm (.) and next to (5.6) and here 42 in (Bella renoche) I can say that I have a little stressed life? (1.0) because 43 I don’t have time to do my stuff (0.7) well I (0.3) I (0.7) I like to be with 44 my friends (0.8) I like to go out with my friends (.) I like to go to the 45 movies (1.4) I like to be with my girlfriend (1.0) yes (1.2) 46 E: what free activities are most popular among your friends? (1.3) 47 C: most popular? well (0.7) study (0.4) at weekends (0.8) we have to study 48 because our course is= 49 E: → =so would you call it free time activities? = 50 C: =no (1.2) not free time activities free time activities we go to parties (0.7) 51 we go to the movies (1.6) and we travel together (1.9) 52 E: alright and how important is free time in people’s lives? (0.7) (Part 1)

In line 49 the examiner asks a supplementary unscripted question which implies that the question has not been answered and provides the candidate with an opportunity to self-repair, and in this case s/he is able to do so and provide a direct answer.

Extract 14 40 E: okay (0.6) let’s talk about public transport (0.5) what kinds of public 41 transport are there (0.3) where you live (2.0) 42 C: it’s eh (0.5) I (0.4) as eh (0.4) a (0.3) person of eh (0.4) ka- Karachi, I 43 (1.1) we have many (0.8) public transport problems and (0.7) many eh 44 we use eh (0.4) eh buses (0.4) there are private cars and eh (.) there are 45 some (0.3) eh (0.4) children (0.4) buses (0.8) and eh (1.9) abou- (0.2) 46 about the main problems in is the (0.4) the number one is the over eh 47 speeding (0.5) they are the oh eh (0.5) the roads (0.8) and eh (.) they are 48 [on] 49 E: → [I ] didn’t ask you about the problems (0.6) my question was (0.6) what 50 kinds of public transport are there (.) where you live (0.7) 51 C: oh s- (.) sorry (0.5) eh I there (.) I live in (0.5) ((inaudible)) (0.4) so I 52 have eh (0.3) eh (0.4) t- we have there eh (0.4) private cars (0.5) and 53 some read 54 about the taxis and eh (0.3) local buses (0.5) (Part 1)



In line 49 above the examiner explicitly treats the candidate’s answer as trouble in that it did not provide a direct answer to his/her question, even though it was on the general topic of public transport. In this instance, the candidate is able to provide a direct answer.

In the data, the vast majority of examiners follow the instructions in relation to indirect candidate answers. In some cases, examiners do initiate repair of indirect answers and this generally results in candidates supplying direct answers as a result.

3.1.4 Vocabulary “The examiner should not explain any vocabulary in the frame.” (Instructions to IELTS Examiners, pp5.)

Extract 15 63 E: what qualifications or certificates do you hope to get? (0.4) 64 C: sorry? (0.4) 65 E: what qualifications or (.) certificates (0.3) do you hope to get (2.2) 66 C: → could you ask me another way (.) I’m not quite sure (.) quite sure about 67 this (1.3) 68 E: it’s alright (0.3) thank you (0.5) uh:: can we talk about your childhood? 69 (0.7) (Part 1)

In the above extract the examiner follows the instructions perfectly, declines the request for clarification and moves on to the next question. In the data the vast majority of examiners follow the instructions in this way.

Extract 16 40 E: uh so (0.5) ↑how would you improve (0.4) the city you live in (1.8) 41 C: I:: (0.8) how do I pro::ve? (0.2) 42 E: how would you impro:ve (.) the city (0.3) 43 C: sorry I don’t know (.) 44 E: improve? (0.3) 45 C: yeah (.) 46 E: → how would you make the city better? (0.3) 47 C: → o::h yes (0.5) (Part 1)

In Extract 16 the examiner does not follow the brief, explains the vocabulary item by providing a synonym and thus gives an advantage to the student, who indicates comprehension in line 47.

Extract 17 71 E: okay (0.3) uh:m what d’you think is the most important (0.6) household 72 task? (1.4) 73 C: household task? (0.4) 74 E: mm= 75 C: =uh:m sorry I [can’t ] 76 E: → [most importa]nt job (.) in the house (0.8) 77 C: → in the house (1.5) uh:m (0.7) I think (0.4) the: most important job is (.) 78 cleaning hh (0.5) because my house is quite big (0.3) (Part 1)

In a similar instance above, the examiner does not follow the brief, explains the vocabulary item by providing a synonym in line 76 and thus gives an advantage to the student, who is able to provide an answer in line 77.



Extract 18 236 C: hh because you know uh:: uh (0.3) we don’t have uh:::m materials here 237 we don’t have uh:: fuel or (.) petrol we don’t have uh:::: (0.2) ⋅hh 238 E: → resources (.) nature[al resour- ] 239 C: → [that’s right w]e don’t have natural resources 240 ((inaudible)) (0.8) uh: (0.3) we we have to work on on tourism (Part 3)

In Extract 18 the examiner helps by supplying vocabulary to the candidate in line 238, which s/he subsequently employs in line 239. Although examiners have more flexibility in Part 3, they do not have a brief to supply vocabulary. In this sub-section, we have seen that the vast majority of examiners follow the instructions not to explain vocabulary. In some rare cases, examiners do not follow the instructions and provide an advantage to these candidates, who are generally able to exploit this help.

We can summarise the section on repair as follows. The organisation of repair in the Speaking Test is highly constrained and inflexible, and this is intended to ensure standardisation. In Part 1, the candidate may only initiate repair by requesting a single repetition of the question – no reformulation is permitted. The examiner rarely initiates repair. If a candidate turn is incomprehensible, error-ridden or irrelevant, there is no brief for the examiner to initiate repair in order to achieve intersubjectivity, except for the single repetition as and when requested. This is because candidate turns are produced for evaluation by the examiner. The design of repair in Part 1, then, has been tightly constrained in relation to the institutional goal of standardisation and fairness.

How does repair in the IELTS Speaking Test compare to that in other settings? In general, the organisation of repair in the IELTS Speaking Test differs very significantly from that described as operating in ordinary conversation (Schegloff, Jefferson & Sacks, 1977), L2 classroom interaction (Seedhouse, 2004) and from university interaction, (Benwell, 1996; Benwell & Stokoe, 2002; Stokoe 2000) which is the target form of interaction for most candidates. The literature on L2 classroom interaction (Seedhouse, 2004) and interaction in universities (Benwell, 1996; Benwell & Stokoe, 2002; Stokoe 2000) shows that many different forms and trajectories of repair are used in these settings. The lack of requirement (in Part 1) to achieve intersubjectivity produced by the test design creates a major difference between Speaking Test interaction and interaction in university seminars, tutorials and workshops, in which the achievement of intersubjectivity is a major institutional goal.

The organisation of repair is rationally designed in relation to the institutional attempt to standardise the interaction and thus assure reliability. However, given that the organisation of repair is unusual and cannot be anticipated by candidates, the worry is that some candidates may become confused and test performance is lowered. The evidence for this is that (as we have seen) some candidates request explanations of questions and multiple repetitions. In the IELTS Handbook and website available to students, and in most IELTS preparation books we examined, there was no statement on the organisation of repair; this was detailed, however, in ‘IELTS On Track’ (Slater, Millen & Tyrie, 2003). It is unclear to what extent candidates are aware of these repair rules. A mock Speaking Test may prepare candidates for these rules, but it is unclear how many candidates will have taken one. We would recommend that a very brief statement be included in written documentation for students, eg, “When you don’t understand a question, you may ask the examiner to repeat it. The examiner will repeat this question only once. No explanations or rephrasing of questions will be provided.” A further recommendation would be that examiners state the rules for repair towards the end of the opening sequence. An example of this choice is described in Egbert (1998).



Overall, the organisation of repair in the Speaking Test has a number of distinctive characteristics. Firstly, it is conducted according to strictly specified rules, in which the examiners have been briefed and trained. Secondly, the vast majority of examiners adhere rigidly to these rules, which are rationally designed to ensure standardisation and reliability. Some examiners do not follow the rules, and in these cases they provide a clear advantage to their candidates. Thirdly, the nature and scope of repair is extremely restricted because of this rational design. In particular, exact repetition of the question is used by examiners as the dominant means of responding to repair initiations by candidates. Fourthly, there is no requirement to achieve intersubjectivity in Part 1 of the Test.

3.2 Turn-taking and sequence The overall organisation of turn-taking and sequence in the Speaking Test closely follows the examiner instructions. Part 1 is a succession of question-answer adjacency pairs. Part 2 is a long turn by the student, started off by a prompt from the examiner and sometimes rounded off with questions. Part 3 is another succession of question-answer adjacency pairs with slightly less rigid organisation than Part 1. This tight organisation of turn-taking and sequence is achieved in two ways. First of all, the examiner script specifies this organisation, for example “Now, in this first part, I’d like to ask you some questions about yourself.” (Examiner script, January 2003). Secondly, many candidates have undertaken training for the Test, and in some cases this will have included a mock Speaking Test.

3.2.1 The introduction section “One of the key features of the IELTS Speaking Test is the importance placed on making the candidate feel as relaxed and as much at ease as possible within the confines of an examination.” (Instructions to IELTS Examiners, pp3).

However, the administrative business in the introduction section sometimes works against this and has the potential to create interactional trouble at the start. In the introduction section, the examiner must create a relaxed atmosphere, but at the same time perform introductions and verify the candidate’s identity. Because this administrative business has to take place before the Test as such begins, a switch of identity is involved for both participants which may tend to work against the intention to create a relaxed atmosphere. When verifying ID, the professional is adopting a gatekeeping or administrative identity and a quasi-policing function; the candidate has the identity of person-being-identified. When this business is concluded, the identities switch to examiner and candidate. The ‘policing’ function is evident in the extract below.

Extract 19 1 E: could you (0.4) tell me your full name please (0.6) 2 C: ((name omitted)) (0.7) 3 E: thank you and (0.4) do you have your identification with you please 4 [that’s] 5 C: [yes ] exactly I sure do have the passport! (1.1) and I do have the 6 national I.D. card (0.8) 7 E: I think it’s your (0.4) oh (5.0) passport that I need 8 C: (4.0) yes please 9 E: → (6.3) is this you? (0.3) 10 C: exactly ma’am I didn’t have my moustaches so that’s why (0.4) I went for 11 a clean shave(0.7) so that’s why I’ve got a chin (0.4) I’m s- (0.5) 12 E: → you look older on that one= 13 C: =yeah exactly (0.6) that’s my mummy told me the same thing 14 E: → (27.2) right (1.2) thank you (0.5) (Part 1)



In Extract 19, then, the administrative business works in opposition to the aim of creating a relaxed atmosphere. The examiner challenges the candidate twice in relation to his identity and the 27.2 second pause before the examiner finally accepts the candidate’s identity is by far the longest pause in the data.

Extract 20 1 E: .hh well good evening=my name is ((first name)) ((last name))= 2 can you tell me your full name please.= 3 C: =yes ((first name,)) ((last name.)) 4 E: .hh ah: a:n[d, 4b C: [ghm= 4c E: → =can you tell me er, what shall I ca:ll you. 4d (1.5) 5 C: e:r (1.0) can you repeat the: er the question[(s),? 6 E: [( ) what do you, (0.2) 7 E: your first name? do you use [((last name)) 7b C: [( ) ((first name)) 8 E: ((first name)). [((first name)) (you want me to call you) ((first na[me)) 8b C: [yes ((first name)) [yes 9 E: °right.° ((with forced sound release)) 10 E: .hh and can I see your identifi<cation: card please.> 10b (0.5) 10c C: .h[hm- 11 E: → [an ID. .hh er: not a student card=do you have an I [D °card? ° 12 C: [↑e::::m 13 C: no::=in, (0.2) tch! no. 13b (0.5) 13c C: tch! er I don’t er (0.2) .h I don’t have (1.3) the: (1.0) 13d administration,=er: the day. 14 E: → m:: I understa:nd but you erm .h need to ha:ve, a: tch! (0.2) your official, 15 C: yes 16 E: → ID card. 17 C: ye:s. 17b (1.5) 18 E: .hh thank yer .hh erm in this first part↓ I’d like to s=ask some questions 19 about your↑self. .hh em >well first of all can you tell me where you’re< 19b fro↑m↓ (Part 1)

The above introduction sequence creates considerable interactional problems and a full analysis of the test (Appendix 2) suggests that the candidate has been thrown by this initial sequence and never recovers.

The question “What shall I call you?” created significant problems for the candidate above and very occasionally in other cases. Sometimes the question and answer sequence for this question is negotiated smoothly, as in Extract 21.

Extract 21 1 E: could you tell me your full name please (.) 2 C: yes (.) I’m ((name omitted)) (0.6) 3 E: thank you (0.6) and (.) what shall I call you (.) ((name omitted))? or 4 (0.9) 5 C: ((name omitted)) (0.6) 6 E: right (0.7) my name’s ((name omitted)) (0.8) em (0.4) can I see your 7 identification please (0.4) (Part 1)



The examiner asks the question and the answer of a nickname is provided by the candidate without trouble arising. However, the examiner does not actually use the candidate’s nickname as provided later on during the course of the interview. It is therefore unclear what the purpose of asking the question is. See also 0394, lines 5-9 for another example of this. We should also note that, in cases where candidates do have a nickname or pet name which is different to their ID name, they sometimes volunteer this (see 0126, line 1 for another example):

Extract 22 1 E: good afternoon my name is ((name omitted)) 2 C: my name is ((name omitted)) oh well you can call me ((name omitted)) 3 because I was studying university everybummy (0.3) everybody call me 4 ((name omitted)) so (0.5) everybody (0.7) because this ((name omitted)) 5 is quite close to my given name at first ((name omitted)) ((spelling out 6 name)) (.) and ((spelling out nickname)) so (0.7) s-= 7 E: =o[kay] (Part 1)

As this question can cause problems to candidates, and as candidates sometimes volunteer a nickname if they have one, it is recommended that the question be deleted.

3.2.2 Transition between parts of the test and between question sequences Transitions between sequences are marked more or less explicitly by examiners in accordance with their written script. An example of the change from Part 2 to Part 3 of the test can be observed between lines 217-220 of the following segment.

Extract 23 216 E: mm h[m ] 217 C: [and] (2.0) and and and most people I know (1.2) 218 E: alright we’ve been talking this piece of equipment which you find useful 219 (0.6) and I’d like to discuss with you one or more general questions 220 related to this (0.6) okay? (0.2) comes to the first of all (0.3) attitudes to 221 technology (1.2) can you describe the attitude of all the people (.) in 222 modern technology (0.7)

Although the examiner above does not specify that s/he is moving from Part 2 to Part 3 of the test, the wording implies a transition from a previous focus to a new but related focus.

We now consider what examiners say on receiving an answer from the candidate and to mark the transition to the next question within Part 1 of the test.

Extract 24 25 E: → okay so what do you like most (0.3) about your studies (1.7) 26 C: uh the variety (0.4) I think in: medicine especially because no: two 27 patients will present the same way (0.4) and i- it’s always a challenge to 28 figure out what the diagnosis is (0.3) and uh ways in which you can (.) 29 confirm the diagnosis °basically° (0.2) 30 E: → okay (0.4) are there any things you ↑don’t like about your studies? (2.7) 31 C: well personally the fact tha:t (.) if I read something I have to read it again 32 you know to remember it (.) it’s just a lot (.) the volume of work is very 33 very large so it’s just (0.2) time management (0.2) and learning to deal 34 with the: (0.2) °(volume of work)° (0.3) 35 E: → okay (0.7) so uh:: what qualifications or certificates (0.8) do you hope to 36 get (1.3) (Part 1) © IELTS Research Reports Volume 6 21


In the Test from which the above extract is taken, the examiner says ‘okay’ 21 times at the start of the receipt slot (the point directly after the candidate’s answer), with seven of those instances being a double ‘okay’ and the end of the test being marked with a triple ‘okay’.

We now consider the issue of how examiners signal to the candidate that the examiner wants to listen further. In the Training Manual, items are listed which in CA literature have been termed “continuers” (eg Goodwin 1986). These display understanding to the current speaker and indicate that the listener passes the opportunity to take the next turn. “Examiners should keep non-verbal interjections to a minimum. (Eg ‘um’, ‘right’, ‘uh uh’.)” (IELTS Examiner Training Material 2001, pp6). “How do examiners acknowledge something candidate has said? By adopting a listening pose and maintaining eye contact. NOT by commenting or giving too much audible acknowledgement.” (IELTS Examiner Training Material 2001, pp69. Part 1). While the audio tapes do not allow us to examine the non-vocal aspects of the interaction, the transcripts indicate that examiners make frequent use of continuers.

Extract 25 15 E: you some questions about yourself (0.7) em (0.3) let’s talk about what 16 you do (.) do you work or are you a student (1.0) 17 C: actually: (1.1) I- no (.) I am not a student right now (0.3) 18 E: → mm hm (.) 19 C: I did my (.) engineering some (0.3) three years back (0.4) 20 E: → mm hm (0.6) 21 C: and then I started working for my father (0.6) and (0.6) family for (0.3) 22 E: → mm [hm] 23 C: [it’s] construction business I’m in (.) 24 E: → mm hm, (0.7) okay so tell me about your job (1.5) 25 C: right now (0.5) we don’t have a job at all (0.5) 26 E: → mm hm, (0.4) (Part 1)

The examiner in the above extract uses ‘mm hm’ to pass on taking the turn, and ‘okay’ to mark that the answer turn is finished and that the examiner will produce another question. Generally in the data, ‘Mm hm’ provides a non-committal, non-evaluative display of attention. ‘Okay’ marks receipt of a complete turn and marks transition to the next question. In neither case does the candidate know the degree of the examiner’s understanding.

The issue of examiners’ use of continuers is of particular importance in relation to Part 2 of the Test. In many transcripts there is no verbalised feedback from the examiner at all during Part 2, for example in transcript 0415.

Extract 26 244 C: so this is a need of (.) this thing (0.7) so (1.1) some people use (.) eh 245 are using (.) these things (.) eh this thing but (0.3) not most of the 246 people (.) 247 E: mm hm= 248 C: =so in my view it is (.) eh (0.9) eh it should be (1.2) the: necessity (.) of 249 our >home town< ↑not my home towns (0.5) all the countryside a- 250 actually all seventy per- eh percent of population is living in the (.) 251 countryside (.) 252 E: mm hm (.) (Part 2)



In Extract 26, by contrast, the examiner uses ‘mm hm’ more frequently, a total of 5 times throughout the test. There are arguments for consistent conduct by examiners in the use of markers in the receipt slot and at turn transition relevance spaces (points at which turn change can occur). The use of ‘okay’ and ‘mm hm” does not appear to cause any trouble in interaction, is designed by examiners and understood by candidates to be non-evaluative and appears suitable in that they do not generate any instances of trouble in the data.

We would therefore recommend, in the interests of consistency and standardisation, that examiner instructions should be that “okay” is used in the receipt slot to mark transition to the next question and that “mm hm” be used as a continuer, ie as a signal that the candidate is encouraged to continue talking. This would be particularly useful in Part 2. A more systematic video analysis would be necessary to shed light on the systematic use of body posture, eye contact, head movements, handling of the written materials and similar behaviours in connection with turn transition, signals of understanding and displays of section closings.

3.2.3 Evaluation The Instructions for Examiners tell examiners to avoid expressing evaluations of candidate responses: “Do not make any unsolicited comments or offer comments on performance.” (IELTS Examiner Training Material, 2001, pp5). It is very noticeable in the data that examiners do not verbalise positive or negative evaluations of candidate talk, with some very rare exceptions. In this aspect the interaction is rather different to interaction in classrooms of all kinds, in which an evaluation move by the teacher in relation to learner talk is an extremely common finding, in relation to L1 classrooms (eg Mehan, 1979) as well as in L2 classrooms (Westgate et al, 1985). It is also different to interaction in university settings (Benwell, 1996; Benwell & Stokoe, 2002; Stokoe 2000).

Examiners follow these instructions, and we found only very few aberrant cases. In the following two data excerpts, examiners produced evaluations of candidate talk.

Extract 27 16 C: eh (1.3) actually eh (0.3) it’s very interesting job eh (.) it is (0.3) 17 especially in my eh (0.8) department (0.4) that is specialised (.) 18 department that is eh (0.3) microbiology (0.8) in eh eh interesting 19 ((inaudible)) (0.5) [I enjoy it] 20 E: → [yes yes ] mm yes (0.3) good (0.5) are there any 21 things you don’t like about your work (1.1) 22 C: lot of things I like to do (0.5) as a pharmacist because eh (.) pharmacist 23 are (1.1) eh complicated persons in pharmaceuticals so= 24 E: =yes (0.9) 25 C: eh (0.3) but the whole department is (0.4) very interesting for me (0.9) 26 [mm hm] 27 E: → [good! ] (1.0) eh (.) do you have any plans to change your job in the 28 future (1.1) (Part 1) Extract 28 108 E: ((inaudible)) and have you any plans to change your job? (1.7) 109 C: na::h (0.4) I don’t think I will change my job? After I come back to 110 Vietnam (.) because when I came here (0.4) to New Zealand (0.3) I quit 111 my job (.) but my ex boss said that I could return to my office (.) if I 112 wish to (0.4) but I think that it’s time for me to set up my own business= 113 E: → =very good!= 114 C: =yeah (0.4) I plan to (.) set up my business to ((inaudible)) educational 115 (1.1) I set up my business (0.6) 116 E: → very good (1.7) (Part 1)



It appears to be the case that L2 teachers often provide positive or negative evaluations of learner talk when teaching in class. However, when the same teachers assume the examiner role in a Speaking Test, they generally do not verbalise evaluations of candidate talk. The explanation appears to lie in the rational design of these two different varieties of institutional talk. In the L2 classroom the institutional goal is that “the teacher will teach the learners the L2” (Seedhouse, 2004:183). In this institutional setting, positive or negative evaluations of learner talk are formative and designed to help the learners learn. The instructor’s main aim is to teach and evaluate learner talk, at least in many teaching methods. However, in the IELTS test, the institutional goal is to “…assess the language ability of candidates…” (IELTS Handbook, pp2). The Speaking Test is not part of an ongoing programme of study. Moreover, a summative evaluation of language ability is provided formally and in writing after the Speaking Test has taken place. The examiner’s aim is to provide an assessment, but the result is not provided to the candidate immediately. It may be that one way in which examiners talk into being a formal examination is precisely by avoiding the positive or negative evaluations of learner talk typical of the classroom. Examiner behaviour here is a striking example of professional caution and asymmetry of access to knowledge, ie the evaluation and scoring of learner talk. It appears that this lack of positive or negative evaluations of candidate talk is related to the rational design of the institutional setting and is therefore appropriate. However, we should note that this creates a striking difference between Speaking Test talk, L2 classroom interaction and interaction in universities, which is the target destination for most candidates. We therefore recommend that candidates be informed about this aspect of examiners’ conduct beforehand.

3.3 Topic In the Speaking Test, the topic of the talk is pre-determined by the central administration, is written out in advance and is introduced by the examiner. Candidates are evaluated on (among other things) their ability to develop a nominated topic (see IELTS band descriptors). Topic is intended to be developed differently in the different parts of the test: “Can examiners ask a follow-up question from something candidate has said? No.” (IELTS Examiner Training Material 2001, pp69. Part 1). “Can the examiner ask an unscripted follow-up question in Part 3? Yes.” (IELTS Examiner Training Material 2001, pp71)

Usually, candidates follow the examiner’s topic nomination wherever possible; however, there are some very rare cases in which the candidate attempts to determine the topic. Note that in the example below (lines 125 ff), the candidate asks whether she can talk about a specific aspect of the prompted topic. This is denied. So even in Part 3 the examiner does not allow the candidate to shift topic.

Extract 29 118 let’s talk about public and private transport (0.6) can you describe (.) the 119 public transport systems in your country (1.0) 120 C: I used to have eh the main eh (2.0) public transport and th- (0.3) the main 121 transport which are (0.3) which is used by the public are the (0.5) buses 122 (0.7) secondly if eh (0.3) there are some eh urgent eh they use the taxis 123 investment plans the banks are (.) given (0.5) and eh (0.3) the main eh is 124 the (0.5) the (1.8) transport is eh (1.5) is eh bad (0.7) today have eh (0.9) 125→ can’t I talk about the (0.3) problems (1.1) 126 E: → no= 127 C: =no (1.1) 128 E: just describe (.) the public transport systems [in you country ] 129 C: [eh describe the main] 130 transport system which [I ] 131 E: [okay] (0.5) now (0.5) I would like you to 132 evaluate (0.8) the advantages of private (.) and public transport (1.5) 133 C: okay (1.7) first eh (.) talking about (0.4) the eh (0.5) private transport eh (Part 3)



In Extract 30 below we see the issues of topic, interpretation of topic, question repetition and direct answers to questions converging. The examiner appears to engage with the topics the candidate talks about in lines 40, 42, 44, 46, 48, 50 and 52. The candidate answers are not direct answers to the question, but are clearly related to the overall topic of childhood. Of special interest is the candidate’s response in lines 50 and 52, where he provides a completely logical answer, which, however, is treated as a misunderstanding since, apparently, the examiner expected a different interpretation.

Extract 30 39 E: ·hh where did you grow up, (0.8) 40 C: eh (.) in my childhood I was eh very naughty! (.) 41 E: yes, (0.7) 42 C: I p- eh (.) I played with my er friends, (.) 43 E: yes (0.7) where did you play (0.7) 44 C: eh (0.7) t- cricket, (0.6) 45 E: ah yes (1.1) 46 C: eh (.) fly kiting, (0.6) 47 E: yes (0.9) 48 C: and eh othe::r (0.8) things eh (0.6) 49 E: and where (.) where did you grow up ((name omitted)) (1.6) 50→ C: em (.) grow eh with my (0.6) parents (.) 51 E: yes= 52→ C: =eh (.) m- my (.) especially my dad (0.6) very good eh (.) 53 E: I see (.) and where did you grow up (1.4) 54 C: ((inaudible)) (0.3) 55 E: where (0.5) did you grow up (1.3) 56 C: ((inaudi[ble))] 57 E: [yeah ] ·hh (.) okay (.) do you think childhood is different today 58 from when you were a child, (1.4) (Part 1)

The examiner says ‘yes’ five times in response to the candidate’s turns and this appears to be positive evaluation. However, at the same time the examiner repeats the question ‘where did you grow up?’ three times. This shows that the examiner is treating the candidate’s answer as a failure to provide an adequate response as trouble. The examiner’s “yes”-receipts and his question repetition appear to be mutually contradictory, with one signalling approval and the other signalling trouble. The examiner is deviating from instructions in two regards, by expressing evaluations and by multiple repetitions of the question.

3.3.1 Topic disjunction In this section we examine instances in which scripted questions generate trouble and topic disjunction (in which the flow of topic is disturbed). We examine firstly the question “Would you like to be in a film?” (Part 1 of the Test), which causes trouble for a striking number of candidates. In the examiner script this follows the questions: “Do you enjoy watching films? How often do you watch films? Do people generally prefer watching films at home or in a cinema?” The interesting point is that in the script, there is no indication that the question might be topic disjunctive, as it is clearly continuing the topic of films. However, in the flow of interaction, eight candidates found it difficult to understand the question, even in cases where the candidate has no problems with understanding all of the other questions in the test. This is out of a total of 32 candidates who were asked this question in the data. In the following two examples we see how the trouble and repair sequences typically unfold after this question.



Extract 31 57 C: .hh err I (0.4) watch most films (0.8) usually after work (1.5) er 58 sometimes sometimes I see two (.) film in a week (.) only 59 E: mm hm (1.8) would you like to be in a film (.) yourself? 60 C: (2.0) pardon.(1.1) 61 E: would you like to be in a film. (1.0) 62 C: err:: if I was:: an actor? (.) 63 E: hmm (1.0) 64 C: no I don’t. I don’t like it. (2.1) (Part 1) Extract 32 66 E: alright= 67 C: =yeah (0.2) 68 E: do- do uhm would you like to be (0.3) in a film (0.3) 69 C: oh I like going to the cinema (0.2) 70 E: but would you like to be in (0.3) a film (0.6) 71 C: uh::m (2.3) 72 E: → actress (0.8) 73 C: actress (.) actre::ss (0.9) 74 E: → would you like to be? (0.3) 75 C: yeah (0.9) I like= 76 E: =why would you like? (0.6) 77 C: uh::m (0.5) because (0.9) I I saw a film (0.4) include uh hero (0.3) and a 78 heroine (0.3) I think the heroine is very very beautiful (0.8) I really like it (Part 1)

In Extract 32, we see that the examiner deviates from instructions by modifying the question in lines 72 and 74. This may be due to the ambiguity of the prompt. Other examples of trouble with this question can be found in extracts 0099, lines 73 on; 0127, lines 83 on; 0394, lines 161 on; 0144, lines 72 on.

We cannot know for certain why the question created so much interactional trouble for so many candidates. However, the explanation appears to involve a shift in perspectives. The previous questions about films involved the candidates in continuing their normal perspective as visitors to cinemas and viewers of films. The problem question involves an unmarked and unmotivated shift in perspective to a fantasy question in which candidates have to imagine they had the opportunity to be a film star. As we can see in the following extracts, some candidates say they have never thought about this and have difficulty with the shift in perspective:

Extract 33 78 (0.4) if I watch a film by video (0.7) it is cheaper than theatre (.) but if 79 I have a family (0.4) I choose (0.6) watching in my home (1.3) 80 E: right (.) right (0.5) would you like to be in a film? (1.1) 81 C: pardon (0.9) 82 E: would would you like to be in a film (0.8) like be an actress= 83 C: → =ahhh (0.4) I never think about that! hhh (0.6) of course if I have a chance 84 (.) of course haha huh huh (1.4) 85 E: ha ha huh of course (0.7) right (Part 1)



Extract 34 66 cinema maybe you: just uh uhm can uh can see once (0.9) 67 E: would you like to be in a film ((name omitted))? (0.9) 68 C: sorry? (0.2) 69 E: would you like to be in a film (2.1) 70 C: → I: (0.3) 71 E: yes you (0.5) 72 C: no:: hh heh (0.7) 73 E: okay let’s talk about shopping now (.) (Part 1)

Further examples of questions which cause trouble are now provided. The question below is “Could you speculate on how much of today’s technology will still be in use in 50 years’ time?”

Extract 35 148 E: thank you (0.6) and could you speculate (.) on how much of today’s 149 technology (0.7) w- may still be in use (.) in fifty years’ time (3.9) 150 C: °sorry° (0.8) 151 E: could you speculate on how much of <today’s technology> (0.9) will 152 still be in use (.) in fifty year’s time (0.3) 153 C: in fifty years time eh (0.5) there will be more advance ((inaudible)) (0.9) 154 to ((inaudible)) (0.7) more things will be in the market (.) available (0.6) 155 and more easy life (0.3) there will be (0.8) (Part 3) For a similar example, see 0338, lines 188-201.

It is unclear whether the trouble is lexical in nature (speculate) or whether the change in perspective to the imaginary is problematic.

Extract 36 175 E: could you speculate on (.) future developments in the transport system 176 (4.6) 177 C: eh (.) in what sense (0.6) 178 E: well what do you think we’re likely to see in the future (.) how will 179 people travel (1.1) 180 C: eh (0.8) no (.) 181 E: any (0.6) further developments (1.0) 182 C: normally eh (.) the development could be made in the (0.7) in cars side of 183 the (0.3) transport (0.6) that eh (0.3) cars in more (.) fuel economised 184 (0.3) and eh (.) pollution aspect can be (0.3) 185 E: mm= (Part 3)

The question in Extract 36 is slightly different from the preceding one. However, it contains the same lexical item and the same imaginary perspective.

In Extract 37, the scripted question is “Can we talk about your childhood? Are you happy to do that?”

Extract 37 63 E: → mm hm (0.9) now can we talk about your childhood (0.6) are you happy 64 to do that? (0.8) 65 C: eh (.) happy to repeat that? (.) 66 E: ah [eh] 67 C: [ha]ppy to remember that= 68 E: =are you happy to talk about your childhood (.) 69 C: eh (0.6) [ee ] 70 E: [now] where did you grow up (0.4) 71 C: → yes (.) not too quite happy (0.4) because it was (0.4) eh actually divided



72 into: eh multiple different portions (0.7) eh like I was born somewhere 73 else (.) not where (0.3) where I am living now= 74 E: → =mm so would you prefer to talk about some (0.3) something else? (0.8) 75 C: eh like (0.6) eh no no (.) I-I I mean to say (.) that I [don’t ] 76 E: [you’re] happy to talk 77 a[bout ] 78 C: [yeah] 79 E: it (0.3) so where did you grow up (1.4) (Part 1) In the above extract, considerable trouble talk arises due to a confusion as to what exactly ‘happy’ is referencing; the candidate takes it to be referencing the topic of childhood and starts explaining that some parts of his childhood were happy and others not. In line 74 we see that the examiner takes this reply to mean that the candidate is not happy to discuss his childhood. This appears to be the only frame in which candidates are asked for their consent to discuss the topic; elsewhere they clearly have no choice. Candidates may find this a source of confusion.

In this section we have seen that a sequence of questions on a particular topic may appear unproblematic in advance of implementation. However, this may nonetheless be a cause of unforeseen trouble for candidates, especially if an unmotivated and unprepared shift in perspective of any kind is involved. Piloting of questions (if not already undertaken) would therefore be recommended.

3.3.2 Recipient design and rounding-off questions In a number of instances in the data, trouble arises in relation to specific rounding-off questions in Part 2. Their purpose is stated as follows: “The rounding-off questions at the end of Part 2 … provide a short response to the candidate’s long turn and closure for Part 2 before moving on to Part 3. However, there may be occasions when these questions are inappropriate or have already been covered by the candidate, in which case they do not have to be used.” (Instructions to IELTS Examiners, pp6).

These types of questions are sometimes topically disjunctive in practice as they may not fit into the flow of interaction and topic which has developed. “Does everyone you know use this piece of equipment?” is a rounding-off question to be used after a Part 2 talk on “a piece of equipment which you find very useful”. In a number of cases the question is experienced as disjunctive and problematic by candidates. In the extract below the candidate has described a computer.

Extract 38 202 E: okay (0.3) 203 C: indispensable (0.4) 204 E: okay (0.4) does everyone you know use this piece of equipment (1.0) 205 C: pardon? (0.5) 206 E: does does everyone you know↑ use this piece of equipment (0.6) 207 C: you mean my particular one? (0.7) 208 E: uh: not your I- but= 209 C: =a computer 210 E: right (1.0) 211 C: most people I know nowadays 212 E: mm hm 213 C: have access to a computer some use it more than others (Part 2)



The above candidate has spoken fluently throughout the interview without repair, but encounters difficulty with this question, even after repetition. This may well be due to the scripted nature of the question. It is unusual that an object already referred to as “a computer” would later be referred to as “this piece of equipment”. A shift in perspective is also evident in the question; previously they had been talking about the equipment which the candidate uses and the shift is to whether other people s/he knows use the equipment.

Extract 39 92 (0.2) or er (0.2) funny story (.) can make me er (0.3) erm er er (0.4) can 93 make me to relax 94 E: OK thanks (.) alright er does everyone you know er use the computer? 95 C: (3.0) actually er can you repeat please? 96 E: yeah (0.2) does every one (.) you know use the computer 97 C: (6.3) I think er computer is very useful for me (0.8) erm tend to 98 computer (0.2) I can er (2.3) er (2.3) I can er I can improve my language 99 E: uh hum, ok (.) so er do you enjoy using the computer? 100 C: yes I enjoy it very much (Part 2) In Extract 39, even after repetition, the candidate still does not understand the question. The examiner then switches to the other additional question, which is successfully answered.

In the extract below the candidate (a doctor) has described a stethoscope.

Extract 40 257 C: =so that really convinced me that (.) this is a key instrument for us (0.6) 258 and [I ] 259 E: [yes] 260 C: think it’s really helpful in diagnosing the diseases (0.3) 261 E: right (0.3) thank you (0.7) em (.) eh does everyone you know use this 262 piece of equipment (0.3) 263 C: eh sorry? (0.8) 264 E: does everyone you know (0.5) use this piece of equi[pment] 265 C: → [ah ] yes as I told 266 you that eh we (.) even in dramas and every person have eh 267 supposed to face a doctor som- eh (0.3) at one or the other time (0.6) so I 268 don’t’ think so (.) that this is an instrument eh (0.3) which is not well 269 known by the other people (0.5) (Part 2)

The candidate is a medical consultant and the piece of equipment he described is a stethoscope. The question is topically disjunctive and the candidate’s answer (lines 265 ff) shows a degree of confusion with the function of the question. Clearly, a stethoscope is a piece of medical equipment and it is not possible that everyone he knows uses it.

In Extract 41, the candidate is also a medical consultant and the piece of equipment he described is a colonoscope.

Extract 41 223 (0.3) so (.) em (0.3) we had (0.4) scope then (0.5) so it is used °to help us° 224 ((°inaudible°)) (0.2) 225 E: okay (.) thank you (0.3) and eh (0.3) does eh (0.6) anyone else you know 226 use this piece of equipment (0.9) 227 C: → em (0.6) in eh (0.3) well (.) every eh (0.3) I think all the specialists the 228 (0.3) mm in eh (.) in EST as (0.3) they use them (.) and em (0.9) in our 229 hospital (.) I’m in charge of this (0.6) equipment because I’m the senior 230 doctor (0.4) I teach them to my junior doctors (0.2)



231 E: mm= 232 C: =and the doctors the medical people also use it (0.3) gastro-enterologists 233 (0.3) (Part 2) In terms of recipient design, then, the examiner’s follow-up question (lines 225-6) seems very odd and disconnected from the previous flow of interaction. The candidate (who obtained a score of 8.0 and speaks extremely fluently elsewhere) shows definite signs of confusion in line 227. Clearly, a colonoscope is a highly specialised piece of equipment and any question about whether other people use it is likely to sound strange. In this case we should perhaps just be grateful that the examiner did not ask the alternative rounding-off question “Do you enjoy using this piece of equipment?” Other instances of trouble in relation to rounding-off questions may be found in 0304, lines 117-121; 0589, lines 133-138; 0099, lines 120-126.

We have seen that these rounding-off questions can appear disjunctive and actually create trouble when they are worded in such a way that they ignore the local context in which they are produced. We now examine three instances of examiners modifying the rounding-off question to provide good recipient design, which maintains the flow of the topic and interaction and avoids interactional trouble. In the extract below the candidate has described a mobile phone.

Extract 42 121 people can contact you. (0.5) anytime (0.7) because you use (.) your own 122 cell phone (0.5) and this is the big (.) advantage of mobile phone (0.4) 123 and that’s why (.) I use to prefer it ((°inaudible°)) (0.8) 124 E: → so (0.5) um (1.7) does everyone you know carry a mobile phone now? 125 (2.4) 126 C: just not (.) not much (1.2) mm lot of people (0.3) lot of people are not 127 carrying the mobile phone (0.4) but (0.9) eh what eh (0.3) in now (.) it’s 128 eh (0.4) thirty or forty percent (0.8) mm of people who work in offices (.) 129 and who are working in a marketing and (0.3) other places (.) they use (Part 2) In Extract 42 the examiner modifies the rounding-off question to the sequential environment or flow of topic and this proves effective in smoothly continuing and rounding off the topic, as well as enabling the candidate to understand the question and provide an appropriate answer.

Extract 43 160 writing skills (0.7) and it also helps you i::n improving your intelligence 161 and doing other things (0.6) 162 E: → °mm hm° okay thank you (2.7) does everyone you know (.) in your 163 family enjoy (.) writing? (0.9) 164 C: yes I do my elder sister is: uh: working (0.2) for a newspaper which is 165 called Times of India (0.3) 166 E: mm hm (0.2) (Part 2) In Extract 43 the candidate has described a pen. Again the question is adapted to the flow of interaction and to the candidate’s circumstances. Hence, the candidate is able to develop the topic very smoothly.

Extract 44 118 the plough is used to (.) it’s not very simple (.) it’s not very sophisticated 119 (.) but we call it appropriate technology (.) so it can be used (.) i’m sure 120 it’s very widely used in Botswana (.) because it’s always pulled by oxen 121 (.) they are pulled by oxen (.) needed to (.) 122 E: → does everyone you know use a plough like that to (.) in the village



123 where you live? 124 C: er (.) I could say sixty percent of people use the plough (.) because they 125 can not afford to pay for tractor (Part 2) In Extract 44 the candidate has described a plough. The question is adapted to include the specific item of equipment and a specific location and the candidate is able to provide an answer without trouble.

In each of the three examples above, the examiners have used the name of the equipment rather than “piece of equipment” to refer to it and in two cases the examiners have adapted the question to what they have learnt during the test of the candidate’s personal and local circumstances. Thus there is a case for training examiners in how to adapt the rounding-off questions slightly to fit seamlessly into the previous flow of the interaction. The training could include some of the examples given above, explain the topic disjunction problems which can arise with unmodified rounding-off questions and provide examples of questions which have been successfully adapted to topic flow. Training should also stress that the questions are optional and that in some instances it might not be possible at all to adapt them to the flow of the interaction.

4 ANSWERS TO RESEARCH QUESTIONS

The main research question is: How is interaction organised in the three parts of the Speaking Test?

The organisation of turn-taking, sequence and repair are tightly and rationally organised in relation to the institutional goal of ensuring valid and reliable assessment of English speaking proficiency. In general, the interaction is organised according to the instructions for examiners: In Part 1, candidates answer general questions about a range of familiar topic areas. In Part 2 (Individual long turn) the candidate is given a verbal prompt on a card and is asked to talk on a particular topic. The examiner may ask one or two rounding-off questions. In Part 3 the examiner and candidate engage in a discussion of more abstract issues and concepts which are thematically linked to the topic prompt in Part 2. The overwhelming majority of tests adhere very closely to examiner instructions. The test is intended to provide variety in terms of task type and patterns of interaction, and in general this is achieved. However, the interaction is very restricted in ways detailed below.

How and why does interactional trouble arise and how is it repaired by the interactants?

There are two basic ways in which interactional trouble may arise. Either a speaker has trouble in speaking (self-initiated repair) or something the other co-participant uttered is not heard or understood properly (other-initiated repair). In the interviews analysed, trouble generally arises for candidates when they do not understand questions posed by examiners. In these cases, candidates usually initiate repair by requesting question repetition. Occasionally, they ask for a re-formulation or explanation of the question. Sometimes interactional trouble can be created (even for the best candidates) by questions which are topically disjunctive, and a number of examples of this are provided.

Examiners very rarely initiate repair in relation to candidate utterances, even when these contain linguistic errors or appear to be incomprehensible. This is because the institutional brief is not to achieve intersubjectivity, nor to offer formative feedback; it is to assess the candidate’s utterances in terms of IELTS bands. Therefore, a poorly-formed, incomprehensible utterance can be assessed and banded in the same fashion as a perfectly-formed, comprehensible utterance. Repair initiation by examiners is not rationally necessary from the institutional perspective in either case. In this way, Speaking Test interaction differs significantly from interaction in classrooms and university settings,



in which the achievement of intersubjectivity is highly valued and assumed to be relevant at all times. In those institutional settings, the transmission of knowledge or skills from teacher to learner is one goal, with repair being a mechanism used to ensure that this transmission has taken place.

What types of repair initiation are used by examiners and examinees and how are these responded to?

Repair policy and practice vary in the different parts of the test. Examiners have training and written instructions on how to respond to repair initiations by candidates. The examiner rarely initiates repair. Candidates initiate repair in relation to examiner questions in a variety of ways. In response to a candidate’s repair initiation, examiner instructions are to repeat the test question once only but not to paraphrase or alter the question. The vast majority of examiners follow the instructions, but there are exceptions. The organisation of repair in the Speaking Test is highly constrained and inflexible; it is rationally designed in relation to the institutional attempt to standardise the interaction and thus to assure reliability. This results in a much narrower choice of repair options.

In general, then, the organisation of repair in the IELTS Speaking Test differs very significantly from that described as operating in ordinary conversation (Schegloff, Jefferson & Sacks, 1977), L2 classroom interaction (Seedhouse, 2004) and from university interaction, (Benwell, 1996; Benwell & Stokoe, 2002; Stokoe 2000), the latter being the target form of interaction for most candidates. In the data, the organisation of repair in the IELTS Speaking Test overwhelmingly follows the instructions for IELTS examiners in Part 1, which specify that the question can only be repeated once and may not be explained or reformulated.

What role does repetition play?

In Part 1, examiners are instructed to repeat the question once and then move on. In the vast majority of cases, examiners adhere to this policy. Occasionally, however, some examiners do not follow these instructions; subsequently, the consequences of repeated repetition vary.

What is the organisation of turn-taking and sequence?

The overall organisation of turn-taking and sequence in the Speaking Test closely follows the examiner instructions. Part 1 is a succession of question-answer adjacency pairs. Part 2 is a long turn by the student, started off by a prompt from the examiner and sometimes rounded off with questions. Part 3 is another succession of question-answer adjacency pairs. This tight organisation of turn-taking and sequence is achieved in two ways. Firstly, the examiner script specifies this organisation, eg “Now, in this first part, I’d like to ask you some questions about yourself.” (Examiner script, January 2003). Secondly, many candidates have undertaken training for the Test, and in some cases this will have included a mock Speaking Test.

What is the relationship between Speaking Test interaction and other speech exchange systems such as ordinary conversation, L2 classroom interaction and interaction in universities?

Speaking test interaction is a very clear example of goal-oriented institutional interaction and is very different to ordinary conversation; it should be noted here that the IELTS test developers’ primary aim was not to develop a Speaking Test in which the interaction mirrors ordinary conversation. Sacks, Schegloff & Jefferson (1974) speak of a “linear array” of speech-exchange systems. Ordinary conversation is one polar type and involves total local management of turn-taking. At the other extreme (which they exemplify by debate and ceremony) there is pre-allocation of all turns. Clearly, Speaking Test interaction demonstrates an extremely high degree of pre-allocation of turns by comparison with other institutional contexts (cf Drew & Heritage, 1992). Not only are the pre-allocated turns given in the format of prompts, but the examiner also reads out scripted prompts (with some flexibility allowed in Part 3). So, not only the type of turn but the precise linguistic formatting of the examiner’s turn is pre-allocated for the majority of the test.



The repair mechanism is pre-specified in the examiner instructions; the organisation of turn-taking and sequence are implicit in these. There are also constraints on the extent to which topic can be developed. The interaction also exhibits considerable asymmetry. Only the examiner has the right to ask questions and allocate turns; the candidate has the right to initiate repair, but only in the prescribed format. Access to knowledge is also highly asymmetrical. The examiner knows in advance what the questions are, but the candidate may not know this. The examiner has to evaluate the candidate’s performance and allocate a score, but must not inform the candidate of his/her evaluation. Overall, the examiner performs a gate-keeping role in relation to the candidate’s performance. Restrictions and regulations are institutionally implemented with the intention to maximise fairness and comparability.

There are certain similarities with L2 classroom interaction, in that the tasks in all three parts of the test are ones which could potentially be employed in L2 classrooms. Indeed, task-based assessment and task-based teaching have the potential to be very closely related (Ellis, 2003). There are sequences which occur in some L2 classrooms, for example when teachers have to read out prepared prompts and learners have to produce responses. However, there are many interactional characteristics in the Speaking Test which are very different to L2 classroom interaction. In general, tasks tend to be used in L2 classrooms for learner-learner interaction in pairs or groups, with the teacher acting as a facilitator, rather than for teacher-learner interaction. Another difference between Speaking Test interaction and L2 classroom interaction is that the teacher evaluation moves common in L2 classrooms are generally absent in the Speaking Test. Also, the options for examiners to conduct repair, explain vocabulary, help struggling students or engage with learner topics are very restricted by comparison to those used by teachers in L2 classroom interaction (Seedhouse, 2004).

As far as university contexts (Benwell, 1996; Benwell & Stokoe, 2002; Stokoe 2000) are concerned, interaction in seminars, workshops and tutorials appears to be considerably less restricted and more unpredictable than that in the Speaking Test. Seminars, tutorials and workshops are intended to allow the exploration of subject matter, topics and ideas and to encourage self-expression. In the Speaking Test, intersubjectivity does not need to be achieved and language is produced for the purpose of assessment. However, there are some similarities. It is very likely that students will be asked questions about their home countries or towns and about their interests when they start tutorials in their universities.

To summarise, Speaking Test interaction is an institutional variety of interaction with three sub-varieties, namely the three parts of the Test. It is very different to ordinary conversation, has some similarities with some sub-varieties of L2 classroom interaction and some similarities with interaction in universities. Speaking test interaction has some unique interactional features; these may, however, occur in other language proficiency interviews.

What is the relationship between examiner interaction and candidate performance?

The overall impression is that the overwhelming majority of examiners treat candidates fairly and equally. Where there are exceptions to this, some examiners sometimes do not follow instructions and may give an advantage to some candidates. The overall impression in the data is that there does appear to be some kind of correlation between test score and occurrence of other-initiated repair, ie trouble in hearing or understanding on the part of the candidate. In interviews with high test scores, candidates initiate fewer or no repairs on the talk of the examiner.

To what extent do examiners follow the briefs they have been given?

The vast majority of examiners follow the briefs and instructions very closely.



In cases where examiners diverge from briefs, what impact does this have on the interaction?

Where some examiners sometimes do not follow instructions, they often give an advantage to some candidates in terms of their ability to produce an answer. Some examples of examiners aiding candidates in this way are provided above.

How are tasks implemented? What is the relationship between the intended tasks and the implemented tasks, between the task-as-workplan and task-in-process?

There is an extremely close correspondence between intended and implemented tasks. This is in contrast to the common finding in language teaching that there is often a major difference between task-as-workplan and task-in-process (Seedhouse, 2005). One key difference, however, is that L2 classroom tasks generally involve learner-learner interaction.

How is the organisation of the interaction related to the institutional goal and participants’ orientations?

The organisation of turn-taking, sequence and repair are logically organised in relation to the institutional goal of ensuring valid and reliable assessment of English speaking proficiency, with standardisation being the key concept in relation to the instructions for examiners. CA work was influential in the design of the revised IELTS Speaking Test, introduced in 2001, and specifically in the standardisation of examiner talk: “Lazaraton’s studies have made use of conversation analytic techniques to highlight the problems of variation in examiner talk across different candidates and the extent to which this can affect the opportunity candidates are given to perform, the language sample they produce and the score they receive. The results of these studies have confirmed the value of using a highly specified interlocutor frame in Speaking Tests which acts as a guide to assessors and provides candidates with the same amount of input and support.” (Taylor, 2000, pp8-9).

How are the roles of examiner and examinee, the participation framework and the focus of the interaction established?

These are established in the introduction section to the test. The examiner has a script to follow, which includes verifying the candidate’s identity, performing introductions and stating the participation framework and focus of the interaction. Once established, the participation framework is sustained throughout the interview and oriented to by both interactants. The examiner is also the one who closes the encounter.

How long do tests last in practice and how much time is given for preparation in Part 2?

The documentation states that tests will last between 11 and 14 minutes. In the sample data, the shortest test lasted 12 minutes 16 seconds (0176) and the longest test 17 minutes 1 second (0199). This included the approximate 1 minute preparation time for the long turn. The actual length of long turn preparation time varied from 41.1 seconds (0678) to 98.2 seconds (0505).

5 CONCLUSION

5.1 Implications and recommendations: test design and examiner training In this final section, we conclude with implications and recommendations in relation to test design and examiner training, followed by suggestions for further research.

We employed Richards and Seedhouse’s (2005) model of “description leading to informed action” in relation to applications of CA. Here we summarise the recommendations for test design and examiner training which have emerged from analysis of the data. The logic of the Speaking Test is to ensure validity by standardisation of examiner talk. Therefore, most of these recommendations serve



to increase standardisation of examiner conduct and concomitantly equality of opportunity for candidates. Other suggestions aim to make the interview more similar to everyday conversation where appropriate.

We would recommend that a statement on repair rules be included in documentation for students, eg “When you don’t understand a question, you may ask the examiner to repeat it. The examiner will repeat this question only once. No explanations or rephrasing of questions will be provided.” Examiners might also state these rules during the opening sequence. It may also be helpful for candidates to know that examiners will not express any evaluations of their utterances.

We recommend, in the interests of consistency and standardisation, that examiner instructions should be that “okay” is used in the receipt slot to mark transition to the next question and that “mm hm” be used for back-channelling, particularly in Part 2.

A sequence of questions on a particular topic may appear unproblematic in advance of implementation. However, this may nonetheless be a cause of unforeseen trouble for candidates, especially if an unmotivated and unprepared shift in perspective of any kind is involved. Piloting of questions (if not already undertaken) to check for this is therefore recommended.

There is a case for training examiners in how to adapt the rounding-off questions slightly to fit seamlessly into the previous flow of the interaction. The training could include some of the examples given above, explain the topic disjunction problems which can arise with unmodified rounding-off questions and provide examples of questions which have been successfully adapted to topic flow. Training should also stress that the questions are optional and that in some instances it might not be possible at all to adapt them to the flow of the interaction.

Although the vast majority of examiners follow instructions, some do not, as we have seen above. Examiner training could include examples from the data of examiners failing to follow instructions re repair, repetition, explaining vocabulary, assisting candidates and evaluation. These examples would demonstrate how such failures may compromise test validity.

The question “What shall I call you?” created significant problems, and it is recommended that this question be deleted. The issue of how candidates and examiners address each other is a cultural one and may be adapted to the local conventions.

We recommend that the IELTS test developers consider what kind of variation in test and preparation duration is acceptable, since candidates may in some cases derive benefit from disproportionate preparation time. “Examiners must stick to the correct timing of the test both for standardisation and fairness to candidates and also for the efficient running of tests in centres.” (IELTS Examiner Training Material 2001, pp6)

5.2 Suggestions for further research This study has not correlated candidate categories in the database (gender, test centre, test score) systematically with patterns of interaction. For the test developers it may be helpful to establish if particular patterns of communication and evidence of interactional trouble are related to any of the above categories. For example, it may be found that candidates from particular regions of the world repeatedly run into trouble in relation to a particular interactional sequence, topic or question in the Speaking Test. Or, for example, comparisons of interactional patterns associated with candidates with a low score with those with a high score may be revealing. Furthermore, such research could build on existing IELTS research like O’ Loughlin’s (2000) study of the variable of gender in relation to the oral interview. Relationships between these categories and patterns of communication may form the basis of further research studies.



We tentatively suggest that there appears to be a correlation between test score and incidence of interactional trouble and repair sequences. This could be researched further. Current repair policy is that only verbatim repetitions of the question are allowed in Part 1. Further research could examine the consequences of allowing the examiner a greater variety of repair activities.

The Speaking Test is predominantly used to assess and predict whether a candidate has the ability to communicate effectively on programmes in English-speaking universities. A vital area of research is therefore the relationship between the IELTS Speaking Test as a variety of institutional discourse and the varieties to which candidates will be exposed when they commence their university studies.

Our study has shown the interactional organisation of the Speaking Test to have certain idiosyncrasies, particularly in the organisation of repair. These idiosyncrasies derive rationally from the principle of ensuring standardisation. The key question arising from this study is how the organisation of interaction in the Speaking Test might be modified to make it more similar to interaction in the university environment while not compromising the principle of standardisation.



REFERENCES

Atkinson, JM and Heritage, JC, 1984, (eds) Structures of Social Action: Studies in Conversation Analysis, Cambridge University Press, Cambridge

Benwell, B, 1996, ‘The discourse of university tutorials, unpublished PhD dissertation, University of Nottingham, UK

Benwell, B and Stokoe, EH, 2002, ‘Constructing discussion tasks in university tutorials: shifting dynamics and identities’, Discourse Studies, vol 4, pp 429-453

Brown, A and Hill, K, 1998, ‘Interviewer style and candidate performance in the IELTS Oral Interview, International English Language Testing System Research Reports 1, vol 1, pp 1-19

Drew, P, 1992, ‘Contested evidence in courtroom cross-examination: the case of a trial for rape’, in P Drew and J Heritage (eds) Talk at work: interaction in institutional settings, Cambridge University Press, Cambridge, pp 470-520

Drew, P and Heritage, J, eds, 1992a, Talk at Work: Interaction in Institutional Settings, Cambridge University Press, Cambridge

Drew, P and Heritage, J, 1992b, ‘Analyzing talk at work: an introduction’ in Talk at Work: Interaction in Institutional Settings, eds P Drew and J Heritage, Cambridge University Press, Cambridge, pp 3-65

Egbert, M, 1998, ‘Miscommunication in language proficiency interviews of first-year German students: a comparison with natural conversation’ in Talking and testing: discourse approaches to the assessment of oral proficiency, eds R Young and A He, Benjamins, Amsterdam, pp 147-169

Ellis, R, 2003, Task-based language learning and teaching, Oxford University Press, Oxford

Goodwin, C, 1986, ‘Between and within: alternative sequential treatments of continuers and assessments’, Human Studies, vol 9, pp 205-218

He, A, 1998, ‘Answering questions in language proficiency interviews: a case study, in Talking and Testing: Discourse Approaches to the Assessment of Oral Proficiency, eds R Young and A He, Benjamins, Amsterdam, pp 147-169

Heritage, J, 1997, ‘Conversation analysis and institutional talk: analysing data, in Qualitative Research: Theory, Method and Practice, ed D Silverman, Sage, London, pp 161-82

Instructions to IELTS Examiners, 2001, Cambridge ESOL

IELTS Examiner Training Material, 2001, Cambridge ESOL

IELTS Handbook, 2005, Cambridge ESOL

IELTS Speaking Test: Examiner script to accompany tasks, 2003, Cambridge ESOL

Kasper, G and Ross, S, 2001, ‘Is drinking a hobby, I wonder: other-initiated repair in language proficiency interviews’, paper at American Association of Applied Linguistics, St. Louis, MO

Kasper, G and Ross, S, 2003, ‘Repetition as a source of miscommunication in oral proficiency interviews’ in Misunderstanding in Social Life. Discourse Approaches to Problematic Talk, eds J House, G Kasper and S Ross, Longman/Pearson Education, Harlow, UK, pp 82-106



Lazaraton, A, 1997, ‘Preference organisation in oral proficiency interviews: the case of language ability assessments’, Research on Language and Social Interaction, vol 30, pp 53-72

Lazaraton, A, 2002, A qualitative approach to the validation of oral language tests, UCLES/Cambridge University Press, Cambridge

Levinson, S, 1992, ‘Activity types and language’ in Talk at Work: Interaction in Institutional Settings, eds P. Drew and J. Heritage, Cambridge University Press, Cambridge, pp 66-100

Mehan, H, 1979, Learning lessons: social organisation in the classroom, Harvard University Press, Cambridge, Mass

Merrylees, B, 1999, ‘An investigation of speaking test reliability’ International English Language Testing System Research Reports, vol 2, pp 1-35

O’Loughlin, K, 2000, ‘The impact of gender in the IELTS Oral Interview, International English Language Testing System Research Reports, vol 3, pp 1-28

Richards, K and Seedhouse, P, 2005, Applying conversation analysis, Palgrave Macmillan, Basingstoke

Sacks, H, Schegloff, E and Jefferson, G, 1974, ‘A simplest systematics for the organisation of turn-taking in conversation’, Language, vol 50, pp 696-735

Schegloff, E, A, Jefferson, G and Sacks, H, 1977, ‘The preference for self-correction in the organisation of repair in conversation, Language, vol 53, pp 361-382

Seedhouse, P, 2004, The interactional architecture of the language classroom: a conversation analysis perspective, Blackwell, Malden, MA

Seedhouse P, 2005, ‘Task as research construct’, Language Learning, vol 55, 3, pp 533-570

Slater, P, Millen, R and Tyrie, L, 2003, IELTS on track, Language Australia, Sydney

Stokoe, EH, 2000, ‘Constructing topicality in university students’ small-group discussion: a conversation analytic approach’, Language and Education, vol 14, pp 184-203

Taylor, L, 2000, ‘Issues in speaking assessment research’, Research Notes, vol 1, pp 8-9

Taylor, L, 2001a, ‘Revising the IELTS Speaking Test: developments in test format and task design’, Research Notes, vol 5, pp 3-5

Taylor, L, 2001b, ‘Revising the IELTS Speaking Test: retraining IELTS examiners worldwide’, Research Notes, vol 6, pp 9-11

Taylor, L, 2001c, ‘The paired speaking test format: recent studies’, Research Notes, vol 6, pp 15-17

Westgate, D, Batey, J, Brownlee, J, and Butler, M, 1985, ‘Some characteristics of interaction in foreign language classrooms, British Educational Research Journal, vol 11, pp 271-281

Wigglesworth, G, 2001, ‘Influences on performance in task-based oral assessments’ in Researching Pedagogic Tasks: Second Language Learning, Teaching and Testing, eds M Bygate, P Skehan and M Swain, Pearson, Harlow, pp 186-209

Young, RF and He, A, eds, 1998, Talking and testing: discourse approaches to the assessment of oral proficiency, Benjamins, Amsterdam



APPENDIX 1: TRANSCRIPTION CONVENTIONS

A full discussion of CA transcription notation is available in Atkinson and Heritage (1984). Punctuation marks are used to capture characteristics of speech delivery, not to mark grammatical units.

[ indicates the point of overlap onset ] indicates the point of overlap termination = a) turn continues below, at the next identical symbol

b) if inserted at the end of one speaker’s turn and at the beginning of the next speaker’s adjacent turn, it indicates that there is no gap at all between the two turns

(3.2) an interval between utterances (3 seconds and 2 tenths in this case) (.) a very short untimed pause Word underlining indicates speaker emphasis e:r the::: indicates lengthening of the preceding sound - a single dash indicates an abrupt cut-off ? rising intonation, not necessarily a question ! an animated or emphatic tone , a comma indicates low-rising intonation, suggesting continuation . a full stop (period) indicates falling (final) intonation CAPITALS especially loud sounds relative to surrounding talk ◦ ◦ utterances between degree signs are noticeably quieter than surrounding talk ↑ ↓ indicate marked shifts into higher or lower pitch in the utterance following the arrow > < indicate that the talk they surround is produced more quickly than neighbouring talk ( ) a stretch of unclear or unintelligible speech ((inaudible 3.2)) a timed stretch of unintelligible speech (guess) indicates transcriber doubt about a word .hh speaker in-breath hh speaker out-breath hhHA HA heh heh laughter transcribed as it sounds → arrows in the left margin pick out features of especial interest

Additional symbols ja ((tr: yes)) non-English words are italicised, and are followed by an English translation in double brackets. [gibee] in the case of inaccurate pronunciation of an English word, an

approximation of the sound is given in square brackets [æ ] phonetic transcriptions of sounds are given in square brackets < > indicate that the talk they surround is produced slowly and deliberately (typical of teachers modelling forms) C: Candidate E: Examiner



APPENDIX 2: A LOW SCORE OF BAND 3.0 ON THE IELTS SPEAKING MODULE

Part 1 -6 E: ehm (.) this is the speaking module, for the international English -5 language testing system, .h conducted on the twenty eighth of -4 january, ehm two thousand an three,? .h thee ca:ndidate is -3 ((first name,)) ((last name,)) candidate number ((number))= -2 ((number))=((number))=((number.)) .hh a:nd the interviewer is -1 ((first name))= ((last name:.)) 0 (1.0)/((clicking sound probably from tape being switched on and off)) 1 E: .hh well good evening=my name is ((first name)) ((last name))= 2 can you tell me your full name please.= 3 C: =yes ((first name,)) ((last name.)) 4 E: .hh ah: a:n[d, 4b C: [ghm= 4c E: =can you tell me er, what shall I ca:ll you. 4d (1.5) 5 → C: e:r (1.0) can you repeat the: er the question[(s),? 6 E: [( ) what do you, (0.2) 7 E: your first name? do you use [((last name)) 7b C: [( ) ((first name)) 8 E: ((first name)). [((first name)) (you want me to call you) ((first na[me)) 8b C: [yes ((first name)) [yes 9 E: °right.° ((with forced sound release)) 10 E: .hh and can I see your identifi<cation: card please.> 10b (0.5) 10c → C: .h[hm- 11 E: [an ID. .hh er: not a student card=do you have an I [D °card? ° 12 C: [↑e::::m 13 C: no::=in, (0.2) tch! no. 13b (0.5) 13c C: tch! er I don’t er (0.2) .h I don’t have (1.3) the: (1.0) 13d administration,=er: the day. 14 E: m:: I understa:nd but you erm .h need to ha:ve, a: tch! (0.2) your official, 15 C: yes 16 E: ID card. 17 C: ye:s. 17b (1.5) 18 E: .hh thank yer .hh erm in this first part↓ I’d like to s=ask some questions 19 about your↑self. .hh em >well first of all can you tell me where you’re< 19b fro↑m↓ 20 C: .h yes er: hh I go: eh: hh e:r I live er to:, .h to Kosa:ni,? 20b (0.2) 21 E: [°(I’m) from Kosani. 22 C: [I am from Kosani. 23 E: o↑okay↓ tch! now! .hh uhm ↑can we talk about erm where you live. ((Note: While we did the final check on the transcription, the tape got damaged at this stretch.)) 23b (0.5) 24 could you describe the city or the town that you live in ↑↑now↓. 25 C: .h er yes I’d li- I would like eh .hh I (0.3) eh I very much eh in 25b Thesaloniki,? (0.5) 26 E: you live in <The[saloniki> eh=ok .hh could you describe where you ↑live? 27 C: [yes. 29 C: er yes er (0.5) I would like er (0.2) 30 E: whe:re you live. >can you describe it please.< ((pitch lowered gradually)) 31 C: erm (1.2) 32 E: >where do you live in Thesaloniki:. < ((pitch lowered more)) 32b (0.2) 32c E: where? 33 C: erm: tch! in the centre. 33b (0.2) 34 E: tell me: eh describe where you live.=uh hum,?



34b (1.0) 35 C: erm (1.0) I would live in erm the centre, (.) erm, (0.5) I’m: er:, h (0.4) 36 one years er, (.) one years in Thesaloniki, 37 E: I see .hh what do you li:ke:, about living he:re 38 C: tch! erm (3.0) 38b E: °°°( ) °°° 39 C: I would like Thesaloniki:, (0.4) er because erm (2.0) because it have eh 39b it has eh (.) er very much er eh people, (0.5) and: eh: and clu:bbing, 39c and er [(1.0) ?? [((sound of paper shuffling)) 40 E: m ↑hm .h eh is, are there things you don’t like about it? 40b (.) 40c E: ((first name)) 41→ C: wh:at? 42 E: <are there things you don’t like about it?> 43 C: yes. 43b (0.5) 43c C: er (1.8) er I guess I I do:: (0.5) I do like er: Thesaloniki,? 44 E: <uh h↑um .hh how could you improve (.) the city? > 44b (0.5) 45 C: hm: (2.5) I improving Thesaloniki: (3.0) umhh (1.0) 46 E: tch! all right. lets move on to (.) the topic of food and restaurants= 47 what kind of food do you like to eat. 48 C: er: (0.8) I’(wou) like er: (0.8) own food in the restaurant, because er (.) 48b eh 49 E: what kind of food do you like to eat. 50 C: er (2.0) tch! erm (.) yes er: I would like eh to eata the restaurant, (0.3) 50b and er:, (0.5) 51 E: .hh <<is there any food you don’t like?>> ((flat intonation)) 51b (0.5) 52 C: er: (0.8) no there isn’t er a (.) a restaurant er (0.6) erm (3.0) 53 E: m ↑hm (.) um (0.3) <what are some of the advantages and disadvantages 54 of eating in a restaurant.> 54b (0.5) 55 C: erm (5.0) tch! er the advantage er (0.2) eh of eating er in a restaurant,? 56 (0.4) er because: er : (0.7) 57 E: uh hm (0.5) 57b C: uhmhh 57c (0.8) 58 E: what’s the good thing about eating in a restaurant. ((soft voice)) 58b (5.0) 59 E: ↑n:t. ^lets talk about fi:lms. (.) do you enjoy wa:tching films?= 59b =((first name))? 59c C: er yes. eh I’m like er: (.) er watching film,? 59d E: °eh hm,? ° 60 C: er: because eh: (.) erm (.) because watching er to=er=in Thesaloniki, 60b and er, and (0.2) and er, like in Thesaloniki: 61 E: okay .h how often <do you watch (.) films.> 61b (0.5) 62 C: umhh (0.2) 63 E: °how often° ((whisper voice)) 64 C: °how often° (0.2) .hh erm (6.0) I often er (2.0) I often watch er (.) 65 er film, in er (6.0) 66 E: ((sing song voice all turn)) °uh hum all right° (0.5) 66b now lets move on to the next part (1.0) a:nd I::, I am going to give 67 you a topic, (0.2) and I’d like you to talk about it for one or two 68 minutes (.) before you talk you will have one minute to think: 69 about what you are going to say:. .hh and you can make some notes 69b if you wish. 70 (.) 70b E: °all right?=do you understand? ° ((high pitch)) 71 C: yes. 72 E: okay so here’s some paper, and here’s a pencil, (.) eh to



73 make your notes, (0.5) and er here’s your ↑↑topic: (0.5) 74 here we are (.) I’d like you to describe a trip: (.) that you once went on:. 74b ((sound of tape recording being switched off)) Part 2 (Counter 119) 75 E: a:ll right now remember you have >one or two minutes for this so don’t 76 worry if I stop you .h and I’ll tell you when the time is up,< 77 can you start speaking (now) please,?= 78 C: =yes (0.2) .hh er I travelled in er (0.5) I travelled in er inyana, 78b .hh iryana is very:, (.) very like, 79 and er (.) I went (.) and I went er to: (0.8) I went there for er the job, (.) ((Note: While we did the final check on the transcription, the tape got damaged at this stretch)) 80 and er (3.0) and er (5.0) 81 E: did you enjoy your trip? or not? how did you go there? you went to 82 Indiana.= 83 C: =yes.= 84 E: =how did you travel there? 84b (6.0) 84c E: did you go by train?=did you go by plane?=how [did you 85 C: [er, 86 C: I went er tch! (.) to: the bus (0.2) and erm (.) I: went erm to my 86b parents, (0.2) em (0.2) 86c E: °°mhm°° did you enjoy the trip? 86d (0.2) 87 E: ((first name?)) 88 (3.0) 88b C: er yes I:: (2.0) I enjoy the: (5.0) Part 3 (Counter 143) 89 E: °°uh hum°° .h okay: .hh ↑can I have the task (.) card (.) back=then: 89b °uh hum° (.) wha:t did you like (.) mo:s[t 90 C: [grh ((clears throat)) 90b E: about it. 90c (0.7) 91 E: what was the be:st: thing:. 91b (1.0) 92 C: um (3.0) 93 E: °m:h::: ° 93b (0.3) 93c E: tch! o:kay (0.5) .hh ^erm^ (0.8) tch! we’ve been talking about a trip 93d that you went on=and I’d like to discuss with you: one or 94 two ↑more ↑↑general questions related to this.= 95 =.hhh ↑lets think about erm (.) travel and transport 96 .hhh ↑whats the most popular way to travel .hh a long distance: (.) 97 <in your country?> 97b (0.7) 98 C: um: (5.0) er the travel ermh, (3.0) .hh er I would like er to:: (0.5) 99 to transport er (1.0) tch! (1.0) [eh for the: (6.0) 99b E: [°mhm° 100 E: °uh hum° .hhh how do you think, we’re going to travel in the future? .hh hh 100b (0.2) 101 C: erm: yes erm: (0.2) I believe that er: (0.5) er the travel er in the future, 102 E: °mhm,? ° 103 C: er but er: but I don’t er (1.0) erm (0.8) but I don’t eh know: to: .hh to 104 the to:wn:, the [city:, 103 E: [a:ll right.:: (.) o:kay:: .hh thank ((tape cut off here)) (0836)



APPENDIX 3: A HIGH SCORE OF BAND 9.0 ON THE IELTS SPEAKING MODULE

Part 11 E: good afternoon (1.3) uh:m (3.4) can you tell me your full name please? 2 (0.4) 3 C: °((name omitted))° (0.2) 4 E: thanks and uh:: (1.5) can you tell me where you’re from? (0.3) 5 C: °I’m from Trinidad ((inaudible))° (0.3) 6 E: okay (0.4) can I see your I.D please (4.7) thanks (3.3) that’s fine thank 7 you (1.7) now in this first part I’d like to ask you some questions about 8 yourself (0.6) uh:: let’s talk about (0.4) uh:m (3.1) what you do: (0.3) do 9 you work or are you a student (0.3) 10 C: I’m medical student hh (0.2) I’m ((inaudible)) graduate in Ma:y of this 11 year (.) s(hh)o (0.3) 12 E: o[kay] 13 C: [not]too long (0.5) 14 E: okay (0.7) so: uh (1.1) tell me about your studies (1.0) 15 C: well (.) I originally started in Grenada (0.3) we do: two years of basic 16 sciences (.) anatomy physiology etc (0.4) ⋅hh then we do two years of 17 clinical studies either in England or the States or a combination of both 18 (0.5)also <the students are American so they tend to do most of the 19 studies in the States< (0.7) uh::m I chose I originally: (.) scheduled to 20 start in New York but that didn’t work out so I actually came to England 21 (0.4) but I’m actually glad I did because (.) medical system is a lot dif- it 22 it American system is much different in((inaudible)) (0.5) whereas 23 English system is more compatible so: I: consider it’s a good move to 24 come to Engla(hh)nd (0.3) 25 E: okay so what do you like most (0.3) about your studies (1.7) 26 C: uh the variety (0.4) I think in: medicine especially because no: two 27 patients will present the same way (0.4) and i- it’s always a challenge to 28 think about what the diagnosis is (0.3) and uh ways in which you can (.) 29 confirm the diagnosis °basically° (0.2) 30 E: okay (0.4) are there any things you ↑don’t like about your studies? (2.7) 31 C: well personally the fact tha:t (.) if I read something I have to read it again 32 you know to remember it (.) it’s just a lot (.) the volume of work is very 33 very large so it’s just (0.2) time management (0.2) and learning to deal 34 with the: (0.2) °(volume of work)° (0.3) 35 E: okay (0.7) so uh:: what qualifications or certificates (0.8) do you hope to 36 get (1.3) 37 C: well (1.1) after I: (0.5) get my degree in May I’m hoping to:: (1.3) uh:m 38 >probably work in England for a while and in order to do that I have to 39 do further exams< hh (0.5) unfortunately bu:t uh:m (1.1) ⋅hh then I just 40 hope to: (0.6) progress further i- in my field ((inaudible)) (0.2) 41 E: okay okay (0.7) ↑let’s uh move on to talk about some of the activities 42 you (0.6) enjoy in your free time (0.7) when do you have free time? (1.3) 43 C: rarely hh heh (0.3) ⋅hh uh::m (0.5) I try to pace myself generally (.) in 44 terms of: getting a lot of work done during the week so I ca:n at least 45 relax a bit at the weekends (0.5) I like to:: look at movies go shopping: 46 hh heh (0.5) uhm have a chat with friends and (0.6) 47 E: okay and uh::m (1.5) what free time activities are most popular where 48 you live? (1.6) 49 C: probably going to the beach definitely ‘cause it’s always warm hh (.) 50 E: mm [hm] 51 C: [th ]at’s what I miss most actually (0.4) uh::m (2.3) I would say 52 that’s probably the most po[pular] 53 E: [ o:]kay (0.4) so how important is free time 54 in people’s lives? (0.6) 55 C: very ve(hh)ry important (0.7) ⋅hh I can (1.8) well personally uh:m (0.7) 56 because I always have so much work to do so much studying to do it’s 57 always so important for me (.) to be able to relax a bit and then come 58 back refreshed so I can study (.) some more ((inaudible)) (0.2) °I think



59 it’s very very important to have free time° (0.3) 60 E: okay okay (0.7) uh::m (0.8) okay can we talk about (.) your childhood 61 are you happy to do that? (0.3) 62 C: yes? °(no worries)°= 63 E: =okay (0.3) so where did you grow up (0.3) 64 C: I grew up in Tobago (0.2) 65 E: o:kay (0.5) uh was it a good place for children? (0.4) 66 C: yes I think so hh HA (.) 67 E: why? (0.2) 68 C: uh::m (1.7) I think becau:se the society at ho:me (.) tends to stress a lot 69 of family value (.) >I think that’s very very important and looking back 70 at my childhood now I realise just how important that was< (0.8) ⋅hh 71 uh::m (0.6)⋅hh I can’t say I can’t really compare myself to to:: (1.1) 72 children growing up in other parts of the world just because I didn’t 73 experience it first hand (.)but I would definitely advocate (0.2) growing 74 up in the West Indies (.) a (great dea(hh)l) (0.4) 75 E: where did you usually play? (0.7) 76 C: uh::m (1.7) well if you were at school then you would play at the (.) 77 playground at school o:r (0.7) at home there’s always space to run 78 around the yard and things like that (.) or you could play on the beach: 79 (0.3) 80 E: oh okay okay (0.6) ⋅hh uh do you think childhood is different today (.) 81 from when you were a child? (2.2) 82 C: I think there’s uh: uhm many differences yes because (1.6) uh:m (0.4) 83 children nowadays are exposed a lot mo:re (0.8) uh:m (1.5) different 84 influences basically because of the television internet things like that 85 (0.5) so I think that tends to have a bigger impact on a child (0.7) in 86 recent years compared to when I grew up (0.6) Part 2 87 E: okay (0.2) all right (cough) (4.3) okay now I’m gonna give you a topic 88 (.) and I’d like you to talk about it (0.6) for one to two minutes and 89 before you talk (0.3) you’ll have one minute to think about what you’re 90 gonna say (0.8) and you can make some notes if you wish (0.3) d’you 91 understand? (0.2) 92 C: yes (0.3) 93 E: o:kay (0.4) so here’s paper (.) and pencil (0.8) for making some notes 94 (0.6) an:d (0.7) I’d like you to describe a trip (0.6) that you once went on 95 (33.8) okay? (0.2) 96 C: yep (.) 97 E: a:ll right (0.6) remember you have one to two minutes for this so don’t 98 worry if I stop you I’ll tell you when the time is u[p] 99 C: [o]kay (0.2) 100 E: can you start speaking now please= 101 C: =yeah (0.6) I remembe:r at the beginning of my medical school career 102 (0.4) we were taken on uhm a boat trip to one of the smaller islands 103 around Grenada (0.6) ⋅hh it was basically: a (0.5) ((inaudible)) trip (0.4) 104 ⋅hh uh:m it was part of the orientatio:n (0.4) into: medical school life and 105 into: life in Grenada: obviously (0.4) ⋅hh uh::m (1.1) uh I: think it took 106 abou:t half an hour to get the:re (0.3) if I remember correctly hh (.) uh:m 107 (0.5) there were lots of us there lots of the students (0.2) uh::m both (0.5) 108 >students who were just starting medical school as well as those who 109 were further into their medical school career< (0.7) an:d there was uhm 110 lot of ↑foo:d lots of drinks (.) we spent (.) most of the day on the beach) 111 (.) in the sun in the water (0.6⋅hh uh::m (1.5) it was: (.) but I can’t say it 112 was: (0.3) a big (0.8) culture change for me because coming from 113 Tobago which is half an hour flying is (.) very very similar (0.3) but I 114 just enjoyed the day out and (0.7) ⋅hh uh:m it always brings back good 115 memories because then you remember all the free time that you had (.) 116 hh heh (0.5) uh::m (1.4) I actually: (0.3) repeated the trip about: a year 117 later (.) because that was my (0.4) last opportunity: (0.6) to uhm (0.2) see 118 a bit of Grenada before leaving: (0.3) to start my °((inaudible))° (0.7)



119 uh::m (1.9) 120 E: okay (0.2) okay= 121 C: =hh heh (.) 122 E: do you generally enjoy travelling? (0.5) 123 C: yes I do although I haven’t actually travelled as much as I would like to 124 yet (0.3) but hopefully after I start working (0.3) Part 3 125 E: o:kay okay (0.4) so we’ve been talking about a trip you (0.4) went on 126 (0.3) 127 C: °y[eah]° 128 E: [an ]d I’d like to discuss with you one or two more general questions 129 related to this (0.7) let’s consider first of all uhm (.) public and private 130 transport can you describe (0.6) the public transport systems (0.3) in your 131 country (1.4) 132 C: well there are public buses (0.4) which tend to be cheaper than taxis (0.2) 133 there is (0.6) uh:m a: taxi association which operates: basically from the 134 airport to anywhere around the island (0.3) but in general if you (.) need 135 a taxi >basically you just stand up on the side of the road and put out 136 your hand< hh heh (0.2) for uhm (0.3) a taxi that comes along (0.5) ⋅hh 137 uh::m (1.1) the bus system there’s a central depot in the middle of town 138 (0.2) so you can go there to purchase tickets and so (0.5) and there are 139 various othe:r (0.7) stations around the island where you can also 140 purchase tickets (0.4) 141 E: °okay (0.2) okay° (0.6) uh:m (0.5) can you uh:: (1.4) speculate on (1.6) 142 future developments (.) in transport systems? (3.8) 143 C: currently there a:re talks to:: (.) expand (0.2) the public >transport 144 systems to< inclu:de (0.5) what we call maxi taxis not (0.6) ⋅hh uh::m 145 (0.7) not exactly public buses they tend to be smalle:r (0.4) they tend to 146 charge more than the public buses (0.4) but uh >they can also hold more 147 people than a taxi obviously so it’s it’s more economical in that way< 148 (0.6) ⋅hh uh:m (0.8)I’m not sure when exactly it will the whole system be 149 put into place but (.)actually (0.7) the: (0.7) the:: (0.4) the plans for the: 150 development of the system in Tobago: (.) can be modelled on 151 ((inaudible)) in Trinidad because there are a lot more maxi taxis in 152 Trinidad (0.2) °((inaudible))° (.) 153 E: okay okay (0.5) can you uh ((cough)) (1.1) speculate on an- any measures 154 that will be taken to reduce pollution (0.7) in the future? (2.0) 155 C: there’s a lot of debate no:w about pollution especially the waters (0.3) 156 around Trinidad and Tobago (0.2) becau:se there’s (0.5) gro- growth in 157 the tourism industry especially (0.5) there’s a lot of concern about the 158 hotels (.)⋅hh disposing of their waste properly (0.5) and in recent years in 159 the (0.2) probably about the last ten years or so there have been (0.7) 160 uh::m (0.4)there has been an increase in the amount of pollutio:n (0.3) in 161 the water and the:re’s (0.6) several (1.3) uh:m societies for example 162 ((inaudible)) Tobago that have been set up to try and combat the 163 problems throu:gh education (0.2) uh:m (1.1) uh:m (0.3) there is other 164 mea(hh)sures ((inaudible)) (.) 165 E: okay okay (.) okay (0.6) ↑right thank you very much= 166 C: =yes (.) 167 E: that’s the end of the speaking test= 168 C: =okay thank you (0389)



7. An investigation of the lexical dimension of the IELTS Speaking Test Authors John Read University of Auckland

Paul Nation Victoria University of Wellington


This study investigates vocabulary use by candidates in the IELTS Speaking Test by measuring lexical output, variation and sophistication, as well as the use of formulaic language.

ABSTRACT

This is a report of a research project to investigate vocabulary use by candidates in the current (since 2001) version of the IELTS Speaking Test, in which Lexical resource is one of the four criteria applied by examiners to rate candidate performance. For this purpose, a small corpus of texts was created from transcriptions of 88 IELTS Speaking Tests recorded under operational conditions at 21 test centres around the world. The candidates represented a range of proficiency levels from Band 8 down to Band 4 on the nine-band IELTS reporting scale. The data analysis involved two phases: the calculation of various lexical statistics based on the candidates’ speech, followed by a more qualitative analysis of the full transcripts to explore, in particular, the use of formulaic language. In the first phase, there were measures of lexical output, lexical variation and lexical sophistication, as well as an analysis of the vocabulary associated with particular topics in Parts 2 and 3 of the test.

The results showed that, while the mean values of the statistics showed a pattern of decline from Band 8 to Band 4, there was considerable variance within bands, meaning that the lexical statistics did not offer a reliable basis for distinguishing oral proficiency levels. The second phase of the analysis focused on candidates at Bands 8, 6 and 4. It showed that the sophistication in vocabulary use of high-proficiency candidates was characterised by the fluent use of various formulaic expressions, often composed of high-frequency words, perhaps more so than any noticeable amount of low-frequency words in their speech. Conversely, there was little obvious use of formulaic language among Band 4 candidates. The report concludes with a discussion of the implications of the findings, along with suggestions for further research.

7. An investigation of the lexical dimension of the IELTS Speaking Test – John Read + Paul Nation

AUTHOR BIODATA

JOHN READ

John Read is an Associate Professor in the Department of Applied Language Studies and Linguistics, University of Auckland, New Zealand. In 2005, while undertaking this research study, he was at Victoria University of Wellington. His research interests are in second language vocabulary assessment and the testing of English for academic and professional purposes. He is the author of Assessing Vocabulary (Cambridge, 2000) and is co-editor of the journal Language Testing.

PAUL NATION

Paul Nation is Professor of Applied Linguistics in the School of Linguistics and Applied Language Studies, Victoria University of Wellington, New Zealand. His research interests are in second language vocabulary teaching and learning, as well as language teaching methodology. He is the author of Learning Vocabulary in Another Language (Cambridge, 2001) and also the author or co-author of widely used research tools such as the Vocabulary Levels Test, the Academic Word List and the Range program.



CONTENTS

1 Introduction ............................................................................................42 Literature review.........................................................................................43 Research questions ...................................................................................74 Method ............................................................................................7 4.1 The format of the IELTS Speaking Test...............................................7 4.2 Selection of texts..................................................................................7 4.3 Preparation of texts for analysis...........................................................95 Statistical analyses ....................................................................................9 5.1 Analytical procedures...........................................................................96 Statistical results........................................................................................10 6.1 Lexical output .......................................................................................10 6.2 Lexical variation ...................................................................................10 6.3 Lexical sophistication ...........................................................................11 6.4 Key words in the four tasks..................................................................147 Qualitative analyses...................................................................................16 7.1 Procedures...........................................................................................168 Qualitative results ......................................................................................17 8.1 Band 8 ............................................................................................17 8.2 Band 6 ............................................................................................19 8.3 Band 4 ............................................................................................209 Discussion ............................................................................................2110 Conclusion ............................................................................................22References ............................................................................................24



1 INTRODUCTION

The revised Speaking Test for the International English Language Testing System (IELTS), introduced in 2001, involved various changes in both the way that a sample of speech is elicited from the candidates and in the criteria used to rate their performance. From our perspective as vocabulary researchers, a number of issues stimulated our interest in investigating the test from a lexical perspective. An obvious one is that, whereas examiners previously assessed each candidate on a single global scale incorporating various descriptors, the rating is now done more analytically with four separate scales, one of which is Lexical resource. Examiners are required to attend to the accuracy and range of the candidate’s vocabulary use as one basis for judging his or her performance. A preliminary study conducted by Cambridge ESOL with a pilot version of the revised test showed a very high correlation with the grammar rating scale, and indeed with the fluency one as well (Taylor and Jones, 2001), suggesting the existence of a halo effect, and perhaps a lack of salience for the examiners of lexical features of the candidates’ speech. Thus, there is scope to investigate characteristics of vocabulary use in the Speaking Test, with the possible outcome of guiding examiners in what to consider when rating the lexical resource of candidates at different proficiency levels.

A second innovation in the revised test was the introduction of the Examiner Frame, which largely controls how an examiner conducts the Speaking Test, by specifying the structure of the interaction and the wording of the questions. This means that the examiner’s speech in the test is quite formulaic in nature. We were interested to determine if this might influence what the candidates said. Another possible influence on the formulaic characteristics of the candidates’ speech is the growing number of IELTS preparation courses and materials, including at least one book (Catt, 2001) devoted just to the Speaking Test. The occurrence of formulaic language in the test would not in itself be a problem. One needs to distinguish here between purposeful memorising of lexical phrases specifically to improve test performance – which one might associate with less proficient candidates – and the skilful use of a whole range of formulaic sequences which authors like Pawley and Syder (1983) see as the basis of fluent native-like oral proficiency.

More generally, the study offered an opportunity to analyse spoken vocabulary use. As Read noted (2000: 235-239), research on vocabulary has predominantly focused on the written language because – among other reasons – written texts are easier to obtain and analyse. Although the speaking test interview is rather different from a normal conversation (cf van Lier, 1989), it represents a particular kind of speech event which is routinely audiotaped, in keeping with the operational requirements of the testing program. As a result, a large corpus of learner speech from test centres all around the world is available for lexical and other analyses once a selection of the tapes has been transcribed and edited. Thus, a study of this kind had the potential to shed new light on the use of spoken vocabulary by second language learners at different levels of proficiency.

2 LITERATURE REVIEW

Both first and second language vocabulary research have predominantly been conducted in relation to reading comprehension ability and the written language in general. This reflects the practical difficulties of obtaining and transcribing spoken language data, especially if it is to be “natural”, ie, unscripted and not elicited. The relative proportions of spoken and written texts in major computer corpora such as the Bank of English and British National Corpus maintain the bias towards the written language, although a number of specialised spoken corpora like the CANCODE (Cambridge and Nottingham Corpus of Discourse in English) and MICASE (Michigan Corpus of Academic Spoken English) are now helping to redress the balance.



To analyse the lexical qualities of texts, scholars have long used a range of lexical statistics. Here again, for practical reasons, the statistics have, until recently, been applied mostly to written rather than spoken texts. Nevertheless, they potentially have great value in allowing us to describe key features of spoken vocabulary in a quantitative manner that may provide useful comparisons between test-takers at different proficiency levels. Read (2000: 197-213), in an overview of the statistical procedures, identifies the main qualities which the statistics are designed to measure: lexical density; lexical variation; and lexical sophistication.

Lexical density is operationalised as the proportion of content words in a text. It has been used to distinguish the relative denseness of written texts from that of oral ones, which tend to have lower percentages of nouns, verbs and adjectives. In a language testing context, O’Loughlin (1995) showed that candidates in a “direct” speaking test, in which they interacted with an interviewer, produced speech with a lower lexical density than those who took a “semi-direct” version of the test, which required test-takers to respond on audiotape to pre-recorded stimulus material with no interviewer present.

Lexical variation, which has traditionally been calculated as the type-token ratio (TTR), is simply the proportion of different words used in the text. It provides a means of measuring what is often referred to as “range of vocabulary”. However, a significant weakness of the TTR when it is used to compare texts is the sensitivity of the measure to the variable length of the texts. Various unsatisfactory attempts have been made over the years to correct the problem through algebraic transformations of the ratio. Malvern and Richards (Durán, Malvern, Richards and Chipere, 2004) argue they have found a solution with their measure, D, which involves drawing multiple word samples from the text and plotting the resulting TTRs on a curve that allows the relative lexical diversity of even quite short texts to be determined. In a study which is of some relevance to our research, Malvern and Richards (2002) used D to investigate the extent to which teachers, acting as examiners in a secondary school French oral examination, accommodated their vocabulary use to the ability level of the candidates.

Lexical sophistication can be defined operationally as the percentage of low-frequency, or “rare”, words used in a text. One such measure is Laufer and Nation’s (1995) Lexical Frequency Profile (LFP), which Laufer (1995) later simplified to a “Beyond 2000” measure – the percentage of words in a text that are not among the most frequent 2000 in the language. Based on the same principle, Meara and Bell (2001) developed their program called P_Lex to obtain reliable measures of lexical sophistication in short texts. It calculates the value lambda by segmenting the text into 10-word clusters and identifying the number of low-frequency words in each cluster. As yet, there is no published study which has used P_Lex with spoken texts.

Apart from the limited number of studies using lexical statistics, recent work on spoken vocabulary has highlighted a number of its distinctive features, as compared to words in written form. One assumption that has been widely accepted is that the number of different words used in informal speech is substantially lower than in written language, especially of the more formal kind. That is to say, a language user can communicate effectively through speaking with a rather smaller vocabulary than that required for written expression. There has been very little empirical evidence for this until recently. In their study of the CANCODE corpus, Adolphs and Schmitt (2003) found a vocabulary of 2000 word families could account for 95% of the running words in oral texts, which indicates that learners with this size of vocabulary may still encounter quite a few words they do not know. The authors suggest that the target vocabulary size for second language learners to have a good foundation for speaking English proficiently should be around 3000 word families, which is somewhat larger than previously proposed.



But perhaps the most important area in the investigation of spoken vocabulary is the use of multi-word lexical items. This represents a move away from the primary focus on individual word forms and word families in vocabulary research until now. Both in manual and computer analysis, it is simpler to count individual forms than any larger lexical units, although corpus linguists are now developing sophisticated statistical procedures to identify collocational patterns in text.

The phenomenon of collocation has long been recognised by linguists and language teaching specialists, going back at least to Harold Palmer (1933, cited in Nation, 2001: 317). What is more recent is the recognition of its psycholinguistic implications. The fact that particular sequences of words occur with much greater than chance probability is not simply an interesting characteristic of written and spoken texts, but also a reflection of the way that humans process natural language. Sinclair (1991) distinguishes two approaches to text construction: the open-choice principle, by which language structures are generated creatively on the basis of rules; and the idiom principle, which involves the building of text from prefabricated lexical phrases. Mainstream linguistics has tended to overlook or undervalue the significance of the latter approach.

Another seminal contribution came from Pawley and Syder (1983), who argued that being able to draw on a large memorised store of lexical phrases was what gave native speakers both their ability to process language fluently and their knack of expressing ideas or speech functions in the appropriate manner. Conversely, learners reveal their non-nativeness in both ways. According to Wray (2002: 206), first language learners focus on large strings of words and decompose them only as much as they need to, for communicative purposes, whereas adult second language learners typically store individual words and draw on them, not very successfully, to compose longer expressions as the need arises. This suggests one interesting basis for distinguishing candidates at different levels in a speaking test, by investigating the extent to which they are able to respond fluently and appropriately to the interviewer’s questions.

Applied linguists are showing increasing interest in the lexical dimension of language acquisition and use. In their research on task-based language learning, Skehan and his associates (Skehan, 1998; Mehnert, 1998; Foster, 2001) have used lexical measures as one means of interpreting the effects of different task variables on learners’ oral production. As part of his more theoretical discussion of the research, Skehan (1998) proposes that the objective of good task design is to achieve the optimum balance between promoting acquisition of the rule system (which he calls syntacticisation) and encouraging the fluent use of lexical phrases (or lexicalisation).

Wray’s (2002) recent book on formulaic language brings together for the first time a broad range of work in various fields and will undoubtedly stimulate further research on multi-word lexical items. In addition, Norbert Schmitt, Zoltan Dornyei and their associates at the University of Nottingham have just completed a series of studies on factors influencing the acquisition of multi-word lexical structures by international students at the university (Schmitt, 2004).

Another line of research relevant to the proposed study is work on the discourse structure of oral interviews. Studies in this area in the 1990s included Ross and Berwick (1992), Young and Milanovic (1992) and Young and He (1998). Lazaraton (2001), in particular, has carried out such research on an ongoing basis in conjunction with UCLES, including her recent analysis of the new IELTS Speaking Test (Lazaraton, 2000, cited in Taylor, 2001).

In one sense, a lexical investigation gives only a limited view of the candidates’ performance in the speaking test. It focuses on specific features of the spoken text rather than the kind of broad discourse analysis undertaken by Lazaraton and appears to relate to just one of the four rating scales employed by examiners in assessing candidates’ performance. Nevertheless, the literature cited above gives ample justification to explore the Speaking Test from a lexical perspective, given the lack of previous research on spoken vocabulary and the growing recognition of the importance of vocabulary in second language learning. © IELTS Research Reports Volume 6 6


3 RESEARCH QUESTIONS

Based on our reading of the literature, we set out to address the following questions: 1. What can lexical statistics reveal about the vocabulary of a corpus of IELTS

Speaking Tests? 2. What are the distinctive characteristics of candidates’ vocabulary use at different

band score levels? 3. What kinds of formulaic language are used by candidates in the Speaking Test? 4. Does the use of formulaic language vary according to the candidate’s band score level?

Formulaic language is used here as a cover term for multi-word lexical items, following Wray (2002: 9), who defines a formulaic sequence as:

a sequence, continuous or discontinuous, of words or other elements, which is, or appears to be, prefabricated: that is, stored and retrieved whole from memory at the time of use, rather than being subject to generation or analysis by the language grammar.

4 METHOD

4.1 The format of the IELTS Speaking Test As indicated in the introduction, the IELTS Speaking Test is an individually administered test conducted by a single examiner and is routinely audiotaped. It takes 11–14 minutes and consists of three parts:

Part 1: Interview (4–5 minutes) The candidate answers questions about himself/herself and other familiar topic areas.

Part 2: Long Turn (3–4 minutes) After some preparation time, the candidate speaks for 1–2 minutes on a topic given by the examiner.

Part 3: Discussion (4–5 minutes) The examiner and candidate discuss more abstract issues and concepts related to the Part 2 topic.

The examiner rates the candidate’s performance on four nine-band scales: Fluency and coherence; Lexical resource; Grammatical range and accuracy; and Pronunciation. The four criteria have equal weighting and the final score for speaking is the average of the individual ratings, rounded to a whole band score.

4.2 Selection of texts The corpus of spoken texts for this project was compiled from audiotapes of actual IELTS tests conducted at various test centres around the world in 2002. The tapes had been sent to Cambridge ESOL as part of the routine monitoring process to ensure that adequate standards of reliability are being maintained. The Research and Validation Group of Cambridge ESOL then made a large inventory of nearly 2000 tapes available to approved outside researchers. The list included the following data on each candidate: centre number; candidate number; gender; module (Academic or General Training); Part 2 task number; and band score for Speaking.

The original plan was to select the tapes of 100 candidates for the IELTS Academic Module according to a quota sample. The first sampling criterion was the task (or topic) for Part 2 of the test. We wanted to restrict the number of tasks included in the sample because we were aware that the topic would have quite an influence on the candidates’ choice of vocabulary and we wanted to be able to reveal its effect by working with just a restricted number of tasks. Thus, the sample was



limited to candidates who had been given one of four Part 2 tasks: Tasks 70, 78, 79, 80. The choice of these specific tasks was influenced by the second criterion, which was that the band scores from 4.0 to 8.0 should be evenly represented, to allow for meaningful comparisons of the lexical characteristics of candidate speech at different proficiency levels, and in particular at Bands 4.0, 6.0 and 8.0. Since there are relatively fewer IELTS candidates who score at Band 4.0 or Band 8.0, compared to the scores in between, it was important to select tasks for which there was an adequate number of tapes across the band score range in the inventory. The four tasks chosen offered the best coverage in this sense.

The score that we used for the selection of candidates was the overall band level for Speaking, rather than the specific rating for Lexical resource (which was also available to us). We decided that, for the purpose of our analyses, it was preferable to classify the candidates according to their speaking proficiency, which was arguably a more reliable and independent measure than the Lexical resource score. In practice, though, the two scores were either the same or no more than one point different for the vast majority of candidates.

Where there were more candidates available than we required, especially at Bands 5.0, 6.0 and 7.0, an effort was made to preserve a gender balance and to include as many test centres in different countries as possible.

However, it was not possible to achieve our ideal selection. Ours was not the first request for the speaking tapes to be received from outside researchers by Cambridge ESOL and thus a number of our selected tapes were no longer available or could not be located. Thus, the final sample consisted of 88 recorded Speaking Tests, as set out in Table 1.

The sample included 34 female and 54 male candidates. The tests had been administered in Australia, Cambodia, China, Colombia, Fiji, Hong Kong, India, Ireland, Libya, New Zealand, Peru, Pakistan, Sudan and the United Kingdom. This meant that a range of countries were included. Although the original intention was to select only Academic Module candidates, the sample included eight who were taking the General Training Module. This was not really a problem for the research because candidates for both modules take the same Speaking Test.

Task 70 Task 78 Task 79 Task 80 Totals

Band 8 4 4 4* 3 15

Band 7 5 4 6 4 19

Band 6 5 5 5 4 19

Band 5 5 5 5 6 21

Band 4 4 4 2 4 14

Totals 23 22 22 21 88

*One of these tapes turned out to have a different Part 2 task. It was thus excluded from the analyses by task.

Table 1: The final sample of IELTS Speaking Test tapes by band score and Part 2 task



4.3 Preparation of texts for analysis The transcription of the tapes was undertaken by transcribers employed by the Language in the Workplace Project at Victoria University of Wellington. They had been trained to follow the conventions of the Wellington Archive of New Zealand English transcription system (Vine, Johnson, O’Brien and Robertson, 2002), which is primarily designed for the analysis of workplace discourse. Since the transcribers were mainly Linguistics students employed part-time, the transcribing took nearly nine months to complete.

For the qualitative analyses, the full transcripts were used. To produce text files for the calculation of lexical statistics for the candidates’ speech, the transcripts were electronically edited to remove all of the interviewer utterances, as well as other extraneous elements such as pause markings and notes on speech quality which had been inserted into the transcripts in square brackets. The resulting files were saved as plain text files and then manually edited to delete the hesitations um, er and mm; back-channelling utterances such as mm, mhm, yeah, okay and oh; and false starts represented by incompletely articulated words and by short phrases repeated verbatim. In addition, contracted forms were separated (it’ll it ‘ll, don’t do n’t) and multi-word proper nouns were linked as single lexical items (Margaret_Thatcher, Lord_of_the_Rings).

5 STATISTICAL ANALYSES

5.1 Analytical procedures To investigate the words used by the candidates, a variety of lexical statistics were calculated, using four different computer programs.

1. WordSmith Tools (Smith, 1998). This is a widely used program for analysing vocabulary in computer corpora. The Wordlist tool was used to identify the most frequently occurring content words, both in the whole corpus and in the texts for each of the four Part 2 tasks. It also provided descriptive statistics on the lexical output of candidates at the five band score levels. A second WordSmith tool, Keyword, allowed us to identify words that were distinctively associated with each of the tasks and with the whole corpus.

2. Range (Nation and Heatley, 1996). This program produces a profile of the vocabulary in a text according to frequency level. It includes three default English vocabulary lists – the first 1000 words, the second 1000 words (both from West, 1953) and the Academic Word List (Coxhead, 2000). The output provides a separate inventory of words from each list, plus words that are not in any of the lists. There are also descriptive statistics which give a summary profile and indicate the relative proportion of high and lower frequency words in the text. The Range program was used to produce profiles not for individual candidates but for each of the five band score levels represented in the corpus.

3. P_Lex (Meara and Bell, 2001). Whereas Range creates a frequency profile, P_Lex yields a single summary measure, lambda, calculated by determining how many non-high frequency words occur in every 10-word segment throughout the text. A low lambda shows that the text contains predominantly high-frequency words, whereas a higher value indicates the use of more lower-frequency vocabulary.

4. D_Tools (Meara and Miralpeix, 2004). The purpose of this pair of programs is to calculate the value of D, the measure of lexical diversity devised by Malvern and Richards. D values range from a maximum of 90 down to 0, reflecting the number of different words used in a text.



6 STATISTICAL RESULTS

6.1 Lexical output Let us first review some characteristics of the overall production of vocabulary by candidates in the test. In Table 2, candidates have been classified according to their band score level and the figures show descriptively how many word forms were produced at each level.

TOTALS MEANS (standard deviations)

Tokens Types Tokens Types

BAND 8 22,366 2374 1491.0 408.1 (n=15) (565.9) (106.0)

BAND 7 21,865 2191 1150.7 334.6 (n=19) (186.7) (46.0)

BAND 6 18,493 1795 937.3 276.7 (n=19) (261.4) (48.2)

BAND 5 15,989 1553 761.4 234.2 (n=21) (146.7) (35.5)

BAND 4 6931 996 475.8 166.6 (n=14) (216.9) (48.6)

Table 2: Lexical output of IELTS candidates by band score level (WordSmith analysis)

Since there were different numbers of candidates in the five bands, the mean scores in the third and fourth columns of the table give a more accurate indication of the band score distinctions than the raw totals. There is a clear pattern of declining output from top to bottom, with candidates at the higher band score levels producing a much larger amount of vocabulary on average than those at the lower levels, both in terms of tokens and types. It is reasonable to expect that more proficient candidates would have the lexical resources to speak at greater length than those who were less proficient. However, it should also be noted that all the standard deviations were quite large. That is to say, there was great variation within band score levels in lexical production, which means that number of words used is not in itself a very reliable index of the quality of a candidate’s speech. For example, the range in length of the edited texts for Band 8 candidates was from 728 to 2741 words. Thus, high proficiency learners varied in how talkative they were and in the extent to which the examiner allowed them to speak at length in response to the test questions.

It would be possible to calculate type-token ratios (TTRs) from the figures in Table 2 – and in fact, the WordSmith output includes a standardised TTR. However, as noted above, the TTR is a problematic measure of lexical variation, particularly in a situation like the present one where candidate texts vary widely in length.

6.2 Lexical variation To deal with the TTR problem, Malvern and Richards’ D was calculated by means of D_Tools. The D values for the texts in our corpus are presented in Table 3. As noted in the table, there may be a small bug in the program, because seven texts yielded a value above 90, which is not supposed to happen. An inspection of the seven texts suggested the possibility that the use of rare or unusually diverse vocabulary by some more proficient candidates may tend to distort the calculation, but this will require further investigation. Leaving aside those anomalous cases, the pattern of the findings for lexical variation is somewhat similar to those for lexical output. The mean values for D decline as we go down the band score scale, but again the standard deviations show a large dispersion in the values at each band level, and particularly at Bands 7 and 6.



As a general principle, more proficient candidates use a wider range of vocabulary than less proficient ones, but D by itself cannot reliably distinguish candidates by band score.

D (LEXICAL DIVERSITY)

Mean SD Maximum Minimum

BAND 8 79.0 4.9 87.5 72.0 (n=11)*

BAND 7 71.8 18.2 89.5 61.2 (n=17)*

BAND 6 67.2 16.0 81.4 57.0 (n=18)*

BAND 5 63.4 11.3 86.7 39.5

(n=21)

BAND 4 60.7 11.4 76.1 37.5

(n=14)

* Seven candidates with abnormal D values were excluded

Table 3: Summary output from the D_Tools Program, by band score level

6.3 Lexical sophistication The third kind of quantitative analysis used the Range program to classify the words (in this case, the types) into four categories, as set out in Table 4. Essentially, the figures in the table provide Laufer and Nation’s (1995) Lexical Frequency Profile for candidates at the five band score levels represented in our corpus.

If we look at the List 1 column, we see that overall at least half of the words used by the candidates were from the 1000 most frequent words in the language, but the percentage rises with decreasing proficiency, so that the high-frequency words accounted for two-thirds of the types in the speech of Band 4 candidates. Conversely, the figures in the fourth column (“Not in Lists”) show the reverse pattern. Words that are not in the three lists represent less frequent and more specific vocabulary, and it was to be expected that the percentage of such words would be higher among candidates at Bands 8 and 7. In fact, there is an overall decline in the percentage of words outside the lists, from 21% at Band 8 to about 12% at Band 4.



TYPES

List 1 List 2 List 3 Not in Lists Total

BAND 8 1270 347 243 504 2364 (n=15) 53.7% 14.7% 10.3% 21.3% 100%

BAND 7 1190 329 205 455 2179 (n=19) 54.6% 15.1% 9.4% 20.9% 100%

BAND 6 1060 266 179 277 1782 (n=19) 59.5% 14.9% 10.0% 15.5% 100%

BAND 5 958 222 119 243 1542 (n=21) 62.1% 14.4% 7.7% 15.8% 100%

BAND 4 677 132 58 122 989 (n=14) 68.5% 13.3% 5.9% 12.3% 100%

KEY List 1 First 1000 words of the GSL (West, 1953) List 2 Second 1000 words of the GSL List 3 Academic Word List (Coxhead, 2000) Not in Lists Not occurring in any of the above lists

Table 4: Analysis by the Range program of the relative frequency of words (lemmas) used by candidates at different band score levels

The patterns for the two intermediate columns are less clear-cut. Candidates at the various band levels used a variable proportion of words from the second 1000 list, around an overall figure of 13–15%. In the case of the academic vocabulary in List 3, the speech of candidates at Bands 6–8 contained around 9–10% of these words, with the percentage declining to about 6% for Band 4 candidates. If we take the percentages in the third and fourth columns as representing the use of more “sophisticated” vocabulary, we can say that higher proficiency candidates used substantially more of those words.

Another perspective on the lexical sophistication of the speaking texts is provided by Meara and Bell’s (2001) P-Lex program, which produces a summary measure – lambda – based on this same distinction between high and low-frequency vocabulary use in individual texts. As noted above, a low value of lambda shows that the text contains mostly high-frequency words, whereas a higher value is intended to indicate more sophisticated vocabulary use.

In Table 5, the mean values of lambda show the expected decline from Band 8 to 4, confirming the pattern in Table 4 that higher proficiency candidates used a greater proportion of lower-frequency vocabulary in their speech. However, the standard deviations and the range figures also demonstrate what was seen in Tables 2 and 3; except to some degree at Band 6, there was a great deal of variation within band score levels.



LAMBDA

Mean SD Maximum Minimum

BAND 8 1.10 0.22 1.50 0.77 (n=15)

BAND 7 1.05 0.26 1.49 0.60 (n=19)

BAND 6 0.89 0.17 1.17 0.55 (n=19)

BAND 5 0.88 0.24 1.38 0.33 (n=21)

BAND 4 0.83 0.33 1.48 0.40 (n=14)

Table 5: Summary output from the P-Lex Program, by band score level

To get some indication of why such variation might occur, it is interesting to look at candidates for whom there is a big mismatch between the band score level and the value of lambda. There were four cases of Band 8 candidates with lambdas between 0.77 and 0.86. An inspection of their transcripts suggests the following tentative explanations:

Candidate 62 may have been overrated as Band 8, based on the simple language used and the apparent difficulty in understanding some of the examiner’s questions in Part 3 of the test.

Candidate 19 spoke fluently in idiomatic English composed largely of high-frequency words.

Three of them used relatively few technical terms in discussing their employment, their study and the Part 2 task.

Candidate 76 used quite a lot of technical terminology in talking about his employment history but switched to a much greater proportion of high-frequency vocabulary in the rest of the test.

On the other hand, four Band 4 candidates had lambdas between 1.16 and 1.48. There is an interesting contrast between two Band 4 candidates who said relatively little in the test (their edited texts are both around 300 words) but who had markedly different lambdas. Candidate 78 responded in simple high-frequency vocabulary, which produced a value of 0.44, whereas Candidate 77 used quite a few somewhat lower frequency words, often ones that were repeated from the examiner’s questions (available, transport, celebrating, information, encourage), and thus obtained a lambda of 1.48. The other Band 4 candidates with high lambdas also appeared to produce a good proportion of words outside the high-frequency vocabulary range, relative to their small lexical output. Another factor with some Band 4 candidates was that poor pronunciation reduced their intelligibility on tape, with the result that it was difficult for the transcriber to make a full record of what they said – and this may have affected high-frequency function words more than phonologically salient lexical items.

Some of these lexical characteristics of performance at the different band score levels are considered further below in the qualitative analysis of the transcripts.



6.4 Key words in the four tasks To investigate the vocabulary associated with particular topics, the texts were classified according to the four Part 2 tasks represented in the corpus. There were about 21–23 texts for each task. Table 6 lists the most frequently occurring content word forms in descending order, according to the WordSmith word lists. The lists have been lemmatised, in the sense that a stem word and its inflected forms (cook, cooking, cooked; book, books) were counted as a single unit, or lemma.

TASK 70 Eating out

(n=23)

TASK 78 Reading a book

(n=22)

TASK 79 Language learning

(n=21)

TASK 80 Describing a person

(n=21) food think people like (vb) restaurant time good friend eat fast place home work cook know country travel family year nice city spend name traditional talk different find study course prefer enjoy dish prepare

269 190 187 177 151 125 117 104 96 86 79 60 58 54 54 50 49 47 41 39 38 38 38 36 31 30 30 28 27 26 26 25 25

book read think people like (vb) time friend name good work different child study story life television problem write family find important interesting country help learn city love

333 224 195 130 126 82 81 80 69 61 49 48 46 44 43 43 42 38 37 34 32 32 29 28 28 26 25

English think learn language like (vb) people know start speak school friend different time country study important good year difficult start class music work listen name word teach teacher place write grammar interesting mean travel family talk new university foreign

315 226 175 130 129 105 79 79 72 67 62 60 57 55 54 53 51 50 48 47 44 42 42 41 39 38 37 35 34 34 31 31 31 31 28 27 26 26 25

people think know famous good name like (vb) person friend work country time year day life family help study city important public different example way problem history transport

340 229 157 148 88 87 85 79 63 59 55 55 45 43 40 39 39 37 36 34 33 32 30 29 29 25 25

Note: Some high-frequency verbs which occurred fairly uniformly across the four tasks have been excluded: get, go, make, say, see, use/used to and want.

Table 6: The most frequent content words used by candidates according to their Part 2 topic (WordSmith Wordlist analysis)

The lists represent, in a sense, the default vocabulary for each topic – the mostly high-frequency words one would expect learners to use in talking about the topic. As such, these words will almost certainly not be salient for the examiners in rating the learners’ lexical resource, except perhaps in the case of low-proficiency candidates who exhibit uncertain mastery of even this basic vocabulary.

It should be remembered that these lists come from the full test for each candidate, not just Parts 2 and 3, where the designated topic was being discussed. This helps to explain why words such as © IELTS Research Reports Volume 6 14


friend, people, family, study and country tend to occur on all four lists because of the frequency of these words in Part 1 of the test, where candidates talked about themselves and their background, including in particular, a question about whether they preferred to socialise with family members or friends.

Words were selected for the four lists down to a frequency of 25. It is interesting to note some variation between topics in the number of words above that minimum level. The longest list was generated by Task 79, on language learning. This indicates that, from a lexical point of view, the candidates discussed this topic in similar terms, so that a relatively small number of words, including English, learn, language, study, listen, word and talk, recurred quite frequently. That is to say, their experience of language learning had much in common from a vocabulary perspective. By contrast, for Task 78 the list of frequently repeated words is noticeably shorter, presumably because the books that the candidates chose to discuss had quite varied characteristics. The same would apply to the people that candidates who were assigned Task 80 chose to talk about.

TASK 70 Eating out

(n=23)

TASK 78 Reading a book

(n=22)

TASK 79 Language learning

(n=21)

TASK 80 Describing a person

(n=21) food restaurant fast eat foods eating go cook like home traditional restaurants dishes cooking nice out McDonalds meal delicious shop healthy

463.1 327.8 184.0 104.8 90.0 86.7 76.1 74.3 58.8 57.7 52.0 47.0 45.3 45.3 42.2 40.0 32.0 31.3 29.3 26.6 24.0

read books book reading story children internet television girl men writer boy this hear women fiction

342.8 309.2 358.9 102.2 66.4 57.2 38.4 38.4 36.8 36.8 35.1 29.7 28.6 28.5 27.4 24.3

English language learn speak learning languages school class grammar communicate foreign started words speaking teacher difficult communication listening

713.1 233.6 251.1 99.4 76.8 74.7 72.4 69.7 62.2 56.2 52.1 40.5 37.7 34.9 33.8 32.4 29.3 27.5

he famous people him person his public admire who known media become she chairman president

346.5 270.4 115.2 110.6 76.0 60.2 53.0 51.5 50.6 48.5 45.7 42.0 39.0 24.2 24.2

Table 7: Results of the WordSmith Keyword analysis for the four Part 2 tasks

Another facility offered by WordSmith is a Keyword analysis, which identifies words occurring with high frequency in a particular text – or set of texts – as compared with their occurrence in a reference corpus. For this purpose, the texts associated with each of the four Part 2 tasks were collectively analysed by reference to the corpus formed by the texts on the other three tasks. The results can be seen in Table 7, which lists the keywords for each of the four tasks, accompanied by a keyness statistic, representing the extent of the mismatch in frequency between the words in the texts for a particular task and in the rest of the corpus.

The keyword results show more clearly than the previous analysis the semantically salient words associated with each task. From a lexical point of view, it is the vocabulary needed for the Part 2 long turn and the Part 3 discussion which dominates each candidate’s Speaking Test.



7 QUALITATIVE ANALYSES

To complement the statistical analyses, a subset of the test transcripts was selected for a more qualitative examination. There were two aims in this part of the study:

1. to identify lexical features of the candidate speech which might help to distinguish performance at different band score levels

2. to seek evidence of the role that formulaic language might play in the Speaking Test.

7.1 Procedures The approach to this phase of the analysis was exploratory and inherently subjective in nature. As we and others have previously noted (Wray 2002, Schmitt and Carter 2004, Read and Nation 2004), there is a great deal of uncertainty about both how to define formulaic language in general, and how to identify particular sequences of words as formulaic.

Our initial expectations were that formulaic language could potentially take a number of different forms in the IELTS Speaking Test:

1. The examiner’s speech in the test is constrained by a “frame”, which is essentially a script specifying the questions that should be asked, with only limited options to tailor them for an individual candidate. This might give the examiner’s speech a formulaic character which would in turn be reflected in the way that the candidate responded to the questions.

2. In the case of high-proficiency candidates who were fluent speakers of the language, one kind of evidence for their fluency could be the use of a wide range of idiomatic expressions, ie, sequences of words appropriately conveying a meaning which might not be predictable from knowledge of the individual words. This would make their speech seem more native-like than that of candidates at lower band score levels.

3. Conversely, lower-proficiency candidates might attempt such expressions but produce ones that were inaccurate or inappropriate.

4. At a low level, candidates might show evidence of using (or perhaps overusing) a number of fixed expressions that they had consciously memorised in an effort to improve their performance in the test. It could be argued that the widespread availability of IELTS preparation courses and materials might encourage this tendency.

In order to highlight contrasts between score levels, the transcripts at Bands 8, 6 and 4 for each of the four tasks were selected for analysis. Our strategy was to read each of the selected transcripts carefully, marking words, phrases and longer sequences that seemed to be lexically distinctive in the following ways:

individual words that we judged to be of low frequency, whether or not they were accurately or appropriately used

words or phrases which had a pragmatic or discourse function within the text sequences of words which could in some sense be regarded as formulaic.

At this point, it is useful to make a distinction between formulaic sequences which could be recognised as such on the basis of native speaker intuition, and sequences that were formulaic for the individual learner as a result of being stored and retrieved as whole lexical units, regardless of how idiomatic they might be judged as being by native speakers. One indication that a sequence was formulaic in the latter sense was that it was produced by the candidate with little if any pauses, hesitation or false start. Another was that the same sequence – or a similar one – was used by the candidate more than once during the test.



8 QUALITATIVE RESULTS

8.1 Band 8 As noted in the results of the statistical analyses, the candidates at Band 8 produced substantially more words as a group than did those at lower proficiency levels. However, the quality of their vocabulary use was also distinctive. This was reflected partly in their confident use of low frequency vocabulary items, particularly those associated with their employment or their leisure interests. Several of the Band 8 candidates in the sample were medical practitioners and here for example is Candidate 01 recounting the daily routine at his hospital:

..and after that um I should er go back to the ward to check patients and check if there’s any complication from receiving the drugs er usually after er er giving the drugs er some drugs may cause side effects which need to my intervention …

The underlined words are obviously more or less technical terms in medicine and one would expect a doctor to have command of them.

Similarly, Candidate 48 described her favourite movie actor in this way:

… he is a very, very versatile actor like he’s er he has got his own styles and mannerisms in a very short span er in two decades or two and a half decades he has established himself as a very good actor in the (cine field) …

The use of styles and span here may not be entirely “native-like”, but the candidate was able to give a convincing description of the actor.

Thus, high-proficiency candidates have available to them a wide range of low frequency words that allow them to express more specific meanings than can be conveyed with more general vocabulary.

However, it is important to emphasise that such lower-frequency vocabulary does not necessarily occur with high density in the speech of Band 8 candidates. The sophistication of their vocabulary ability may also be reflected in their use of formulaic sequences – made up largely or entirely of high-frequency words – which give their speech a native-like quality. Here are some excerpts from the transcripts of Band 8 candidates, with some of the sequences that we consider to be formulaic underlined:

one of the main reasons [why he became a doctor] was both my parents are doctors so naturally I got into that line but I was also interested in this medicine, as such, of course the money factors come into play (Candidate 54)

it’s quite nice [describing a restaurant] it’s er its er Japanese er all type of food but basically what I like there is the sushi, I love sushi so I just enjoy going there and when you go in they start shouting and stuff, very Japanese culture type of restaurant which is very good (Candidate 19)

[after visiting a new place] …I like to remember everything later on and er I don’t know it’s a habit I just keep picking up these small things like er + um if I go to the northern areas that is if I go to [place name] or some place like that, I’ll be picking up these small pieces and them um on the way back when I look at them I was like God, I cannot explain why I got this, there’s just this weird stuff that I’ve picked up … (Candidate 72)

A related feature of the speech of many Band 8 candidates was the use of short words or phrases functioning as pragmatic devices, or what Hasselgren (2002) has termed “smallwords”. These include you know, I mean, I guess, actually, basically, obviously, like and okay. These tend to be semantically “empty” and as such might be considered outside the scope of a lexical analysis, but



nevertheless they need to be included in any account of formulaic language use. Here are some examples from Candidate 47, who possibly overdoes the use of such devices:

I’m a marine engineer actually so er I work on the ship and er basically … we have to go wherever the ship you know goes and so obviously we are on the ship so basically I am taking care of the machinery and that’s it so + er well I’ve travelled quite a lot you know I mean all around the world …

Another distinctive user was Candidate 38:

[I prefer America] er because um to be frank like er um people were nice I mean they were not biased or you know they didn’t show any coloured preference or whatever yeah they were more friendly …

In most cases, these pragmatic devices did not occur as frequently as in these two excerpts, but they were still a noticeable feature of the speech in many of the Band 8 transcripts that we examined.

Another kind of device was the use of discourse markers to enumerate two or three points that the candidate wanted to make:

if you compare er my language with er English … it’s completely different … because .. er firstly we write from right to left and in English you write from right to left … um another thing the grammar our grammar it’s not like English grammar … (Candidate 62)

my name has two meanings there’s one um it’s actually a Muslim name so there’s two meanings to that one is that it means a guardian from heaven and the second meaning it’s er second name it was given to a tribe of people that were lofty and known for their arrogance (Candidate 71)

These discourse markers were not so common in candidate speech, which is perhaps a little surprising, particularly in relation to the Part 2 long turn, when the candidates were given a minute or so to prepare what they were going to say.

It is important not to overstate the extent to which the features identified so far can be found in the speech of all the candidates at Band 8. In fact, they varied in the extent to which their speech appeared to be formulaic, in the sense of containing a lot of idiomatic expressions, pragmatic devices and so on. Here is a candidate who expresses her opinion about the importance of English in a relatively plain style:

er I think the English language is very important now + at first it didn’t used to be, actually it has been strong for the last er fifty years but importance was not given to it + now in every organisation in every school in every college, er basically at the university level everything is taught in English basically so you need to understand the language, I think we students are better off because we are studying from a younger age we understand the language but a big problem we have here is that + people don’t communicate but now teachers encourage the students to speak in English and um + it is very important (Candidate 83)

Apart from the words actually and basically, plus a phrase such as are better off, there is not much in this excerpt which could be considered formulaic in any overt way.

Another example is this candidate talking about the kind of friends he prefers:

er normally I prefer one or two very close friends so that I can discuss with them if I have any problems or things like that, I can have more contact close contact with them instead of having so many friends, but I have so many friends I make friends as soon as I see I see



people for the first time it’s like when I came here today I talk to a number of people here … but I have I prefer to have just one or two friends who are very close to me (Candidate 64)

On the face of it, these opinions are expressed in very simple vocabulary without any idiomatic expression. It should be noted that the phrase prefer to have one or two very close friends was part of the preceding question asked by the examiner and thus the opening statement is formulaic in the sense that it echoes what the examiner said. On closer inspection, there are other phrases that could be formulaic, such as or things like that, close contact with them, I make friends and it’s like when.

8.2 Band 6 As compared to Band 8 candidates, those at Band 6 had some similar features but overall they showed a more limited range of vocabulary and a smaller amount of idiomatic expression.

One tendency among Band 6 candidates was either to use an incorrect form of a word or to reveal some uncertainty about what the correct form was. For instance, Candidate 09 said … if I go by myself maybe some dangerous or something and it’s more e- economy if I travel with other peoples. Similarly, Candidate 69 made statements such as when I was third year old, the differences between health and dirty and, in perhaps an extreme case of uncertainty, … then my parents brang bring bringed me branged me here ….

One noticeable characteristic of many candidates at this level was the occurrence of a mixture of appropriate and inappropriate expression, both in individual word choice and in the longer word sequences which they produced. Here are some examples:

I think adventurous books are really good for um pleasure time where you can sit and you can think and read those books and really come into real world … (Candidate 36)

No people rarely do [change their names] especially because er first of they’re proud of their names and proud of their tribes if you ever ever er go through the history of those people … they would think themselves like a very proudy person and most of the people don’t change their name (Candidate 84)

Mm I think train is better because it’s fast and convenient but sometimes when in the weekend there’s many people who are travel by train to somewhere else + so I think that time is very busy (Candidate 10)

These examples illustrate how Band 6 candidates were able to communicate their meaning effectively enough, even though they made some errors and did not express themselves in the more idiomatic fashion that Band 8 candidates were capable of. Here is one further example, which includes low-frequency vocabulary such as relaxation, dwelling, cassette recorder and distract, as well as the formulaic expression (it’s) (just) a matter of…, but in other respects it is not very idiomatic:

I usually listen to music as a relaxation time after duties at my dwelling it’s just a matter of relaxation ( ) cassette recorder certain cassettes I have picked ( ) I am travelling from town to town in the recorder of my car I used to put it on just a matter of you can distract ( ) going by your thoughts ( ) cannot sleep ( ) so it’s a matter of relaxation ( ) something to distract me ( ) also I enjoy it very much. (Candidate 87)

Candidates at Band 6 did not generally use pragmatic devices such as actually, you know and I mean with any frequency. Candidate 69 is a clear exception to this but the other transcripts contained few, if any examples, of such devices.



8.3 Band 4 First, it should be noted that there were some practical difficulties for the transcribers in accurately recording what candidates at this level said, both because of the intelligibility of their accent and because their answers to questions might not be very coherent, particularly when the candidate had not properly understood the question.

Although candidates at this level used predominantly high-frequency vocabulary, they often knew some key lower-frequency words related to familiar topics, which they would use without necessarily being able to incorporate them into well-formulated statements, as in this response by Candidate 77 about transport in his city:

Transport problems locally there is a problem of these ( ) and er rickshaws motor rickshaws a lot of problems of making pollution and er problems

Here is another example from Candidate 18:

Er in my case I have a working holiday visa yes (before) I I worked as salesperson in convenience shop

A third example is a description of a local festival by Candidate 73:

Er is er our locality is very famous we’re celebrating [name] festivals and we er too celebrate with our er relatives and there’s a big gathering there and we always er make chitchat and we negotiate and deal of our personal characters in such kind of + festi- festivals

There was not a great deal of evidence of formulaic language among the candidates at the Band 4 level. In some respects, the most formulaic section of the test was at the very beginning, as in this exchange:

IN: Can you tell me your full name please? CM: My full name is [full name] IN: Okay and um and what shall I call you? CM: Um you can call me [name] IN: [Name] okay [name] er can you tell me where you’re from [name]? CM: Er I’m from [place] in [country] (Candidate 65)

Of course, this introductory part was formulaic to varying degrees for candidates at all levels of proficiency because examiners are required to go through the routine at the beginning of every test.

There was one Band 4 candidate who gave an unusually well-formulated response which seems quite formulaic in the sense of being perhaps a rehearsed explanation for her decision to study medicine:

I want to be a doctor because I think this is a meaningful job to use my knowledge to help others and also to contribute to the society (Candidate 45)

More typically, the responses by Band 4 candidates to questions that they had understood were not nearly as well-formed as this. For example, Candidate 78 responded thus to a question about English teaching in her country:

Er in my school is very good I can er I’ll er + read there er two years last + nine ten matric than I’ll leave the school go to college + and there’s no good English in colleges

For the most part, there were only certain limited sequences which we could identify as in any way formulaic in the speech of these low-proficiency candidates. For instance, Candidate 80 used the formula Yes of course six times. Other phrases such as most of the time, in my opinion, first of all, © IELTS Research Reports Volume 6 20


I don’t know I’m not sure and I like music very much occur sporadically in the transcripts we examined.

Particularly in Part 3, which is designed to be the most challenging section of the test, the Band 4 candidates had difficulty in understanding the examiner’s questions, let alone composing an adequate answer. However, even here they mostly did not have formulaic expressions to express their difficulty and to request a repetition of the question. Some used pardon, please or (I’m) sorry, or else just struggled to respond as best they could. Exceptions were I do not understand (Candidate 80) and sorry I don’t exactly understand what you’re ( ) can you repeat please (Candidate 45).

9 DISCUSSION

In this study we used a variety of statistical tools, as well as our own judgement, to explore the lexical characteristics of oral texts produced by IELTS candidates in the Speaking Test.

We decided to conduct most of the analyses using the band scores for speaking which had been assigned to the candidates’ performance by the examiners in the operational situation. For research purposes, it might have been desirable to check the reliability of the original scores by having the tapes re-rated by two certificated examiners. On the other hand, the fact that the recording quality of the audiotapes was quite variable, and that rating of tapes is a different experience from assessing candidates live, meant that the re-ratings would not necessarily have produced more valid measures of the candidates’ speaking ability.

Classifying the candidates by band score, then, we found that the lexical statistics revealed broad patterns in the use of individual word forms which followed one’s general expectations:

Higher proficiency candidates gave more extended responses to the questions and thus produced more vocabulary than lower proficiency candidates.

Candidates with higher band scores also used a wider range of vocabulary than those on lower band scores.

The speech of less proficient candidates contained a higher proportion of high-frequency words, particularly the first 1000 most frequent words in the language, reflecting the limitations of their vocabulary knowledge.

Conversely, higher proficiency candidates used greater percentages of lower frequency words, demonstrating their larger vocabulary size and their ability to use more specific and technical terms as appropriate.

It is important, though, that all of these findings should be seen as tendencies of varying strengths rather than defining characteristics of a particular band score level, because in all cases, there was substantial variation within levels. Thus, for instance, some Band 8 candidates gave relatively short responses and used predominantly high-frequency word forms, whereas those at Band 4 often produced quite a few low-frequency words, which could form a substantial proportion of their lexical output. Another point worth reiterating here is that, following Nation (2001: 13-16), we are defining “high-frequency” as occurring among the 2000 most frequent words in English – and, in the case of the P_Lex analysis, even more narrowly as the first 1000 words. As Nation (pp 19) also notes, the distinction between high and low is a somewhat arbitrary one and many very familiar words are classified as low-frequency by this criterion. However, the division still seems to provide a useful basis for evaluating the lexical quality of these oral texts.



No particular analysis was conducted of technical terms used by these IELTS candidates. The test questions are not really intended to elicit much discussion of the candidate’s field of study or employment, particularly since the same test material is used with both Academic and General Training candidates. Within the short time-span of the test, the examiner cannot afford to let the candidate speak at length on any one topic. Even the Part 2 “long turn” is supposed to be restricted to 1–2 minutes. Nevertheless, some more proficient candidates who were well-established professionals in medicine, finance or engineering did give relatively technical accounts of their professional experience and interests in Parts 1 and 2 of the test.

The WordSmith analyses of the four Part 2 tasks clearly showed the influence of the topic that was the focus of Parts 2 and 3 of each candidate’s test. The distinctive, frequently occurring content words were mostly those associated with the Part 2 task, which then led to the more demanding follow-up questions in Part 3. One interesting point to emerge from the analysis of the four topics was that they varied in terms of the range of content vocabulary that they elicited. Task 79, which concerned the candidates’ experience of learning English, was the most narrowly focused in this regard. In other words, the candidates who talked on this topic tended to draw on the same lexical set related to formal study of the language in a classroom. On the other hand, Tasks 78 (a book) and 80 (a person) required some generic terms, but also more specific vocabulary to talk about the particular characteristics of the book or person.

The qualitative analysis was exploratory in nature and the findings must be regarded as suggestive rather than in any way conclusive. As noted in the literature review, there are no well-established procedures for identifying formulaic language, which indeed can be defined in several different ways. We found it no easier than previous researchers to confidently identify multi-word units as formulaic in nature on the basis of a careful reading of the transcripts. The comparison of transcripts within and across Bands 4, 6 and 8 produced some interesting patterns of lexical distinction between candidates at these widely separated proficiency levels. However, we were also conscious of the amount of individual variation within levels, which of course was one of the findings of the quantitative analysis as well. It should also be pointed out that the candidates whose tapes we were working with comprised a relatively small, non-probabilistic sample of the IELTS candidates worldwide – another reason for caution in drawing any firm conclusions.

The simple fact of working with the transcripts obliged us to shift from focusing on the individual word forms that were the primary units of analysis for the statistical procedures to a consideration of how the forms combined into multi-word lexical units in the candidates’ speech. This gave another perspective on the concept of lexical sophistication. In the statistical analyses, sophistication is conceived in terms of the occurrence of low frequency words in the language user’s production. The qualitative analysis, particularly of Band 8 texts, highlighted the point that the lexical superiority of these candidates was shown not only by their use of individual words but also their mastery of colloquial or idiomatic expressions which were often composed of relatively high-frequency words.

10 CONCLUSION

In the first instance, this study can be seen as a useful contribution to the analysis of spoken vocabulary in English, an area which is receiving more attention now after a long period of neglect. Within a somewhat specialised context – non-native speakers performing in a high-stakes proficiency test – the research offers interesting insights into oral vocabulary use, both at the level of individual words and through multi-word formulaic units. The texts are incomplete in one sense, in that the examiner’s speech has been deleted, but of course the primary focus of the assessment is on what the candidate says (and discourse analytic procedures such as those used by Lazaraton (2002) are more appropriate for investigating the interactive nature of the Speaking Test). Although oral



texts like these are certainly not as tidy as written ones, it appears that lexical statistics can provide an informative summary of some key aspects of the vocabulary they contain.

From the perspective of IELTS itself, it is important to investigate vocabulary use in the Speaking Test as part of the ongoing validation of the IELTS test, particularly as Lexical resource is one of the criteria on which the candidates’ performance is assessed. Our findings suggest that it is not surprising if examiners have some difficulty in reliably rating vocabulary performance as a separate component from the other three rating criteria. Whereas broad distinctions can be identified across band score levels, we found considerable variation in vocabulary use by candidates within levels. Ideally, research of this kind will, in the longer term, inform a revision of the rating descriptors for the Lexical resource scale, so that they direct the examiners’ attention to salient distinguishing features of the different bands. However, it would be premature to attempt to identify such features on the basis of the present study.

One fruitful area of further research would be to ask a group of IELTS examiners to listen to a sample of the Speaking Test tapes and discuss the features of each candidate’s vocabulary use that were noticeable to them. Their comments could then be compared with the results of the present study to see to what extent there was a match between their subjective perceptions and the various quantitative measures. However, it should also be remembered that, in the operational setting, examiners need to be monitoring all four rateable components of the candidate’s performance, thus restricting the amount of attention they can pay to Lexical resource or any one of the others. It may well be that it is unrealistic to expect them to reliably separate the components. Moreover, the formulaic nature of oral language, as we observed it in our data particularly among Band 8 candidates, calls into question the whole notion of a clear distinction between vocabulary and grammar. Thus, while as vocabulary researchers we emphasise the importance of the lexical dimension of second language performance, we also recognise that it represents one perspective among several on what determines how effectively a candidate can perform in the IELTS Speaking Test.



REFERENCES

Adolphs, S and Schmitt, N, 2003, ‘Lexical coverage of spoken discourse’, Applied Linguistics, vol 24, pp 425-438

Ball, F, 2001, ‘Using corpora in language testing’ in Research Notes 6, EFL Division, University of Cambridge Local Examinations Syndicate, Cambridge, pp 6-8

Catt, C, 2001, IELTS Speaking – preparation and practice, Catt Publishing, Christchurch

Durán, P, Malvern, D, Richards, B and Chipere, N, 2004, ‘Developmental trends in lexical diversity’, Applied Linguistics, vol 25, pp 220-242

Foster, P, 2001, ‘Rules and routines: A consideration of their role in the task-based language production of native and non-native speakers’ in Researching pedagogic tasks: Second language learning, teaching and testing, eds M Bygate, P Skehan, and M Swain, Harlow: Longman, pp 75-93

Hasselgren, A, 2002, ‘Learner corpora and language testing: Smallwords as markers of oral fluency’ in Computer learner corpora, second language acquisition and foreign language teaching, eds S Granger, J Hung and S Petch-Tyson, John Benjamins, Amsterdam, pp 143-173

Hawkey, R, 2001, ‘Towards a common scale to describe L2 writing performance’ in Research Notes 5, EFL Division, University of Cambridge Local Examinations Syndicate, Cambridge, pp 9-13

Laufer, B, 1995, ‘Beyond 2000: A measure of productive lexicon in a second language’ in The current state of interlanguage, eds L Eubank, L Selinker and M Sharwood Smith, John Benjamins, Amsterdam, pp 265-272

Laufer, B and Nation, P, 1995, ‘Vocabulary size and use: Lexical richness in L2 written production’, Applied Linguistics, vol 16, pp 307-322

Lazaraton, A, 2002, A qualitative approach to the validation of oral language tests, Cambridge University Press, Cambridge

Malvern, D and Richards, B, 2002, ‘Investigating accommodation in language proficiency interviews using a new measure of lexical diversity’, Language Testing, vol 19, pp 85-104

McCarthy, M, 1990, Vocabulary, Oxford University Press, Oxford

McCarthy, M, 1998, Spoken language and applied linguistics, Cambridge University Press, Cambridge

Meara, P and Bell, H, 2001, ‘P_Lex: A simple and effective way of describing the lexical characteristics of short L2 texts’, Prospect, vol 16, pp 5-24

Meara, P and Miralpeix, I, 2004, D_Tools computer software, Lognostics (Centre for Applied Language Studies, University of Wales Swansea), Swansea

Mehnert, U, 1998, ‘The effects of different lengths of time for planning on second language performance’, Studies in Second Language Acquisition, vol 20, pp 83-108

Nation, ISP, 2001, Learning vocabulary in another language, Cambridge University Press, Cambridge

Nation, P and Heatley, A, 1996, Range computer program, English Language Institute, Victoria University of Wellington, Wellington



Pawley, A and Syder, FH, 1983, ‘Two puzzles for linguistic theory: Native-like selection and native-like fluency’ in Language and communication, eds JC Richards and RW Schmidt, Longman, London, pp 191-226

Read, J, 2000, Assessing vocabulary, Cambridge University Press, Cambridge

Read, J and Nation, P, 2004, ‘Measurement of formulaic sequences’ in Formulaic sequences: Acquisition, processing and use, ed N Schmitt, John Benjamins, Amsterdam, pp 23-35


Schmitt, N (ed), 2004, Formulaic sequences: Acquisition, processing and use, John Benjamins, Amsterdam

Sinclair, J, 1991, Corpus, concordance, collocation, Oxford University Press, Oxford


Smith, M, 1998, WordSmith tools, version 3.0, computer software, Oxford University Press, Oxford

Taylor, L, 2001, ‘Revising the IELTS Speaking Test: Developments in test format and task design’ in Research Notes, 5, EFL Division, University of Cambridge Local Examinations Syndicate, Cambridge, pp 2-5

Taylor, L and Jones, N, 2001, ‘Revising the IELTS Speaking Test’ in Research Notes, 4, EFL Division, University of Cambridge Local Examinations Syndicate, Cambridge, pp 9-12

van Lier, L, 1989, ‘Reeling, writhing, drawling, stretching and fainting in coils: Oral proficiency interviews as conversation’, TESOL Quarterly, vol 23, pp 489-508

Wray, A, 2002, Formulaic language and the lexicon, Cambridge University Press, Cambridge

Young, R and He, AW (eds), 1998, Talking and testing: Discourse approaches to the assessment of oral proficiency, John Benjamins, Amsterdam

Young, R and Milanovic, M, 1992, ‘Discourse variation in oral proficiency interviews’, Studies in Second Language Acquisition, vol 14, pp 403-424


vol6 full report

Documents