annotating the hkcse pragmatically martin weisser visiting professor school of english and education...

12
Annotating the HKCSE Pragmatically Martin Weisser Visiting Professor School of English and Education Guangdong University of Foreign Studies mail: [email protected] web: martinweisser.org

Upload: barnaby-mills

Post on 04-Jan-2016

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Annotating the HKCSE Pragmatically Martin Weisser Visiting Professor School of English and Education Guangdong University of Foreign Studies mail: weissermar@gmail.comweb:

Annotating the HKCSE Pragmatically

Martin WeisserVisiting Professor

School of English and EducationGuangdong University of Foreign Studies

mail: [email protected] web: martinweisser.org

Page 2: Annotating the HKCSE Pragmatically Martin Weisser Visiting Professor School of English and Education Guangdong University of Foreign Studies mail: weissermar@gmail.comweb:

Outline

• The Conversion Process• Pre-processing Requirements• Annotation & Post-processing• Searching & Exploring the Corpus• Conclusion

Page 3: Annotating the HKCSE Pragmatically Martin Weisser Visiting Professor School of English and Education Guangdong University of Foreign Studies mail: weissermar@gmail.comweb:

The Conversion Process I – Issues

• how to convert to DART XML format?– identify original conventions

• some documented in Cheng et al. (2008)• some undocumented

– use tone unit marking?• unfortunately tone units in Brazil’s system for ‘discourse intonation’ ≠ C-

units • → no ‘sentence’ intonation inferable directly

– remove prosodic information, apart from stress and tone movements, to ensure readability

– handle overlap• exact extent not marked or inferable • → better to delete

– etc.

Page 4: Annotating the HKCSE Pragmatically Martin Weisser Visiting Professor School of English and Education Guangdong University of Foreign Studies mail: weissermar@gmail.comweb:

The Conversion Process II – Original Format

Page 5: Annotating the HKCSE Pragmatically Martin Weisser Visiting Professor School of English and Education Guangdong University of Foreign Studies mail: weissermar@gmail.comweb:

The Conversion Process III – the Conversion Editor

original input fileconversion result

view

conversion script editor

save output

Page 6: Annotating the HKCSE Pragmatically Martin Weisser Visiting Professor School of English and Education Guangdong University of Foreign Studies mail: weissermar@gmail.comweb:

The Conversion Process IV – Conversion Results

converted to DART XML format

retained stress marking

converted & moved tone marking

converted ‘non-speech’ to comments

added gender attribute

added speaker type attribute

moved pauses to next turn

Page 7: Annotating the HKCSE Pragmatically Martin Weisser Visiting Professor School of English and Education Guangdong University of Foreign Studies mail: weissermar@gmail.comweb:

Pre-processing Requirements

• creating new resources in/for DART– adapt DART modules to handle mixed case– ‘synthesise’ domain-specific lexicon– create domain-specific topic ‘thesaurus’

• pre-processing– fix conversion errors– identify/mark incomplete words– split turns– add punctuation, partly based on original prosodic features– etc.

Page 8: Annotating the HKCSE Pragmatically Martin Weisser Visiting Professor School of English and Education Guangdong University of Foreign Studies mail: weissermar@gmail.comweb:

Annotation & Post-processing I –Steps

• annotation in DART– fully automated– less than 80 sec for

• 24 files• ~72,100 words• ~10,300 C-Units

• Post-processing to fix potential errors on the levels of– syntax: potentially missing syntax rules– pragmatics: missing inferencing rules or modes (‘IFIDs’)– semantics: incorrectly identified topics

Page 9: Annotating the HKCSE Pragmatically Martin Weisser Visiting Professor School of English and Education Guangdong University of Foreign Studies mail: weissermar@gmail.comweb:

Annotation & Post-processing II –Annotation Result

identified syntacticcategory

automatically split off DM

annotated identifiablespeech acts

Page 10: Annotating the HKCSE Pragmatically Martin Weisser Visiting Professor School of English and Education Guangdong University of Foreign Studies mail: weissermar@gmail.comweb:

Searching the Corpus

• easily searchable via DART– speech act stats hyperlinked to concordancer– formulaic patterns or disfluencies via n-grams– manual searches in concordancer for specific

• speech acts• syntactic categories + speech acts• speech acts + speaker types• speech acts + gender• responses to questions• searches for specific tone features

Page 11: Annotating the HKCSE Pragmatically Martin Weisser Visiting Professor School of English and Education Guangdong University of Foreign Studies mail: weissermar@gmail.comweb:

Conclusion

• DART annotation enriches the HKCSE through– adding syntactic and pragmatic annotation– ability to analyse features based on (functional) C-

units, rather than intonation units– new search options based on the above features

Page 12: Annotating the HKCSE Pragmatically Martin Weisser Visiting Professor School of English and Education Guangdong University of Foreign Studies mail: weissermar@gmail.comweb:

References

• Cheng, W. Greaves, C. and Warren, M. 2008. A Corpus-driven Study of Discourse Intonation: the Hong Kong Corpus of Spoken English (prosodic). Amsterdam/Philadelphia: John Benjamins.

• Weisser, M. 2010. Annotating Dialogue Corpora Semi-Automatically: a Corpus-Linguistic Approach to Pragmatics. Unpublished Habilitation (professorial) thesis, University of Bayreuth.

• Weisser, M. 2012; forthcoming 2014. Pragmatic annotation. In: Aijmer, K. & Rühlemann, C. (Eds.). Corpus Pragmatics: a Handbook. Cambridge: CUP.

• Weisser, M. 2014. The DART Manual.• Weisser, M. (in progress). DART – the Dialogue Annotation and

Research Tool.