using tei mark-up and pragmatic classification in the construction and analysis of the british...

45
Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University [email protected] Twitter @RalphMortonCov

Upload: lily-hare

Post on 01-Apr-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

Using TEI mark-up and pragmatic classification in the construction and

analysis of the British Telecom Correspondence Corpus.

Ralph Morton, Coventry University [email protected]

Twitter @RalphMortonCov

Page 2: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

Outline

• Project background – British Telecom, New Connections, original aims, initial outcomes

• Working with Text Encoding Initiative (TEI) compliant XML in corpus mark-up

• Pragmatic Classification• British Telecom Correspondence Corpus – uses

and future research

Page 3: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

BT (British Telecom)

• The main telephone network in the United Kingdom between 1912 and 1984 (as a government department then public corporation).

• Traces its history back to founding of Electric Telegraph Company in 1846

• In 1984 British Telecom was privatised• One condition of the privatisation was the

preservation of their public records

Page 4: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

• Public archive• Located in Holborn• Established in 1986• ‘Preserves the records of BT and its

predecessors and promotes access to the records and their content internally as a corporate resource, and externally to national and international communities’

http://www.btplc.com/Thegroup/BTsHistory/BTgrouparchives/OurHeritagePolicy/BTA_policies_2010_06.pdf

BT Archives

Page 5: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

New Connections Project

• JISC-funded collaboration between Coventry University, BT Heritage and The National Archives

• Project aim ‘to catalogue, digitise and develop a searchable online archive of almost half a million photographs, images, documents and correspondence assembled by BT over 165 years.’

Page 6: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

• Promoting easier access to the archives - not just two days a week, not limited to Holborn High Street.

• Engaging with material in new ways

• Three research projects attached to New Connections, one of which is the British Telecom Correspondence Corpus (BTCC)

Page 7: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

British Telecom Correspondence Corpus

• Original aims - to identify and transcribe around 500 letters

- collect contextual information for those letters and encode it using TEI compliant XML

- Use corpus analyses to gain new insights into how English business correspondence changed from the mid-nineteenth to late-twentieth century

Page 8: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

Corpus vs. Archive

• Leech (1991:11) - ‘ultimately, the difference between an archive and a corpus must be that the latter is designed or required for a particular 'representative' function’.

• Hunston (2002: 28) ‘‘being representative’ inevitably involves knowing what the character of the whole is’.

Page 9: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

• The character of the whole of the BT Archive is unknown at the item level

• In cataloguing, BT archivists ‘describe the context and function of the folder in relation to the history of BT and its predecessors and not the individual authors of letters’

(Sian Wynn-Jones, personal communication 05/03/2013).

• ‘archives are organised not classified’ (David Hay 2013, personal communication, 2013)

• We were provided with ‘sufficient’ Category C files to fulfil our initial request for 500 letters.

• Letters were extracted through manual examination of digitised folders

Page 10: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

• One of the advantages of letters is they are rich with contextual data

• Took an approach normally reserved for spoken language which ‘exists in unknowable quantities and in an unknowable range of varieties’ (Hunston, 2002: 29)

• Selected a number of factors to control and sampled accordingly: a purposive approach.

Page 11: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

• Balance between decades as far as possible, incl. every letter from underrepresented decades – to be 'internally contrastive’ (Sinclair 2005)

• Variety of authors- Including historically interesting letters BUT inclusion of day to day letters too so as to not only preserve histories of prominent individuals (Nurmi, 1999:54) (Prescott 2012).

• Inclusion of handwritten and typed letters• Where available, inclusion of chains of correspondence

(Dossena 2004)

Sampling

Page 12: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

Basic metadata extraction

• Date• Author• Recipient• General topic of the

letter• Whether the letter

was part of a chain• Format (handwritten…

etc)

• Time constraints meant that the initial sampling metadata was very basic. (see right)

Page 13: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

- 612 letters (150 OCR, 462 manual transcription) - 386 authors - 132, 917 words

1850s 1860s 1870s 1880s 1890s 1900s 1910s 1920s 1930s 1940s 1950s 1960s 1970s 1980s0

10

20

30

40

50

60

70

80

Initial Findings

Page 14: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

Begin to get an idea of the kind of document that the archive contains.

Occupation • Secretary most frequently listed profession (27) with a further

34 letters from variations on this role (assistant secretary, honourable secretary, under secretary, deputy secretary...).

• Need for caution when generalising about jobs, case of Secretary. 1934 -> ‘Director General’. 1966 -> ‘Deputy Chairman of the Post Office Board’. Area for investigation?

• We see a similar variety in the role of the next most common occupation Director, where there are 17 letters from ‘Director’ and 13 letters from variations on this (e.g. deputy managing director).

Detailed Metadata

Page 15: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

Companies

• The correspondence originates from 137 separate companies.

• Aside from The Post Office/BT the letters predominantly come from communications companies and press organisations.

• There are also letters from government departments, law firms, charities, universities, district councils and miscellaneous one-off letters from organisations like the National Rifle Association and the Belgian Citizen Band Association

Page 16: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

Gender

• 257 Male, 16 Female authors (106 where it’s not clear from letter or outside sources)

• Surprising?• ‘[women] were eventually deemed capable of

replacing men in labour intensive activities like sorting. But more intellectually demanding positions such as clerical posts in the Chief Engineer’s Department remained firmly closed to them’ (Duncan Campbell-Smith, 2012:246)

Page 17: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

Working with TEI

• “The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form.”

• Text internal <body>• Header Information <header>

Page 18: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

Correspondence SIG TEI Proposal• Focus on header information• <profileDesc>

<correspDesc> <correspAction type="sending"> <persName>The sender</persName> <placeName>The place of sending</placeName> <date>The date of sending</date> </correspAction> <correspAction type="receiving"> <persName>The recipient</persName> <placeName>The place of receiving</placeName> <date>The date of receiving</date> </correspAction> </correspDesc></profileDesc>

• Context?

Page 19: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

<ct:correspAction type="sending"> <persName> <forename>Henry</forename> <surname>Schütz-Wilson</surname> <roleName>Assistant Secretary</roleName> </persName> <placeName> <orgName type="Company">The Electric & International Telegraph Company</orgName> <address> <street>Telegraph Street</street> <settlement>London</settlement> <postCode>EC</postCode> </address> </placeName></ct:correspAction>

CorrespDesc in the BTCC

Page 20: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

Fields recorded in our metadataResource creation• BT/Post Office file references – maintaining links• Transcription information• <encodingDesc> - project descriptionLetter metadata• <persName> - <sender> and <addressee>• <roleName> - occupation• <orgName> - company, department• <placeName> - <sender> and <addressee> - location info• <date> yyyy-mm-dd, n=“decade”• <keywords> - topic, function• <sex>• <scriptDesc> - format (handwritten, typed…etc)

Page 21: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

<text> <body> <opener> <salute>Dear Sir,</salute> </opener> <p>In reply to your favor of yesterday I hasten to forward you the full particulars connected with the matter between us and the International Commissioners and Copy of the Correspondence which took place on the subject.</p> <closer> <salute>In haste, Yours faithfully</salute> <signed>L. Walter Courtenay</signed> </closer> </body></text>

• Allows us to look at individual textual features of letter. E.G. use of <postScript> <quote>, <soCalled>, use of letter titles and references and letter <opener> and <closer>s including salutations.

Text Internal

Page 22: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

<openers>

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

1850s 1860s 1870s 1880s 1890s 1900s 1910s 1920s 1930s 1940s 1950s 1960s 1970s 1980s

Sir

Dear Sir

Dear [first name]

Dear [Surname]

Dear [title][surname]

Page 23: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

http://www.tableausoftware.com/public/

Page 24: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

1850s 1860s 1870s 1880s 1890s 1900s 1910s 1920s 1930s 1940s 1950s 1960s 1970s 1980s0

5

10

15

20

25

30

35

40

45

50

Yours sincerelyObedientYours faithfully

Most frequent <closers> by decade

Page 25: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

Formal links between openings and closings

1850s 1860s 1870s 1880s 1890s 1900s 1910s 1920s 1930s 1940s 1950s 1960s 1970s 1980s0

5

10

15

20

25

30

35

40

45

50

Named recipientsYours sincerely

Page 26: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

1850s 1860s 1870s 1880s 1890s 1900s 1910s 1920s 1930s 1940s 1950s 1960s 1970s 1980s0

10

20

30

40

50

60

Unnamed recipients"polite" sign offs

• “yours faithfully”, “obedient servant”

Page 27: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

Marconi

• Uses combination of relatively familiar - Dear Mr Preece• In combination with a variety of familiar and formal

closing salutations ‘believe me dear sir yours very truly’ ‘with best regards for you and all your family I

remain dear sir yours very truly’ ‘I remain dear sir yours very sincerely’ ‘I remain dear sir yours very truly and sincerely’

Page 28: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

• To be used as a starting point. There are many other factors to consider

• Context - relationship between sender and addressee - population of authors• Content and function of letter need to be taken into

account. • Salutations may even provide clues to nature of

correspondence, e.g. Nesfield’s “demi-official” correspondence (1917:191)

• One way into the analysis

Page 29: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

Letter function

• One of the challenges in approaching the analysis of the BT corpus is in how to make meaningful comparisons across so many different years, authors and subject matters.

• To try and address this we categorised the letters pragmatic by function.

• Looking at how these functions are realised and how they have changed/remained stable over time

Page 30: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

• Definitions were generated through a close examination of the letters.

Primary Functions – Advice, Suggestion, (Instruction?), Request, Application, Offer, Confirmation, Agreement, Acceptance, Rejection, Outlining, Detailing, Setting Out, Report, Notification, Expressive, Query, Clarification, Reiteration, Correction, Explanation, Complaints, Reminder, Thanking, Enclosing, Forwarding, Copying, Acknowledging, Arguing, Disputing, Arranging, Planning, Instructing, Personal Update, Proposal, Expenditure Review, Commissive, Promise

Secondary Functions – Thanking, Apology, Acknowledgement, Expressing, Query, Request, Offer, Advice, Suggestion, Direction, Instruction, Recommendation, Discussion, Informing, Stating, Agreeing, Conceding, Noting Change, Restating, Explanation, Invitation, Report, Notification, Enclosure, Approval

• Long list was narrowed down to a more manageable list of 19 functions

• This list was tested at a workshop at Coventry University with six participants asked to identify the main function (+component functions) of a sample of letters

Page 31: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

Frequent problems

• Lack of context• What’s the “main” function? E.g. complaints• Form/function conflict - Overlap – offer vs. application

Page 32: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

Problem cases – author identified functions

Dear Sir Robert: <informative>When you spoke to me on the telephone on July 18th, I had already received a telegram from Pattison summarizing his conversation with you a few days ago….…particularly in the matter of possible reference to technical detail, and that the objectives to be stressed should relate to a greater degree to exploration of general policy considerations.</informative> <thanking>However, I need not amplify on these matters in this letter since we are doing our best to get the formal reply to the British despatch out as soon as we can. I am sending this personal note to you, however, to express my appreciation for your own letter…..

Page 33: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

Problem cases - multifunctionality.

‘Sir, <informative> I am directed to refer to your communication dated 16th May, 1917, No.

96951/17 relating to Mr.T.Gilbert, a Telegraph Operator employed at the Ware Post Office and to inform you that instructions have this day been given for this man to be posted to the Royal Engineers, Signal Service, Bletchley through Area Headquarters. </informative>

<directive> I am also to ask you to forward to this Department for countersignature on behalf of the Director of Recruiting, your stock of enlistment Forms 27/Gen.No./6112 (D.R.l.c.) a copy of which was attached by you.

It is hoped that this will obviate any further difficulty of the nature referred to in your communication. </directive>

I am, Sir, Your obedient Servant, W. MacDonald’

The informative part of the letter does not contextualise the directive; the two sections address independent concerns.

There are some examples of letters where the authors address multiple independent concerns which carry equal claim to being the primary function of the letter. E.g. 1917_05_19_WM_## reproduced in full below

Page 34: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

• Three more rounds of inter-rater testing, one with two participants, and two further rounds with three participants

• Improved pre-discussion agreement c. 80% two raters and 60% three raters -> 82%

two raters and 70% three raters • Clarifications

Page 35: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

Final categories• 1. Application,• 2. Commissive, • 3. Complaint, • 4. Declination, • 5. Directive, • 6. Informative • 7. Notification, • 8. Offer, • 9. Query, • 10. Thanking

Page 36: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

Application Commissive Complaint Declination Directive Informative Notification Offer Query Thanking

1850s 0 0 0 0 1 2 1 3 0 0

1860s 0 0 1 0 1 1 5 0 2 0

1870s 3 5 1 2 9 14 0 3 3 0

1880s 0 4 1 8 2 1 0 0 3 0

1890s 0 2 2 1 8 16 5 2 5 1

1900s 1 10 3 3 4 6 2 3 2 0

1910s 0 2 0 8 11 11 6 3 3 0

1920s 10 5 2 3 14 14 6 8 4 1

1930s 0 3 4 7 7 11 3 5 4 3

1940s 2 6 3 2 10 18 1 3 4 0

1950s 2 2 4 2 10 13 5 2 4 3

1960s 0 11 0 0 3 21 7 5 1 3

1970s 0 1 4 2 8 14 5 4 2 0

1980s 0 1 3 3 1 20 4 2 2 0

Total 18 52 28 41 89 162 50 43 39 11

Page 37: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

Analysis

• Data Driven analysis of the <text> <body> – Using Frequency Lists, Keywords and Clusters as starting points

- By decade (diachronic) - By function

Page 38: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

Application1 19 47.908 york2 25 30.214 new3 35 25.294 my4 14 22.262 call5 7 20.102 times6 10 19.432 application7 5 18.747 salary8 13 18.447 experiments9 10 18.035 beg10 10 16.529 years11 4 16.410 grove12 5 16.368 opening13 6 15.793 hoping14 5 15.704 convenience15 9 15.044 request

Page 39: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

Directive1 80 33.780 committee2 15 19.098 landlord3 148 17.766 if4 12 15.278 tenant5 11 14.005 bbc6 20 12.847 calls7 444 12.836 be8 14 12.619 ireland9 12 12.477 rayner10 12 12.477 television11 11 12.082 northern12 31 11.971 majesty13 126 11.924 should14 9 11.459 ita

Page 40: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University
Page 41: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

• Far from perfect inter-rater reliability • Issues with multi-functional letters, and component

functions• Not much data so individual letters can skew resultsBUT• Some promising preliminary results. Need to

examine patterns in corpus as a whole and back any claims up with close reading

• Could be supplemented by qualitative approaches such as analysis of rhetorical moves (see Biber et al. 2007)

Page 42: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

BTCC: Current use and future directions

• Who uses it? - (currently) me - Corpus will be available by request/through e.g.

Oxford Text Archive - BT Archives have the data and are looking to

incorporate it in their Digital Archive

Page 43: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

• Linguistic analysis - Somewhat exploratory. Cannot generalise but

planning to expand corpus to examine findings further

• Historical study - transcriptions, metadata

For What?

Page 44: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

Ways in which historical research is being transformed by Digital Methods

• opportunity to read historical documents in new ways

• making publicly available but practically restricted material available

• providing historical archives with transcriptions and detailed item level metadata

• potentially bringing related but separate physical archives’ material together in the form of a digital resource (Post Office)

Page 45: Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. Ralph Morton, Coventry University

Thank you!• [email protected]

ReferencesBiber, Douglas, Connor, Ulla, and Upton, Thomas A., eds. Discourse on the Move : Using Corpus Analysis to Describe Discourse Structure. Amsterdam, NLD: John Benjamins Publishing Company,Dossena, M. (2004) ‘Towards a corpus of nineteenth-century Scottish correspondence’ Linguistica e Filologia 18, 195-214Hunston, S. (2002) Corpora in Applied Linguistics Cambridge, Cambridge University PressLeech, G. (1991) ‘The state of the art in corpus linguistics’ In Aijmer, K. & Altenberg, B. (eds.) English Corpus Linguistics: Studies in honour of Jan Svartvik, 8-29Nesfield, J.C. (1917) Junior Course of English Composition, London, MacMillan and Co. LtdPrescott, A (2012) ‘Making the Digital Human: Anxieties, Possiblities, Challenges’ delivered at Digital Humanities Summer School, Merton College Oxford [online] available from http://digitalriffs.blogspot.co.uk/2012/07/making-digital-human-anxieties.html [20th March 2013]