Terttu NevalainenResearch Unit for Variation, Contacts and Change in English
Department of English, University of Helsinki
Historical sociolinguistics as corpus linguistics
Challenge: the ‘bad-data problem’
Historical linguistics can then be thought of as the art
of making the best use of bad data. The art is a highly
developed one, but there are some limitations of the data
that cannot be compensated for. Except for very recent
times, no phonetic records are available for instrumental
measurements. We usually know very little about the social
position of the writers and not much more about the social
structure of the community. Though we know what was
written, we know nothing about what was understood, and
we are in no position to perform controlled experiments on
crossdialectal comprehension. (Labov 1994: 11)
… and how to deal with it
by systematic corpus compilation
collecting metadata
reconstructing earlier communities
building up baseline evidence
Topics addressed
Kinds of corpora
What is a historical sociolinguistic corpus?
HC, CEEC, OBC, CED
What can historical sociolinguistic corpora tell us
about language change?
a case study
Kinds of corpora
synchronic vs. diachronic
single-genre vs. multigenre
special purpose vs. general
small and tidy vs. big and messy
flat vs. annotated
> the first category more frequent than the
second in sociolinguistic corpora
What is a sociolinguistic corpus?
sampling unit: person
sampling frame: regional variation, variation in
socio-economic status, gender, age, ethnicity etc.
e.g. Sali Tagliamonte’s Roots Corpora:
Northwest England, Lowland Scotland, Nothern Ireland
110 speakers, c. 1 million words
Historical ‘proto-corpora’
diachronic
multigenre
general-purpose
small and tidy
increasingly grammatically annotated
include basic metadata
Sub-period Words %
OLD ENGLISH
I -850
II 850-950
III 950-1050
IV 1050-1150
Total
2 190
92 050
251 630
67 380
413 250
0.5
22.3
60.9
16.3
100.0MIDDLE ENGLISH
I 1150-1250
II 1250-1350
III 1350-1420
IV 1420-1500
Total
EModE, BRITISH
113 010
97 480
184 230
213 850
608 570
18.6
16.0
30.3
35.1
100.0
I 1500-1570
II 1570-1640
III 1640-1710
Total
190 160
189 800
171 040
551 000
34.5
34.5
31.0
100.0
The Helsinki Corpus of English Texts.
(http://www.helsinki.fi/varieng/CoRD/corpora/HelsinkiCorpus/index.html)
Resampling general-purpose, multigenre diachronic corpora for sociolinguistic studies
Helsinki Corpus ( >850-1710)
letters
diaries
trials
Helsinki Corpus of Older Scots (1450-1700)
letters
diaries
trials
Representative Corpus of Historical English
Registers (ARCHER; 1650-1990)
letters
Letters, diaries and trials in the Helsinki Corpus.
Sub-period Words (total) %
OLD ENGLISH
I -850
II 850-950
III 950-1050
IV 1050-1150
Total
- (413 250)
- (100,0)
MIDDLE ENGLISH
I 1150-1250
II 1250-1350
III 1350-1420
IV 1420-1500
Total
5 010
19 090
24 100 (608 570)
0.8
3.1
3.9 (100.0)
EMODE, BRITISH
I 1500-1570
II 1570-1640
III 1640-1710
Total
45 970
44 000
43 980
133 950 (551 000)
8.3
8.0
8.0
24.3 (100.0)
Lady Hoby’s diary (Margaret Hoby, 1571-1633, http://www.oxforddnb.com/view/article/37555)
(Munday the 17)
After priuat praier I saw a mans Legg dressed, took order for #
thinges in the house, and wrough tell dinner time : after dinner I #
went about the house, and read of the arball : then I tooke my
Cocth and Came to Linton, wher, after I had talked a whill with
my mother, examened my selfe and praied, I went to supper, and
then praied publeckly, and so to bed :
(E2 NN DIARY HOBY 72)
Header code Explanation
<QE2 NN DIARY HOBY> (text identifier)
<N DIARY HOBY> (name of text)
<A HOBY MARGARET> (author)
<C E2> (corpus period)
<O 1570-1640> (period of original)
<D ENGLISH> (dialect)
<V PROSE> (verse/prose)
<T DIARY PRIV> (text type)
<W WRITTEN> (relationship to spoken language)
<X FEMALE> (sex of author)
<Y 20-40> (age of author)
<H HIGH> (social rank of author)
<I INFORMAL> (setting)
<Z NARR NON-IMAG> (prototypical text category)
HC reference codes (http://www.helsinki.fi/varieng/CoRD/corpora/HelsinkiCorpus/generalintro.
html)
Historical sociolinguistic corpora
diachronic
single genre; typically letters and trials share a number of linguistic characteristics with face-to-
face conversations
provide data by known individuals, similar to interview
data used in present-day sociolinguistic research
purpose-built
small to medium size
increasingly grammatically annotated
include metadata
The CEEC family of corpora.(http://www.helsinki.fi/varieng/CoRD/corpora/CEEC/index.html)
CEEC
1998
CEEC
Extension
CEEC
SupplementTOTALS
words 2,597,795 2,219,422 442,484 5,259,701
collections 96 77 19 192
letters 5,961 4,923 829 11,713
writers 778 308 94 1,180
time span c. 1410-1681 1653-1800 1402-1663 1402-1800
Published versions the CEEC corpora.(http://www.helsinki.fi/varieng/CoRD/corpora/CEEC/index.html)
CEEC Sampler
(1999)
Parsed CEEC
(2006)*
words 450,085 2,159,132
collections 23 84
letters 1,123 4,970
writers 194 666
time span 1418-1681 1410-1681
* Historical Sociolinguistics Team’s collaborative project with the University of York, Prof. Anthony Warner, Dr. Susan Pinzuk and Dr. Ann Taylor. Tagging by Arja Nurmi and parsing by Ann Taylor.
The Corpus of Early English Correspondence (1998
version: c. 2,6 million words; 778 writers; c. 6,000 letters).____________________________________________________________
TIME SPAN: 1410–1681
WRITERS BY SOCIAL RANK: WRITERS BY DOMICILE:
Royalty: 3% Court: 8%
Nobility: 15% London: 14%
Gentry: 39% East Anglia: 17%
Clergy: 14% North: 12%
Professionals: 11% Other: 49%
Merchants: 8%
Other non-gentry: 10%
WRITERS BY GENDER:
Female: 26%
Male: 74%___________________________________________________________________
Compilers: Terttu Nevalainen, Helena Raumolin-Brunberg; Jukka Keränen, Minna
Nevala, Arja Nurmi, Minna Palander-Collin
Gregory King’s estimate of population and wealth in England and Wales,
1688: cumulative percentages of population (Nevalainen 2010: 8).
Social status Average annual
income (£)
Number of families
Temporal lords 6,060 200
Baronets 1,500 800
Bishops 1,300 26
Knights 800 600
Esquires 562.5 3,000
Merchants, greater 800 5,264
Gentlemen 280 15,000
Persons in greater offices 240 5,000
1% of population
Merchants, lesser 400 21,057
Artisans and handicrafts 200 6,745
Law 154 8,062
Persons in lesser offices 120 5,000
Freeholders, greater 91 27,568
5% of population
Naval officers 80 5,000
Clergymen, greater 72 2,000
Military officers 60 4,000
Science, liberal arts 60 12,898
Freeholders, lesser 55 96,490
Clergymen, lesser 50 10,000
10% of population
Shopkeepers and tradesmen 45 101,704
20% of population
Farmers 42.5 103,382
Manufacturing trades 38 162,863
Building trades 25 73,018
Common seamen 20 50,000
Miners 15 14,240
40% of population
Labouring people and
outservants
15 284,997
60% of population
Common soldiers 14 35,000
Cottagers and paupers 6.5 313,183
Vagrants 2 23,489
I am sure, howsoever I measurd by the cold clime
Aprill for a late May, or missed to signe my name, I
omitted it not for want of grace, but for hast; which
shall be at layzure mended. The hand as I take it
was, as this, my owne, and therefore my owne, and
not my secretarie’s fault; and I confesse I love to
write no dobles of letters, but will affirm my hand and
it whansoever your Grace shall nede to call uppon it.
(CEEC, Peregrine Bertie 1598 (HUTTON) 131)
Holograph vs. autograph?
If you could read my lres [letters] your self I would
have written largelie of your owne buisenes, And
because I will have none acquainted wth them but
who you thinke fitt besides your self, I have taken the
paines to write it in Romaine hand in this inclosed
paper, wch I thinke your self can read.
(CEEC, Anthony Antony 1615 (Stockwell) I,37)
Secretary vs. italic hand?
GEORGE CELY AT ANTWERP TO RICHARD CELY THE
YOUNGER AT CALAIS, 27 SEPTEMBER 1476
Ryght whellbelovyd brothyr, I recomeavnde me vnto as lowyngly as I
con or may. Fordyrmor, plesythe yt yow to vndyrstonde I hawe
resseywyd an letter ffrom yow, the wheche I hawe rede and <P 5> do
whell vndyrstonde/ I hawe wrytt owr ffathyr an answer therof, etc. Owr
ffathyr wold that I showd hy me vnto Calles. Ytt ys so I resseywyd of
Thomas an byll of an Cxxj li. vj s. vj d., wherof I con resseywe but lxx li.
Fl., that hawe I resseywyd and Thomas Kesten hathe promyssyd me to
delyuyr me the rest, and mor to. Allso ther ys an veryavns bytuyxt
Kesten and John Vandyrhay ffor ix sarplers woll. Thys ys an shrowd
[{matter{] : I whas at Mekyllyn and saw yt yll woll. Yt ys thys ys
bytuyxt Kesten and hym.
Cely letter (first half): flat text
Ryght_ADV whellbelovyd_ADV+VAN brothyr_N ,_, I_PRO
recomeavnde_VBP me_PRO vnto_P as_ADVR lowyngly_ADV as_P
I_PRO con_MD or_CONJ may_MD ._. Fordyrmor_ADVR+QR ,_,
plesythe_VBP yt_PRO yow_PRO to_TO vndyrstonde_VB I_PRO
hawe_HVP resseywyd_VBN an_D letter_N ffrom_P yow_PRO ,_,
the_D wheche_WPRO I_PRO hawe_HVP rede_VBN and_CONJ
<P_5> do_DOP whell_ADV vndyrstonde_VB __. I_PRO hawe_HVP
wrytt_VBN owr_PRO$ ffathyr_N an_D answer_N therof_ADV+P ,_,
etc_FW ._. Owr_PRO$ ffathyr_N wold_VBP that_C I_PRO
showd_MD hy_VB me_PRO vnto_P Calles_NPR ._. Ytt_PRO
ys_BEP so_ADV I_PRO resseywyd_VBD of_P Thomas_NPR an_D
byll_N of_P an_D Cxxj_NUM li._NS vj_NUM s._NS vj_NUM d._NS ,_,
wherof_WADV+P I_PRO con_MD resseywe_VB but_FP lxx_NUM
li_NS ._.
Cely letter (beginning): tagged text
( (IP-MAT (NP-VOC (ADJP (ADV Ryght) (ADV+VAN whellbelovyd))
(N brothyr))
(, ,)
(NP-SBJ (PRO I))
(VBP recomeavnde)
(NP-OB1 (PRO me))
(PP (P vnto)
(NP *))
(ADVP (ADVR as) (ADV lowyngly)
(PP (P as)
(CP-CMP (WADVP-1 0)
(C 0)
(IP-SUB (ADVP *T*-1)
(NP-SBJ (PRO I))
(MD (MD con) (CONJ or) (MD may))
(VB *)))))
Cely letter (very beginning): parsed text
Trials: The Proceedings of the Old Bailey, 1674-1913
A fully searchable edition of the largest body of texts
detailing the lives of non-elite people ever published,
containing 197,745 criminal trials held at London's
central criminal court.
http://www.oldbaileyonline.org/index.jsp
"probably the best accounts we shall ever have of what
transpired in ordinary English criminal courts before the
later eighteenth century".
the material reported was neither invented nor
significantly distorted.
at the same time, the Proceedings are far from
comprehensive transcripts of what was said in court.
(see Huber 2007)
"The Old Bailey, Known Also as the Central Criminal Court“ (1808)
(http://en.wikipedia.org/wiki/File:Old_Bailey_Microcosm_edited.jpg)
Old Bailey Proceedings: number of words and
proportion of direct speech per decade, 1734-1834
(Huber 2007).
Corpus of Early English Dialogues (Compiled under the
supervision of Merja Kytö and Jonathan Culpeper).
Degree of narratorial intervention
Authentic dialogue
Constructed dialogue
Minimum narratorial intervention
Trial Proceedings 285,660 words
Drama Comedy 238,590 words
Didactic Works A. Other 162,250 words B. Language Teaching 74,390 words
Miscellaneous 25,970 words
Considerable narratorial intervention
Witness Depositions 172,940 words
Prose Fiction 223,890 words
Total word count
458,600
725,090
http://www.engelska.uu.se/corpus.html
Period word counts for direct speech in the Corpus of English Dialogues (Compiled under the supervision of Merja
Kytö and Jonathan Culpeper).
Period Period totals
1 1560-1599
140,410
2 1600-1639
145,880
3 1640-1679
192,150
4 1680-1719
237,030
5 1720-1760
178,630
Total 894,100
(Source: http://www.engelska.uu.se/corpus.html)
Replacement of THOU by YOU: HC1
Helsinki Corpus: the use of THOU c. 500 instances in
1500-1570 (sermons, the Bible; but also handbooks,
educational treatises, fiction, comedy, and trials):
This whete and rye that thou shalt sowe ought to be
very clene of wede, and therfore er thou thresshe thy
corne open thy sheues and pyke oute all maner of
wedes, and than thresshe it and wynowe it clene, & so
shalt thou haue good clene corne an other yere. (John
Fitzherbert, The Boke of Husbandry 1534: 41).
Replacement of THOU by YOU: HC2
Helsinki Corpus: the use of THOU c. 350 instances in
1570-1640 (sermons, the Bible; comedy, fiction, trials)
sociodialectal narrowing during the seventeenth
century:
- in comedies and fiction, for example, thou is found
in the mouths of servants and country people.
- to some extent, thou continues to be used by
social superiors addressing their inferiors.
rare in letters
Users and non-users of THOU in the CEEC (Nevala 2004: 165).
Writer/recipient relation Users of THOU Non-users of THOU
Family members
15th century 2 (4%) 50 (96%)
16th century 6 (8%) 97 (92%)
17th century 21 (12%) 158 (88%)
18th century 1 (5%) 20 (95%)
Close friends
15th century 0 (0%) 8 (100%)
16th century 0 (0%) 2 (100%)
17th century 7 (21%) 27 (79%)
18th century 1 (6%) 16 (94%)
Findings on CEEC (Nevala 2004)
all THOU users also use YOU
15th - 16th c: mostly from London & Court
17th c.: mostly from other parts of England
18th c.: in poetical & biblical contexts
17th c.: writer/recipient: male writers to their wives,
female writers to their husbands and children
some typical users: the Kentish gentleman Henry
Oxinden writing to his wife
Lady Katherine Paston, an East Anglian gentlewoman
writing to her son
‘Heavy’ THOU users (1): Henry Oxinden
Deare Heart
How glad I was to heare from thee I cannott well expresse: I will
assure thee, leaving all manner of expressions out which are not as
reall as God is true, I do exceedingly love and honour thee.
And the more because of thy industrie in advancing her who if this
businesse in hand aile, cannot expect fanie thing of consequence.
Prethee if the rub be onlie in her, remove itt by all meanes possible,
and I shall thinke nothing too much for thee that I may be able to
give thee. I would thou didst but know one halfe of my ardent
affections towards thee and then I dare say thou wouldst run
through fire and water to effect my desires.
(Henry Oxinden to Katherine Oxinden, 1647)
‘Heavy’ THOU users (2): Katherine Oxinden
My good will: Christ Iesus blese the ever: I did take thy wrightinge to
me in very kinde parte, seinge that at that time thow mightest haue
pretended wearines withe travill yett woldest not make that any lett
to hinder me of thy most louinge and respectiue lines, the which
wear and ever shall be most well com to me, I was glad to heer of
your prosperous Iorny, and of the kind wellcom which you fownd
from that worthy master./whom, I wold by any means thou sholdest
haue a very reverend respect ofe:/ and beware good child that thou
be not too talketiue befor him, but only to learne what is fittinge
behauiour for you to vse before him and that observe and doe:
(Lady Katherine Paston to William Paston, April 1624)
Pursuing Region: some cross-corpus comparisons
THOU vs. YOU
corpora: trials
Old Bailey Proceedings, 1674-1913
English Witness Depositions 1560–1760
Q: the disappearance of THOU?
The use of THOU vs. YOU in Old Bailey trials (London).
Period THOU % THOU YOU % YOU
1700-1709 7 28 % 19 (9 items) 72 %
1710-1719 3 1 % 240 (75 items) 99 %
1720-1729 2 < 0.5% (514 items) > 99.5 %
1730-1739 20 < 0.5% (1,539 items) > 99.5 %
1740-1749 12 < 0.5% (2,837 items) > 99.5 %
William Wilson … one silver watch, val. 3 l. one metal watch, val. 4 l. one three pound twelve shill. piece, two thirty six shill. pieces, and three moidores, the property of Joseph Millikin … did steal, take and carry away, 18 June, 1750
We prevailed upon the countryman to change his
dress, by pulling his great coat off, and I put my hat
and wig on his head, and put on the countryman's
wig, and walked up after him. We gave him charge if
that was the man to give us notice and we would
assist him; he went and took a survey of the man,
went past him a few yards, I planted myself by the
prisoner, the countryman turn'd upon him, and said, ''
mon thou '' hast not altered thy heed if thou hast thy
dress, '' thou art the mon that robbed me.''
> report of a ’countryman’s’ words
Old Bailey: William Davis was indicted for stealing one grey gelding, value 3 l. the property of John Southal, 16 January, 1751
Said I to the prisoner, how came you by this horse?
said he, he had been in a pound, and was brought to
me; I took him to a justice near there; the justice ask'd
him where he was going; said he, to service: said the
justice, how much money hast thou in thy pocket?
said he, but six-pence; said the justice, thou settest
out very empty; the saddle was mine.
> report of a justice talking to the prisoner
The use of THOU vs. YOU in English Witness Depositions (Kytö et al. 2007).
Region/
Period
THOU % THOU YOU % YOU
North-east
(1696–1760)
30 39 % 47 61 %
North-west
(1724–1758)
30 35 % 56 65 %
East
(1700–1754)
0 - 55 100 %
English Witness Depositions examples (Kytö et al. 2007)
said he, Thou knows y=t= y=u= and thy Daughter Murthered a man, and conveyed him away. (North-west: National Archives, London. Palatinate of Lancaster, Crown
Court Depositions. MS PL 27/2, the information of Thomas Airton, 1697)
[...] the said Bassett said Damm ye for a whore youhave pict my Pockett (East: Norfolk Record Office, Norwich. Norwich Quarter Sessions files,
interrogatories and depositions. MS NCR Case 12b(2), the information
of Ellen Wakefeild, 1714)
Conclusions
historical sociolinguistic corpora usefully complement
each other
chronologically
regionally
socially
building up baseline evidence is necessary for a
comprehensive picture of historical developments
new corpora always needed to make our ‘bad data’
better!
The work goes on …
Find out more at: http://www.helsinki.fi/varieng/CoRD/index.html
CoRD is an open-access online resource on which academic corpus compilers can make available basic information about their corpora. It is part of the eVARIENG online services, offered and maintained by the Research Unit for Variation, Contacts and Change in English (VARIENG).
References
Beal, J.C., K.P. Corrigan & H.L. Moisl, (eds) (2007). Creating and Digitizing Language Corpora. Vol. 2: Diachronic databases. Houndsmills: Palgrave-Macmillan.
Huber, M. (2007). The Old Bailey Proceedings, 1674-1834. Evaluating and annotating a corpus of 18th- and 19th-century spoken English. Studies in Variation, Contacts and Change in English, Volume 1, ed. by A. Meurman-Solin & A. Nurmi. http://www.helsinki.fi/varieng/journal/volumes/01/huber/
Kytö, M., P. Grund & T. Walker (2007). Regional variation and the language of English witness depositions 1560-1760: constructing a 'linguistic' edition in electronic form. Studies in Variation, Contacts and Change in English, Volume 2, ed. by P. Pahta et al. http://www.helsinki.fi/varieng/journal/volumes/02/kyto_et_al/
Labov, W. (1994). Principles of Linguistic Change. Vol. 1: Internal factors. Oxford: Blackwell.
Nevala, Minna (2004). Address in Early English Correspondence: Its Forms and Socio-pragmatic Functions. Mémoires de la Société Néophilologique de Helsinki 64. Helsinki: Société Néophilologique.
Nevalainen, T. (2010). Theory and practice in English historical sociolinguistics. Studies in Modern English 26: 1–24
Nevalainen, T. & H. Raumolin-Brunberg (2003). Historical Sociolinguistics: Language Change in Tudor and Stuart England. London: Longman.
Romaine, S. (1982). Socio-historical Linguistics: Its Status and Methodology. Cambridge: CUP.
Tagliamonte, S. (2008). Conversations from the speech community: Exploring language variation in synchronic dialect corpora. The Dynamics of Linguistic Variation, ed. by T. Nevalainen, I. Taavitsainen, P. Pahta & M. Korhonen, 107-128. Amsterdam/Philadelphia: Benjamins.