forecasting the beginnings of newspaper texts some corpus & experimental findings michael hoey,...
TRANSCRIPT
Forecasting the Forecasting the beginnings of beginnings of newspaper textsnewspaper textsSome corpus & experimental Some corpus & experimental findingsfindings
Michael Hoey, Matthew Brook O’Donnell, Michaela Mahlberg and
Mike Scott
BAAL 11-13 September 2008, Swansea University
The Lexical Priming claim
Whenever we encounter a word (or syllable or combination of words), we note subconsciously
the words it occurs with (its collocations),
the grammatical patterns it occurs in (its colligations),
the meanings with which it is associated (its semantic associations),
word collocates with against and a
a word against has a semantic association with sending & receiving communication
(e.g. hear a word against)
send/receive a word against has a pragmatic association with denial
(e.g. wouldn’t hear a word against)
The Lexical Priming claim
Whenever we encounter a word (or syllable or combination of words), we note subconsciously
the words it occurs with (its collocations),
the meanings with which it is associated (its semantic associations),
The Lexical Priming claim
Whenever we encounter a word (or syllable or combination of words), we note subconsciously
the words it occurs with (its collocations),
the meanings with which it is associated (its semantic associations),
word collocates with against and a
a word against has a semantic association with sending & receiving communication
(e.g. hear a word against)
send/receive a word against has a pragmatic association with denial
(e.g. wouldn’t hear a word against)
The Lexical Priming claim
Whenever we encounter a word (or syllable or combination of words), we note subconsciously
the words it occurs with (its collocations),
the meanings with which it is associated (its semantic associations),
the pragmatics it is associated with (its pragmatic associations),
The Lexical Priming claim
Whenever we encounter a word (or syllable or combination of words), we note subconsciously
the words it occurs with (its collocations),
the meanings with which it is associated (its semantic associations),
the pragmatics it is associated with (its pragmatic associations),
send/receive a word against has a pragmatic association with denial
(e.g. wouldn’t hear a word against)
denial + send/receive a word against has a pragmatic association with hypotheticality
(e.g. wasn’t prepared to say a word against)
send/receive a word against has a pragmatic association with denial
(e.g. wouldn’t hear a word against)
denial + send/receive a word against has a pragmatic association with hypotheticality
(e.g. wasn’t prepared to say a word against)
The Lexical Priming claimWhenever we encounter a word (or
syllable or combination of words), we also note subconsciously
the grammatical patterns it is associated with (its colligations),
the genre and/or style and/or social situation it is used in,
whether it is used in a context we are likely to want to emulate or not
denial + send/receive a word against colligates with modal verbs
(e.g. wouldn’t hear a word against)
denial + send/receive a word against also colligates with human subjects and human prepositional objects
denial + send/receive a word against colligates with modal verbs
(e.g. wouldn’t hear a word against)
denial + send/receive a word against also colligates with human subjects and human prepositional objects
The Lexical Priming claim All the features we notice prime us so
that when we come to use the word ourselves, we are likely (in speech, particularly) to use it in the same lexical context, with the same grammar, in the same semantic context, as part of the same genre/style, in the same kind of social and physical context, with a similar pragmatics and in similar textual ways.
The Lexical Priming claim
Our ability to do this is what it means to know a word.
We are ALL learners, since we never stop being primed.
The only difference between the native speaker and the non-native speaker is the way that they are typically primed.
Creativity is the result of overriding some of one’s primings.
A footnote
Whenever we encounter a word (or syllable or combination of words), we note subconsciously …
the words it occurs with (its collocations),
the grammatical patterns it occurs in (its colligations),
the meanings with which it is associated (its semantic associations),
A footnote
Whenever we encounter a word (or syllable or combination of words), we note subconsciously …
the words it occurs with (its collocations),
the grammatical patterns it occurs in (its colligations),
the meanings with which it is associated (its semantic associations),
A footnote
Whenever we encounter a word (or syllable or combination of words), we note subconsciously …
the words it occurs with (its collocations),
the grammatical patterns it occurs in (its colligations),
the meanings with which it is associated (its semantic associations),
The Lexical Priming claim
Whenever we encounter a word (or syllable or combination of words), we also note subconsciously
the positions in a text that it occurs in, e.g. does it like to begin sentences? Does it like to start paragraphs? (its textual colligations),
the genre and/or style and/or social situation it is used in
The Lexical Priming claim
Whenever we encounter a word (or syllable or combination of words), we also note subconsciously
the positions in a text that it occurs in, e.g. does it like to begin sentences? Does it like to start paragraphs? (its textual colligations),
the genre and/or style and/or social situation it is used in
The Lexical Priming claim
Whenever we encounter a word (or syllable or combination of words), we also note subconsciously
the positions in a text that it occurs in, e.g. does it like to begin sentences? Does it like to start paragraphs? (its textual colligations),
the genre and/or style and/or social situation it is used in
Research QuestionResearch Question
• Do certain words and groups of words exhibit preferences for particular textual positions, such as the beginnings of texts and paragraphs? (Once upon a time is canonical example)
If they do, how can these items be discovered in a corpus?
Research QuestionResearch Question
• Do certain words and groups of words exhibit preferences for particular textual positions, such as the beginnings of texts and paragraphs? (Once upon a time is canonical example)
• If they do, how can these items be discovered in a corpus?
AHRC Textual Priming AHRC Textual Priming ProjectProject
Using a corpus of Home News articles from the Guardian/Observer newspaper 1998-2004◦Approx. 54 million words◦113,288 articles
Each sentence in body of each article is classified according to its positionTISC – first sentence of first paragraphPISC – first sentence of any subsequent
paragraphNISC – any non-initial sentence
Thanks to AHRC
AHRC Textual Priming AHRC Textual Priming ProjectProject
Using a corpus of Home News articles from the Guardian/Observer newspaper 1998-2004◦Approx. 54 million words◦113,288 articles
Each sentence in body of each article is classified according to its positionTISC – first sentence of first paragraphPISC – first sentence of any subsequent
paragraphNISC – any non-initial sentence
Thanks to AHRC
AHRC Textual Priming AHRC Textual Priming ProjectProject
Using a corpus of Home News articles from the Guardian/Observer newspaper 1998-2004◦Approx. 54 million words◦113,288 articles
Each sentence in body of each article is classified according to its position◦TISC – first sentence of first paragraph(Text-Initial Sentence Corpus)
Thanks to AHRC
AHRC Textual Priming AHRC Textual Priming ProjectProject
Using a corpus of Home News articles from the Guardian/Observer newspaper 1998-2004◦Approx. 54 million words◦113,288 articles
Each sentence in body of each article is classified according to its position◦TISC – first sentence of first paragraph◦PISC – first sentence of any subsequent
paragraph(Paragraph-Initial Sentence Corpus)
Thanks to AHRC
AHRC Textual Priming AHRC Textual Priming ProjectProject
Using a corpus of Home News articles from the Guardian/Observer newspaper 1998-2004◦Approx. 54 million words◦113,288 articles
Each sentence in body of each article is classified according to its position◦TISC – first sentence of first paragraph◦PISC – first sentence of any subsequent
paragraph◦NISC – any non-initial sentence
Thanks to AHRC
AHRC Textual Priming AHRC Textual Priming ProjectProject
Using a corpus of Home News articles from the Guardian/Observer newspaper 1998-2004◦Approx. 54 million words◦113,288 articles
Each sentence in body of each article is classified according to its position◦TISC – first sentence of first paragraph◦PISC – first sentence of any subsequent
paragraph◦NISC – Non-Initial Sentence Corpus
Thanks to AHRC
More wet weather was predicted across Britain today as experts warned many areas were already saturated with rain.
…
On Wednesday and Thursday a brief respite should see most of the country becoming fine, with heavy rain only expected across parts of Northern Ireland. But by Friday, much of England and Wales will again be hit by storms and further downpours.
…
So far, Britain's recent storms have already claimed the lives of six people. Yesterday, insurers said the cost of the cleanup could run into tens of millions of pounds.
Method: Sentence classificationMethod: Sentence classification
More wet weather was predicted across Britain today as experts warned many areas were already saturated with rain.
…
On Wednesday and Thursday a brief respite should see most of the country becoming fine, with heavy rain only expected across parts of Northern Ireland. But by Friday, much of England and Wales will again be hit by storms and further downpours.
…
So far, Britain's recent storms have already claimed the lives of six people. Yesterday, insurers said the cost of the cleanup could run into tens of millions of pounds.
Method: Sentence classificationMethod: Sentence classificationTISCsentence
More wet weather was predicted across Britain today as experts warned many areas were already saturated with rain.
…
On Wednesday and Thursday a brief respite should see most of the country becoming fine, with heavy rain only expected across parts of Northern Ireland. But by Friday, much of England and Wales will again be hit by storms and further downpours.
…
So far, Britain's recent storms have already claimed the lives of six people. Yesterday, insurers said the cost of the cleanup could run into tens of millions of pounds.
Method: Sentence classificationMethod: Sentence classificationTISCsentence
PISCsentence
More wet weather was predicted across Britain today as experts warned many areas were already saturated with rain.
…
On Wednesday and Thursday a brief respite should see most of the country becoming fine, with heavy rain only expected across parts of Northern Ireland. But by Friday, much of England and Wales will again be hit by storms and further downpours.
…
So far, Britain's recent storms have already claimed the lives of six people. Yesterday, insurers said the cost of the cleanup could run into tens of millions of pounds.
Method: Sentence classificationMethod: Sentence classificationTISCsentence
PISCsentence
NISCsentence
More wet weather was predicted across Britain today as experts warned many areas were already saturated with rain.
…
On Wednesday and Thursday a brief respite should see most of the country becoming fine, with heavy rain only expected across parts of Northern Ireland. But by Friday, much of England and Wales will again be hit by storms and further downpours.
…
So far, Britain's recent storms have already claimed the lives of six people. Yesterday, insurers said the cost of the cleanup could run into tens of millions of pounds.
Method: Sentence classificationMethod: Sentence classificationTISCsentence
PISCsentence
PISCsentence
NISCsentence
More wet weather was predicted across Britain today as experts warned many areas were already saturated with rain.
…
On Wednesday and Thursday a brief respite should see most of the country becoming fine, with heavy rain only expected across parts of Northern Ireland. But by Friday, much of England and Wales will again be hit by storms and further downpours.
…
So far, Britain's recent storms have already claimed the lives of six people. Yesterday, insurers said the cost of the cleanup could run into tens of millions of pounds.
Method: Sentence classificationMethod: Sentence classificationTISCsentence
PISCsentence
PISCsentence
NISCsentence
NISCsentence
GuardianGuardian Home News 1998- Home News 1998-20042004
TISC PISC NISC
tokens 3,122,037 12,521,902 19,338,590
types 58,432 127,038 141,793
type/token ratio (TTR) 53.43 98.57 136.39
sentences 113,288 607,125 1,064,493
mean (in words) 28 21 18
std.dev. 11.11 9.68 9.88
Summary of positional subcorporaSummary of positional subcorpora
Method: Method: Intra-textual Key Intra-textual Key Word AnalysisWord Analysis• Compare the frequency of words
and clusters in one section of text with their frequency in another
• For example, fresh occurs significantly more frequently in text-initial sentences (TISC) than in non-initial sentences (NISC)
• fresh is a text-initial key word
• It also exhibits distinctive patterns in TISC contexts in terms of collocates:• fresh{row,controversy,
embarrassment}
Method: Method: Intra-textual Key Intra-textual Key Word AnalysisWord Analysis• Compare the frequency of words
and clusters in one section of text with their frequency in another
• For example, fresh occurs significantly more frequently in text-initial sentences (TISC) than in non-initial sentences (NISC)
• fresh is a text-initial key word
• It also exhibits distinctive patterns in TISC contexts in terms of collocates:• fresh{row,controversy,
embarrassment}
Method: Method: Intra-textual Key Intra-textual Key Word AnalysisWord Analysis• Compare the frequency of words
and clusters in one section of text with their frequency in another
• For example, fresh occurs significantly more frequently in text-initial sentences (TISC) than in non-initial sentences (NISC)
• fresh is a text-initial key word
• It also exhibits distinctive patterns in TISC contexts in terms of collocates:• fresh{row,controversy,
embarrassment}
Method: Method: Intra-textual Key Intra-textual Key Word AnalysisWord Analysis• Compare the frequency of words
and clusters in one section of text with their frequency in another
• For example, fresh occurs significantly more frequently in text-initial sentences (TISC) than in non-initial sentences (NISC)
• fresh is a text-initial key word
• It also exhibits distinctive patterns in TISC contexts in terms of collocates:• fresh {row, controversy,
embarrassment}
Method: Comparative KW Method: Comparative KW listslistsTake the pair-wise comparisons
for TISC, PISC and NISC and create Key Word and Key Cluster lists:
TISC_NISC
TISC_PISC
Method: Comparative KW Method: Comparative KW listslistsTake the pair-wise comparisons
for TISC, PISC and NISC and create Key Word and Key Cluster lists:
TISC_NISC
TISC_PISC
PISC_NISC
PISC_TISC
Method: Comparative KW Method: Comparative KW listslistsTake the pair-wise comparisons
for TISC, PISC and NISC and create Key Word and Key Cluster lists:
TISC_NISC
TISC_PISC
PISC_NISC
PISC_TISC
NISC_TISC
NISC_PISC
Method: Key Word/Cluster Method: Key Word/Cluster MatrixMatrixEach word/cluster scored according
to whether (Y) or not (N) it is found on each of the six lists:
TISC_NISC TISC_PISC PISC_NISC PISC_TISC NISC_TISC NISC_PISC
yesterday Y Y Y N N N
said N N Y Y Y N
also N N N Y Y N
recall N N N N Y N
it was announced
Y Y N N N N
revealed that
Y N Y N N N
Method: Key Word/Cluster Method: Key Word/Cluster MatrixMatrixEach word/cluster scored according
to whether (Y) or not (N) it is found on each of the six lists:
TISC_NISC TISC_PISC PISC_NISC PISC_TISC NISC_TISC NISC_PISC
yesterday Y Y Y N N N
said N N Y Y Y N
also N N N Y Y N
recall N N N N Y N
it was announced
Y Y N N N N
revealed that
Y N Y N N N
Method: Key Word/Cluster Method: Key Word/Cluster MatrixMatrixEach word/cluster scored according
to whether (Y) or not (N) it is found on each of the six lists:
TISC_NISC TISC_PISC PISC_NISC PISC_TISC NISC_TISC NISC_PISC
yesterday Y Y Y N N N
said N N Y Y Y N
also N N N Y Y N
recall N N N N Y N
it was announced
Y Y N N N N
revealed that
Y N Y N N N
Method: Key Word/Cluster Method: Key Word/Cluster MatrixMatrixEach word/cluster scored according
to whether (Y) or not (N) it is found on each of the six lists:
TISC_NISC TISC_PISC PISC_NISC PISC_TISC NISC_TISC NISC_PISC
yesterday Y Y Y N N N
said N N Y Y Y N
also N N N Y Y N
recall N N N N Y N
it was announced
Y Y N N N N
revealed that
Y N Y N N N
Method: Key Word/Cluster Method: Key Word/Cluster MatrixMatrixEach word/cluster scored according
to whether (Y) or not (N) it is found on each of the six lists:
TISC_NISC TISC_PISC PISC_NISC PISC_TISC NISC_TISC NISC_PISC
yesterday Y Y Y N N N
said N N Y Y Y N
also N N N Y Y N
recall N N N N Y N
it was announced
Y Y N N N N
revealed that
Y N Y N N N
Method: Key Word/Cluster Method: Key Word/Cluster MatrixMatrixEach word/cluster scored according
to whether (Y) or not (N) it is found on each of the six lists:
TISC_NISC TISC_PISC PISC_NISC PISC_TISC NISC_TISC NISC_PISC
yesterday Y Y Y N N N
said N N Y Y Y N
also N N N Y Y N
recall N N N N Y N
it was announced
Y Y N N N N
revealed that
Y N Y N N N
Categories from patternsCategories from patterns From our corpus there 18
resulting patterns, covering:◦ 4467 words◦ 50861 clusters
Here we focus on four patterns:Text-Initial (YYNNNN & YNNNNN)
Paragraph-Initial (NNYYNN & NNYNNN)
TI and PI (YNYNNN)
Non-initial (NNNNYY)
Categories from patternsCategories from patterns From our corpus there 18
resulting patterns, covering:◦ 4467 words◦ 50861 clusters
Here we focus on four patterns:1. Text-Initial (YYNNNN & YNNNNN)
Paragraph-Initial (NNYYNN & NNYNNN)
TI and PI (YNYNNN)
Non-initial (NNNNYY)
Categories from patternsCategories from patterns From our corpus there 18
resulting patterns, covering:◦ 4467 words◦ 50861 clusters
Here we focus on four patterns:1. Text-Initial (YYNNNN & YNNNNN)
2. Paragraph-Initial (NNYYNN & NNYNNN)
TI and PI (YNYNNN)
Non-initial (NNNNYY)
Categories from patternsCategories from patterns From our corpus there 18
resulting patterns, covering:◦ 4467 words◦ 50861 clusters
Here we focus on four patterns:1. Text-Initial (YYNNNN & YNNNNN)
2. Paragraph-Initial (NNYYNN & NNYNNN)
3. TI and PI (YNYNNN)
Non-initial (NNNNYY)
Categories from patternsCategories from patterns From our corpus there 18
resulting patterns, covering:◦ 4467 words◦ 50861 clusters
Here we focus on four patterns:1. Text-Initial (YYNNNN & YNNNNN)
2. Paragraph-Initial (NNYYNN & NNYNNN)
3. TI and PI (YNYNNN)
4. Non-initial (NNNNYY)
Category 1: Text-initialCategory 1: Text-initial(YYNNNN, YNNNNN & YYNNNY)(YYNNNN, YNNNNN & YYNNNY)
TISC PISC NISC
ONE OF BRITAIN’S 132.0 13.4 9.0A REPORT BY THE 16.0 6.8 3.1ARE TO BE 271.9 23.8 24.7THAT COULD 106.7 41.4 61.7AFTER BEING 334.4 67.8 54.5
• 1,600 (36%) of our key words and 29,303 (58%) of our key clusters being to this category
normalized to occurrences per million words
Category 1: Text-initialCategory 1: Text-initial(YYNNNN, YNNNNN & YYNNNY)(YYNNNN, YNNNNN & YYNNNY)
TISC PISC NISC
ONE OF BRITAIN’S 132.0 13.4 9.0A REPORT BY THE 16.0 6.8 3.1ARE TO BE 271.9 23.8 24.7THAT COULD 106.7 41.4 61.7AFTER BEING 334.4 67.8 54.5
• 1,600 (36%) of our key words and 29,303 (58%) of our key clusters being to this category
normalized to occurrences per million words
Category 2: Paragraph-initialCategory 2: Paragraph-initial(NNYYNN,NNYNNN & NNNYNN)(NNYYNN,NNYNNN & NNNYNN)
TISC PISC NISC
THE FINDINGS 13.1 43.5 16.7CAME AS 5.8 47.9 9.8IS THE LATEST 10.6 17.2 3.5GENERAL SECRETARY OF THE 15.4 58.5 11.6CONFIRMED THAT 43.2 66.5 32.3
• 732 (16%) of our key words and 5,755 (11%) of our key clusters being to this category
normalized to occurrences per million words
Category 3: Text- & Category 3: Text- & Paragraph-initialParagraph-initial(YNYNNN & YNYYNN)(YNYNNN & YNYYNN)
TISC PISC NISC
THE CONTROVERSY 28.5 17.2 6.6HEAD OF THE 93.5 89.4 46.8DECISION TO 151.8 130.8 73.9SAID YESTERDAY THAT 80.1 67.5 26.6ISSUED A 51.9 35.9 20.3
• 253 (6%) of our key words and 913 (2%) of our key clusters being to this category
normalized to occurrences per million words
Category 3: Text- & Category 3: Text- & Paragraph-initialParagraph-initial(YNYNNN & YNYYNN)(YNYNNN & YNYYNN)
TISC PISC NISC
THE CONTROVERSY 28.5 17.2 6.6HEAD OF THE 93.5 89.4 46.8DECISION TO 151.8 130.8 73.9SAID YESTERDAY THAT 80.1 67.5 26.6ISSUED A 51.9 35.9 20.3
• 253 (6%) of our key words and 913 (2%) of our key clusters being to this category
normalized to occurrences per million words
Category 4: Non-initialCategory 4: Non-initial(NNNNYY, NNNYYY & NYNNYY)(NNNNYY, NNNYYY & NYNNYY)
TISC PISC NISC
HAVE TO 174.2 352.2 616.1WHILE 530.1 589.3 701.0BUT 1262.0 4164.3 6068.6BE ABLE TO 81.0 108.9 165.5GOING TO 78.2 268.8 494.9
• 486 (11%) of our key words and 3,105 (6%) of our key clusters being to this category
normalized to occurrences per million words
More wet weather was predicted across Britain today as experts warned many areas were already saturated with rain.
…
On Wednesday and Thursday a brief respite should see most of the country becoming fine, with heavy rain only expected across parts of Northern Ireland. But by Friday, much of England and Wales will again be hit by storms and further downpours.
…
So far, Britain's recent storms have already claimed the lives of six people. Yesterday, insurers said the cost of the cleanup could run into tens of millions of pounds.
Method: Sentence classificationMethod: Sentence classificationTISCsentence
PISCsentence
PISCsentence
NISCsentence
More wet weather was predicted across Britain today as experts warned many areas were already saturated with rain.
…
On Wednesday and Thursday a brief respite should see most of the country becoming fine, with heavy rain only expected across parts of Northern Ireland. But by Friday, much of England and Wales will again be hit by storms and further downpours.
…
So far, Britain's recent storms have already claimed the lives of six people. Yesterday, insurers said the cost of the cleanup could run into tens of millions of pounds.
Method: Sentence classificationMethod: Sentence classificationTISCsentence
More wet weather was predicted across Britain today as experts warned many areas were already saturated with rain.
TISC 79 per million sentencesPISC 5 per million sentencesNISC 22 per million sentences
Method: Sentence classificationMethod: Sentence classificationTISCsentence
More wet weather was predicted across Britain today as experts warned many areas were already saturated with rain.
TISC 117 per 100,000 sentencesPISC 25 per 100,000 sentencesNISC 12 per 100,000 sentences
Method: Sentence classificationMethod: Sentence classificationTISCsentence
More wet weather was predicted across Britain today as experts warned many areas were already saturated with rain.
TISC 48 per thousand sentencesPISC 7 per thousand sentencesNISC 6 per thousand sentences
Method: Sentence classificationMethod: Sentence classificationTISCsentence
More wet weather was predicted across Britain today as experts warned many areas were already saturated with rain.
TISC 17 per thousand sentencesPISC 4 per thousand sentencesNISC 3 per thousand sentences
Method: Sentence classificationMethod: Sentence classificationTISCsentence
Theoretical ImplicationsTheoretical Implications
Confirmation of prediction made by lexical priming theory
Knowing a word includes knowing where it will be used in a text
Clusters are more important than single words in textual positioning (cf. Wray)
Theoretical ImplicationsTheoretical Implications
Confirmation of prediction made by lexical priming theory
Knowing a word includes knowing where it will be used in a text
Clusters are more important than single words in textual positioning (cf. Wray)
Theoretical ImplicationsTheoretical Implications
Confirmation of prediction made by lexical priming theory
Knowing a word includes knowing where it will be used in a text
Clusters are more important than single words in textual positioning (cf. Wray)
Theoretical ImplicationsTheoretical Implications
Confirmation of prediction made by lexical priming theory
Knowing a word includes knowing where it will be used in a text
Clusters are more important than single words in textual positioning (cf. Wray)
Applied Linguistic Applied Linguistic ImplicationsImplications
Translation
Academic writing
Authentic data
Death (or redefinition) of the topic sentence
Applied Linguistic Applied Linguistic ImplicationsImplications
Translation
Academic writing
Authentic data
Death (or redefinition) of the topic sentence
Applied Linguistic Applied Linguistic ImplicationsImplications
Translation
Academic writing
Authentic data
Death (or redefinition) of the topic sentence
Applied Linguistic Applied Linguistic ImplicationsImplications
Translation
Academic writing
Death of the topic sentence
Applied Linguistic Applied Linguistic ImplicationsImplications
Translation
Academic writing
Death (or redefinition) of the topic sentence
Applied Linguistic Applied Linguistic ImplicationsImplicationsLearning a word or phrase includes
learning its characteristic textual positioning, or else a learner’s text will read awkwardly
Fabricated texts are unlikely to preserve the natural textual colligations of the language if the intention of these texts is to illustrate other features
Textual colligation is where discourse analysis and dictionaries meet.