english corpora and words defined in learner's dictionaries
DESCRIPTION
Yang Shouxun [email protected] Corpus Development Section, FLTRP. English Corpora and Words Defined in Learner's Dictionaries. English Corpora. A corpus is a collection of written texts and/or transcripts of spoken language The Brown Corpus The British National Corpus - PowerPoint PPT PresentationTRANSCRIPT
English Corpora and Words Defined in Learner's Dictionaries
Yang [email protected]
Corpus Development Section, FLTRP
English Corpora
A corpus is a collection of written texts and/or transcripts of spoken language The Brown Corpus The British National Corpus
Special software to access, and analyze the corpus
FLTRP English/Chinese Parallel Corpus
Frequency
A small number of high-frequency words cover a large proportion of the corpus.
A large number of low-frequency words cover a disproportionate part.top n BNC Brown Fl ecPara
10 22.51% 24.43% 23.19%20 29.52% 31.35% 29.91%50 40.03% 40.88% 40.03%
100 47.20% 47.71% 47.58%200 53.63% 53.94% 54.16%500 62.01% 62.40% 63.11%
1000 69.11% 69.37% 70.50%10000 90.95% 92.44% 92.86%40000 97.07% 99.40% 98.55%
Frequency
High-frequency words are most useful to learners of the language.
High-frequency words should be defined in a learner's dictionary.
More frequently used senses should come before less frequently used senses.
Objectives
Are the words defined in a learner's dictionary really high-frequency words?
What high-frequency words are not defined in a learner's dictionary?
What low-frequency words are included in a learner's dictionary (and what not)?
Research methods
Six learner's dictionaries by well-known international publishers
Three corpora for word frequency extraction Brown Corpus British National Corpus FLTRP English/Chinese Parallel Corpus
Research methods
A frequency table for each corpus is computed and the frequency is normalized to that per million words.
Lists of defined words in the 6 dictionaries are extracted.
But multi-word entries are excluded:“a priori”, “according to”
Research methods
Words from corpora are reduced to the base forms dictionaries contain basically words in base
forms corpora contain words in all possible forms,
including cases and capitalization some issues with this method
thought/think case distinction cannot be kept: A/a China/china
lots of entries containing numbers in corpora, but only a few numbers are entries in dictionaries.
Computation
Distribution of word frequency in a dictionary
What percentage of high-frequency words are defined in a dictionary?
Distribution of word frequency in dictionaries
Brown Corpus
Frequency A B C D E F128 4.6 4.2 2.3 2.2 2.3 2.264 9.2 8.3 4.7 4.4 4.5 4.332 15.6 14.3 8 7.5 7.7 7.416 24.3 22.3 12.6 11.9 12.1 11.78 35.2 32.3 18.6 17.5 17.8 17.34 52 48.2 29.5 27.5 28.1 27.32 64.6 61.8 41.4 38.5 39.4 38.51 73.9 72.3 54.5 51 51.7 50.9
Distribution of word frequency in dictionaries
More than 45% of words defined in dictionaries C, D, E, and F are not found in Brown Corpus.
More than 25% of words defined in dictionaries A and B are not found in Brown Corpus.
Still good dictionaries Even learner's dictionaries include far
more words than a learner possibly needs.
Distribution of word frequency in dictionaries
128 64 32 16 8 4 2 10
10
20
30
40
50
60
70
80
Frequency Distribution of Words in Dictionaries
A
B
C
D
E
F
Frequency >=
Perc
enta
ge in
Dic
tionari
es
Distribution of word frequency in dictionaries
The figure clearly shows that the dictionaries can be clustered into two categories A, B C, D, E, F
The denominator is the size of words defined in the dictionaries for learners for advanced learners
Distribution of word frequency in dictionaries
BNC
Frequency A B C D E F64 8.3 7.6 4.2 4 4 3.932 14.2 13 7.2 6.8 7 6.716 22.5 20.6 11.6 10.9 11.1 10.78 33.4 30.6 17.3 16.2 16.6 164 47.8 44 25.5 23.7 24.4 23.52 64.2 59.2 36.2 33.3 34.5 33.31 77 73.7 49.4 44.7 46.7 45.5
0.5 82.9 82.6 62.5 56.4 58.7 58
Distribution of word frequency in dictionaries
64 32 16 8 4 2 1 0.50
10
20
30
40
50
60
70
80
90
Frequency Distribution of Words in Dictionaries
A
B
C
D
E
F
Frequency >=
Pe
rce
ntag
e in
Dic
tio
nary
Distribution of word frequency in dictionaries
FlecParaFrequency A B C D E F
128 4.7 4.2 2.4 2.2 2.3 2.264 8.7 7.9 4.4 4.2 4.2 4.132 14.8 13.5 7.5 7.1 7.2 716 23 20.9 11.8 11.1 11.3 10.98 34.1 31 17.8 16.6 17 16.44 47.1 43 25.4 23.5 24.2 23.42 61.6 57.3 35.7 32.6 33.9 32.91 72.2 68.8 46.3 42 43.7 42.7
0.5 80.4 78.8 59.1 53.7 55.4 54.9
Distribution of word frequency in dictionaries
128 64 32 16 8 4 2 1 0.50
10
20
30
40
50
60
70
80
90
Frequency Distribution in Dictionaries
A
B
C
D
E
F
Frequency >=
Perc
enta
ge in
Dic
tionari
es
How many high-frequency words are defined? Brown Corpus
Frequency A B C D E F128 94.5 96 95.3 96.1 96.5 97.164 92.2 93.8 93.3 94.4 94.8 95.332 89.5 91.6 90.9 91.9 92.3 92.616 85.1 87 87.6 88.2 88.6 898 80.1 82.1 84.4 84.5 85 85.64 79.5 82.4 89.9 89.1 90.3 91.12 70 74.7 89.3 88.3 89.6 90.81 60.6 66.3 89 88.6 89.1 91
How many high-frequency words are defined? The denominator is constant across
dictionaries. Advanced dictionaries are rated higher,
but the margin is very small. The curves after frequency < 8 are
surpring and require an explanation.
How many high-frequency words are defined?
128 64 32 16 8 4 2 10
20
40
60
80
100
120
High-frequency Words in Dictionaries
A
B
C
D
E
F
Frequency >=
Perc
en
tag
e
How many high-frequency words are defined? BNC
Frequency A B C D E F64 92.7 94.5 93.4 94.1 95 94.832 90.6 92.6 91.9 92.4 93.2 92.816 88.3 90.3 90.1 90.2 91 91.18 83.6 85.6 86.3 85.9 87.1 87.14 77.7 79.8 82.6 81.5 83.2 83.12 68.4 70.6 76.9 75.3 77.4 77.51 55.5 59.4 70.8 68.4 70.6 71.5
0.5 48.8 54.3 73.2 70.3 72.5 74.4
How many high-frequency words are defined?
64 32 16 8 4 2 1 0.50
10
20
30
40
50
60
70
80
90
100
High-frequency words in dictionar-ies
A
B
C
D
E
F
Frequency >=
Perc
enta
ge
How many high-frequency words are defined? FlecPara
Frequency A B C D E F128 95 96 96 96.4 96.6 9764 92.8 94 93.8 94.1 94.7 94.732 90.6 92.1 91.9 92.5 93 9316 87.7 89.1 89.8 89.8 90.5 90.88 83.7 85 87 86.1 87.3 87.84 77.4 78.9 83.2 81.7 83.5 83.92 69.1 71.9 79.8 77.6 79.9 80.41 56.4 60.1 72.1 69.6 71.8 72.7
0.5 53 58 77.5 75 76.7 78.8
How many high-frequency words are defined?
128 64 32 16 8 4 2 1 0.50
20
40
60
80
100
120
High-frequency Words in Dictionaries
A
B
C
D
E
F
Frequency >=
Perc
enta
ge
High-frequency words not defined in dictionaries
An increasing number of high-frequency words (with the frequency getting lower) are not defined. Place names, such as “Asia”, “Europe” Person's names, such as “John”, “David” Other cases, such as “ii”, “na”, “ca” Words probably should be included:
“legislative”(>19) not defined in B Some dictionaries extensively use words
in definitions that are not defined.
Why not all high-frequency words are defined?
Computational methods are good enough but not perfect: how to reduce words to the base forms,
spelling variations in the corpus numbers
They are not supposed to be important or are just left out by accident: “Soviet”(>119), “Unix”(>42)
Some vulgar words are probably avoided intentionally for elementary learners.
Low-frequency words defined in dictionaries
Some lower-frequency words have to be chosen if the dictionary is a big one.
Words and expressions come into wider use after the corpus is built may find their way into new dictionaries or updated versions. "ISP", "spammer", "MP3", and "e-commerce"
Affixes, e.g. “post-”, “-proof” Why some low-frequency words are
chosen and others not is not so clear.
Concluding remarks
Take the numbers with a grain of salt. The frequency principle is well
observed in modern English dictionaries.
There may be occasional bugs. The corpus should be kept up-to-date,
or new words and expressions should be added from other sources if the dictionary is targeted at advanced learners.
Concluding remarks
A learner's dictionary does not really need to cover so many low-frequency words.
A better metric for evaluating learner's dictionaries will be coverage of high-frequency words in the dictionary texts, and a topic for further study.
It'll be interesting to include some dictionaries compiled without corpora in the study.