word frequency and readability: lexical characterization ... · 10-fold cv acc. = 48% (5%...

LEAD Graduate School

12/28/15, LEAD Colloquium

Word Frequency and Readability: Lexical Characterization of Text ComplexityXiaobin Chen, Detmar Meurers

2 | WORD FREQUENCY AND READABILITY

Contents

1. Introductioni. The Importance of Reading Ability and the Realityii. Readability Assessment: Why and How

2. Literature Reviewi. The Reading Processii. Vocabulary and Readingiii. Word Frequency and Readability

3. Research Design1. Research Questions2. Methodology

4. The Three Studies and Their Results

5. Conclusions


The Importance of Reading

● It is considered the most basic subject of school education and the major source of knowledge development for students.

● A person’s prose/document literacy is positively related to his/her education attainment, income, and occupational prestige. (Kutner et al., 2007, U. S. Department of Education)


Literacy Level and Education Attainment


Literacy Level and and Job Opportunities


Literacy Level and Employment


Literacy Level and Gross Earnings


Literacy Level and Occupational Prestige


The Reality

● A significant amount of high school graduates still cannot meet the college or career readiness benchmarks (ACT, 2015).


How to Improve Reading Proficiency

● One way to enhance reading outcomes is to engage students “with texts of appropriate complexity throughout schooling” (Nelson, Perfetti,Liben, and Liben, 2012).

● Students usually gain a sense of success and are motivated to read more when they are given texts that enable them to practice being competent readers (Milone & Biemiller,2014).

● Reading materials that meet the “i + 1” criterion (Krashen, 1985) are optimal for promoting language abilities.


Readability Assessment

● Definition of readability: the sum of all elements of a text that affect a reader's understanding, reading speed, and level of interest in the text (Dale & Chall, 1949).

Key aspects of text readability, adapted from Collins-Thompson ( 2014)


Readability Assessment

● Qualitative methods (see review by Pearson & Hiebert, 2014):

– Text leveling (TL)

– Rubrics plus examplars (R + E)

– Text maps (TM)

● Quantitative methods (see reviews by Kollins-Thompson, 2014; Benjamin, 2012; Zakaluk & Jay, 1988):

– Traditionally: multiple regression on surface features

– Modern methods: natural language processing (NLP) and machine learning (ML)

– Features: morphological, lexical, semantic, syntactic, structural, psycholinguistic, genre, etc.


The Present Study

● Extends readability research from the lexical perspective—an important but yet to be deeply explored area—by making use of the latest development in NLP/ML and linguistic theory and practice.

● Our interest was in the use of word frequency lists for readability assessment, an issue that had caught on since the very beginning of readability research but not yet settled.


Why Frequency?

● Why is word frequency interesting?

● How is it related to readability?


The Reading Process

● Reading is a coordinated execution of a series of processes (Just & Carpenter, 1980), including:

– word encoding

– lexical access

– assigning semantic roles

– and relating the information contained in a sentence to earlier sentences in the same text and the reader's prior knowledge.

● Successful comprehension of texts depends a lot on reader's:

– syntactical competence and semantic decoding abilities (Marks, Doctorow, & Wittrock, 1974) and

– vocabulary knowledge on the language (Laufer & Ravenhorst-Kalovski, 2010; Nation, 2006).


Vocabulary and Reading

● Lexical coverage and vocabulary knowledge are good predictors of reading comprehension, an idea shared by a number of other researchers (e.g., Bernhardt & Kamil, 1995; Laufer, 1992; Nation, 2001,2006; Qian, 1999, 2002; Ulijn & Strother, 1990, etc.).


The Frequency Effects

● A reader's vocabulary knowledge is related to the amount of exposure the reader has received on words.

– Word frequency is predictive to word difficulty (Ryder & Slater, 1988).

– Word frequency is strongly associated with both actual difficulty (how well can people choose the correct definition of the word) and perceived difficulty (how difficult does a word look) (Leroy and Kauchak, 2013).

– High-frequency words are more easily perceived (Bricker & Chapanis, 1953) and readily retrieved by the reader (Haseley,1957).

– High-frequency words are perceived and produced more quickly and more efficiently than low-frequency ones (Balota & Chumbley, 1984; Howes & Solomon, 1951; Jescheniak & Levelt, 1994; Monsell, Doyle, & Haggard, 1989; Rayner & Duy, 1986), resulting in more efficient comprehension of the text (Klare, 1968).

– Frequency of word occurrence affects not only the ease of reading, but also its acceptability (Klare, 1968).

Syntactical competence

Semantic decoding abilities

Vocabulary knowledge

Frequency effects

Reading comprehension

Relating to prior knowledge

Frequency and Reading Comprehension


Related Literature

● Researchers have constantly used semantic and syntactic features of a text to predict its difficulty level (e.g., Dale & Chall, 1948; Flesch, 1948; Gray & Leary, 1935; Kincaid et al., 1975; Kintsch et al., 1993;Kintsch & Vipond, 1979; Lexile, 2007; Vajjala & Meurers, 2012).

● The semantic variable of word difficulty usually accounts for the greatest percentage of readability variance (Marks et al., 1974).

● Consequently, textual difficulty of a reading passage is assessable by investigating the frequency of the words chosen for the writing.


Traditional Readability Research● Lively and Pressey (1923)

– “zero-index words” and median of the index numbers of words from Thorndike's lists of 10,000 most frequent words in English (Thorndike, 1921).

– “The median index number was the best indicator of the vocabulary burden of reading materials.”

● Patty and Painter (1931)

– Average word weighted value: the average of products of index value from Thorndike's list and the frequency of words in the text sample.

– “an apparent improvement in technique for readability judgment”

● Ojemann (1934)

– words from the text that are among the first 1,000 and first 2,000 most frequent words of the Thorndike list.

– “highly correlated with difficulty”


Modern Readability Research● Lexile (Lexile, 2007)

– word frequencies from the Carrol-Davies-Richman corpus (Carroll,Davies, & Richman, 1971)

● ATOS (Milone & Biemiller, 2014)

– Graded Vocabulary list

● Commercially successful and effective (Nelson, Perfetti, Liben, & Liben, 2012).


Problems with Previous Research (I)● Frequency list

– Problem: did not take into consideration spoken language exposure.

– Why not optimal: not a faithful representation of the reader's actual language experience, hence unable to predict the ease of retrieval and perception accurately.

– Solution: a frequency list that represents actual language experience.

● Frequency measures

– Problem: only count actual occurrence of words.

– Why not optimal: did not consider the number of contexts in which a word may occur.

– Solution: include Contextual Diversity (CD) measures, which were found to be a better predictors of word frequency effect on Lexical Decision Tasks (Adelman, Brown,& Quesada, 2006).


Problems with Previous Research (II)● Methodology

– Methods used: simple average frequency count, percentage of words from the top frequency bands of the list.

– Problem: unable to capture the full picture of text readability.

– Why not optimal: 1) average procedure is easily affected by extreme values and loses details; 2) contribution of less-frequent words neglected.

– Solution: develop an understanding of how a frequency list can be used as a “ruler” of the text's difficulty level.


Research Questions● What is the relationship between word frequency and text

readability?

● Which frequency measures are better predictors of textual complexity?

● How can word frequency lists be better used to characterize text readability?


Methods: Frequency Lists● SUBTLEXus (Brysbaert & New, 2009):

– 74,286 word forms

– calculated from a 51-million-word corpus of subtitles from 8,388 American films and television series between the years 1900 and 2007.

● SUBTLEXuk (van Heuven, Mandera, Keuleers, & Brysbaert, 2014)

– 160,022 word forms

– calculated from a 201.7-million-word corpus of subtitles from nine British TV channels broadcast from January 2010 to December 2012.


Frequency Measures


Methods: Corpora● Training corpus: WeeBit (Vajjala & Meurers, 2012)

– Sources: educational magazine Weekly Reader and BBC-Bitesize website

– 789,926 words, 616 texts in each level, 5 levels

● Testing corpus: Appendix B of the Common Core State Standards (CommonCore, 2010), 168 texts


Experimental Procedure● Tokenize corpus texts

– CoreNLP Tokenizer (Manning et al., 2014)

● Calculate various frequency values as features of texts

● Train a ML classification model with training corpus

– The “class” package of R

– Algorithm: K-nearest neighbors

● Apply the trained model on test corpus

● Report results

– Within-corpus statistics: 10-fold CV accuracy, 10-fold CV Spearman's ρ

– Cross-corpus statistics: Spearman's ρ


Study 1: Frequency means and SD as Features ● Purposes:

– Testing the use of frequency lists for predicting readability

– Testing if frequency lists from different corpora have different effects in predicting readability

– Testing if different frequency measures make a difference

● Features:

– Average frequency of word forms with/without standard deviation

– Average frequency of word types with/without standard deviation

– Tested all the frequency measures provided by SUBTLEXus and SUBTLEXuk


Study 1: Results


Study 1: Findings

● Models trained with both the mean and SD features performed consistently better than those with only mean frequencies, be it type or token averages.

● Type models had uniformly better accuracy and validation performance than token models (see illustration).

● The corpus from which the frequency list was constructed mattered when it is used to characterize text readability.


Token and Type Differences for Common Core


Study 2: Proportions of Words from Frequency Bands of Increasing Fine-grainedness

● Purposes:

– Testing the effectiveness of using frequency lists as a ruler of readability.

● Hypothesis:

– The more words of a text are from the less frequent bands, the higher the perception demand for these words, hence higher textual difficulty and less readability.

● Features:

– Percentage of words from each frequency band

– Gradual increase of band fine-grainedness, or the number of bands the frequency list is cut into

– Band stratification with different frequency measures: LOGFREQCBEEBIES_ZIPF and CD_CBBC from SUBTLEXuk; ZIPF_VALUE and SUBTLCD from SUBTLEXus.


Study 2: Illustration

Band 1

Band 2

Band 1

Band 2

Band 3

Band 1

Band 2

Band 3

Band 4

Band 1

Band 2

Band 3

Band 4

Band 5

… (up to 100 bands)


Study 2: Results with SUBTLEXuk Measures


Summary of Results

● Best-performing token model: 20 bands on LOGFREQCBEEBIES_ZIPF10-fold CV Acc. = 48% (5% improvement to the corresponding model in Study 1)within-corpus ρ = .65, p<.001cross-corpus ρ = .54, p<.001

● The LOGFREQCBEEBIES_ZIPF type models were not generalizable.

● Neither the CD_CBBC type models nor the token models were generalizable.


Study 2: Results with SUBTLEXus Measures


Study 2: Summary of Results: SUBTLEXus features

● ZIPF_VALUE had better training performance, while the SUBTLCD measure had more stable testing performance.

● Finer-grained frequency bands did not improve testing results beyond 10 bands.

SUBTLCD ZIPF_VALUEType m

ode lToken m

od el

Number of bands

Spe

arm

an's

rho ―― Within-corpus rho

―― Cross-corpus rho


Study 2: Findings

● It is effective to use the frequency list as a “ruler” of language use to measure readability.

● Although the training performance improves with finer stratification schemes, the testing performance does not improve beyond 10 bands.

● The US list has better performance when the trained models are carried over to a test corpus. Models trained with the UK list do not generalize.

● The type models involving contextual diversity (i.e., SUBTLCD) have more stable performance than the other models.


Study 3: Word Frequencies Cluster Means as Features

● Purposes:

– Approaching readability from an “internal” perspective, namely the frequency of words chosen for the text.

● Hypotheses:

– Difficult texts usually use more less-frequent words, while easier texts use less.

– Groupings of word frequencies and the group values are revealing to the text's readability

● Features:

– Cluster means

– Cluster Zipf values from both SUBTLEXus and SUBTLEXuk

– Up to 100 clusters tested


Study 3: A Simplified Illustration

Word 1

Word 7

Word 6

Word 14

Word 9

Word 2

Word 4

Word 8

Word 10

Word 5

Word 3

Word 13

Word 15

Word 12

Word 11

Word 1

Word 7

Word 6

Word 14

Word 9

Word 2

Word 4Word 8

Word 10Word 5

Word 3

Word 13

Word 15

Word 12

Word 11Cluster 1

Cluster 2

Cluster 3

Text

Text

Clustering


Study 3: Results


Study 3: Results

LOGFREQCBEEBIES_ZIPF ZIPF_VALUE

Type mode l

Token mod el

―― Within-corpus rho―― Cross-corpus rho


Study 3: Findings

● The type and token models had similar performance in terms of accuracy estimates, within- and cross-corpus ρs.

● No significant difference were found between the performance of models trained on measures from different lists.

● Improved performance with the increase of cluster numbers, cross-corpus ρs peaking at around 70 clusters.

● The ZIPF_VALUE measure from the US list performed marginally better than its counterpart from the UK list.

● The trained classifiers were generalizable to the test corpus—a finding that suggests the existence of frequency effects on readability.


Conclusions

● The lexical measure of word frequency is effective in characterizing text difficulty.

● Frequency lists: faithfully represent language usage and exposure.

● Frequency measures: normalized measures that accurately estimate the cognitive load involved in vocabulary perception and retrieval.

● The methods: – Simple overall mean and sd: easy and effective, given that the measure meets the

previous two criteria.

– Stratification: improved performance, requires fine-tuning number of bands, less generalizable

– Clustering: best performance, least sensitive to list and measure, most expensive


References● ACT, (2015). The Condition of College & Career Readiness 2015 (National). ACT● Adelman, J. S., Brown, G. D. A., & Quesada, J. F. (2006). Contextual diversity, not word frequency, determines word naming and

lexical decision times. Psychological Science, 17, 814–823.● Balota, D. A., & Chumbley, J. I. (1984). Are lexical decisions a good measure of lexical access? The role of word frequency in the

neglected decision stage. Journal of Experimental Psychology: Human Percep- Tion & Performance, 10, 640–357.● Benjamin, R. G. (2012). Reconstructing readability: recent developments and recommendations in the analysis of text difficulty.

Educational Psychology Review, 24(1), 63–88.● Bernhardt, E. B., & Kamil, M. L. (1995). Interpreting relationships between L1 and L2 reading: Consolidating the linguistic

threshold and the linguistic interdependence hypotheses. Applied Linguistics, 16, 15–34.● Bricker, P. D., & Chapanis, A. (1953). Do incorrectly perceived stimuli convey some information? Psychological Review, 60, 181–

188.● Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and

the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4), 977–990.

● Collins-Thompson, K. (2014). Computational assessment of text readability: A survey of past, present, and future research. In T. François & D. Bernhard (Eds.), Recent Advances in Automatic Readability Assessment and Text Simplification. Special issue of International Journal of Applied Linguistics (pp. 97–135).

● CommonCore. (2010). Common Core State Standards for English Language Arts and Literacy in History/Social Studies, Science, and Technical Subjects. Common Core State Standards Initiative. Retrieved from http://www.corestandards.org

● Dale, E., & Chall, J. (1948). A formula for predicting readability. Educational Research Bulletin, 27(Jan. 21 and Feb. 17), 1–20, 37–54.

● Dale, E., & Chall, J. (1949). The concept of readability. Elementary English, 26(3).● Flesch, R. (1948). A new readability yardstick. Journal of Applied Psychology, 32, 221–233.● Gray, W. S., & Leary, B. E. (1935). What makes a book readable. Chicago: University of Chicago Press.● Haseley, L. (1957). The relationship between cue-value of words and their frequency of prior occurrence. Ohio university.● van Heuven, W. J. B., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). SUBTLEX-UK: A new and improved word frequency

database for British English. The Quarterly Journal of Experimental Psychology, 67(6), 1176–90.


References● Howes, D. H., & Solomon, R. L. (1951). Visual duration threshold as a function of word-probability. Journal of Experimental

Psychology, 41, 501–410.● Jescheniak, J. D., & Levelt, W. J. M. (1994). Word frequency effects in speech production: Retrieval of syntactic information and

of phonological form. Journal of Experimental Psychology: Learning, Memory, & Cognition, 20, 824–843.● Just, M. A., & Carpenter, P. A. (1980). A theory of reading: From eye fixations to comprehension. Psychological Review, 87(4),

329–354.● Kintsch, W., Britton, B. K., Fletcher, C. R., Kintsch, E., Mannes, S. M., & Nathan, M. J. (1993). A Comprehension-Based Approach

to Learning and Understanding. In D. L. Medin (Ed.), Psychology of Learning and Motivation (Vol. 30, pp. 165–214). New York: Academic Press.

● Kincaid, J. P., Rogers, R. L., Fishburne, R. P., & Chissom, B. S. (1975). Derivation of new readability formulas ( Automated readability index , Fog count and Flesch reading ease formula ) for navy enlisted personnel. Millington, Tennessee.

● Kintsch, W., & Vipond, D. (1979). Reading comprehension and readability in educational practice and psychological theory. In L. G. Nilsson (Ed.), Perspectives on memory research (pp. 24–62). Hillsdale, NJ: Erlbaum.

● Klare, G. R. (1968). The role of word frequency in readability. Elementary English, 45(1), 12–22.● Krashen, S. (1985). The Input Hypothesis: Issues and Implications. New York: Longman.● Kutner, M., Greenberg, E., Jin, Y., Boyle, B., Hsu, Y., Dunleavy, E., & White, S. (2007). Literacy in everyday life results from the

2003 national assessment (NCES 2007-480). U.S. Department of Education. Washinton, DC: National Center for Education Statistics.

● Laufer, B. (1992). How much lexis is necessary for reading comprehension? In H. Bejoint & P. Arnaud (Eds.), Vocabulary and applied linguistics (pp. 126–132). Basingstoke & London: Macmillan.

● Laufer, B., & Ravenhorst-Kalovski, G. (2010). Lexical threshold revisited: Lexical text coverage, learners’ vocabulary size and reading comprehension. Reading in a Foreign Language, 22(1), 15–30.

● Leroy, G., & Kauchak, D. (2013). The effect of word familiarity on actual and perceived text difficulty. Journal of the American Medical Informatics Association, (0), 1–4.

● Lexile. (2007). The Lexile Framework ® for reading: Theoretical Framework and Development. Durham, NC.● Lively, B. A., & Pressey, S. L. (1923). A method for measuring the “vocabulary burden” of textbooks. Educational Administration

and Supervision, 9, 389–398.


References● Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., & McClosky, D. (2014). The Stanford CoreNLP Natural

Language Processing Toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 55–60).

● Marks, C. B., Doctorow, M. J., & Wittrock, M. C. (1974). Word frequency and reading comprehension. The Journal of Educational Research, 67(6), 259–262.

● Milone, M., & Biemiller, A. (2014). The development of ATOS: The Renaissance readability formula. Wisconsin Rapids.● Monsell, S., Doyle, M. C., & Haggard, P. N. (1989). Effects of frequency on visual word recognition tasks: Where are they? Journal

of Experimental Psychology: General, 118, 43–71.● Nation, I. S. P. (2001). Learning vocabulary in another language. Cambridge: Cambridge University Press.● Nation, I. S. P. (2006). How large a vocabulary is needed for reading and listening? The Canadian Modern Language Review,

63(1), 59–82. ● Nelson, J., Perfetti, C., Liben, D., & Liben, M. (2012). Measures of text difficulty: Testing their predictive value for grade levels and

student performance.● Ojemann, R. J. (1934). The reading ability of parents and factors associated with reading difficulty of parent education materials.

University of Iowa Studies in Child Welfare, 8, 11–32.● Patty, W. W., & Painter, W. I. (1931). A technique for measuring the vocabulary burden of textbooks. Journal of Educational

Research, 24, 127–134.● Pearson, P. D., & Hiebert, E. H. (2014). The State of the Field: Qualitative Analyses of Text Complexity. The Elementary School

Journal, 115(2), 161–183.● Qian, D. D. (1999). Assessing the roles of depth and breadth of vocabulary knowledge in reading comprehension. The Canadian

Modern Language Review, 56, 282–308.● Qian, D. D. (2002). Investigating the relationship between vocabulary knowledge and academic reading performance: An

assessment perspective. Language Learning, 52, 513–536.● Rayner, K., & Duffy, S. A. (1986). Lexical complexity and fixation times in reading: Effects of word frequency, verb complexity, and

lexical ambiguity. Memory & Cognition, 14, 191–201.● Ryder, R. J., & Slater, W. H. (1988). The relationship between word frequency and word knowledge. The Journal of Educational

Research, 81(5), 312–317.


References● Ulijn, J. M., & Strother, J. B. (1990). The effect of syntactic simplification on reading EST texts as L1 and L2. Journal of Research

in Reading`, 13, 38–54.● Vajjala, S., & Meurers, D. (2012). On improving the accuracy of readability classification using insights from second language

acquisition. In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP.● Zakaluk, B. L., & Samuels, S. J. (Eds.). (1988). Readability: Its Past, Present, and Future. Newark, Del.: International Reading

Association.


Thank you!Contact:Xiaobin [email protected]

Detmar [email protected]

LEAD Graduate School,Eberhard Karls Universität Tübingenwww.lead.uni-tuebingen.de

word frequency and readability: lexical characterization ... · 10-fold cv acc. = 48% (5%...

Documents