complexword identiﬁcationforswedishuu.diva-portal.org/smash/get/diva2:1212982/fulltext01.pdf ·...

Complex WordIdentification for Swedish

Greta Smolenska

Uppsala UniversityDepartment of Linguistics and PhilologyMaster’s Programme in Language TechnologyMaster’s Thesis in Language TechnologyJune 4, 2018

Supervisors:Joakim NivreMagnus Sahlgren

Abstract

Complex Word Identification (CWI) is a task of identifying complex words intext data and it is often viewed as a subtask of Automatic Text Simplification(ATS) where the main task is making a complex text simpler. The ways inwhich a text should be simplified depend on the target readers such as secondlanguage learners or people with reading disabilities. In this thesis, we focuson Complex Word Identification for Swedish. First, in addition to exploringexisting resources, we collect a new dataset for Swedish CWI. We continue bybuilding several classifiers of Swedish simple and complex words. We then usethe findings to analyze the characteristics of lexical complexity in Swedish andEnglish. Our method for collecting training data based on second languagelearning material has shown positive evaluation scores and resulted in a newdataset for Swedish CWI. Additionally, the built complex word classifiers havean accuracy at least as good as similar systems for English. Finally, the analysisof the selected features confirms the findings of previous studies and revealssome interesting characteristics of lexical complexity.

Contents

Contents 3

List of Tables 5

Preface 6

1 Introduction 71.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Background 92.1 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 The Concept of Complexity . . . . . . . . . . . . . . . . . . . . 92.1.2 Linguistic Complexity . . . . . . . . . . . . . . . . . . . . . . . 102.1.3 Lexical Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Automatic Text Simplification . . . . . . . . . . . . . . . . . . . . . . . 112.3 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.1 CWI Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.2 Features of Simple and Complex Words . . . . . . . . . . . . . 13

3 Dataset for Swedish Complex Word Identification 153.1 Complexity Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Collection of the Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.1 Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2.2 Gold Standard Datasets . . . . . . . . . . . . . . . . . . . . . . 183.2.3 Evaluation of the Datasets . . . . . . . . . . . . . . . . . . . . 20

4 Features for Swedish Complex Word Identification 224.1 Sources and Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.2 Extracted Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2.1 Morpho-Syntactic Features . . . . . . . . . . . . . . . . . . . . 254.2.2 Contextual Features . . . . . . . . . . . . . . . . . . . . . . . . 254.2.3 Syntactic Features . . . . . . . . . . . . . . . . . . . . . . . . . 264.2.4 Conceptual Features . . . . . . . . . . . . . . . . . . . . . . . . 264.2.5 Frequency Features . . . . . . . . . . . . . . . . . . . . . . . . 26

4.3 Feature Analysis of Simple and Complex Words . . . . . . . . . . . . 29

5 Experiments 315.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.1.1 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3

5.1.2 Lexical Complexity and Cognition . . . . . . . . . . . . . . . . 325.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.3.1 Selection of the Classifier . . . . . . . . . . . . . . . . . . . . . 335.3.2 Selection of the Optimal Feature Set . . . . . . . . . . . . . . . 34

5.4 Simple and Complex Words in WordNet . . . . . . . . . . . . . . . . . 47

6 Discussion 516.1 Gold Standard Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 516.2 Complex Word Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 516.3 Lexical Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

7 Conclusion 537.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Bibliography 55

A Machine Learning Algorithms 60

B Syntactic Relations 63

4

List of Tables

3.1 CEFR Global Scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Kelly Gold Standard Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3 Sources of simple, standard and complex Swedish that were used to

collect the Gold Standard Dataset. . . . . . . . . . . . . . . . . . . . . . . . 193.4 Manually Collected Gold Standard Dataset. . . . . . . . . . . . . . . . . . 203.5 Dataset Evaluation Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1 Roget’s Conceptual Classes. . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 The Swedish Culturomics Gigaword Corpus. . . . . . . . . . . . . . . . . 244.3 Scrambled corpora that were used to retrieve word frequencies. . . . . . 244.4 Morpho-syntactic Features. . . . . . . . . . . . . . . . . . . . . . . . . . . 254.5 Contextual Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.6 Syntactic Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.7 Conceptual Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.8 Frequency Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.1 Evaluation Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.2 Binary classification results using different classifiers. . . . . . . . . . . . 335.3 Multi-classification results using different classifiers. . . . . . . . . . . . 345.4 Binary classification results using different feature blocks. . . . . . . . . 355.5 Multi-classification results using different feature blocks. . . . . . . . . . 355.6 Feature Informativeness: multi-classification results. . . . . . . . . . . . 425.7 Feature Removal Results I. . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.8 Conceptual Feature Removal Results. . . . . . . . . . . . . . . . . . . . . . 435.9 Morpho-syntactic Feature Removal Results. . . . . . . . . . . . . . . . . . 435.10 Contextual Feature Removal Results. . . . . . . . . . . . . . . . . . . . . . 445.11 Syntactic Feature Removal Results. . . . . . . . . . . . . . . . . . . . . . . 455.12 Frequency Feature Removal Results. . . . . . . . . . . . . . . . . . . . . . 455.13 Feature Removal Results II. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5

Preface

First of all, I am very grateful to my supervisors Joakim Nivre and Magnus Sahlgrenfor their ideas, advice and help throughout the entire work.

A special thanks must also go to Evelina Rennes for providing me with necessarydata.

Finally, I am very grateful to the teachers of the Department of Linguistics andPhilology at Uppsala University who helped me with my studies during the master’sprogram.

6

1 Introduction

Based on data from Statistics Sweden(Statistiska centralbyrån1), in 2017 morepeople moved to Sweden than were born in the country. The reason for this isnot only a high number of asylum seekers but also the fact that Sweden is one ofthe most attractive destinations for expats. According to Expat Explorer Survey,2

Sweden is voted the 8th best country for expats out of 45 countries. It is voted tohave the best childcare quality in the world, second best job security and work/lifebalance and is the third in economic confidence. However, the survey also revealsareas that are more problematic, such as making friends (45th place), social life(43rd place) and integration (39th place).

It is not surprising that integration of such vast numbers raises many challengesin the country. One of these challenges is language acquisition. Probably the mainmeasure for solving this issue is Swedish for Immigrants (SFI) courses that areorganized and financed by the state and are free of charge for the learners ofSwedish. However, the demand for resources is huge and such problems as lack ofqualified personnel arise often in public debates. Therefore, during recent years thetopic of second language acquisition has become more popular in Swedish NLPsociety as well, where we can see an increase of relevant research. In this thesis,we focus on one of NLP research areas that can be beneficial in second languageacquisition − Complex Word Identification (CWI).

CWI is usually described as a subtask of Automatic Text Simplification (ATS),which solves the problem of automatically rewriting a text in order to make itsimpler. ATS is a rapidly growing field that still faces many challenges such as theneed of a better evaluation metrics, text simplification systems built for particulartarget groups and the lack of training data, especially in languages other thanEnglish. In this thesis, we focus on Swedish CWI by exploring existing resourcesand introducing a new method for collecting the data that can be applied to otherlanguages as well. In addition to that, we build a Swedish CWI system and analyzethe characteristics of lexical complexity. We then apply the findings to English inorder to analyze the distribution of simple and complex words in a semantic web. Bymaking a new ATS resource for Swedish and building a CWI system that is targetingsecond language learners specifically, we tackle two of the named challenges of ATS- lack of resources and general ATS systems − as well as contribute to the solutionof the challenges that Swedish society faces today.

First, we propose a new method for collecting training data for CWI tasks.Existing methods usually depend on human evaluators and therefore are expensive,

1http://www.scb.se/hitta-statistik/statistik-efter-amne/befolkning/befolkningens-sammansattning/befolkningsstatistik/pong/tabell-och-diagram/helarsstatistik–riket/befolkningsutveckling-fodda-doda-in–och-utvandring-gifta-skilda/

2https://www.expatexplorer.hsbc.com/survey/country/sweden

7

which is one of the reasons for the lack of good quality training data in ATS. Theproposed method is based on second language teaching materials that usually arefreely available and are cheaper compared to human evaluators. We then use thecollected dataset to train several complex word classifiers. In this step, we focuson feature engineering and try to achieve best results by modifying the extractedfeature set instead of the algorithms. Additionally, we use these results to analyzethe characteristics of lexical complexity in Swedish. Finally, we apply the findingsabout lexical complexity to English and describe what patterns simple and complexEnglish words reveal in WordNet (Miller, 1995). The main goals and the outline ofthe thesis are described in more detail in the following sections.

1.1 PurposeThe purpose of the thesis is threefold:

1. Collecting and evaluating a new dataset for Swedish CWI using a new method,in this way making a contribution to ATS.

2. Building a CWI system for Swedish in order to see whether this task can bedone automatically.

3. Analyzing the features of simple and complex words in Swedish and Englishin order to better understand the characteristics of lexical complexity.

1.2 OutlineThe thesis consists of three main parts: data collection, CWI system building andanalysis of lexical complexity.

In chapter 2, we review the most important background for the study. We startwith theoretical background where the concept of complexity in general as wellas in linguistics and ATS is presented, followed by a summary of previous workdone in the field. In chapter 3, we focus on the first goal of the thesis. We describetwo training datasets for CWI, one existing and one built for the purpose of thisstudy, and evaluate them. In chapter 4, we present all features that were used totrain the classifiers. We describe the sources of these features as well as the chosennormalization methods. In addition to that, we use the extracted features to analyzethe collected dataset in more detail. In chapter 5, we focus on the second and thethird goals of the thesis. We train and evaluate several complex word classificationsystems in order to find the optimal feature set for Swedish CWI. Additionally, weuse the findings to analyze lexical complexity of Swedish words and distribution ofsimple and complex English words in a semantic web. In chapter 6, we discuss theresults about the collected dataset, Swedish complex word classifiers and lexicalcomplexity. We conclude our work in chapter 7.

8

2 Background

Identifying complex words in a text is often described as a subtask of Lexical Sim-plification (LS), where the goal is to replace complex words and phrases in a textwith simpler alternatives (Shardlow, 2013a). In this chapter, we will (1) give a shorttheoretical overview of the concept of complexity, (2) introduce Complex WordIdentification in the context of Automatic Text Simplification and (3) give a broaderoverview of the previous work done in the field.

2.1 Complexity2.1.1 The Concept of ComplexityThe concept of complexity is not well defined. Some researchers even argue thatcomplexity is entirely subjective and depends on the observer and their perceptionof the reality (Edmonds, 1999, p. 50). However, in general it is characterized asthe number and variety of elements and the elaborateness of their interrelationalstructure (Simon, 1996; Rescher, 1998).

In the book Complexity: A Philosophical Overview Rescher further describescomplexity by pointing out three things:

• A system’s complexity is a matter of the quantity and variety of its constituentelements and of the interrelational elaborateness of their organizational andoperational makeup.

• Ontological complexity accordingly has three main aspects: the composi-tional, the structural, and the functional.

• As an item’s complexity increases, so do the cognitive requisites for its ade-quate comprehension <...> our best practical index of an item’s complexityis the effort that has to be expended in coming to cognitive terms with it inmatters of description and explanation. And this means that complexity canin principle make itself felt in any domain whatsoever. (Rescher, 1998, p.1)

In other words, the main characteristic of complexity is the cognitive effort thatis needed in order to understand, describe and explain things. That means thatcomplexity is present in any domain or field and can describe those fields in termsof their structure, function or other aspects.

In this study we focus on complexity in the linguistic domain and refer to thesame definition of the concept, where simplicity relates to less and complexity tomore cognitive effort needed in order to understand, describe or explain words.

9

2.1.2 Linguistic ComplexityIn linguistics, more structural units, rules, representations usually mean biggercomplexity (Hawkins, 2009, p.10). In their dissertation Sinnemäki et al. (2011) showthat linguistic complexity is often related to difficulty and rarity. Sinnemaki arguesthat there is a clear distinction between complexity and difficulty and that thesetwo terms should not be used synonymously. He explains that difficulty relates torelative complexity, something that is subjective to the observer, while absolutecomplexity is a matter of the number of parts and interrelations in a system andtherefore can be defined in more general terms. In other words, while difficulty isan entirely subjective matter, complexity can be defined in a more objective way.Sinnemaki, however, describes two possible links between complexity and rarity.First, according to Harris (2008), rarity is a result of the coincidence of commonhistorical processes that have a low probability of happening. Second, according toDahl (2004), rare patterns require more evolutionary steps to develop. This meansthat historical linguistic processes that have a low probability of happening orhave many evolutionary steps are more complex. To illustrate this, Sinnemaki usesthe development of complexity in grammar as an example, because grammaticalcomplexity is the product of a non-trivial historical process, which takes time todevelop.

However, there is no unified definition or measurement of linguistic complexity.This leads to a crucial distinction between local complexity that describes parts ofan entity and global complexity where the goal is to describe the overall complexityof said entity (Sinnemäki et al., 2011, p.17).

If one looked at a language as such entity, describing its complexity would referto its global complexity and would be a more difficult task than focusing on thelocal complexity of one aspect of a language. In this study, we explore one suchaspect − lexical complexity.

2.1.3 Lexical ComplexityLexical complexity relates to complexity of words and vocabulary of a language. Al-ghizzi (2017) describes it by saying that lexical richness, sophistication, proficiencyand competence are all synonyms of lexical complexity and that these terms areoften used interchangeably. Wolfe-Quintero et al. (1998) combined these compo-nents and described lexical complexity as a multidimensional feature of languageuse encompassing lexical density, sophistication and variation. Here, lexical den-sity refers to the proportion of content words (e.g. nouns, verbs, adjective), lexicalsophistication is the usage of rare, more complex vocabulary and longer wordswith more elaborate syllable structures (Vera et al., 2016) and variation describesthe amount of unique words in a vocabulary. However, there are no agreed upondefinitions of lexical complexity and there are various ways of defining it (Alghizzi,2017).

Cutler (1983) illustrates lexical complexity by giving an example of two Englishwords: wombat and bark. He explains that in the mental lexicon of a speaker, wom-bat is related to a specific meaning of a “small Australian mammal” and therefore isfairly simple. Bark, however, is more ambiguous and could be a noun or a verb andthus, is more lexically complex than wombat. In addition to semantic complexity,

10

words can also be complex from other perspectives, for example syntactically ormorphologically.

Rayner and Duffy (1986) mentions such characteristics of lexical complexityas low word frequency, word meaning representations and lexical ambiguity. Forexample, kill is explained as “cause to die”, die means “to die” and convince is“cause to believe”. This means that kill is more lexically complex than die andequally complex as convince. Cutler (1983) argues that such lexical information is infact stored in a form of these definitions and therefore, a more complex definitionor multiple definitions of one word result in a higher complexity of a lexical unit.Rayner and Duffy (1986) tests these observations by tracking eye fixation times onwords with a premise that if a person spends more time fixating on a word whilereading, it means that they need more cognitive effort to understand that word. Theresults of the experiments showed that longer, ambiguous, low-frequency wordsrequire longer fixation times.

There are however more ways of describing lexical complexity. For example,from the second language acquisition point of view, lexical complexity is oftenrelated to fluency of a speaker, since the vocabulary of more fluent speakers tendsto have more lexical variety and complexity (Housen and Kuiken, 2009).

All in all, it is difficult to describe lexical complexity with a single definition.Therefore, in this work, we rely on a more intuitive understanding and rely on hu-man intuition about lexical complexity to evaluate a dataset of simple and complexwords.

2.2 Automatic Text SimplificationSiddharthan describes Text Simplification (TS) as “the process of reducing linguisticcomplexity of a text, while still retaining the original information content andmeaning” (Siddharthan, 2014, p. 2). The goal of such a task is to improve thereadability and understandability of a text. Usually a text has to be simplified ontwo levels, lexical and syntactic, where simplification operations include rewording,reordering, insertion and deletion (Saggion and Hirst, 2017). For example, thefollowing sentence was simplified on lexical and syntactic levels by applyingrewording to support and fell asleep and splitting one long sentence into threeshorter ones:

Original Because he had to work at night to support his family, Paco oftenfell asleep in class.

Simplified Paco had to make money for his family.Paco worked at night.He often went to sleep in class. (Siddharthan, 2014)

The ways in which a text should be changed are defined as simplification rules.However, what is perceived as a complex text or what are its simplification rules candiffer depending on target reader groups, for example second language learners orpeople with language disabilities, such as aphasia or dyslexia.

Automatic Text Simplification (ATS) in its turn is the task of simplifying textdata automatically. ATS formed in the nineties and since then researchers haveapproached the issue from different angles and applied varying methods. Now ATS

11

is used in different tasks, such as text summarization, machine translation andparaphrase generation (Shardlow, 2014). However, most of the work focuses onreducing the complexity of a text and increasing its readability for target readers.

Some of the approaches to ATS include machine translation methods, semantic,syntactic or lexical simplification (Shardlow, 2013a). The latter is a task of replacingcomplex words or phrases in a text with simpler alternatives. However, the first stepin Lexical Simplification (LS) is CWI, that is, identifying which words should bereplaced in the first place (Paetzold and Specia, 2016). Failing at this step could leadto inadequate simplifications where, for example, too many substitutions changethe original meaning of the text or too few changes do not improve the readabilityof the output text (Shardlow, 2013a). In the thesis, we focus on this first step of ATS− Complex Word Identification.

2.3 Previous WorkThe CWI task organized by SemEval in 2016 is the main inspiration for this thesis(Paetzold and Specia, 2016). However, CWI is not a novel problem in this field sinceit is a necessary step for one of the main four previously mentioned simplificationoperations - rewording.

2.3.1 CWI SystemsPaetzold and Specia (2013) showed that not including CWI in an ATS pipelinenegatively affects the output of CWI systems. In the LS approach of Horn et al.(2014) , only the words that can be substituted are deemed as complex. However,the results showed that such system fails to identify two thirds of the complexwords.

Shardlow (2013b) was the first to focus specifically on CWI. The author pointsout that even though CWI is very important for lexical simplifications, it is far fromperfect. For instance, CWI experiments usually were not evaluated since prior toShardlow’s work there were no evaluation techniques or datasets available. However,even though Shardlow (2013b) and Paetzold and Specia (2013) show that good CWIapproaches can increase the quality of LS systems, some of the work suggests thatsimplifying all words that can be simplified is the best approach (Shardlow, 2013a).

CWI task organized by SemEval in 2016 was the first initiative focusing specifi-cally on CWI. The main goals of the task were to provide new resources and insightsfor CWI and to establish the state of the art performance in English CWI, in thisway bringing more visibility to ATS (Paetzold and Specia, 2016). For this task, a newCWI dataset was collected and annotated.

In addition to 11 baseline systems, 42 CWI systems were submitted. The systemswere ranked based on G-1 score (harmonic mean between accuracy and recall)on binary classification of complex and non-complex words. The best performingbaseline system (number 16 in the overall ranking) used a threshold approach thatwas trained on the word’s language model probabilities obtained from Wikipedia,one of the most popular data sources for English ATS. The winners of the task, groupSV00gg, combined different lexicon, threshold and Machine Learning voter systemsinto two systems, one of which uses Hard Voting and the other Soft Voting. In total,

12

the system uses 69 morphological, lexical, collocational and semantic features andreaches a G-1 score of 0.774%.

However, further analysis of the collected dataset showed that most of thesubmitted systems performed poorly on the selected dataset and that the mainreason for this was the way the data was annotated (Zampieri et al., 2017). The datawas annotated by 400 non-native English speakers, where annotators had to choosethe words that they did not understand. A word was annotated as complex if at leastone from twenty evaluators did not understand it. The experiments showed thatsuch annotation method can be optimized if words that are annotated as complexby most annotators should be labeled as complex instead.

Yimam et al. (2017) propose a new cross-lingual CWI method that shows that itis possible to build CWI systems with language independent feature sets. Addition-ally, new CWI datasets were collected for English, Spanish and German. The datacollection method was once again based on human annotators. However, this timethe annotators were both native and non-native speakers.

To the best of our knowledge, there are no studies that focus on specificallySwedish CWI. However, the topic arises in some Swedish ATS studies. In her thesis,Decker (2003) defines multiple text simplification rules. Here, simplification rulesare divided into two groups. One of these groups concerns lexical simplificationswhich are defined as one lexeme substitutions by another lexeme. However, theauthor does not go into details about lexical simplifications and mentions that suchsemantic webs as WordNet (Miller, 1995) would be needed to substitute words andphrases with simpler alternatives. Rybing et al. (2010) formalized these rules andused them to build CogFLUX, a Swedish ATS system. Keskisärkkä (2012) investigatedwhether synonym replacements improve the overall readability of a text. Here, thebest alternatives for replacement were chosen based on word length, frequencyand level of synonymy between the two words, where shorter and more frequentwords were considered to be simpler. For synonym substitution, the authors usedSynLex (Kann, 2004), a list consisting of 80,000 synonym pairs and their level ofsynonymy. Synonym replacement with simpler alternatives was also the maintopic of Abrahamsson et al. (2014) work. In this study, the authors tried to improvethe readability of Swedish medical texts by replacing difficult terms with simplersynonyms. The complexity of a word was defined by two factors, the frequencyof a word in a general corpus and the frequency of substrings of words. Finally,Rennes and Jönsson (2015) developed the CogFLUX tool further by including moresimplification operations. In their work, the authors again used the Synlex listtogether with a frequency list.

2.3.2 Features of Simple and Complex WordsSince one of the goals of this study is to better understand which linguistic featurescorrelate with word complexity, it is necessary to take a closer look at previousfindings regarding the most informative features for CWI.

Elhadad (2006) analyzed how to improve access to medical literature for healthcustomers focusing on terminology that can be difficult to understand for a personwithout medical background. The chosen method for identifying not understand-able terms was based on finding out how common a given term is in texts familiarfor a reader. In other words, here linguistic complexity is directly related to un-derstandability and familiarity of a word. The main method of measuring such

13

complexity in the study was word frequencies in specific texts. In fact, other studiesshow that for example in English, knowing only the 5,000 most frequent wordsprovides almost 96% coverage of the spoken discourse, which strongly supportsthe familiarity hypothesis (Adolphs and Schmitt, 2003). Biran et al. (2011), however,defines word complexity as the product of corpus complexity and lexical complexity,where the former is word frequency in a corpus and the latter the length of the word.Similarly, Shardlow (2013b) works with data which is based on word frequencies indifferent sources but additionally includes tf-idf values in Simple English Wikipediaarticles.

When training data was collected for the SemEval task of 2016, more featureswere taken into consideration, such as language model probabilities, senses, syn-onyms, hypernyms, hyponyms and others. For the training, participants also usedfrequencies, word embeddings, various semantic, syntactic, lexical, psycholinguis-tic, morphological, collocational features, part-of-speech tags, named entity tagsand others. Generally, the systems used a combination of multiple features. Someof them, however, were based on a simpler approach. For example, the Pomonasystem classified all words with Wikipedia frequencies lower than 147 as complex.The eleven baselines usually focused on just one feature, such as language modelprobabilities in different sources, word length, number of word senses or classifiedwords as simple or complex depending on whether they appear in certain sources.Based on the results of the winning team that used 69 various features, however,word frequency in a good quality corpora remained the most informative feature.

14

3 Dataset for Swedish Complex WordIdentification

In this chapter, we focus on the first goal of the thesis - collecting and evaluatingtraining data for Swedish CWI. Since such training data has to consist of Swedishwords with their corresponding complexity levels, we begin by defining the chosenlevels of simple and complex words. We then collect two possible datasets fortraining. One of the datasets is based on an existing frequency list for Swedish withalready annotated complexity levels. Another dataset is collected manually usingsecond language learning materials for Swedish as the main source. Finally, weevaluate both datasets and choose our gold standard dataset for training based onthe evaluation results.

3.1 Complexity LevelsWe have chosen to split the data into four complexity levels, where levels 1 and2 correspond to simple words and levels 3 and 4 to complex words. In this way,the collected data can be used to train not only binary but also multi-classifiers ofsimple and complex words.

The difficulty levels of simple words were chosen based on the Common Euro-pean Framework of Reference for Languages (CEFR). As stated on the website ofthe Council of Europe, CEFR is the outcome of more than twenty years of researchand is created to provide a transparent, coherent and comprehensive basis for theelaboration of language syllabuses and curriculum guidelines, the design of teachingand learning materials, and the assessment of foreign language proficiency.1 In otherwords, CEFR provides a global scale of reference levels of language proficiency. Itconsists of 3 categories that are further divided into 2 subcategories each2 as shownbelow:

Basic User Independent User Proficient UserA1, A2 B1, B2 C1, C2

Table 3.1: CEFR Global Scale.

The guidelines further explain that a basic user is someone who is able to usefrequent, routine expressions and basic phrases, as well as interact in a simple way.In addition to that, independent users are able to interact in a more spontaneous

1https://www.coe.int/en/web/common-european-framework-reference-languages/2https://www.coe.int/en/web/common-european-framework-reference-languages/table-1-

cefr-3.3-common-reference-levels-global-scale

15

way, reason and talk about a wider range of topics. Concept of complexity first ap-pears in the description of a proficient user, who should be able to understand andproduce complex language in a clear and fluent way. In other words, levels A and Bcorrespond to simple language, while level C describes a more complex language.

In this study, we follow a similar approach and choose 4 difficulty levels, wherethe first two levels represent simple words and the last two - complex words. Levels1, 2 and 3 are closely related to CEFR language levels. In addition to that, we addthe fourth level that would represent the highest level of complexity - words thatare complex even for native speakers of Swedish. The main reason for adding thisfourth level is that there is no evidence that even the most difficult CEFR level wordswould also be perceived as complex by native speakers. Since the aim of the studyis to analyze the features of complex words, we have decided to form this fourthgroup of words and in this way ensure that our data contains a variety of wordsranging from most simple to most complex:

• 1: corresponds to proficiency of a basic user.

• 2: corresponds to proficiency of an independent user.

• 3: corresponds to proficiency of a proficient user.

• 4: corresponds to proficiency of a native speaker and higher.

3.2 Collection of the DatasetsTwo options of possible training datasets were considered and evaluated. In onecase, we use an already existing lexical list for Swedish that contains CEFR levels. Inaddition to that, we build a new dataset using another method of word collectionand complexity level assignment. In this section, we present all used sources andthe two resulting datasets as well as describe the procedure of evaluating them andchoose the best dataset for the training of the CWI classifiers.

3.2.1 SourcesKelly Swedish ListThe Kelly Swedish List3 (Kelly) is a list of 8,425 Swedish lemmas (Volodina and Kokki-nakis, 2012). It is a frequency-based list that contains additional information aboutthe headwords, including their corresponding CEFR levels. Kelly was generatedfrom the web-based SweWAC corpus of 114 mln. words from 2010’s. 8,425 entriesin the list cover 80% of the original corpus (Kokkinakis and Volodina, 2011). Someof the corresponding CEFR levels were annotated manually, but most of them arefrequency-based. These annotations were made by ordering the collected wordsfrom most to least frequent and then splitting the list into six equal parts, wherethe first set of most frequent words were assigned CEFR level A1, the second - A2and so on. Kelly list is described as a reliable resource for CEFR-based courses inSweden when it comes to creating and evaluating CEFR-based learning material inSwedish and can be used by language learners and teachers.

3https://spraakbanken.gu.se/eng/kelly

16

The main disadvantage of the source is that Kelly CEFR levels are equally as-signed according to their frequency range (Volodina and Kokkinakis, 2012). On onehand, multiple previous studies showed that frequency is the main feature usedto describe word complexity (Shardlow, 2013a; Paetzold and Specia, 2016). On theother hand, such distribution only suggests, but does not ensure, that all head-words were assigned the correct level, especially since the lemmas were assignedCEFR levels equally, approximately 1,404 headwords per level (Volodina and Kokki-nakis, 2012). Such distribution could mean that too many or too few lemmas wereassigned specific levels.

RivstartRivstart is one of the most commonly used textbooks by second language learners ofSwedish.4 As most such materials, it has several books, each of which correspondsto different CEFR levels. For this study, we have used freely available dictionaries forRivstart A1+A2 and Rivstart B1+B2 textbooks.5 Since these textbooks were preparedby professionals and by following CEFR guidelines, we make an assumption thatthe material used to write the books, including the vocabulary, corresponds toCEFR requirements for basic user and independent user proficiency and therefore,provides a reasonable gold standard of simple words.

There are some advantages and disadvantages of such an approach. To beginwith, it is prescriptive in a sense that such dictionaries only suggest what wordsshould be simple to the learners. However, there is no proof that second languagelearners would agree on this. On the other hand, such choice increases the consis-tency, the lack of which has been said to be one of the disadvantages of manuallyannotated data. In addition to that, it solves other issues of manually annotateddata, such as dividing people into groups based on their proficiency or dealing withwords that were assigned different complexity levels by several annotators. For thepurpose of this study, however, we make an assumption that all words that appearin these dictionaries are learned by most second language learners of Swedish.To the best of our knowledge, such an approach of CWI training and testing datacollection has not been applied before.

8 sidor8 sidor is an online newspaper that provides “readable news”.6 It is explained thatin addition to having shorter sentences and bigger letters, readable also means theabsence of complex and unusual words. The target readers of such texts are meantto be children, people with language disabilities or immigrants who want to learnSwedish7.

Språkbanken,8 a source of various Swedish language resources, contains acorpus of the newspaper’s articles, as well as a freely available scrambled version of

4https://www.nok.se/rivstart5https://www.nok.se/Laromedel/-Laromedelswebb-/-B23-/-Lararwebb-/Rivstart/-Flikar-

/A1A2/Textbok/Ordlista/6http://8sidor.se/om-8-sidor/7http://8sidor.se/8-sidors-historia/8https://spraakbanken.gu.se/swe/om

17

the corpus including words and their lemmas.9 This data was used to extract wordlemmas as an additional source of simple words.

LäSBarTLäSBarT (Mühlenbock, 2009) is a corpus of easy to read Swedish texts from chil-dren’s books. Similarly as with 8 sidor data, we extracted simple word lemmas froma freely available file with annotated words from the corpus.10

Aligned CorpusThis corpus consists of aligned Swedish simple and complex sentences (Rennesand Jönsson, 2016). Data in the corpus covers texts from the websites of Swedishauthorities and is created to complement the previously mentioned LäSBarT corpus.Sentences from the dataset were used as additional sources of simple and standardSwedish.

Stockholm Umeå CorpusThe Stockholm-Umeå Corpus (SUC) consists of Swedish texts from the 1990’s andcovers one million unique words in total. The corpus is balanced, meaning that itcontains various text types and stylistic levels. Just as in previous cases, we use thescrambled version of the corpus11 to extract word lemmas.

OrdtestetOrdtestet12 is a website that provides exercises for The Swedish ScholasticAssessment Test’s (SweSAT) vocabulary part. It is a vocabulary understanding taskto test students’ knowledge of complex words. It is important to mark that this testis aimed at native Swedish speakers and therefore words chosen for this part aremeant to be complex even for natives and not only second language learners ofSwedish. Ordtestet, as several other websites, provides exercises for testing one’sknowledge of such words in order to prepare for SweSAT. Complexity level 4 wordswere collected from these exercises.

More information about the used sources is presented in Table 3.3.

3.2.2 Gold Standard DatasetsDescribed sources were used to build two possible datasets for training. The firstdataset is based on Swedish Kelly List and the second one was built manually byextracting words from the rest of the sources.

9https://spraakbanken.gu.se/swe/resurs/attasidor10https://spraakbanken.gu.se/resource/lasbart11https://spraakbanken.gu.se/eng/resource/suc212https://ord.relaynode.info/

18

Kelly Gold Standard DatasetThe gold standard dataset was made by collecting Kelly list entries with CEFR levelsfrom A1 to C2. As mentioned before, instead of partitioning collected data into sixcomplexity levels, we have chosen to divide it into three levels - 1 (A1 and A2), 2 (B1and B2) and 3 (C1 and C2). In addition to that, we have used some practice wordsretrieved from Ordtestet to form the fourth group of the most complex words. Theresulting dataset includes 4,307 words in total (Table 3.2).

1 2 3 4 Total1,380 1,219 1,174 532 4,305

Table 3.2: Kelly Gold Standard Dataset.

Manually Collected DatasetAll other described sources of simple, standard and complex Swedish were used tobuild another possible gold standard dataset (3.3).

Rivstart dictionaries were used as the main source of simple words (complexitylevels 1 and 2). Level 1 words are such that occur in A1+A2 Rivstart dictionaries.Similarly, level 2 words are retrieved from B1+B2 level textbook. In this case, wealso remove the words that appear in the previous book. In order to increase thelikelihood that the chosen words are indeed simple, we make sure that they alsoappear in all other sources of simple Swedish, that is, in 8 sidor, LäSBarT andAligned Simple corpora.

For words with complexity level 3, the basis were standard Swedish words thatdo not occur in any of the simple Swedish datasets and level 4 contains words fromOrdtestet, same as in the Kelly Gold Standard Dataset.

The final dataset is described in more detail in Table 3.4. As you can see, inorder to be assigned complexity level 1, the words had to appear in all sources of

Source #words #tokens

Simple Swedish

Rivstart A1+A2 2,030 17,040Rivstart B1+B2 1,705 17,6568 sidor 16,695 188,546LäSBarT 24,881 272,127Aligned Simple 7,963 88,261

Standard Swedish

SUC 72,920 916,401Aligned Standard 87,230 1,249,270

Complex Swedish

Ordtestet 532 4,954

Table 3.3: Sources of simple, standard and complex Swedish that were used tocollect the Gold Standard Dataset.

19

Complexity level: 1 2 3 4# words 1,699 1,056 978 505# tokens 13,948 10,195 12,086 4,954

Appears in:

Rivstart A1+A2 + − − −Rivstart B1+B2 + + − −8 sidor + + − −LäSBarT + + − −Aligned Easy + + − −Aligned Standard +SUC3 +

Ordtestet +

Table 3.4: Manually Collected Gold Standard Dataset.

simple words. Level 2 words also had to appear in the same sources except in theRivstart A1+A2 dictionary. When it comes to the complex words, they are extractedby removing all words that appear in any of the simple word datasets. For example,if a complex word candidate appears in LäSBarT corpus but not in any other sourcesof simple Swedish, it is still not considered to be complex. We also check whetherthese words appear in both datasets of extracted standard words. However, themain reason for this is to minimize the risk of including unnormalized words.

3.2.3 Evaluation of the DatasetsBoth datasets were manually evaluated in order to see whether the second languagelearning material or the frequency based method correlates better with humanjudgment.

One native speaker and one learner of Swedish were given a set of 50 samplesfrom the two datasets. Each of these samples contained 4 lemmas - one lemmafrom each of the 4 complexity levels. For example, in the sample below “de” (eng.they) is level 1 word, “våning” (eng. floor) - level 2, “insättning” (eng. deposit) - level3 and “obduktion” (eng. autopsy) - level 4.

• obduktion, insättning, de, våning

The evaluators were aware of this distribution and were asked to order the wordsfrom most simple to most complex based on their opinion, in this way assigningcomplexity levels to 200 words. The answers were evaluated in two ways:

• Exact Match: Assigned complexity level is correct if it matches the complexitylevel of the word in the gold standard dataset.

• Fuzzy Match: Assigned complexity level is correct if it matches the complexitylevel of the word in the gold standard dataset or if the difference between thetwo levels is not bigger than 1.

For example, if the human evaluator would order the previous example in thefollowing way:

20

• våning, de, insättning, obduktion,

based on the first evaluation method, there is a match of 50% since two out of fourwords were assigned the same complexity levels as in the gold standard dataset.However, based on the second method the score would be 100% since the firsttwo words are just one position away from their correct rank. The second methodwas included because the evaluators have mentioned that one of the difficultiesof completing this task was that it was often hard to choose between two adjacentcomplexity levels and in these cases they picked the order of two words more orless randomly.

The evaluation scores are presented in Table 3.5, where EM stands for ExactMatch and FM for Fuzzy Match evaluation methods. First of all, there is a strongcorrelation between the evaluation results of the native speaker and the learner ofSwedish. Also, the manually collected dataset got much higher scores when bothevaluation methods were applied. This allows us to conclude that there is a muchbigger correlation between human judgment and the collected dataset than theKelly dataset. It suggests that either frequency based method is not suitable forassigning lexical complexity levels or, most likely, that in such cases the shiftingpoints between these levels should be chosen more carefully instead of be assignedequally based on the range of word frequencies.

Kelly CollectedEM FM EM FM

Learner 0.555 0.925 0.745 0.99Native Speaker 0.51 0.91 0.755 0.985

Table 3.5: Dataset Evaluation Results.

All in all, the evaluation results show that the manually collected dataset is amore suitable dataset for training and testing in CWI tasks because it correlatesbetter with human judgment. Therefore, this dataset will be used as the GoldStandard dataset in the following experiments. Additionally, the results validatethe textbook based method for collecting such data. Since the lack of data is oneof the challenges that ATS faces today, especially in languages other than English,this method could allow us to collect training data from the available learningmaterials in multiple languages instead of using human evaluators. Our dataset isfreely available at https://github.com/gresmol/CWIdataset.

21

https://github.com/gresmol/CWIdataset

4 Features for Swedish Complex WordIdentification

The second goal of the thesis is to built a CWI system. In the previous chapter, wedescribed how we collected the dataset for the training which consists of Swedishwords and their annotated complexity levels. However, in addition to the classesthat are to be predicted by the classifier, the training data has to contain moreinformation. In our case, additional linguistic information or features about thecollected words are needed in order to train classifiers that could predict the outputclass - word complexity levels. As described in the background chapter, previousstudies included a variety of linguistic features for CWI system training, such asmorphological, syntactic, psycholinguistic and others. In this chapter, we presentall features that we selected for the study and describe how they were prepared forthe training. First, we present the sources and tools that were used to preprocessthe data and retrieve the features. Then, we describe all features as well as showhow they were converted to a format supported by Machine Learning algorithms.

4.1 Sources and ToolsIn addition to words themselves, some of the selected features were based onlinguistic information that required additional sources and tools in order to extractit. Below, we present these sources and the main used tool.

BringBring wordlist1 is based on the well known Roget’s Thesaurus (Roget, 1982). Rogetdescribes his work as a collection of the words ... arranged, not in alphabeticalorder as they are in a dictionary, but according to the ideas which they express (p. 3).Although semantic webs, such as Princeton’s WordNet Miller (1995), remain themost widely used sources of similar kind within NLP, outside of it, Roget’s collectionis the best known lexical-semantic resource for English (Borin et al., 2015) andis widely used in fields such as Social Sciences, Psychology or Cognitive Science(Mohammad, 2015).

Roget’s Thesaurus is known for its taxonomic structure since the words are listedbased on their conceptual classes. In total, there are 6 classes that are partitionedinto 39 sections. These sections are further divided into 1000 categories, each ofwhich contains examples from English vocabulary (Table 4.1).

1https://spraakbanken.gu.se/eng/resource/bring

22

Bring is a Swedish adaptation of Roget’s Thesaurus. The collection was firstpublished in book Svenskt Ordförråd ordnat i begreppsklasser (Bring, 1930) andnow exists as a digital version with 148,815 entries in total (Borin et al., 2015).

Classes #Sections #Categories

Abstract Relations 8 179Space 4 136Matter 3 134Intellectual Faculties 10 150Voluntary Powers 9 220Sentient and Moral Powers 5 181Total 39 1000

Table 4.1: Roget’s Conceptual Classes.

SaldoSaldo2 is a freely available electronic lexicon resource for modern Swedish. Thelexicon consists of three main parts: a semantic lexicon, which is also describedas a semantic-lexical network, a morphological lexicon, and a computational mor-phological description (Borin et al., 2013).

Saldo is often compared to the previously mentioned WordNet since it also rep-resents semantic relations between the words, but at the same time it is structuredaccording to different principles (Borin and Forsberg, 2009). Namely, the words areorganized based on two primitive semantic relations - the first of these relations isobligatory, and the second one might be omitted. The authors describe the first rela-tion as a mother (or main descriptor), which must be a semantically closely relatedword which is more “central”, often meaning “semantically and/or morphologicallyless complex, probably more frequent, stylistically more unmarked and acquiredearlier in first and second language acquisition” (Borin and Forsberg, 2009, p. 7).The second relation, a father (determinative descriptor), is optional and is usedto differentiate lexemes with the same mother. For example, word “Bulgakov” inSaldo is described by a mother “författare” (eng. author) and a father “rysk” (eng.Russian).

The Swedish Culturomics Gigaword CorpusThe Swedish Culturomics Gigaword Corpus3 (Giga) is a dataset of contemporarySwedish (Eide et al., 2016). The corpus contains sentences from 1950 to 2015 andcovers multiple various genres, such as fiction, government, news, science andsocial media, totaling in 59,736,642 unique sentences. The dataset is split into 7parts based on the decade (Table 4.2).

In this study, a part of this corpus (1970-79, 1990-99, 2000-09 and 2010-15) wasused to retrieve sentences containing the selected simple and complex words inorder to see in what context they appear. The parts were selected in such a way that

2https://spraakbanken.gu.se/eng/resource/saldo3https://spraakbanken.gu.se/eng/resource/gigaword

23

the collected sentences would span over all genres and would mainly include mostrecent texts.

Decade Fiction Government News Science Socialmedia

1950-59 − + − − −1960-69 − + + − −1970-79 + + + − −1980-89 + + + − −1990-99 + + + + +2000-09 − + + + +2010-15 − − + + +

Table 4.2: The Swedish Culturomics Gigaword Corpus.

The extracted part of the corpus consists of 19,616,876 sentences. Each wordfrom the Gold Standard dataset was mapped to at least one and at most 5,000 suchsentences.

Swedish MedEval Test CollectionThe Swedish MedEval test collection contains different types of articles withinthe medical domain(Friberg Heppin, 2011). The collected texts are targeted formedical professionals as well as patients. The dataset was used to calculate inversedocument frequencies (IDF) to be used as a feature in training and testing data.

Frequency ListsAs mentioned before, frequency remains the most informative feature when itcomes to CWI. Therefore, multiple word frequency features were retrieved fromscrambled versions of various corpora in Språkbanken, as seen in Table 4.3. Allthese sources are freely available at spraakbanken.se.4

Corpus Genre Sources Time span

Blog mix socialmedia Swedish blogs 2010-17

Webb News media Swedish newspapers 2009-13

SUC v2 and v3 various various 1990’s

SUC Novels fiction Swedish novels 1990’s

Twitter mix socialmedia Swedish Twitter 2006-17

Table 4.3: Scrambled corpora that were used to retrieve word frequencies.

EfselabIn order to retrieve additional linguistic information, such as part-of-speech tagsand syntactic relations, the data was annotated using the Efselab compiler of

4https://spraakbanken.gu.se/eng/resources/corpus

24

Feature DescriptionSources,

ToolsExtraction,

Normalization

1. LenChar number of tokens lemma raw

2. NumSyl number of syllables lemma approximated as the num-ber of vowels

3. VowConRatio ratio of vowels and con-sonants

lemma raw

4. PoS part-of-speech tag efselab each PoS tag mapped to aunique numerical value

Table 4.4: Morpho-syntactic Features.

sequence labeling tools (Östling, 2013). It supports various languages includingSwedish and can be used as, for example, a parser or Named Entity Recognitiontagger. The accuracy for Swedish reaches 96.3 %.

In this study, Efselab was used to parse the sentences and extract syntac-tic features. For dependency parsing, the compiler uses MaltParser, a language-independent system for data-driven dependency parsing (Nivre et al., 2007).

4.2 Extracted FeaturesThe main strategy for feature selection was to include as many different linguisticfeatures as possible and only then remove the unnecessary features.

All selected features can be divided into 5 larger categories - frequency, morpho-syntactic, contextual, syntactic and conceptual features. In this section, we presenteach of the selected feature groups, as well as describe how they were extracted andnormalized.

4.2.1 Morpho-Syntactic FeaturesMorphological features represent information about the words such as their formand structure. In addition to 3 morphological features, a part-of-speech tag featureis also included in this group (Table 4.4). In other words, this group representsfeatures containing information that does not change depending on the context ofa word.

4.2.2 Contextual FeaturesSince it is known that the context of a word can give valuable information aboutwords in a number of linguistic tasks, we select several contextual features as well.These are such features that describe the context of a word. It is important to pointout, however, that some of these features, such as mean distance from the word tothe root of the sentence or mean number of punctuation marks, represent syntacticinformation too. The chosen contextual window is one sentence and each featurewas extracted by averaging information about up to 5,000 sentences from the Gigacorpus containing the lemma or any of its forms (Table 4.5).

25


ToolsExtraction,

Normalization

1. MWL mean word length in a sentence Giga raw/10

2. MSL mean sentence length Giga raw/10

3. MRD mean length from the word to the root Giga, Efselab raw

4. MPC mean number of punctuation marks Giga, Efselab raw

5. WordtVar mean word length variance Giga raw

6. SentVar mean sentence length variance Giga raw/100

7. RootVar mean distance to the root variance Giga raw/10

Table 4.5: Contextual Features.

The values were left unnormalized or were divided by 10 or 100. The featureswere normalized differently in order to keep all values in the range 0-10.

4.2.3 Syntactic FeaturesSelected syntactic features are universal syntactic relations that were presentedin Universal Dependency (UD) project, as described in De Marneffe et al. (2014).In this thesis, we use version 2 of UD. The list contains 37 relations that are alsogrouped into structural categories of the dependent (nominals, clauses, modifiersand function words) and categories that are not dependency relations in the narrowsense (coordination, MWE, loose, special, other). In the experiments, we use thesecategories instead of dependency labels.

In addition to that, instead of using binary representation of syntactic relations,we use frequency counts - the value of each feature is the normalized number oftimes a given word appears in that syntactic category in the corpus. Therefore, thevalues of extracted syntactic features are closely related to word frequencies.

Syntactic relation category features are listed in Table 4.6. Description of eachcategory contains syntactic relation feature titles. The titles are mapped to their fullsyntactic relation labels in Appendix B.

4.2.4 Conceptual FeaturesThe final group of the features represents semantic or conceptual informationabout the words. Roget’s Conceptual Classes and the Saldo lexicon were used asthe main sources. In other words, most of these features (Table 4.7) represent wordsenses.

4.2.5 Frequency FeaturesSince so far frequencies are known to be the most informative features when itcomes to lexical complexity, we have selected multiple features of word frequenciesfrom different sources and genres (Table 4.8).

26


ToolsExtraction,

Normalization

1. nominals frequency of the word appearing asnsubj, obj, iobj, obl, vocative, expl,dislocated, nmod, appos or num-mod

Giga, Efselab log base 10

2. clauses frequency of the word appearing ascsubj, ccomp, xcomp, advcl or acl


3. modifiers frequency of the word appearing asadvmod, discourse or amod


4. funcWords frequency of the word appearing asaux, cop, mark, det, clf or case


5. coordination frequency of the word appearing asconj or cc


6. NWE frequency of the word appearing asfixed, flat or compound


7. loose frequency of the word appearing aslist or parataxis


8. special frequency of the word appearing asorphan, goeswith or reparandum


9. other frequency of the word appearing aspunct, root or dep


Table 4.6: Syntactic Features.

Each frequency count was normalized by computing its logarithm to base 10,which roughly means that the extracted values show how many times one word ismore frequent than another. For example, if the frequency of word X is 1 and thefrequency of word Y is 2, it would mean that Y appears in a given corpus 10 timesmore often than X. Other considered normalization methods include Zipf’s scale(Van Heuven et al., 2014) and word per million counts. We ran several preliminaryclassification experiments to choose the best normalization method. However, onlylogarithmic frequencies yielded better results than raw counts.

27


ToolsExtraction,

Normalization

1. SaldoNumComp number of primitive seman-tic relations

Saldo raw

2. SaldoNumSenses number of senses Saldo raw

3. BringNumTags frequency of a word inBring’s thesaurus

Bring’s Thesaurus raw

4. Class1 frequency of a word in thefirst class of Bring’s the-saurus


5. Class2 frequency of a word in thesecond class of Bring’s the-saurus


6. Class3 frequency of a word in thethird class of Bring’s the-saurus


7. Class4 frequency of a word in thefourth class of Bring’s the-saurus


8. Class5 frequency of a word in thefifth class of Bring’s the-saurus


9. Class6 frequency of a word in thesixth class of Bring’s the-saurus


Table 4.7: Conceptual Features.


ToolsExtraction,

Normalization

1. FreqStandard frequency of a word in a corpus SUC2 log base 10

2. FreqGiga frequency of a word in a corpus Giga log base 10

3. FreqTwitter frequency of a word in a corpus Twitter log base 10

4. FreqSuc frequency of a word in a corpus SUC2 and SUC3 log base 10

5. FreqSucLit frequency of a word in a corpus SUC Novels log base 10

6. FreqBlog frequency of a word in a corpus Swedish blogs log base 10

7. FreqRiks frequency of a word in a corpus Riksdag’s texts log base 10

8. FreqWeb frequency of a word in a corpus Web log base 10

9. FreqSubs frequency of a word in a corpus Movie subtitles log base 10

10. IDF frequency of a word in a corpus Medical texts log base 10

Table 4.8: Frequency Features.

28

4.3 Feature Analysis of Simple and Complex WordsWe have analyzed the gold standard dataset in more detail to see how extractedfeatures correlate with word simplicity and complexity. We present the findings inFigures 4.1 - 4.5, where green columns show simple and blue columns - complexword results. The numbers represent normalized feature values.

As you can see in Figure 4.1, based on our training data complex words usuallyhave fewer senses than simple words. Also, complex words tend to be longer (Figure

Figure 4.1: Word Complexity Correlation with Conceptual Features.

Figure 4.2: Word Complexity Correlation with Morphological Features.

4.2) as well as appear in longer sentences, where the distance from the target wordto the root is bigger (Figure 4.3). In addition to that, word length, sentence lengthand distance to root variances are smaller when it comes to complex words whichindicates that the contexts of complex words tend to be less different from eachother than the contexts of simple words.

When it comes to syntactic features (Figure 4.4), the most obvious pattern againis that simple words appear in various syntactic groups more often than complexwords.

29

Figure 4.3: Word Complexity Correlation with Contextual Features.

Figure 4.4: Word Complexity Correlation with Syntactic Features.

Finally and not surprisingly, analysis of word complexity correlation with oneof the extracted frequency features (Figure 4.5) revealed that the simpler the word,the more frequent it is.

Figure 4.5: Word Complexity Correlation with FreqBlog Feature.

30

5 Experiments

In this chapter, we focus on the second and the third goals of the thesis - building aCWI system and analyzing lexical complexity. In chapter 3 described gold standarddataset and in chapter 4 presented features are used as training and testing datato carry out a number of experiments in order to (I) select the best performingclassifier and the optimal feature set for Swedish CWI and to (II) better understandthe linguistic properties of lexical complexity. We start with the overview of theexperimental setup, followed by a short presentation of the selected evaluationmetrics. Finally, we continue with the detailed description of the experiments andtheir results.

5.1 Experimental SetupThe main goals and expected outcomes of the experiments are as follows:

• Model Selection:a) to choose the highest scoring classifier given our data.b) to select the optimal feature combination for Swedish CWI.

• Lexical Complexity and Cognition:a) to analyze the linguistic properties of Swedish lexical complexity.b) to apply these findings to English and analyze the distribution of simpleand complex words in a semantic web.

5.1.1 Model SelectionWe use Python module scikit-learn (Pedregosa et al., 2011) to train, tune andtest six Machine Learning algorithms: Support Vector Machines (SVM), Naive Bayes(NB), Random Forest (RF), Gradient Boosting (GB), Logistic Regression (LG) andStochastic Descent (SD). The algorithms are described in more detail in AppendixA. Since our goal is to find out which features are most informative when it comesto CWI, we do not modify the algorithms themselves.

The collected dataset is used to train and test the systems. Every system is testedin terms of binary and multi-classification, where binary classifiers predict whethera word is simple (complexity levels 1 or 2) or complex (complexity levels 3 and 4)and multi classifiers predict one of the four complexity levels. The data is randomlyshuffled and split into three parts: 80% training set, 10% validation set and 10% testset.

Hyperparameters are chosen using sklearn.model_selection.GridSearchCVobject that evaluates all possible combinations of parameters over the validation

31

set and retains the best one. Selected hyperparameters are presented below (notmentioned values are set to default as per scikit-learn documentation).

• SGD: alpha = 0.001, l1_ratio = 0.1, loss = squared hinge, penalty = l1

• LG: C = 1.0, penalty = l2

• SVM: C/ = 10, gamma = 1e

• GB: max_features = None, n_estimators = 100, learning_rate = 0.2,min_samples_leaf : 3

• RF: max_features = None, n_estimators = 250, min_samples_split = 3,min_samples_leaf = 2

We use these parameters for all conducted experiments that are described in thefollowing sections.

Once the best classifier is selected and tuned, we use it to conduct a number ofexperiments in order to find out the optimal feature combination for Swedish CWI.This is done by evaluating the systems over the test set again. Therefore, thesescores can be slightly overestimated. Our starting point is the full model that istrained on all 39 features. We then proceed by training the classifier with variouscombinations of features with the goal to find such feature set that contains as fewfeatures as possible while maintaining the classification score.

5.1.2 Lexical Complexity and CognitionIn addition to selecting the features for Swedish CWI, we analyze the properties oflexical complexity. First, we analyze linguistic properties of simple and complexSwedish words. Since lexical complexity is defined by the cognitive effort that isneeded to understand and describe words, it is also interesting to see whethersimple and complex words reveal any patterns in a large semantic web. However,such resources are not available for Swedish. Therefore, we apply the findings aboutSwedish lexical complexity to English and analyze the distribution of simple andcomplex English words in WordNet Miller (1995), the most popular semantic webfor English.

5.2 EvaluationAll systems are evaluated by applying the same approach as in the CWI task orga-nized by SemEval (Paetzold and Specia, 2016), where five metrics are used: precisionscore (P), accuracy score (A), recall score (R), F-1 score (F-1) and G-1 score (G-1). Allfive metrics are calculated in terms of four values:

• True Negative (TN): sample was negative and predicted negative

• True Positive (TP): sample was positive and predicted positive

• False Negative (FN): sample was positive but predicted negative

• False Positive (FP): sample was negative but predicted positive

32

In our case, positive and negative refers to simple and complex words. For example,if the predicted label of a complex word is complex, this would be a case of TrueNegative. All evaluation metrics are defined in terms of these four values and theirrelation to each other in Table 5.1.

Accuracy Precision Recall F-1 G-1

T P+T Ntot al

T PT P+F P

T PT P+F N

2·(PR)P+R

2·(AR)A+R

Table 5.1: Evaluation Metrics.

We rank our systems based on G−1 score, the harmonic mean between accuracyand recall.

5.3 Experiments and ResultsWe start our experiments by training six classifiers in order to find out which modelfits our data best. We then continue by training the best performing classifier usingdifferent sets of features in order to find the optimal feature combination for CWI.

The main goal is to find a feature set that contains as few features as possiblewhile maintaining or improving the original classification score that is achievedwhen the classifier is trained and tested on data with all extracted features.

5.3.1 Selection of the ClassifierIn this experiment, we train six Machine Learning algorithms: Support VectorMachines (SVM), Naive Bayes (NB), Random Forest (RF), Gradient Boosting (GB),Logistic Regression (LG) and Stochastic Descent (SD). The classifiers are thenevaluated on the same test set consisting of 10% of the dataset. Each system istested in terms of binary and multi-classification as previously defined. Binaryclassification results are presented in Table 5.2 and multi-classification resultsin Table 5.3. Since the results of the Gradient Boosting and the Random Forestclassifiers can slightly change depending on the run, presented results for theseclassifiers show the average over 300 runs.


NB 0.805 0.788 0.814 0.794 0.809SGD 0.873 0.86 0.855 0.858 0.864LG 0.894 0.881 0.885 0.883 0.889SVM 0.901 0.889 0.892 0.89 0.896GB 0.922 0.914 0.914 0.914 0.918RF 0.925 0.918 0.914 0.916 0.92

Table 5.2: Binary classification results using different classifiers.

Rankings of the classifiers are the same for both, binary and multi-classification.Random Forest is the best performing classifier of simple and complex words and isvery closely followed by the Gradient Boosting classifier. These are promising results

33


NB 0.628 0.639 0.64 0.632 0.634SGD 0.685 0.732 0.698 0.706 0.691LG 0.725 0.743 0.743 0.721 0.734SVM 0.732 0.759 0.755 0.749 0.744GB 0.745 0.767 0.777 0.769 0.761RF 0.747 0.77 0.777 0.769 0.762

Table 5.3: Multi-classification results using different classifiers.

since the winners of a similar SemEval task for English achieved G-1 score of 0.774for binary classification (Paetzold and Specia, 2016). However, some of the featuresthat the winning team used for the training were based on the information that isdirectly related to word complexity (such as a binary feature that shows whethera word is present in a source of simple English or not) while we do not includeany similar information about the data. In addition to that, the winning team usesmore features (69 compared to our 39) and trains a more complex classifier. Allthis suggests that our classification results should be much lower and even thoughSemEval and our task are not entirely comparable, such difference between theresults indicates that either Swedish CWI is a less complex task than English CWIor, more likely, that the main reason for this is the training data. On one hand, itcould mean that our textbook based method to collect the training data is a betterchoice for CWI than data annotated by human evaluators. On the other hand, it canalso mean that the differences between our selected simple and complex words aremore obvious than in the SemEval’s dataset.

5.3.2 Selection of the Optimal Feature SetOur next step of the experiments is to train the Random Forest classifier again usingfewer features and different feature blocks. The main goal of these experimentsis to select the optimal feature set with as few features as possible while main-taining the previously achieved results and to better understand which linguisticcharacteristics have highest correlation with lexical complexity.

This step consists of two parts. First, we carry out a number of experiments tofind out which features are most important. We focus on I) different feature blocksand II) feature informativeness, in order to define the best strategy for III) featureselection. Then, we continue with the main part of the experiments where we followthe defined strategy to select the optimal feature set. In all these experiments, theclassifiers are tested on the test set.

I. Classifier Training on Different Feature BlocksIn this step the classifier is trained using only one feature block at a time: concep-tual, morpho-syntactic, contextual, syntactic or frequency. The main goals of thisexperiment are as follows:

• To find out which of the feature blocks (if any) correlate with lexical complex-ity.

34

• If any of the feature blocks correlate with lexical complexity, to find out whichspecific features from these blocks have the highest correlation with lexicalcomplexity.

The results of binary and multi-classification are presented in Tables 5.4 and 5.5respectively.


Conceptual 0.666 0.608 0.569 0.562 0.617Morpho-syntactic 0.772 0.762 0.707 0.721 0.739Contextual 0.793 0.77 0.768 0.769 0.78Syntactic 0.847 0.83 0.829 0.83 0.838Frequency 0.933 0.928 0.921 0.923 0.927

Table 5.4: Binary classification results using different feature blocks.


Conceptual 0.409 0.328 0.336 0.322 0.373Morpho-syntactic 0.496 0.404 0.401 0.367 0.449Contextual 0.52 0.462 0.462 0.448 0.491Syntactic 0.574 0.531 0.515 0.514 0.545Frequency 0.743 0.769 0.779 0.773 0.761

Table 5.5: Multi-classification results using different feature blocks.

The results of binary and multi-classification once more reflect the same pat-terns.

First of all, it is important to note that classifier trained on each of the fivefeature blocks achieves higher results than that of a random guess, which wouldbe 50% for binary and 25% for multi-classification. This indicates that conceptual,morphological, contextual, syntactic characteristics as well as frequency of thewords have at least some correlation with lexical complexity.

Secondly, it is clear that some linguistic characteristics have a stronger correla-tion with lexical complexity than others. The results suggest that this correlation isvery small when it comes to conceptual features.

Not surprisingly, the highest scores were achieved when the classifier wastrained on the frequency feature block, with the results of the syntactic featurescoming in second. The most likely reason for this is that our syntactic feature vectoris not binary and is closely related to frequency.

What is more surprising, however, is that the results achieved using the fre-quency block are as high as the ones achieved using all feature blocks. To seewhether frequency features alone is the reason for high classification results, wehave additionally trained the classifier on conceptual, morpho-syntactic, contex-tual and syntactic features together. In this case, the classifier scored 0.843 in binaryand 0.602 in multi-classification, which is only a slight increase in binary classifica-tion but has a bigger impact on multi-classification. This suggests that additionallinguistic information is more useful in multi-classification of complex words thanin binary classification.

35

All in all, the results support the findings of the previous studies and indicatethat frequency features are sufficient for maintaining the originally achieved classi-fication scores. This also indicates that the problem with the Kelly dataset is not thefrequency-based method that was used to collect the data but the fact that eachCEFR proficiency level was assigned to an equal number of words in the dataset.

II. Feature InformativenessA benefit of ensemble methods such as Random Forest is that they allow us tocompute feature importance (also relevance or informativeness) which is a valuabletool in feature selection (Kira and Rendell, 1992). Feature or attribute importancereturns scores for every feature that indicates how useful and informative thisfeature is for the construction of the decision trees. The more informative a featureis, the higher its importance score.

In scikit-learn, this can be achieved with feature_importance_ attribute thatranks the features based on their informativeness or relevance. Feature relevanceis measured by keeping track of how often a feature is used in the split points of adecision tree. The more often a feature is used to make these decisions, the higherits relative importance. By averaging feature importance of each tree, every featuregets a relative rank which can be used for feature selection.

We use this attribute as the main tool for the selection of the optimal featureset and analyze feature informativeness when Random Classifier is trained on 1)all features and 2) separate feature blocks. The results show scores over the test setagain and are presented in charts, where different colors refer to the five featuregroups as follows:

• Red: frequency features

• Purple: syntactic features

• Blue: contextual features

• Green: morpho-syntactic features

• Yellow: conceptual features

Feature Informativeness Results: All FeaturesFirst of all, we analyze feature informativeness when the Random Forest classifieris trained using all 39 features. Binary classification results are presented in Figure5.1 and multi-classification results in Figure 5.2. Even though there are somedifferences, the overall pattern is clear. First of all, the most informative featuregroups in descending order are: frequency, syntactic, contextual, morpho-syntacticand conceptual. This correlates with the classification results when the classifierwas trained on separate feature blocks. Secondly, features within one feature blockcan have considerably different importance ranks. For example, the FreqBlogrelevance value is much higher than the value of IDF. This means that even thoughfrequency features are more informative in general, not every frequency featureis more relevant than all other features. Finally, the most informative feature,FreqBlog, has a much higher relevance score in binary classification than in multi-classification. Additionally, the difference between it and and the second most

36

Figure 5.1: Feature Informativeness: binary classification results.

Figure 5.2: Feature Informativeness: multi-classification results.

37

important feature, FreqStandard, is also significantly bigger in binary classifica-tion. One of the possible reasons for this is the fact that FreqStandard representsfrequencies in the SUC corpus which was the main source for level 3 words. Itis therefore expected that frequencies from this source are more helpful whenlevel 3 words need to be distinguished from level 4 words like in multi-classification.

Feature Informativeness Results: Feature BlocksWe continue by evaluating feature informativeness in different feature groupswhen classifiers were trained on separate feature blocks. As you can see in Figure5.3, conceptual features have almost the same ranks with NumBringTags being themost informative feature, followed by Class5 and Class1. Surprisingly, Bring’sconceptual classes seem to be a better word sense source for Swedish CWI than amore popular and well known Saldo lexicon.

Binary classification Multi-classification

Figure 5.3: Conceptual Feature Informativeness.

The differences in morpho-syntactic feature informativeness (Figure 5.4) area bit more clear. It seems that the length of the word is most relevant to wordcomplexity. However, in binary classification this information is better reflected bythe number of syllables while in multi-classification - by the number of characters.

When it comes to contextual feature informativeness (Figure 5.6), sentencevariance (SentVar), mean sentence length (MSL) and mean word length (MWL) aremost relevant features while mean number of punctuation marks (MPC) is leastimportant.

Among syntactic relation features (Figure 5.7), coordination and nominalsare the most informative features and modifiers, loose, funcWords and specialare the least informative features.

Finally, frequency feature informativeness (Figure 5.8) also reveals similarresults for binary and multi-classification tasks. In this case, frequencies fromSwedish blogs and Twitter as well as SUC corpus are more important than

38


Figure 5.4: Morpho-syntactic Feature Informativeness.

Binary classification Figure 5.5: Multi-classification

Figure 5.6: Contextual Feature Informativeness.

39


Figure 5.7: Syntactic Feature Informativeness.


Figure 5.8: Frequency Feature Informativeness.

40

frequencies retrieved from movie subtitles and medical data.

Conclusion of Feature Informativeness ResultsAll in all, several important observations can be made:

• Feature rankings are mostly the same in binary and multi-classification.

• Features related to word frequencies are most informative.

• In most cases, there is a significant difference between the most and the leastinformative features.

Based on these findings, it is easier to define the strategy for feature selection:

• Selected features can be the same for binary and multi-classification tasks.

• Classification scores can be maintained by using only the frequency featuregroup.

• Potentially, the results could be improved by removing least informativefeatures.

In the next section we test these options and report the results.

III. Selection of the Optimal Feature SetWe follow the defined strategy to find the optimal feature set and remove leastinformative features while maintaining (or improving) the classification scores.Since feature informativeness is similar in binary and multi-classification, we followmulti-classification feature informativeness as the main reference.

First, we train and test the classifier by gradually removing least informativefeatures from I) the whole feature set and II) separate feature blocks. Then, wecombine the results and use them to III) remove least informative features in eachfeature block from the whole feature set. The results are reported in Tables 5.7,5.8, 5.9, 5.10, 5.11 and 5.12, where #RemFeat is the number of least informativefeatures removed and G-1 scores show the average over 10 runs for binary andmulti-classification.

Removal of all least informative featuresIn this experiment we remove two least informative features at a time. Infor-mativeness of all features from least to most important is reported in Table5.6.

Overall, feature removal results in lower scores (Table 5.7). However,the results also show that binary classification is less sensitive to featureremoval than multi-classification. This implies that the smaller the differ-ence between the complexity of words is, the more linguistic information isneeded to classify these words. Finally, it is necessary to point out that noneof the systems managed to outperform the previously described classifierthat is trained on 9 frequency features.

41

1. PoS 11. Class1 21. FreqSubs 31. WordrVar2. Class3 12. NumBringTags 22. clauses 32. MPC3. NumSaldoComp 13. special 23. NWE 33. MSL4. Class2 14. IDF 24. FreqSUCLit 34. FreqTwitter5. Class4 15. funcWords 25. MRD 35. FreqRiks6. NumSyl 16. VowConRatio 26. RootVar 36. nominals7. NumSaldoSenses 17. loose 27. FreqSUC 37. coordination8. Class5 18. modifiers 28. MWL 38. FreqStandard9. Class6 19. other 29. FreqWeb 39. FreqBlog10. LenChar 20. FreqGiga 30. SentVar

Table 5.6: Feature Informativeness: multi-classification results.

#RemFeat G-1(binary) G-1(multi) #RemFeat G-1(binary) G-1(multi)

0 0.922 0.762 20 0.921 0.7532 0.921 0.762 22 0.919 0.7494 0.917 0.758 24 0.919 0.7476 0.919 0.758 26 0.91 0.7388 0.922 0.756 28 0.91 0.746

10 0.92 0.753 30 0.904 0.74712 0.922 0.761 32 0.914 0.74914 0.923 0.759 34 0.914 0.74416 0.921 0.758 36 0.896 0.70818 0.923 0.755 38 0.863 0.496

Table 5.7: Feature Removal Results I.

Removal of least informative features in different feature blocksWe train the classifier using separate feature groups again. This time, weadditionally remove least informative features (on feature at a time). Themain goal of the experiment is to find out the smallest feature set in eachfeature block without significantly affecting the scores when all features in ablock are used.

ConceptualWe start with conceptual feature block. Conceptual feature informativenessfrom least important to most important is as follows:

1. Class3

2. NumSaldoSenses

3. Class6

4. NumSaldoComp

5. Class4

6. Class2

7. Class1

8. Class5

9. NumBringTags

The overall score dropped by 0.3 percent in binary and by 0.01 in multi-classification when only the most informative feature, NumBringTags, wasused in training (Table 5.8). Therefore, it is possible to conclude that it issufficient to use only this feature in future experiments and discard the

42

#RemFeat G-1 (binary) G-1 (multi)

0 0.609 0.3631 0.599 0.3322 0.593 0.3583 0.584 0.3454 0.429 0.3515 0.537 0.3366 0.548 0.3377 0.568 0.3568 0.569 0.362

Table 5.8: Conceptual Feature Removal Results.

other extracted conceptual features.

Morpho-syntacticMorpho-syntactic feature informativeness from least important to mostimportant is as follows:

1. NumSyl

2. VowConRatio

3. PoS

4. LenChar


0 0.736 0.4381 0.736 0.4392 0.716 0.4233 0.714 0.445

Table 5.9: Morpho-syntactic Feature Removal Results.

The results are even more clear in morpho-syntactic feature block (Table5.9) since for multi-classification, including only the most informative fea-ture, LenChar, results in increase of 0.07 percent in the overall score. Thisdoes not hold for binary classification, however, since in this case the scoredrops by 2.2 percent.

43

ContextualContextual feature informativeness from least important to most importantis as follows:

1. MPC2. RootVar3. MRD

4. WordVar5. MSL6. SentVar

7. MWL


0 0.768 0.4881 0.772 0.4962 0.764 0.4773 0.759 0.4784 0.75 0.4865 0.716 0.446 0.659 0.43

Table 5.10: Contextual Feature Removal Results.

In this case, MWL, SentVar, and MSL, the three most informative features,could be picked as the optimal feature set in this block since the score dropsonly by 0.02 percent in multi-classification when only these features areused (5.10).

SyntacticSyntactic feature informativeness from least important to most important isas follows:

1. special

2. funcWord

3. modifiers

4. loose

5. NWE

6. other

7. clauses

8. nominals

9. coordination

A gradual decrease in the scores can be observed when least informativesyntactic features are removed (Table 5.11). However, removal of thefirst two features (special and funcWords) did not affect the scoressignificantly and removal of second most informative feature, nominals,improved the performance. Therefore, we have tested the classifier againby removing these three features which resulted in 0.812 G-1 score forbinary and 0.504 score for multi-classification. Hence, we conclude thatthe optimal syntactic feature set is modifiers, loose, NWE, other, clausesand coordination.

44


0 0.814 0.5071 0.807 0.5082 0.796 0.5033 0.796 0.4924 0.796 0.4885 0.772 0.4736 0.758 0.437 0.737 0.4038 0.761 0.406

Table 5.11: Syntactic Feature Removal Results.

FrequencyFrequency feature informativeness from least to most important is as fol-lows:

1. IDF2. FreqSubs3. FreqSUC4. FreqSuCLit

5. FreqGiga6. FreqWeb7. FreqRiks8. FreqTwitter

9. FreqStandard

10. FreqBlog


0 0.923 0.7591 0.931 0.7722 0.921 0.7613 0.909 0.7624 0.911 0.7415 0.906 0.7346 0.901 0.7337 0.888 0.7068 0.883 0.6879 0.865 0.491

Table 5.12: Frequency Feature Removal Results.

Some interesting observations about the results (Table 5.12) can bemade. First, removal of the least informative feature, IDF, improved theperformance of the system and resulted in highest yet classification scoresof 0.931 for binary and 0.772 for multi-classification. After that, however,removal of more features did not improve the scores just as in previous exper-iments with feature blocks. Another interesting observation regards removalof feature FreqStandard (#8 in the table) which significantly lowers theoverall score for multi-classification but only slightly affects binary classifi-cation. It support our previously mentioned hypothesis that FreqStandard

45

helps distinguish words with complexity level 3 from words with complexitylevel 4, since SUC corpus is the main source for level 3 words.

All in all, even though the results once more confirm the findings ofprevious studies and show that word frequencies are sufficient for CWI, themost important point that can be concluded is that multiple frequencysources are needed in order to build an optimal CWI system, especiallywhen it comes to multi-classification.

Removal of least informative features in each feature block from thewhole feature setThe last experiment that we carry out is based on the findings of the twoprevious experiments regarding the overall feature informativeness andfeature informativeness within separate feature groups. We use the wholefeature set to train and test the classifier again, but instead of removing leastinformative features overall, we remove least informative features from eachblock first. After that, since we know that only frequency feature block issufficient to maintain the scores, we continue by additionally removing theremaining features in each of the blocks starting with the worst performingblocks.

First, we summarize the optimal feature sets for each feature block (or-dered from worst to best performing feature block):

1. Conceptual: NumBringTags.

2. Morpho-syntactic: LenChar.

3. Contextual: MWL, SentVar, and MSL.

4. Syntactic: modifiers, loose, NWE, other, clauses andcoordination.

5. Frequency: FreqSubs, FreqSUC, FreqSuCLit, FreqGiga, FreqWeb,FreqRiks, FreqTwitter, FreqStandard and FreqBlog.

Based on this information, we gradually remove features, starting with leastinformative features in each feature block and ending with the removal ofthe whole blocks (Table 5.13). Numbers in RemFeatID column correspondto the feature block IDs from the list above.

The experiment resulted in the best yet classification score of 0.933 forbinary and 0.775 for multi-classification with the following optimal featureset:

• Syntactic: modifiers, loose, NWE, other, clauses, coordination

• Frequency: FreqSubs, FreqSUC, FreqSuCLit, FreqGiga, FreqWeb,FreqRiks, FreqTwitter, FreqStandard and FreqBlog

Even though the selected features are from frequency and syntacticfeature blocks, syntactic features have a big correlation with frequencies as

46

RemFeatID G-1 (binary) G-1 (multi)

Least informative features from:

1 0.922 0.7611, 2 0.923 0.761, 2, 3 0.922 0.7641, 2, 3, 4 0.928 0.7651, 2, 3, 4 ,5 0.93 0.764

All remaining features from:

1 0.931 0.7621, 2 0.93 0.761, 2, 3 0.933 0.7751, 2, 3, 4 0.931 0.771

Table 5.13: Feature Removal Results II.

well, which can also be seen in Figure 5.9, where we show how all featurescorrelate with the most informative frequency feature FreqBlog. Theseresults mostly confirm the findings of the previous studies but at the sametime illustrate that multiple frequency sources as well as their combinationwith other linguistic characteristics, for example syntactic, improve theperformance of the classifiers compared to the classifiers that are trainedon only one frequency feature.

Finally, it is important to note that feature selection was done by eval-uating the system on the test set which means that the final results can beslightly overestimated.

5.4 Simple and Complex Words in WordNetThe findings of the previous studies and the conducted experiments allowus to conclude that word frequencies are very good indicators of lexical com-plexity. In this last set of experiments, we apply this knowledge to English inorder to analyze lexical complexity from a more cognitive perspective. Themain goal of these experiments is to see whether simple and complex wordsreveal any patterns in WordNet, a large semantic web for English (Miller,1995).

WordNet is a large lexical database for English. It is inspired by psycholin-guistic theories about human semantic memory. These theories suggest thatpeople store and organize the knowledge in their minds in a hierarchicalmanner and that time and cognitive effort needed in order to retrieve thisinformation is directly related to the size of those hierarchical structures.Since the concept of complexity is also defined by the cognitive effort whichis needed to understand things, in this section we analyze the distributionof simple and complex English words in WordNet.

In WordNet, lexical units are grouped into cognitive synonyms - synsets .

47

Figure 5.9: Feature Correlation with FreqBlog.

Synsets are related to each other by conceptual-semantic and lexical means.In addition to synsets, the most popular relations in the web are hyper-onymy and hyponymy. The first relation links words to more general synsets(e.g. bed to furniture), while the second to more specific words (e.g. bed tobunkbed). In this way, WordNet forms hierarchical structures, where the rootnode represents most general concepts and leaf nodes are specific instances(Miller et al., 1990).

For the purpose of this experiment, we assume that frequency is the bestindicator of lexical complexity in English and that spoken language datais the best source for these frequencies. We have collected 12,041 Englishwords and their frequencies from WordLex, a frequency resource which isbased on English Twitter, blog posts and newspapers (Gimenes and New,2016). Then, we compared the hierarchical distribution of less and more fre-quent words in WordNet in terms of the two semantic relations - hyponymyand hyperonymy. We additionally check the average number of synonymssince synonymy is the core relation in WordNet.

As in the previous experiments, we normalize the frequencies by extract-ing their logarithm to base 10. In this way, we build five frequency groups (0,1, 2, 3 and greater or equal to 4), where we assume that words belonging togroup 0 are most complex and words belonging to group 4 or greater -the simplest. The results are presented in Figures 5.10, 5.11 and 5.12.

As we can see, all results reflect very clear patterns and indicate that thesimpler the word, the more synonyms, more hyponyms and less hypernyms

48

Figure 5.10: Correlation between mean number of synonyms and word frequenciesin WordNet.

Figure 5.11: Correlation between mean number of hypernyms and word frequenciesin WordNet.

Figure 5.12: Correlation between mean number of hyponyms and word frequenciesin WordNet.

49

it has. In other words, the results suggest that simple words are easier toreplace and that they usually are higher up in the described hierarchicalstructures. This means that simple words are more general (they are closerto the root node) and complex words - more specific (closer to leaf nodes).If we would assume that the described psycholinguistic theories are correct,this could mean that people traverse such knowledge structures stored inthe brains in a top-down manner and that the “shorter” this traversal, thesimpler the retrieved concept is.

50

6 Discussion

In this chapter, we conclude and discuss the results and findings of our work.We focus on three main outcomes as outlined in the purpose of the thesis:

• Gold standard dataset

• Complex word classifier

• Lexical complexity

6.1 Gold Standard DatasetTwo datasets for Swedish CWI were collected and evaluated. The resultsimply that our proposed method for data collection that is based on secondlanguage learning and teaching materials has a high correlation with humanjudgment. Moreover, the performance of the classifier trained on our datacompares favorably to similar classifiers for English that were trained onmanually annotated data, which additionally supports our findings. All inall, the results indicate that in addition to being faster and cheaper, ourmethod correlates better with human judgment.

6.2 Complex Word ClassifierWe used our collected data to train and test six Machine Learning classi-fiers. The Random Forest classifier significantly outperformed other fourclassifiers with the Random Boosting classifier being a close second. Nomodifications were made to the algorithms but our system reached a bi-nary classification G-1 score of 0.933 and multi-classification score of 0.775when trained on an optimal set of 15 selected features. The winners of asimilar CWI task for English organized by SemEval in 2016 (Paetzold andSpecia, 2016) managed to reach G-1 score of 0.774 in binary CWI. In thiscase, however, more complex classifiers were used and some of the includedfeatures were directly related to word complexity unlike in our work. Eventhough these tasks are not entirely comparable, this makes our results moresignificant since a simple classification system and a feature set that doesnot include any information directly related to lexical complexity indicatesagain that our training and test data might be of better quality. Anotherpossible, though unlikely, explanation could be that Swedish CWI is a much

51

simpler task than English CWI. However, we leave this question for futurestudies. The most important outcome of the experiments is the prove that itis possible to build CWI systems for Swedish that reach satisfactory classifi-cation results.

6.3 Lexical ComplexityOur work confirms the findings of the previous studies and shows thatword frequency is the best indicator of how complex a lexical unit is. How-ever, some interesting observations and findings were made. First of all,the sources of said frequencies are very important. Our results imply thatcorpora of spoken language, such as blogs or Twitter, are better suitablefor this task. Worst sources of frequencies turned out to be domain specifictexts such as laws or medical data. Secondly, the more frequency attributesare used, the higher is the classification score.

However, an important observation to make here is that even thoughfrequency has the highest correlation with lexical complexity, it does notexplain what makes a word complex. In fact, our results show that contex-tual, morphological, syntactic and conceptual features also to some extentcorrelate with lexical complexity and that it most likely is the combination ofthem all that defines how lexically complex a word is. For example, analysisof our dataset showed that complex words tend to be longer, have fewersenses and appear in more complex contexts. Additionally, distribution ofEnglish words in WordNet indicates that simple concepts are more generalwhile complex words are more specific.

Frequency, on the other hand, describes only the usage of a language -the results imply that people tend to use simpler language in their daily lives.A question arises then whether we use simple lexicon because it is simpleor whether it is simple because we use it often. Findings of our analysis ofWordNet indicate that it could be the former - we possibly prefer to use suchconcepts that are easier to retrieve from the hierarchical structures thatour knowledge might be stored in. However, the latter cannot be excludedeither. As mentioned in the background chapter, some linguistic theoriesrelate complexity to familiarity, where familiar and known things, includingwords, require much less cognitive effort in order to understand them. Onecould then argue that frequency is such a good indicator of word complexitybecause it represents best how familiar the vocabulary is to us.

This information is sufficient to built CWI systems but it raises otherquestions, such as whether lexical complexity is first and foremost relatedto the usage of the language or specific linguistic characteristics. We leavethese questions for future studies.

52

7 Conclusion

We conclude our work with a short summary and by outlining the possibleextensions and improvements.

7.1 SummaryComplex Word Identification (CWI), a sub-task of Automatic Text Simpli-fication, is still a growing field, where most of the work has been done onEnglish. In this study, we experimented with Swedish CWI.

The contribution of our work is threefold. First, it resulted in a newdataset containing words with four lexical complexity levels for Swedish.However, the most important outcome of our work with data is the vali-dation of a possible new method for data collection that can be used inlanguages other than Swedish. The results imply that in addition to beingfast and cheap, method for data collection based on second language learn-ing material has a high correlation with human judgment. Secondly, oursimple Random Forest classifier reached 0.933 G-1 score in binary and 0.775score in multi-classification, which shows that Swedish CWI is a task thatcan be done automatically. The results also confirmed the previous findingsthat word frequencies have the highest correlation with lexical complexitybut at the same time showed that the sources of these frequencies are notless important. Based on our findings, the optimal frequency sources arecorpora of spoken language. However, other linguistic features can alsohelp to improve the performance of complex words classifiers, especially inmulti-classification.

7.2 Future WorkWe believe that the complex word classifier that we have built could befurther improved by extending the work done with the frequency featuressince the results show that adding, removing and normalizing these featuresusing different methods can significantly affect the classification.

Additionally, the work can be extended by implementing an automaticlexical simplification system for Swedish. This can be done by collectinga lexicon of Swedish synonyms, where each word is mapped to multiplesynonyms that are ranked based on their frequency in a corpus. In thiscase, our classifier could be used to identify complex words in Swedish

53

data. If possible, these words would then be substituted with more frequentsynonyms from the collected lexicon.

54

Bibliography

Emil Abrahamsson, Timothy Forni, Maria Skeppstedt, and Maria Kvist. Med-ical text simplification using synonym replacement: Adapting assessmentof word difficulty to a compounding language. In Proceedings of the 3rdWorkshop on Predicting and Improving Text Readability for Target ReaderPopulations (PITR), pages 57–65, 2014.

Svenja Adolphs and Norbert Schmitt. Lexical coverage of spoken discourse.Applied linguistics, 24(4):425–438, 2003.

Talal Musaed Alghizzi. Complexity, accuracy, and fluency (CAF) developmentin L2 writing: the effects of proficiency level, learning environment, texttype, and time among Saudi EFL learners. PhD thesis, 2017.

Or Biran, Samuel Brody, and Noémie Elhadad. Putting it simply: a context-aware approach to lexical simplification. In Proceedings of the 49th AnnualMeeting of the Association for Computational Linguistics: Human Lan-guage Technologies: short papers-Volume 2, pages 496–501. Associationfor Computational Linguistics, 2011.

Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer,2006.

Lars Borin and Markus Forsberg. All in the family: A comparison of SALDOand WordNet. In Proceedings of the Nodalida 2009 Workshop on WordNetsand other Lexical Semantic Resources - between Lexical Semantics, Lexi-cography, Terminology and Formal Ontologies. NEALT Proceedings Series,volume 7, 2009.

Lars Borin, Markus Forsberg, and Lennart Lönngren. SALDO: a touch of yinto Wordnet’s yang. Language resources and evaluation, 47(4):1191–1211,2013.

Lars Borin, Richard Johansson, and Luis Nieto Piña. Here be dragons? theperils and promises of inter-resource lexical-semantic mapping. In Pro-ceedings of the Workshop on Semantic resources and Semantic Annotationfor Natural Language Processing and the Digital Humanities at NODAL-IDA 2015, Vilnius, 11th May, 2015, number 112, pages 1–11. LinköpingUniversity Electronic Press, 2015.

Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.

55

Sven Casper Bring. Svenskt ordförråd ordnat i begreppsklasser. Natur ochKultur, Stockholm, 1930.

Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vectormachines. ACM transactions on intelligent systems and technology (TIST),2(3):27:1–27:27, 2011.

Anne Cutler. Lexical complexity and sentence processing. In The process oflanguage understanding, pages 43–79. Wiley, 1983.

Östen Dahl. The growth and maintenance of linguistic complexity, volume 71.John Benjamins Publishing, 2004.

Marie-Catherine De Marneffe, Timothy Dozat, Natalia Silveira, Katri Haveri-nen, Filip Ginter, Joakim Nivre, and Christopher D Manning. UniversalStanford dependencies: A cross-linguistic typology. In Conference onLanguage Resources and Evaluation, volume 14, pages 4585–4592, 2014.

Anna Decker. Towards automatic grammatical simplification of Swedishtext. Masteruppsats, Institutionen för Lingvistik, Stockholms universitet,Stockholm, 2003.

Bruce M Edmonds. Syntactic measures of complexity. PhD thesis, 1999.

Stian Rødven Eide, Nina Tahmasebi, and Lars Borin. The Swedish Cultur-omics Gigaword Corpus: A one billion word swedish reference dataset forNLP. In Digital Humanities 2016. From Digitization to Knowledge 2016:Resources and Methods for Semantic Processing of Digital Works/Texts,Proceedings of the Workshop, July 11, 2016, Krakow, Poland, number 126,pages 8–12. Linköping University Electronic Press, 2016.

Noemie Elhadad. Comprehending technical texts: Predicting and definingunfamiliar terms. In AMIA annual symposium proceedings, volume 2006,pages 239–243. American Medical Informatics Association, 2006.

Karin Friberg Heppin. MedEval — a Swedish medical test collection withdoctors and patients user groups. Journal of Biomedical Semantics, 2:1–15, 2011.

Jerome H Friedman. Greedy function approximation: a gradient boostingmachine. Annals of statistics, pages 1189–1232, 2001.

Manuel Gimenes and Boris New. Worldlex: Twitter and blog word frequen-cies for 66 languages. Behavior research methods, 48(3):963–972, 2016.

Alice Carmichael Harris. On the explanation of typologically unusual struc-tures. Linguistic universals and language change, pages 54–76, 2008.

John A Hawkins. An efficiency theory of complexity and related phenomena.Oxford University Press, Oxford, 2009.

56

Colby Horn, Cathryn Manduca, and David Kauchak. Learning a lexicalsimplifier using Wikipedia. In Proceedings of the 52nd Annual Meeting ofthe Association for Computational Linguistics (Volume 2: Short Papers),volume 2, pages 458–463, 2014.

Alex Housen and Folkert Kuiken. Complexity, accuracy, and fluency insecond language acquisition. Applied linguistics, 30(4):461–473, 2009.

Viggo Kann. Folkets användning av lexin–en resurs. KTH Nada, 2004.

Robin Keskisärkkä. Automatic text simplification via synonym replacement.In Swedish Language Technology Conference, 2012.

Kenji Kira and Larry A Rendell. A practical approach to feature selection. InMachine Learning Proceedings 1992, pages 249–256. Elsevier, 1992.

Sofie Johansson Kokkinakis and Elena Volodina. Corpus-based approachesfor the creation of a frequency based vocabulary list in the EU projectKelly–issues on reliability, validity and coverage. Proceedings of Electroniclexicography, pages 129–139, 2011.

Raghavan Prabhakar Manning, Christopher D and Hinrich Schütze. Intro-duction to information retrieval. Cambridge University Press, 2008.

George A Miller. WordNet: a lexical database for English. Communicationsof the ACM, 38(11):39–41, 1995.

George A Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, andKatherine J Miller. Introduction to Wordnet: An on-line lexical database.International journal of lexicography, 3(4):235–244, 1990.

Saif Mohammad. Imagisaurus: An interactive visualizer of valence andemotion in the Roget’s thesaurus. In Proceedings of the 6th Workshop onComputational Approaches to Subjectivity, Sentiment and Social MediaAnalysis, pages 85–91, 2015.

Katarina Mühlenbock. Legible, readable or plain words-presentation ofan easy-to-read Swedish corpus. In Multilingualism, Proceedings of the23rd Scandinavian Conference of Linguistics, volume Studia LinguisticaUpsaliensia 8, pages 325–327, 2009.

Joakim Nivre, Johan Hall, Jens Nilsson, Atanas Chanev, Gülsen Eryigit, San-dra Kübler, Svetoslav Marinov, and Erwin Marsi. MaltParser: A language-independent system for data-driven dependency parsing. Natural Lan-guage Engineering, 13(2):95–135, 2007.

Robert Östling. Stagger: An open-source part of speech tagger for Swedish.volume 3, pages 1–18. Linköping University Electronic Press, 2013.

Gustavo Paetzold and Lucia Specia. Semeval 2016 task 11: Complex wordidentification. In Proceedings of the 10th International Workshop on Se-mantic Evaluation (SemEval-2016), pages 560–569, 2016.

57

Gustavo H Paetzold and Lucia Specia. Text simplification as tree transduc-tion. In Proceedings of the 9th Brazilian Symposium in Information andHuman Language Technology, pages 116––125, 2013.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn:Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

Keith Rayner and Susan A Duffy. Lexical complexity and fixation times inreading: Effects of word frequency, verb complexity, and lexical ambiguity.Memory & cognition, 14(3):191–201, 1986.

Evelina Rennes and Arne Jönsson. A tool for automatic simplification ofSwedish texts. In Proceedings of the 20th Nordic Conference of Computa-tional Linguistics (NODALIDA 2015), pages 317–320, 2015.

Evelina Rennes and Arne Jönsson. Towards a corpus of easy to authority webtexts. In Proceedings of the Sixth Swedish Language Technology Conference(SLTC-16), Umeå, Sweden, 2016.

Nicholas Rescher. Complexity: A philosophical overview. Transaction Pub-lishers, 1998.

Peter Mark Roget. Roget’s Thesaurus. World Heritage Encyclopedia, 1982.

Jonas Rybing, Christian Smith, and Annika Silvervarg. Towards a rule basedsystem for automatic simplification of texts. Proceedings of SLTC, 2010.

Horacio Saggion and Graeme Hirst. Automatic Text Simplification. Morgan& Claypool Publishers, 2017.

Matthew Shardlow. A comparison of techniques to automatically identifycomplex words. In 51st Annual Meeting of the Association for Computa-tional Linguistics Proceedings of the Student Research Workshop, pages103–109, 2013a.

Matthew Shardlow. The CW corpus: A new resource for evaluating theidentification of complex words. In Proceedings of the Second Workshop onPredicting and Improving Text Readability for Target Reader Populations,pages 69–77, 2013b.

Matthew Shardlow. A survey of automated text simplification. InternationalJournal of Advanced Computer Science and Applications, 2014.

Advaith Siddharthan. A survey of research on text simplification. Interna-tional Journal of Applied Linguistics, 165(2):259–298, 2014.

Herbert A Simon. The sciences of the artificial. MIT press, 1996.

58

Kaius Sinnemäki et al. Language universals and linguistic complexity: Threecase studies in core argument marking. PhD thesis, 2011.

Walter JB Van Heuven, Pawel Mandera, Emmanuel Keuleers, and MarcBrysbaert. SUBTLEX-UK: A new and improved word frequency databasefor British English. Quarterly Journal of Experimental Psychology, 67(6):1176–1190, 2014.

Gabriela Gómez Vera, Carmen Sotomayor, Percy Bedwell, Ana MaríaDomínguez, and Elvira Jéldrez. Analysis of lexical quality and its rela-tion to writing quality for 4th grade, primary school students in Chile.Reading and Writing, 29(7):1317–1336, 2016.

Elena Volodina and Sofie Johansson Kokkinakis. Introducing the SwedishKelly-list, a new lexical e-resource for Swedish. In The International Con-ference on Language Resources and Evaluation, pages 1040–1046, 2012.

Kate Wolfe-Quintero, Shunji Inagaki, and Hae-Young Kim. Second languagedevelopment in writing: Measures of fluency, accuracy, & complexity. Num-ber 17. University of Hawaii Press, 1998.

Seid Muhie Yimam, Sanja Štajner, Martin Riedl, and Chris Biemann. Multi-lingual and cross-lingual complex word identification. In Proceedings ofRecent Advances in Natural Language Processing, pages 813–822, 2017.

Marcos Zampieri, Shervin Malmasi, Gustavo Paetzold, and Lucia Specia.Complex word identification: Challenges in data annotation and systemperformance. arXiv preprint arXiv:1710.04989, 2017.

Harry Zhang. The optimality of Naive Bayes. In Proceedings of the Seven-teenth International Florida Artificial Intelligence Research Society Confer-ence (FLAIRS 2004). AAAI Press, 2004a.

Tong Zhang. Solving large scale linear prediction problems using stochasticgradient descent algorithms. In Proceedings of the twenty-first interna-tional conference on Machine learning, pages 116–123. ACM, 2004b.

59

A Machine Learning Algorithms

In this study, we used Scikit-learn, a Python module that integrates multiplestate-of-the-art Machine Learning algorithms (Pedregosa et al., 2011). Sixof these algorithms were chosen to train complex word classifiers: SupportVector Machine, Naive Bayes, Random Forest, Gradient Boosting, LogisticRegression and Stochastic Gradient Descent. In this appendix, we give ashort overview of these classifiers.

Support Vector MachinesScikit-learn’s implementation of Support Vector Machines (SVM) is basedon LIBSVM, a library for Support Vector Machines (Chang and Lin, 2011).SVMs are based on supervised Machine Learning methods and can be usedfor regression, classification or outlier detection. LIBSVM support SupportVector Classification (SVC), Support Vector Regression (SVR) and one-classSVM. In the study, we use SVC for binary and multi-classification.

Support Vectors are training data points belonging to different classesin a multi-dimensional space. The main idea behind the algorithm is tofind a hyperplane that separates the data points from different classes withthe biggest possible margin. Therefore, SVM is defined as a constrainedoptimization problem:

minw ,b

1

γ (w ,b)

subj. to yn (w · xn +b) ≥ 1

In this problem, the classifier tries to find parameters w,b that would max-imize the margin γ in such way that all data points would be classifiedcorrectly.

Naive BayesNaive Bayes is a set of supervised Machine Learning algorithms that are allbased on a principle that features are independent of each other given thevalue of the class variable (Zhang, 2004a).

In this study we use probabilistic learning model, Multinomial NaiveBayes, where the probability of example d belonging to class c is computedas follows:

P (c | d) ∝ P (c)∏

1≤k≤nd

P (tk | c)

60

Here, P (tk | c) is the conditional probability of example tk being a part ofclass c (Manning and Schütze, 2008).

Logistic RegressionLogistic Regression is a linear model used for classification. The main com-ponent of the algorithm is a logistic function that tries to build a model thatfits the data best and the output of the algorithm in this case is a probabilityof an example belonging to a certain class (Bishop, 2006). These proba-bilities are computed using odds ratio which is the ratio of the examplebelonging to one class and not belonging to another class:

Odd s =(

P(y = 1 | x

)1−P

(y = 1 | x

))Given this odds ratio, the final equation that computes the probability is:

P(y = 1 | x

)= ea +bx

1+ea+bx= 1

1+e−(a+bx)

Here, a and b are the gradients for the logistic function, x is the data featureand y is the output.

Stochastic Gradient DescentStochastic Gradient Descent is another linear classification algorithm, wherethe quality of the predictor p(x) is measured by a loss function. It solves aproblem of finding such predictor p(x) that the expected true loss of p is assmall as possible:

Q(p (·))= EX ,YΦ

(p (X ) ,Y

)Here, EX ,Y is the expectation with the respect to the true distribution of D- an unknown underlying distribution that the training data X ,Y is drawnfrom (Zhang, 2004b).

Random ForestThe Random Forest algorithm is based on randomized decision trees thatare closely related to the “divide and conquer” notion. (Breiman, 2001) de-fines random forest as a classifier consisting of a collection of tree-structuredclassifiers:

h (x,Θk) ,

whereΘk are independent, identically distributed random vectors and eachtree casts a unit vote for the most popular class at input x (Breiman, 2001).Random forest then solves a problem of defining the margin function thatmeasures how much the mean number of votes for the correct class at X, Yis bigger than the mean number of votes for the other classes as follows:

mg (X ,Y ) = avk I (hk (X ) = Y )−maxj 6=Y avk I

(hk (X ) = j

),

where I is an indicator function (Breiman, 2001).

61

Gradient BoostingSimilarly to Random Forest, the Gradient Boosting algorithm is based onbuilding an ensemble of tree structures and can be used for both, regressionand classification (Friedman, 2001). Gradient Boosting consists of threeparts:

• a loss function that is optimized

• a weak learner that makes predictions

• an additive model that adds weak learners to minimize the loss func-tion,

where the loss function is optimized by minimizing its value and the weaklearners are decision trees. After an ensemble of trees is constructed, theadditive model sums the predictions of the individual trees:

D (x) = dtr ee1 (x)+dtr ee2 (x)+ ...

The following trees are constructed in a way that they would minimize theloss function.

62

B Syntactic Relations

List of syntactic relations and corresponding feature labels 1.

Feature Label Syntactic Relation

Nominals:nsjub nominal subjectobj objectiobj indirect objectobl oblique nominalvocative vocativeexpl expletivedislocated dislocated elementsnmod nominal modifierappos appositional modifiernummod numeric modifier

Clauses:csubj clausal subjectccomp clausal complementxcomp open clausal complementadvcl adverbial clause modifieracl clausal modifier of noun (adjectival clause)

Modifier words:advmod adverbial modifierdiscourse discourse elementamod adjectival modifier

Function words:aux auxiliarycop copulamark markerdet determinerclf classifiercase case marking

Coordination:conj conjunctcc coordinating conjunction

MWEfixed fixed multiword expressionflat flat multiword expressioncompound compound

Loose:list listparataxis parataxis

Special:orphangoeswith goes withreparandum overridden disfluency

Other:punct punctuationroot rootdep unspecified dependency

1http://universaldependencies.org/u/dep/

63

complexword identiﬁcationforswedishuu.diva-portal.org/smash/get/diva2:1212982/fulltext01.pdf ·...

Documents