simplescience: lexical simpliﬁcation of scientiﬁc … › yeaseul1 › paper ›...

Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1066–1071,Austin, Texas, November 1-5, 2016. c�2016 Association for Computational Linguistics

SimpleScience: Lexical Simplification of Scientific Terminology

Yea-Seul Kim and Jessica HullmanUniversity Washington

Information Schoolyeaseul1, [email protected]

Matthew BurgessUniversity of Michigan

Computer Science [email protected]

Eytan AdarUniversity of MichiganSchool of Information

[email protected]

Abstract

Lexical simplification of scientific terms rep-resents a unique challenge due to the lack of astandard parallel corpora and fast rate at whichvocabulary shift along with research. We in-troduce SimpleScience, a lexical simplifica-tion approach for scientific terminology. Weuse word embeddings to extract simplificationrules from a parallel corpora containing sci-entific publications and Wikipedia. To eval-uate our system we construct SimpleSciGold,a novel gold standard set for science-relatedsimplifications. We find that our approach out-performs prior context-aware approaches atgenerating simplifications for scientific terms.

1 Introduction

Lexical simplification, the process of reducing thecomplexity of words by replacing them with sim-pler substitutes (e.g., sodium in place of Na; insectsin place of lepidopterans) can make scientific textsmore accessible to general audiences. Human-in-the-loop interfaces present multiple possible simpli-fications to a reader (on demand) in place of jargonand give the reader familiar access points to under-standing jargon (Kim et al., 2015). Unfortunately,simplification techniques are not yet of high enoughquality for fully automated scenarios.

Currently lexical simplification pipelines for sci-entific texts are rare. The vast majority of priormethods assume a domain independent context, andrely on Wikipedia and Simple English Wikipedia, asubset of Wikipedia using simplified grammar andterminology, to learn simplifications (Biran et al.,

2011; Paetzold and Specia, 2015), with translation-based approaches using an aligned version (Costerand Kauchak, 2011; Horn et al., 2014; Yatskaret al., 2010). However, learning simplificationsfrom Wikipedia is not well suited to lexical sim-plification of scientific terms. Though generic orestablished terms may appear in Wikipedia, novelterms associated with new advances may not be re-flected. Wikipedia’s editing rules also favor gener-ality over specificity and eliminate redundancy, bothof which are problematic in providing a rich train-ing set that matches simple and complex terms. Fur-ther, some approaches work by detecting all pairs ofwords in a corpus and filtering to isolate synonym orhypernym-relationship pairs using WordNet (Biranet al., 2011). Like Wikipedia, WordNet is a generalpurpose semantic database (Miller, 1995), and doesnot cover all branches of science nor integrate newterminology quickly.

Word embeddings do not require the use of pre-built ontologies to identify associated terms likesimplifications. Recent work indicates that they mayimprove results for simplification selection: deter-mining which simplifications for a given complexword can be used without altering the meaning ofthe text (Paetzold and Specia, 2015). Embeddingshave also been explored to extract hypernym rela-tions from general corpora (Rei and Briscoe, 2014).However, word embeddings have not been used forgenerating lexical simplifications. We provide anovel demonstration of how using embeddings ona scientific corpus is better suited to learning scien-tific term simplifications than prior approaches thatuse WordNet as a filter and Wikipedia as a corpus.

1066

INPUT: Finally we show that the transient immune activationthat renders mosquitoes resistant to the human malaria parasitehas little to no effect on mosquito fitness as a measure of sur-vival or fecundity under laboratory conditions.CANDIDATE RULES:{fecundity!fertility} {fecundity!productivity}OUTPUT:Finally we show that the transient immune activation that ren-ders mosquitoes resistant to the human malaria parasite has lit-tle to no effect on mosquito fitness as a measure of survival or(fertility; productivity) under laboratory conditions.

Table 1: Input sentence, candidate rules and output sentence.(Further examples provided as supplementary material.)

We introduce SimpleScience, a novel lexical sim-plification pipeline for scientific terms, which weapply to a scientific corpus of nearly 500k publi-cations in Public Library of Science (PLOS) andPubMed Central (PMC) paired with a general cor-pus from Wikipedia. We validate our approach us-ing SimpleSciGold, a gold standard set that we cre-ate using crowdsourcing that contains 293 sentencescontaining scientific terms with an average of 21simplifications per term. We show how the Sim-pleScience pipeline achieves better performance (F-measure: 0.285) than the prior approach to simplifi-cation generation applied to our corpus (F-measure:0.136). We further find that the simplification se-lection techniques used in prior work to determinewhich simplifications are a good fit for a sentencedo not improve performance when our generationpipeline is used. 1

2 Parallel corpora: Scientific and General

We assembled a scientific corpus of papers from theentire catalog of PLOS articles and the National Li-brary of Medicine’s Pubmed Central (PMC) archive(359,324 fulltext articles). The PLOS corpus of125,378 articles includes articles from PLOS Oneand each speciality PLOS journal (e.g., Pathogens,Computational Biology). Our general corpus in-cludes all 4,776,093 articles from the Feb. 2015 En-glish Wikipedia snapshot. We chose Wikipedia asit covers many scientific concepts and usually con-tains descriptions of those concepts using simplerlanguage than the research literature. We obtainedall datasets from DeepDive (Re and Zhang, 2015).

1Data and source code are available at:https://github.com/yeaseulkim/SimpleScience

3 SimpleScience Design

3.1 Generating Simplifications

Our goal is to learn simplification rules in the formcomplex word!simple word. One approach identi-fies all pairwise permutations of ‘content’ terms andthen applies semantic (i.e., WordNet) and simplic-ity filters to eliminate pairs that are not simplifica-tions(Biran et al., 2011). We adopt a similar pipelinebut leverage distance metrics on word embeddingsand a simpler frequency filter in place of WordNet.Embeddings identify words that share context in anunsupervised, scalable way and are more efficientthan constructing co-occurrence matrices (Biran etal., 2011). As our experiments demonstrate, our ap-proach improves performance on a scientific test setover prior work.

3.1.1 Step 1: Generating Word EmbeddingsWe used the Word2Vec system (Mikolov et al.,

2013) to learn word vectors for each content word inthe union of vocabulary of the scientific and generalcorpus. While other approaches exist (Penningtonet al., 2014; Levy and Goldberg, 2014), Word2Vechas been shown to produce both fast and accurateresults (Mikolov et al., 2013). We set the embed-ding dimension to 300, the context-window to 10,and use the skip-gram architecture with negative-sampling,which is known to produce quality resultsfor rare entities (Mikolov et al., 2013).

3.1.2 Step 2: Filtering PairsGiven the set of all pairwise permutations of

words, we retain a simplification rule of two wordsw

1

, w2

if the cosine similarity cos(w1

, w2

) betweenthe word vectors is greater than a threshold a. Weuse grid search, described below, to parameterize a.

We then apply additional filtering rules. To avoidrules comprised of words with the same stem (e.g.,permutable, permutation) we stem all words (us-ing the Porter stemmer in the Python NLTK li-brary (Bird et al., 2009)). The POS of each wordis determined by Morphadorner (Burns, 2013) andpairs that differ in POS are omitted (e.g., permu-tation (noun), change(d) (verb)); Finally, we omitrules where one word is a prefix of the other and thesuffix is one of s, es, ed, ly, er, or ing.

To retain only rules of the form complex word !

1067

simple word we calculate the corpus complexity, C(Biran et al., 2011) of each word w as the ratio be-tween the frequency (f ) in the scientific versus gen-eral corpus: Cw = fw,scientific/fw,general. The lex-ical complexity, L, of a word is calculated as theword’s character length, and the final complexity ofthe word as Cw ⇥Lw. We require that the final com-plexity score of the first word in the rule be greaterthan the second.

While this simplicity filter has been shown towork well in general corpora (Biran et al., 2011), itis sensitive to very small differences in the frequen-cies with which both words appear in the corpora.This is problematic given the distribution of terms inour corpora, where many rarer scientific terms mayappear in small numbers in both corpora.

We introduce an additional constraint that re-quires that the second (simple) word in the rule oc-cur in the general corpus at least k times. This helpsensure that we do not label words that are at a simi-lar complexity level as simplifications. We note thatthis filter aligns with prior work that suggests thatfeatures of the hypernym in hypernym-hyponym re-lations influence performance more than features ofthe hyponym (Rei and Briscoe, 2014).

Parameterization: We use a grid search anal-ysis to identify which measures of the set in-cluding cos(w

1

, w2

), fw1,scientific, fw2,scientific,fw1,general, and fw2,general most impact the F-measure when we evaluate our generation approachagainst our scientific gold standard set (Sec. 4), andto set the specific parameter values. Using thismethod we identify a=0.4 for cosine similarity andk=3,000 for the frequency of the simple term in thegeneral corpus. Full results are available in supple-mentary material.

3.2 Applying SimplificationsIn prior context-aware simplification systems, thedecision of whether to apply a simplification rulein an input sentence is complex, involving severalsimilarity operations on word co-occurrence matri-ces (Biran et al., 2011) or using embeddings toincorporate co-occurrence context for pairs gener-ated using other means (Paetzold and Specia, 2015).However, the SimpleScience pipline already consid-ers the context of appearance for each word in de-riving simplifications via word embeddings learned

from a large corpus. We see no additional improve-ments in F-measure when we apply two variants ofcontext similarity thresholds to decide whether toapply a rule to an input sentence. The first is thecosine similarity between the distributed represen-tation of the simple word and the sum of the dis-tributed representations of all words within a win-dow l surrounding the complex word in the inputsentence (Paetzold and Specia, 2015). The second isthe cosine similarity of a minimum shared frequencyco-occurrence matrix for the words in the pair andthe co-occurrence matrix for the input sentence (Bi-ran et al., 2011).

In fully automated applications, the top rule fromthe ranked candidate rules is used. We find that rank-ing by the cosine similarity between the word em-beddings for the complex and simple word in therule leads to the best performance at the top slot (fullresults in supplementary material).

4 Evaluation

4.1 SimpleSciGold Test SetTo evaluate our pipeline, we develop Sim-pleSciGold, a scientific gold standard set of sen-tences containing complex scientific terms which ismodeled after the general purpose gold standard setcreated by Horn et al. (2014).

To create SimpleSciGold, we start with scientificterms from two sources: we utilized all 304 com-plex terms from unigram rules by (Vydiswaran etal., 2014), and added another 34,987 child termsfrom rules found by mining direct parent-child rela-tions for unigrams in the Medical Subject Headings(MeSH) ontology (United States National Library ofMedicine, 2015). We chose complex terms with pre-existing simplifications as it provided a means bywhich we could informally check the crowd gener-ated simplifications for consistency.

To obtain words in context, we extracted 293sentences containing unique words in this set fromPLOS abstracts from PLOS Biology, Pathology, Ge-netics, and several other journals. We present 10MTurk crowdworkers with a task (“HIT”) show-ing one of these sentences with the complex wordbolded. Workers are told to read the sentence, con-sult online materials (we provide direct links to aWikipedia search, a standard Google search, and

1068

SimpleSciGold

Method Corpus(Complex, Simple)

Number ofSimplifications Pot. Prec. F

Biran et al. 2011 Wikipedia, SEW 17 0.059 0.036 0.044PLOS/PMC, Wikip. 588 0.352 0.084 0.136

SimpleScience(cos � .4, fw,simple � 3000) PLOS/PMC, Wikip. 2,322 0.526 0.196 0.285

SimpleScience(cos � .4, fw,simple � 0) PLOS/PMC, Wikip. 10,910,536 0.720 0.032 0.061

Table 2: Simplification Generation Results. SimpleScience achieves the highest F-measure with a cosine threshold of 0.4 and afrequency of the simple word in the general corpus of 3000.

a Google “define” search on the target term), andadd their simplification suggestions. Crowdworkersfirst passed a multiple choice practice qualificationin which they were presented with sentences con-taining three complex words in need of simplifica-tion along with screenshots of Wikipedia and dictio-nary pages for the terms. The workers were askedto identify which of 5 possible simplifications listedfor each complex word would preserve the mean-ing while simplifying. 108 workers took part in thegold standard set creation task, completing an aver-age of 27 HITs each. The resulting SimpleSciGoldstandard set consists of an average of 20.7 simplifi-cations for each of the 293 complex words in corre-sponding sentences.

4.2 Simplification Generation

We compare our word embedding generation pro-cess (applied to our corpora) to Biran et al.’s (2011)approach (applied to the Wikipedia and Simple En-glish Wikipedia corpus as well as our scientific cor-pora). Following the evaluation method used inPaetzold and Specia (2015), we calculate potentialas the proportion of instances for which at least oneof the substitutions generated is present in the goldstandard set, precision as the proportion of generatedinstances which are present in the gold standard set,and F-measure as their harmonic mean.

Our SimpleScience approach outperforms theoriginal approach by Biran et al. (2011) applied tothe Wikipedia and SEW corpus as well as to the sci-entific corpus (Table 1).

4.3 Applying Simplifications

We find that neither prior selection approaches yieldperformance improvements over our generation pro-

cess. We evaluate the performance of ranking can-didate rules by cosine similarity (to find the top rulefor a fully automated application), and achieve pre-cision of 0.389 at the top slot. In our supplementarymaterials, we provide additional results for poten-tial, precision and F-measure at varying numbers ofslots (up to 5), where we test ranking by cosine sim-ilarity of embeddings as well as by the second filterused in our pair generation step: the frequency of thesimple word in the simple corpus.

4.4 Antonym Prevalence Analysis

A risk of using Word2Vec in place of WordNetis that the simpler terms generated by our ap-proach may represent terms with opposite mean-ings (antonyms). While a detailed analysis is be-yond the scope of this paper, we compared the like-lihood of seeing antonyms in our results using agold standard set of antonyms for biology, chem-istry, and physics terms from WordNik (Wordnik,2009). Specifically, we created an antonym set con-sisting of the 304 terms from the biology, chemistry,and physics categories in Wictionary for which atleast one antonym is listed in WordNik. We com-pared antonym pairs with rules that produced bythe SimpleScience pipeline (Fig. 1). We observedthat 14.5% of the time (44 out of 304 instances),an antonym appeared at the top slot among results.51.3% of the time (156 out of 304 instances), noantonyms in the list appeared within the top 100ranked results. These results suggest that furtherfilters are necessary to ensure high enough qualityresults for fully automated applications of scientificterm simplification.

1069

Figure 1: Probability of an antonym in our test set occurringas a suggested simpler term in the top 100 slots in the Simple-Science pipeline.

5 Limitations and Future Work

A risk of using Word2Vec to find related terms,rather than querying a lexical database like Word-Net, is that generated rules may include antonyms.Adding techniques to filter antonym rules, such asusing co-reference chains (Adel and Schutze, 2014),is important in future work.

We achieve a precision of 0.389 at the top sloton our SimpleSciGold standard set when we ap-ply our generation method and rank candidates bycosine similarity. This level of precision is higherthan that achieved by various prior ranking meth-ods used in Lexenstein (Paetzold and Specia, 2015),with the exception of using machine learning tech-niques like SVM (Paetzold and Specia, 2015). Fu-ture work should explore how much the precisionof our SimpleScience pipeline can be improved byadopting more sophisticated ranking methods. How-ever, we suspect that even the highest precision ob-tained on general corpora and gold standard sets inprior work is not sufficient for fully automated sim-plification. An exciting area for future work is inapplying the SimpleScience pipeline in interactivesimplification suggestion interfaces for those read-ing or writing about science (Kim et al., 2015).

6 Conclusion

In this work, we introduce SimpleScience, a lexicalsimplification approach to address the unique chal-lenges of simplifying scientific terminology, includ-ing a lack of parallel corpora, shifting vocabulary,and mismatch with using general purpose databasesfor filtering. We use word embeddings to extractsimplification rules from a novel parallel corporathat contains scientific publications and Wikipedia.

Using SimpleSciGold, a gold standard set that wecreated using crowdsourcing, we show that usingembeddings and simple frequency filters on a sci-entific corpus outperforms prior approaches to sim-plification generation, and renders the best prior ap-proach to simplification selection unnecessary.

References[Adel and Schutze2014] Heike Adel and Hinrich Schutze.

2014. Using mined coreference chains as a resourcefor a semantic task. In EMNLP, pages 1447–1452.

[Biran et al.2011] Or Biran, Samuel Brody, and NoemieElhadad. 2011. Putting it simply: A context-awareapproach to lexical simplification. In ACL ’11. Asso-ciation for Computational Linguistics.

[Bird et al.2009] Steven Bird, Ewan Klein, and EdwardLoper. 2009. Natural language processing withPython. O’Reilly Media, Inc.

[Burns2013] Philip R Burns. 2013. Morphadornerv2: a java library for the morphological adornmentof english language texts. Northwestern University,Evanston, IL.

[Coster and Kauchak2011] William Coster and DavidKauchak. 2011. Learning to simplify sentences usingwikipedia. In Proceedings of the workshop on mono-lingual text-to-text generation, pages 1–9. Associationfor Computational Linguistics.

[Horn et al.2014] Colby Horn, Cathryn Manduca, andDavid Kauchak. 2014. Learning a lexical simplifierusing wikipedia. In ACL (2), pages 458–463.

[Kim et al.2015] Yea-Seul Kim, Jessica Hullman, and Ey-tan Adar. 2015. Descipher: A text simplification toolfor science journalism. In Computation+JournalismSymposium.

[Levy and Goldberg2014] Omer Levy and Yoav Gold-berg. 2014. Neural word embedding as implicit ma-trix factorization. In Z. Ghahramani, M. Welling,C. Cortes, N. D. Lawrence, and K. Q. Weinberger, ed-itors, Advances in Neural Information Processing Sys-tems 27, pages 2177–2185. Curran Associates, Inc.

[Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever, KaiChen, Greg S Corrado, and Jeff Dean. 2013. Dis-tributed representations of words and phrases and theircompositionality. In Advances in neural informationprocessing systems, pages 3111–3119.

[Miller1995] George A. Miller. 1995. Wordnet: A lexicaldatabase for english. Commun. ACM, 38(11):39–41,November.

[Paetzold and Specia2015] Gustavo Henrique Paetzoldand Lucia Specia. 2015. Lexenstein: A framework forlexical simplification. ACL-IJCNLP 2015, 1(1):85.

1070

[Pennington et al.2014] Jeffrey Pennington, RichardSocher, and Christopher D. Manning. 2014. Glove:Global vectors for word representation. In EmpiricalMethods in Natural Language Processing (EMNLP),pages 1532–1543.

[Re and Zhang2015] Christopher Re and Ce Zhang.2015. Deepdive open datasets. http://deepdive.stanford.edu/opendata.

[Rei and Briscoe2014] Marek Rei and Ted Briscoe. 2014.Looking for hyponyms in vector space. In CoNLL,pages 68–77.

[United States National Library of Medicine2015]United States National Library of Medicine. 2015.Medical subject headings.

[Vydiswaran et al.2014] V.G.Vinod Vydiswaran,Qiaozhu Mei, David A. Hanauer, and Kai Zheng.2014. Mining consumer health vocabulary fromcommunity-generated text. In AMIA ’14.

[Wordnik2009] Wordnik. 2009. Wordnik online englishdictionary. https://www.wordnik.com/.

[Yatskar et al.2010] Mark Yatskar, Bo Pang, CristianDanescu-Niculescu-Mizil, and Lillian Lee. 2010. Forthe sake of simplicity: Unsupervised extraction of lex-ical simplifications from wikipedia. In NAACL ’10.Association for Computational Linguistics.

1071

simplescience: lexical simpliﬁcation of scientiﬁc … › yeaseul1 › paper ›...

Documents