embedding open-domain common-sense knowledge from text · embedding open-domain common-sense...

Embedding Open-domain Common-sense Knowledge from Text

Travis Goodwin and Sanda HarabagiuHuman Language Technology Research Institute

University of Texas at DallasRichardson, TX 75083-0688, USA{travis,sanda}@hlt.utdallas.edu

AbstractOur ability to understand language often relies on common-sense knowledge – background information the speaker can assume is knownby the reader. Similarly, our comprehension of the language used in complex domains relies on access to domain-specific knowledge.Capturing common-sense and domain-specific knowledge can be achieved by taking advantage of recent advances in open informationextraction (IE) techniques and, more importantly, of knowledge embeddings, which are multi-dimensional representations of concepts andrelations. Building a knowledge graph for representing common-sense knowledge in which concepts discerned from noun phrases are castas vertices and lexicalized relations are cast as edges leads to learning the embeddings of common-sense knowledge accounting forsemantic compositionality as well as implied knowledge. Common-sense knowledge is acquired from a vast collection of blogs and booksas well as from WordNet. Similarly, medical knowledge is learned from two large sets of electronic health records. The evaluation resultsof these two forms of knowledge are promising: the same knowledge acquisition methodology based on learning knowledge embeddingsworks well both for common-sense knowledge and for medical knowledge Interestingly, the common-sense knowledge that we haveacquired was evaluated as being less neutral than than the medical knowledge, as it often reflected the opinion of the knowledge utterer. Inaddition, the acquired medical knowledge was evaluated as more plausible than the common-sense knowledge, reflecting the complexityof acquiring common-sense knowledge due to the pragmatics and economicity of language.

Keywords: Common-sense knowledge, medical domain knowledge, knowledge embedding

1. Introduction

Our ability to understand language often relies on common-sense knowledge – background information the speaker canassume is known by the reader. When language is used tocommunicate, common-sense knowledge is not articulatedin most of the cases. Hence, the availability of resourcescapturing or approximating common-sense knowledge iscrucial. Multiple attempts have been made to capture gen-eral common-sense knowledge, such as the CYC project(Lenat and Guha, 1989) and ConceptNet (Speer and Havasi,2013). Moreover, WordNet (Miller, 1995) captures lexico-semantic knowledge in English, while DBPedia (Auer etal., 2007) provides knowledge that was used for common-sense reasoning. These resources encode a large number ofconcepts without representing all their possible attributes.Moreover, another important limitation of these existingresources for common-sense knowledge stems from the lim-ited number of relation types spanning concepts. Some ofthe existing relations are taxonomic, e.g. IS-A, PART-OF,while others capture causality, e.g. ENABLE, CAUSE-OF,but they hardly represent all relation types we use whenaccessing common-sense knowledge. To address these twolimitations, we took advantage of recent advances in openinformation extraction (IE) techniques and, more impor-tantly, of the advent of knowledge embeddings, which aremulti-dimensional representations of concepts and relations.The use of knowledge embedding enabled us to considerlexicalized relations between concepts, allowing relationalsimilarity to be easily identified, even if not grouped in thesame type of relations. Although this is an approximationof the relation types available from existing common-senseknowledge resources, it captures a much larger set of pos-sible relations spanning concepts. Because concepts are

largely expressed as noun phrases, in our formulation basedon knowledge embeddings, we make use of semantic compo-sitionality to represent complex concepts that comprise theirattributes and their nominal interpretation (e.g. “plantainslices” are interpreted as slices of plantain, while “plan-tain dish” can be interpreted as a dish made with plantains).When capturing common-sense knowledge from a varietyof sources, including books and weblogs, we rely on open-domain information extraction (IE) to identify concepts asnoun phrases and the lexicalized relations they share. Forexample, consider the following sentence obtained from apersonal weblog post:

c1︷︸︸︷plantain slices are cooked in︸︷︷︸

r

c2︷︸︸︷hot oil

This sentence encodes the lexical relation r = are cooked in,indicating the relationship between two concepts c1 =plantain slices, and c2 = hot oil. Representing the knowl-edge encoded in this sentence requires a level of granularitynot supported by existing knowledge sources: we must re-strict ourselves to slices of plantains (not, for example, slicesof pie) and to hot oil (not cold oil, and certainly not crudeoil). Additionally, the lexical relation r between these twoconcepts is not present in the fixed set of relation types usedby any of these knowledge sources, and cannot be approx-imated by the fixed set of relation types without adjustingthe semantics of the original sentence. Capturing common-sense knowledge is enabled by organizing the extractedconcepts and their lexical relations into a knowledge graphwhich we embed in a multi-dimensional vector space.Because our approach for capturing knowledge is able toidentify relational knowledge which is not currently avail-able in existing resources, we explored the possibility of

4621

using the same knowledge acquisition methodology for cap-turing domain-specific knowledge that is not current avail-able despite domain-specific texts being abundant. For thispurpose, we considered the domain of medicine.In the medical domain, multiple efforts have been madeto capture knowledge, including the Unified Medical Lan-guage System (UMLS) (Bodenreider, 2004) and the System-ized Nomenclature for MEdicine (SNOMED) (Stearns etal., 2001). These medical knowledge resources have simi-lar limitations as the common-sense knowledge resources,namely (1) too few types of relations between concepts and(2) description of concepts without capturing their semanticconcepts. However, when processing medical language, alarge variety of relations need to be considered. For example,given a sentence stating:

c1︷︸︸︷abnormal EEG due to︸︷︷︸

r

c2︷︸︸︷rhythmic background slowing

the causality between the “abnormal” attribute of the elec-troencephalogram (EEG) and the rhythmic slowing of thebackground signal needs to be recognized. Using knowledgeembeddings to encode medical knowledge provides accessto medical information not available in current ontologiesbut expressed in medical texts.An important aspect of the common-sense and domain-specific knowledge acquisition framework described inthis paper is provided by our probabilistic treatment ofrelations between concepts. By encoding the plausibilityof a lexicalized relation between a pair of concepts, e.g.P (q) = 〈c1, r, c2〉, the knowledge encoding framework thatwe describe is able to capture salient information as knowl-edge embeddings without discarding the long-tail aspectsof this knowledge. Moreover these knowledge embeddingsare able to generalize the knowledge discerned from indi-vidual sentences by incorporating two notions of semanticsmoothness: (1) functional similarity (Turney, 2012), theidea that two phrases are similar if they engage in similarlexical relations, and (2) behavioral similarity (Nakov andHearst, 2008), the idea that two lexical relationships aresimilar if they operate on similar concepts.In this paper, we present a novel knowledge embed-ding framework that generates promising results both oncommon-sense knowledge and domain-specific knowledgeby learning an optimal embedding for each concept andlexical relation according to (1) the functional similaritybetween pairs of concepts, (2) the behavioral similaritybetween pairs of lexical relations, and (3) the role of thesemantic composition for each word occurring in each con-cept and lexical relation. The remainder of this paper isorganized as follows. Section 2 reviews related work andSection 3 describes the datasets we considered as well asthe natural language processing techniques performed onthese datasets. Section 4 details our knowledge embeddingframework, while Section 5 discusses our experiments, andSection 6 summarizes the conclusions.

2. Related WorkOne of the most popular sources of lexico-semantic knowl-edge and an approximation of common-sense knowledge is

WordNet (Miller, 1995) which encodes 117,000 concepts or-ganized as synonym sets or synsets spanned by a small num-ber of relation types. The relations between concepts capturehypernymy, meronymy (for nouns), entailment, causality(between verbs), as well as antonomy. In addition, eachWordNet concept is associated with a gloss defining theconcept.While the WordNet semantic graph was carefully generatedby expert lexicographers, the JeuxDeMots project (Lafour-cade, 2007) provides a crowd-sourced lexico-semantic graphwhich captures some of the relation types encoded in Word-Net (e.g. synonym, antonymy) for French. These relationswere populated by allowing people to play word associationgames.ConceptNet (Speer and Havasi, 2013) was also obtainedthrough crowd-sourcing. It encodes 21 relation typeswhich attempt to capture common-sense knowledge. How-ever, these relations are sometimes under-specified, e.g.for the CAPABLE-OF relation, the sentence providedby ConceptNet, ”Mary broke a vase” was encoded as〈Mary,CAPABLEOF, vase〉. When interpreting this relation,what does it mean to be capable of a “vase”? Can Mary eat avase? Can she buy a vase? Can she cook a vase? What if weconsider, instead, the relation 〈Mary,CAPABLEOF, break〉.Although perhaps closer to the original semantics of thesentence, it is still too general: What is Mary capable ofbreaking? Can she break a brick wall? Can she break atable? It would be preferable to encode only the fact thatMary broke a vase. However, because the relation CAPABLE-OF cannot account for the semantics of the verb “break”, itgenerates an underspecified relation between Mary and thevase.DBPedia (Auer et al., 2007) curates a variety of knowledgemined from Wikipedia. It considers article titles as conceptsand relies on the relations between them discerned frommeta-data in Wikipedia. However, the number of conceptsand relation types is limited and dependent on the availabilityof metadata.Medical knowledge acquisition was targeted in multipleprojects. Perhaps the most popular of them is the UnifiedMedical Language System (UMLS) (Bodenreider, 2004)which designed by the National Library of Medicine inorder to provide a standard set of concepts allowing forcommonalities amongst various medical terminologies tobe interlinked, for example, by mapping ICD-9 diagnosticcodes (Cimino et al., 1993) to Medical Subject Headings(MeSH) (Lowe and Barnett, 1994).A component of UMLS, the Systemized NOmenclature ofMEDicine (SNOMED) (Stearns et al., 2001) provides asmall number of relationship types between a subset ofUMLS concepts, such as hypernymy, synonym as well asmedical relations (e.g. treatment targets disease).In addition, the Open Biomedical Ontologies (OBO)Foundry facilitates a collaborative experiment involving sci-entific (e.g. medical) ontologies (Smith et al., 2007).Both the UMLS and the ontologies linked in the OBOFoundary do not represent medical knowledge probabilis-tically, nor do they capture many of the relations betweenmedical concepts. Attempts to address these limitationswas tackled in our previous work (Goodwin and Harabagiu,

4622

2013) where a qualified medical knowledge graph (QMKG)was presented. However, the QMKG does not capture com-positional semantics of medical concepts, nor the lexicalrelations between them. Both of these aspects are accountedfor in the knowledge framework presented in this paper.

3. The Textual DataTo acquire common-sense knowledge as well as medicalknowledge, we have relied on several sources of textualdata.

3.1. Textual Data expressing Common-senseKnowledge

We considered two collections of narrative texts which webelieved would be useful for common-sense reasoning: (1)narrative personal stories from weblog articles, and (2)Google’s syntactic n-gram dataset obtained from storiessourced from Google Books. Narratives of personal storieshave held interest within the artificial intelligence commu-nity for decades due to the rich semantic information theycontain. For example, as conjectured by Schank in (Schank,1983), personal stories can be used to pin-point gaps ina knowledge base. Moreover, as argued in (Schank andAbelson, 1995), the most common source of common-senseknowledge for humans is that of narrative communication,or stories. To provide additional common-sense knowledge,we also considered WordNet glosses.

3.1.1. Narrative Blog PostsWe used the data provided for the 2009 International Con-ference on Weblogs and Social Media (ICWSM), sponsoredby the Association for the Advancement of Artificial Intelli-gence (AAAI) (Burton et al., 2009), by Spinn3r, Inc, whichconsists of 44 million blog posts authored between August1st and October 1st in 2008. The dataset covers many majornews events, such as the 2008 Olympics, both United Statespresidential nominating conventions, the United States finan-cial crisis, etc. Gordon and Swanson identified nearly onemillion English-language narrative blog posts discussingpersonal stories from this dataset, using a supervised neuralnetwork based on lexical features (Gordon and Swanson,2009). A total of 937,994 blog posts were classified ascontaining narrative stories.

3.1.2. Google BooksInspired by (Akbik and Michael, 2014), we considered asecondary source of narrative information: the SyntacticN-Grams dataset (Goldberg and Orwant, 2013) which wasextracted from 3.5 million digitized English Books availablethrough Google Books (Michel et al., 2011). The datasetcontains over 10 billion syntactic n-grams, which are rootedsyntactic dependency tree fragments (noun phrases and verbphrases) and is the largest publicly available corpus of itskind. Each tree fragment is annotated with the dependencyinformation, its head word, and the frequency with whichit occurred. An example tree fragment is (n0) “when/WRBthe/DT small/JJ boy/NN ate/VBD cookies/NN”. We con-verted each tree fragment in this dataset into a sentence byremoving the leading relative pronouns, conjunctives, andprepositions. In this way, for the tree fragment n0 we obtainthe sentence s0, “the small boy ate cookies.”

ECG209,051

Radiology870,504

Discharge summary

55,177

Rehab services

5,431

Other822,497

Nutrition9,418

Pharmacy103

Social Work2,670

Case Management

967

Physician141,624

General8,301

Nursing223,556

Echo45,794

Consult98

Figure 1: The distribution of electronic health record (EHR)types in the MIMIC-III clinical database.

3.1.3. WordNet GlossesBecause each concept in WordNet is (a) represented bya synset and (b) defined by a gloss, we generate a set ofsentences expressing these definitions. For example, thesynset y1 = {“male child”, “boy”} is defined as “a youthfulmale person.” In addition, the glosses also provide sentenceexamples of the concepts represented by the synset. Forexample, y1 is exemplified by (s1) “he baby was a boy,” (s2)“she made the boy brush his teeth every night”, and (s3)“most soldiers are only boys in uniform.” We also createdsentences based on the glosses by using each member of thesynset as a genus and the gloss as the differentia. In thisway, we obtain from the gloss of y1 the sentences (s4) “amale child is a youthful male person”, and (s5) “a boy is ayouthful male person.”

3.2. Medical Textual DataWith the advent of electronic health records (EHRs), a grow-ing set of medical narratives are becoming available. Thesenarratives reflect domain-specific medical knowledge com-municated by physicians for the interpretation of other medi-cal professionals. Thus, they reflect highly-specialized med-ical knowledge, both in explicit and implicit ways. We wereinterested to acquire medical knowledge from type of textualdata: (1) medical narratives from a variety of medical recordtypes as well as (2) EHRs which belong to a single recordtype. Both types of textual data needed to be available frommassive medical archives.

3.2.1. MIMIC-3 Clinical DatabaseMIMIC-III (Medical Information Mart for Intensive CareIII) is a large, freely-available database comprising de-identified clinical data for over 40,000 patients who stayedin critical care units of the Beth Israel Deaconess Medi-cal Center between 2001 and 2012 (Saeed et al., 2011).The database includes narratives of a variety of EHR typessuch as discharge summaries, nursing notes, and radiologyreports. In this work, we considered all EHRs types, encom-passing 2,426,930 medical records. The distribution of EHRtypes in this dataset is shown in Figure 1

3.2.2. EEG ReportsWe considered the Temple University Hospital (TUH) cor-pus of Electroencephalogram (EEG) reports (Harati et al.,

4623

2014). Generated by the The Neural Engineering Data Con-sortium (NEDC) at Temple University, the corpus containsnearly 20,000 EEG reports conducted from 2002 to 2013.The reports include the patient’s clinical history and medi-cations, a description of the EEG setting and configuration,as well as the physician’s impressions and clinical findings.The descriptions document any notable epileptic activityobserved by the physician when interpreting the EEG signaldata, such as:

The EEG is diffusely slow but the patient transitions inand out of a drowsy state with varying amounts of beta.There is small, shifting asymmetries noted. Stimulationof the patient produces eye opening.

The neurologist also records her impressions of the pre-viously described observations and epileptiform activities,e.g.:

Abnormal EEG due to:Generalized background slowing.Shifting slowing.Shifting beta asymmetries.

The EEG impression is further analyzed for possible clinicalcorrelations or clinical findings, in which the physicianinterprets the findings and describes a diagnosis or generalconclusion, e.g.:

No epileptiform features were observed.This EEG is more supportive of a bihemispheric pro-cess.

4. Knowledge Acquisition FrameworkTo generate knowledge embeddings as representations of (1)common-sense knowledge as well as (2) medical domainknowledge, we used the following steps:STEP 1: Extract concepts and the lexicalized relations be-tween them from texts;STEP 2: Generate a knowledge graph representing the con-cepts as vertices and the lexicalized relations as edges be-tween them; andSTEP 3: Learn an optimal, semantically-smooth embed-ding of this knowledge graph. This knowledge acquisitionframework has the advantage of capturing all the lexicalizedrelations in which a concept is involved. In addition, it learnsrepresentations which involve compositional semantics bothat concept and relation levels. This representation is furtherenhanced through smoothing which resolves the data spar-sity problem. Finally, the knowledge embedding which islearned informs the plausibility estimation of both existingand new relations between concepts.

4.1. Information Extraction of Concepts andRelations

Stanford’s automatic open-domain information extractionsystem (Angeli et al., 2015) enabled us to discover struc-tured relation triples (SRTs) from natural language sen-tences. SRTs were discovered without relying on a prede-fined set of concepts, set of relation types, or vocabulary.Figure 2 illustrates examples of SRTs discovered from per-sonal webplogs, Google Books, as well as WordNet glosses.Figure 3 illustrates examples of SRTs extracted from theMIMIC-III database as well as the TUH EEG corpus.

General KnowledgePersonal Story Weblogs

〈white phosphorus smoke,may cause, burns〉〈smoke, is,mist〉

〈plantain slices, are cooked in, hot oil〉〈plantains, are immediately placed in, salt water〉

〈scattered showers, caused, flooding in parts of the state〉〈urban areas,were hit by, the deluge〉

Google N-Grams〈bananas, hung above, assortment of fruit〉

〈smoke, rose above, the trees〉WordNet

〈plantains, are, starchy banana-like fruit〉〈the fire, produced, tower of black smoke〉〈the plains, are fertilized by, annual floods〉

Figure 2: Examples of structured relation triples expressingcommon-sense knowledge.

Medical KnowledgeMIMIC III Notes

〈large hematoma, caused, outlet obstruction〉〈aneurysm,was coiled with, 4-mm 360 coils〉〈aneurysm, obliterated by, successive GDC coils〉

EEG Reports〈drowsiness, is characterized by, increase in background beta〉〈abnormal EEG, due to, left anterior temporal sharp wave〉

〈excess theta,may be due to,PRES syndrome〉〈generalized spike & wave, consistent with, generalized epilepsy〉〈mildly abnormal EEG, due to,mild background slowing〉〈abnormal EEG, due to, rhythmic background slowing〉

Figure 3: Examples of structured relation triples expressingmedical domain knowledge.

4.2. Generating the Knowledge GraphIn order to compactly represent the semantic informationobtained from each SRT extracted from each sentence inour dataset, we generated a knowledge graph in which eachconcept was represented by a vertex, and each lexical re-lation between two concepts by a directed edge. Becausewe considered lexical relations between concepts, ratherthan individual words, the resulting knowledge graph wasquite large, containing 32,522,807 vertices and 52,209,411edges. Thus, we relied on the Apache Spark (Zaharia etal., 2010) web-scale data processing engine to generate adistributed knowledge graph. This allowed us to leverageSpark’s GraphX library (Xin et al., 2013) for graph-parallelcomputations. However, processing such a large knowl-edge graph is challenging, even in a distributed architecture.Hence, we decided to reduce the number of edges withoutinformation loss. This was achieved by replacing all edgescorresponding to each extraction of the same SRT with asingle edge which also encodes the frequency of the ST inthe dataset. We implemented this by a map-reduce countingoperation. However, this did not account for another prob-lem which arose during knowledge acquisition: knowledgesparsity. Because concepts are represented along with theirattributes and lexicalized relations do not provide means ofcapturing the similarity of meaning, we do not recognizein data, no matter how large, important relational informa-tion that accounts for common-sense knowledge or domainspecific knowledge. For example, in the knowledge graph,

4624

it is difficult to recognize that the concept hot oil, cookingoil, and olive oil may be similar despite each correspondingto separate nodes. Likewise, each lexical relation is repre-sented as a separate edge, such that it is difficult to identifysimilar lexical relations, e.g. are cooked in, are cookedwith, or were cooked in each constitute different edges in theknowledge graph. In order to address these problems, welearned an optimal embedding of our knowledge graph intoa semantically-smooth continuous vector space. Moreover,this embedding allowed us to infer non-explicit relationalknowledge.

4.3. Learning the Knowledge EmbeddingRecently, new methods for knowledge graph completionhave produced a promising paradigm for embedding whichis able to infer new knowledge from a knowledge graph(Bordes et al., 2013). In this paradigm, each concept isrepresented as a N -dimensional vector and each relation isrepresented by an operation in the RN space such that newknowledge can be asserted by simple vector operations. Theembeddings are learned by minimizing a global loss functionover all the concepts and their relations in the knowledgegraph. However, new knowledge often contains conceptswhich are not present in the knowledge graph, a problemwhich we address through the semantic compositionalityof concept and relation embeddings. Moreover, smoothingthese enhanced embeddings produces additional new knowl-edge. Given these observations, learning the knowledgeembedding is performed in five steps:LEARNING STEP 1: Multi-dimensional representation of conceptsand relations;LEARNING STEP 2: Accounting for semantic composition;LEARNING STEP 3: Evaluating the plausibility of knowledge;LEARNING STEP 4: Optimizing knowledge plausibility; andLEARNING STEP 5: Smoothing the knowledge embeddings.

4.3.1. Multi-dimensional Representation of Conceptsand Relations

As in TransE (Bordes et al., 2013), each vertex representinga concept in the knowledge graph was cast as a point in thecontinuous space of RN , where N is a parameter indicatingthe cardinality of our vector representation (in our case,N =200). Likewise, each edge representing a lexical relationis interpreted as the translation vector which connects thepoints representing its arguments. Unlike TransE, however,we also learned a vector for each word used to express everyconcept as well as each word used to express the lexicalrelation. This allowed our embedding space to account forthe role of semantic composition on (1) the types of lexicalrelations shared by concepts, and (2) the types of conceptsspanned by each lexical relation.Because concepts and relations are often expressed by morethan one word, the embedding needs to represent the se-mantics of the entire expression. Single words are cast aspoints in RN . It happens that noun phrases (which accountfor concepts) and verb phrases (which account for relations)havbe a syntactic head which corresponds to a single word.Thus, the entire concept or lexical relation is viewed as amodification of the head words provided by the role of themodifiers in the noun phrase or verb phrase, respectively.

plantains hot oil

are cooked in

abnormalEEG

mildlyabnormalEEG

backgroundslowing

mildbackgroundslowing

due to

due to

plantainsfood

hot oil

oilhot

EEG abnormal EEG

mildly abnormal EEG

backgroundslowing

background mild backgroundslowing

Figure 4: Examples of lexical relations encoded in theknowledge graphs and their representations in the embed-ding space for common-sense knowledge (top) and medicaldomain knowledge (bottom).

Figure 4 illustrates an example of common-sense knowl-edge and an example of medical domain knowledge in theknowledge graph (on the left) as well as in the embeddedspace (on the right). Each word is represented as a pointin the embedded space, shown as a square. We capture therole of semantic composition and lexical relations as lineartransformations in the embedded space, which are repre-sented as arrows showing the point obtained after applyingthe transformation. For example, the top of Figure 4, showshow the relation 〈plantains, are cooked in, hot oil〉 is repre-sented in the embedding space: a blue square for the word“plantains”, a green square for the word “oil”, and a yellowsquare for the phrase “hot oil”. Figure 4 shows that applyingthe lexical relation “are cooked in” to the point “plantains”produces the same point as the phrase “hot oil.” Likewise, itshows that the point for the phrase ”hot oil” is obtained byapplying the ”hot” semantic composition vector to the word”oil”. This example highlights the ability of the knowledgeembeddings to infer new generalized relationships becausethe vector generated for the word “oil” in the phrase ”hotoil” is generated such that it can be obtained by applying the”are cooked in” relationship to the word ”plantains”. Fig-ure 4 also illustrates semantic composition for the domain ofmedicine: applying the transformation associated with theword “mild” (or “mildly”) adjusts the “due to” relationshipfrom “abnormal EEG” to “background slowing” such thatwhen “due to” is applied to ”mild background slowing”, thephrase “mildly abnormal EEG” is produced instead of themore general concept “abnormal EEG”. This demonstratesthat the embedding procedure is able to account for the af-fects of individual words on the semantics of the discoveredlexical relationships.

4.3.2. Accounting for Semantic CompositionEach vertex and each edge in our knowledge graph corre-sponds to a concept or relation which may be expressedwith a multi-word sequence w1 . . . wL. To account for thecontribution of each word on the semantics of the entire ex-pression of concept or relation, we represented each word bya bag-of-words vector on which two composition functionsare applied. The first composition function, denoted as g(•)defines uses the bag-of-words vector representations of eachword expressing the concept to produce a new vector repre-senting the entire concept. The second composition function,

4625

3/17/2016 Preview

1/1

plantainw1

slicesw2

arew3

cookedw4

inw5

hotw6

oilw7

c1 := w1TG + bg + c1

r = w4TH + bh

r := w5TH + bh + r

c2 = w7TG + bg

c2 := w6TG + bg + c2

c2 := w5TG + bg + c2

c1 = w2TG + bgg(●)

g(●)

h(●)

h(●) g(●)

g(●)

f(●)

∥ + r − ∥c1 c2 ℓ2

g(●)

Figure 5: Neural architecture for computing the plausibility of a lexical reltion.

h(•), uses the bag-of-words vector representations express-ing the lexicalized relation to generate a vector representa-tion of the relation. Each of these composition functionswas defined as a linear recurrence relation, which first mapsthe head word to a vector, and then recursively modifies thatvector for each remaining word in the sequence. Formally,for each composition function we learned a composition ma-trix (G or H) and a bias vector (bg or bh) which capturesthe effect of each additional word on the vector for the wordsequence. Formally, we define two semantic compositionfunctions:

g(w1 . . . wL) =

{wL

TG+ bg, if L = 1

wLTG+ bg + g(w1 . . . wL−1), otherwise.

(1)and

h(w1 . . . wL) =

{wL

TH + bh, if L = 1

wLTH + bh + h(w1 . . . wL−1), otherwise.

(2)where g(•) recursively learns a vector for the each conceptby repeated applying a linear transformation for each wordin the concept using the learned semantic composition ma-trix G, and the concept bias vector bg; and h(•) recursivelylearns a vector for each open-domain relation by repeatedlyapplying a linear transformation for each word in the rela-tion using the learned semantic relation composition matrixH , and the relation bias vector bh.

4.3.3. Evaluating the Plausibility of KnowledgeBy representing the concepts and a lexical relations as vec-tors, we can measure the plausibility of any arbitrary lexicalrelation r between any possible pair of concepts, c1 and c2.Recall that we have represented each lexical relation as a ge-ometric translation vector. This means, that for each conceptc1, the most likely concept to be related to it by lexical rela-tion r should be embedded at the point c1+r. This allows usto measure the plausibility of any triple 〈c1, r, c2〉 triple asthe distance between the point in the embedded space mostlikely to be related to c1 by r and the point corresponding

by c2, defined by:

f (〈s1, r, s2〉) = ‖g(s1) + h(r)− g(s2)‖`2 (3)

Thus, equation 3 ensures the geometric property that thepoint associated with concept c2 is obtained by applying thetranslation vector associated with r to the point for conceptc1, i.e., g(c2) ≈ g(c1) + h(r). Figure 5 illustrates therelationship between Equations 1, 2, and 3: the plausibilityof a lexical relation depends on the semantic composition ofeach concept as well as the lexical relation.

4.3.4. Optimizing Knowledge PlausibilityBy defining the plausibility of any structured relation triple〈c1, r, c2〉, we can learn the optimal values of the latentcomposition matrices G and H , as well as the latent biasvectors bg, and bh. To do this, we try to find the valuesof these latent variables which maximize the plausibility ofall positive edges in the knowledge graph and minimize theplausibility of negative edges which are not consistent withthe knowledge graph. We construct so-called negative edgesby randomly sampling edges from the knowledge graph andreplacing the lexical relation with another, random lexicalrelation such that the new edge is not in the network. Thisallows optimal latent variables to be learned by comparingthe margin (gap) in the geometric space between edges in theknowledge graph, and the negative edges which do not occurin the knowledge graph. Formally, we define the margin loss(Guo et al., 2015):

L =∑

t+∈SN

∑t−∈SN−

max(0, γ + f(t+)− f(t−)) (4)

where t+ = |s1, r, s2| ∈ SN refers to each triple in theknowledge graph, t− ∈ SN− refers to artificially creatednegative triples, and γ, which indicates how much of amargin (or distance) should exist between positive triplesencoded in the knowledge graph and the negative tripleswe randomly generated. As described in (Bordes et al.,2013; ?), we applied stochastic gradient descent to solvethis minimization problem.

4626

4.3.5. Smoothing the Knowledge EmbeddingsThe embedded knowledge graph obtained by Equation 3relies on distributional information in order to create an op-timal embedding for each concept and each lexical relationthe data. In order to ensure that the embeddings learned forthe knowledge graphs are able to account for lexical varia-tion, we introduce two regularization terms based on notionsof lexical semantics. These terms are based on two assump-tions: (1) vertices in the knowledge graph which have ahigh functional similarity (Turney, 2012) – that is, conceptswhich participate in many of the same relationships – shouldbe located close to each other in the embedded space; and (2)edges in the knowledge graph which have a high relationalsimilarity (Nakov and Hearst, 2008) – that is, lexical rela-tions which often describe the same participants – shouldalso be located close to each other in the embedded space.These assumptions allow the plausibility computed from ourembedding space to account for lexical variation in conceptsas well as lexical relations by ensuring that semanticallysimilar concepts occupy geometrically close points (Guo etal., 2015). To do this, we define the following functionalsimilarity matrix, SF (i, j) which returns the sum of thenumber of lexical relations in which both concepts i and jwere participants. This allows us to measure the smoothnessof similar vertices in the embedding space:

R1 =1

2

V∑i=1

V∑j=1

‖~vi − ~vj‖2`2 SF (i, j) (5)

where V is the number of vertices in the knowledge graph.Equation 5 ensures that the distance between two conceptswhich participate in similar relations is small.In order to ensure that lexical relations with high relationalsimilarity have similar translation matrices, we construct arelational similarity matrix SR(i, j) which returns the totalnumber of concepts for which lexical relations i and j areboth edges in the knowledge graph. This allows us to ac-count for the smoothness of similar edges in the embeddingspace:

R2 =1

2

E∑i=1

E∑j=1

‖~ei − ~ej‖2`2 SR(i, j) (6)

where E is the number of edges in the knowledge graph.Equation 6 ensures that similar relations are embedded assimilar transformations in the embedded space.

5. Experimental ResultsEvaluating the quality of any common-sense knowledgebase is known to be difficult (Singh et al., 2002). Conse-quently, we measured two aspects of our knowledge em-bedding: (1) the quality of the relations in the knowledgegraph, and (2) the quality of any new inferred relations in theembedded space. When evaluating the quality of common-sense knowledge, we were inspired by previous efforts madeto evaluate the OpenMind common-sense acquisition dataset(Singh et al., 2002). For each knowledge graph, we sam-pled 200 relations, and 200 inferred relations, obtained bysampling random vertices v and relations r and finding thevertices closest to the point obtained by v + r. Each ofthese common-sense relations were evaluated on a shifted5 point Likert scale for the following desirable properties:

(1) generality (such that -2 indicates a specific fact and +2indicates a general statement about the world), (2) plausi-bility (such that -2 indicates a relationship which does notmake sense in the world, and +2 indicates a relationshipwhich makes complete sense), and (3) neutrality (such that-2 indicates highly-biased opinions and +2 indicates neu-tral sentiment). A total of three annotators were used, andobtained an inter-annotator agree of 84.3% (according toCohen’s kappa co-efficient). We evaluated both domains ofcommon-sense knowledge: (1) general world knowledgeand (2) medical knowledge. The average for generality was0.8, and 1.2, indicating that while the knowledge from bothsources was general, the medical knowledge was more-so.Likewise, the average rating for plausibility was 0.42 and1.02, reflecting the fact that most medical relations stoodalone as plausible facts while common-sense facts oftenrely on more context. This suggests that future work couldbenefit by incorporating coreference resolution. Finally, theaverage rating for neutrality was 0.05 and 1.85, suggestingthat many of the relations encoded in the general datasetwere biased by the opinions of the author, but that medicalrelationships were highly neutral.

6. ConclusionsIn this paper, a framework for learning knowledge embed-dings as representations of commonsense knowledge anddomain-specific knowledge for medicine is presented. Thenovelty of the knowledge embeddings stems from the in-corporation of semantic compositionality and of two reg-ularization factors that accomplish the smoothing of theacquired knowledge - representing the new, un-articulatedinformation discerned from vast text collections. To learncommonsense knowledge embeddings we have relied on alarge set of blogs and books as well as the defining glossesfrom WordNet. To learn the knowledge embeddings formedicine, we have relied on two large set of electronic healthrecords. We evaluated the quality of the common-senseknowledge obtained by the framework presented in this pa-per through manual review of randomly sampled relationsencoded in the knowledge graph, as well as implicit rela-tions inferred in the embedding space. The evaluation resultsfor the two forms of knowledge that we have acquired arepromising: the same knowledge acquisition methodologybased on learning knowledge embeddings works well bothfor commonsense knowledge and for medical knowledge.Interestingly, the commonsense knowledge that we haveacquired was evaluated as being less neutral than than themedical knowledge, because it often reflected the opinion ofthe knowledge writer. In addition, the medical knowledgethat was acquired was evaluated as more plausible that thecommonsense knowledge.

7. AcknowledgementsThis work was supported by the National Human GenomeResearch Institute of the National Institutes of Health underaward number 1U01HG008468. The content is solely the re-sponsibility of the authors and does not necessarily representthe official views of the National Institutes of Health.

4627

8. Bibliographical ReferencesAkbik, A. and Michael, T. (2014). The weltmodell: A data-

driven commonsense knowledge base. In LREC, pages3272–3276.

Angeli, G., Premkumar, M. J., and Manning, C. D. (2015).Leveraging linguistic structure for open domain infor-mation extraction. In Proceedings of the 53rd AnnualMeeting of the Association for Computational Linguisticsand the 7th International Joint Conference on NaturalLanguage Processing of the Asian Federation of NaturalLanguage Processing, ACL, pages 26–31.

Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak,R., and Ives, Z. (2007). Dbpedia: A nucleus for a web ofopen data. Springer.

Bodenreider, O. (2004). The unified medical language sys-tem (umls): integrating biomedical terminology. Nucleicacids research, 32(suppl 1):D267.

Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., andYakhnenko, O. (2013). Translating embeddings for mod-eling multi-relational data. In Advances in Neural Infor-mation Processing Systems, pages 2787–2795.

Burton, K., Java, A., and Soboroff, I. (2009). The icwsm2009 spinn3r dataset. In Proceedings of the Third AnnualConference on Weblogs and Social Media (ICWSM 2009),San Jose, CA.

Cimino, J. J., Johnson, S. B., Peng, P., and Aguirre, A.(1993). From icd9-cm to mesh using the umls: a how-toguide. In Proceedings of the annual symposium on com-puter application in medical care, page 730. AmericanMedical Informatics Association.

Goldberg, Y. and Orwant, J. (2013). A dataset of syntactic-ngrams over time from a very large corpus of englishbooks. In Second Joint Conference on Lexical and Com-putational Semantics (* SEM), volume 1, pages 241–247.

Goodwin, T. and Harabagiu, S. M. (2013). Automatic gen-eration of a qualified medical knowledge graph and its us-age for retrieving patient cohorts from electronic medicalrecords. In Semantic Computing (ICSC), 2013 IEEE Sev-enth International Conference on, pages 363–370. IEEE.

Gordon, A. and Swanson, R. (2009). Identifying personalstories in millions of weblog entries. In Third Interna-tional Conference on Weblogs and Social Media, DataChallenge Workshop, San Jose, CA.

Guo, S., Wang, Q., Wang, B., Wang, L., and Guo, L. (2015).Semantically smooth knowledge graph embedding. InProceedings of the 53rd Annual Meeting of the Associa-tion for Computational Linguistics and the 7th Interna-tional Joint Conference on Natural Language Processing,pages 84–94.

Harati, A., Lopez, S., Obeid, I., Picone, J., Jacobson, M.,and Tobochnik, S. (2014). The tuh eeg corpus: A bigdata resource for automated eeg interpretation. In SignalProcessing in Medicine and Biology Symposium (SPMB),2014 IEEE, pages 1–5. IEEE.

Lafourcade, M. (2007). Making people play for lexical ac-quisition with the jeuxdemots prototype. In SNLP’07: 7thinternational symposium on natural language processing,page 7.

Lenat, D. B. and Guha, R. V. (1989). Building large

knowledge-based systems; representation and inferencein the Cyc project. Addison-Wesley Longman PublishingCo., Inc.

Lowe, H. J. and Barnett, G. O. (1994). Understanding andusing the medical subject headings (mesh) vocabulary toperform literature searches. Jama, 271(14):1103–1108.

Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray,M. K., Pickett, J. P., Hoiberg, D., Clancy, D., Norvig,P., Orwant, J., et al. (2011). Quantitative analysisof culture using millions of digitized books. science,331(6014):176–182.

Miller, G. A. (1995). Wordnet: a lexical database for en-glish. Communications of the ACM, 38(11):39–41.

Nakov, P. and Hearst, M. A. (2008). Solving relationalsimilarity problems using the web as a corpus. In ACL,pages 452–460.

Saeed, M., Villarroel, M., Reisner, A. T., Clifford, G.,Lehman, L.-W., Moody, G., Heldt, T., Kyaw, T. H.,Moody, B., and Mark, R. G. (2011). Multiparameterintelligent monitoring in intensive care ii (mimic-ii): apublic-access intensive care unit database. Critical caremedicine, 39(5):952.

Schank, R. C. and Abelson, R. P. (1995). Knowledge andmemory: The real story. Knowledge and memory: Thereal story. Advances in social cognition, 8:1–85.

Schank, R. C. (1983). Dynamic memory: A theory of re-minding and learning in computers and people. Cam-bridge University Press.

Singh, P., Lin, T., Mueller, E. T., Lim, G., Perkins, T., andZhu, W. L. (2002). Open mind common sense: Knowl-edge acquisition from the general public. In On the moveto meaningful internet systems 2002: Coopis, doa, andodbase, pages 1223–1237. Springer.

Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W.,Ceusters, W., Goldberg, L. J., Eilbeck, K., Ireland, A.,Mungall, C. J., et al. (2007). The obo foundry: coordi-nated evolution of ontologies to support biomedical dataintegration. Nature biotechnology, 25(11):1251–1255.

Speer, R. and Havasi, C. (2013). Conceptnet 5: A large se-mantic network for relational knowledge. In The People’sWeb Meets NLP, pages 161–176. Springer.

Stearns, M., Price, C., Spackman, K., and Wang, A. (2001).Snomed clinical terms: overview of the development pro-cess and project status. In Proceedings of the AMIA Sym-posium, page 662. American Medical Informatics Associ-ation.

Turney, P. D. (2012). Domain and function: A dual-spacemodel of semantic relations and compositions. Journalof Artificial Intelligence Research, pages 533–585.

Xin, R. S., Gonzalez, J. E., Franklin, M. J., and Stoica, I.(2013). Graphx: A resilient distributed graph system onspark. In First International Workshop on Graph DataManagement Experiences and Systems, page 2. ACM.

Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S.,and Stoica, I. (2010). Spark: cluster computing withworking sets. In Proceedings of the 2nd USENIX con-ference on Hot topics in cloud computing, volume 10,page 10.

4628

embedding open-domain common-sense knowledge from text · embedding open-domain common-sense...

Documents