arxiv:1908.11536v1 [cs.cl] 30 aug 2019 · toms, proptype, fall into four categories:...

Learning to Infer Entities, Properties and their Relations from ClinicalConversations

Nan Du, Mingqiu Wang, Linh Tran, Gang Li, and Izhak Shafran

{dunan, mingqiuwang, tranlm, leebird, izhak}@google.com

Google Inc.

Abstract

Recently we proposed the Span Attribute Tag-ging (SAT) Model (Du et al., 2019) to in-fer clinical entities (e.g., symptoms) and theirproperties (e.g., duration). It tackles the chal-lenge of large label space and limited trainingdata using a hierarchical two-stage approachthat identifies the span of interest in a taggingstep and assigns labels to the span in a classi-fication step.

We extend the SAT model to jointly infer notonly entities and their properties but also re-lations between them. Most relation extrac-tion models restrict inferring relations betweentokens within a few neighboring sentences,mainly to avoid high computational complex-ity. In contrast, our proposed Relation-SAT(R-SAT) model is computationally efficientand can infer relations over the entire con-versation, spanning an average duration of 10minutes.

We evaluate our model on a corpus of clini-cal conversations. When the entities are given,the R-SAT outperforms baselines in identi-fying relations between symptoms and theirproperties by about 32% (0.82 vs 0.62 F-score) and by about 50% (0.60 vs 0.41 F-score) on medications and their properties. Onthe more difficult task of jointly inferring enti-ties and relations, the R-SAT model achievesa performance of 0.34 and 0.45 for symp-toms and medications respectively, which issignificantly better than 0.18 and 0.35 for thebaseline model. The contributions of differentcomponents of the model are quantified usingablation analysis.

1 Introduction

The widespread adoption of Electronic HealthRecords by clinics across United States has placeda disproportionately heavy burden on clinicalproviders, causing burnouts among them (Wachter

and Goldsmith, 2018; Xu, 2018; Arndt et al.,2017). There has been considerable interest, bothin academia and industry, to automate aspects ofdocumentation so that providers can spend moretime with their patients. One such approach aimsto generate clinical notes directly from the doctor-patient conversations (Patel et al., 2018; Finleyet al., 2018a,b). The success of such an approachhinges on extracting relevant information reliablyand accurately from clinical conversations.

In this paper, we investigate the tasks of jointlyinferring entities, specifically, symptoms (Sx),medications (Rx), their properties and relationsbetween them from clinical conversations. Thesetasks are defined in Section 2. The key contribu-tions of the work reported here include: (i) a novelmodel architecture for jointly inferring entities andtheir relations, whose parameters are learned usingthe multi-task learning paradigm (Section 4), (ii)comprehensive empirical evaluation of our modelon a corpus of clinical conversations (Section 6),and (iii) understanding the model performance us-ing ablation study and human error analysis (Sec-tion 6.7). Since clinical conversations include do-main specific knowledge, we also investigate thebenefit of augmenting the input feature represen-tation with knowledge graph embedding. Finally,we summarize our conclusions and contributionsin Section 7.

2 Task Definitions

For the purpose of defining the tasks, consider thesnippet of a clinical conversation in Table 1.

2.1 The Symptom Task (Sx)This task consists of extracting the tuples (sym-Type, propType, propContent).

The “pain” in the example in Table 1 is an-notated as symType (sym/msk/pain), where mskstands for musculo-skeletal system. We have

arX

iv:1

908.

1153

6v1

[cs

.CL

] 3

0 A

ug 2

019

DR: How often do you have pain in your arms?PT: It hurts every morning.DR: Are you taking anything for it?PT: I’ve been taking Ibuprofen. Twice a day.

Table 1: An example to illustrate entities, propertiesand their relations. Entities – (sym/msk/pain: pain) &(meds/name: Ibuprofen); Properties – (symprop/freq:every morning) & (medsprop/freq: twice a day); Re-lations: (sym/msk/pain, symprop/freq, every morning),(Ibuprofen, medsprop/freq, twice a day).

pre-defined 186 categories for symptom types,curated by a team of practising physicians andscribes, based on how they appear in clinical notes.We deliberately abstained from the more exhaus-tive symptom labels such as UMLS and ICDcodes (Bodenreider, 2004) in favor of this smallerset since our training data is limited.

The properties associated with the symp-toms, propType, fall into four categories:symprop/severity, symprop/duration, sym-prop/location, and symprop/frequency. ThepropContent denotes the content associated withthe property. In the running example, “everymorning” is the content associated with theproperty type symprop/frequency.

Not all the symptoms mentioned in the courseof clinical conversations are experienced by thepatients. We explicitly infer the status of a symp-tom as experienced or not. This secondary taskextracts the pair: (symType, symStatus).

2.2 The Medication Task (Rx)

This task consists of extracting tuples of the form:(medContent, propType, propContent).

While symptoms can be categorized into aclosed set, the set of medications is very large andcontinually updated. Moreover, in conversations,we would like to extract indirect references suchas “pain medications” as medContent. We definethree types of properties: medsprop/dosage, med-sprop/duration and medsprop/frequency. In therunning example,“twice a day” is the propContentof the type medsprop/frequency associated withthe medContent “ibuprofen”.

3 Previous Work

Relation extraction is a long studied problemin the NLP domain and include tasks such asthe ACE (Doddington et al., 2004), the Se-mEval (Hendrickx et al., 2010), the i2b2/VA Task(Uzuner et al., 2011a), and the BioNLP Shared

Task (Kim et al., 2013). Many early algorithmssuch as DIPRE algorithm by Brin (1998) andSNOWBALL algorithm by Agichtein and Gra-vano (2000) relied on regular expressions andrules (Fundel et al., 2007; Peng et al., 2014). Sub-sequent work exploited syntactic dependencies ofthe input sentences. Features from the dependencyparse tree were used in maximum entropy mod-els (Kambhatla, 2004) and neural network mod-els (Snow et al., 2005). Kernels were defined overtree structures (Zelenko et al., 2003; Culotta andSorensen, 2004; Qian et al., 2008). More effi-cient methods were investigated including short-est dependency path (Bunescu and Mooney, 2005)and sub-sequence kernels (Mooney and Bunescu,2006). Recent work on deep learning modelsinvestigated convolutional neural networks (Liuet al., 2013), graph convolutional neural net-works over pruned trees (Zhang et al., 2018), re-cursive matrix-vector projections (Socher et al.,2012) and Long Short Term Memory (LSTM) net-works (Miwa and Bansal, 2016). Other more re-cent approaches include two-level reinforcementlearning models (Takanobu et al., 2019), twolayers of attention-based capsule network mod-els (Zhang et al., 2019), and self-attention withtransformers (Verga et al., 2018). In particu-lar, (Miwa and Sasaki, 2014; Katiyar and Cardie,2016; Zhang et al., 2017; Zheng et al., 2017; Vergaet al., 2018; Takanobu et al., 2019) also seek tojointly learn the entities and relations among themtogether. A large fraction of the past work fo-cused on relations within a single sentences. Thedependency tree based approaches have been ex-tended across sentences by linking the root notesof adjacent sentences (Gupta et al., 2019). Coref-erence resolution is a similar task which requiresfinding all mentions of the same entity in the text(Martschat and Strube, 2015; Clark and Manning,2016; Lee et al., 2017).

In the medical domain, the BioNLP shared taskdeals with gene interactions and is very differentfrom our domain (Kim et al., 2013). The i2b2/vachallenge is closer to our domain of clinical notes,however, that task is defined on a small corpusof written discharge summaries (Uzuner et al.,2011b). Written domain benefits from cues suchas the section headings which are unavailable inclinical conversations. For a wider survey of ex-tracting clinical information from written clinicaldocuments, see (Liu et al., 2012).

prop:locexperienced

prop_B prop_I

my backinpainsomefeelI

Osym_Isym_BOO

Word & Calligraphic Embedding

Bi-LSTM Contextual Representation

Feedforward Layer

Entity Extraction Layer

Attribute Tagging Layer

sym:resp sym:msk Memory Buffer Layer

push

select

sym:msk

(a) Model architecture for the symptoms task.

duration

prop_B prop_I

three monthsforthinnerbloodtakeI

Omed_Imed_BOO

Word & Calligraphic Embedding

Bi-LSTM Contextual Representation

Feedforward Layer

Entity Extraction Layer

Attribute Tagging Layer

blood thinner Memory Buffer Layer

push

ibuprofenselect

(b) Model architecture for the medications task.

Figure 1: Variants of the R-SAT model architecture. The entity spans (“some pain”, “blood thinner”) are identifiedin a tagging step, which are pushed into a memory buffer along with their latent representation and in a subsequentstep the property spans (”my back”, “three months”) selects the most related entity from the buffer.

4 Model

Our application requires performing multiple in-ferences simultaneously, that of identifying symp-toms, medications, their properties and relationsbetween them. For this purpose, we adopt thewell-suited multitask learning framework and de-velop a model architecture, illustrated in Figure 1,that utilizes our limited annotated corpus effi-ciently.

4.1 Input Encoder Layer

Let x be an input sequence. We com-pute the contextual representation at the k-th step using a bidirectional LSTM, h′k =

[~h(x≤k|~ΘLSTM ), ~h(x≥k| ~ΘLSTM )], which is fedinto a two-layer fully connected feed-forward net-work. For simplicity, we drop the index k fromthe rest. The final features are represented ash′′ = MLP (h′|ΘFF ). In our task, we found thatthe LSTM-based encoder performs better that thetransformer-based encoder (Vaswani et al., 2017;Chen et al., 2018).

Extending a Standard Tagging Model In atypical tagging model, the contextual representa-tion of the encoder h′′ is fed into a conditionalrandom field (CRF) layer to predict the BIO-styletags (Collobert et al., 2011; Huang et al., 2015; Maand Hovy, 2016; Chiu and Nichols, 2016; Lampleet al., 2016; Peters et al., 2017; Yang et al., 2017;Changpinyo et al., 2018). Such a model can be ex-tended to predict the relations. For example, in theutterance, “I feel some pain in my back”, we couldsetup the tagger to predict the association betweenthe symptom (sym/msk), and its property (sym-

prop/loc) using a cross-product space where “myback” is tagged with sym/msk+symprop/loc so thatthe relation prediction problem is reformulated asa standard sequence labeling task. Although thiswould be a viable option for tasks where the tag setis small (e.g., place, organization, etc.), the cross-product space in our Sx task is unfortunately large(e.g., 186 Sx labels× 3 Sx property types, and 186Sx labels × 3 Sx status types).

4.2 Span Extraction Layer

We propose an alternative formulation that tack-les the problem in a hierarchical manner. We firstidentify the span of interest using generic tagswith BIO notation, namely, (sym B, sym I) forsymptoms and (symprop B, symprop I) for theirproperties, as in Figure 1(a). Likewise, (med B,med I) for medications and (medsprop B, med-sprop I) for their properties as shown in Fig-ure 1(b). This corresponds to highlighting, for ex-ample, “some pain” and “my back” as spans ofinterest.

Given the latent representations, h =(h

′′1 , · · · ,h

′′N ), and the target tag sequence

ye = (y1, · · · , yN ) (e.g., sym B, sym I, O,symprop B, symprop I), we use the nega-tive log-likelihood − logP (ye|h) under CRFas the loss of identifying spans of interest−S(ye,h) + log

∑y′ exp (S(y′,h)), where

S(y,h) =∑N

i=0Ayi,yi+1 +∑N

i=0 P (hi,yi) mea-sures the compatibility between a sequence y andh. The first component estimates the accumulatedcost of transition between two neighboring tagsusing a learned transition matrix A. P (hi,yi)is computed via the inner product h>i yi where

yi belongs to any sequence of tags y that can bedecoded from h. During training, the logP (ye|h)is estimated using forward-backward algorithmand during inference, the most probable sequenceis computed using the Viterbi algorithm.

4.3 Attribute Tagging Layer

Using the latent representation of the highlightedspan, we can predict one or more attributes ofthe span. In Figure 1(a), we can predict two at-tributes associated with “some pain”: sym/msk asthe symptom label and symStatus/experienced asits status. Similarly, in Figure 1(b), the span prop-erty span “three months” has the predicted prop-erty type medsprop/duration. Therefore, by form-ing semantic abstractions for each highlighted textspan, we decompose a single complex tagging taskin a large label space into correlated but simplersub-tasks, which are likely to generalize betterwhen the training data is limited.

Given the spans, either from the inferred or theground truth sequence y∗, a richer representationof the contexts can be used to predict attributesthan otherwise possible. A contextual represen-tation is computed from the starting i and endingj index of each span.

hsij = Aggregate(hk|hk ∈ h, i ≤ k < j) (1)

where Aggregate(·) is the pooling function, im-plemented as mean, sum or attention-weightedsum of the latent states of the input encoder. Thekth attributes associated with the span are modeledusing P (ykattr|hs

ij). For example, while predictionsymptom labels sx and their associated status st,the target attributes are y0attr := ysx and y1attr :=yst . For predicting medication entities rx and theirproperties pr, each span only has one attribute.Since each attribute comes from a pre-defined on-tology, the multi-nomial distribution P (ykattr|hs

ij)

can be modeled as Softmax(hsij |Θk) for each at-

tribute.

4.4 Memory Buffer Layer

One of the critical components of our model is thememory buffer. Most previous models on joint in-ference of entities and relations consider all spansof entities and properties. This has the computa-tional complexity of O(n4) in the length of theinput n, and makes it infeasible for applicationsuch as ours where the input could often be 1k

words or more. We circumvent this problem us-ing a memory buffer to cache all inferred candi-date spans and test their relationship with inferredproperty spans. Note, unlike methods that cascadetwo such stages, our model is trained end-to-endjointly with multi-task learning.

The memory buffer saves different entries forsymptom and medication tasks, as illustratedin Figure 1. At each occurrence of a symp-tom (medication) entity span, we push mk =Aggregate({hs

ij , es}) into the k-th position of the

memory buffer. For the symptom task, es is thelearned word embedding of one of the labels inthe closed label set. In the medication case, es isthe Aggregate of learned word embedding of theverbatim sub-sequence corresponding to the med-ication entity.

4.5 Relation Inference Layer

Each span of inferred property in the conversa-tion is compared against each entry in the buffer.A property entity span is represented as yp =Aggregate({hp

ij , ep}) where ep is the Aggregate

of word embedding corresponding to the span.The multi-nomial likelihood is computed using abilinear weight matrix W . The most likely en-try (k) is picked from the memory stack M =(m1, ...,mK) by maximizing the likelihood.

k = arg maxk

P (k|yp)

= arg maxk

Softmax(yp>WM) (2)

Remarks The computation cost of inferring re-lation between a property span and all the entitiesin the input is proportional to the memory buffersize. On our corpus, for Sx task, the mean andstandard deviation per conversation was 22 and 15respectively, and for Rx task, it was 32 and 23respectively. Hence, the set of candidate entitiesconsidered is substantially smaller than all poten-tial entities O(n2) in the input sequence.

The small size of the memory buffer also has animpact on rate of learning. In each training step,rather than updating all embedding, we only up-date a smaller number of embedding, those asso-ciated with the entries in the memory buffer. Thismakes the learning fast and efficient.

4.6 An End-to-end Learning ParadigmWe train the model end-to-end by minimizing thefollowing loss function for each conversation:

L = − α logP (ye|h)

−∑

{ykattr∈S}

logP (yiattr|h)

−∑

{yjpos∈P}

logP (yjpos|h) (3)

where ye is the target sequence (sym B , sym I ,prop B , prop I ), {ykattr} is the set of attribute la-bels for each highlighted span, {yjpos} is the list ofbuffer slot indices and α is a relative weight.

During training, we are simultaneously attempt-ing to detect the location of tags as well as classifythe tags. Initially our model for locating the tagsis unlikely to be reliable, and so we adopt a cur-riculum learning paradigm. Specifically, we pro-vide the classification stage the reference locationof the tag from the training data with probabilityp, and the inferred location of the tag with proba-bility 1 − p. We start the joint multi-task trainingby setting this probability to 1 and decrease it astraining progresses (Bengio et al., 2015).

Since our model consists of span extractionand attribute tagging layers followed by relationextraction, we refer to our model as RelationalSpan-Attribute Tagging Model (R-SAT). Oneadvantage of our model is that the computationalcomplexity of joint inference isO(n) which is lin-ear in the length of the conversation n. This issubstantially cheaper than other previous work onjoint relation prediction models where the compu-tational complexity is O(N4) (Lee et al., 2017).

5 Knowledge Graph Features

Medical domain knowledge could be helpful inincreasing the likelihood of symptoms when re-lated medications is mentioned in a conversation,and vice versa. One such source is a knowledgegraph (KG) whose embedding represent a low-dimensional projection that captures structural andsemantic information of its nodes. Previous workhas demonstrated that KG embedding can improverelation extraction in written domain (Han et al.,2018). We utilize an internally developed KGthat contains about 14k medical nodes of 87 dif-ferent types (e.g., medications, symptoms, treat-ments, etc.). The nodes are represented by 256dimension embedding vectors, which were trained

Figure 2: Illustration of how POS (p) features andknowledge graph (KG) are incorporated into our en-coder. Dashed lines represent mappings from words(w) to KG nodes, which contain the embedding (e) andthe type (t) information.

to minimize word2vec loss function on web doc-uments (Mikolov et al., 2013). A given node maybelong to multiple types and this is encoded as asum of one-hot vectors. The input word sequenceswere mapped to KG nodes using an internal tool(Brown, 2013). For words that do not map to KGnodes, we use a learnable UNK vector of the samedimension as the KG embedding. In addition, wealso represented linguistic information using part-of-speech (POS) tags as one-hot vector. The POStags were inferred from the input sequence usingan internal tool with 47 distinct tags (Andor et al.,2016). In our experiments, we find it most effec-tive to concatenate word embedding with KG en-tities, and the encoder output with the embeddingof POS tags and the KG entity types.

6 Experiments

We describe our corpus, evaluation metrics, theexperimental setup, the evaluations of the pro-posed model and comparison with different base-lines on both the symptom and medication tasks.

6.1 CorpusGiven the privacy-sensitive nature of clinical con-versations, there aren’t any publicly available cor-pora for this domain. Therefore, we utilize a pri-vate corpus consisting of 92K de-identified andmanually transcribed audio recordings of clini-cal conversations, typically about 10 minutes long[IQR: 5-12 minutes] with mostly 2 participants(72.7%). Other participants when present in-cluded, for example, nurses and caregivers. Thecorresponding manual transcripts contained se-quences that were on average 208 utterances or1,459 words in length. We note that due to the ca-sual conversational style of speech, an entity men-tioned at the beginning can be related to a propertymentioned at the end of the conversation. This

makes the problem of modeling relations muchharder than previous work on extracting relations.

A subset of about 2,950 clinical conversations,related to primary care, were annotated by profes-sional medical scribes. The ontology for labelingmedication consisted of the type of medications(e.g., medications, supplements) and their proper-ties (e.g., dosage, frequency, duration), and thatfor symptoms consisted of 186 symptom namesand their properties. This resulted in 77K and99K tags for the medication and symptom tasks,respectively. In all, there were 23k and 16k rela-tionships between medications and symptoms andtheir properties, respectively. The conversationswere divided into training (1,950), development(500) and test (500) sets.

In the case of medications, about 70% of thelabels were about medications and the rest abouttheir properties, of which 51% were dosage orquantity, and 40% were frequency. In the case ofsymptoms, 41% of the labels were about symp-tom names, another 41% about status, and the restabout properties, of which 39% were about fre-quency and 37% about body locations.

6.2 Pretraining

Since our labeled data is small, only about 3k, theinput encoder of the model was pre-trained overthe entire 92k conversations. For pre-training,given a short snippet of conversation, the modelwas tasked with predicting the next turn, similarto skip-thought (Kiros et al., 2015). Our modelswere trained using the Adam optimizer (Kingmaand Ba, 2015) and the hyperparameters are de-scribed in the supplementary material.

6.3 Evaluation Metrics

As described in Section 2, our tasks consist of ex-tracting tuples – (symType, propType, propCon-tent) for symptoms task and (medContent, prop-Type, propContent) for medications. The preci-sion, recall and F1-scores are computed jointlyover all the three elements and the content istreated as a list of tokens for evaluation purposes.To allow for partial content matches, we general-

ize the calculation of precision and recall such that

Precision =1

|SY |∑i∈SY

3∏j=1

Iyji

(yji , yji )

Recall =1

|SY |∑i∈SY

3∏j=1

Iyji

(yji , yji )

where SY denotes the set of predictions, SY de-notes the set of ground truths, and I

zji(xji , x

ji ) =

|xji ∩xji |/|z

ji |. We note that, as symType and prop-

Type are simply target classes, I reduces to a sim-ple indicator function. Under the scenario that thecontent includes single elements, the entire calcu-lation simplifies to the exact matching-based cal-culation of precision and recall over the set of pre-dictions and ground truths. For the symptom task,we additionally evaluate the performance of pre-dicting symType and symStatus by performing theexact matching-based calculation.

We illustrate this evaluation metric with an ex-ample below: There are two symptoms in the ref-

Prediction: [(sym/sob, prop/severity, [bad])]Reference: [(sym/unk, prop/location, [arm]),

(sym/sob, prop/severity, [really, bad])]

erence and the model extracted one of them. In theextracted symptom, the model correctly identifiedone out of the two content words. So, we score theprecision as 1/1(1 ∗ 1 ∗ (1/1)) = 1 and recall as1/2((0 ∗ 0 ∗ 0) + (1 ∗ 1 ∗ (1/2))) = 0.25.

6.4 Baselines

Symptom Task As a baseline for this task, wetrain an extension of the standard tagging model,described in Section 4.1. The label space for ex-tracting the relations between symptoms and theirproperties is 186 symptoms× 3 properties, and forextracting symptoms and their status is 186 symp-toms × 3 status. Using the BIO-scheme, that addsup to 2,233 labels in total. The baseline consistsof a bidirectional LSTM-encoder followed by twofeed-forward layers [512, 256] and then a 2,233dimension softmax layer. The label space is toolarge to include a CRF layer. The encoder waspre-trained in the same way as described in Sec-tion 6.2, the hyperparameters were selected ac-cording to Table 2, and the model parameters weretrained using cross-entropy loss.

Medication Task For this task, we adopt a dif-ferent baseline since the generic medication en-tity type (e.g., drug name, supplement name) doesnot provide any useful information unlike the 186symptom entity labels (e.g., sym/msk/pain). In-stead, we adopt the neural co-reference resolutionapproach which is better suited to this task (Leeet al., 2017). The encoder is the same as the base-line for symptom task and pre-trained in the samemanner. Since the BIO labels contain only 9 ele-ments in this case, the encoder output is fed into aCRF layer. Each candidate relation is representedby concatenating the latent states of the head to-kens of the medication entity and the property.This representation is augmented with an embed-ding of the token-distance, which is fed to a soft-max layer whose binary output encodes whetherthey are related or not. Note our R-SAT modeldoes not take the advantage of this distance em-bedding.

6.5 Parameter Tuning

Table 2 shows the parameters that were selectedafter evaluating over a range on a development set.In all experiments, the Aggregate(·) function isimplemented as the mean function for its simplic-ity.

Parameter Used RangeWord emb 256 [128 – 512]LSTM Cell 1024 [256 – 1024]Enc/dec layers 1 [1 – 3]Dropout 0.4 [0.0 – 0.5]L2 1e-4 [1e-5 – 1e-2]Std of VN 1e-3 [1e-4 – 0.2]α 0.01 [1e-4 – 0.1]Learning rate 1e-2 [1e-4 – 1e-1]

Table 2: Hyperparameters of our models for model re-producibility.

6.6 Results & Ablation Analysis

The performance of the proposed R-SAT modelwas compared with the baseline models, and theresults are reported in Table 3.

Symptom Task The model was trained usingmulti-task learning for both tasks: (symType, prop-Type, propContent) as well as (symType, symSta-tus). The performance was evaluated using allthe elements of the tuple as described in Sec-tion 6.3. The baseline performs better on (sym-Type, symStatus) compared to (symType, prop-Type, propContent) possibly because there aremore instances of the former in the training data

than the latter. The R-SAT model performs signif-icantly better than baselines on both tasks.

For understanding the contribution of differentcomponents of the model, we performed a seriesof ablation analysis by removing them one at atime. In extracting relations in Sx + Property, theKG embeddings along with POS tags contributea relative gain of 13% while the memory bufferbrings a relative gain of 8%. Neither of them im-pact Sx + status, and that is expected for memorybuffer since the status is tagged on the same spanas the contents of the memory buffer. Multi-tasklearning brings a relative improvement of 4% onSx + Property, and this may be because there arefewer instances of this relation in the training data,and jointly learning with Sx + Status helps to learnbetter representations. Note we have not checkedother sequences for removing model components(e.g., removing Multi-tasking earlier or KG later).

Medication Task In the Rx case, we only haveone task (Rx + Property), that is, predicting therelations between medications and their proper-ties, e.g., ([ibuprofen], prop/dosage, [10 mg]). Thebaseline gives reasonable performance. Ablationanalysis reveals that KG and POS features con-tribute about 4.6% relative improvement, whilethe contextual span in memory buffer adds a sub-stantial 43% relative improvement. Since themedications are from an open set, we cannot runexperiments without the buffer. Compared tosymptoms task, the model performs better on med-ication task, and this may be due to lower variabil-ity in dosage.

Relation Only Prediction For teasing apart thestrength and weakness of the model, we evaluatedits performance when the entities and their proper-ties were given, and the model was only requiredto decide whether a relation exists or not.

As a baseline, we compare our model with amost recently proposed model for document-leveljoint entity and relation extraction: BRAN, whichachieved state-of-art performance for chemical-disease relation (Verga et al., 2018). When thismodel was originally used to test relations be-tween all pairs of entities and properties in theentire conversation, it performed relatively poorly.Using the implementation released by the authors,the performance of BRAN was then optimized byrestricting the distance between the pairs and byfine-tuning the threshold. The best results are re-

Model Sx + Property Sx + Status Rx + PropertyBaseline 0.18 0.44 0.35R-SAT 0.34 0.57 0.45

w/o [KG] 0.30 0.56 0.43w/o [KG, Context] 0.26 0.55 0.30w/o [KG, Context, Buffer] 0.24 0.55 n/aw/o [KG, Context, Buffer, Multi-task] 0.23 n/a n/a

Human 0.51 0.78 0.52

Table 3: Comparison of the performance of the proposed R-SAT model with baselines and ablation analysis ondifferent components (KG, Context, Buffer, Multi-task) where ‘context’ is the latent representation of the span hs

ij

in the memory buffer.

Model Sx + Property Rx + PropertyBRAN 0.62 0.41R-SAT 0.82 0.60

Table 4: Performance of the model when the entitiesand properties are given and it is only required to pre-dict existence of relations.

ported in Table 4. Our proposed R-SAT modelwithout any such constraints performs better thanBRAN on both tasks by an absolute F1-score gainof about 0.20.

Interestingly, the performance of our model onSx + Property jumps from 0.34 in the joint pre-diction task to 0.82 in the relation only predictiontask. This reveals the primary weakness of theSx model is in tagging the entities and the prop-erties accurately. In contrast, the F1-score for Rx+ Property is impacted less, and only moves upfrom 0.45 to 0.6.

The task of inferring whether a relation ispresent between a medication and its propertiesis more challenging than in the case of symptomstask. This is not entirely surprising since there is ahigher correlation between symptom type and lo-cation (e.g., respiratory symptom being associatedwith nose) and relatively low correlation betweendosage and medications (e.g., 400mg could be thedosage for several different medications).

6.7 AnalysisFor understanding the inherent difficulty of ex-tracting symptoms and medications and theirproperties from clinical conversations, we esti-mated human performance. A set of 500 con-versations were annotated by 3 different scribes.We created a “voted” reference and compared the3 annotations from each of the 3 scribes againstthem.

The F1-score of scribes were surprisingly low,with 0.51 for Sx + Property and 0.78 for Sx + Sta-tus. The model performance also finds extractingrelation in Sx + Property to be more difficult thanSx + Status task. In summary, the model perfor-mance reaches 67% of human performance for Sx+ Property and 73% for Sx + Status. The F1-scoreof scribes for Rx + Property is similar to that of Sx+ Property. In this case, the model achieves about85% of human performance. The human errors orinconsistencies in Sx and Rx annotations appear tobe largely due to missed labels and not due to in-consistent spans for the same tags, or inconsistenttags for the same span.

While the majority of our relations in the ref-erence annotations occurred within the same sen-tence, approximately 11.1% of relations occurredacross 3 or more sentences. This typically oc-curred when the symptoms or medications are dis-cussed over multiple dialog turns, as illustrated inTable 1. Among the relations correctly identifiedby the model, 10.6% were also across 3 or moresentences, which is very similar to the priors onthe reference and seem to contain no bias. We no-tice that in certain cases, the model is able to linka property to an entity that is far away (100+ sen-tences) when a nearby mention of the same entitywas missed by the model. Models that only exam-ine relations in nearby sentences (2-3 sentences)would have missed the relation in such a scenario.

The majority of the errors result from our modelmissing the property span. Specifically, we seethat 35% and 81% of the errors are due to modelnot detecting medications and symptoms property.For example, when i really have to, every threethree months, which are rare mentions in informallanguage.

Our reference links each property to only one

entity. In certain cases, we notice that the modellinks the entity to an alternative mention or entitythat is equally valid (Advil vs pain killer). So,our performance measure underestimates the ac-tual model performance.

7 Conclusions

We propose a novel model to jointly infer entitiesand relations. The key components of the modelare: a mechanism to highlight the spans of interest,classify them into entities, store the entities of in-terest in a memory buffer along with the latent rep-resentation of the context, and then infer relationbetween candidate property spans with the entitiesin the buffer. The components of the model are nottied to any domain. We have demonstrated appli-cations in two different tasks. In the case of symp-toms, the entities are categorized into 188 classes,while in the case of medications, the entities are anopen set. The model is tailored for tasks where thetraining data is limited and the label space is largebut can be partitioned into subsets. The two stageprocessing where the candidates are stored in amemory buffer allows us to perform the joint infer-ence at a computational cost ofO(n) in the lengthof the input n compared to methods that exploreall spans of entities and properties at a computa-tional cost of O(n4). The model is trained end-to-end. We evaluate the performance on three relatedtasks, namely, extracting symptoms and their sta-tus, relations between symptoms and their proper-ties, and relations between medications and theirproperties. Our model outperforms the baselinessubstantially, by about 32-50%. Through abla-tion analysis, we observe that the memory bufferand the KG features contribute significantly to thisperformance gain. By comparing human scribesagainst “voted” reference, we see that the task isinherently difficult, and the models achieve about67-85% of human performance.

Acknowledgments

This work would not have been possible withoutthe help of a number of colleagues, including Lau-rent El Shafey, Hagen Soltau, Yuhui Chen, AshleyRobson Domin, Lauren Keyes, Rayman Huang,Justin Stuart Paul, Mark Knichel, Jeff Carlson,Zoe Kendall, Mayank Mohta, Roberto Santana,Katherine Chou, Chris Co, Claire Cui, and KyleScholz.

ReferencesEugene Agichtein and Luis Gravano. 2000. Snow-

ball: Extracting relations from large plain-text col-lections. In Proceedings of the ACM Conference onDigital Libraries, pages 85–94. ACM.

Daniel Andor, Chris Alberti, David Weiss, AliakseiSeveryn, Alessandro Presta, Kuzman Ganchev, SlavPetrov, and Michael Collins. 2016. Globally nor-malized transition-based neural networks. In Pro-ceedings of the 54th Annual Meeting of the Associa-tion for Computational Linguistics. Association forComputational Linguistics.

Brian G. Arndt, John W. Beasley, Michelle D. Watkin-son, Jonathan L. Temte, Wen-Jan Tuan, Christine A.Sinsky, and Valerie J. Gilchrist. 2017. Tethered tothe EHR: primary care physician workload assess-ment using EHR event log data and time-motion ob-servations. Annals of Family Medicine, 15(5):419–26.

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, andNoam Shazeer. 2015. Scheduled sampling for se-quence prediction with recurrent neural networks.In Proceedings of the 28th International Confer-ence on Neural Information Processing Systems -Volume 1, NIPS’15, pages 1171–1179, Cambridge,MA, USA. MIT Press.

Olivier Bodenreider. 2004. The unified medicallanguage system (UMLS): integrating biomedicalterminology. Nucleic Acids Res., 32(Databaseissue):D267–70.

Sergey Brin. 1998. Extracting patterns and relationsfrom the world wide web. In In WebDB Work-shop at 6th International Conference on ExtendingDatabase Technology, EDBT98, pages 172–183.

Aaron Brown. 2013. Semantics and structured knowl-edge in practice at google. In IEEE ICSC. IEEE In-ternational conference on semantic computing.

Razvan C Bunescu and Raymond J Mooney. 2005.A shortest path dependency kernel for relation ex-traction. In Proceedings of the Conference on Hu-man Language Technology and Empirical Methodsin Natural Language Processing, HLT ’05, pages724–731, Stroudsburg, PA, USA. Association forComputational Linguistics.

Soravit Changpinyo, Hexiang Hu, and Fei Sha. 2018.Multi-task learning for sequence tagging: An em-pirical study. In Proceedings of the 27th Inter-national Conference on Computational Linguistics,pages 2965–2977. Association for ComputationalLinguistics.

Mia Xu Chen, Orhan Firat, Ankur Bapna, MelvinJohnson, Wolfgang Macherey, George Foster, LlionJones, Niki Parmar, Mike Schuster, Zhifeng Chen,Yonghui Wu, and Macduff Hughes. 2018. The bestof both worlds: Combining recent advances in neu-ral machine translation. In ACL.

http://dl.acm.org/citation.cfm?id=2969239.2969370


Jason Chiu and Eric Nichols. 2016. Named entityrecognition with bidirectional lstm-cnns. Transac-tions of the Association for Computational Linguis-tics, 4:357–370.

Kevin Clark and Christopher D Manning. 2016. Deepreinforcement learning for Mention-Ranking coref-erence models. In Proceedings of the 2016 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 2256–2262, Stroudsburg, PA,USA. Association for Computational Linguistics.

Ronan Collobert, Jason Weston, Leon Bottou, MichaelKarlen, Koray Kavukcuoglu, and Pavel Kuksa.2011. Natural language processing (almost) fromscratch. J. Mach. Learn. Res., 12:2493–2537.

Aron Culotta and Jeffrey Sorensen. 2004. Dependencytree kernels for relation extraction. In Proceed-ings of the 42Nd Annual Meeting on Association forComputational Linguistics, ACL ’04.

George Doddington, Alexis Mitchell, Mark Przybocki,Lance Ramshaw, Stephanie Strassel, and RalphWeischedel. 2004. The automatic content extraction(ACE) program tasks, data, and evaluation. Pro-ceedings of the Fourth International Conference onLanguage Resources and Evaluation (LREC’04).

Nan Du, Kai Chen, Anjuli Kannan, Linh Tran, YuhuiChen, and Izhak Shafran. 2019. Extracting symp-toms and their status from clinical conversations. InProceedings of the 57th Annual Meeting of the As-sociation for Computational Linguistics, pages 915–925, Florence, Italy. Association for ComputationalLinguistics.

Greg P. Finley, Erik Edwards, Amanda Robinson, Na-jmeh Sadoughi, Mark Miller, David Suendermann-Oeft, and Nico Axtmann Michael Brenndoerfer.2018a. An automated medical scribe for document-ing clinical encounters. In Proceedings of the An-nual Conference of the North American Chapter ofthe Association for Computational Linguistics.

Gregory Finley, Wael Salloum, Najmeh Sadoughi,Erik Edwards, Amanda Robinson, Nico Axtmann,Michael Brenndoerfer, Mark Miller, and DavidSuendermann-Oeft. 2018b. From dictations to clin-ical reports using machine translation. In Proceed-ings of the 2018 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 3(Industry Papers), pages 121–128. Association forComputational Linguistics.

Katrin Fundel, Robert Kuffner, and Ralf Zimmer. 2007.RelEx–relation extraction using dependency parsetrees. Bioinformatics, 23(3):365–371.

Pankaj Gupta, Subburam Rajaram, Hinrich Schutze,Bernt Andrassy, and Thomas A. Runkler. 2019.Neural relation extraction within and across sen-tence boundaries. In Proceedings of the Associationfor the Advancement of Artificial Intelligence (AAAI)Conference on Artificial Intelligence.

Xu Han, Zhiyuan Liu, and Maosong Sun. 2018. Neuralknowledge acquisition via mutual attention betweenknowledge graph and text. In Thirty-Second AAAIConference on Artificial Intelligence.

Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva,Preslav Nakov, Diarmuid O Seaghdha, SebastianPado, Marco Pennacchiotti, Lorenza Romano, andStan Szpakowicz. 2010. SemEval-2010 task 8:Multi-way classification of semantic relations be-tween pairs of nominals. In Proceedings of the5th International Workshop on Semantic Evaluation,SemEval ’10, pages 33–38, Stroudsburg, PA, USA.Association for Computational Linguistics.

Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidi-rectional LSTM-CRF models for sequence tagging.CoRR, abs/1508.01991.

Nanda Kambhatla. 2004. Combining lexical, syntactic,and semantic features with maximum entropy mod-els for extracting relations. In Proceedings of theACL 2004 on Interactive Poster and DemonstrationSessions, ACLdemo ’04, Stroudsburg, PA, USA. As-sociation for Computational Linguistics.

Arzoo Katiyar and Claire Cardie. 2016. InvestigatingLSTMs for joint extraction of opinion entities andrelations. In Proceedings of the 54th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), pages 919–929, Berlin,Germany. Association for Computational Linguis-tics.

J D Kim, Y Wang, and Y Yasunori. 2013. The Ge-nia event extraction shared task. Proceedings of theBioNLP.

Diederik Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. ICLR.

Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov,Richard Zemel, Raquel Urtasun, Antonio Torralba,and Sanja Fidler. 2015. Skip-thought vectors. InAdvances in Neural Information Processing Systems28, pages 3294–3302.

Guillaume Lample, Miguel Ballesteros, Sandeep Sub-ramanian, Kazuya Kawakami, and Chris Dyer. 2016.Neural architectures for named entity recognition.In Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 260–270. Association for Computational Lin-guistics.

Kenton Lee, Luheng He, Mike Lewis, and Luke Zettle-moyer. 2017. End-to-end neural coreference resolu-tion. In ACL.

ChunYang Liu, WenBo Sun, WenHan Chao, andWanxiang Che. 2013. Convolution neural networkfor relation extraction. In International Conferenceon Advanced Data Mining and Applications, pages231–242. Springer.



http://arxiv.org/abs/1508.01991

http://arxiv.org/abs/1508.01991

https://doi.org/10.18653/v1/P16-1087

https://doi.org/10.18653/v1/P16-1087

https://doi.org/10.18653/v1/P16-1087

https://doi.org/10.18653/v1/N16-1030

Feifan Liu, Chunhua Weng, and Hong Yu. 2012.Natural Language Processing, Electronic HealthRecords, and Clinical Research, pages 293–310.Springer Science & Business Media.

Xuezhe Ma and Eduard Hovy. 2016. End-to-end se-quence labeling via bi-directional lstm-cnns-crf. InProceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 1064–1074. Association forComputational Linguistics.

Sebastian Martschat and Michael Strube. 2015. La-tent structures for coreference resolution. Transac-tions of the Association for Computational Linguis-tics, 3(0):405–418.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-rado, and Jeffrey Dean. 2013. Distributed represen-tations of words and phrases and their composition-ality. In Proceedings of the 26th International Con-ference on Neural Information Processing Systems -Volume 2, NIPS’13, pages 3111–3119, USA. CurranAssociates Inc.

Makoto Miwa and Mohit Bansal. 2016. End-to-end re-lation extraction using lstms on sequences and treestructures. In Proceedings of the 54th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), pages 1105–1116. Asso-ciation for Computational Linguistics.

Makoto Miwa and Yutaka Sasaki. 2014. Modelingjoint entity and relation extraction with table repre-sentation. In Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Processing(EMNLP), pages 1858–1869, Doha, Qatar. Associa-tion for Computational Linguistics.

Raymond J. Mooney and Razvan C. Bunescu. 2006.Subsequence kernels for relation extraction. InY. Weiss, B. Scholkopf, and J. C. Platt, editors, Ad-vances in Neural Information Processing Systems18, pages 171–178. MIT Press.

Pinal Patel, Disha Davey, Vishal Panchal, and ParthPathak. 2018. Annotation of a large clinical entitycorpus. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Process-ing, pages 2033–2042. Association for Computa-tional Linguistics.

Yifan Peng, Manabu Torii, Cathy H Wu, and K Vijay-Shanker. 2014. A generalizable NLP frameworkfor fast development of pattern-based biomedicalrelation extraction systems. BMC Bioinformatics,15(1):285.

Matthew Peters, Waleed Ammar, Chandra Bhagavat-ula, and Russell Power. 2017. Semi-supervised se-quence tagging with bidirectional language models.In Proceedings of the 55th Annual Meeting of theAssociation for Computational Linguistics (Volume1: Long Papers), pages 1756–1765. Association forComputational Linguistics.

Longhua Qian, Guodong Zhou, Fang Kong, QiaomingZhu, and Peide Qian. 2008. Exploiting constituentdependencies for tree kernel-based semantic relationextraction. In Proceedings of the 22Nd InternationalConference on Computational Linguistics - Volume1, COLING ’08, pages 697–704, Stroudsburg, PA,USA. Association for Computational Linguistics.

Rion Snow, Daniel Jurafsky, and Andrew Y Ng. 2005.Learning syntactic patterns for automatic hypernymdiscovery. In Advances in neural information pro-cessing systems, pages 1297–1304.

Richard Socher, Brody Huval, Christopher D Manning,and Andrew Y Ng. 2012. Semantic compositional-ity through recursive matrix-vector spaces. In Pro-ceedings of the 2012 joint conference on empiricalmethods in natural language processing and com-putational natural language learning, pages 1201–1211.

Ryuichi Takanobu, Tianyang Zhang, Jiexi Liu, andMinlie Huang. 2019. A hierarchical framework forrelation extraction with reinforcement learning. InProceedings of the Association for the Advancementof Artificial Intelligence (AAAI) Conference on Arti-ficial Intelligence.

Ozlem Uzuner, Brett R South, Shuying Shen, andScott L DuVall. 2011a. 2010 i2b2/VA challenge onconcepts, assertions, and relations in clinical text. J.Am. Med. Inform. Assoc., 18(5):552–556.

Ozlem Uzuner, Brett R South, Shuying Shen, andScott L. DuVall. 2011b. 2010 i2b2/va challenge onconcepts, assertions, and relations in clinical text.Journal of American Medical Informatics Associa-tion, 18(5):552–6.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In NIPS.

Patrick Verga, Emma Strubell, and Andrew McCallum.2018. Simultaneously self-attending to all mentionsfor full-abstract biological relation extraction. InProceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long Papers), pages 872–884. Associa-tion for Computational Linguistics.

Robert Wachter and Jeff Goldsmith. 2018. To com-bat physician burnout and improve care, fix the elec-tronic health record. Harvard Business Review.

Rena Xu. 2018. The burnout crisis in americanmedicine. The Atlantic.

Zhilin Yang, Ruslan Salakhutdinov, and William Co-hen. 2017. Transfer learning for sequence taggingwith hierarchical recurrent networks. ICLR).

Dmitry Zelenko, Chinatsu Aone, and AnthonyRichardella. 2003. Kernel methods for relation ex-traction. J. Mach. Learn. Res., 3:1083–1106.

https://doi.org/10.18653/v1/P16-1101

https://doi.org/10.18653/v1/P16-1101




https://doi.org/10.3115/v1/D14-1200

https://doi.org/10.3115/v1/D14-1200

https://doi.org/10.3115/v1/D14-1200

https://doi.org/10.18653/v1/P17-1161

https://doi.org/10.18653/v1/P17-1161

https://doi.org/10.18653/v1/N18-1080

https://doi.org/10.18653/v1/N18-1080

Meishan Zhang, Yue Zhang, and Guohong Fu. 2017.End-to-end neural relation extraction with global op-timization. In Proceedings of the 2017 Conferenceon Empirical Methods in Natural Language Pro-cessing, pages 1730–1740, Copenhagen, Denmark.Association for Computational Linguistics.

Xinsong Zhang, Pengshuai Li, Weijia Jia, and HaiZhao. 2019. Multi-labeled relation extraction withattentive capsule network. In Proceedings of the As-sociation for the Advancement of Artificial Intelli-gence (AAAI) Conference on Artificial Intelligence.

Yuhao Zhang, Peng Qi, and Christopher D Manning.2018. Graph convolution over pruned dependencytrees improves relation extraction. In Proceedings ofthe 2018 Conference on Empirical Methods in Nat-ural Language Processing, pages 2205–2215.

Suncong Zheng, Feng Wang, Hongyun Bao, YuexingHao, Peng Zhou, and Bo Xu. 2017. Joint extractionof entities and relations based on a novel taggingscheme. In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), pages 1227–1236, Van-couver, Canada. Association for Computational Lin-guistics.

https://doi.org/10.18653/v1/D17-1182

https://doi.org/10.18653/v1/D17-1182

https://doi.org/10.18653/v1/P17-1113

https://doi.org/10.18653/v1/P17-1113

https://doi.org/10.18653/v1/P17-1113

arxiv:1908.11536v1 [cs.cl] 30 aug 2019 · toms, proptype, fall into four categories:...

Documents