toward complete structured information extraction from ... › content › pdf › 10.1007... ·...

Toward Complete Structured Information Extraction from RadiologyReports Using Machine Learning

Jackson M. Steinkamp1,2& Charles Chambers1 & Darco Lalevic1 & Hanna M. Zafar1 & Tessa S. Cook1

Published online: 19 June 2019#

AbstractUnstructured and semi-structured radiology reports represent an underutilized trove of information for machine learning (ML)-based clinical informatics applications, including abnormality tracking systems, research cohort identification, point-of-caresummarization, semi-automated report writing, and as a source of weak data labels for training image processing systems.Clinical ML systems must be interpretable to ensure user trust. To create interpretable models applicable to all of these tasks,we can build general-purpose systems which extract all relevant human-level assertions or “facts” documented in reports;identifying these facts is an information extraction (IE) task. Previous IE work in radiology has focused on a limited set ofinformation, and extracts isolated entities (i.e., single words such as “lesion” or “cyst”) rather than complete facts, which requirethe linking of multiple entities and modifiers. Here, we develop a prototype system to extract all useful information inabdominopelvic radiology reports (findings, recommendations, clinical history, procedures, imaging indications and limitations,etc.), in the form of complete, contextualized facts. We construct an information schema to capture the bulk of information inreports, develop real-time MLmodels to extract this information, and demonstrate the feasibility and performance of the system.

Keywords Machine learning . Radiology reports . Natural language processing . Structured reporting

Introduction

Unstructured and semi-structured radiology reports representan underutilized trove of information for clinical informaticsapplications, including follow-up tracking systems, intelligentchart search, point-of-care summarization, semi-automated re-port writing, research cohort identification, and as a source ofweak data labels for training image processing systems. Thegrowing importance of structured data is reflected in radiolo-gists’ increasing embrace of structured reporting, standardizedcoding systems, ontologies, and common data elements [1, 2].It is an open question as to whether all potentially usefulstructured information can be captured feasibly a priori atthe time of reporting. Even if this undertaking proves feasible,large amounts of historically useful data will remain in theirunstructured form.

A parallel approach to procuring structured informationinvolves the development of automated systems capable ofextracting structured information from unstructured free text.This approach is frequently used for individual downstreamapplications (e.g., text mining for research cohort identifica-tion). However, individual applications often recreate oldpipelines to extract the information from reports, reduplicatinglarge amounts of work, and leaving the general informationextraction problem unsolved. Recent advances in natural lan-guage processing (NLP) and modern machine learning (ML)technologies have made it feasible to extract complete sets ofstructured information from unstructured free text within lim-ited domains, essentially fully converting unstructured infor-mation to structured information. Tasks of this form fall underthe general umbrella of information extraction (IE) [3].Table 1 provides a breakdown of some (but by no means all)common subtasks and task formulations of information ex-traction, while Table 2 contrasts rule-based andmachine learn-ing approaches for performing these tasks.

Previous IE work within radiology has focused on namedentity recognition (NER)—identifying particular spans of textcorresponding to relevant entities in the document [4–7].Some definitions of named entity recognition are limited toproper nouns (e.g., person, organization, and location), but in

* Jackson M. [email protected]

1 Department of Radiology, Hospital of the University ofPennsylvania, Philadelphia, PA 19104, USA

2 Boston University School of Medicine, Boston, MA 02119, USA

Journal of Digital Imaging (2019) 32:554–564https://doi.org/10.1007/s10278-019-00234-y

The Author(s) 2019

http://crossmark.crossref.org/dialog/?doi=10.1007/s10278-019-00234-y&domain=pdf

http://orcid.org/0000-0001-7888-0691

mailto:[email protected]

Table1

Someof

themostcom

mon

inform

ationextractio

nsubtasks

Task

form

ulation

Descriptio

nExample

Notes

andlim

itatio

ns

Nam

edentity

recognition

System

provides

labelsforspansof

text

with

inthedocument,usingapre-specifiedseto

flabels.C

ommonly

used

fornouns(e.g.,

radiologicfindings

andanatom

icorgans).

Input:“Patient

hascystin

thekidney.”

Output:

Cyst→

radiologicfinding

Kidney→

anatom

iclocatio

n

-Doesnotinherently

handle“implied”

entitieswhich

are

notd

irectly

mentio

nedin

thetext.

-Byitself,does

notlinkentitiesto

each

other,lim

iting

the

scopeof

answ

erablequestio

nsto

“which

entitiesappear

inthisdocument.”

-Traditio

nally,outputislim

itedto

pre-specifiedentitytypes.

Relationextractio

nIn

additio

nto

identifying

entities,system

identifiesrelatio

nsor

predicates

betweenentities.Fo

rinstance,a

Finding

IsLocatedIn

relatio

nmay

connecta

radiologicfindingentitywith

ananatom

iclocatio

nentity.

Input:“Patient

hascystin

kidney.”

Valid

outputs:

“Finding

IsLocated

In(cyst,kidney)”/

cyst;isin;k

idney.

-Can

beeither

“open”

(ableto

identifyarbitraryrelatio

nsbetweenentities)or

“closed”

(lim

itedto

aseto

fpre-

specifiedrelatio

ns).

-Systemsmay

belim

itedto

binary

relatio

ns(betweentwo

entities)

ormay

handlerelatio

nswith

arbitrarynumbersof

entities.

Naturallanguage

questio

nansw

ering

System

provides

arbitrarynaturallanguage

answ

erto

anarbitrarynaturallanguage

questio

nabouta

document.Sy

stem

may

generatenewnaturallanguageand/or

“point”to

spansof

text

with

inthe

document.

Input:“D

oesthepatient

have

any

kidney

findings?“

Valid

output:“Acyst.”

Input:“Inwhato

rgan

isthecyst?”

Valid

output:

“Inthekidney.”

-Naturallanguage

questio

nscanbe

seen

asageneralizationof

othernaturallanguagetasks(e.g.,named

entityrecognition

andrelatio

nextractio

ncanboth

befram

edas

questio

n-answ

eringtasks).

- Naturallanguage

answ

ersallowformaxim

umflexibility

comparedwith

aseto

fpre-specifiedlabelsandcanideally

generalizeto

questio

nsoutsideof

thetraining

set.

-Requiresvery

largeam

ount

oftraining

data,asthetask

isinherently

morecomplicated.

-Providing

acompleteseto

flabelsforagiventraining

exam

pleisdifficult,as

theremay

bemanycorrect

waysto

answ

eraquestio

n.

J Digit Imaging (2019) 32:554–564 555

this paper, we use “named entity recognition” to refer to rec-ognition of any text span which corresponds to a relevantpiece of information (e.g., including modifiers such as“spiculated” or “5 mm”). Such NER systems represent astep-up from systems which merely perform a raw text searchover a pre-specified set of terms, as there are many cases inwhich there is not a one-to-one mapping between raw text andentity (e.g., homonyms, multiple context-dependent meaningsof a word, and pronouns). Furthermore, modern machinelearning systems such as neural networks are capable of gen-eralizing beyond set lists of terms and are therefore not limitedby incomplete ontologies.

However, systems which strictly perform named entityrecognition-level tasks are insufficient for answering evenbasic clinical queries. Perhaps the most commonly citedexample is negation: in the sentence “No lesion observed,”an NER-only system could (correctly) identify “lesion” as anentity, but cannot correctly answer the intended question“does this patient have a lesion” without additional informa-tion about the negation and how it relates to the “lesion” entity.Many other limitations exist, including conditional expres-sions, e.g., “if a lesion is present, consider …”; uncertainty,e.g., “possibly representing a lesion”; temporal modifiers, e.g.,“patient with history of lesion”; and even person-specific, e.g.,“patient’s father with reported history of lesion”; which addcrucial context to the entity. Despite the relatively narrowvocabulary and informational scope of radiology reportscompared with general-purpose English text, these linguisticphenomena are very common in radiology reports, renderingpure NER systems incapable of answering basic clinicalqueries without additional post-processing.

The task known as relation extraction [8] is a step-up fromNER (see Table 1), as it involves identifying entities and re-lations between them. In the most general form, a set of pos-sible relations between entity types are defined, such as “le-sion X has size Y” or “imaging modality X was used on bodyregion Y”, and a successful system is capable of identifyingthese relations from raw text. Such systems, therefore, cananswer questions such as “does this patient have a lesion,”“are any of the lesions larger than 5 mm,” or “has this lesionchanged since it was last observed” by using informationrelating multiple named entities. Some systems exist for rela-tion extraction in a general-purpose clinical text, but the radi-ology literature is limited.

In addition, most systems only focus on small subsets ofinformation within reports. By contrast, we aim to create asystem capable of general-purpose complete information ex-traction from radiology reports, for a wide variety of down-stream uses, as listed above. In this study, we develop aninformation schema capable of capturing the majority of in-formation in radiologic reports, demonstrate the feasibility ofa neural network system to extract this information, and mo-tivate possible downstream uses.Ta

ble2

Twomajor

approaches

forhandlin

ginform

ationextractio

ntasks

Methodology

Descriptio

nPros

andcons

Rule-basedstring

matching

System

uses

ahuman-providedlisto

fruleswith

particular

text

strings(orregularexpressions)to

identifyentitiesandrelatio

nswith

inthetext.

-Explainableandinterpretable.

-Sim

pleto

implem

ent;may

besufficient

forcertainlim

itedtasks.

-Manynaturallanguagewords

have

multip

lemeaningsor

classificatio

nsbasedon

surroundingcontext,so

itmay

beim

possibleto

createatrue

one-to-one

mapping

betweentext

andlabels.

-For

morecomplex

taskssuch

asrelatio

nextractio

nand

questio

nansw

ering,itisvery

difficulto

rim

possibleto

anticipateallrules

necessaryforthetask.

-Noability

togeneralizebeyond

provided

rules.

Machine

learning

system

sManyavailablemodels,e.g.,supportvector

machines

andneuralnetworks.S

ystem

learns

from

atraining

dataseto

finput/o

utputp

airsandcangeneralizeto

unseen

exam

ples

which

aresimilar.Deeplearning

modelsutilizing

neuralnetworks

arecurrently

state-of-the-artforthemajority

ofcomplex

tasks.

-Manymodelsincorporateinform

ationfrom

theentire

surroundingsequence

oftext

toproducean

answ

er.

-Ableto

generalizebeyond

provided

training

exam

ples.

-May

bedifficulttounderstand

why

modelproduces

itsoutput.

-Require

largeam

ountsof

labeledtraining

data.

556 J Digit Imaging (2019) 32:554–564

Methods

Fact Schema

We iteratively developed a schema of all relevant informationdocumented in radiology reports by qualitatively examining abody of existing reports. For the purposes of this feasibilitystudy, we used only abdominal and pelvic radiology reports toconstrain the space of possible information. However, most ofthe schema covers information (e.g., findings, follow-up rec-ommendations, and indications) which applies to all bodyparts, and a similar process could easily be used to modifythe informational schema to include missing domain-specificinformation.

The basic unit of information in our schema is a clinicalassertion, predicate, or “fact”, such as “a radiologic findingwas observed,” “the patient has a follow-up recommenda-tion,” or “the patient has a known diagnosis.” Each fact hasone “anchor” text span—e.g., the finding, the recommenda-tion, or the diagnosis—as well as a set of informational“modifier” text spans which contextualize or modify the fact.These modifier spans include common linguistic elementssuch as negation, uncertainty, and conditionality, as well asfact-specific information (e.g., a radiologic finding has“location,” “size,” “description,” and “change over time”whereas a follow-up recommendation has “desired timing”)designed to be able to answer most reasonable queries aclinician might have regarding a report. These modifier spansare roughly analogous to “slots” in a slot filling task [9], inwhich systems populate predefined relation types for a givenentity. Figure 1 shows an example “radiologic finding wasobserved” fact.

It is important to note that multiple facts can occur withinthe same span of text—for instance, a sentence stating “thisexam is compared to a previous abdominal CT” contains atleast two separate facts—(1) that an imaging study was donein the past and (2) that this imaging study was used for com-parison with the current study.

Rather than trying to build up an ontology from simpleentities, as past work has done, we designed each anchorand modifier span to answer a specific question and trainedour system to produce all information necessary to answer thequestion implied by the span type. For instance, in a “patienthas radiologic finding” fact, the entire phrase “within the leftkidney” would be tagged as the “location” span, as all of thisinformation, including the organ, laterality, and preposition“within,” is necessary to answer the common question “whereis the finding?”

We iteratively developed the information schema byusing it to label radiology reports with their completefactual content. When information was found that wasnot captured by our schema, we updated our schema tocapture this information. The schema is extensible to anypiece of information which can be formatted as assertionsor predicates relating groups of entities which map tospans of radiologic text.

Data

For our data set, we used a convenience sample of 120 ab-dominal and pelvic radiology reports from imaging studiesobtained at our institution from 2013 to 2018. These reportsincluded a variety of indications as well as modalities, includ-ing CT (n = 50), MRI (n = 48), and ultrasound (n = 22). Thereports included adult patients of all ages and sexes. The re-ports were written using a wide variety of personal reportingtemplates and language and most were written in prose. Somereports, but not all, used some form of anatomical sectionheadings, although the form of these headings was highlyvariable. The corpus included reports written by 26 differentattending radiologists and 51 resident radiologists. Our studywas approved by our Institutional Review Board.

We used custom-developed labeling software to manuallyannotate 120 reports with their complete factual content. Eachpiece of extracted information was required to be linked to aspecific span of text in the document, in order for the system to

Fig. 1 Example fact with anchor and modifier text spans

J Digit Imaging (2019) 32:554–564 557

be interpretable and evaluable. Each fact was labeled with thefollowing three pieces of information: (1) the minimal contig-uous text span from the report necessary to capture the com-plete fact, (2) the text span corresponding to the anchor spanwhich defines that fact, and (3) text spans corresponding tosurrounding pre-defined modifier spans (e.g., negation andsize of lesion). Labels [2] and [3] were required to fall withinthe span of label [1], but only the anchor entity [2] was re-quired to be contiguous. A fact could have only one anchorentity, but could have any number of non-anchor modifierspans, including zero. In many cases, the same text span waslabeled with multiple different fact instances. Text spans forall three span labels were constrained to the word token levelrather than the individual character level (e.g., a fact mightbegin at token 230 and end at token 266). Tokenization ofdocuments was performed using spaCy, a freely available py-thon package.

Models

For continuous word token embeddings, we used customfastText vectors trained on a corpus of 100,000abdominopelvic radiology reports at our institution. Wechose fastText because it is relatively quick to train and iscapable of using subword information such as a word’sspelling, enabling out-of-vocabulary tokens (includingtypos) to be reasonably well-represented. We used thefastText implementation from the gensim python packagewith the skip-gram training procedure, 300-dimensionalembeddings, and all other parameters set to gensim’s de-fault values. Qualitative inspection of the resulting vectorsshowed reasonable similarity between nearby words bothin spelling and semantics. We concatenated these embed-dings with GloVe embeddings trained on large volumes ofgeneral-purpose English text to form our complete wordembeddings, believing that the combination of large-volume training data (from GloVe) and domain-specificembeddings (from fastText) may outperform each embed-ding type individually.

We used a two-part neural network architecture for ourmodels. The first neural network takes a full document asinput and outputs (1) token-level predictions of anchorentities and (2) token-level predictions of complete factspans. The second takes as input an anchor entity, facttype, and candidate fact span and outputs (1) token-levelpredictions for context modifier spans and (2) a refinedprediction of the beginning and end of the complete fact.This refinement is necessary because of the small amountof training data for the full-document model (120 exampledocuments) compared with the large amount of trainingdata for the second network (> 5000 manually labeled factspans within these reports). It is unsurprising that the factspans predicted by the fine-tuning module in the second

neural network are significantly better than those predictedby the first neural network, given the significantly largervolume of training examples it had to work with.

The first neural network works as follows: embeddedfull documents are processed sequentially by a 2-layerbidirectional gated recurrent unit (GRU), a type of recur-rent neural network module used frequently in text pro-cessing systems to encode surrounding context. The out-puts of the two directions of the GRU are concatenatedand used as shared features for two separate dense layer“heads,” which calculate predictions for each word tokenin the document. One head predicts which fact typeseach word token is part of and the other predicts whichanchor entity types (including none) a word token is partof. We treated the problem as a multi-class, multi-labelprediction problem, allowing each word token to be partof multiple different facts or anchor types. A dropout of0.3 was applied to the GRU outputs for regularization.We opted to keep the fact span head and anchor spanhead independent, and use the joint output of both todefinitively identify facts, as described below.

For every predicted anchor span, if it is inside of a predictedfact span, we create a fact candidate. The second neural net-work takes as input each fact candidate, consisting of thepredicted anchor span, the predicted fact type, and the embed-dings of the words within the predicted fact span. The predict-ed anchor span is provided in the form of a mask vector ofzeros and ones, where 1 corresponds to a predicted anchortoken and 0 corresponds to a predicted non-anchor token.This “mask” vector is concatenated to the word embeddings.To enable the second network to expand the predicted span ofthe first network if necessary, we also include the word em-beddings of the 20 tokens which come before and after thepredicted fact span. This network has separate weight param-eters stored for each separate fact type, as each fact may re-quire different learned information—the fact type predicted bythe first network determines which stored parameters are usedby the second network.

The second network outputs word token-level predictionsfor all modifier spans (including possible refinement of theanchor entity prediction), and error is computed using a sig-moid function followed by binary cross-entropy loss. Last,this network predicts the refined beginning and end of thecomplete fact span as follows: a dense layer processes theGRU-encoded tokens and outputs “beginning” and “end”scores for each token. A maximum is taken over all tokenpositions to produce the final predictions for the beginningand end of the fact span, and loss is computed using a softmaxfollowed by a cross-entropy loss function. The overall lossfunction for the second network is the sum of the beginningand end position losses and the token-level predictions.

A complete diagram of the neural network architecture isgiven in Fig. 2.

558 J Digit Imaging (2019) 32:554–564

Training and Validation

Data was divided into training, validation, and test sets (80%/10%/10%) and model design decisions and hyperparameterswere tuned on the validation set. Models were trained untilvalidation set performance ceased increasing, with a patienceof three epochs. Model performance was evaluated usingword-level micro-averaged recall, precision, and F1 scorefor (1) full-fact spans, (2) anchor entities, and (3) modifierspans on the unseen test data.

For further information about model hyperparameters andother specific training decisions, see Appendix.

Results

Information Schema

Our full information schema is given in Table 3, along withparsed examples and relative frequency in the labeled docu-ment corpus. In total, 5294 facts were labeled (an average of44.1 facts per report) with 15,299 total labeled pieces of infor-mation (anchor entities and modifier spans) for an average of2.88 labeled text spans per fact. Our fact schema was compre-hensive and covered the vast majority of information presentin abdominopelvic radiology reports, with 86.3% of the rawtext (53,862 tokens) covered by at least one fact. Almost all ofthe texts not covered by facts were document section headersand other metadata. We did not have to add any new facts after

approximately 50 reports were fully labeled, suggesting thatwe had achieved some degree of content saturation within thelimited domain of abdominopelvic reports.

Fact Extraction

For prediction of fact spans, our model achieves token-levelrecall of 88.6%, precision of 93.5%, and F1 of 91.0%, micro-averaged over all fact types. For anchor prediction, we achieve77.1% recall, 88.4% precision, and 82.4% F1. Micro-averagedacross all pieces of information, including modifier spans, themodel achieves 74.5% recall, 75.1% precision, and 74.8% F1.Performance varies significantly across class types, from 0% forrare facts (i.e., < 10 training examples) to > 95% for the mostcommon facts. This is unsurprising given that some fact typeshave extremely limited training data in our corpus; e.g., in ourcorpus, there were only 9 “patient has lab result” facts vs. ap-proximately 2500 “radiologist asserts imaging finding” facts, sothe model does not have enough data to learn generalizableproperties for the rarest fact types. Additional data would likelyimprove the model’s performance across all fact types, but mostparticularly on rare fact types. However, our model was able toachieve > 75% F1 score at extracting “patient had procedure”and “patient carries diagnosis” facts, each of which have fewerthan 200 labeled training examples, suggesting that the volumeof training data required for thismodel to performwell in limiteddomains is not prohibitively large. Future labeling can be direct-ed toward the rare facts to improve the system’s performance.

Fig. 2 Architecture for (a) first and (b) second neural network

J Digit Imaging (2019) 32:554–564 559

Table 3 Complete information schema. Examples are color-codedusing the colors in the “anchor entity” and “modifier spans” columns.Modifier spans which are in standard color do not appear in the parsedexample, while example texts in standard color are not part of any

information spans. Note that multiple modifier and/or anchor spans mayoverlap, but for ease of visualization, no examples with overlapping spanswere selected

Fact Anchor en�ty Modifier spans Parsed example Frequency in labeled dataset

Pa�ent carries diagnosis

DDiagnosis Timing of diagnosis, body loca�on specifics, current status/severity, previous workup, previous therapeu�c measure

62-year-old male with history of SMAdissection status post repair.

161

Pa�ent has/had procedure

Procedure Timing, body locationspecifics,outcome/consequence,indica�on, where/by whom procedure performed

Stable post surgical changes from right upper pole partialnephrectomy.

195

Imaging study occurred

Modality Time of occurrence, anatomic region,contrast-related informa�on, imaging sequences, other protocolling informa�on, summary of findings, indica�on

Findings: please refer to CT chestfrom today's datefor chest findings.

341

Study was used for comparison with this study

Modality (ofcompared study)

Anatomic region, �ming Comparison: Complete abdominalultrasound__DATE__.

142

Radiologic finding was observed

Finding Loca�on, finding descriptors, �ming, image cita�on, quan�ta�ve size measurement, description of changeover time, related diagnos�c interpreta�on

Stable septatedinferior interpolar cys�c lesion is seen on image 1/4,measuring 7 x 7 mm.

2651

Anatomic region has property

Region Property, image cita�on, quan�ta�ve size measurements, statement rela�ve to past imaging, related diagnos�c reasoning

Soft tissues appear unremarkable.

1221

Pa�ent has lab results

Lab Result, �ming, indica�on, other se�ng informa�on (e.g., at what facility)

AFP was 1 on __DATE__..

9

Pa�ent has pathology results

Tissue Findings/result,indica�on, �ming, other se�ng informa�on

Status post le� robo�c par�al nephrectomy on __DATE__. Kidneypathology

18

560 J Digit Imaging (2019) 32:554–564

Table 3 (continued)

demonstrated clear cell carcinoma, Fuhrman grade 1.

Pa�ent has/had symptom

SSymptom None History: 64-yr-old male with abdominal pain.

39

Follow-up recommenda�on made

Recommendation Timing/condi�on, indica�on

9 x 15 mm nonaggressive appearing right adnexal cyst for which one-yearfollow up pelvicultrasound is recommended if the pa�ent is postmenopausal.

162

A�ending a�esta�on/comment exists

Commenter A�esta�on or comment A endingendingradiologistadiologistagreement: I have personally reviewed the images and agree with this report

119

This study has a limita�on

Limitation Reason Portions of the headand tail areobscured by bowel gas.

82

Informa�on is elsewhere (e.g., in another document)

Information Loca�onFindings: Please refer to CT chest from today's datefor chest findings.

24

Radiologist communicated with someone

Whom Timing, modality, informa�on conveyed

These observa�onswere discussed with and acknowledged by __NAME__ at __TIME__ on __DATE__.

8

This study has an indica�on

Indication N/A He presents for active surveillanceof a left kidneymass.

27

Pa�ent has stated age and gender

Age Gender 63-year-old male 87

Modifiers associated with all facts

Nega�on (of en�re fact), statement of uncertainty, informa�on source

-

tt

J Digit Imaging (2019) 32:554–564 561

Examples

Inspection of predictions on unseen reports showed that thenetwork was able to extract varied information successfullyfrom complex sentences. For instance, in the test set fact“9 mm nonaggressive appearing cystic lesion in the pancreatictail on image 16 series 2 is unchanged from prior exam whenmeasured in similar fashion, likely a sidebranch IPMN,” thesystem was able to produce correct labels for all 7 informa-tional spans, including diagnostic reasoning. Furthermore, thesystem was able to generalize to fact instances it had not seenbefore (for instance, it identified “Gamna-Gandy bodies” as afinding in the test set), indicating that it had learned generalrepresentations of how facts are recorded in radiology reports.In this way, it has a significant advantage over rule-basedsystems and ontologies which are limited to pre-specified vo-cabularies and ontologies. The fastText embeddings, whichutilize subword information in the form of character n-gramsto embed unseen words, allowed the system to handle typo-graphical errors effectively.

Discussion

Our study demonstrates the feasibility of near-complete infor-mation extraction from radiologic texts using only a smallcorpus of 120 abdominopelvic cross-sectional imaging re-ports. It introduces a novel information schema which is ca-pable of handling the vast majority of information in radiolog-ic text and a neural network model for extracting that infor-mation. It is capable of extracting information from reportswritten in a wide variety of template and prose styles by >50 different radiologists. Although we have not yet integratedit into the clinical workflow, the system is capable of operatingin real time (i.e., while a user types) and the information sche-ma is easily extensible to additional facts with no changes tothe neural network other than the number of output units. It islikely that our schema and models would generalize effective-ly to other radiologic and clinical domains.

Our representational system has several strengths. First, itis capable of handling multiple facts of the same typecontained with the same text span—for instance, in the sen-tence “hepatic artery, portal vein, and hepatic veins are pat-ent,” our system extracts three separate “anatomic region hasproperty” facts - one for each anatomic structure. This disen-tanglement of separate facts from raw text is particularly cru-cial for downstream applications that involve tracking or rea-soning over discrete facts. Second, by representing our infor-mation schema using complete predicates, e.g., “patient hasdiagnosis” and “imaging study was used for comparison withthis study,” we are able to represent relations involving someimplied entities such as “this patient” or “this study” and dif-ferentiate them from predicates involving other people or

studies. Thirdly, by treating our modifier spans as questionsto be answered rather than entities to be identified, we includeall the information necessary to answer a single clinical queryin each modifier span. To see the advantage of this, considerthe sentence “Lesion observed within the left kidney.” ManyNER-based systems would attempt to identify “left” as alaterality and “kidney” as a body organ, but our system iden-tifies the complete text span “within the left kidney” as thelocation of the lesion, which directly answers the clinical que-ry and is more readily applicable to downstream tasks; e.g., anautomated tracking system could disambiguate this lesionfrom other kidney lesions or lesions on the left side of thebody, and display it separately.

There are, however, some limitations to our representationof the IE task. Although our system is able to represent thevast majority of information found in abdominopelvic radiol-ogy reports, no matter how comprehensive a structured infor-mation schema is, there are rare types of information that willbe missed. Furthermore, our system is unable to handle sometypes of “implied” information which are not directly refer-enced in the text–e.g., in the sentence “The gallbladder issurgically absent,” an ideal system might infer a cholecystec-tomy procedure fact, but there is no span of text correspondingto the procedure name in this sentence, and therefore our sys-tem cannot extract a procedure fact. For this study, this was adeliberate design decision, as we wanted to provide interpret-able and verifiable output where each prediction is linked to aparticular span of text [13]. However, this trade-off results inour system being unable to extract some types of impliedinformation. In the limited domain of radiological text at ourinstitution, the assumption of direct textual evidence for eachentity was valid in the vast majority of cases, but not all.

Exceptions are rare in the structured domain of radio-logic text, and we were able to find sensible workaroundsfor this study. However, moving to a more general model,where systems are capable of answering arbitrary naturallanguage questions about a report or span of text, is likelynecessary to solve some of these difficulties and expandthe capabilities of information extraction systems.Figure 1 of a study from the Allen Institute [10] effective-ly highlights the difficulty in capturing every piece ofinformation using structured ontologies and relations be-tween entities and suggests a simpler alternative to com-plex ontologies in the form of question-answer meaningrepresentations (QAMRs), which do not grow more com-plex and brittle with larger information schemata.Development of large question-answer data sets for radio-logic and other clinical domains is an important step forprogressing the field and developing models which caneffectively comprehend and reason about radiologic texts.A parallel approach involves transfer learning fromgeneral-purpose question answering tasks such as theStanford Question Answering Dataset (SQuAD) [11].

562 J Digit Imaging (2019) 32:554–564

Another limitation of this study is the small amountof labeled data available for the full document-levelneural network (120 reports); it is a general rule thatdeep learning models perform better with larger trainingdata sets, and many specific vocabulary terms are notrepresented within this corpus. Labeling each reportwith their complete informational content requires hun-dreds of text spans to be manually identified (each re-port takes 30–45 min to label). However, our modelswere able to perform reasonably well even with thissmall training set and are able to generalize unseen vo-cabulary terms, suggesting that the effort required tocreate effective training sets is not prohibitive even forsmaller groups of clinicians or researchers. We plan tocontinue increasing the size of the training set, includ-ing additional reports from other department sections,imaging modalities, and institutions, to improve gener-alization performance. Alternatively, using rule-based la-beling algorithms to produce weak labels for a muchlarger corpus of reports may be effective for training,although manually labeled data is still required to eval-uate the performance of such systems.

From the modeling perspective, usage of more so-phisticated word token representations such as languagemodels, which tend to outperform static word embed-dings on most NLP tasks [12] may provide further im-provement. Using attentional models such as the trans-former architecture [13] may also enable the system tohandle long-range text dependencies more effectivelythan recurrent neural network models.

Our current study is limited to abdominopelvic radiologyreports, and we did not conduct any formal experiments onother report types. It is likely that the similar vocabulary de-scribing findings, recommendations, imaging modalities, andmeasurements, as well as the similar document-level form ofmost reports (findings ordered by anatomic region) wouldprovide a baseline level of generalizability across radiologysubdisciplines. However, exposure to subdiscipline-specificvocabulary (anatomic terms, specific lesion descriptions)would almost certainly be necessary for high-accuracy perfor-mance. Future work will aim to formally test these assump-tions (e.g., training a model on one type of report and testingon another).

Conclusions

We develop a comprehensive information schema for com-plete information extraction from abdominopelvic radiologicreports, develop neural network models to extract this infor-mation, and demonstrate their feasibility. Our system uses no

pre-specified rules, runs in real time, generalizes arbitrarilylarge information schemata, and has representational advan-tages over systems which only perform named entity recogni-tion. The system has many downstream applications includingfollow-up tracking systems, research cohort identification, in-telligent chart search, automatic summarization, and creationof high-accuracy weak labels for large imaging data sets.Future work includes using more sophisticated languagemodels, generalizing to all radiologic subdisciplines, andbuilding downstream applications. Complete conversion ofunstructured data to structured data represents a feasible com-plementary approach to structured reporting toward the goalof creating fully machine-readable radiology reports.

Appendix. Hyperparameters and ModelSpecifics

All models were trained using the PyTorch module with theadaptive moment estimation (Adam) optimizer and a learningrate of 1e-3.

Since the second network receives as input the predictedfact spans and anchors from the first network, we stochasti-cally augment the dataset to simulate incorrect output predic-tions by the first network. A random number of additionaltokens between 0 and 20 is appended to either side of the truefact span in order generate a noisy training example for thesecond network. This enables the second network to learn tocorrect the first network’s incorrect predictions.

Training was completed in 2 h on a machine with oneNVIDIA GTX 1070 GPU.

Open Access This article is distributed under the terms of the CreativeCommons At t r ibut ion 4 .0 In te rna t ional License (h t tp : / /creativecommons.org/licenses/by/4.0/), which permits unrestricted use,distribution, and reproduction in any medium, provided you give appro-priate credit to the original author(s) and the source, provide a link to theCreative Commons license, and indicate if changes were made.

References

1. Rubin DL, Kahn, Jr CE: Common Data Elements in Radiology.Radiology 283:837–844, 2017

2. European Society of Radiology (ESR): ESR paper on structuredreporting in radiology. Insights Imaging 9(1):1–7, 2018

3. Singh S: Natural Language Processing for Information Extraction.arXiv [cs.CL], 2018

4. Hassanpour S, Langlotz CP: Information extraction from multi-institutional radiology reports. Artif Intell Med 66:29–39, 2016

5. Nandhakumar N, Sherkat E, Milios EE, Gu H, Butler M: ClinicallySignificant Information Extraction from Radiology Reports. In:Proceedings of the 2017 ACM Symposium on DocumentEngineering - DocEng ’17. ACM Press, 2017, pp 153–162

J Digit Imaging (2019) 32:554–564 563

6. Esuli A, Marcheggiani D, Sebastiani F: An enhanced CRFs-basedsystem for information extraction from radiology reports. J.Biomed. Inform. 46:425–435, 2013

7. Cornegruta S, Bakewell R, Withey S, Montana G: ModellingRadiological Language with Bidirectional Long Short-TermMemory Networks. arXiv [cs.CL], 2016

8. Pawar S, Palshikar GK, Bhattacharyya P: Relation Extraction: ASurvey. arXiv [cs.CL], 2017

9. Huang L, Sil A, Ji H, Florian R: Improving Slot FillingPerformance with Attentive Neural Networks on DependencyStructures. arXiv [cs.CL], 2017

10. Michael J, Stanovsky G, He L, Dagan I, Zettlemoyer L:Crowdsourcing Question-Answer Meaning Representations

11. Rajpurkar P, Zhang J, Lopyrev K, Liang P: SQuAD: 100,000+Questions for Machine Comprehension of Text. arXiv [cs.CL],2016

12. Devlin J, Chang M-W, Lee K, Toutanova K: BERT: Pre-training ofDeep Bidirectional Transformers for Language Understanding.arXiv [cs.CL], 2018

13. Vaswani A, et al: Attention Is All You Need. arXiv [cs.CL], 2017

Publisher’s Note Springer Nature remains neutral with regard to juris-dictional claims in published maps and institutional affiliations.

564 J Digit Imaging (2019) 32:554–564

toward complete structured information extraction from ... › content › pdf › 10.1007... ·...

Documents