toward complete structured information extraction from ... › content › pdf › 10.1007... ·...
TRANSCRIPT
Toward Complete Structured Information Extraction from RadiologyReports Using Machine Learning
Jackson M. Steinkamp1,2& Charles Chambers1 & Darco Lalevic1 & Hanna M. Zafar1 & Tessa S. Cook1
Published online: 19 June 2019#
AbstractUnstructured and semi-structured radiology reports represent an underutilized trove of information for machine learning (ML)-based clinical informatics applications, including abnormality tracking systems, research cohort identification, point-of-caresummarization, semi-automated report writing, and as a source of weak data labels for training image processing systems.Clinical ML systems must be interpretable to ensure user trust. To create interpretable models applicable to all of these tasks,we can build general-purpose systems which extract all relevant human-level assertions or “facts” documented in reports;identifying these facts is an information extraction (IE) task. Previous IE work in radiology has focused on a limited set ofinformation, and extracts isolated entities (i.e., single words such as “lesion” or “cyst”) rather than complete facts, which requirethe linking of multiple entities and modifiers. Here, we develop a prototype system to extract all useful information inabdominopelvic radiology reports (findings, recommendations, clinical history, procedures, imaging indications and limitations,etc.), in the form of complete, contextualized facts. We construct an information schema to capture the bulk of information inreports, develop real-time MLmodels to extract this information, and demonstrate the feasibility and performance of the system.
Keywords Machine learning . Radiology reports . Natural language processing . Structured reporting
Introduction
Unstructured and semi-structured radiology reports representan underutilized trove of information for clinical informaticsapplications, including follow-up tracking systems, intelligentchart search, point-of-care summarization, semi-automated re-port writing, research cohort identification, and as a source ofweak data labels for training image processing systems. Thegrowing importance of structured data is reflected in radiolo-gists’ increasing embrace of structured reporting, standardizedcoding systems, ontologies, and common data elements [1, 2].It is an open question as to whether all potentially usefulstructured information can be captured feasibly a priori atthe time of reporting. Even if this undertaking proves feasible,large amounts of historically useful data will remain in theirunstructured form.
A parallel approach to procuring structured informationinvolves the development of automated systems capable ofextracting structured information from unstructured free text.This approach is frequently used for individual downstreamapplications (e.g., text mining for research cohort identifica-tion). However, individual applications often recreate oldpipelines to extract the information from reports, reduplicatinglarge amounts of work, and leaving the general informationextraction problem unsolved. Recent advances in natural lan-guage processing (NLP) and modern machine learning (ML)technologies have made it feasible to extract complete sets ofstructured information from unstructured free text within lim-ited domains, essentially fully converting unstructured infor-mation to structured information. Tasks of this form fall underthe general umbrella of information extraction (IE) [3].Table 1 provides a breakdown of some (but by no means all)common subtasks and task formulations of information ex-traction, while Table 2 contrasts rule-based andmachine learn-ing approaches for performing these tasks.
Previous IE work within radiology has focused on namedentity recognition (NER)—identifying particular spans of textcorresponding to relevant entities in the document [4–7].Some definitions of named entity recognition are limited toproper nouns (e.g., person, organization, and location), but in
* Jackson M. [email protected]
1 Department of Radiology, Hospital of the University ofPennsylvania, Philadelphia, PA 19104, USA
2 Boston University School of Medicine, Boston, MA 02119, USA
Journal of Digital Imaging (2019) 32:554–564https://doi.org/10.1007/s10278-019-00234-y
The Author(s) 2019
Table1
Someof
themostcom
mon
inform
ationextractio
nsubtasks
Task
form
ulation
Descriptio
nExample
Notes
andlim
itatio
ns
Nam
edentity
recognition
System
provides
labelsforspansof
text
with
inthedocument,usingapre-specifiedseto
flabels.C
ommonly
used
fornouns(e.g.,
radiologicfindings
andanatom
icorgans).
Input:“Patient
hascystin
thekidney.”
Output:
Cyst→
radiologicfinding
Kidney→
anatom
iclocatio
n
-Doesnotinherently
handle“implied”
entitieswhich
are
notd
irectly
mentio
nedin
thetext.
-Byitself,does
notlinkentitiesto
each
other,lim
iting
the
scopeof
answ
erablequestio
nsto
“which
entitiesappear
inthisdocument.”
-Traditio
nally,outputislim
itedto
pre-specifiedentitytypes.
Relationextractio
nIn
additio
nto
identifying
entities,system
identifiesrelatio
nsor
predicates
betweenentities.Fo
rinstance,a
Finding
IsLocatedIn
relatio
nmay
connecta
radiologicfindingentitywith
ananatom
iclocatio
nentity.
Input:“Patient
hascystin
kidney.”
Valid
outputs:
“Finding
IsLocated
In(cyst,kidney)”/
cyst;isin;k
idney.
-Can
beeither
“open”
(ableto
identifyarbitraryrelatio
nsbetweenentities)or
“closed”
(lim
itedto
aseto
fpre-
specifiedrelatio
ns).
-Systemsmay
belim
itedto
binary
relatio
ns(betweentwo
entities)
ormay
handlerelatio
nswith
arbitrarynumbersof
entities.
Naturallanguage
questio
nansw
ering
System
provides
arbitrarynaturallanguage
answ
erto
anarbitrarynaturallanguage
questio
nabouta
document.Sy
stem
may
generatenewnaturallanguageand/or
“point”to
spansof
text
with
inthe
document.
Input:“D
oesthepatient
have
any
kidney
findings?“
Valid
output:“Acyst.”
Input:“Inwhato
rgan
isthecyst?”
Valid
output:
“Inthekidney.”
-Naturallanguage
questio
nscanbe
seen
asageneralizationof
othernaturallanguagetasks(e.g.,named
entityrecognition
andrelatio
nextractio
ncanboth
befram
edas
questio
n-answ
eringtasks).
- Naturallanguage
answ
ersallowformaxim
umflexibility
comparedwith
aseto
fpre-specifiedlabelsandcanideally
generalizeto
questio
nsoutsideof
thetraining
set.
-Requiresvery
largeam
ount
oftraining
data,asthetask
isinherently
morecomplicated.
-Providing
acompleteseto
flabelsforagiventraining
exam
pleisdifficult,as
theremay
bemanycorrect
waysto
answ
eraquestio
n.
J Digit Imaging (2019) 32:554–564 555
this paper, we use “named entity recognition” to refer to rec-ognition of any text span which corresponds to a relevantpiece of information (e.g., including modifiers such as“spiculated” or “5 mm”). Such NER systems represent astep-up from systems which merely perform a raw text searchover a pre-specified set of terms, as there are many cases inwhich there is not a one-to-one mapping between raw text andentity (e.g., homonyms, multiple context-dependent meaningsof a word, and pronouns). Furthermore, modern machinelearning systems such as neural networks are capable of gen-eralizing beyond set lists of terms and are therefore not limitedby incomplete ontologies.
However, systems which strictly perform named entityrecognition-level tasks are insufficient for answering evenbasic clinical queries. Perhaps the most commonly citedexample is negation: in the sentence “No lesion observed,”an NER-only system could (correctly) identify “lesion” as anentity, but cannot correctly answer the intended question“does this patient have a lesion” without additional informa-tion about the negation and how it relates to the “lesion” entity.Many other limitations exist, including conditional expres-sions, e.g., “if a lesion is present, consider …”; uncertainty,e.g., “possibly representing a lesion”; temporal modifiers, e.g.,“patient with history of lesion”; and even person-specific, e.g.,“patient’s father with reported history of lesion”; which addcrucial context to the entity. Despite the relatively narrowvocabulary and informational scope of radiology reportscompared with general-purpose English text, these linguisticphenomena are very common in radiology reports, renderingpure NER systems incapable of answering basic clinicalqueries without additional post-processing.
The task known as relation extraction [8] is a step-up fromNER (see Table 1), as it involves identifying entities and re-lations between them. In the most general form, a set of pos-sible relations between entity types are defined, such as “le-sion X has size Y” or “imaging modality X was used on bodyregion Y”, and a successful system is capable of identifyingthese relations from raw text. Such systems, therefore, cananswer questions such as “does this patient have a lesion,”“are any of the lesions larger than 5 mm,” or “has this lesionchanged since it was last observed” by using informationrelating multiple named entities. Some systems exist for rela-tion extraction in a general-purpose clinical text, but the radi-ology literature is limited.
In addition, most systems only focus on small subsets ofinformation within reports. By contrast, we aim to create asystem capable of general-purpose complete information ex-traction from radiology reports, for a wide variety of down-stream uses, as listed above. In this study, we develop aninformation schema capable of capturing the majority of in-formation in radiologic reports, demonstrate the feasibility ofa neural network system to extract this information, and mo-tivate possible downstream uses.Ta
ble2
Twomajor
approaches
forhandlin
ginform
ationextractio
ntasks
Methodology
Descriptio
nPros
andcons
Rule-basedstring
matching
System
uses
ahuman-providedlisto
fruleswith
particular
text
strings(orregularexpressions)to
identifyentitiesandrelatio
nswith
inthetext.
-Explainableandinterpretable.
-Sim
pleto
implem
ent;may
besufficient
forcertainlim
itedtasks.
-Manynaturallanguagewords
have
multip
lemeaningsor
classificatio
nsbasedon
surroundingcontext,so
itmay
beim
possibleto
createatrue
one-to-one
mapping
betweentext
andlabels.
-For
morecomplex
taskssuch
asrelatio
nextractio
nand
questio
nansw
ering,itisvery
difficulto
rim
possibleto
anticipateallrules
necessaryforthetask.
-Noability
togeneralizebeyond
provided
rules.
Machine
learning
system
sManyavailablemodels,e.g.,supportvector
machines
andneuralnetworks.S
ystem
learns
from
atraining
dataseto
finput/o
utputp
airsandcangeneralizeto
unseen
exam
ples
which
aresimilar.Deeplearning
modelsutilizing
neuralnetworks
arecurrently
state-of-the-artforthemajority
ofcomplex
tasks.
-Manymodelsincorporateinform
ationfrom
theentire
surroundingsequence
oftext
toproducean
answ
er.
-Ableto
generalizebeyond
provided
training
exam
ples.
-May
bedifficulttounderstand
why
modelproduces
itsoutput.
-Require
largeam
ountsof
labeledtraining
data.
556 J Digit Imaging (2019) 32:554–564
Methods
Fact Schema
We iteratively developed a schema of all relevant informationdocumented in radiology reports by qualitatively examining abody of existing reports. For the purposes of this feasibilitystudy, we used only abdominal and pelvic radiology reports toconstrain the space of possible information. However, most ofthe schema covers information (e.g., findings, follow-up rec-ommendations, and indications) which applies to all bodyparts, and a similar process could easily be used to modifythe informational schema to include missing domain-specificinformation.
The basic unit of information in our schema is a clinicalassertion, predicate, or “fact”, such as “a radiologic findingwas observed,” “the patient has a follow-up recommenda-tion,” or “the patient has a known diagnosis.” Each fact hasone “anchor” text span—e.g., the finding, the recommenda-tion, or the diagnosis—as well as a set of informational“modifier” text spans which contextualize or modify the fact.These modifier spans include common linguistic elementssuch as negation, uncertainty, and conditionality, as well asfact-specific information (e.g., a radiologic finding has“location,” “size,” “description,” and “change over time”whereas a follow-up recommendation has “desired timing”)designed to be able to answer most reasonable queries aclinician might have regarding a report. These modifier spansare roughly analogous to “slots” in a slot filling task [9], inwhich systems populate predefined relation types for a givenentity. Figure 1 shows an example “radiologic finding wasobserved” fact.
It is important to note that multiple facts can occur withinthe same span of text—for instance, a sentence stating “thisexam is compared to a previous abdominal CT” contains atleast two separate facts—(1) that an imaging study was donein the past and (2) that this imaging study was used for com-parison with the current study.
Rather than trying to build up an ontology from simpleentities, as past work has done, we designed each anchorand modifier span to answer a specific question and trainedour system to produce all information necessary to answer thequestion implied by the span type. For instance, in a “patienthas radiologic finding” fact, the entire phrase “within the leftkidney” would be tagged as the “location” span, as all of thisinformation, including the organ, laterality, and preposition“within,” is necessary to answer the common question “whereis the finding?”
We iteratively developed the information schema byusing it to label radiology reports with their completefactual content. When information was found that wasnot captured by our schema, we updated our schema tocapture this information. The schema is extensible to anypiece of information which can be formatted as assertionsor predicates relating groups of entities which map tospans of radiologic text.
Data
For our data set, we used a convenience sample of 120 ab-dominal and pelvic radiology reports from imaging studiesobtained at our institution from 2013 to 2018. These reportsincluded a variety of indications as well as modalities, includ-ing CT (n = 50), MRI (n = 48), and ultrasound (n = 22). Thereports included adult patients of all ages and sexes. The re-ports were written using a wide variety of personal reportingtemplates and language and most were written in prose. Somereports, but not all, used some form of anatomical sectionheadings, although the form of these headings was highlyvariable. The corpus included reports written by 26 differentattending radiologists and 51 resident radiologists. Our studywas approved by our Institutional Review Board.
We used custom-developed labeling software to manuallyannotate 120 reports with their complete factual content. Eachpiece of extracted information was required to be linked to aspecific span of text in the document, in order for the system to
Fig. 1 Example fact with anchor and modifier text spans
J Digit Imaging (2019) 32:554–564 557
be interpretable and evaluable. Each fact was labeled with thefollowing three pieces of information: (1) the minimal contig-uous text span from the report necessary to capture the com-plete fact, (2) the text span corresponding to the anchor spanwhich defines that fact, and (3) text spans corresponding tosurrounding pre-defined modifier spans (e.g., negation andsize of lesion). Labels [2] and [3] were required to fall withinthe span of label [1], but only the anchor entity [2] was re-quired to be contiguous. A fact could have only one anchorentity, but could have any number of non-anchor modifierspans, including zero. In many cases, the same text span waslabeled with multiple different fact instances. Text spans forall three span labels were constrained to the word token levelrather than the individual character level (e.g., a fact mightbegin at token 230 and end at token 266). Tokenization ofdocuments was performed using spaCy, a freely available py-thon package.
Models
For continuous word token embeddings, we used customfastText vectors trained on a corpus of 100,000abdominopelvic radiology reports at our institution. Wechose fastText because it is relatively quick to train and iscapable of using subword information such as a word’sspelling, enabling out-of-vocabulary tokens (includingtypos) to be reasonably well-represented. We used thefastText implementation from the gensim python packagewith the skip-gram training procedure, 300-dimensionalembeddings, and all other parameters set to gensim’s de-fault values. Qualitative inspection of the resulting vectorsshowed reasonable similarity between nearby words bothin spelling and semantics. We concatenated these embed-dings with GloVe embeddings trained on large volumes ofgeneral-purpose English text to form our complete wordembeddings, believing that the combination of large-volume training data (from GloVe) and domain-specificembeddings (from fastText) may outperform each embed-ding type individually.
We used a two-part neural network architecture for ourmodels. The first neural network takes a full document asinput and outputs (1) token-level predictions of anchorentities and (2) token-level predictions of complete factspans. The second takes as input an anchor entity, facttype, and candidate fact span and outputs (1) token-levelpredictions for context modifier spans and (2) a refinedprediction of the beginning and end of the complete fact.This refinement is necessary because of the small amountof training data for the full-document model (120 exampledocuments) compared with the large amount of trainingdata for the second network (> 5000 manually labeled factspans within these reports). It is unsurprising that the factspans predicted by the fine-tuning module in the second
neural network are significantly better than those predictedby the first neural network, given the significantly largervolume of training examples it had to work with.
The first neural network works as follows: embeddedfull documents are processed sequentially by a 2-layerbidirectional gated recurrent unit (GRU), a type of recur-rent neural network module used frequently in text pro-cessing systems to encode surrounding context. The out-puts of the two directions of the GRU are concatenatedand used as shared features for two separate dense layer“heads,” which calculate predictions for each word tokenin the document. One head predicts which fact typeseach word token is part of and the other predicts whichanchor entity types (including none) a word token is partof. We treated the problem as a multi-class, multi-labelprediction problem, allowing each word token to be partof multiple different facts or anchor types. A dropout of0.3 was applied to the GRU outputs for regularization.We opted to keep the fact span head and anchor spanhead independent, and use the joint output of both todefinitively identify facts, as described below.
For every predicted anchor span, if it is inside of a predictedfact span, we create a fact candidate. The second neural net-work takes as input each fact candidate, consisting of thepredicted anchor span, the predicted fact type, and the embed-dings of the words within the predicted fact span. The predict-ed anchor span is provided in the form of a mask vector ofzeros and ones, where 1 corresponds to a predicted anchortoken and 0 corresponds to a predicted non-anchor token.This “mask” vector is concatenated to the word embeddings.To enable the second network to expand the predicted span ofthe first network if necessary, we also include the word em-beddings of the 20 tokens which come before and after thepredicted fact span. This network has separate weight param-eters stored for each separate fact type, as each fact may re-quire different learned information—the fact type predicted bythe first network determines which stored parameters are usedby the second network.
The second network outputs word token-level predictionsfor all modifier spans (including possible refinement of theanchor entity prediction), and error is computed using a sig-moid function followed by binary cross-entropy loss. Last,this network predicts the refined beginning and end of thecomplete fact span as follows: a dense layer processes theGRU-encoded tokens and outputs “beginning” and “end”scores for each token. A maximum is taken over all tokenpositions to produce the final predictions for the beginningand end of the fact span, and loss is computed using a softmaxfollowed by a cross-entropy loss function. The overall lossfunction for the second network is the sum of the beginningand end position losses and the token-level predictions.
A complete diagram of the neural network architecture isgiven in Fig. 2.
558 J Digit Imaging (2019) 32:554–564
Training and Validation
Data was divided into training, validation, and test sets (80%/10%/10%) and model design decisions and hyperparameterswere tuned on the validation set. Models were trained untilvalidation set performance ceased increasing, with a patienceof three epochs. Model performance was evaluated usingword-level micro-averaged recall, precision, and F1 scorefor (1) full-fact spans, (2) anchor entities, and (3) modifierspans on the unseen test data.
For further information about model hyperparameters andother specific training decisions, see Appendix.
Results
Information Schema
Our full information schema is given in Table 3, along withparsed examples and relative frequency in the labeled docu-ment corpus. In total, 5294 facts were labeled (an average of44.1 facts per report) with 15,299 total labeled pieces of infor-mation (anchor entities and modifier spans) for an average of2.88 labeled text spans per fact. Our fact schema was compre-hensive and covered the vast majority of information presentin abdominopelvic radiology reports, with 86.3% of the rawtext (53,862 tokens) covered by at least one fact. Almost all ofthe texts not covered by facts were document section headersand other metadata. We did not have to add any new facts after
approximately 50 reports were fully labeled, suggesting thatwe had achieved some degree of content saturation within thelimited domain of abdominopelvic reports.
Fact Extraction
For prediction of fact spans, our model achieves token-levelrecall of 88.6%, precision of 93.5%, and F1 of 91.0%, micro-averaged over all fact types. For anchor prediction, we achieve77.1% recall, 88.4% precision, and 82.4% F1. Micro-averagedacross all pieces of information, including modifier spans, themodel achieves 74.5% recall, 75.1% precision, and 74.8% F1.Performance varies significantly across class types, from 0% forrare facts (i.e., < 10 training examples) to > 95% for the mostcommon facts. This is unsurprising given that some fact typeshave extremely limited training data in our corpus; e.g., in ourcorpus, there were only 9 “patient has lab result” facts vs. ap-proximately 2500 “radiologist asserts imaging finding” facts, sothe model does not have enough data to learn generalizableproperties for the rarest fact types. Additional data would likelyimprove the model’s performance across all fact types, but mostparticularly on rare fact types. However, our model was able toachieve > 75% F1 score at extracting “patient had procedure”and “patient carries diagnosis” facts, each of which have fewerthan 200 labeled training examples, suggesting that the volumeof training data required for thismodel to performwell in limiteddomains is not prohibitively large. Future labeling can be direct-ed toward the rare facts to improve the system’s performance.
Fig. 2 Architecture for (a) first and (b) second neural network
J Digit Imaging (2019) 32:554–564 559
Table 3 Complete information schema. Examples are color-codedusing the colors in the “anchor entity” and “modifier spans” columns.Modifier spans which are in standard color do not appear in the parsedexample, while example texts in standard color are not part of any
information spans. Note that multiple modifier and/or anchor spans mayoverlap, but for ease of visualization, no examples with overlapping spanswere selected
Fact Anchor en�ty Modifier spans Parsed example Frequency in labeled dataset
Pa�ent carries diagnosis
DDiagnosis Timing of diagnosis, body loca�on specifics, current status/severity, previous workup, previous therapeu�c measure
62-year-old male with history of SMAdissection status post repair.
161
Pa�ent has/had procedure
Procedure Timing, body locationspecifics,outcome/consequence,indica�on, where/by whom procedure performed
Stable post surgical changes from right upper pole partialnephrectomy.
195
Imaging study occurred
Modality Time of occurrence, anatomic region,contrast-related informa�on, imaging sequences, other protocolling informa�on, summary of findings, indica�on
Findings: please refer to CT chestfrom today's datefor chest findings.
341
Study was used for comparison with this study
Modality (ofcompared study)
Anatomic region, �ming Comparison: Complete abdominalultrasound__DATE__.
142
Radiologic finding was observed
Finding Loca�on, finding descriptors, �ming, image cita�on, quan�ta�ve size measurement, description of changeover time, related diagnos�c interpreta�on
Stable septatedinferior interpolar cys�c lesion is seen on image 1/4,measuring 7 x 7 mm.
2651
Anatomic region has property
Region Property, image cita�on, quan�ta�ve size measurements, statement rela�ve to past imaging, related diagnos�c reasoning
Soft tissues appear unremarkable.
1221
Pa�ent has lab results
Lab Result, �ming, indica�on, other se�ng informa�on (e.g., at what facility)
AFP was 1 on __DATE__..
9
Pa�ent has pathology results
Tissue Findings/result,indica�on, �ming, other se�ng informa�on
Status post le� robo�c par�al nephrectomy on __DATE__. Kidneypathology
18
560 J Digit Imaging (2019) 32:554–564
Table 3 (continued)
demonstrated clear cell carcinoma, Fuhrman grade 1.
Pa�ent has/had symptom
SSymptom None History: 64-yr-old male with abdominal pain.
39
Follow-up recommenda�on made
Recommendation Timing/condi�on, indica�on
9 x 15 mm nonaggressive appearing right adnexal cyst for which one-yearfollow up pelvicultrasound is recommended if the pa�ent is postmenopausal.
162
A�ending a�esta�on/comment exists
Commenter A�esta�on or comment A endingendingradiologistadiologistagreement: I have personally reviewed the images and agree with this report
119
This study has a limita�on
Limitation Reason Portions of the headand tail areobscured by bowel gas.
82
Informa�on is elsewhere (e.g., in another document)
Information Loca�onFindings: Please refer to CT chest from today's datefor chest findings.
24
Radiologist communicated with someone
Whom Timing, modality, informa�on conveyed
These observa�onswere discussed with and acknowledged by __NAME__ at __TIME__ on __DATE__.
8
This study has an indica�on
Indication N/A He presents for active surveillanceof a left kidneymass.
27
Pa�ent has stated age and gender
Age Gender 63-year-old male 87
Modifiers associated with all facts
Nega�on (of en�re fact), statement of uncertainty, informa�on source
-
tt
J Digit Imaging (2019) 32:554–564 561
Examples
Inspection of predictions on unseen reports showed that thenetwork was able to extract varied information successfullyfrom complex sentences. For instance, in the test set fact“9 mm nonaggressive appearing cystic lesion in the pancreatictail on image 16 series 2 is unchanged from prior exam whenmeasured in similar fashion, likely a sidebranch IPMN,” thesystem was able to produce correct labels for all 7 informa-tional spans, including diagnostic reasoning. Furthermore, thesystem was able to generalize to fact instances it had not seenbefore (for instance, it identified “Gamna-Gandy bodies” as afinding in the test set), indicating that it had learned generalrepresentations of how facts are recorded in radiology reports.In this way, it has a significant advantage over rule-basedsystems and ontologies which are limited to pre-specified vo-cabularies and ontologies. The fastText embeddings, whichutilize subword information in the form of character n-gramsto embed unseen words, allowed the system to handle typo-graphical errors effectively.
Discussion
Our study demonstrates the feasibility of near-complete infor-mation extraction from radiologic texts using only a smallcorpus of 120 abdominopelvic cross-sectional imaging re-ports. It introduces a novel information schema which is ca-pable of handling the vast majority of information in radiolog-ic text and a neural network model for extracting that infor-mation. It is capable of extracting information from reportswritten in a wide variety of template and prose styles by >50 different radiologists. Although we have not yet integratedit into the clinical workflow, the system is capable of operatingin real time (i.e., while a user types) and the information sche-ma is easily extensible to additional facts with no changes tothe neural network other than the number of output units. It islikely that our schema and models would generalize effective-ly to other radiologic and clinical domains.
Our representational system has several strengths. First, itis capable of handling multiple facts of the same typecontained with the same text span—for instance, in the sen-tence “hepatic artery, portal vein, and hepatic veins are pat-ent,” our system extracts three separate “anatomic region hasproperty” facts - one for each anatomic structure. This disen-tanglement of separate facts from raw text is particularly cru-cial for downstream applications that involve tracking or rea-soning over discrete facts. Second, by representing our infor-mation schema using complete predicates, e.g., “patient hasdiagnosis” and “imaging study was used for comparison withthis study,” we are able to represent relations involving someimplied entities such as “this patient” or “this study” and dif-ferentiate them from predicates involving other people or
studies. Thirdly, by treating our modifier spans as questionsto be answered rather than entities to be identified, we includeall the information necessary to answer a single clinical queryin each modifier span. To see the advantage of this, considerthe sentence “Lesion observed within the left kidney.” ManyNER-based systems would attempt to identify “left” as alaterality and “kidney” as a body organ, but our system iden-tifies the complete text span “within the left kidney” as thelocation of the lesion, which directly answers the clinical que-ry and is more readily applicable to downstream tasks; e.g., anautomated tracking system could disambiguate this lesionfrom other kidney lesions or lesions on the left side of thebody, and display it separately.
There are, however, some limitations to our representationof the IE task. Although our system is able to represent thevast majority of information found in abdominopelvic radiol-ogy reports, no matter how comprehensive a structured infor-mation schema is, there are rare types of information that willbe missed. Furthermore, our system is unable to handle sometypes of “implied” information which are not directly refer-enced in the text–e.g., in the sentence “The gallbladder issurgically absent,” an ideal system might infer a cholecystec-tomy procedure fact, but there is no span of text correspondingto the procedure name in this sentence, and therefore our sys-tem cannot extract a procedure fact. For this study, this was adeliberate design decision, as we wanted to provide interpret-able and verifiable output where each prediction is linked to aparticular span of text [13]. However, this trade-off results inour system being unable to extract some types of impliedinformation. In the limited domain of radiological text at ourinstitution, the assumption of direct textual evidence for eachentity was valid in the vast majority of cases, but not all.
Exceptions are rare in the structured domain of radio-logic text, and we were able to find sensible workaroundsfor this study. However, moving to a more general model,where systems are capable of answering arbitrary naturallanguage questions about a report or span of text, is likelynecessary to solve some of these difficulties and expandthe capabilities of information extraction systems.Figure 1 of a study from the Allen Institute [10] effective-ly highlights the difficulty in capturing every piece ofinformation using structured ontologies and relations be-tween entities and suggests a simpler alternative to com-plex ontologies in the form of question-answer meaningrepresentations (QAMRs), which do not grow more com-plex and brittle with larger information schemata.Development of large question-answer data sets for radio-logic and other clinical domains is an important step forprogressing the field and developing models which caneffectively comprehend and reason about radiologic texts.A parallel approach involves transfer learning fromgeneral-purpose question answering tasks such as theStanford Question Answering Dataset (SQuAD) [11].
562 J Digit Imaging (2019) 32:554–564
Another limitation of this study is the small amountof labeled data available for the full document-levelneural network (120 reports); it is a general rule thatdeep learning models perform better with larger trainingdata sets, and many specific vocabulary terms are notrepresented within this corpus. Labeling each reportwith their complete informational content requires hun-dreds of text spans to be manually identified (each re-port takes 30–45 min to label). However, our modelswere able to perform reasonably well even with thissmall training set and are able to generalize unseen vo-cabulary terms, suggesting that the effort required tocreate effective training sets is not prohibitive even forsmaller groups of clinicians or researchers. We plan tocontinue increasing the size of the training set, includ-ing additional reports from other department sections,imaging modalities, and institutions, to improve gener-alization performance. Alternatively, using rule-based la-beling algorithms to produce weak labels for a muchlarger corpus of reports may be effective for training,although manually labeled data is still required to eval-uate the performance of such systems.
From the modeling perspective, usage of more so-phisticated word token representations such as languagemodels, which tend to outperform static word embed-dings on most NLP tasks [12] may provide further im-provement. Using attentional models such as the trans-former architecture [13] may also enable the system tohandle long-range text dependencies more effectivelythan recurrent neural network models.
Our current study is limited to abdominopelvic radiologyreports, and we did not conduct any formal experiments onother report types. It is likely that the similar vocabulary de-scribing findings, recommendations, imaging modalities, andmeasurements, as well as the similar document-level form ofmost reports (findings ordered by anatomic region) wouldprovide a baseline level of generalizability across radiologysubdisciplines. However, exposure to subdiscipline-specificvocabulary (anatomic terms, specific lesion descriptions)would almost certainly be necessary for high-accuracy perfor-mance. Future work will aim to formally test these assump-tions (e.g., training a model on one type of report and testingon another).
Conclusions
We develop a comprehensive information schema for com-plete information extraction from abdominopelvic radiologicreports, develop neural network models to extract this infor-mation, and demonstrate their feasibility. Our system uses no
pre-specified rules, runs in real time, generalizes arbitrarilylarge information schemata, and has representational advan-tages over systems which only perform named entity recogni-tion. The system has many downstream applications includingfollow-up tracking systems, research cohort identification, in-telligent chart search, automatic summarization, and creationof high-accuracy weak labels for large imaging data sets.Future work includes using more sophisticated languagemodels, generalizing to all radiologic subdisciplines, andbuilding downstream applications. Complete conversion ofunstructured data to structured data represents a feasible com-plementary approach to structured reporting toward the goalof creating fully machine-readable radiology reports.
Appendix. Hyperparameters and ModelSpecifics
All models were trained using the PyTorch module with theadaptive moment estimation (Adam) optimizer and a learningrate of 1e-3.
Since the second network receives as input the predictedfact spans and anchors from the first network, we stochasti-cally augment the dataset to simulate incorrect output predic-tions by the first network. A random number of additionaltokens between 0 and 20 is appended to either side of the truefact span in order generate a noisy training example for thesecond network. This enables the second network to learn tocorrect the first network’s incorrect predictions.
Training was completed in 2 h on a machine with oneNVIDIA GTX 1070 GPU.
Open Access This article is distributed under the terms of the CreativeCommons At t r ibut ion 4 .0 In te rna t ional License (h t tp : / /creativecommons.org/licenses/by/4.0/), which permits unrestricted use,distribution, and reproduction in any medium, provided you give appro-priate credit to the original author(s) and the source, provide a link to theCreative Commons license, and indicate if changes were made.
References
1. Rubin DL, Kahn, Jr CE: Common Data Elements in Radiology.Radiology 283:837–844, 2017
2. European Society of Radiology (ESR): ESR paper on structuredreporting in radiology. Insights Imaging 9(1):1–7, 2018
3. Singh S: Natural Language Processing for Information Extraction.arXiv [cs.CL], 2018
4. Hassanpour S, Langlotz CP: Information extraction from multi-institutional radiology reports. Artif Intell Med 66:29–39, 2016
5. Nandhakumar N, Sherkat E, Milios EE, Gu H, Butler M: ClinicallySignificant Information Extraction from Radiology Reports. In:Proceedings of the 2017 ACM Symposium on DocumentEngineering - DocEng ’17. ACM Press, 2017, pp 153–162
J Digit Imaging (2019) 32:554–564 563
6. Esuli A, Marcheggiani D, Sebastiani F: An enhanced CRFs-basedsystem for information extraction from radiology reports. J.Biomed. Inform. 46:425–435, 2013
7. Cornegruta S, Bakewell R, Withey S, Montana G: ModellingRadiological Language with Bidirectional Long Short-TermMemory Networks. arXiv [cs.CL], 2016
8. Pawar S, Palshikar GK, Bhattacharyya P: Relation Extraction: ASurvey. arXiv [cs.CL], 2017
9. Huang L, Sil A, Ji H, Florian R: Improving Slot FillingPerformance with Attentive Neural Networks on DependencyStructures. arXiv [cs.CL], 2017
10. Michael J, Stanovsky G, He L, Dagan I, Zettlemoyer L:Crowdsourcing Question-Answer Meaning Representations
11. Rajpurkar P, Zhang J, Lopyrev K, Liang P: SQuAD: 100,000+Questions for Machine Comprehension of Text. arXiv [cs.CL],2016
12. Devlin J, Chang M-W, Lee K, Toutanova K: BERT: Pre-training ofDeep Bidirectional Transformers for Language Understanding.arXiv [cs.CL], 2018
13. Vaswani A, et al: Attention Is All You Need. arXiv [cs.CL], 2017
Publisher’s Note Springer Nature remains neutral with regard to juris-dictional claims in published maps and institutional affiliations.
564 J Digit Imaging (2019) 32:554–564