semantic modeling for exposomics with exploratory evaluation in...
TRANSCRIPT
Research ArticleSemantic Modeling for Exposomics with ExploratoryEvaluation in Clinical Context
Jung-wei Fan,1,2,3 Jianrong Li,1,2,3 and Yves A. Lussier1,2,3,4
1Department of Medicine, The University of Arizona, Tucson, AZ, USA2BIO5 Institute, The University of Arizona, Tucson, AZ, USA3Center for Biomedical Informatics & Biostatistics, The University of Arizona, Tucson, AZ, USA4Cancer Center, The University of Arizona, Tucson, AZ, USA
Correspondence should be addressed to Jung-wei Fan; [email protected] and Yves A. Lussier; [email protected]
Received 24 April 2017; Revised 26 June 2017; Accepted 30 July 2017; Published 30 August 2017
Academic Editor: Cui Tao
Copyright © 2017 Jung-wei Fan et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Exposome is a critical dimension in the precision medicine paradigm. Effective representation of exposomics knowledge isinstrumental to melding nongenetic factors into data analytics for clinical research. There is still limited work in (1) modelingexposome entities and relations with proper integration to mainstream ontologies and (2) systematically studying their presencein clinical context. Through selected ontological relations, we developed a template-driven approach to identifying exposomeconcepts from the Unified Medical Language System (UMLS). The derived concepts were evaluated in terms of literaturecoverage and the ability to assist in annotating clinical text. The generated semantic model represents rich domain knowledgeabout exposure events (454 pairs of relations between exposure and outcome). Additionally, a list of 5667 disorder conceptswith microbial etiology was created for inferred pathogen exposures. The model consistently covered about 90% of PubMedliterature on exposure-induced iatrogenic diseases over 10 years (2001–2010). The model contributed to the efficiency ofexposome annotation in clinical text by filtering out 78% of irrelevant machine annotations. Analysis into 50 annotateddischarge summaries helped advance our understanding of the exposome information in clinical text. This pilot studydemonstrated feasibility of semiautomatically developing a useful semantic resource for exposomics.
1. Introduction
Precision medicine represents a paradigm of conductingbiomedical research and practice by considering individualvariation—genes, environment, and lifestyle [1]. Environ-mental and lifestyle factors play a critical role in our healthand are known to interact with the genetic componentsthrough an epigenetic process [2]. The concept of exposome[3], which first came about in 2005, stands for all nongeneticfactors that a person is exposed to throughout a lifetime.Common examples are pollutants, tobacco/alcohol use,occupational hazards, and even psychosocial stress such asbeing a victim of abuse. Exposome science has receivedincreasing attention in precision medicine and has evolvedinto an interdisciplinary field across biology, genomics,public health, statistics, and informatics [4]. In order tocapture the full picture of various health-related contexts,
it is essential to incorporate exposome parameters intoknowledge engineering and data analytics [5].
Whenever exposome is mentioned, common under-standing refers to environmental conditions and self-induced exposures in social history. In fact, there has beena growing interest in using social history information (e.g.,substance use and occupation) from clinical text [6–8]. Onthe other hand, often overlooked are exposures in clinicalsettings, many of which are the consequence of healthcareactivities. Given the simplified equation “phenotype=genotype× environment,” extensive research has focused onphenotyping [9] from electronic health records (EHRs).Likewise, we propose that it is equally important to investi-gate all exposome information available in EHRs, a taskdubbed as “expotyping” [10]. Dr. Vertosick Jr’s quote, “Youain’t never the same when the air hits your brain” [11] easilyexplains the implication that any procedure, especially a
HindawiJournal of Healthcare EngineeringVolume 2017, Article ID 3818302, 10 pageshttps://doi.org/10.1155/2017/3818302
major one like craniotomy (plus the complication of anes-thetics), may have a lingering effect on one’s health (e.g.,postoperative cognitive dysfunction [12]). In addition,research has revealed that even life-saving procedures couldparadoxically result in an adverse outcome (e.g., ventilator-induced lung injury), depending on the subtle interplay witha patient’s genetic predisposition [13]. Fortunately, modernEHRs can actually serve as a primary data source for tracingall types of clinical exposome on a patient. Another notablecategory is pathogen exposures, which often occur in disguiseas diseases that can be deterministically attributed to a spe-cific microbe. However, the challenge remains on how wecan model and extract such abundant exposome informationfrom EHRs.
Despite the diverse exposome data available in EHRs,there is still a dearth of systematic work on modelingand extraction. Possible explanations are (1) many health-care activities are not perceived as exposures, (2) mostresearch concentrate only on a narrow set of study-specific exposures, or (3) there is a lack of semanticresources available for assisting in the identification ofexposome information in the EHRs. The most pertinentmodeling framework we found was the Exposure Ontology(ExO) [14], which maps out key exposome concepts suchas exposure event, stressor, and the relations among them.However, ExO is a bare skeleton that has not been filledwith cross-references to concepts in other major terminol-ogies and currently has limited coverage over the diverseclinical exposome of interest. To bridge these gaps with afocus on systematically investigating exposures in clinicalcontexts, we aimed to semiautomatically derive a semanticframework as well as explored its usability in facilitatingexposome annotation of narrative EHRs. Our semanticmodeling consisted of a subset of the Unified MedicalLanguage System (UMLS) and, therefore, was interopera-ble with major biomedical terminologies. The interest innarrative EHRs was based on empirical knowledge thatthe texts not only serve as a good source of environmen-tal/lifestyle factors but also document extensive clinicalexposures (e.g., procedures). The annotation was meantto obtain general assessments on what exposome informa-tion is described in the clinical text, which would then laythe foundation for developing extraction tools and fordiscovering disease associations down the road.
The objectives of this study were (1) to create anexposome-oriented semantic network from existing ontologyentities and relations, (2) to perform exploratory evaluationon the semantic adequacy and usability in clinical contexts,and (3) to summarize the properties of exposome conceptsfound in EHRs. In summary, our exposome subnetwork rep-resented rich domain knowledge, including 454 pairs ofexposure and outcome semantic types. The subnetwork con-cepts consistently covered about 90% of PubMed literatureon exposure-induced iatrogenic diseases over a 10-yearperiod. The subnetwork filtered out 78% of irrelevantmachine outputs in the exposome annotation of 50 dischargesummaries. Analysis of the annotations offers insights intothe exposome presence in clinical contexts and feedback forexisting ontology resources.
2. Materials and Methods
2.1. The UMLS and Semantic Network. The UMLS [15] is theworld’s largest integrated biomedical terminology frameworkmaintained by the U.S. National Library of Medicine (NLM).The UMLS has three major components: the Metathesaurus(META), the Semantic Network (SN), and the SpecialistLEXICON. TheMETA unifies concepts frommultiple sourceterminologies and ontologies into individual ConceptUnique Identifiers (CUIs). As of the end of 2016, it containedmore than three million concepts from 199 sources. TheMETA includes files that host rich semantic information.The MRCONSO.RRF is the main file that collects compre-hensive synonymous entities of the sources under eachCUI. The MRSTY.RRF file links each CUI to their corre-sponding SN semantic type(s). The MRREL.RRF filepreserves fine-grained semantic relations from source ontol-ogies such as SNOMED-CT [16]. The SN currently contains127 semantic types (e.g., T037 Injury or Poisoning) and 54distinct semantic relations. The SN relations can be tracedback to their origin from the NLM’s Medical Subject Head-ings (MeSH) and are categorized as either hierarchical (is-a)or nonhierarchical (e.g., type A causes type B) [17]. For usecases that do not require fine granularity, there is also amapping that aggregates the SN types into coarser semanticgroups [18].
2.2. The i2b2 NLP Challenge Corpus. For studying the distri-bution and properties of exposome concepts in clinical texts,we used de-identified notes from the i2b2 NLP (NaturalLanguage Processing) Research Data Sets that consist ofmedical records from institutions such as Partners Health-Care. With roughly a decade of history, the i2b2 challenges[19] facilitated clinical NLP research that varied from funda-mental (e.g., coreference resolution) to end applications (e.g.,identifying obesity). The 2006 corpus [20] for tasks of de-identification and smoking status classification was chosenfor the current study—specifically a subset of Partners 889raw discharge summaries without any annotation. The2006 subset was chosen because the section headers occurredmostly with explicit uppercase patterns, making automateddetection easier for computing section-wise distribution ofthe concept annotations. Further, the corpus had sentenceboundaries predetected and therefore reduced the NLPeffort. In a generalizability evaluation, we also used 73 inde-pendent discharge summaries from Beth Israel DeaconessMedical Center (or simply “Beth” hereafter), which was partof the i2b2 2010 corpus.
2.3. Extract Exposure-Related Semantic Types. To identifyexposure-related semantic types, we started with disorder-related semantic types. This strategy was based on theassumption that any exposure event would generally resultin a certain negative health effect. Figure 1(a) illustrates ourinference process using an event-driven template that hasbeen framed into the question: What nongenetic factorscould contribute to a disorder? The template was used tosearch the SRSTRE2 file of SN, which contained fullyinherited relations among the semantic types. The disorder
2 Journal of Healthcare Engineering
semantic types were obtained via the Disorder (DISO) of theSN semantic groups [18]. All the DISO semantic types wereincluded except for T050 Experimental Model of Disease,which indicates an artificial setting. Wemanually determinedfour semantic relations that would involve a DISO type in anexposure (abbreviated as EXPO) event:
(1) DISO is result_of EXPO.
(2) DISO is associated_with EXPO.
(3) EXPO causes DISO.
(4) EXPO complicates DISO.
Many of the candidate EXPO semantic types matchedthe template; however, not all of them also fit into thecontext of an exposure event. For example, T019 Congen-ital Abnormality associated_with T040 Organism Functiondoes not fit, as the latter is not a qualified EXPO type.Manual curation was performed over all candidate EXPOtypes suggested by the template, and only those that reallymeant an EXPO→DISO event were kept.
2.4. Extract Disease Concepts of Microbial Etiology. Toidentify microbe exposure where only the disease is men-tioned, we used fine-grained semantic relations in theMETA MRREL.RRF file. The file contains semantic rela-tions among UMLS concepts that were populated fromthe source ontologies. We reviewed all relations inMRREL.RRF and chose the causative_agent_of fromSNOMED-CT to form a template. Figure 1(b) illustratesour inference process of using the template. We manuallydetermined three specific semantic types that represent themost common microbial exposures: T004 Fungus, T005Virus, and T007 Bacterium. Searching with the templateis equivalent to asking “the microbe concept is a
causative_agent_of what?” The outcome slot is simplyfilled by selecting any concept that belongs to the DISOsemantic group as described in the previous section. As aresult, each identified disorder actually stands for “everbeing exposed to” a responsible microbe, based on thedomain knowledge modeled in SNOMED-CT.
2.5. Evaluate the Semantic Adequacy for Covering ClinicalExposome Literature. To evaluate if the semantic modelingreasonably covered domain knowledge on clinical expo-some, we used PubMed literature about exposure-inducediatrogenic diseases during 2001–2010. The range to 2010was used because PubMed records of the most recent yearsmight be still being indexed and not reflect a stabilized viewfor the bibliometric purpose. The benchmark set consistedof 1248 PubMed IDs (PMIDs) from the query: “adverseeffects”[sh] AND iatrogenic[ti] AND (“2001/01/01”[PDAT]:“2010/12/31”[PDAT]). The MeSH terms of the PMIDs wereautomatically extracted and mapped to CUIs using theMRCONSO.RRF source and synonym information. Giventhe CUIs of the MeSH terms, we were able to compute whichMeSH terms belonged to the exposure or disease semantictypes of our derived exposome subnetwork. Note that for aMeSH to qualify as an exposure, we also required it to befollowed by the subheading “adverse effects” or any of itschildren “toxicity” and “poisoning.” For a PMID to becounted as being covered by the subnetwork, at least oneMeSH had to belong to an exposure semantic type (plusproper subheading) and at least one MeSH belonging toa disease semantic type as in one of the covered EXPO-DISO relations (refer to Supplement 1 available online athttps://doi.org/10.1155/2017/3818302). The chronologicaltrends were summarized for the number of covered PMIDsand exposure-disease pairs, as well as the percentage of cov-ered PMIDs and the average number of exposure-diseasepairs per PMID.
Disorder semantic typesSemantic relations
of interest
Result_ofAssociated_with
CausesComplicates
T019: Congenital abnormality
T020: Acquired abnormality
T037: Injury or poisoning
Candidate exposuresemantic types
T040: Organism function
T054: Social behavior
T069: Environmentale�ect of humans
... ...
(a) Identifying exposure semantic types via relations with disorders
Microbe semantic types
Causative_agent_of
SNOMED-CTrelation of interest
Automatically selecteddisorder concepts
C0009186: Coccidioidomycosis
C0282687: Hemorrhagicfever, Ebola
...
T004: Fungus
T005: Virus
T007: Bacterium
(b) Identifying disease concepts in relation with microbe exposures
Figure 1: Ontology-assisted selection of exposome concepts.
3Journal of Healthcare Engineering
2.6. Perform Exposome Annotation and Analysis in ClinicalText. For batch preannotation, the MetaMap [21] program(2016 version) was used to identify UMLS concepts in the889 Partners discharge summaries. From the MetaMap-identified concepts, we filtered by keeping only those eitherwith an exposure semantic type or of a microbe-caused disor-der. In addition, regular expression (uppercase phrasefollowed by colon) was used to mark the section headersfor calculation of section-wise distribution of the concepts.To obtain a more reliable summary of the exposome annota-tions, the first author (Ph.D. in biomedical informatics) man-ually curated themachine-identified candidate concepts in 50random discharge summaries that had an explicit socialhistory section. In terms of size, this subcorpus had an averageof 115 sentences (or 1124 tokens) per document, and theaverage sentence length was 9.5 tokens. The curation involvedverifying exposures that were present to the patient and cor-recting any mislabeled sections. The process was facilitatedby using the brat annotation tool [22]. Based on the curatedannotations, we calculated descriptive statistics of the conceptdistribution over different sections and semantic types. As arough assessment of reproducibility, we used the curatedexposure annotations to (case-insensitively) exact-matchinto 73 independent Beth discharge summaries and com-pared the distribution of sections that contained thoseexposures. Lastly, qualitative analysis was performed tounderstand characteristics of the exposome concepts andlimitations of the existing ontology resources.
3. Results and Discussion
3.1. Modeling of Exposome Semantic Types and Concepts. Themethods of Figure 1(a) obtained 454 EXPO-DISO pairs thatwere linked via the relations causes, complicates, associated_with, and results_of. There were 41 distinct exposure seman-tic types involved in these relations. The full list of thecurated semantic type pairs is available in Supplement 1.The EXPO-DISO pairs represent a comprehensive networkof semantic types involved in exposure events. To provide apeek into the model contents, Figure 2 visualizes a partialnetwork showing only exposure semantic types that aredirectly related to T020 Acquired Abnormality. Pointing out-ward are the relations result_of and associated_with, forexample, Acquired Abnormality as result_of Therapeutic orPreventive Procedure. Inversely, the exposure nodes pointto the center with the relations causes and complicates, forexample, Pharmacologic Substance causes Acquired Abnor-mality. We color-coded the exposure nodes according totheir broader UMLS semantic groups. It can be seen thatChemicals & Drugs (in green) dominate about half of thetypes, followed by Procedures (in purple), and then Activities& Behaviors (in indigo). The methods described inFigure 1(b) identified 5667 microbe-induced disorderconcepts. Examples are shown in Table 1, and the full list isavailable in Supplement 2.
The results demonstrate rich domain knowledge that canbe extracted from existing ontology. The derived subnetworkmodels comprehensive interactions involved in exposureevents. One advantage of using the UMLS is that it allows
linking the semantic framework to individual biomedicalconcepts and their source terminologies for versatile integra-tion. The application of our methods offers great potential forenriching existing resources such as the ExO, which providesa core skeleton but lacks the integration with concepts ofmajor terminologies. Our deliverable of the microbial disor-der concepts can serve as a useful resource itself, especiallyfor use cases dealing with infectious diseases. In the list, wealso observed an interesting disorder, C0014522 Epidermo-dysplasia verruciformis, which is an autosomal recessiveinherited skin disease that features wart-like lesion due toinfection with the human papillomavirus (HPV). It echoesthe importance of investigating subtle interplay betweenour genome and exposome in order to fully understandcertain health conditions. In terms of methodology, oursemantic filtering based on relational template demonstratesa useful ontological approach for selecting task-specificentities of interest.
3.2. Semantic Adequacy of the Derived Exposome Subnetwork.For assessing whether our semantic framework adequatelyaccommodates evolving evidence-based medicine, Figure 3shows the trends of covered literature on exposure-inducediatrogenic diseases—which represent the primary area ofinterest in clinical exposome. In Figure 3(a), the red line indi-cates a steady increase of PMIDs that had at least oneexposure-disease MeSH pair covered by our subnetwork. Interms of the distinct exposure-disease pairs, the blue lineindicates corresponding growth over the 10 years. Table 2provides actual examples of the covered MeSH pairs andtheir host PMIDs. More importantly, not only do the countsindicate the coverage scales with the knowledge growth butthe orange line in Figure 3(b) indicates that the percentage ofPMIDs with covered exposure-disease pairs remained consis-tently high (mean= 90.59%, standard deviation=0.02%). Thegreen line shows that the average number of coveredexposure-disease pairs per PMID climbed mildly over theyears. As for the peak in the green line, it is unclearwhy in 2004 there was a sudden surge of research find-ings. Despite marginal errors in the bibliometrics-oriented evaluation, we believe it reasonably corroboratesthe reliability and scalability of our semantic modeling.In addition, since our subnetwork is derived from theUMLS, any update in the SN (though infrequent) can beincorporated on a regular basis.
3.3. Exposome Annotations in the Clinical Text. The filter ofexposome semantic types and concepts consistently reducedthe number of entities required for manual review. Specifi-cally for the 50 random discharge summaries, an average of77.72% presumably irrelevant (i.e., nonexposome) MetaMapannotations was filtered out. After the manual curation, themedian number of annotations per note was 36 (min: 9,max: 118). According to the data use agreement, we willcontribute our annotations back to the i2b2 clinical NLPrepository. By aggregating counts across the notes, the top20 sections with the curated exposome annotations areshown in Table 3. Due to the inpatient context of dischargesummaries, the top section turned out to be HOSPITAL
4 Journal of Healthcare Engineering
COURSE, which essentially covers all procedures and medi-cations administered during a patient’s hospitalization.Aligning with expectation, SOCIAL HISTORY rankedmoderately high (the 8th) in the list. There were sixmedication-related sections in the top 20, which reflected
drugs being a preeminent exposure in a clinical setting.The sections HISTORY OF PRESENT ILLNESS (the 2nd)and PAST MEDICAL HISTORY (the 7th) also hosted adecent amount of exposome annotations. As the assessmentof generalizability across independent institution/dataset,
Enzyme
Hazardous orpoisonoussubstance
Element, ion, orisotope
Clinical drug
Chemical drug
Acquiredabnormality
Biomedical ordental material
Biologically activesubstance
Antibiotic
Amino acid,peptide, or
protein
Vitamin
Pharmacologicsubstance
Organic chemical
Occupationalactivity
Individual behaviorSocial behaviorBehavior�erapeutic or
preventiveprocedure
Molocular biologyresearch
technique
Laboratoryprocedure
Health care activity
Diagnosticprocedure
Naturalphenomenon or
process
Human-causedphenomenon or
process
Environmentale�ect of humans
Manufacturedobject
Food
Substance
Geographic area
Medical device
Drug delivery
Research device
Nucleic acid,nucleoside, or
nucleotide
Inorganic chemicalIndicator,reagent, or
diagnostic aidImmunologic factor
Hormone
Result_ofPhenomenaDisorders
Activities & behaviors
Procedures
Geographic areas
Devices
Chemicals & drugs
Objects
Associated_with
Causes
Complicates
Chemical viewedfunctionally
device
Figure 2: A partial network of exposome semantic types and their relations to health outcome. ∗The colors of the exposure nodes stand fordifferent broad semantic groups that the semantic types belong to.
5Journal of Healthcare Engineering
Table 4 shows that we not only found many of the identicalexposure terms in the Beth corpus (the column “# annota-tions” was based on exact-string search) but the sectionshosting those terms also exhibit high similarity in terms
of name and rank (comparing to Table 3). For example,the top two sections, HOSPITAL COURSE and HISTORYOF PRESENT ILLNESS, are identical. In Table 5, we listthe top 20 semantic types of the curated concepts. The
Table 1: Example disease concepts of microbial etiology.
Disease CUI Disease name Disease semantic type Microbe CUI Microbe name Microbe semantic type
C0006057 Botulism T037 Injury or Poisoning C0009055Clostridiumbotulinum
T007 Bacterium
C0275677 Miscarriage due to Leptospira T046 Pathologic Function T0023358 Leptospira T007 Bacterium
C0376618 Endotoxemia T033 Finding C0018150Gram-negative
bacteriaT007 Bacterium
C0266024 Moon’s molar teethT019 CongenitalAbnormality
C0040840 Treponema pallidum T007 Bacterium
∗C0032768 Postherpetic neuralgia T047 Disease or Syndrome C0042338 Human herpesvirus 3 T005 Virus∗C0032371 Poliomyelitis T047 Disease or Syndrome C0206435 Human poliovirus T005 Virus
∗C1535939Pneumocystis jiroveci
pneumoniaT047 Disease or Syndrome C0320385 Pneumocystis jiroveci T004 Fungus
∗C0029307 Oroya Fever T047 Disease or Syndrome C0318324Bartonellabacilliformis
T007 Bacterium
∗Concepts that did occur in the study corpus.
800
700
600
500
400
300
200
100
2001Expo-diso pairsCovered PMIDs
19873 72 83 84 100 117 133 133 164 172
236 274 376 370 413 482 512 627 6802002 2003 2004 2005 2006 2007 2008 2009 20100
(a) Number of exposure-disease MeSH pairs and number of PMIDs that had at least one such pair covered by the
derived exposome subnetwork
% of covered PMIDsAvg # Expo-diso pairs per PMID
2001
5.00
4.00
3.00
2.00
1.00
0.00
2.7194.81 88.89 89.25 89.36 88.50 91.41 90.48 91.10 91.11 91.01
3.28 3.30 4.48 3.70 3.53 3.62 3.85 3.82 3.952002 2003 2004 2005 2006 2007 2008 2009 2010
100.00
97.00
94.00
91.00
88.00
85.00
(b) Percentage of PMIDs that had at least one covered exposure-disease MeSH pair and the average number of
such MeSH pairs per PMID
Figure 3: Semantic coverage of exposure-induced iatrogenic diseases in PubMed from 2001 to 2010.
6 Journal of Healthcare Engineering
top two types, T121 Pharmacologic Substance and T109Organic Chemical, are from drugs, consistent with the highprevalence of medication-related sections as in Table 3. Thesecond cluster consists of T061 Therapeutic or PreventiveProcedure and T060 Diagnostic Procedure, with most ofthem mentioned in HOSPITAL COURSE (not shown inthe table). The type T055 Individual Behavior (e.g., use ofalcohol or tobacco) mostly resides in SOCIAL HISTORY.For further inspection, Supplement 3 includes the detailsof the annotation counts ranked in two levels: first bysection and then by semantic type.
The results show that our exposome semantic modelinghelped prefilter about 78% of irrelevant concepts before themanual curation. Our attention to microbe-based exposureswas justified: Although minor in proportion, the two types,T047 Disease or Syndrome (those of microbial etiology)and T007 Bacterium, made it into the second half ofTable 5 (the 11th and 15th, resp.). Medications were foundto form the majority of the clinical exposome, which echoes
Table 2: Example exposure-disease MeSH pairs (and their semantic types) covered by our exposome subnetwork.
PMID Exposure MeSH Exposure semantic type Disease MeSH Disease semantic type
20736205Anti-Arrhythmia
AgentsT121 Pharmacologic Substance Hallucinations
T048 Mental or BehavioralDysfunction
11337626 OsteotomyT061 Therapeutic or Preventive
ProcedureKyphosis T190 Anatomical Abnormality
11387778 Colonoscopy T060 Diagnostic Procedure Intestinal Perforation T047 Disease or Syndrome
11581058Prostheses and
ImplantsT074 Medical Device Lacrimal Duct Obstruction T190 Anatomical Abnormality
11747288 HIV Protease Inhibitors T121 Pharmacologic Substance Lipodystrophy T047 Disease or Syndrome
11984961Dental Restoration,
PermanentT122 Biomedical or Dental
MaterialSystemic Inflammatory Response
SyndromeT047 Disease or Syndrome
12043843 HaloperidolT109 Organic Chemical
T121 Pharmacologic SubstanceBasal Ganglia Diseases T047 Disease or Syndrome
12106934 Cardiac Catheterization T058 Health Care Activity Arteriovenous Fistula T190 Anatomical Abnormality
Table 3: Top 20 sections with exposure annotations in the 50Partners discharge summaries.
Rank Section name # annotations
1 HOSPITAL COURSE 617
2 HISTORY OF PRESENT ILLNESS 418
3 DISCHARGE MEDICATIONS 344
4 MEDICATIONS ON ADMISSION 321
5 MEDICATIONS ON DISCHARGE 309
6 MEDICATIONS 178
7 PAST MEDICAL HISTORY 89
8 SOCIAL HISTORY 87
9 ALLERGIES 47
10 MEDICATIONS ON TRANSFER 45
11 HOSPITAL COURSE BY SYSTEM 40
12 ADMISSION MEDICATIONS 40
13 RELEVANT LABORATORY DATA 38
14 LABORATORY DATA 36
15 HOSPITAL COURSE AND TREATMENT 35
16 HOSPITALIZATION COURSE 29
17 PHYSICAL EXAMINATION 22
18 PAST SURGICAL HISTORY 22
19 IDENTIFYING DATA 19
20 DISCHARGE INSTRUCTIONS 18
Table 4: Top 20 sections with exposure annotations in the 73 Bethdischarge summaries.
Rank Section name # annotations
1 HOSPITAL COURSE 492
2 HISTORY OF PRESENT ILLNESS 386
3 BRIEF HOSPITAL COURSE 353
4 DISCHARGE MEDICATIONS 267
5 MEDICATIONS ON ADMISSION 157
6 PERTINENT RESULTS 148
7 DISCHARGE INSTRUCTIONS 114
8 DISP 103
9 SOCIAL HISTORY 93
10 PAST MEDICAL HISTORY 92
11 SIGNED ELECTRONICALLY BY 69
12 INDICATION 45
13THE FOLLOWING ISSUES WERE
ADDRESSED DURING THIS HOSPITALADMISSION
44
14 GASTROINTESTINAL 44
15 SERVICE 43
16 IMPRESSION 41
17 COURSE IN HOSPITAL 40
18 ALLERGIES 40
19 LABORATORY DATA 38
20MAJOR SURGICAL OR INVASIVE
PROCEDURE35
7Journal of Healthcare Engineering
a previous work that also presented drugs as an “environ-mental” factor in EHRs [23]. Given the exposures predomi-nantly being medications and procedures, one could arguethat using structured EHR data would save the redundanteffort of extracting them from text. However, it requiresfurther investigation to understand how much data wouldbe missed if using only structured data. For example, aprevious study showed that provider notes are comple-mentary to structured data for recording of medicationintensification [24] and that we should take into accountany over-the-counter products documented in history nar-ratives. It is noteworthy that HISTORY OF PRESENTILLNESS and PAST MEDICAL HISTORY ranked amongthe top 10 sections in Table 3, suggesting that the history sec-tions that are generally included in most clinical notes (notjust discharge summary) can host abundant exposome infor-mation. Among the exposome-containing sections, SOCIALHISTORY should be considered as different by nature, for itdistinctly covers most of the “nonclinical” exposomesencountered in a patient’s daily life. In one specific note,the social history even documented exposure to a nuclearaccident in the patient’s past residence.
A pertinent question that emerged in the study was “towhat extent could we generalize the definition of exposome?”We annotated concepts such as retirement and widowhood(both T033 Finding) as exposures, given that studies haveshown their association with health [25, 26]. However, wouldit make sense to also treat any significant biological eventsuch as pregnancy as an exposure? This remains an open
subject of future research. Interestingly, an infrequentcategory of exposures revealed one blind spot of oursemantic modeling, which could partially be attributed tolimitations in the UMLS semantic classification. We cameacross several exposure concepts such as C3496069Cocaine Use and C0206073 Domestic Violence, whichwere classified to type T048 Mental or Behavioral Dys-function. However, the filter criteria missed them becausewe did not consider T048 to be a causal role (the exposureslot) in our event template, and none of these health issueshad microbial etiology. It definitely takes more considerationfor reclassifying and modeling these abusive behaviors(affecting self or others) as exposures.
3.4. Limitations. Although we believe the high-level trends tobe reliable, the PubMed-based coverage evaluation mightinvolve some marginal errors inevitably. Only dischargesummary was utilized in this study, which does not repre-sent all types of clinical text. As an exploratory evaluation,the exposome annotations were curated by only one anno-tator (JF) and therefore not free from bias. With the aimof just assessing what exposome concepts exist, we didnot engage in annotating any advanced attributes such asintensity (dosage), duration, and frequency. Relatedly, wedid not attempt to optimize the NLP methods for extractingexposome concepts; instead, we concentrated on under-standing the high-level distribution and characteristics ofthe final annotations.
3.5. Future Work. We will consider seeking collaborationwith the ExO development team to unify the solutions andmake it a sustainable resource for the exposome sciencecommunity. Beyond serving the coverage evaluation, we willbuild a knowledge base of clinical exposome by refining ourPubMed query and postprocessing pipeline. For rigorousannotation, multiple annotators will be used and with agree-ment metrics computed. The annotation criteria are to beexpanded by including pertinent attributes (e.g., exposureintensity) and documented in a formal guideline. Theannotations can be used to train NLP systems for cohortidentification in clinical exposome studies. Ultimately, amore thorough expotyping approach should be pursued tointegrate both structured and unstructured EHR data.
4. Conclusions
We developed a semiautomated approach for modelingexposome semantics and assessing the usability in clinicalcontexts. The modeling leveraged event templates to iden-tify exposure concepts that had ontological relations withdisease outcomes. A subnetwork was derived from theUMLS, representing 454 pairs of relations between expo-sure and outcome semantic types. The subnetwork wasable to cover 90% of PubMed literature on exposure-induced iatrogenic diseases from 2001 to 2010. For iden-tifying exposome concepts in 50 discharge summaries, thesubnetwork improved the efficiency by filtering out 78%of irrelevant machine annotations. The exposome con-cepts exhibited diverse presence of semantic types and
Table 5: Top 20 semantic types of the exposure annotations in the50 Partners discharge summaries.
Rank Semantic type # annotations
1 T121: Pharmacologic Substance 927
2 T109: Organic Chemical 874
3 T061: Therapeutic or Preventive Procedure 281
4 T060: Diagnostic Procedure 278
5 T195: Antibiotic 105
6 T116: Amino Acid, Peptide, or Protein 72
7 T074: Medical Device 64
8 T033: Finding 47
9 T197: Inorganic Chemical 38
10 T125: Hormone 36
11 T047: Disease or Syndrome 30
12 T200: Clinical Drug 29
13 T127: Vitamin 27
14 T055: Individual Behavior 24
15 T007: Bacterium 15
16 T097: Professional or Occupational Group 14
17 T129: Immunologic Factor 14
18T114: Nucleic Acid, Nucleoside, or
Nucleotide13
19 T058: Health Care Activity 12
20 T037: Injury or Poisoning 12
8 Journal of Healthcare Engineering
sections in the annotated corpus. Analysis into the resultsexpanded our understanding of the clinical narratives andontology resources. This work demonstrated the value ofsemantics-powered methods for advancing exposome sci-ence and precision medicine.
Conflicts of Interest
The authors declare that there is no conflict of interestregarding the publication of this paper.
Acknowledgments
This work was supported in part by The University ofArizona Health Sciences CB2, the BIO5 Institute, and NIH(U01AI122275, HL132532, CA023074, 1UG3OD023171,1R01AG053589-01A1, and 1S10RR029030). The authorsthank Dr. Colleen Kenost for the help with the Englishediting. De-identified clinical records used in this researchwere provided by the i2b2 National Center for Biomedi-cal Computing funded by U54LM008748 and were origi-nally prepared for the Shared Tasks for Challenges in NLPfor Clinical Data organized by Dr. Ozlem Uzuner, i2b2and SUNY.
References
[1] “The precision medicine initiative cohort program - build-ing a research foundation for 21st century medicine,September 2016,” 2015, https://www.nih.gov/sites/default/files/research-training/initiatives/pmi/pmi-working-group-report-20150917-2.pdf.
[2] S. M. Langevin and K. T. Kelsey, “The fate is not always writtenin the genes: epigenomics in epidemiologic studies,” Environ-mental Molecular Mutagenesis, vol. 54, no. 7, pp. 533–541,2013.
[3] C. P. Wild, “Complementing the genome with an ‘exposome’:the outstanding challenge of environmental exposure mea-surement in molecular epidemiology,” Cancer Epidemiology,Biomarkers & Prevention, vol. 14, no. 8, pp. 1847–1850, 2005.
[4] A. K. Manrai, Y. Cui, P. R. Bushel et al., “Informatics and dataanalytics to support exposome-based discovery for publichealth,” Annual Review Public Health Published Online First,vol. 38, pp. 279–294, 2017.
[5] F. Martin Sanchez, K. Gray, R. Bellazzi, and G. Lopez-Campos,“Exposome informatics: considerations for the design of futurebiomedical research information systems,” Journal of theAmerican Medical Informatics Association, vol. 21, no. 3,pp. 386–390, 2014.
[6] E. S. Chen, S. Manaktala, I. N. Sarkar, and G. B. Melton, “Amulti-site content analysis of social history information inclinical notes,” in AMIA Annual Symposium Proceedings,pp. 227–236, Washington DC, USA, 2011.
[7] Y. Wang, E. S. Chen, S. Pakhomov et al., “Automated extrac-tion of substance use information from clinical texts,” inAMIA Annual Symposium Proceedings, pp. 2121–2130, SanFrancisco, CA, USA, 2015.
[8] R. Aldekhyyel, E. S. Chen, S. Rajamani, Y. Wang, andG. B. Melton, “Content and quality of free-text occupationdocumentation in the electronic health record,” in AMIA
Annual Symposium Proceedings, pp. 1708–1716, Chicago,IL, USA, 2016.
[9] K. M. Newton, P. L. Peissig, A. N. Kho et al., “Validation ofelectronic medical record-based phenotyping algorithms:results and lessons learned from the eMERGE network,”Journal of the American Medical Informatics Association,vol. 20, no. e1, pp. e147–e154, 2013.
[10] F. J. Martin-Sanchez and G. Lopez-Campos, “The new role ofbiomedical informatics in the age of digital medicine,”Methods of Information in Medicine, vol. 55, no. 5, pp. 392–402, 2016.
[11] F. T. Vertosick Jr., When the air Hits your Brain: Tales fromNeurosurgery, W. W Norton & Company, 2008.
[12] S. Grape, P. Ravussin, A. Rossi, C. Kern, and L. A. Steiner,“Postoperative cognitive dysfunction,” Trends in Anaesthesiaand Critical Care, vol. 2, no. 3, pp. 98–103, 2012.
[13] T. Wang, C. Gross, A. A. Desai et al., “Endothelial cellsignaling and ventilator-induced lung injury: molecular mech-anisms, genomic analyses, and therapeutic targets,” AmericanJournal of Physiology-Lung Cellular and Molecular Physiology,vol. 312, no. 4, pp. L452–L476, 2017.
[14] C. J. Mattingly, T. E. McKone, M. A. Callahan, J. A. Blake, andE. A. Hubal, “Providing the missing link: the exposure scienceontology ExO,” Environmental Science & Technology, vol. 46,no. 6, pp. 3046–3053, 2012.
[15] D. A. B. Lindberg, B. L. Humphreys, and A. T. McCray, “TheUnified Medical Language System,” Methods of Informationin Medicine, vol. 32, pp. 281–291, 1993.
[16] “IHTSDO. SNOMED CT. SNOMED CT, March 2017,” 2015,http://www.ihtsdo.org/snomed-ct/what-is-snomed-ct.
[17] A. McCray, “The UMLS Semantic Network,” 13th AnnualSymposium on Computer Application in Medical Care,pp. 503–507, 1989.
[18] A. T. McCray, A. Burgun, and O. Bodenreider, “AggregatingUMLS semantic types for reducing conceptual complexity,”Studies in Health Technology and Informatics, pp. 216–220,2001.
[19] W. W. Chapman, P. M. Nadkarni, L. Hirschman, L. W.D’Avolio, G. K. Savova, and O. Uzuner, “Overcoming barriersto NLP for clinical text: the role of shared tasks and theneed for additional creative solutions,” Journal of theAmerican Medical Informatics Association, vol. 18, no. 5,pp. 540–543, 2011.
[20] Ö. Uzuner, Y. Luo, and P. Szolovits, “Evaluating the state-of-the-art in automatic de-identification,” Journal of theAmerican Medical Informatics Association, vol. 14, no. 5,pp. 550–563, 2007.
[21] A. R. Aronson and F.-M. Lang, “An overview of MetaMap:historical perspective and recent advances,” Journal of theAmerican Medical Informatics Association, vol. 17, no. 3,pp. 229–236, 2010.
[22] “The brat rapid annotation tool: online environment forcollaborative text annotation, March 2017,” 2012, http://brat.nlplab.org.
[23] X. Wang, G. Hripcsak, and C. Friedman, “Characterizingenvironmental and phenotypic associations using informa-tion theory and electronic health records,” BMC Bioinfor-matics, vol. 10, Supplement 9, p. S13, 2009.
[24] A. Turchin, M. Shubina, E. Breydo, M. L. Pendergrass, andJ. S. Einbinder, “Comparison of information content ofstructured and narrative text data sources on the example
9Journal of Healthcare Engineering
of medication intensification,” Journal of the AmericanMedical Informatics Association, vol. 16, no. 3, pp. 362–370, 2009.
[25] J. R. Moon, M. M. Glymour, S. V. Subramanian, M. Avendaño,and I. Kawachi, “Transition to retirement and risk of cardio-vascular disease: prospective analysis of the US health andretirement study,” Social Science & Medicine, vol. 75, no. 3,pp. 526–530, 2012.
[26] A. Daoulah, M. N. Alama, O. E. Elkhateeb et al., “Widowhoodand severity of coronary artery disease: a multicenter study,”Coronary Artery Dis Published Online First, vol. 28, no. 2,pp. 98–103, 2016.
10 Journal of Healthcare Engineering
RoboticsJournal of
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Hindawi Publishing Corporation http://www.hindawi.com
Journal of
Volume 201
Submit your manuscripts athttps://www.hindawi.com
VLSI Design
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 201
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation http://www.hindawi.com
Volume 2014
The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Modelling & Simulation in EngineeringHindawi Publishing Corporation http://www.hindawi.com Volume 2014
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
DistributedSensor Networks
International Journal of