big data, artificial intelligence, and cardiovascular ... · tude of large of datasets. the term...
TRANSCRIPT
Full Terms & Conditions of access and use can be found athttp://www.tandfonline.com/action/journalInformation?journalCode=tepm20
Expert Review of Precision Medicine and DrugDevelopmentPersonalized medicine in drug development and clinical practice
ISSN: (Print) 2380-8993 (Online) Journal homepage: http://www.tandfonline.com/loi/tepm20
Big data, artificial intelligence, and cardiovascularprecision medicine
Chayakrit Krittanawong, Kipp W. Johnson, Steven G. Hershman & W.H.Wilson Tang
To cite this article: Chayakrit Krittanawong, Kipp W. Johnson, Steven G. Hershman &W.H. Wilson Tang (2018) Big data, artificial intelligence, and cardiovascular precisionmedicine, Expert Review of Precision Medicine and Drug Development, 3:5, 305-317, DOI:10.1080/23808993.2018.1528871
To link to this article: https://doi.org/10.1080/23808993.2018.1528871
Accepted author version posted online: 26Sep 2018.Published online: 10 Oct 2018.
Submit your article to this journal
Article views: 50
View Crossmark data
REVIEW
Big data, artificial intelligence, and cardiovascular precision medicineChayakrit Krittanawonga, Kipp W. Johnsonb, Steven G. Hershmanc,d and W.H. Wilson Tange,f,g
aDepartment of Internal Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA; bInstitute for Next Generation Healthcare,Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA; cDepartment of Medicine, StanfordUniversity, Stanford, CA, USA; dDivision of Cardiovascular Medicine, Department of Medicine, Stanford University, Stanford, CA, USA; eDepartmentof Cardiovascular Medicine, Heart and Vascular Institute, Cleveland Clinic, Cleveland, OH, USA; fDepartment of Cellular and Molecular Medicine,Lerner Research Institute, Cleveland, OH, USA; gCenter for Clinical Genomics, Cleveland Clinic, Cleveland, OH, USA
ABSTRACTIntroduction: Cardiovascular diseases (CVDs) are chronic, heterogeneous diseases which are generallyclassified according to clinical presentation. However, the arrival of big data and analytical methodspresents an opportunity to better understand these disease entities.Areas covered: This review article highlights: (1) the potential of a big data approaches with emergingtechnology to explore the heterogeneity of CVDs; (2) current challenges of a big data approach; and (3)the future of precision cardiovascular medicine.Expert commentary: Overall, most of the current data utilizing big data techniques remain largelydescriptive and retrospective. Precision medicine, or N-of-1, approaches have not yet allowed for con-sistent interpretation since there is no ‘standard’ of how to best apply treatment approaches in a fieldwhere evidence-based medicine is based largely on randomized controlled trials. The risk score andbiomarker-based approaches have been utilized with some ‘validation’ studies, but more in-depthbiomarkers (i.e. pharmacogenomic biomarkers) have failed to demonstrate incremental benefits.Exploring novel CVD phenotypes by integrating existing medical variables, multi-omics, lifestyle, andenvironmental data using artificial intelligence is vitally important and may allow us to digitize futureclinical trials, potentially leading to novel therapies.
ARTICLE HISTORYReceived 29 July 2018Accepted 24 September 2018
KEYWORDSBig data; cardiovascularprecision medicine; precisionmedicine; big dataapproach; omics
1. Heterogeneous cardiovascular diseases
Cardiovascular diseases (CVDs) are chronic, heterogeneousdiseases that have generally been identified and categorizedinto phenotypes according to their clinical presentation.However, due to the complexity of chronic CVDs, it is likelythat multiple independent etiologies manifest similarly in theclinic. This ultimately results in differing responses to standar-dized treatment regimens, which are derived from broad dis-ease characterizations. Understanding the reasons for thesedifferences presents an avenue through which to improvepatient care. Although the heterogeneous pathophysiologyof CVDs has been extensively studied, the emergence of newanalytical methods drawn from the statistical and computerscience communities presents a powerful tool for betterunderstanding. CVDs are associated with multiple phenotypesthat result from genetics, metabolomics, environmental, andbehavioral or lifestyle perturbations [1,2]. Hypertension, atrialfibrillation (AF), heart failure with preserved ejection fraction(HFpEF), Takotsubo syndrome, Cardiorenal syndrome, andspontaneous coronary artery dissection are known to be het-erogeneous in their etiology and pathophysiology, and differ-ent phenotypes may respond to treatment in different ways[3–7]. Most clinical research studies are based on currentclinical diagnosis and known validated parameters to investi-gate endpoints or outcomes. However, many parameters are
not well-validated, and there are some emerging variables orcombinations of variables that could potentially be used asguided parameters for prognosis and treatment in order toreplace older metrics [8–10]. The diagnostic criteria of diastolicdysfunction or HFpEF, for example, are not well-defined, andthe guidelines have varied over time [8,11]. Recent studieshave demonstrated that an artificial intelligence (AI) methodinvolving high-dimensional unsupervised clustering may havethe potential to classify heterogeneous clinical CV conditionsmore accurately than current diagnostic criteria [6,12].
2. Big data and precision medicine: where we are
The zeitgeist of the information age may be the use of so-called ‘big data’ to analyze, interpret, and alter the humancondition. Biomedical science, and cardiovascular medicine,in particular, is at the forefront of this movement. Centralcomponents of the use of big data are effective strategiesfor the challenges of storing, managing, and analyzing a multi-tude of large of datasets. The term ‘big data,’ used in modern-day scientific communities, medical literature, and at scientificconferences, is frequently referred to as the 5 Vs (volume,velocity, variety, veracity, and valorization), which cannot beanalyzed or interpreted using traditional data processingmethods [13]. However, the definition of big data is still
CONTACT Chayakrit Krittanawong [email protected] Department of Internal Medicine, Icahn School of Medicine at Mount Sinai,1000 10th Ave, New York, NY 10019
EXPERT REVIEW OF PRECISION MEDICINE AND DRUG DEVELOPMENT2018, VOL. 3, NO. 5, 305–317https://doi.org/10.1080/23808993.2018.1528871
© 2018 Informa UK Limited, trading as Taylor & Francis Group
tenuous and not well-established. Datasets do not necessarilyneed to be a large number of observations, but they may beconsidered ‘big data’ due to the potential of the data in thecontext of innovation, how meaningful it is, if it is multidimen-sional, and how its value will increase over time [14]. Examplesof big data include datasets combining human gut micro-biome sequencing, genomics, metabolomics, proteomics,transcriptomics, social media data, and data from standardizedelectronic health records (EHRs) or precision medicine plat-forms (e.g. AHA Precision Medicine Platforms or the UCSFPrecision Medicine Platform) [15,16]. Several decades of trans-lational, epidemiological, and clinical multiethnic studies ofCVDs have been found to be largely inconsistent. With emer-ging analytic technology, a big data approach would attemptto classify heterogeneous CVDs that could facilitate precisionCV medicine [17]. To date, many curated and uncurated med-ical and environmental databases are freely available to thepublic which could be used for data analysis. Tables 1–3demonstrate both known variables (i.e. clinical variables,genetics or multi-omics variables) and potential latent vari-ables, including environmental factors (i.e. media consump-tion, transportation use, restaurant selection, or illicit drugsuse), epidemiological factors (i.e. Google Flu Trends) may beexplored in CVDs. Some particularly exciting resources forprecision medicine are the so-called ‘biobanks.’ These aremass collections of biomedical specimens which may belinked to retrospective EHRs in order to facilitate a wide varietyof retrospective analyses [18]. Well-curated biobanks likeMount Sinai’s BioMe, Vanderbilt’s BioVU, Northwestern’sNUgene, Penn Medicine’s BioBank, Stanford CardiovascularInstitute’s Biobank (SCVI) and GenePool, or more recently themassive UK BioBank (n = 500,000 patients) are exciting oppor-tunities for biomedical discovery in precision medicine, andthey can be accessed by various innovative actors, public andprivate, throughout the world. However, drawbacks for thisresearch are the often limiting data usage agreement policiesfor these resources, which in some cases (i.e. Mount Sinai’sBioMe), only allow use by faculty members from the partici-pating institutions. As such, much of the research potentialfrom these important biobanks are siloed away, unable tofulfill their great potential. A novel method of collecting bigdata is using mobile health apps. Studies like MyHeart Counts[19], Health eHeart [20], MyGene Rank [21], and the AppleHeart Study [22] have used the app store as a recruitmenttool and iOS applications for data collection; using such anapproach, it is not uncommon to recruit as many as ~105
participants. Many such studies are designed to have anopen data portal accessible to qualified researchers [23–25].Other study apps, like VascTrac, are applied to patients popu-lated in a clinical setting [26]. In contrast, resources containinguncurated or unprocessed big data are much harder to use,but the application of big data into clinical decision-makingusing emerging techniques drawn from the field of AI,machine learning (ML), or deep learning (DL) has the potentialto transform the current practice of cardiovascular health(CVH) into precision medicine [17,27,28]. Big data analysisusing AI allows us to classify heterogeneous CVDs into moreprecise phenotypes of CVD, leading to personalized, targetedtherapy [29]. To date, big data holds great promise for
solutions in CV research in various aspects. First, big data canbe used to allow integration of EHR, multi-omic data, gutmicrobiome sequencing, diet consumption diaries, physicalactivity information, sleep habit information from wearabletechnology, and emotional sentiments from social mediaposts to determine the multidimensional associationsbetween these factors [30,31]. Second, the relationshipsbetween variables from big data tend to show nonlinearrelationships, which require an advanced tool like AI forsophisticated analysis. However, the main limitation of a bigdata approach is the heterogeneity of multiple databases (i.e.different ICD code versions, different diagnostic criteria, differ-ent laboratories, and different software vendors) [32,33].Therefore, the harmonization of data, particularly from differ-ent databases, is needed before performing an analysis andcreating an automated prediction model for CVH recommen-dations for individuals. In conclusion, a big data approach tothe study of heterogeneous CVD is currently challenging butappears promising. Thus, future AHA/ACC/ESC guidelines maybe needed to take a big data approach into account.
3. Data processing step
In general, there are several steps required to apply big datato cardiovascular medicine (Figure 1). First, and most impor-tantly, the discovery of datasets pertinent to the task at handis required. This may include searching the wide variety ofdatabases that are already available (Tables 1–3). De-identifi-cation is a crucial step for data privacy to protect patientinformation according to the HIPAA Privacy Rule, althoughthis should generally be performed before the data is released[34]. Nonetheless, researchers re-using data have an obligationto maintain the confidentiality of any patient records they mayanalyze and to take appropriate steps to safeguard their data.Second, synchronization between different databases can gen-erate new insights of disease pathogenesis, particularly het-erogeneous diseases [35]. There are many data warehousemanagement tools that can be used to assist with databaseintegration such as Google’s visualizer [36], Galaxy [37], SparkSQL [38], Amazon Redshift [39], BIME Analytics [40], andGoogle BigQuerry [41]. However, there are certain limitations.First, the integration between different databases, particularlythose including clinical variables and lifestyle variables, is still alimitation because of the heterogeneity in any number ofvariables which may be shared among those databases. Forexample, participant IDs (or even participants) are usually notshared across different freely available resources – in manycases, this makes patient-level analyses impossible. Second,these datasets have generally not been designed to workwell together in the context of file format, columns/rows,transformation, or distribution. Third, some databases suchas toxicology or metagenomics are designed primarily forthe experts in those fields using specific terminology whichmay be hard to explore or combine without publicly availableresources such as wiki-style websites. Fourth, data imputationis a quality control step that can be applied to improve dataquality and accuracy after analysis [35,42,43]. Fifth, data mod-eling is a common term used in ML [44]. It is a model thatneeds to be generated. In general, the implementation of
306 C. KRITTANAWONG ET AL.
Table1.
Exam
ples
ofOmicsdatabase.
Omicsdatabase
Type
ofdata
Details
Num
berof
samples
Link
GlobalB
iobank
Engine
Phenotypes,variants,genetics,HLA
alleles
Aweb-based
tool
that
enablesthe
explorationof
therelatio
nshipbetween
geno
type
andph
enotype
500,000individu
als
biob
ankeng
ine.stanford.edu
Trans-OmicsforPrecision
Medicine(TOPM
ed)
Omicsdata
–RN
A,gene,and
metabolite
RNA,
gene,and
metabolite
profilesfrom
individu
alswho
participated
intheNHLBI-
fund
edMulti-Ethn
icStud
yof
Atherosclerosis(M
ESA)
Over90,000
geno
mes
sequ
encesandover
30,000
who
legeno
mesequ
encesin
dbGAP
https://www.nhlbi.nih.gov/new
s/2016/
toward-precision-medicine-first-who
le-
geno
mes-top
med-now
-available-stud
y
BioM
eEH
R-linkedbioanddata
repo
sitory
inNew
York
City
Epidem
iologic,molecular,g
enom
ic,
environm
ent,andlifestyle
32,000
participants
http://icahn.mssm.edu
/research/ipm/pro
gram
s/biom
e-biob
ank
Merck
Molecular
Activity
Challeng
eThetraining
andtest
datasets
for
machine
learning
practice
MoleculeID,M
olecular
descrip
tors
and
features
15biolog
icalactivity
data
sets
https://github
.com
/Ruw
anT/merck
TheHum
anMetabolom
eDatabase(HMDB)
Metabolite
andproteinsequ
ences
(1)Ch
emical
data,(2)
clinicaldata,and
(3)
molecular
biolog
y/biochemistrydata
114,099metabolite
entriesand5702
protein
sequ
ences
http://www.hmdb
.ca/
UKbiob
anks
Who
legeno
mesequ
encing
,exome
sequ
encing
,and
geno
typing
Genom
e,exom
e,on
linequ
estio
nnaires(diet,
cogn
itive
functio
n,workhistoryand
digestivehealth),EH
R,images
500,000peop
leaged
between40
and
69yearsin
2006–2010
http://www.ukbiobank.ac.uk/
Genom
icsEngland
Genom
esequ
encing
Genom
esequ
ence
data,o
btainedfrom
samples
ofblood,
tissue,andsaliva
100,000geno
mes
and70,000
patientsand
family
https://www.genom
icseng
land
.co.uk/the-
100000-genom
es-project/data/current-
research/
UK10K
DNAsequ
encing
DNAsequ
ence
atan
orderof
magnitude
deeper
than
the1000
Genom
esProjectfor
Europe
bycarrying
outgeno
me-wide
sequ
encing
of4000
samples
from
the
TwinsUKandALSPAC
coho
rts
Who
legeno
mecoho
rts(4000),
neurod
evelop
mentsamplesets(upto
3000
who
leexom
es),ob
esity
samplesets
(2000
who
leexom
es),andrare
diseases
sample
sets
(1000who
leexom
es)
http://www.uk10k.org/
PubC
hem
Chem
istry
Chem
icalstructures,identifiers,chem
ical,
physical
prop
erties,biolog
ical
activities,
patents,health,safetyandtoxicity
data
95,414,874
compo
unds,2
50,188,056
substances,1
,252,883
bioassays,and
236,181,958bioA
ctivities
pubchem.ncbi.nlm.nih.gov
MetaCyc
Metabolism
Both
primaryandsecond
arymetabolism,
associated
metabolites,reactio
ns,enzym
es,
andgenes
2642
pathwaysfrom
2941
diffe
rent
organism
smetacyc.org
Molecular
Transducersof
PhysicalActivity
(MoTrPAC
)Omicsdu
ringexercise
$170M
NIH
Consortiu
mon
impact
ofactivity
onmolecular
health
TBD(There
isno
publicdata
yet)
https://www.motrpac.org/
Chem
icalEntitiesof
Biolog
ical
Interest(ChEBI)
Chem
istry
‘Small’chem
icalcompo
unds
IntEnz,K
EGGCO
MPO
UND,P
DBeCh
em,
ChEM
BL
46,477
fully
curatedentries,each
ofwhich
isclassifiedwith
intheon
tology
andassign
edmultip
leanno
tatio
ns
www.ebi.ac.uk/chebi/
ProteinDataBank
(PDB)
Protein
3Dshapes
ofproteins,n
ucleicacids,and
complex
assemblies
44,165
distinct
proteinsequ
ences,38,467
structures
ofhu
man
sequ
ences,and10,027
nucleicacid
containing
structures
www.rcsb.org
TheUniversalProteinResource
(UniProt)
Proteomeandproteins
Functio
nalinformationon
proteins
and
proteome
Peptidesequ
encesfrom
172,997hu
man
with
557,713review
edand116,030,110
unreview
edproteins
http://www.uniprot.org/
GenBank
CoreNucleotide(the
maincollection),
dbEST(expressed
sequ
ence
tags),
anddb
GSS
(genom
esurvey
sequ
ences)
DNAsequ
ences
DNADataBankof
Japan(DDBJ),theEuropean
NucleotideArchive(ENA),and
GenBank
atNCB
I
www.ncbi.nlm.nih.gov/genbank/
TheToxinandToxinTarget
Database(T3D
B)Toxin
Mechanism
sof
toxicity
andtarget
proteins
for
each
toxin,
detailedtoxindata,p
ollutants,
pesticides,d
rugs,and
food
toxins
3670
common
toxins
andenvironm
ental
pollutants
http://www.t3db
.ca/
SMPD
B(The
SmallM
olecule
Pathway
Database)
Smallm
olecule
Smallm
oleculepathways
30,000
human
metabolicanddisease
pathways
http://sm
pdb.ca/
(Con
tinued)
EXPERT REVIEW OF PRECISION MEDICINE AND DRUG DEVELOPMENT 307
Table1.
(Con
tinued).
Omicsdatabase
Type
ofdata
Details
Num
berof
samples
Link
TheGolm
Metabolom
eDatabase
(GMD)
Metabolom
ics
Arepo
sitory
ofsum
form
ulawith
source
tagg
edanno
tatio
nsforprop
ertiessuch
asInCh
Istrings,C
ASnu
mbers,IUPA
Cnames,
syno
nyms,crossreferences
orKEGG
Pathway
names
2.1millionun
ique
sum
form
ulafrom
more
than
150pu
blicavailabledatabases
http://gm
d.mpimp-go
lm.mpg
.de/
BREN
DA
Enzymes,o
rganism,p
athw
ay,reaction
Comprehensive
enzymedatabase
7341
diffe
rent
enzymes
www.brend
a-enzymes.org
MassBank
Massspectraof
metabolites
High-resolutio
nmassspectraof
metabolites
605electron
-ionizatio
nmassspectrom
etry
(EI-M
S),1
37fast
atom
bombardmentMS,
and9276
electrospray
ionizatio
n(ESI)-MS
(n)data
of2337
authentic
compo
unds
ofmetabolites,11,545
EI-M
Sand834other-
MSdata
of10,286
volatilenaturala
ndsynthetic
compo
unds,and
3045
ESI-M
S[2]
data
of679synthetic
drug
s
massbank.eu/M
assBank/
BioC
ycMetabolicpathways
Metabolicpathwaysandop
eron
s13,075
Pathway/genom
edatabases
biocyc.org
NHLBIExomeSequ
encing
Project
Exom
esequ
encing
data
Genename(HUGO,u
pper
orlower
case),
gene
ID(from
NCB
IEntrezGene),
chromosom
allocatio
n,db
SNPrsID
tostud
ygenetic
contrib
utions
totheriskof
severalh
eart,lun
g,andbloodph
enotypes
>7000
individu
als
http://evs.gs.washing
ton.edu/EVS/
Ensembl
Genom
esGenom
icdata
Bacteria,p
rotists,fun
gi,p
lants,and
invertebrate
metazoangeno
me-scaledata
44,048
bacteria,1
89protists,8
11fung
i,45
plants,and
68Metazoa
http://ensemblgeno
mes.org/in
fo/data
UCSCGenom
eBrow
ser
Genom
icdata
CRISPR/Cas9trac,g
eneInteractions,refSeq
Genes
trackandGTExGeneTrack
180assembliesandover
100species
geno
me.ucsc.edu
/cgi-bin/hgG
atew
ay
Hum
anMicrobiom
eProject
Microbiom
edata
Thecollectionof
allthe
microorganism
sliving
inassociationwith
thehu
man
body.These
commun
ities
consistof
avariety
ofmicroorganism
sinclud
ingeukaryotes,
archaea,bacteriaandviruses.
86,843
files,3
0,688samples
(the
microbial
commun
ities
from
300healthyindividu
als,
across
severald
ifferentsiteson
thehu
man
body:n
asalpassages,o
ralcavity,skin,
gastrointestinaltract,andurog
enitaltract)
hmpd
acc.org
Microbiom
eDB
Microbiom
edata
Geographicenvironm
entalfeatures,16SrRNA
genes,andantib
iotic
expo
sures
13,565
samples
http://microbiom
edb.org/mbio/
EBIM
etagenom
ics
Metagenom
ics
Allg
enom
espresentin
anygiven
environm
entwith
outtheneed
forprior
individu
alidentification
129,051data
sets,1
7,545metagenom
esand
1727
metatranscriptomes
www.ebi.ac.uk/m
etagenom
ics/
Phytozom
eGenom
icdata
Allg
enesets
inPh
ytozom
ehave
been
anno
tatedwith
KOG,K
EGG,ENZYME,
Pathway
andtheInterPro
family
ofprotein
Phytozom
eho
sts93
assembled
and
anno
tatedgeno
mes,from
82Virid
iplantae
species
phytozom
e.jgi.doe.gov/pz/po
rtal.htm
l
UniProt
Metagenom
icand
Environm
entalSequences
(UniMES)
Metagenom
icandenvironm
entald
ata
Metagenom
icandenvironm
entald
ata
(the
aminoacid
sequ
ence,p
rotein
nameor
descrip
tion,
taxono
micdata
andcitatio
ninform
ation)
171,510hu
man,8
3,587mou
se,and
59,676
zebrafish
www.uniprot.org/help/un
imes
TheHBT
(Hum
anBrain
Transcrip
tome)
Genom
e-wide,exon
-level
transcrip
tome
Atotalo
f16
brainregion
sweresampled:the
cerebellarcortex,m
ediodo
rsalnu
cleusof
thethalam
us,striatum,amygdala,
hipp
ocam
pus,and11
areasof
the
neocortex.Genom
e-widegeno
typing
data
for2.5millionmarkers
Over1340
tissuesamples
sampled
from
both
hemisph
eres
ofpo
stmortem
human
brains
http://hb
atlas.org/
1000
Genom
esProject
Who
le-genom
esequ
encing
Acomprehensive
descrip
tionof
common
human
genetic
variatio
nby
applying
who
le-genom
esequ
encing
toadiverseset
ofindividu
alsfrom
multip
lepo
pulatio
ns
84.4
millionvariantsfrom
2504
individu
als
http://www.internationalgenom
e.org/
(Con
tinued)
308 C. KRITTANAWONG ET AL.
Table1.
(Con
tinued).
Omicsdatabase
Type
ofdata
Details
Num
berof
samples
Link
Greengenes
Small-sub
unitrRNAgene
(SSU
)Archaealandbacterial1
6SSSUrDNA
sequ
enceson
linefull-leng
thsm
all-sub
unit
rRNAgene
(SSU
)database
90,000
public16Ssm
all-sub
unitrRNAgene
sequ
ences
http://greeng
enes.lbl.gov
H-In
vitatio
nalD
atabase(H-
InvD
B)Hum
angenesandtranscrip
tsCu
ratedanno
tatio
nsof
human
genesand
transcrip
tsthat
includ
egene
structures,
alternativesplicingvariants,no
n-coding
functio
nalR
NAs,p
rotein
functio
ns,
functio
nald
omains,sub
cellular
localizations,m
etabolicpathways,protein
3Dstructure,genetic
polymorph
isms(SNPs,
indels,and
microsatellite
repeats),relation
with
diseases,g
eneexpression
profiling
,andmolecular
evolutionary
features,
protein–
proteininteractions
(PPIs)and
gene
families/group
s.
120,558hu
man
mRN
Asextractedfrom
the
InternationalN
ucleotideSequ
ence
Databases
(INSD
),in
additio
nto
54978
human
FLcD
NAs
http://www.h-in
vitatio
nal.jp/
Table2.
Exam
ples
ofclinicalandenvironm
ental/lifestyledatabase.
Database
Type
ofdata
Details
Num
bersof
samples
Link
Nationw
ideInpatient
Sample
(NIS)
Clinical
ICD-9-CM,d
emog
raph
ic,expectedpaym
entsource,total
charges,dischargestatus,lengthof
stay,severity
and
comorbidity
NIS
collectsannu
aldata
on7–8million
hospitalstays,reflectingalld
ischargesfrom
arou
nd1000
hospitals
https://www.hcup-us.ahrq.go
v/db
/natio
n/nis/nisdbd
ocum
entatio
n.jsp
Nationw
ideReadmission
sDatabase(NRD
)Clinical
Diagn
osis,p
rocedu
re,p
atient
demog
raph
ics,expected
paym
entsource,costs
associated
with
readmission
s,reason
sforreadmission
s,impactof
health
policychanges
Discharge
data
from
27geog
raph
ically
dispersedStates
https://www.hcup-us.ahrq.go
v/db
/natio
n/nrd/nrdd
bdocum
entatio
n.jsp
Nationw
ideEm
ergency
Departm
entSample(NED
S)Clinical
ICD-9-CM,d
emog
raph
ics,expected
paym
entsource,total
EDcharges,totalh
ospitalcharges,hospitalcharacteristics
Discharge
data
forE
Dvisitsfrom
953ho
spitals
locatedin
34States
andtheDistrictof
Columbia
https://www.hcup-us.ahrq.go
v/db
/natio
n/neds/nedsdbd
ocum
enta
tion.jsp
Wom
en’sHealth
Initiative
Clinical
2major
parts:aClinicalTrialand
anObservatio
nalS
tudy
from
heartdisease,breastandcolorectal
cancer,and
osteop
orosisin
postmenop
ausalw
omen
Clinicaltrial(68,132
wom
en)and
observationalstudy
(93,676wom
en)from
wom
enaged
50–79between1993
and
1998
https://www.whi.org/researchers/
SitePages/Get%20Involved.aspx
Multi-Ethn
icStud
yof
Atherosclerosis(M
ESA)
Clinical
Multi-Ethn
icStud
yfrom
ColumbiaUniversity,Joh
nsHop
kins
University,N
orthwestern
University,U
CLA,
University
ofMinnesota,and
WakeForest
University
6814
men
andwom
enwww.mesa-nh
lbi.org
AtherosclerosisRisk
inCo
mmun
ities
(ARIC)
Clinical
Cardiovascular
riskfactors,medicalcare,and
diseaseby
race,g
ender,locatio
n,anddate
470,000men
andwom
en(aged35–84years)
http://www2.cscc.unc.edu
/aric/opp
ortunities_for_new_investig
ators
SleepHeartHealth
Stud
y(SHHS)
EEG,EKG
,and
polysomno
gram
sMulti-coho
rtstud
yfocusedon
sleep-disordered
breathing
andcardiovascular
outcom
e5804
adults
(aged40
andolder)
sleepd
ata.org/datasets/shh
s
Coronary
Artery
Risk
Develop
mentin
Youn
gAd
ults
(CAR
DIA)
Clinical
From
4centers:Birm
ingh
am,A
L;Ch
icago,
IL;M
inneapolis,
MN;and
Oakland
,CA
5115
blackandwhite
men
andwom
en(aged18–30years)
www.cardia.do
pm.uab.edu
(Con
tinued)
EXPERT REVIEW OF PRECISION MEDICINE AND DRUG DEVELOPMENT 309
Table2.
(Con
tinued).
Database
Type
ofdata
Details
Num
bersof
samples
Link
JacksonHeartStud
y(JHS)
Clinical
Clinicalvariables,labs,imaging,
interview,and
physical
activity
5306
African-American
residentslivingin
the
Jackson,
MS,metropo
litan
area
ofHinds,
Madison
,and
Rankin
Coun
ties
www.jacksonh
eartstud
y.org
Cardiovascular
Health
Stud
y(CHS)
Clinical
Extensiveinitialph
ysicalandlabo
ratory
evaluatio
nsto
identifycardiovascular
riskfactors,such
ashigh
blood
pressure,h
ighcholesterol,andpre-diabetes;sub
clinical
disease(e.g.carotid
artery
atherosclerosis,leftventricular
enlargem
ent,andtransientischem
ia)
5888
men
andwom
enaged
65or
olderin
four
U.S.com
mun
ities
–Sacram
ento,C
A;Hagerstow
n,MD;W
inston
-Salem
,NC;
and
Pittsburgh
,PA
chs-nh
lbi.org
Socialmedia
Curate
Tweets
and3Tw
itter
APIp
latformsstandard
and
prem
ium
(free)bu
tenterprise(paid)
Over900millionexistin
gTw
itter
accoun
tshttps://developer.twitter.com
/en/pro
ducts/prod
ucts-overview
IBM
Watson(blog,
facebo
okpages,Tw
itter,n
ews)
Millions
ofdata
andsocialmedia
sources
Severalanalyticalpackages
(regular,p
lus,andprofession
al),
andseveraltypes
ofdata
(blog,
facebo
okpages,Tw
itter,
news)
Upto10,000,000
rowsperdatasetandup
to500columns
perdataset
www.ibm.com
/us-en/m
arketplace/
watson-analytics
PhysioBank
Digitalrecording
sof
physiologic
sign
alsandrelateddata
Clinical,w
aveforms,EKGs,RR
interval,o
xygensaturatio
nvariability,gaitandbalancedata
Over75
databases
100,000samples
www.physion
et.org
MIMIC,M
IMIC-II,M
IMIC-III
Clinical
Dem
ograph
ics,vitalsignmeasurements,laboratorytest
results,p
rocedu
res,medications,n
urse
andph
ysician
notes,imagingrepo
rts,andou
t-of-hospitalm
ortality
30,000–60,000admission
sof
patientswho
stayed
incriticalcareun
itsof
theBeth
Israel
Deaconess
MedicalCenter
between
2001
and2012
mimic.physion
et.org
NationalH
ealth
andNutrition
Exam
inationSurvey
(NHAN
ES)
Nutrition
Dem
ograph
ic,d
ietary,q
uestionn
aire
39,695
person
sforNHAN
ES-III,27,801
person
sforNHAN
ES-II,and
32,000
person
sfor
NHAN
ES-I
wwwn.cdc.go
v/nchs/nhanes/Default.
aspx
TheNHAN
ESNationalY
outh
FitnessSurvey
(NNYFS)
Physicalactivity
andfitness
levels
Dem
ograph
ic,d
ietary,q
uestionn
aire,p
hysicala
ctivity
mon
itor,aerobicfitness
–maximaland
subm
aximalexercise
test,and
musclestreng
th
1640
childrenandadolescentsaged
3–15
www.cdc.gov/nchs/nn
yfs/index.htm
YouTub
e-8M
Video
Lifestyle
6.1MillionVideoIDs,2.6billion
audio/visual
features,and
3,862
Classes
research.google.com/you
tube8m
/
UCF101dataset
Video
Lifestyle
13,320
videos
http://crcv.ucf.edu
/data/UCF101.ph
pUCF-Spo
rts
Video
Lifestylecollected
from
vario
ussports
which
aretypically
featured
onbroadcasttelevision
channelssuch
asthe
BBCandESPN
150sequ
enceswith
theresolutio
nof
720×480
http://crcv.ucf.edu
/data/UCF_Spo
rts_
Actio
n.ph
p
J-HMDB
Video
Collected
from
moviesor
theInternet
5100
clipsof
51diffe
rent
human
actio
nshttp://jhmdb
.is.tu
e.mpg
.de/
THUMOS2015
dataset
Video
Lifestyle
430hof
videodata
and45
millionfram
eshttp://www.th
umos.info/hom
e.html
DAV
IS16
and17
Video
Lifestyle
50sequ
ences,3455
anno
tatedfram
esdavischalleng
e.org
Sports-1M
Video
Lifestyle
1,133,158videoURLswhich
have
been
anno
tatedautomaticallywith
487labels
github
.com
/gtoderici/spo
rts-1m
-dataset/blob
/wiki/P
rojectHom
e.md
TRECVIDMED
dataset
Severaltypes
ofvideodatasets
(i.e.
IACC
.1.A-C,Y
FCC100M,H
AVIC)
Datafrom
asm
alln
umberof
know
nprofession
alsources–
broadcastnewsorganizatio
ns,TVprog
ram
prod
ucers,
andsurveillancesystem
s
Severalcategoriesof
videodataset(depends
onyear)
www-nlpir.nist.gov/projects/trecvid/
trecvid.data.htm
l
Uber2B
trip
data
Text,Lifestyle
Lifestyle
North
America,Central&
SouthAm
erica,
Europe,A
frica,SouthAsia,A
ustralia&New
Zealand
movem
ent.u
ber.com
Yelp
OpenDataset
Text,Lifestyle
JSONandSQ
Ldatasets
5,200,000review
s,174,000businesses,200,000
pictures,1
1metropo
litan
areas
www.yelp.com/dataset
Quo
raQuestionPairs
Text,Lifestyle
Questions
inQuo
racompetitionisto
predictwhich
ofthe
provided
pairs
ofqu
estio
nscontaintwoqu
estio
nswith
thesamemeaning
N/A
www.kaggle.com/c/quo
ra-question-
pairs/data
GoogleAu
dioset
Audio
Ahierarchical
graphof
eventcatego
ries,coverin
gawide
rang
eof
human
andanimalsoun
ds,m
usicalinstruments
andgenres,and
common
everyday
environm
ental
soun
ds
632audioeventclassesandacollectionof
2,084,320hu
man-labeled10-ssoun
dclips
draw
nfrom
YouTub
evideos
research.google.com/aud
ioset/
dataset/index.html
(Con
tinued)
310 C. KRITTANAWONG ET AL.
Table2.
(Con
tinued).
Database
Type
ofdata
Details
Num
bersof
samples
Link
NYC
Taxidataset
Taxiin
New
York
City
Datacontaining
inform
ationon
ourvario
usindicators,trip
coun
ts,crash
history,etc.,and
also
raw
trip
data
from
avariety
ofsources
Millions
oftrip
recordsfrom
both
yellow
medalliontaxisandgreenstreet
haillivery
http://www.nyc.gov/htm
l/tlc/htm
l/abou
t/trip_record_
data.shtml
OpenFDA
Date,drug
s,events
Drugs,d
evices,and
food
sandsubcategories(i.e.adverse
events,enforcementrepo
rts,classification,
registratio
n,labelling
)
8,733,422drug
adverseeventrepo
rts,65,523
food
adverseeventrepo
rts,and7,353,142
device
adverseeventrepo
rts
open.fda.go
v/tools/do
wnloads/
SEER
Research
Data
Epidem
iologic
Cancer
incidencedata
from
popu
latio
n-basedcancer
registries
10,050,814
cases(9,099,524
malignant
cases
and9,776,139cases)
https://seer.cancer.g
ov/seertrack/
data/request/
UNSD
Environm
entalInd
icators
Environm
ent
NOxem
ission
s,SO
2em
ission
s,CO
2em
ission
s,CH
4andN2O
emission
s,Climatolog
icaldisasters,Hydrologicald
isasters,
andInland
Water
Resources
Environm
entald
ata(airpo
llutio
n,climate
changes,greenh
ouse
gases)from
183
coun
tries
unstats.un
.org/unsd/envstats/
qind
icators.cshtml
DrugB
ank
Drug
Morethan
200data
fieldswith
halfof
theinform
ationbeing
devotedto
drug
data
andtheotherhalfdevotedto
drug
target
orproteindata
11,203
drug
entriesinclud
ing2,562approved
smallm
oleculedrug
s,966approved
biotech(protein/peptid
e)drug
s,121
nutraceuticalsandover
5183
experim
ental
drug
s
www.drugb
ank.ca
TheToxinandToxinTarget
Database(T3D
B)Toxin
Mechanism
sof
toxicity
andtarget
proteins
foreach
toxin
detailedtoxindata
with
comprehensive
toxintarget
inform
ationpo
llutants,pesticides,d
rugs,and
food
toxins
3670
common
toxins
andenvironm
ental
pollutants
http://www.t3db
.ca/
FooD
BFood
,nutrients
Food
,com
poun
ds,n
utrients,contents
detailed
compo
sitio
nal,biochemicalandph
ysiologicalinformation
structure,chem
ical
class,its
physico-chem
ical
data,its
food
source(s),its
color,its
arom
a,its
taste,its
physiologicale
ffect,p
resumptivehealth
effects(from
publishedstud
ies),and
concentrations
invario
usfood
s
28,000
food
compo
nentsandfood
additives
http://food
b.ca/
PhysioNet
Electroencephalography
(EEG
),electrooculography
(EOG),
electrom
yography
(EMG),
electrocardiolog
y(EKG
),and
oxygen
saturatio
n(SaO
2)
Largecollections
ofrecorded
physiologicsign
als
(PhysioB
ank)
andrelatedop
en-sou
rcesoftware
(PhysioToolkit)
1985
subjects
from
MGHsleeplabo
ratory
for
thediagno
sisof
sleepdisorders(from
PhysioNet
Cardiology
Challeng
e2018)
www.physion
et.org
UCI
Machine
Learning
Repo
sitory
Machine
learning
dataset
Machine
learning
436data
sets
archive.ics.uci.edu
/ml/ind
ex.php
OpenImages
Dataset
V4Machine
learning
dataset
Machine
learning
(avalidationset(41,620images),anda
test
set(125,436
images)
15,440,132
boxesand30,113,078
image-level
labels
github
.com
/openimages/dataset
TheNationalSurveyon
DrugUse
andHealth
(NSD
UH)
Survey
data
Tobacco,
alcoho
l,anddrug
use,mentalh
ealth
andother
health-related
issues
intheUnitedStates
70,000
peop
lensdu
hweb.rti.org
GoogleFluTrends
andGoogle
Dengu
eTrends
Text
FluTrends
since2008
50millionof
themostcommon
search
queriesin
theUnitedStates
https://www.google.org/flutrends/
abou
t/CardiacMRI
dataset
Images
CardiacMRI
33subjects
and7980
images
(20fram
esand
8–15
slices
alon
gthelong
axis)
http://www.cse.yorku.ca/~
mrid
ataset/
TheCardioVascular
Research
Grid
(CVR
G)
Clinical,gene,andproteinexpression
Multiscale
data
sets
(SNP,
mRN
Aexpression
,protein
expression
,imaging,ECG,clinicaldata)from
Canine
Heart
Atlas,Mou
seHearts,In-VivoHum
anHeartCT
ImageData
Multip
levariables
(clinical,g
ene,andprotein
expression
)from
15canine
hearts
http://cvrgrid
.org/
Influenza
Research
Database
(IRD)
Epidem
iology
Strain,segmentandproteinsequ
ence
data
surveillance
sampleinform
ation
5621
structural
andfunctio
nalsequence
features
ininfluenza
proteins
www.flud
b.org
Risk-AdjustedInpatient
Mortality
RatesandHospitalR
atings
for
CaliforniaHospitals,2
012
Clinical
Risk-adjustedmortalityrates,qu
ality
ratin
gs,and
numberof
deaths
andcasesfor6medicalcond
ition
streated(acute
stroke,acute
myocardialinfarction,
heartfailure,
gastrointestinalhemorrhage,hipfracture
and
pneumon
ia)in
Californiaho
spitalsfor2012
Depends
oncond
ition
sandprocedures
(from
300to
64,000)
data.chh
s.ca.gov/dataset/california-
hospital-inp
atient-m
ortality-rates-
and-qu
ality-ratings
Commun
ityHealth
Status
Indicators(CHSI)
Clinical
Commun
ityhealth
(e.g.o
besity,h
eartdisease,cancer)
Over200measuresforeach
ofthe3,141
UnitedStates
coun
ties
www.cdc.gov/oph
ss/csels/dph
id/
CHSI.htm
l
EXPERT REVIEW OF PRECISION MEDICINE AND DRUG DEVELOPMENT 311
Table3.
Exam
ples
ofpu
blicdata
search.
Omicsdata
search
Type
ofdata
Details
Num
bersof
samples
Link
GWAS
Catalog
ArticlessummarizingGWAS
andSN
P-traitassociations
Hum
angeno
mewideassociationstud
ies(GWAS
)andassociation
results
3395
publications
and62,174
unique
SNP-trait
associations
https://www.ebi.ac.uk/
gwas/
GWAS
Central
Articles
Comprises
allkno
wnSN
Psandothervariants,allele
andgeno
type
frequencydata,p
lusgenetic
associationsign
ificancefin
ding
sfrom
publicdatabasessuch
asdb
SNPandtheDBG
V
1605
stud
ies(2,935,163
unique
dbSN
Pmarkers)
www.gwascentral.org
KEGGpathway
Metabolism,m
olecular
interactions,
reactio
nsandrelatio
ns,
environm
entalinformation
processing
,and
cellularprocesses
Gene/protein(KEG
GGEN
ES)
Reactio
n(KEG
GREAC
TION)
Drug(KEG
GDRU
G)
2706
entriesforpathway
diagrams
110,018entriesin
24completegeno
mes
and12
partialg
enom
es5,645entriesin
theCO
MPO
UNDsection
https://www.genom
e.jp/
kegg
/pathw
ay.htm
l
ExAC
Brow
ser(Beta)
|Exom
eAg
gregation
Consortiu
m
Exom
esequ
encing
data
Harmon
izeexom
esequ
encing
data
from
avariety
oflarge-scale
sequ
encing
projects
60,706
unrelatedindividu
alssequ
encedas
partof
vario
usdisease-specificandpo
pulatio
ngenetic
stud
ies
http://exac.broadinstitu
te.
org/
GlobalB
iobank
Engine
Genotypes
andph
enotypes
Biob
ankexplorer
iscurrently
seeded
with
data
from
UKB
Ballowing
explorationbetweengeno
typesandph
enotypes
392,292participants
from
theUKB
Bhttp://gb
e.stanford.edu
gnom
ADbrow
serbeta
|geno
meAg
gregation
Database
Exom
eandgeno
mesequ
encing
data
Exom
eandgeno
mesequ
encing
data
from
avariety
oflarge-scale
sequ
encing
projects
123,136exom
esequ
encesand15,496
who
le-
geno
mesequ
encesfrom
unrelatedindividu
als
sequ
enced
http://gn
omad.broadinsti
tute.org
GeneExpression
Omnibu
s(GEO
)Geneandfunctio
nalg
enom
icsdata
Freelydistrib
utes
microarray,next-generationsequ
encing
,and
other
form
sof
high
-throu
ghpu
tfunctio
nalg
enom
icsdata
4,348DataSet
www.ncbi.nlm.nih.gov/geo/
Sequ
ence
Read
Archive
(SRA
)NucleotideSequ
ence
High-throug
hput
sequ
encing
data
andispartof
theInternational
NucleotideSequ
ence
DatabaseCo
llabo
ratio
n(IN
SDC)
that
includ
esat
theNCB
ISequenceRead
Archive(SRA
),theEuropean
BioinformaticsInstitu
te(EBI),andtheDNADatabaseof
Japan
(DDBJ)
>500billion
readsconsistin
gof
60trillionbase
pairs
www.ncbi.nlm.nih.gov/sra
Thedatabase
ofGenotypes
and
Phenotypes
(dbG
aP)
Genotypeandph
enotype
Phenotypedata,associatio
n(GWAS
)data,sum
marylevela
nalysis
data,SRA
(Sho
rtRead
Archive)
data,reference
alignm
ent(BAM
)data,V
CF(Variant
CallForm
at)data,expressiondata,impu
ted
geno
type
data,imagedata
Over100,000individu
als
www.ncbi.nlm.nih.gov/gap
ThePh
enotype-Genotype
Integrator
(PheGenI)
Genom
e-wideassociationstud
y(GWAS
)catalogdata
with
several
databases
MergesNHGRI
geno
me-wideassociationstud
y(GWAS
)catalogdata
with
severald
atabases
housed
attheNationalC
enterfor
Biotechn
olog
yInform
ation(NCB
I),includ
ingGene,db
GaP,O
MIM,
eQTL,and
dbSN
P
66,063
associationrecords(54,282from
dbGaP
and
11,781
from
theNHGRI
GWAS
catalog)
www.ncbi.nlm.nih.gov/gap/
phegeni
Health
map.org
New
s,twitter
Infectious
Disease
Outbreaks
Anautomated
process,up
datin
grealtim
e,the
system
mon
itors,o
rganizes,integrates,filters,
visualizes
anddissem
inates
onlineinform
ation
abou
tem
erging
diseases
http://www.health
map.org
CDCWONDER
Epidem
iologic
Mortality(deaths),cancerincidence,HIV
andAIDS,tuberculosis,
vaccinations,n
atality
(births),census
data
20collections
ofpu
blic-use
data
forU.S.b
irths,
deaths,cancerdiagno
ses,tuberculosiscases,
vaccinations,enviro
nmental
expo
sures,andpo
pulatio
nestim
ates
won
der.cdc.gov
SEER*Explorer
Cancer
statistics
Gender,race,calendaryear,age,and
foraselected
numberof
cancer
sites,by
stageandhistolog
y308,745,538patients
seer.cancer.g
ov/explorer/
USD
ANationalN
utrient
Database
Nutrition
Differenttypesof
food
sandnu
trients
%fatand%
lean
andtypesof
servingmetho
ds7793
diffe
rent
food
sandnu
trients
ndb.nal.usda.go
v/nd
b/
Nutrition,
Physical
Activity,and
Obesity:
Data,Trends
andMaps
Graph
,tables
Obesity,b
reastfeeding
,physicala
ctivity,o
ther
health
behaviorsand
relatedenvironm
entaland
policydata
Either
natio
nally
orby
statein
theUS
https://www.cdc.gov/
nccdph
p/dn
pao/data-
trends-m
aps/index.html
TOXM
APGraph
,tables
NCI
SEER
cancer
anddiseasemortalitydata,C
anadianNational
Pollutant
ReleaseInventory(NPRI)data
U.S.com
mercial
nuclearpo
wer
plants,and
Coalpo
wer
plantdata
from
theEPACleanAirMarkets
Prog
ram
Either
natio
nally
orby
statein
theUS
https://toxm
ap.nlm.nih.gov/
toxm
ap/new
s/2018/06/
new-version
-of-toxm
ap-
now-available.html
312 C. KRITTANAWONG ET AL.
existing models (algorithms) is commonly used, as it is mucheasier and sufficient algorithms already exist which may beapplied to important problems. Finally, an exploratory analysisis based on data-driven hypotheses rather than investigator-driven hypothesis [45]. For example, there have been papersshowing clustering of phenotypes (phenomapping) [6], thereare papers using systems biology methods to look at distinctendophenotypes [46], and there are also papers dissecting outresponse predictors with patterns [47].
4. Current challenges
It is important to delineate some of the challenges of imple-menting a big data approach in cardiovascular medicine. First,integrating big data into clinical trials is challenging becauseclinical trials are usually designed under ideal conditions,among select patients, and monitored by highly qualifiedphysicians [48]. In order to perform analysis using big datawith traditional statistical methods could be difficult. Smartclinical trials that are guided by AI to recruit patients (e.g.Deep 6 AI), do dynamic matching (e.g. SYNERGY-AI;NCT03452774), or to do direct targeted therapy are also pro-mising [49]. Second, heterogeneity and disparities of differentdatasets can be challenging to utilize. Third, latent variablesmight have been ignored in those heterogeneous diseases inprevious studies. Briefly, latent or unknown variables can becategorized into hidden medical variables and lifestyle vari-ables. Hidden medical variables could act as new parametersto characterize accurate myocardial function, novel serummetabolites, or new parameters for subclinical arteriosclerosis[9,10]. HFpEF, for example, could potentially be subcategor-ized into more mechanistically and molecularly homogenous,discrete genotypes, phenotypes, and etiologies [6,11]. Lifestylevariables are often quite novel because most studies have notincluded high-definition lifestyle variables in their analyses[50]. However, integrating deeply phenotyped lifestyle factorsinto medical records can be difficult because of data privacyand the lack of publically available application programminginterfaces for consumer devices to interact with EHRs [51].
Lifestyle variables may include dietary intake [52], physicalactivity [30], sleep hygiene [53], air pollution [54], ergonomics[55], income [56], domestic violence [57], working hours [58],and workplace wellness [59]. To date, most recent researchhas been collected on lifestyle variables mainly by question-naires or interviews, leading to recall or social desirabilitybiases [60]. Advancement of wearable technology could beused to track real-time activity and integrate those hiddenvariables into a person’s medical history. For example, theetiologies of HF readmission are heterogeneous and perhapsrelated to medication compliance and dietary habits [61].Integrating lifestyle variables could potentially track the mainproblems with real-world variables rather than tracking theminside of a hospital and preventing recall biases from patienthistories [60,62]. However, there remains a need to collectbetter and more consistent data from wearable devices –most consumer devices are not approved by the FDA forclinical monitoring of patients, and this may be a limitationin some cases. In addition, wearable devices have a number ofvalidation issues, and it is unclear if they motivate long-termbehavioral change [63,64]. For example, in a BEAT-HF trial, acombination of remote patient monitoring with care transitionmanagement did not reduce 180-day all-cause readmissionafter hospitalization for HF [65]. Fourth, data quality, datainconsistency, data instability, and validation of big data arealso barriers, and therefore the imputation of big data iscritical [66]. More data, more entropy, and more heterogeneityresult in lower-quality databases [67]. Therefore, the pre-ana-lytic process of big data needs to be assessed and imputedsystematically. For example, though the methodology of redu-cing heterogeneity in meta-analysis is not yet perfect, it canreduce significant biases [68]. Fifth, some other limitations of abig data approach are heterogeneity of multiple databases(i.e., different ICD code versions, different diagnostic criteria,different laboratories, and different software vendors) [13,14].Hence, synchronizing existing data to generate meaningfulanalysis can be very challenging. Sixth, although de-identifica-tion seems to be a solution in big data research, studies haveshown that re-identification can be done in various ways. For
Figure 1. Big data process flow for cardiovascular medicine.
EXPERT REVIEW OF PRECISION MEDICINE AND DRUG DEVELOPMENT 313
example, anonymous genetic data stores could be unmaskedby matching their data to a sample of their DNA [69] ormatching social networks for information that might yieldinsights into the genetic basis for complex human traits [70].Seventh, to date there has been little evidence to suggest thatDNA testing has little or no impact in motivating behaviorchange [71]. Therefore, the genomic information, or GWAS,impacting long-term behavior change may still need handcuration [72]. In addition, distinguishing signals from noise inOmics data and software validation are required [73]. Forexample, using different types of software (i.e. PLINK,QCTOOL, Vcftools, BOLTs, or EPACTS) may reflect differentresults. Lastly, another important challenge in the use of bigdata in cardiovascular medicine is the ascertainment of caus-ality from observational and retrospective studies. Most AI andML methods do not explicitly utilize a framework to modelcausality. Consider the humorous case of age-related gray hairand CVD. The presence of both gray hair, wrinkles, baldness,and CVD are highly correlated [74–76]. However, if we were topursue this strong association in an attempt to design thera-pies (e.g. hair dyes or wrinkle cream), we would be whollyunsuccessful in preventing CVDs. This is an important limita-tion that all big-data analyses must account for – however,there do exist emerging methods to perform causal inferencefrom observational datasets, such as the parametric G formula[77]. We recently completed one application of the parametricG formula, in which we used retrospective EHR data todemonstrate the relative correctness of a clinical trial forhypertension that had been called into question [78].However, EHR data also has some limitations, such as theaccuracy of ICD 9 codes [79–81].
5. Implementation of big data in clinical practice
Several resources are still the main starting points for any bigdata search in cardiovascular medicine. The utilization of thesedatasets could facilitate precision CV medicine. The integrationof the Internet of Things, social media, Omics and big datatechnologies, and AI could create a new concept of smarthealth, integrating real-world variables into hospital-relatedvariables, and leading to improved quality of patient careand hospital workflow [82–85]. Today, with the help of theInternet, there are many types of websites providing eitherdatasets for public use or data search (Tables 1–3). The imple-mentation of big data analytics that links these databasestogether is crucial. However, there may be some barriers orrestrictions. Academic institutions usually have manyresources and can provide their own biobank (i.e. the MayoClinic Biobank, Cleveland Clinic’s Biorepository, SCVI Biobank,Mount Sinai’s BioMe, Vanderbilt’s BioVU, or Northwestern’sNUgene). Most biobanks are designed so they can be accessedby various innovative actors, public and private, throughoutthe world. Integration of these biobanks in ongoing research isworth exploring. Training in bioinformatics or coordinatingwith data scientists is also important [86]. In addition, usingonline community support for data analysis such as Github,Stack Overflow, Kaggle, and Biostars is increasingly recognizedand utilized in the medical community. Previous research has
acknowledged many confounders in clinical research; how-ever, none of them have mentioned real-world lifestyle factorssuch as seafood/cereal/coffee consumption, watching movies,playing video games, or personal hygiene. These real-worldfactors could potentially be confounders in CVD burdens, forexample, HF readmission, recurrent AF, labile INR, statin sensi-tivity, or stent thrombosis. These integrations can increasedimensional research into new translation research by includ-ing real-world environmental factors.
6. Expert commentary
Though many of the technical issues for a big data approachremain to be solved, the potential for big data analysis toimprove cardiovascular quality of care and patient outcomeis tremendous. To date, the key findings from previous studiesin this field are inconclusive. For example, strong evidencethat the attempt to change behavior using either wearablesor genomic information is lacking. The ultimate goal of bigdata analysis is to unify heterogeneous databases into homo-genous databases using advanced computational power, suchas AI. In addition, we believe that big analysis using AI willadvance clinical trials in the context of recruiting patients,distributing drugs randomly and fairly between two arms,assisting drug delivery, and predicting outcomes of trials inadvance. However, the biggest challenge is to combine het-erogeneous variables from various datasets and implementthese into clinical practice. In addition, there are candidategenes, novel biomarkers, and parameters emerging every day,which makes it almost impossible for current guidelines toremain current. Moreover, decision-making using these novelprofiles without guidelines can be challenging and may faceethical dilemmas. Future studies should integrate big dataanalysis to better explore the robustness of novel CVD phe-notypes and smart clinical trial design for targeted therapy.Targeting components of the CVD phenotypes such as specificgenes, specific metabolites, and the specific gut microbiomein CVD may prove to be valuable. This phenotype-based clas-sification system could be helpful for the identification of newbiomarkers and potential targeted therapies, and it may leadto the development of tailored/customized future clinicaltrials.
7. Five-year view
In the realm of the big data era, genetic polymorphisms,plasma metabolomics, and proteomics may help to identifynew biomarkers and potential novel therapeutic targets forCVH. We hope and believe that these tools will soon emergeas best practices in day-to-day clinical medicine. The next stepis to create on-demand predictive analytics in clinical practiceusing the results of a big data approach, which shows greatpromise in cardiovascular medicine. In clinical practice, theimplementation of sophisticated analytics tools with ‘omic’data, the human microbiome, physical activity, environmentalfactors, and lifestyle factors might help identify novel pheno-types of CVD patients. Today, genetic risk scores are starting tostratify patients based on risk before the disease presents[87,88]. A big data approach could potentially transform
314 C. KRITTANAWONG ET AL.
medicine into a more personalized approach using sophisti-cated algorithms generated from a combination of real-worldfactors and medical variables to calculate the risk and benefitsof CVH-related behaviors in individuals. For example, takinginto account a persons patterns of dietary intake, medicationcompliance, and daily life activities using wearable technol-ogy, storing this data in a secure system (i.e. cloud or block-chain), and transferring it to an EHR could generate apredictive analysis with prompt recommendations in regardsto maximum fruit intake and minimal carbohydrate intake forindividuals in their discharge summary. The results of this typeof analysis would be transferred to primary care physicians,collected in wearable technology with warning messages, andcould appear in a patient’s history in the EHR system. Thisproposed model could potentially be a modifiable factor toweigh CVD risk and benefit based on individuals.
Key issues
● A phenotype-based classification using multi-omics, life-style, and environmental data with new analytical methodsand high computational power could potentially transformfuture clinical trials.
● Data cleaning and data imputation are keys to unlockingbig data analysis.
● The data, so far, on both wearables and genomic informa-tion evoking long-term behavior change is negative or, atbest, neutral.
● Biobanks and curated public databases may play an impor-tant role in big data analysis.
● Although there are many limitations to the proposedapproach that have already been clearly tested, there istremendous potential for big data analysis to improve car-diovascular quality of care and patient outcome.
Funding
This paper was not funded.
Declaration of interest
The authors have no relevant affiliations or financial involvement with anyorganization or entity with a financial interest in or financial conflict withthe subject matter or materials discussed in the manuscript. This includesemployment, consultancies, honoraria, stock ownership or options, experttestimony, grants or patents received or pending, or royalties.
Reviewer disclosures
Peer reviewers on this manuscript have no relevant financial or otherrelationships to disclose.
References
Papers of special note have been highlighted as either of interest (•) or ofconsiderable interest (••) to readers.
1. Gaye B, Tafflet M, Arveiler D, et al. Ideal cardiovascular health andincident cardiovascular disease: heterogeneity across event sub-types and mediating effect of blood biomarkers: the PRIME study.J Am Heart Assoc. 2017 Oct 17;6(10).
2. Jose PO, Frank AT, Kapphahn KI, et al. Cardiovascular diseasemortality in Asian Americans. J Am Coll Cardiol. 2014;64:2486–2494.
3. Gordon RD. Heterogeneous hypertension. Nat Genet. 1995;11:6–9.4. Darbar D, Herron KJ, Ballew JD, et al. Familial atrial fibrillation is
a genetically heterogeneous disorder. J Am Coll Cardiol.2003;41:2185–2192.
5. Inohara T, Shrader P, Pieper K, et al. Association of atrial fibrillationclinical phenotypes with treatment patterns and outcomes: a mul-ticenter registry study. JAMA cardiology. 2018;3:54–63.
6. Shah SJ, Katz DH, Selvaraj S, et al. Phenomapping for novel classi-fication of heart failure with preserved ejection fraction. Circulation.2015;131:269–279.
7. Krittanawong C, Bomback AS, Baber U, et al. Future direction forusing artificial intelligence to predict and manage hypertension.Curr Hypertens Rep. 2018;20:75.
8. Balaney B, Medvedofsky D, Mediratta A, et al. Invasive validation ofthe echocardiographic assessment of left ventricular filling pres-sures using the 2016 diastolic guidelines: head-to-head comparisonwith the 2009 guidelines. J Am Soc Echocardiography: OfficialPublication Am Soc Echocardiography. 2018;31:79–88.
9. Pislaru C, Alashry MM, Thaden JJ, et al. Intrinsic wave propagationof myocardial stretch, a new tool to evaluate myocardial stiffness: apilot study in patients with aortic stenosis and mitral regurgitation.J Am Soc Echocardiography: Official Publication Am SocEchocardiography. 2017;30:1070–1080.
10. Laaksonen R, Ekroos K, Sysi-Aho M, et al. Plasma ceramides predictcardiovascular death in patients with stable coronary artery diseaseand acute coronary syndromes beyond LDL-cholesterol. Eur HeartJ. 2016;37:1967–1976.
11. Krittanawong C, Kukin ML. Current management and future direc-tions of heart failure with preserved ejection fraction: a contem-porary review. Curr Treat Options Cardiovasc Med. 2018;20:28.
12. Guo Q, Lu X, Gao Y, et al. Cluster analysis: a new approach foridentification of underlying risk factors for coronary artery diseasein essential hypertensive patients. Sci Rep. 2017;7:43965.
13. Bellazzi R. Big data and biomedical informatics: a challengingopportunity. Yearb Med Inform. 2014;9:8–13.
14. Scruggs SB, Watson K, Su AI, et al. Harnessing the heart of big data.Circ Res. 2015;116:1115–1119.
15. Kass-Hout TA, Stevens LM, Hall JL. American Heart Associationprecision medicine platform. Circulation. 2018;137:647–649.
16. Gourraud P-A, Henry R, Cree BAC, et al. Precision medicine inchronic disease management: the MS bioscreen. Ann Neurol.2014;76:633–642.
17. Krittanawong C, Zhang H, Wang Z, et al. Artificial intelligence inprecision cardiovascular medicine. J Am Coll Cardiol. 2017;69:2657–2664.•• This is a useful review about artificial intelligence in cardio-vascular medicine.
18. Glicksberg BS, Johnson KW, Dudley JT. The next generation ofprecision medicine: observational studies, electronic healthrecords, biobanks and continuous monitoring. Hum Mol Genet.2018;27:R56–r62.
19. McConnell MV, Shcherbina A, Pavlovic A, et al. Feasibility of obtain-ing measures of lifestyle from a smartphone app: the MyHeartcounts cardiovascular health study. JAMA cardiology. 2017;2:67–76.• This study provides an example of a potential smartphoneapplication study in cardiovascular health.
20. Guo X, Vittinghoff E, Olgin JE, et al. Volunteer participation in thehealth eHeart study: a comparison with the US population. Sci Rep.2017;7:1956.
21. Muse ED, Wineinger NE, Schrader B et al. Moving beyond clinicalrisk scores with a mobile app for the genomic risk of coronaryartery disease. bioRxiv. 2017.
22. [cited 2018 Oct 6]. Access online at https://med.stanford.edu/appleheartstudy.html.
23. Bot BM, Suver C, Neto EC, et al. The mPower study, Parkinsondisease mobile data collected using ResearchKit. Sci Data.2016;3:160011.
EXPERT REVIEW OF PRECISION MEDICINE AND DRUG DEVELOPMENT 315
24. Chan Y-FY, Bot BM, Zweig M, et al. The asthma mobile health study,smartphone data collected using ResearchKit. Sci Data. 2018;5:180096.
25. Webster DE, Suver C, Doerr M, et al. The Mole Mapper study,mobile phone skin imaging and melanoma risk data collectedusing ResearchKit. Sci Data. 2017;4:170005.
26. Ata R, Gandhi N, Rasmussen H, et al. IP225 VascTrac: a study ofperipheral artery disease via smartphones to improve remote dis-ease monitoring and postoperative surveillance. J Vasc Surg.2017;65:115S–116s.
27. Johnson KW, Torres Soto J, Glicksberg BS, et al. Artificial intelli-gence in cardiology. J Am Coll Cardiol. 2018;71:2668–2679.
28. Krittanawong C, Tunhasiriwet A, Zhang H, et al. Deep learning withunsupervised feature in echocardiographic imaging. J Am CollCardiol. 2017;69:2100–2101.
29. Shameer K, Johnson KW, Glicksberg BS, et al. Machine learning incardiovascular medicine: are we there yet? Heart. 2018;104:1156–1164.
30. Krittanawong C, Aydar M, Kitai T. Pokémon Go: digital healthinterventions to reduce cardiovascular risk. Cardiol Young.2017;27:1625–1626.
31. Ding MQ, Chen L, Cooper GF, et al. Precision oncology beyondtargeted therapy: combining omics data with machine learningmatches the majority of cancer cells to effective therapeutics. MolCancer Res 2017.
32. Anwar S, Negishi K, Borowszki A, et al. Comparison of two-dimen-sional strain analysis using vendor-independent and vendor-speci-fic software in adult and pediatric patients. JRSM CardiovascDisease. 2017;6:2048004017712862.
33. O’Malley KJ, Cook KF, Price MD, et al. Measuring diagnoses: ICDcode accuracy. Health Serv Res. 2005;40:1620–1639.
34. Standards for privacy of individually identifiable health informa-tion. Office of the Assistant Secretary for Planning and Evaluation,DHHS. Proposed rule. Federal register 1999;64:59918–60065.
35. Verma SS, de Andrade M, Tromp G, et al. Imputation and qualitycontrol steps for combining multiple genome-wide datasets. FrontGenet. 2014;5:370.
36. Hendler J. Data integration for heterogenous datasets. Big Data.2014;2:205–215.
37. Blankenberg D, Coraor N, Von Kuster G, et al. Integrating diversedatabases into an unified analysis framework: a galaxy approach. JBioll Databases Curation 2011. 2011: bar011.
38. Shkapsky A, Yang M, Interlandi M, et al. Big data analytics withdatalog queries on spark. Proceedings ACM-Sigmod InternationalConference on Management of Data. San Francisco, CA, USA.2016;2016:1135–1149.
39. [cited 2018 Oct 6].Amazon AWS http://aws.amazon.com/.40. Forbes A The future of BIME. 201841. Pan C, McInnes G, Deflaux N, et al. Cloud-based interactive analy-
tics for terabytes of genomic variants data. Bioinformatics.2017;33:3709–3715.
42. Coleman JR, Euesden J, Patel H, et al. Quality control, imputation andanalysis of genome-wide genotyping data from the illuminaHumanCoreExomemicroarray. Brief Funct Genomics. 2016;15:298–304.
43. Das S, Forer L, Schönherr S, et al. Next-generation genotype impu-tation service and methods. Nat Genet. 2016;48:1284.
44. Luo G, Stone BL. Automating construction of machine learningmodels with clinical big data: proposal rationale and methods.JMIR Res Protoc. 2017 Aug 29;6(8):e175.
45. Naik AW, Kangas JD, Sullivan DP, et al. Active machine learning-driven experimentation to determine compound effects on proteinpatterns. Elife. 2016;5:e10047.
46. Eppinga RN, Hagemeijer Y, Burgess S. Identification of genomic lociassociated with resting heart rate and shared genetic predictorswith all-cause mortality. Nat Genet. 2016 Dec;48(12):1557-1563. doi:10.1038/ng.3708.
47. Masetic Z, Subasi A. Congestive heart failure detection using ran-dom forest classifier. Comput Meth Prog Bio. 2016;130:54–64.
48. Mayo CS, Matuszak MM, Schipper MJ, Jolly S, Hayman JA, TenHaken RK. Big data in designing clinical trials: opportunities andchallenges. Front Oncol. 2017;7:187.
49. Say LEAFC Goodbye to clinical trials that don’t teach. 2018.50. Assi N, Thomas DC, Leitzmann M, et al. Are metabolic signatures
mediating the relationship between lifestyle factors and hepato-cellular carcinoma risk? Results from a nested case-control study inEPIC. Cancer epidemiol Biomarkers Prevention. 2018;27:531–540.
51. Filkins BL, Kim JY, Roberts B, et al. Privacy and security in the era ofdigital health: what should translational researchers know and doabout it? Am J Transl Res. 2016;8:1560–1580.
52. Krittanawong C, Tunhasiriwet A, Zhang H, et al. Is white rice con-sumption a risk for metabolic and cardiovascular outcomes? Asystematic review and meta-analysis. Heart Asia. 2017;9:e010909.
53. Krittanawong C, Tunhasiriwet A, Wang Z, et al. Associationbetween short and long sleep durations and cardiovascular out-comes: a systematic review and meta-analysis. Eur Heart J AcuteCardiovasc Care. 2017;2048872617741733.
54. Hartiala J, Breton CV, Tang WH, et al. Ambient air pollution isassociated with the severity of coronary atherosclerosis and inci-dent myocardial infarction in patients undergoing elective cardiacevaluation. J Am Heart Assoc. 2016 Jul 28;5(8).
55. Djindjic N, Jovanovic J, Djindjic B, et al. Associations between theoccupational stress index and hypertension, type 2 diabetes melli-tus, and lipid disorders in middle-aged men and women. AnnOccup Hyg. 2012;56:1051–1062.
56. Orth-Gomer K, Deter HC, Grun AS, et al. Socioeconomic factors incoronary artery disease - results from the SPIRR-CAD study. JPsychosom Res. 2018;105:125–131.
57. Mason SM, Wright RJ, Hibert EN, et al. Intimate partner violenceand incidence of hypertension in women. Ann Epidemiol.2012;22:562–567.
58. Kivimaki M, Jokela M, Nyberg ST et al. Long working hours and riskof coronary heart disease and stroke: a systematic review andmeta-analysis of published and unpublished data for 603,838 indi-viduals. Lancet (London, England) 2015;386:1739–1746.
59. Ryu H, Jung J, Cho J, et al. Program development and effectivenessof workplace health promotion program for preventing metabolicsyndrome among office workers. Int J Environ Res PublicHealth. 2017 Aug 4;14(8).
60. Althubaiti A. Information bias in health research: definition, pitfalls,and adjustment methods. J Multidiscip Healthc. 2016;9:211–217.
61. Retrum JH, Boggs J, Hersh A, et al. Patient-identified factors relatedto heart failure readmissions. Circ Cardiovasc Quality Outcomes.2013;6:171–177.
62. Larsson SC, Tektonidis TG, Gigante B, et al. Healthy lifestyle and riskof heart failure: results from 2 prospective cohort studies. Circ HeartFail. 2016;9:e002855.
63. Murakami H, Kawakami R, Nakae S, et al. Accuracy of wearabledevices for estimating total energy expenditure: comparison withmetabolic chamber and doubly labeled water method. JAMAIntern Med. 2016;176:702–703.
64. Jakicic JM, Davis KK, Rogers RJ, et al. Effect of wearable technologycombined with a lifestyle intervention on long-term weight loss:the idea randomized clinical trial. Jama. 2016;316:1161–1171.
65. Ong MK, Romano PS, Edgington S, et al. Effectiveness of remotepatient monitoring after discharge of hospitalized patients withheart failure: the better effectiveness after transition–heart failure(beat-hf) randomized clinical trial. JAMA Intern Med.2016;176:310–318.• This study provides evidence of the association betweenwearable devices and long-term behavioral change.
66. Dinov ID. Methodological challenges and analytic opportunities formodeling and interpreting big healthcare data. Gigascience.2016;5:12.
67. Coakley MF, Leerkes MR, Barnett J, et al. Unlocking the power ofbig data at the National Institutes of Health. Big Data.2013;1:183–186.
68. Egger M, Smith GD, Schneider M, et al. Bias in meta-analysisdetected by a simple, graphical test. BMJ. 1997;315:629.
69. Gymrek M, McGuire AL, Golan D, et al. Identifying personal gen-omes by surname inference. Science. 2013;339:321–324.
70. Hayden EC. The genome hacker. Nature. 2013;497:172.
316 C. KRITTANAWONG ET AL.
71. Hollands GJ, French DP, Griffin SJ et al. The impact of commu-nicating genetic risks of disease on risk-reducing health beha-viour: systematic review with meta-analysis. BMJ. 2016 Mar15;352:i1102.
72. Presley CJ, Tang D, Soulos PR, et al. Association of broad-basedgenomic sequencing with survival among patients with advancednon–small cell lung cancer in the community oncology setting.Jama. 2018;320:469–477.
73. Saracci R. Epidemiology in wonderland: big data and precisionmedicine. Eur J Epidemiol. 2018;33:245–257.
74. Schnohr P, Lange P, Nyboe J, et al. Gray hair, baldness, and wrinklesin relation to myocardial infarction: the Copenhagen City HeartStudy. Am Heart J. 1995;130:1003–1010.
75. Lesko SM, Rosenberg L, Shapiro S. A case-control study of baldnessin relation to myocardial infarction in men. Jama. 1993;269:998–1003.
76. Ford ES, Freedman DS, Byers T. Baldness and ischemic heart dis-ease in a national sample of men. Am J Epidemiol. 1996;143:651–657.
77. Lin SH, Young J, Logan R, et al. Parametric mediational g-formulaapproach to mediation analysis with time-varying exposures, med-iators, and confounders. Epidemiology. 2017;28:266–274.
78. Johnson KW, Glicksberg BS, Hodos RA, et al. Causal inference onelectronic health records to assess blood pressure treatment targets:an application of the parametric g formula. Pacific Symposium onBiocomputing Pacific Symposium on Biocomputing. Fairmont Orchid,Hawaii, Puako, HI. 2018;23:180–191.
79. Ahmad FS, Chan C, Rosenman MB, et al. Validity of cardiovasculardata from electronic sources: the multi-ethnic study of athero-sclerosis and HealthLNK. Circulation. 2017;136:1207–1216.
80. Krittanawong C, Kumar A, Virk HUH, et al. Trends in incidence, char-acteristics, and in-hospital outcomes of patients presenting withspontaneous coronary artery dissection (from a national population-based cohort study between 2004 and 2015). Am J Cardiol. In press.
81. [cited 2018 Oct 6]. https://www.federalregister.gov/d/2018-15390Aoa.
82. Talboom JS, Huentelman MJ. Big data collision: the internet ofthings, wearable devices and genomics in the study of neurologicaltraits and disease. Hum Mol Genet. 2018;27:R35–r39.
83. Kang M, Park E, Cho BH, et al. Recent patient health monitoringplatforms incorporating internet of things-enabled smart devices.Int Neurourol J. 2018;22:S76–82.
84. Ozdemir V, Hekim N. Birth of industry 5.0: making sense of big datawith artificial intelligence, “The internet of things” and next-gen-eration technology policy. Omics: J Integr Biol. 2018;22:65–76.
85. Dey N, Ashour AS. Medical cyber-physical systems: a survey.2018;42. p. 74.
86. Krittanawong C. Future physicians in the era of precision cardio-vascular medicine. Circulation. 2017;136:1572–1574.
87. Muse ED, Wineinger NE, Spencer EG, et al. Validation of a geneticrisk score for atrial fibrillation: a prospective multicenter cohortstudy. PLoS Med. 2018;15:e1002525.
88. Knowles JW, Ashley EA. Cardiovascular disease: the rise of thegenetic risk score. PLoS Med. 2018;15:e1002546.
EXPERT REVIEW OF PRECISION MEDICINE AND DRUG DEVELOPMENT 317