big data, artificial intelligence, and cardiovascular ... · tude of large of datasets. the term...

Full Terms & Conditions of access and use can be found athttp://www.tandfonline.com/action/journalInformation?journalCode=tepm20

Expert Review of Precision Medicine and DrugDevelopmentPersonalized medicine in drug development and clinical practice

ISSN: (Print) 2380-8993 (Online) Journal homepage: http://www.tandfonline.com/loi/tepm20

Big data, artificial intelligence, and cardiovascularprecision medicine

Chayakrit Krittanawong, Kipp W. Johnson, Steven G. Hershman & W.H.Wilson Tang

To cite this article: Chayakrit Krittanawong, Kipp W. Johnson, Steven G. Hershman &W.H. Wilson Tang (2018) Big data, artificial intelligence, and cardiovascular precisionmedicine, Expert Review of Precision Medicine and Drug Development, 3:5, 305-317, DOI:10.1080/23808993.2018.1528871

To link to this article: https://doi.org/10.1080/23808993.2018.1528871

Accepted author version posted online: 26Sep 2018.Published online: 10 Oct 2018.

Submit your article to this journal

Article views: 50

View Crossmark data

http://www.tandfonline.com/action/journalInformation?journalCode=tepm20

http://www.tandfonline.com/loi/tepm20

http://www.tandfonline.com/action/showCitFormats?doi=10.1080/23808993.2018.1528871

https://doi.org/10.1080/23808993.2018.1528871

http://www.tandfonline.com/action/authorSubmission?journalCode=tepm20&show=instructions

http://www.tandfonline.com/action/authorSubmission?journalCode=tepm20&show=instructions

http://crossmark.crossref.org/dialog/?doi=10.1080/23808993.2018.1528871&domain=pdf&date_stamp=2018-09-26

http://crossmark.crossref.org/dialog/?doi=10.1080/23808993.2018.1528871&domain=pdf&date_stamp=2018-09-26

REVIEW

Big data, artificial intelligence, and cardiovascular precision medicineChayakrit Krittanawonga, Kipp W. Johnsonb, Steven G. Hershmanc,d and W.H. Wilson Tange,f,g

aDepartment of Internal Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA; bInstitute for Next Generation Healthcare,Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA; cDepartment of Medicine, StanfordUniversity, Stanford, CA, USA; dDivision of Cardiovascular Medicine, Department of Medicine, Stanford University, Stanford, CA, USA; eDepartmentof Cardiovascular Medicine, Heart and Vascular Institute, Cleveland Clinic, Cleveland, OH, USA; fDepartment of Cellular and Molecular Medicine,Lerner Research Institute, Cleveland, OH, USA; gCenter for Clinical Genomics, Cleveland Clinic, Cleveland, OH, USA

ABSTRACTIntroduction: Cardiovascular diseases (CVDs) are chronic, heterogeneous diseases which are generallyclassified according to clinical presentation. However, the arrival of big data and analytical methodspresents an opportunity to better understand these disease entities.Areas covered: This review article highlights: (1) the potential of a big data approaches with emergingtechnology to explore the heterogeneity of CVDs; (2) current challenges of a big data approach; and (3)the future of precision cardiovascular medicine.Expert commentary: Overall, most of the current data utilizing big data techniques remain largelydescriptive and retrospective. Precision medicine, or N-of-1, approaches have not yet allowed for con-sistent interpretation since there is no ‘standard’ of how to best apply treatment approaches in a fieldwhere evidence-based medicine is based largely on randomized controlled trials. The risk score andbiomarker-based approaches have been utilized with some ‘validation’ studies, but more in-depthbiomarkers (i.e. pharmacogenomic biomarkers) have failed to demonstrate incremental benefits.Exploring novel CVD phenotypes by integrating existing medical variables, multi-omics, lifestyle, andenvironmental data using artificial intelligence is vitally important and may allow us to digitize futureclinical trials, potentially leading to novel therapies.

ARTICLE HISTORYReceived 29 July 2018Accepted 24 September 2018

KEYWORDSBig data; cardiovascularprecision medicine; precisionmedicine; big dataapproach; omics

1. Heterogeneous cardiovascular diseases

Cardiovascular diseases (CVDs) are chronic, heterogeneousdiseases that have generally been identified and categorizedinto phenotypes according to their clinical presentation.However, due to the complexity of chronic CVDs, it is likelythat multiple independent etiologies manifest similarly in theclinic. This ultimately results in differing responses to standar-dized treatment regimens, which are derived from broad dis-ease characterizations. Understanding the reasons for thesedifferences presents an avenue through which to improvepatient care. Although the heterogeneous pathophysiologyof CVDs has been extensively studied, the emergence of newanalytical methods drawn from the statistical and computerscience communities presents a powerful tool for betterunderstanding. CVDs are associated with multiple phenotypesthat result from genetics, metabolomics, environmental, andbehavioral or lifestyle perturbations [1,2]. Hypertension, atrialfibrillation (AF), heart failure with preserved ejection fraction(HFpEF), Takotsubo syndrome, Cardiorenal syndrome, andspontaneous coronary artery dissection are known to be het-erogeneous in their etiology and pathophysiology, and differ-ent phenotypes may respond to treatment in different ways[3–7]. Most clinical research studies are based on currentclinical diagnosis and known validated parameters to investi-gate endpoints or outcomes. However, many parameters are

not well-validated, and there are some emerging variables orcombinations of variables that could potentially be used asguided parameters for prognosis and treatment in order toreplace older metrics [8–10]. The diagnostic criteria of diastolicdysfunction or HFpEF, for example, are not well-defined, andthe guidelines have varied over time [8,11]. Recent studieshave demonstrated that an artificial intelligence (AI) methodinvolving high-dimensional unsupervised clustering may havethe potential to classify heterogeneous clinical CV conditionsmore accurately than current diagnostic criteria [6,12].

2. Big data and precision medicine: where we are

The zeitgeist of the information age may be the use of so-called ‘big data’ to analyze, interpret, and alter the humancondition. Biomedical science, and cardiovascular medicine,in particular, is at the forefront of this movement. Centralcomponents of the use of big data are effective strategiesfor the challenges of storing, managing, and analyzing a multi-tude of large of datasets. The term ‘big data,’ used in modern-day scientific communities, medical literature, and at scientificconferences, is frequently referred to as the 5 Vs (volume,velocity, variety, veracity, and valorization), which cannot beanalyzed or interpreted using traditional data processingmethods [13]. However, the definition of big data is still

CONTACT Chayakrit Krittanawong [email protected] Department of Internal Medicine, Icahn School of Medicine at Mount Sinai,1000 10th Ave, New York, NY 10019

EXPERT REVIEW OF PRECISION MEDICINE AND DRUG DEVELOPMENT2018, VOL. 3, NO. 5, 305–317https://doi.org/10.1080/23808993.2018.1528871

© 2018 Informa UK Limited, trading as Taylor & Francis Group

http://www.tandfonline.com

http://crossmark.crossref.org/dialog/?doi=10.1080/23808993.2018.1528871&domain=pdf

tenuous and not well-established. Datasets do not necessarilyneed to be a large number of observations, but they may beconsidered ‘big data’ due to the potential of the data in thecontext of innovation, how meaningful it is, if it is multidimen-sional, and how its value will increase over time [14]. Examplesof big data include datasets combining human gut micro-biome sequencing, genomics, metabolomics, proteomics,transcriptomics, social media data, and data from standardizedelectronic health records (EHRs) or precision medicine plat-forms (e.g. AHA Precision Medicine Platforms or the UCSFPrecision Medicine Platform) [15,16]. Several decades of trans-lational, epidemiological, and clinical multiethnic studies ofCVDs have been found to be largely inconsistent. With emer-ging analytic technology, a big data approach would attemptto classify heterogeneous CVDs that could facilitate precisionCV medicine [17]. To date, many curated and uncurated med-ical and environmental databases are freely available to thepublic which could be used for data analysis. Tables 1–3demonstrate both known variables (i.e. clinical variables,genetics or multi-omics variables) and potential latent vari-ables, including environmental factors (i.e. media consump-tion, transportation use, restaurant selection, or illicit drugsuse), epidemiological factors (i.e. Google Flu Trends) may beexplored in CVDs. Some particularly exciting resources forprecision medicine are the so-called ‘biobanks.’ These aremass collections of biomedical specimens which may belinked to retrospective EHRs in order to facilitate a wide varietyof retrospective analyses [18]. Well-curated biobanks likeMount Sinai’s BioMe, Vanderbilt’s BioVU, Northwestern’sNUgene, Penn Medicine’s BioBank, Stanford CardiovascularInstitute’s Biobank (SCVI) and GenePool, or more recently themassive UK BioBank (n = 500,000 patients) are exciting oppor-tunities for biomedical discovery in precision medicine, andthey can be accessed by various innovative actors, public andprivate, throughout the world. However, drawbacks for thisresearch are the often limiting data usage agreement policiesfor these resources, which in some cases (i.e. Mount Sinai’sBioMe), only allow use by faculty members from the partici-pating institutions. As such, much of the research potentialfrom these important biobanks are siloed away, unable tofulfill their great potential. A novel method of collecting bigdata is using mobile health apps. Studies like MyHeart Counts[19], Health eHeart [20], MyGene Rank [21], and the AppleHeart Study [22] have used the app store as a recruitmenttool and iOS applications for data collection; using such anapproach, it is not uncommon to recruit as many as ~105

participants. Many such studies are designed to have anopen data portal accessible to qualified researchers [23–25].Other study apps, like VascTrac, are applied to patients popu-lated in a clinical setting [26]. In contrast, resources containinguncurated or unprocessed big data are much harder to use,but the application of big data into clinical decision-makingusing emerging techniques drawn from the field of AI,machine learning (ML), or deep learning (DL) has the potentialto transform the current practice of cardiovascular health(CVH) into precision medicine [17,27,28]. Big data analysisusing AI allows us to classify heterogeneous CVDs into moreprecise phenotypes of CVD, leading to personalized, targetedtherapy [29]. To date, big data holds great promise for

solutions in CV research in various aspects. First, big data canbe used to allow integration of EHR, multi-omic data, gutmicrobiome sequencing, diet consumption diaries, physicalactivity information, sleep habit information from wearabletechnology, and emotional sentiments from social mediaposts to determine the multidimensional associationsbetween these factors [30,31]. Second, the relationshipsbetween variables from big data tend to show nonlinearrelationships, which require an advanced tool like AI forsophisticated analysis. However, the main limitation of a bigdata approach is the heterogeneity of multiple databases (i.e.different ICD code versions, different diagnostic criteria, differ-ent laboratories, and different software vendors) [32,33].Therefore, the harmonization of data, particularly from differ-ent databases, is needed before performing an analysis andcreating an automated prediction model for CVH recommen-dations for individuals. In conclusion, a big data approach tothe study of heterogeneous CVD is currently challenging butappears promising. Thus, future AHA/ACC/ESC guidelines maybe needed to take a big data approach into account.

3. Data processing step

In general, there are several steps required to apply big datato cardiovascular medicine (Figure 1). First, and most impor-tantly, the discovery of datasets pertinent to the task at handis required. This may include searching the wide variety ofdatabases that are already available (Tables 1–3). De-identifi-cation is a crucial step for data privacy to protect patientinformation according to the HIPAA Privacy Rule, althoughthis should generally be performed before the data is released[34]. Nonetheless, researchers re-using data have an obligationto maintain the confidentiality of any patient records they mayanalyze and to take appropriate steps to safeguard their data.Second, synchronization between different databases can gen-erate new insights of disease pathogenesis, particularly het-erogeneous diseases [35]. There are many data warehousemanagement tools that can be used to assist with databaseintegration such as Google’s visualizer [36], Galaxy [37], SparkSQL [38], Amazon Redshift [39], BIME Analytics [40], andGoogle BigQuerry [41]. However, there are certain limitations.First, the integration between different databases, particularlythose including clinical variables and lifestyle variables, is still alimitation because of the heterogeneity in any number ofvariables which may be shared among those databases. Forexample, participant IDs (or even participants) are usually notshared across different freely available resources – in manycases, this makes patient-level analyses impossible. Second,these datasets have generally not been designed to workwell together in the context of file format, columns/rows,transformation, or distribution. Third, some databases suchas toxicology or metagenomics are designed primarily forthe experts in those fields using specific terminology whichmay be hard to explore or combine without publicly availableresources such as wiki-style websites. Fourth, data imputationis a quality control step that can be applied to improve dataquality and accuracy after analysis [35,42,43]. Fifth, data mod-eling is a common term used in ML [44]. It is a model thatneeds to be generated. In general, the implementation of

306 C. KRITTANAWONG ET AL.

Table1.

Exam

ples

ofOmicsdatabase.

Omicsdatabase

Type

ofdata

Details

Num

berof

samples

Link

GlobalB

iobank

Engine

Phenotypes,variants,genetics,HLA

alleles

Aweb-based

tool

that

enablesthe

explorationof

therelatio

nshipbetween

geno

type

andph

enotype

500,000individu

als

biob

ankeng

ine.stanford.edu

Trans-OmicsforPrecision

Medicine(TOPM

ed)

Omicsdata

–RN

A,gene,and

metabolite

RNA,

gene,and

metabolite

profilesfrom

individu

alswho

participated

intheNHLBI-

fund

edMulti-Ethn

icStud

yof

Atherosclerosis(M

ESA)

Over90,000

geno

mes

sequ

encesandover

30,000

who

legeno

mesequ

encesin

dbGAP

https://www.nhlbi.nih.gov/new

s/2016/

toward-precision-medicine-first-who

le-

geno

mes-top

med-now

-available-stud

y

BioM

eEH

R-linkedbioanddata

repo

sitory

inNew

York

City

Epidem

iologic,molecular,g

enom

ic,

environm

ent,andlifestyle

32,000

participants

http://icahn.mssm.edu

/research/ipm/pro

gram

s/biom

e-biob

ank

Merck

Molecular

Activity

Challeng

eThetraining

andtest

datasets

for

machine

learning

practice

MoleculeID,M

olecular

descrip

tors

and

features

15biolog

icalactivity

data

sets

https://github

.com

/Ruw

anT/merck

TheHum

anMetabolom

eDatabase(HMDB)

Metabolite

andproteinsequ

ences

(1)Ch

emical

data,(2)

clinicaldata,and

(3)

molecular

biolog

y/biochemistrydata

114,099metabolite

entriesand5702

protein

sequ

ences

http://www.hmdb

.ca/

UKbiob

anks

Who

legeno

mesequ

encing

,exome

sequ

encing

,and

geno

typing

Genom

e,exom

e,on

linequ

estio

nnaires(diet,

cogn

itive

functio

n,workhistoryand

digestivehealth),EH

R,images

500,000peop

leaged

between40

and

69yearsin

2006–2010

http://www.ukbiobank.ac.uk/

Genom

icsEngland

Genom

esequ

encing

Genom

esequ

ence

data,o

btainedfrom

samples

ofblood,

tissue,andsaliva

100,000geno

mes

and70,000

patientsand

family

https://www.genom

icseng

land

.co.uk/the-

100000-genom

es-project/data/current-

research/

UK10K

DNAsequ

encing

DNAsequ

ence

atan

orderof

magnitude

deeper

than

the1000

Genom

esProjectfor

Europe

bycarrying

outgeno

me-wide

sequ

encing

of4000

samples

from

the

TwinsUKandALSPAC

coho

rts

Who

legeno

mecoho

rts(4000),

neurod

evelop

mentsamplesets(upto

3000

who

leexom

es),ob

esity

samplesets

(2000

who

leexom

es),andrare

diseases

sample

sets

(1000who

leexom

es)

http://www.uk10k.org/

PubC

hem

Chem

istry

Chem

icalstructures,identifiers,chem

ical,

physical

prop

erties,biolog

ical

activities,

patents,health,safetyandtoxicity

data

95,414,874

compo

unds,2

50,188,056

substances,1

,252,883

bioassays,and

236,181,958bioA

ctivities

pubchem.ncbi.nlm.nih.gov

MetaCyc

Metabolism

Both

primaryandsecond

arymetabolism,

associated

metabolites,reactio

ns,enzym

es,

andgenes

2642

pathwaysfrom

2941

diffe

rent

organism

smetacyc.org

Molecular

Transducersof

PhysicalActivity

(MoTrPAC

)Omicsdu

ringexercise

$170M

NIH

Consortiu

mon

impact

ofactivity

onmolecular

health

TBD(There

isno

publicdata

yet)

https://www.motrpac.org/

Chem

icalEntitiesof

Biolog

ical

Interest(ChEBI)

Chem

istry

‘Small’chem

icalcompo

unds

IntEnz,K

EGGCO

MPO

UND,P

DBeCh

em,

ChEM

BL

46,477

fully

curatedentries,each

ofwhich

isclassifiedwith

intheon

tology

andassign

edmultip

leanno

tatio

ns

www.ebi.ac.uk/chebi/

ProteinDataBank

(PDB)

Protein

3Dshapes

ofproteins,n

ucleicacids,and

complex

assemblies

44,165

distinct

proteinsequ

ences,38,467

structures

ofhu

man

sequ

ences,and10,027

nucleicacid

containing

structures

www.rcsb.org

TheUniversalProteinResource

(UniProt)

Proteomeandproteins

Functio

nalinformationon

proteins

and

proteome

Peptidesequ

encesfrom

172,997hu

man

with

557,713review

edand116,030,110

unreview

edproteins

http://www.uniprot.org/

GenBank

CoreNucleotide(the

maincollection),

dbEST(expressed

sequ

ence

tags),

anddb

GSS

(genom

esurvey

sequ

ences)

DNAsequ

ences

DNADataBankof

Japan(DDBJ),theEuropean

NucleotideArchive(ENA),and

GenBank

atNCB

I

www.ncbi.nlm.nih.gov/genbank/

TheToxinandToxinTarget

Database(T3D

B)Toxin

Mechanism

sof

toxicity

andtarget

proteins

for

each

toxin,

detailedtoxindata,p

ollutants,

pesticides,d

rugs,and

food

toxins

3670

common

toxins

andenvironm

ental

pollutants

http://www.t3db

.ca/

SMPD

B(The

SmallM

olecule

Pathway

Database)

Smallm

olecule

Smallm

oleculepathways

30,000

human

metabolicanddisease

pathways

http://sm

pdb.ca/

(Con

tinued)

EXPERT REVIEW OF PRECISION MEDICINE AND DRUG DEVELOPMENT 307

https://www.nhlbi.nih.gov/news/2016/toward-precision-medicine-first-whole-genomes-topmed-now-available-study



http://icahn.mssm.edu/research/ipm/programs/biome-biobank

http://icahn.mssm.edu/research/ipm/programs/biome-biobank

https://github.com/RuwanT/merck

http://www.hmdb.ca/

http://www.ukbiobank.ac.uk/

https://www.genomicsengland.co.uk/the-100000-genomes-project/data/current-research/



http://www.uk10k.org/

https://www.motrpac.org/

http://www.ebi.ac.uk/chebi/

http://www.rcsb.org

http://www.uniprot.org/

http://www.ncbi.nlm.nih.gov/genbank/

http://www.t3db.ca/

http://smpdb.ca/

Table1.

(Con

tinued).

Omicsdatabase

Type

ofdata

Details

Num

berof

samples

Link

TheGolm

Metabolom

eDatabase

(GMD)

Metabolom

ics

Arepo

sitory

ofsum

form

ulawith

source

tagg

edanno

tatio

nsforprop

ertiessuch

asInCh

Istrings,C

ASnu

mbers,IUPA

Cnames,

syno

nyms,crossreferences

orKEGG

Pathway

names

2.1millionun

ique

sum

form

ulafrom

more

than

150pu

blicavailabledatabases

http://gm

d.mpimp-go

lm.mpg

.de/

BREN

DA

Enzymes,o

rganism,p

athw

ay,reaction

Comprehensive

enzymedatabase

7341

diffe

rent

enzymes

www.brend

a-enzymes.org

MassBank

Massspectraof

metabolites

High-resolutio

nmassspectraof

metabolites

605electron

-ionizatio

nmassspectrom

etry

(EI-M

S),1

37fast

atom

bombardmentMS,

and9276

electrospray

ionizatio

n(ESI)-MS

(n)data

of2337

authentic

compo

unds

ofmetabolites,11,545

EI-M

Sand834other-

MSdata

of10,286

volatilenaturala

ndsynthetic

compo

unds,and

3045

ESI-M

S[2]

data

of679synthetic

drug

s

massbank.eu/M

assBank/

BioC

ycMetabolicpathways

Metabolicpathwaysandop

eron

s13,075

Pathway/genom

edatabases

biocyc.org

NHLBIExomeSequ

encing

Project

Exom

esequ

encing

data

Genename(HUGO,u

pper

orlower

case),

gene

ID(from

NCB

IEntrezGene),

chromosom

allocatio

n,db

SNPrsID

tostud

ygenetic

contrib

utions

totheriskof

severalh

eart,lun

g,andbloodph

enotypes

>7000

individu

als

http://evs.gs.washing

ton.edu/EVS/

Ensembl

Genom

esGenom

icdata

Bacteria,p

rotists,fun

gi,p

lants,and

invertebrate

metazoangeno

me-scaledata

44,048

bacteria,1

89protists,8

11fung

i,45

plants,and

68Metazoa

http://ensemblgeno

mes.org/in

fo/data

UCSCGenom

eBrow

ser

Genom

icdata

CRISPR/Cas9trac,g

eneInteractions,refSeq

Genes

trackandGTExGeneTrack

180assembliesandover

100species

geno

me.ucsc.edu

/cgi-bin/hgG

atew

ay

Hum

anMicrobiom

eProject

Microbiom

edata

Thecollectionof

allthe

microorganism

sliving

inassociationwith

thehu

man

body.These

commun

ities

consistof

avariety

ofmicroorganism

sinclud

ingeukaryotes,

archaea,bacteriaandviruses.

86,843

files,3

0,688samples

(the

microbial

commun

ities

from

300healthyindividu

als,

across

severald

ifferentsiteson

thehu

man

body:n

asalpassages,o

ralcavity,skin,

gastrointestinaltract,andurog

enitaltract)

hmpd

acc.org

Microbiom

eDB

Microbiom

edata

Geographicenvironm

entalfeatures,16SrRNA

genes,andantib

iotic

expo

sures

13,565

samples

http://microbiom

edb.org/mbio/

EBIM

etagenom

ics

Metagenom

ics

Allg

enom

espresentin

anygiven

environm

entwith

outtheneed

forprior

individu

alidentification

129,051data

sets,1

7,545metagenom

esand

1727

metatranscriptomes

www.ebi.ac.uk/m

etagenom

ics/

Phytozom

eGenom

icdata

Allg

enesets

inPh

ytozom

ehave

been

anno

tatedwith

KOG,K

EGG,ENZYME,

Pathway

andtheInterPro

family

ofprotein

Phytozom

eho

sts93

assembled

and

anno

tatedgeno

mes,from

82Virid

iplantae

species

phytozom

e.jgi.doe.gov/pz/po

rtal.htm

l

UniProt

Metagenom

icand

Environm

entalSequences

(UniMES)

Metagenom

icandenvironm

entald

ata

Metagenom

icandenvironm

entald

ata

(the

aminoacid

sequ

ence,p

rotein

nameor

descrip

tion,

taxono

micdata

andcitatio

ninform

ation)

171,510hu

man,8

3,587mou

se,and

59,676

zebrafish

www.uniprot.org/help/un

imes

TheHBT

(Hum

anBrain

Transcrip

tome)

Genom

e-wide,exon

-level

transcrip

tome

Atotalo

f16

brainregion

sweresampled:the

cerebellarcortex,m

ediodo

rsalnu

cleusof

thethalam

us,striatum,amygdala,

hipp

ocam

pus,and11

areasof

the

neocortex.Genom

e-widegeno

typing

data

for2.5millionmarkers

Over1340

tissuesamples

sampled

from

both

hemisph

eres

ofpo

stmortem

human

brains

http://hb

atlas.org/

1000

Genom

esProject

Who

le-genom

esequ

encing

Acomprehensive

descrip

tionof

common

human

genetic

variatio

nby

applying

who

le-genom

esequ

encing

toadiverseset

ofindividu

alsfrom

multip

lepo

pulatio

ns

84.4

millionvariantsfrom

2504

individu

als

http://www.internationalgenom

e.org/

(Con

tinued)


http://gmd.mpimp-golm.mpg.de/

http://www.brenda-enzymes.org

http://evs.gs.washington.edu/EVS/

http://ensemblgenomes.org/info/data

http://microbiomedb.org/mbio/

http://www.ebi.ac.uk/metagenomics/

http://www.uniprot.org/help/unimes

http://hbatlas.org/

http://www.internationalgenome.org/

Table1.

(Con

tinued).

Omicsdatabase

Type

ofdata

Details

Num

berof

samples

Link

Greengenes

Small-sub

unitrRNAgene

(SSU

)Archaealandbacterial1

6SSSUrDNA

sequ

enceson

linefull-leng

thsm

all-sub

unit

rRNAgene

(SSU

)database

90,000

public16Ssm

all-sub

unitrRNAgene

sequ

ences

http://greeng

enes.lbl.gov

H-In

vitatio

nalD

atabase(H-

InvD

B)Hum

angenesandtranscrip

tsCu

ratedanno

tatio

nsof

human

genesand

transcrip

tsthat

includ

egene

structures,

alternativesplicingvariants,no

n-coding

functio

nalR

NAs,p

rotein

functio

ns,

functio

nald

omains,sub

cellular

localizations,m

etabolicpathways,protein

3Dstructure,genetic

polymorph

isms(SNPs,

indels,and

microsatellite

repeats),relation

with

diseases,g

eneexpression

profiling

,andmolecular

evolutionary

features,

protein–

proteininteractions

(PPIs)and

gene

families/group

s.

120,558hu

man

mRN

Asextractedfrom

the

InternationalN

ucleotideSequ

ence

Databases

(INSD

),in

additio

nto

54978

human

FLcD

NAs

http://www.h-in

vitatio

nal.jp/

Table2.

Exam

ples

ofclinicalandenvironm

ental/lifestyledatabase.

Database

Type

ofdata

Details

Num

bersof

samples

Link

Nationw

ideInpatient

Sample

(NIS)

Clinical

ICD-9-CM,d

emog

raph

ic,expectedpaym

entsource,total

charges,dischargestatus,lengthof

stay,severity

and

comorbidity

NIS

collectsannu

aldata

on7–8million

hospitalstays,reflectingalld

ischargesfrom

arou

nd1000

hospitals

https://www.hcup-us.ahrq.go

v/db

/natio

n/nis/nisdbd

ocum

entatio

n.jsp

Nationw

ideReadmission

sDatabase(NRD

)Clinical

Diagn

osis,p

rocedu

re,p

atient

demog

raph

ics,expected

paym

entsource,costs

associated

with

readmission

s,reason

sforreadmission

s,impactof

health

policychanges

Discharge

data

from

27geog

raph

ically

dispersedStates


v/db

/natio

n/nrd/nrdd

bdocum

entatio

n.jsp

Nationw

ideEm

ergency

Departm

entSample(NED

S)Clinical

ICD-9-CM,d

emog

raph

ics,expected

paym

entsource,total

EDcharges,totalh

ospitalcharges,hospitalcharacteristics

Discharge

data

forE

Dvisitsfrom

953ho

spitals

locatedin

34States

andtheDistrictof

Columbia


v/db

/natio

n/neds/nedsdbd

ocum

enta

tion.jsp

Wom

en’sHealth

Initiative

Clinical

2major

parts:aClinicalTrialand

anObservatio

nalS

tudy

from

heartdisease,breastandcolorectal

cancer,and

osteop

orosisin

postmenop

ausalw

omen

Clinicaltrial(68,132

wom

en)and

observationalstudy

(93,676wom

en)from

wom

enaged

50–79between1993

and

1998

https://www.whi.org/researchers/

SitePages/Get%20Involved.aspx

Multi-Ethn

icStud

yof

Atherosclerosis(M

ESA)

Clinical

Multi-Ethn

icStud

yfrom

ColumbiaUniversity,Joh

nsHop

kins

University,N

orthwestern

University,U

CLA,

University

ofMinnesota,and

WakeForest

University

6814

men

andwom

enwww.mesa-nh

lbi.org

AtherosclerosisRisk

inCo

mmun

ities

(ARIC)

Clinical

Cardiovascular

riskfactors,medicalcare,and

diseaseby

race,g

ender,locatio

n,anddate

470,000men

andwom

en(aged35–84years)

http://www2.cscc.unc.edu

/aric/opp

ortunities_for_new_investig

ators

SleepHeartHealth

Stud

y(SHHS)

EEG,EKG

,and

polysomno

gram

sMulti-coho

rtstud

yfocusedon

sleep-disordered

breathing

andcardiovascular

outcom

e5804

adults

(aged40

andolder)

sleepd

ata.org/datasets/shh

s

Coronary

Artery

Risk

Develop

mentin

Youn

gAd

ults

(CAR

DIA)

Clinical

From

4centers:Birm

ingh

am,A

L;Ch

icago,

IL;M

inneapolis,

MN;and

Oakland

,CA

5115

blackandwhite

men

andwom

en(aged18–30years)

www.cardia.do

pm.uab.edu

(Con

tinued)


http://greengenes.lbl.gov

http://www.h-invitational.jp/

https://www.hcup-us.ahrq.gov/db/nation/nis/nisdbdocumentation.jsp



https://www.hcup-us.ahrq.gov/db/nation/nrd/nrddbdocumentation.jsp



https://www.hcup-us.ahrq.gov/db/nation/neds/nedsdbdocumentation.jsp



https://www.whi.org/researchers/SitePages/Get%20Involved.aspx

https://www.whi.org/researchers/SitePages/Get%20Involved.aspx

http://www.mesa-nhlbi.org

http://www2.cscc.unc.edu/aric/opportunities_for_new_investigators

http://www2.cscc.unc.edu/aric/opportunities_for_new_investigators

http://www.cardia.dopm.uab.edu

Table2.

(Con

tinued).

Database

Type

ofdata

Details

Num

bersof

samples

Link

JacksonHeartStud

y(JHS)

Clinical

Clinicalvariables,labs,imaging,

interview,and

physical

activity

5306

African-American

residentslivingin

the

Jackson,

MS,metropo

litan

area

ofHinds,

Madison

,and

Rankin

Coun

ties

www.jacksonh

eartstud

y.org

Cardiovascular

Health

Stud

y(CHS)

Clinical

Extensiveinitialph

ysicalandlabo

ratory

evaluatio

nsto

identifycardiovascular

riskfactors,such

ashigh

blood

pressure,h

ighcholesterol,andpre-diabetes;sub

clinical

disease(e.g.carotid

artery

atherosclerosis,leftventricular

enlargem

ent,andtransientischem

ia)

5888

men

andwom

enaged

65or

olderin

four

U.S.com

mun

ities

–Sacram

ento,C

A;Hagerstow

n,MD;W

inston

-Salem

,NC;

and

Pittsburgh

,PA

chs-nh

lbi.org

Twitter

Socialmedia

Curate

Tweets

and3Tw

itter

APIp

latformsstandard

and

prem

ium

(free)bu

tenterprise(paid)

Over900millionexistin

gTw

itter

accoun

tshttps://developer.twitter.com

/en/pro

ducts/prod

ucts-overview

IBM

Watson(blog,

facebo

okpages,Tw

itter,n

ews)

Millions

ofdata

andsocialmedia

sources

Severalanalyticalpackages

(regular,p

lus,andprofession

al),

andseveraltypes

ofdata

(blog,

facebo

okpages,Tw

itter,

news)

Upto10,000,000

rowsperdatasetandup

to500columns

perdataset

www.ibm.com

/us-en/m

arketplace/

watson-analytics

PhysioBank

Digitalrecording

sof

physiologic

sign

alsandrelateddata

Clinical,w

aveforms,EKGs,RR

interval,o

xygensaturatio

nvariability,gaitandbalancedata

Over75

databases

100,000samples

www.physion

et.org

MIMIC,M

IMIC-II,M

IMIC-III

Clinical

Dem

ograph

ics,vitalsignmeasurements,laboratorytest

results,p

rocedu

res,medications,n

urse

andph

ysician

notes,imagingrepo

rts,andou

t-of-hospitalm

ortality

30,000–60,000admission

sof

patientswho

stayed

incriticalcareun

itsof

theBeth

Israel

Deaconess

MedicalCenter

between

2001

and2012

mimic.physion

et.org

NationalH

ealth

andNutrition

Exam

inationSurvey

(NHAN

ES)

Nutrition

Dem

ograph

ic,d

ietary,q

uestionn

aire

39,695

person

sforNHAN

ES-III,27,801

person

sforNHAN

ES-II,and

32,000

person

sfor

NHAN

ES-I

wwwn.cdc.go

v/nchs/nhanes/Default.

aspx

TheNHAN

ESNationalY

outh

FitnessSurvey

(NNYFS)

Physicalactivity

andfitness

levels

Dem

ograph

ic,d

ietary,q

uestionn

aire,p

hysicala

ctivity

mon

itor,aerobicfitness

–maximaland

subm

aximalexercise

test,and

musclestreng

th

1640

childrenandadolescentsaged

3–15

www.cdc.gov/nchs/nn

yfs/index.htm

YouTub

e-8M

Video

Lifestyle

6.1MillionVideoIDs,2.6billion

audio/visual

features,and

3,862

Classes

research.google.com/you

tube8m

/

UCF101dataset

Video

Lifestyle

13,320

videos

http://crcv.ucf.edu

/data/UCF101.ph

pUCF-Spo

rts

Video

Lifestylecollected

from

vario

ussports

which

aretypically

featured

onbroadcasttelevision

channelssuch

asthe

BBCandESPN

150sequ

enceswith

theresolutio

nof

720×480

http://crcv.ucf.edu

/data/UCF_Spo

rts_

Actio

n.ph

p

J-HMDB

Video

Collected

from

moviesor

theInternet

5100

clipsof

51diffe

rent

human

actio

nshttp://jhmdb

.is.tu

e.mpg

.de/

THUMOS2015

dataset

Video

Lifestyle

430hof

videodata

and45

millionfram

eshttp://www.th

umos.info/hom

e.html

DAV

IS16

and17

Video

Lifestyle

50sequ

ences,3455

anno

tatedfram

esdavischalleng

e.org

Sports-1M

Video

Lifestyle

1,133,158videoURLswhich

have

been

anno

tatedautomaticallywith

487labels

github

.com

/gtoderici/spo

rts-1m

-dataset/blob

/wiki/P

rojectHom

e.md

TRECVIDMED

dataset

Severaltypes

ofvideodatasets

(i.e.

IACC

.1.A-C,Y

FCC100M,H

AVIC)

Datafrom

asm

alln

umberof

know

nprofession

alsources–

broadcastnewsorganizatio

ns,TVprog

ram

prod

ucers,

andsurveillancesystem

s

Severalcategoriesof

videodataset(depends

onyear)

www-nlpir.nist.gov/projects/trecvid/

trecvid.data.htm

l

Uber2B

trip

data

Text,Lifestyle

Lifestyle

North

America,Central&

SouthAm

erica,

Europe,A

frica,SouthAsia,A

ustralia&New

Zealand

movem

ent.u

ber.com

Yelp

OpenDataset

Text,Lifestyle

JSONandSQ

Ldatasets

5,200,000review

s,174,000businesses,200,000

pictures,1

1metropo

litan

areas

www.yelp.com/dataset

Quo

raQuestionPairs

Text,Lifestyle

Questions

inQuo

racompetitionisto

predictwhich

ofthe

provided

pairs

ofqu

estio

nscontaintwoqu

estio

nswith

thesamemeaning

N/A

www.kaggle.com/c/quo

ra-question-

pairs/data

GoogleAu

dioset

Audio

Ahierarchical

graphof

eventcatego

ries,coverin

gawide

rang

eof

human

andanimalsoun

ds,m

usicalinstruments

andgenres,and

common

everyday

environm

ental

soun

ds

632audioeventclassesandacollectionof

2,084,320hu

man-labeled10-ssoun

dclips

draw

nfrom

YouTub

evideos

research.google.com/aud

ioset/

dataset/index.html

(Con

tinued)


http://www.jacksonheartstudy.org

https://developer.twitter.com/en/products/products-overview

https://developer.twitter.com/en/products/products-overview

http://www.ibm.com/us-en/marketplace/watson-analytics

http://www.ibm.com/us-en/marketplace/watson-analytics

http://www.physionet.org

http://www.cdc.gov/nchs/nnyfs/index.htm

http://crcv.ucf.edu/data/UCF101.php

http://crcv.ucf.edu/data/UCF_Sports_Action.php

http://crcv.ucf.edu/data/UCF_Sports_Action.php

http://jhmdb.is.tue.mpg.de/

http://www.thumos.info/home.html

http://www.yelp.com/dataset

http://www.kaggle.com/c/quora-question-pairs/data

http://www.kaggle.com/c/quora-question-pairs/data

Table2.

(Con

tinued).

Database

Type

ofdata

Details

Num

bersof

samples

Link

NYC

Taxidataset

Taxiin

New

York

City

Datacontaining

inform

ationon

ourvario

usindicators,trip

coun

ts,crash

history,etc.,and

also

raw

trip

data

from

avariety

ofsources

Millions

oftrip

recordsfrom

both

yellow

medalliontaxisandgreenstreet

haillivery

http://www.nyc.gov/htm

l/tlc/htm

l/abou

t/trip_record_

data.shtml

OpenFDA

Date,drug

s,events

Drugs,d

evices,and

food

sandsubcategories(i.e.adverse

events,enforcementrepo

rts,classification,

registratio

n,labelling

)

8,733,422drug

adverseeventrepo

rts,65,523

food

adverseeventrepo

rts,and7,353,142

device

adverseeventrepo

rts

open.fda.go

v/tools/do

wnloads/

SEER

Research

Data

Epidem

iologic

Cancer

incidencedata

from

popu

latio

n-basedcancer

registries

10,050,814

cases(9,099,524

malignant

cases

and9,776,139cases)

https://seer.cancer.g

ov/seertrack/

data/request/

UNSD

Environm

entalInd

icators

Environm

ent

NOxem

ission

s,SO

2em

ission

s,CO

2em

ission

s,CH

4andN2O

emission

s,Climatolog

icaldisasters,Hydrologicald

isasters,

andInland

Water

Resources

Environm

entald

ata(airpo

llutio

n,climate

changes,greenh

ouse

gases)from

183

coun

tries

unstats.un

.org/unsd/envstats/

qind

icators.cshtml

DrugB

ank

Drug

Morethan

200data

fieldswith

halfof

theinform

ationbeing

devotedto

drug

data

andtheotherhalfdevotedto

drug

target

orproteindata

11,203

drug

entriesinclud

ing2,562approved

smallm

oleculedrug

s,966approved

biotech(protein/peptid

e)drug

s,121

nutraceuticalsandover

5183

experim

ental

drug

s

www.drugb

ank.ca

TheToxinandToxinTarget

Database(T3D

B)Toxin

Mechanism

sof

toxicity

andtarget

proteins

foreach

toxin

detailedtoxindata

with

comprehensive

toxintarget

inform

ationpo

llutants,pesticides,d

rugs,and

food

toxins

3670

common

toxins

andenvironm

ental

pollutants

http://www.t3db

.ca/

FooD

BFood

,nutrients

Food

,com

poun

ds,n

utrients,contents

detailed

compo

sitio

nal,biochemicalandph

ysiologicalinformation

structure,chem

ical

class,its

physico-chem

ical

data,its

food

source(s),its

color,its

arom

a,its

taste,its

physiologicale

ffect,p

resumptivehealth

effects(from

publishedstud

ies),and

concentrations

invario

usfood

s

28,000

food

compo

nentsandfood

additives

http://food

b.ca/

PhysioNet

Electroencephalography

(EEG

),electrooculography

(EOG),

electrom

yography

(EMG),

electrocardiolog

y(EKG

),and

oxygen

saturatio

n(SaO

2)

Largecollections

ofrecorded

physiologicsign

als

(PhysioB

ank)

andrelatedop

en-sou

rcesoftware

(PhysioToolkit)

1985

subjects

from

MGHsleeplabo

ratory

for

thediagno

sisof

sleepdisorders(from

PhysioNet

Cardiology

Challeng

e2018)

www.physion

et.org

UCI

Machine

Learning

Repo

sitory

Machine

learning

dataset

Machine

learning

436data

sets

archive.ics.uci.edu

/ml/ind

ex.php

OpenImages

Dataset

V4Machine

learning

dataset

Machine

learning

(avalidationset(41,620images),anda

test

set(125,436

images)

15,440,132

boxesand30,113,078

image-level

labels

github

.com

/openimages/dataset

TheNationalSurveyon

DrugUse

andHealth

(NSD

UH)

Survey

data

Tobacco,

alcoho

l,anddrug

use,mentalh

ealth

andother

health-related

issues

intheUnitedStates

70,000

peop

lensdu

hweb.rti.org

GoogleFluTrends

andGoogle

Dengu

eTrends

Text

FluTrends

since2008

50millionof

themostcommon

search

queriesin

theUnitedStates

https://www.google.org/flutrends/

abou

t/CardiacMRI

dataset

Images

CardiacMRI

33subjects

and7980

images

(20fram

esand

8–15

slices

alon

gthelong

axis)

http://www.cse.yorku.ca/~

mrid

ataset/

TheCardioVascular

Research

Grid

(CVR

G)

Clinical,gene,andproteinexpression

Multiscale

data

sets

(SNP,

mRN

Aexpression

,protein

expression

,imaging,ECG,clinicaldata)from

Canine

Heart

Atlas,Mou

seHearts,In-VivoHum

anHeartCT

ImageData

Multip

levariables

(clinical,g

ene,andprotein

expression

)from

15canine

hearts

http://cvrgrid

.org/

Influenza

Research

Database

(IRD)

Epidem

iology

Strain,segmentandproteinsequ

ence

data

surveillance

sampleinform

ation

5621

structural

andfunctio

nalsequence

features

ininfluenza

proteins

www.flud

b.org

Risk-AdjustedInpatient

Mortality

RatesandHospitalR

atings

for

CaliforniaHospitals,2

012

Clinical

Risk-adjustedmortalityrates,qu

ality

ratin

gs,and

numberof

deaths

andcasesfor6medicalcond

ition

streated(acute

stroke,acute

myocardialinfarction,

heartfailure,

gastrointestinalhemorrhage,hipfracture

and

pneumon

ia)in

Californiaho

spitalsfor2012

Depends

oncond

ition

sandprocedures

(from

300to

64,000)

data.chh

s.ca.gov/dataset/california-

hospital-inp

atient-m

ortality-rates-

and-qu

ality-ratings

Commun

ityHealth

Status

Indicators(CHSI)

Clinical

Commun

ityhealth

(e.g.o

besity,h

eartdisease,cancer)

Over200measuresforeach

ofthe3,141

UnitedStates

coun

ties

www.cdc.gov/oph

ss/csels/dph

id/

CHSI.htm

l


http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

https://seer.cancer.gov/seertrack/data/request/

https://seer.cancer.gov/seertrack/data/request/

http://www.drugbank.ca

http://www.t3db.ca/

http://foodb.ca/

http://www.physionet.org

https://www.google.org/flutrends/about/

https://www.google.org/flutrends/about/

http://www.cse.yorku.ca/~mridataset/

http://www.cse.yorku.ca/~mridataset/

http://cvrgrid.org/

http://www.fludb.org

http://www.cdc.gov/ophss/csels/dphid/CHSI.html

http://www.cdc.gov/ophss/csels/dphid/CHSI.html

Table3.

Exam

ples

ofpu

blicdata

search.

Omicsdata

search

Type

ofdata

Details

Num

bersof

samples

Link

GWAS

Catalog

ArticlessummarizingGWAS

andSN

P-traitassociations

Hum

angeno

mewideassociationstud

ies(GWAS

)andassociation

results

3395

publications

and62,174

unique

SNP-trait

associations

https://www.ebi.ac.uk/

gwas/

GWAS

Central

Articles

Comprises

allkno

wnSN

Psandothervariants,allele

andgeno

type

frequencydata,p

lusgenetic

associationsign

ificancefin

ding

sfrom

publicdatabasessuch

asdb

SNPandtheDBG

V

1605

stud

ies(2,935,163

unique

dbSN

Pmarkers)

www.gwascentral.org

KEGGpathway

Metabolism,m

olecular

interactions,

reactio

nsandrelatio

ns,

environm

entalinformation

processing

,and

cellularprocesses

Gene/protein(KEG

GGEN

ES)

Reactio

n(KEG

GREAC

TION)

Drug(KEG

GDRU

G)

2706

entriesforpathway

diagrams

110,018entriesin

24completegeno

mes

and12

partialg

enom

es5,645entriesin

theCO

MPO

UNDsection

https://www.genom

e.jp/

kegg

/pathw

ay.htm

l

ExAC

Brow

ser(Beta)

|Exom

eAg

gregation

Consortiu

m

Exom

esequ

encing

data

Harmon

izeexom

esequ

encing

data

from

avariety

oflarge-scale

sequ

encing

projects

60,706

unrelatedindividu

alssequ

encedas

partof

vario

usdisease-specificandpo

pulatio

ngenetic

stud

ies

http://exac.broadinstitu

te.

org/

GlobalB

iobank

Engine

Genotypes

andph

enotypes

Biob

ankexplorer

iscurrently

seeded

with

data

from

UKB

Ballowing

explorationbetweengeno

typesandph

enotypes

392,292participants

from

theUKB

Bhttp://gb

e.stanford.edu

gnom

ADbrow

serbeta

|geno

meAg

gregation

Database

Exom

eandgeno

mesequ

encing

data

Exom

eandgeno

mesequ

encing

data

from

avariety

oflarge-scale

sequ

encing

projects

123,136exom

esequ

encesand15,496

who

le-

geno

mesequ

encesfrom

unrelatedindividu

als

sequ

enced

http://gn

omad.broadinsti

tute.org

GeneExpression

Omnibu

s(GEO

)Geneandfunctio

nalg

enom

icsdata

Freelydistrib

utes

microarray,next-generationsequ

encing

,and

other

form

sof

high

-throu

ghpu

tfunctio

nalg

enom

icsdata

4,348DataSet

www.ncbi.nlm.nih.gov/geo/

Sequ

ence

Read

Archive

(SRA

)NucleotideSequ

ence

High-throug

hput

sequ

encing

data

andispartof

theInternational

NucleotideSequ

ence

DatabaseCo

llabo

ratio

n(IN

SDC)

that

includ

esat

theNCB

ISequenceRead

Archive(SRA

),theEuropean

BioinformaticsInstitu

te(EBI),andtheDNADatabaseof

Japan

(DDBJ)

>500billion

readsconsistin

gof

60trillionbase

pairs

www.ncbi.nlm.nih.gov/sra

Thedatabase

ofGenotypes

and

Phenotypes

(dbG

aP)

Genotypeandph

enotype

Phenotypedata,associatio

n(GWAS

)data,sum

marylevela

nalysis

data,SRA

(Sho

rtRead

Archive)

data,reference

alignm

ent(BAM

)data,V

CF(Variant

CallForm

at)data,expressiondata,impu

ted

geno

type

data,imagedata

Over100,000individu

als

www.ncbi.nlm.nih.gov/gap

ThePh

enotype-Genotype

Integrator

(PheGenI)

Genom

e-wideassociationstud

y(GWAS

)catalogdata

with

several

databases

MergesNHGRI

geno

me-wideassociationstud

y(GWAS

)catalogdata

with

severald

atabases

housed

attheNationalC

enterfor

Biotechn

olog

yInform

ation(NCB

I),includ

ingGene,db

GaP,O

MIM,

eQTL,and

dbSN

P

66,063

associationrecords(54,282from

dbGaP

and

11,781

from

theNHGRI

GWAS

catalog)

www.ncbi.nlm.nih.gov/gap/

phegeni

Health

map.org

New

s,twitter

Infectious

Disease

Outbreaks

Anautomated

process,up

datin

grealtim

e,the

system

mon

itors,o

rganizes,integrates,filters,

visualizes

anddissem

inates

onlineinform

ation

abou

tem

erging

diseases

http://www.health

map.org

CDCWONDER

Epidem

iologic

Mortality(deaths),cancerincidence,HIV

andAIDS,tuberculosis,

vaccinations,n

atality

(births),census

data

20collections

ofpu

blic-use

data

forU.S.b

irths,

deaths,cancerdiagno

ses,tuberculosiscases,

vaccinations,enviro

nmental

expo

sures,andpo

pulatio

nestim

ates

won

der.cdc.gov

SEER*Explorer

Cancer

statistics

Gender,race,calendaryear,age,and

foraselected

numberof

cancer

sites,by

stageandhistolog

y308,745,538patients

seer.cancer.g

ov/explorer/

USD

ANationalN

utrient

Database

Nutrition

Differenttypesof

food

sandnu

trients

%fatand%

lean

andtypesof

servingmetho

ds7793

diffe

rent

food

sandnu

trients

ndb.nal.usda.go

v/nd

b/

Nutrition,

Physical

Activity,and

Obesity:

Data,Trends

andMaps

Graph

,tables

Obesity,b

reastfeeding

,physicala

ctivity,o

ther

health

behaviorsand

relatedenvironm

entaland

policydata

Either

natio

nally

orby

statein

theUS

https://www.cdc.gov/

nccdph

p/dn

pao/data-

trends-m

aps/index.html

TOXM

APGraph

,tables

NCI

SEER

cancer

anddiseasemortalitydata,C

anadianNational

Pollutant

ReleaseInventory(NPRI)data

U.S.com

mercial

nuclearpo

wer

plants,and

Coalpo

wer

plantdata

from

theEPACleanAirMarkets

Prog

ram

Either

natio

nally

orby

statein

theUS

https://toxm

ap.nlm.nih.gov/

toxm

ap/new

s/2018/06/

new-version

-of-toxm

ap-

now-available.html


https://www.ebi.ac.uk/gwas/

https://www.ebi.ac.uk/gwas/

http://www.gwascentral.org

https://www.genome.jp/kegg/pathway.html

https://www.genome.jp/kegg/pathway.html

http://exac.broadinstitute.org/

http://exac.broadinstitute.org/

http://gbe.stanford.edu

http://gnomad.broadinstitute.org

http://gnomad.broadinstitute.org

http://www.ncbi.nlm.nih.gov/geo/

http://www.ncbi.nlm.nih.gov/sra

http://www.ncbi.nlm.nih.gov/gap

http://www.ncbi.nlm.nih.gov/gap/phegeni

http://www.ncbi.nlm.nih.gov/gap/phegeni

http://www.healthmap.org

https://www.cdc.gov/nccdphp/dnpao/data-trends-maps/index.html



https://toxmap.nlm.nih.gov/toxmap/news/2018/06/new-version-of-toxmap-now-available.html




existing models (algorithms) is commonly used, as it is mucheasier and sufficient algorithms already exist which may beapplied to important problems. Finally, an exploratory analysisis based on data-driven hypotheses rather than investigator-driven hypothesis [45]. For example, there have been papersshowing clustering of phenotypes (phenomapping) [6], thereare papers using systems biology methods to look at distinctendophenotypes [46], and there are also papers dissecting outresponse predictors with patterns [47].

4. Current challenges

It is important to delineate some of the challenges of imple-menting a big data approach in cardiovascular medicine. First,integrating big data into clinical trials is challenging becauseclinical trials are usually designed under ideal conditions,among select patients, and monitored by highly qualifiedphysicians [48]. In order to perform analysis using big datawith traditional statistical methods could be difficult. Smartclinical trials that are guided by AI to recruit patients (e.g.Deep 6 AI), do dynamic matching (e.g. SYNERGY-AI;NCT03452774), or to do direct targeted therapy are also pro-mising [49]. Second, heterogeneity and disparities of differentdatasets can be challenging to utilize. Third, latent variablesmight have been ignored in those heterogeneous diseases inprevious studies. Briefly, latent or unknown variables can becategorized into hidden medical variables and lifestyle vari-ables. Hidden medical variables could act as new parametersto characterize accurate myocardial function, novel serummetabolites, or new parameters for subclinical arteriosclerosis[9,10]. HFpEF, for example, could potentially be subcategor-ized into more mechanistically and molecularly homogenous,discrete genotypes, phenotypes, and etiologies [6,11]. Lifestylevariables are often quite novel because most studies have notincluded high-definition lifestyle variables in their analyses[50]. However, integrating deeply phenotyped lifestyle factorsinto medical records can be difficult because of data privacyand the lack of publically available application programminginterfaces for consumer devices to interact with EHRs [51].

Lifestyle variables may include dietary intake [52], physicalactivity [30], sleep hygiene [53], air pollution [54], ergonomics[55], income [56], domestic violence [57], working hours [58],and workplace wellness [59]. To date, most recent researchhas been collected on lifestyle variables mainly by question-naires or interviews, leading to recall or social desirabilitybiases [60]. Advancement of wearable technology could beused to track real-time activity and integrate those hiddenvariables into a person’s medical history. For example, theetiologies of HF readmission are heterogeneous and perhapsrelated to medication compliance and dietary habits [61].Integrating lifestyle variables could potentially track the mainproblems with real-world variables rather than tracking theminside of a hospital and preventing recall biases from patienthistories [60,62]. However, there remains a need to collectbetter and more consistent data from wearable devices –most consumer devices are not approved by the FDA forclinical monitoring of patients, and this may be a limitationin some cases. In addition, wearable devices have a number ofvalidation issues, and it is unclear if they motivate long-termbehavioral change [63,64]. For example, in a BEAT-HF trial, acombination of remote patient monitoring with care transitionmanagement did not reduce 180-day all-cause readmissionafter hospitalization for HF [65]. Fourth, data quality, datainconsistency, data instability, and validation of big data arealso barriers, and therefore the imputation of big data iscritical [66]. More data, more entropy, and more heterogeneityresult in lower-quality databases [67]. Therefore, the pre-ana-lytic process of big data needs to be assessed and imputedsystematically. For example, though the methodology of redu-cing heterogeneity in meta-analysis is not yet perfect, it canreduce significant biases [68]. Fifth, some other limitations of abig data approach are heterogeneity of multiple databases(i.e., different ICD code versions, different diagnostic criteria,different laboratories, and different software vendors) [13,14].Hence, synchronizing existing data to generate meaningfulanalysis can be very challenging. Sixth, although de-identifica-tion seems to be a solution in big data research, studies haveshown that re-identification can be done in various ways. For

Figure 1. Big data process flow for cardiovascular medicine.


example, anonymous genetic data stores could be unmaskedby matching their data to a sample of their DNA [69] ormatching social networks for information that might yieldinsights into the genetic basis for complex human traits [70].Seventh, to date there has been little evidence to suggest thatDNA testing has little or no impact in motivating behaviorchange [71]. Therefore, the genomic information, or GWAS,impacting long-term behavior change may still need handcuration [72]. In addition, distinguishing signals from noise inOmics data and software validation are required [73]. Forexample, using different types of software (i.e. PLINK,QCTOOL, Vcftools, BOLTs, or EPACTS) may reflect differentresults. Lastly, another important challenge in the use of bigdata in cardiovascular medicine is the ascertainment of caus-ality from observational and retrospective studies. Most AI andML methods do not explicitly utilize a framework to modelcausality. Consider the humorous case of age-related gray hairand CVD. The presence of both gray hair, wrinkles, baldness,and CVD are highly correlated [74–76]. However, if we were topursue this strong association in an attempt to design thera-pies (e.g. hair dyes or wrinkle cream), we would be whollyunsuccessful in preventing CVDs. This is an important limita-tion that all big-data analyses must account for – however,there do exist emerging methods to perform causal inferencefrom observational datasets, such as the parametric G formula[77]. We recently completed one application of the parametricG formula, in which we used retrospective EHR data todemonstrate the relative correctness of a clinical trial forhypertension that had been called into question [78].However, EHR data also has some limitations, such as theaccuracy of ICD 9 codes [79–81].

5. Implementation of big data in clinical practice

Several resources are still the main starting points for any bigdata search in cardiovascular medicine. The utilization of thesedatasets could facilitate precision CV medicine. The integrationof the Internet of Things, social media, Omics and big datatechnologies, and AI could create a new concept of smarthealth, integrating real-world variables into hospital-relatedvariables, and leading to improved quality of patient careand hospital workflow [82–85]. Today, with the help of theInternet, there are many types of websites providing eitherdatasets for public use or data search (Tables 1–3). The imple-mentation of big data analytics that links these databasestogether is crucial. However, there may be some barriers orrestrictions. Academic institutions usually have manyresources and can provide their own biobank (i.e. the MayoClinic Biobank, Cleveland Clinic’s Biorepository, SCVI Biobank,Mount Sinai’s BioMe, Vanderbilt’s BioVU, or Northwestern’sNUgene). Most biobanks are designed so they can be accessedby various innovative actors, public and private, throughoutthe world. Integration of these biobanks in ongoing research isworth exploring. Training in bioinformatics or coordinatingwith data scientists is also important [86]. In addition, usingonline community support for data analysis such as Github,Stack Overflow, Kaggle, and Biostars is increasingly recognizedand utilized in the medical community. Previous research has

acknowledged many confounders in clinical research; how-ever, none of them have mentioned real-world lifestyle factorssuch as seafood/cereal/coffee consumption, watching movies,playing video games, or personal hygiene. These real-worldfactors could potentially be confounders in CVD burdens, forexample, HF readmission, recurrent AF, labile INR, statin sensi-tivity, or stent thrombosis. These integrations can increasedimensional research into new translation research by includ-ing real-world environmental factors.

6. Expert commentary

Though many of the technical issues for a big data approachremain to be solved, the potential for big data analysis toimprove cardiovascular quality of care and patient outcomeis tremendous. To date, the key findings from previous studiesin this field are inconclusive. For example, strong evidencethat the attempt to change behavior using either wearablesor genomic information is lacking. The ultimate goal of bigdata analysis is to unify heterogeneous databases into homo-genous databases using advanced computational power, suchas AI. In addition, we believe that big analysis using AI willadvance clinical trials in the context of recruiting patients,distributing drugs randomly and fairly between two arms,assisting drug delivery, and predicting outcomes of trials inadvance. However, the biggest challenge is to combine het-erogeneous variables from various datasets and implementthese into clinical practice. In addition, there are candidategenes, novel biomarkers, and parameters emerging every day,which makes it almost impossible for current guidelines toremain current. Moreover, decision-making using these novelprofiles without guidelines can be challenging and may faceethical dilemmas. Future studies should integrate big dataanalysis to better explore the robustness of novel CVD phe-notypes and smart clinical trial design for targeted therapy.Targeting components of the CVD phenotypes such as specificgenes, specific metabolites, and the specific gut microbiomein CVD may prove to be valuable. This phenotype-based clas-sification system could be helpful for the identification of newbiomarkers and potential targeted therapies, and it may leadto the development of tailored/customized future clinicaltrials.

7. Five-year view

In the realm of the big data era, genetic polymorphisms,plasma metabolomics, and proteomics may help to identifynew biomarkers and potential novel therapeutic targets forCVH. We hope and believe that these tools will soon emergeas best practices in day-to-day clinical medicine. The next stepis to create on-demand predictive analytics in clinical practiceusing the results of a big data approach, which shows greatpromise in cardiovascular medicine. In clinical practice, theimplementation of sophisticated analytics tools with ‘omic’data, the human microbiome, physical activity, environmentalfactors, and lifestyle factors might help identify novel pheno-types of CVD patients. Today, genetic risk scores are starting tostratify patients based on risk before the disease presents[87,88]. A big data approach could potentially transform


medicine into a more personalized approach using sophisti-cated algorithms generated from a combination of real-worldfactors and medical variables to calculate the risk and benefitsof CVH-related behaviors in individuals. For example, takinginto account a persons patterns of dietary intake, medicationcompliance, and daily life activities using wearable technol-ogy, storing this data in a secure system (i.e. cloud or block-chain), and transferring it to an EHR could generate apredictive analysis with prompt recommendations in regardsto maximum fruit intake and minimal carbohydrate intake forindividuals in their discharge summary. The results of this typeof analysis would be transferred to primary care physicians,collected in wearable technology with warning messages, andcould appear in a patient’s history in the EHR system. Thisproposed model could potentially be a modifiable factor toweigh CVD risk and benefit based on individuals.

Key issues

● A phenotype-based classification using multi-omics, life-style, and environmental data with new analytical methodsand high computational power could potentially transformfuture clinical trials.

● Data cleaning and data imputation are keys to unlockingbig data analysis.

● The data, so far, on both wearables and genomic informa-tion evoking long-term behavior change is negative or, atbest, neutral.

● Biobanks and curated public databases may play an impor-tant role in big data analysis.

● Although there are many limitations to the proposedapproach that have already been clearly tested, there istremendous potential for big data analysis to improve car-diovascular quality of care and patient outcome.

Funding

This paper was not funded.

Declaration of interest

The authors have no relevant affiliations or financial involvement with anyorganization or entity with a financial interest in or financial conflict withthe subject matter or materials discussed in the manuscript. This includesemployment, consultancies, honoraria, stock ownership or options, experttestimony, grants or patents received or pending, or royalties.

Reviewer disclosures

Peer reviewers on this manuscript have no relevant financial or otherrelationships to disclose.

References

Papers of special note have been highlighted as either of interest (•) or ofconsiderable interest (••) to readers.

1. Gaye B, Tafflet M, Arveiler D, et al. Ideal cardiovascular health andincident cardiovascular disease: heterogeneity across event sub-types and mediating effect of blood biomarkers: the PRIME study.J Am Heart Assoc. 2017 Oct 17;6(10).

2. Jose PO, Frank AT, Kapphahn KI, et al. Cardiovascular diseasemortality in Asian Americans. J Am Coll Cardiol. 2014;64:2486–2494.

3. Gordon RD. Heterogeneous hypertension. Nat Genet. 1995;11:6–9.4. Darbar D, Herron KJ, Ballew JD, et al. Familial atrial fibrillation is

a genetically heterogeneous disorder. J Am Coll Cardiol.2003;41:2185–2192.

5. Inohara T, Shrader P, Pieper K, et al. Association of atrial fibrillationclinical phenotypes with treatment patterns and outcomes: a mul-ticenter registry study. JAMA cardiology. 2018;3:54–63.

6. Shah SJ, Katz DH, Selvaraj S, et al. Phenomapping for novel classi-fication of heart failure with preserved ejection fraction. Circulation.2015;131:269–279.

7. Krittanawong C, Bomback AS, Baber U, et al. Future direction forusing artificial intelligence to predict and manage hypertension.Curr Hypertens Rep. 2018;20:75.

8. Balaney B, Medvedofsky D, Mediratta A, et al. Invasive validation ofthe echocardiographic assessment of left ventricular filling pres-sures using the 2016 diastolic guidelines: head-to-head comparisonwith the 2009 guidelines. J Am Soc Echocardiography: OfficialPublication Am Soc Echocardiography. 2018;31:79–88.

9. Pislaru C, Alashry MM, Thaden JJ, et al. Intrinsic wave propagationof myocardial stretch, a new tool to evaluate myocardial stiffness: apilot study in patients with aortic stenosis and mitral regurgitation.J Am Soc Echocardiography: Official Publication Am SocEchocardiography. 2017;30:1070–1080.

10. Laaksonen R, Ekroos K, Sysi-Aho M, et al. Plasma ceramides predictcardiovascular death in patients with stable coronary artery diseaseand acute coronary syndromes beyond LDL-cholesterol. Eur HeartJ. 2016;37:1967–1976.

11. Krittanawong C, Kukin ML. Current management and future direc-tions of heart failure with preserved ejection fraction: a contem-porary review. Curr Treat Options Cardiovasc Med. 2018;20:28.

12. Guo Q, Lu X, Gao Y, et al. Cluster analysis: a new approach foridentification of underlying risk factors for coronary artery diseasein essential hypertensive patients. Sci Rep. 2017;7:43965.

13. Bellazzi R. Big data and biomedical informatics: a challengingopportunity. Yearb Med Inform. 2014;9:8–13.

14. Scruggs SB, Watson K, Su AI, et al. Harnessing the heart of big data.Circ Res. 2015;116:1115–1119.

15. Kass-Hout TA, Stevens LM, Hall JL. American Heart Associationprecision medicine platform. Circulation. 2018;137:647–649.

16. Gourraud P-A, Henry R, Cree BAC, et al. Precision medicine inchronic disease management: the MS bioscreen. Ann Neurol.2014;76:633–642.

17. Krittanawong C, Zhang H, Wang Z, et al. Artificial intelligence inprecision cardiovascular medicine. J Am Coll Cardiol. 2017;69:2657–2664.•• This is a useful review about artificial intelligence in cardio-vascular medicine.

18. Glicksberg BS, Johnson KW, Dudley JT. The next generation ofprecision medicine: observational studies, electronic healthrecords, biobanks and continuous monitoring. Hum Mol Genet.2018;27:R56–r62.

19. McConnell MV, Shcherbina A, Pavlovic A, et al. Feasibility of obtain-ing measures of lifestyle from a smartphone app: the MyHeartcounts cardiovascular health study. JAMA cardiology. 2017;2:67–76.• This study provides an example of a potential smartphoneapplication study in cardiovascular health.

20. Guo X, Vittinghoff E, Olgin JE, et al. Volunteer participation in thehealth eHeart study: a comparison with the US population. Sci Rep.2017;7:1956.

21. Muse ED, Wineinger NE, Schrader B et al. Moving beyond clinicalrisk scores with a mobile app for the genomic risk of coronaryartery disease. bioRxiv. 2017.

22. [cited 2018 Oct 6]. Access online at https://med.stanford.edu/appleheartstudy.html.

23. Bot BM, Suver C, Neto EC, et al. The mPower study, Parkinsondisease mobile data collected using ResearchKit. Sci Data.2016;3:160011.


https://med.stanford.edu/appleheartstudy.html

https://med.stanford.edu/appleheartstudy.html

24. Chan Y-FY, Bot BM, Zweig M, et al. The asthma mobile health study,smartphone data collected using ResearchKit. Sci Data. 2018;5:180096.

25. Webster DE, Suver C, Doerr M, et al. The Mole Mapper study,mobile phone skin imaging and melanoma risk data collectedusing ResearchKit. Sci Data. 2017;4:170005.

26. Ata R, Gandhi N, Rasmussen H, et al. IP225 VascTrac: a study ofperipheral artery disease via smartphones to improve remote dis-ease monitoring and postoperative surveillance. J Vasc Surg.2017;65:115S–116s.

27. Johnson KW, Torres Soto J, Glicksberg BS, et al. Artificial intelli-gence in cardiology. J Am Coll Cardiol. 2018;71:2668–2679.

28. Krittanawong C, Tunhasiriwet A, Zhang H, et al. Deep learning withunsupervised feature in echocardiographic imaging. J Am CollCardiol. 2017;69:2100–2101.

29. Shameer K, Johnson KW, Glicksberg BS, et al. Machine learning incardiovascular medicine: are we there yet? Heart. 2018;104:1156–1164.

30. Krittanawong C, Aydar M, Kitai T. Pokémon Go: digital healthinterventions to reduce cardiovascular risk. Cardiol Young.2017;27:1625–1626.

31. Ding MQ, Chen L, Cooper GF, et al. Precision oncology beyondtargeted therapy: combining omics data with machine learningmatches the majority of cancer cells to effective therapeutics. MolCancer Res 2017.

32. Anwar S, Negishi K, Borowszki A, et al. Comparison of two-dimen-sional strain analysis using vendor-independent and vendor-speci-fic software in adult and pediatric patients. JRSM CardiovascDisease. 2017;6:2048004017712862.

33. O’Malley KJ, Cook KF, Price MD, et al. Measuring diagnoses: ICDcode accuracy. Health Serv Res. 2005;40:1620–1639.

34. Standards for privacy of individually identifiable health informa-tion. Office of the Assistant Secretary for Planning and Evaluation,DHHS. Proposed rule. Federal register 1999;64:59918–60065.

35. Verma SS, de Andrade M, Tromp G, et al. Imputation and qualitycontrol steps for combining multiple genome-wide datasets. FrontGenet. 2014;5:370.

36. Hendler J. Data integration for heterogenous datasets. Big Data.2014;2:205–215.

37. Blankenberg D, Coraor N, Von Kuster G, et al. Integrating diversedatabases into an unified analysis framework: a galaxy approach. JBioll Databases Curation 2011. 2011: bar011.

38. Shkapsky A, Yang M, Interlandi M, et al. Big data analytics withdatalog queries on spark. Proceedings ACM-Sigmod InternationalConference on Management of Data. San Francisco, CA, USA.2016;2016:1135–1149.

39. [cited 2018 Oct 6].Amazon AWS http://aws.amazon.com/.40. Forbes A The future of BIME. 201841. Pan C, McInnes G, Deflaux N, et al. Cloud-based interactive analy-

tics for terabytes of genomic variants data. Bioinformatics.2017;33:3709–3715.

42. Coleman JR, Euesden J, Patel H, et al. Quality control, imputation andanalysis of genome-wide genotyping data from the illuminaHumanCoreExomemicroarray. Brief Funct Genomics. 2016;15:298–304.

43. Das S, Forer L, Schönherr S, et al. Next-generation genotype impu-tation service and methods. Nat Genet. 2016;48:1284.

44. Luo G, Stone BL. Automating construction of machine learningmodels with clinical big data: proposal rationale and methods.JMIR Res Protoc. 2017 Aug 29;6(8):e175.

45. Naik AW, Kangas JD, Sullivan DP, et al. Active machine learning-driven experimentation to determine compound effects on proteinpatterns. Elife. 2016;5:e10047.

46. Eppinga RN, Hagemeijer Y, Burgess S. Identification of genomic lociassociated with resting heart rate and shared genetic predictorswith all-cause mortality. Nat Genet. 2016 Dec;48(12):1557-1563. doi:10.1038/ng.3708.

47. Masetic Z, Subasi A. Congestive heart failure detection using ran-dom forest classifier. Comput Meth Prog Bio. 2016;130:54–64.

48. Mayo CS, Matuszak MM, Schipper MJ, Jolly S, Hayman JA, TenHaken RK. Big data in designing clinical trials: opportunities andchallenges. Front Oncol. 2017;7:187.

49. Say LEAFC Goodbye to clinical trials that don’t teach. 2018.50. Assi N, Thomas DC, Leitzmann M, et al. Are metabolic signatures

mediating the relationship between lifestyle factors and hepato-cellular carcinoma risk? Results from a nested case-control study inEPIC. Cancer epidemiol Biomarkers Prevention. 2018;27:531–540.

51. Filkins BL, Kim JY, Roberts B, et al. Privacy and security in the era ofdigital health: what should translational researchers know and doabout it? Am J Transl Res. 2016;8:1560–1580.

52. Krittanawong C, Tunhasiriwet A, Zhang H, et al. Is white rice con-sumption a risk for metabolic and cardiovascular outcomes? Asystematic review and meta-analysis. Heart Asia. 2017;9:e010909.

53. Krittanawong C, Tunhasiriwet A, Wang Z, et al. Associationbetween short and long sleep durations and cardiovascular out-comes: a systematic review and meta-analysis. Eur Heart J AcuteCardiovasc Care. 2017;2048872617741733.

54. Hartiala J, Breton CV, Tang WH, et al. Ambient air pollution isassociated with the severity of coronary atherosclerosis and inci-dent myocardial infarction in patients undergoing elective cardiacevaluation. J Am Heart Assoc. 2016 Jul 28;5(8).

55. Djindjic N, Jovanovic J, Djindjic B, et al. Associations between theoccupational stress index and hypertension, type 2 diabetes melli-tus, and lipid disorders in middle-aged men and women. AnnOccup Hyg. 2012;56:1051–1062.

56. Orth-Gomer K, Deter HC, Grun AS, et al. Socioeconomic factors incoronary artery disease - results from the SPIRR-CAD study. JPsychosom Res. 2018;105:125–131.

57. Mason SM, Wright RJ, Hibert EN, et al. Intimate partner violenceand incidence of hypertension in women. Ann Epidemiol.2012;22:562–567.

58. Kivimaki M, Jokela M, Nyberg ST et al. Long working hours and riskof coronary heart disease and stroke: a systematic review andmeta-analysis of published and unpublished data for 603,838 indi-viduals. Lancet (London, England) 2015;386:1739–1746.

59. Ryu H, Jung J, Cho J, et al. Program development and effectivenessof workplace health promotion program for preventing metabolicsyndrome among office workers. Int J Environ Res PublicHealth. 2017 Aug 4;14(8).

60. Althubaiti A. Information bias in health research: definition, pitfalls,and adjustment methods. J Multidiscip Healthc. 2016;9:211–217.

61. Retrum JH, Boggs J, Hersh A, et al. Patient-identified factors relatedto heart failure readmissions. Circ Cardiovasc Quality Outcomes.2013;6:171–177.

62. Larsson SC, Tektonidis TG, Gigante B, et al. Healthy lifestyle and riskof heart failure: results from 2 prospective cohort studies. Circ HeartFail. 2016;9:e002855.

63. Murakami H, Kawakami R, Nakae S, et al. Accuracy of wearabledevices for estimating total energy expenditure: comparison withmetabolic chamber and doubly labeled water method. JAMAIntern Med. 2016;176:702–703.

64. Jakicic JM, Davis KK, Rogers RJ, et al. Effect of wearable technologycombined with a lifestyle intervention on long-term weight loss:the idea randomized clinical trial. Jama. 2016;316:1161–1171.

65. Ong MK, Romano PS, Edgington S, et al. Effectiveness of remotepatient monitoring after discharge of hospitalized patients withheart failure: the better effectiveness after transition–heart failure(beat-hf) randomized clinical trial. JAMA Intern Med.2016;176:310–318.• This study provides evidence of the association betweenwearable devices and long-term behavioral change.

66. Dinov ID. Methodological challenges and analytic opportunities formodeling and interpreting big healthcare data. Gigascience.2016;5:12.

67. Coakley MF, Leerkes MR, Barnett J, et al. Unlocking the power ofbig data at the National Institutes of Health. Big Data.2013;1:183–186.

68. Egger M, Smith GD, Schneider M, et al. Bias in meta-analysisdetected by a simple, graphical test. BMJ. 1997;315:629.

69. Gymrek M, McGuire AL, Golan D, et al. Identifying personal gen-omes by surname inference. Science. 2013;339:321–324.

70. Hayden EC. The genome hacker. Nature. 2013;497:172.


http://aws.amazon.com/

https://doi.org/10.1038/ng.3708

71. Hollands GJ, French DP, Griffin SJ et al. The impact of commu-nicating genetic risks of disease on risk-reducing health beha-viour: systematic review with meta-analysis. BMJ. 2016 Mar15;352:i1102.

72. Presley CJ, Tang D, Soulos PR, et al. Association of broad-basedgenomic sequencing with survival among patients with advancednon–small cell lung cancer in the community oncology setting.Jama. 2018;320:469–477.

73. Saracci R. Epidemiology in wonderland: big data and precisionmedicine. Eur J Epidemiol. 2018;33:245–257.

74. Schnohr P, Lange P, Nyboe J, et al. Gray hair, baldness, and wrinklesin relation to myocardial infarction: the Copenhagen City HeartStudy. Am Heart J. 1995;130:1003–1010.

75. Lesko SM, Rosenberg L, Shapiro S. A case-control study of baldnessin relation to myocardial infarction in men. Jama. 1993;269:998–1003.

76. Ford ES, Freedman DS, Byers T. Baldness and ischemic heart dis-ease in a national sample of men. Am J Epidemiol. 1996;143:651–657.

77. Lin SH, Young J, Logan R, et al. Parametric mediational g-formulaapproach to mediation analysis with time-varying exposures, med-iators, and confounders. Epidemiology. 2017;28:266–274.

78. Johnson KW, Glicksberg BS, Hodos RA, et al. Causal inference onelectronic health records to assess blood pressure treatment targets:an application of the parametric g formula. Pacific Symposium onBiocomputing Pacific Symposium on Biocomputing. Fairmont Orchid,Hawaii, Puako, HI. 2018;23:180–191.

79. Ahmad FS, Chan C, Rosenman MB, et al. Validity of cardiovasculardata from electronic sources: the multi-ethnic study of athero-sclerosis and HealthLNK. Circulation. 2017;136:1207–1216.

80. Krittanawong C, Kumar A, Virk HUH, et al. Trends in incidence, char-acteristics, and in-hospital outcomes of patients presenting withspontaneous coronary artery dissection (from a national population-based cohort study between 2004 and 2015). Am J Cardiol. In press.

81. [cited 2018 Oct 6]. https://www.federalregister.gov/d/2018-15390Aoa.

82. Talboom JS, Huentelman MJ. Big data collision: the internet ofthings, wearable devices and genomics in the study of neurologicaltraits and disease. Hum Mol Genet. 2018;27:R35–r39.

83. Kang M, Park E, Cho BH, et al. Recent patient health monitoringplatforms incorporating internet of things-enabled smart devices.Int Neurourol J. 2018;22:S76–82.

84. Ozdemir V, Hekim N. Birth of industry 5.0: making sense of big datawith artificial intelligence, “The internet of things” and next-gen-eration technology policy. Omics: J Integr Biol. 2018;22:65–76.

85. Dey N, Ashour AS. Medical cyber-physical systems: a survey.2018;42. p. 74.

86. Krittanawong C. Future physicians in the era of precision cardio-vascular medicine. Circulation. 2017;136:1572–1574.

87. Muse ED, Wineinger NE, Spencer EG, et al. Validation of a geneticrisk score for atrial fibrillation: a prospective multicenter cohortstudy. PLoS Med. 2018;15:e1002525.

88. Knowles JW, Ashley EA. Cardiovascular disease: the rise of thegenetic risk score. PLoS Med. 2018;15:e1002546.


https://www.federalregister.gov/d/2018-15390

big data, artificial intelligence, and cardiovascular ... · tude of large of datasets. the term...

Documents