data lakes and their implication on the global health ... · a data lake is a storage repository...

14
Data Lakes and Their Implication on the Global Health Sector v2.0 Research | Blogs | Expert Opinions Pekka Neittaanmäki Dean of the Faculty of Information Technology Professor, Department of Mathematical Information Technology University of Jyväskylä Miikael Lehto Anthony Ogbechie Project Researcher Service Innovation Management University of Jyväskylä University of Jyväskylä JYVÄSKYLÄN YLIOPISTO INFORMAATIOTEKNOLOGIAN TIEDEKUNTA 2017 TEKES HANKE: VALUE FROM PUBLIC HEALTH WITH COGNITIVE COMPUTING

Upload: others

Post on 20-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Lakes and Their Implication on the Global Health ... · A data lake is a storage repository that holds an enormous amount of raw or refined data in native format until it is

Data Lakes and Their Implication on the Global Health Sector v2.0

Research | Blogs | Expert Opinions

Pekka Neittaanmäki

Dean of the Faculty of Information Technology

Professor, Department of Mathematical Information Technology

University of Jyväskylä

Miikael Lehto Anthony Ogbechie

Project Researcher Service Innovation Management

University of Jyväskylä University of Jyväskylä

JYVÄSKYLÄN YLIOPISTO

INFORMAATIOTEKNOLOGIAN TIEDEKUNTA 2017

TEKES HANKE: VALUE FROM PUBLIC HEALTH WITH COGNITIVE COMPUTING

Page 2: Data Lakes and Their Implication on the Global Health ... · A data lake is a storage repository that holds an enormous amount of raw or refined data in native format until it is

TABLEOFCONTENTS

WhatisaDataLake?..............................................................................3

DataLakes101:AnOverview.................................................................4

DataLake..............................................................................................4

DataLakesandthePromiseofUnsiloedData........................................6

InvestinginaDataLake?ShoreUptheBigDataGateway.....................7

WhyDoINeedaDataLake?...................................................................8

DataLakes:TheBiggestBigDataChallenges.........................................9

Your‘ResolutionList’for2017:5BestPracticesforUnleashingthePowerofYourDataLakes.................................................................................9

ChartingtheCourseTowardValue-BasedCareWithaHealthcareDataLake.....................................................................................................11

DivinginWithaHealthcareDataLakeforPredictiveCare...................11

WhyHealthcareOrganisationsShouldTakethePlungeWithDataLakes.............................................................................................................12

PartnersDataLakeOffersHealthcareAnalyticsasaService................13

MedicalInsightSettoFlowFromSemanticDataLakes........................13

Page 3: Data Lakes and Their Implication on the Global Health ... · A data lake is a storage repository that holds an enormous amount of raw or refined data in native format until it is

3

WhatisaDataLake?A data lake is a shared data environment that comprises multiple repositories andcapitalizesonbigdatatechnologies.Itprovidesdatatoanorganizationforavarietyofanalyticsprocessingincluding:

• discoveryandexplorationofdata• simpleadhocanalytics• complexanalysisforbusinessdecisions• reporting• real-timeanalytics

Organizationsareincreasinglyexploringthedatalakeapproachtoaddressdemandsforanagileyetsecureandwell-governeddataenvironmentthatsupportsbothstructuredandunstructureddata.

Adatalakeis:

• Anenvironmentwhereuserscanaccessvastamountsofrawdata• Anenvironmentfordevelopingandprovingananalyticsmodel,andthen

movingitintoproduction• Ananalyticssandboxforexploringdatatogaininsight• Anenterprise-widecatalogthathelpsusersfinddataandlinkbusinessterms

withtechnicalmetadata• Anenvironmentforenablingreuseofdatatransformationsandqueries

Adata lakeisastoragerepositorythatholdsanenormousamountofraworrefineddata in native format until it is accessed. The term data lake is usually associatedwithHadoop-orientedobjectstorageinwhichanorganization’sdataisloadedintotheHadoopplatformandthenbusinessanalyticsanddata-miningtoolsareappliedtothedatawhereitresidesontheHadoopcluster.

However, data lakes can also be used effectively without incorporating Hadoopdependingontheneedsandgoalsoftheorganization.Thetermdatalakeisincreasinglybeingusedtodescribeanylargedatapoolinwhichtheschemaanddatarequirementsarenotdefineduntilthedataisqueried.https://www.ibm.com/analytics/uk-en/data-lake/

Page 4: Data Lakes and Their Implication on the Global Health ... · A data lake is a storage repository that holds an enormous amount of raw or refined data in native format until it is

4

DataLakes101:AnOverviewADataLakeisapoolofunstructuredandstructureddata,storedas-is,withoutaspecificpurposeinmind,thatcanbe“builtonmultipletechnologiessuchasHadoop,NoSQL,AmazonSimpleStorageService,arelationaldatabase,orvariouscombinationsthereof,”accordingtoawhitepapercalledWhatisaDataLakeandWhyHasitBecomePopular?

“ADataLakeischaracterizedbythreekeyattributes:

1. Collecteverything.ADataLakecontainsalldata,bothrawsourcesover

extendedperiodsoftimeaswellasanyprocesseddata.2. Diveinanywhere.ADataLakeenablesusersacrossmultiplebusinessunitsto

refine,exploreandenrichdataontheirterms.3. Flexibleaccess.ADataLakeenablesmultipledataaccesspatternsacrossa

sharedinfrastructure:batch,interactive,online,search,in-memoryandotherprocessingengines.”

http://www.dataversity.net/data-lakes-101-overview/

DataLakeThereisavitaldistinctionbetweenthedatalakeandthedatawarehouse.Thedatalakestoresrawdata,inwhateverformthedatasourceprovides.Thereisnoassumptionsabouttheschemaofthedata,eachdatasourcecanusewhateverschemaitlikes.It'suptotheconsumersofthatdatatomakesenseofthatdatafortheirownpurposes.

Page 5: Data Lakes and Their Implication on the Global Health ... · A data lake is a storage repository that holds an enormous amount of raw or refined data in native format until it is

5Thecomplexityofthisrawdatameansthatthereisroomforsomethingthatcuratesthedataintoamoremanageablestructure(aswellasreducingtheconsiderablevolumeofdata.)Thedatalakeshouldn'tbeaccesseddirectlyverymuch.Becausethedataisraw,youneedalotofskilltomakeanysenseofit.Youhaverelativelyfewpeoplewhoworkin thedata lake,as theyuncovergenerallyusefulviewsofdata in the lake, theycancreateanumberofdatamartseachofwhichhasaspecificmodelforasingleboundedcontext.Alargernumberofdownstreamuserscanthentreattheselakeshoremartsasanauthoritativesourceforthatcontext.

https://martinfowler.com/bliki/DataLake.html

Page 6: Data Lakes and Their Implication on the Global Health ... · A data lake is a storage repository that holds an enormous amount of raw or refined data in native format until it is

6

DataLakesandthePromiseofUnsiloedData

Hadoopallowsthehospital’sdisparaterecordstobestoredintheirnativeformatsforlater parsing, rather than forcing all-or-nothing integration up front as in a datawarehousing scenario. Preserving the native format also helps maintain dataprovenance and fidelity, so different analyses can be performed using differentcontexts.

Previous approaches to broad-based data integration have forced all users into acommonpredeterminedschema,ordatamodel.Unlikethismonolithicviewofasingleenterprise-widedatamodel,thedatalakerelaxesstandardizationanddefersmodeling,resultinginanearlyunlimitedpotentialforoperationalinsightanddatadiscovery.Asdatavolumes,datavariety,andmetadatarichnessgrow,sodoesthebenefit.

Page 7: Data Lakes and Their Implication on the Global Health ... · A data lake is a storage repository that holds an enormous amount of raw or refined data in native format until it is

7

http://usblogs.pwc.com/emerging-technology/data-lakes-and-the-promise-of-unsiloed-data/

InvestinginaDataLake?ShoreUptheBigDataGatewayBycapturinglargelyunstructureddataforalowcostandstoringvarioustypesofdatainthesameplace,adatalake:

• Breaksdownsilosandroutesinformationintoonenavigablestructure.Data

poursintothelakelivesthereuntilitisneeded,whenitflowsbackoutagain.

• Gathersallinformationintooneplacewithoutfirstqualifyingwhetherapieceofdataisrelevantornot.

• Enablesanalyststoeasilyexplorenewdatarelationships,unlockinglatent

value.Datadistillationcanbeperformedondemandbasedonbusinessneeds,allowingforidentifyingnewpatternsandrelationshipsinexistingdata.

Page 8: Data Lakes and Their Implication on the Global Health ... · A data lake is a storage repository that holds an enormous amount of raw or refined data in native format until it is

8• Helpsdeliverresultsfasterthanatraditionaldataapproach.Datalakesprovide

aplatformtoutiliseheapsofinformationforbusinessbenefitsinnearreal-time.

• Healthcare:Healthsystemsmaintainandanalysemillionsofrecordsfor

millionsofpeopletoimproveambulatorycareandpatientoutcomes.Quickinsightandactiononsuchrecordsalsocanimproveoperationalefficiencyandenableanaccountablecareorganisation.

http://www.itproportal.com/2015/09/25/investing-data-lake-shore-up-big-data-gateway-2/

WhyDoINeedaDataLake?

Thedatalakeisapowerfuldataarchitecturethatleveragestheeconomicsofbigdata(where it is 20x to 50x cheaper to store,manage and analyze data as compared totraditionaldatawarehouse technologies).Andnewbigdataprocessingandanalyticscapabilitieshelporganizationsaddressbusinessandoperationalchallengesthatweredifficult to address using conventional Business Intelligence and data warehousingtechnologies.

https://www.linkedin.com/pulse/why-do-i-need-data-lake-bill-schmarzo

Page 9: Data Lakes and Their Implication on the Global Health ... · A data lake is a storage repository that holds an enormous amount of raw or refined data in native format until it is

9

DataLakes:TheBiggestBigDataChallenges

Datalakesneedtohavefourprimarycomponents–dataingestion,datamanagement,querymanagementanddatalakemanagement.DataIngestion

• Modeldriven• Semantictagging• On-demandquery• Streaming• Scheduledbatchload• Selfservice

DataManagement• Datamovement• Dataprovenance• Types(Inmemory,NoSQL,MR,columnar,graph,Semantic,HDFS)• Dataflow(governeddata,lightlygoverneddata,ungoverneddata)

QueryManagement• Semanticsearch• Datadiscovery• Analyticsdirectedtobestqueryengine• Captureandshareanalyticsexpertise• Querydata,metadataandprovenance

DataLakeManagement• Models(Bizunitdataoptimizedtoassistanalytics)• Dataassetscatalog(ontologies,taxonomies)• Workflow(processes,schedules,provenancecapture)• Accessmanagement(AAA,group/role/rule/user-basedauthorization)• Metadata

http://analytics-magazine.org/data-lakes-biggest-big-data-challenges/

Your‘ResolutionList’for2017:5BestPracticesforUnleashingthePowerofYourDataLakes

Inthepast,companiesthoughtthey’dgainfull360-degreevisibilityintotheirenterpriseinformationwith a datawarehouse. However, the advent of big data has put thesesystemsunderdistress,pushingthemtocapacity,anddrivingupthecostsofstorage.Asa result, some companies have startedmoving some of their data (often times less

Page 10: Data Lakes and Their Implication on the Global Health ... · A data lake is a storage repository that holds an enormous amount of raw or refined data in native format until it is

10utilizeddata)offtoanewsetofsystemslikethoseruninHadoop,NoSQLdatabasesortheCloud.

Asaresultofthismigration,companiesalsocametorealizethattheycanactuallydomorewithHadoop,NoSQLandCloudvs.usingenterprisedatawarehouses.Thus,theystarted addingnew sourcesof data like sensor,mobile, social andbigdata to thesesystems, ultimately transforming their Hadoop, NoSQL and Cloud systems into datalakes.

AccordingtoNickHuedeckeratGartner,“Datalakesaremarketedasenterprise-widedatamanagementplatformsforanalyzingdisparatesourcesofdatainitsnativeformat.Theideaissimple:insteadofplacingdatainapurpose-builtdatastore,youmoveitintoadatalakeinitsoriginalformat.Thiseliminatestheupfrontcostsofdataingestion,liketransformation.Oncedataisplacedintothelake,it’savailableforanalysisbyeveryoneintheorganization.”

Thedatalakemetaphoremergedbecause‘lakes’areagreatconcepttoexplainoneofthebasicprinciplesofbigdata.Thatis,theneedtocollectalldataanddetectexceptions,trendsandpatternsusinganalyticsandmachine learning.This isbecauseoneof thebasicprinciplesofdatascienceisthemoredatayouget,thebetteryourdatamodelwillultimatelybe.Withaccesstoallthedata,youcanmodelusingtheentiresetofdataversusjustasampleset,whichreducesthenumberoffalsepositivesyoumightget.

Thelackofdatagovernanceispreventingmanyorganizationsfromfullyopeningupdatalakesforallemployeestouse,becausemoreoftenthannot,datalakescontainsensitivedatalikesocialsecuritynumbers,dateofbirth,creditcardnumbers,etc.thatneedtobeprotected.Hencetheseorganizationswillnotreapthefullbenefitsandgettheirfullreturn on data lake investment without having a thorough information governancestrategy.https://www.talend.com/blog/2016/12/21/your-resolution-list-for-2017-5-best-practices-for-unleashing-the-power-of-your/

Page 11: Data Lakes and Their Implication on the Global Health ... · A data lake is a storage repository that holds an enormous amount of raw or refined data in native format until it is

11

ChartingtheCourseTowardValue-BasedCareWithaHealthcareDataLake

Thelargevolumesofunstructuredandsemi-structureddatathatarecurrentlysiloedinEHRs,PACS,and labsystemswillalsoneedtobe integratedtoguide informeddata-drivendecisions.Thisenterprise-wideapproachhelpsrevealactionableinsightsaboutanorganization’sperformanceindicatorsandimpactsofpatientcareinterventionstobettermanagerisksanddeliveraffordable,higherqualitycare.

A data lake offers healthcare organizations a powerful data architecture with one,unified location for all healthcare data required for mining and analysis by clinicaldepartments,businessanalysts,anddatascienceteams.Theendgoal—reducetimetoinsights—isattainedthroughtheabilitytoanalyzemultiplevariablestoquicklyidentifytrends,patterns,andcorrelations.

Complementingexistingbusinessintelligenceanddatawarehouseinvestments,adatalake enables healthcare providers to execute analytics across disparate systems —runningdatabases,datawarehouses,andstructuredorunstructureddatasetswithoutimpactingday-to-dayoperationsoraccesstodata.

https://www.healthitoutcomes.com/doc/charting-the-course-toward-value-based-care-with-a-healthcare-data-lake-0001

DivinginWithaHealthcareDataLakeforPredictiveCare

Notonlyisdatacomingintothehealthcaresystematarapidrate,butitisalsocominginmanyformsincludingstructured,semi-structured,andunstructuredformats-suchastheEMR,patientimages,labreports,pathology,genomics,clinicalnotes,andsocialmediaactivity.Instrumentation,sensor,andtelemetrydata.Yet,alloftheinformationsourcesresidingacrossthehealthsystemrelevanttoaparticularpatientcareepisodeneedstobeavailabletotherightcaregiverattherighttimefora360-degreeviewofthepatienttoprovideasafeandappropriatediagnosis.

Movingtowardpredictiveanalyticscreatesfuture-focusedinsights-creatinganewrealmof data science for uncovering trends, patterns, relationships, correlations, anddiscoveries that impact integrated patient care. Data can also be consolidated fromoutside resources, including payers, genomic research centers, biobanks, and socialmediafeeds.

Yourorganizationcanincorporatecurrentbusinessintelligencereporting,buildinternaldatasciencecapabilities,andthen,deployadatalakeforhealthcaretomoveforwardpredictiveanalyticsanddatamining.

Page 12: Data Lakes and Their Implication on the Global Health ... · A data lake is a storage repository that holds an enormous amount of raw or refined data in native format until it is

12

Simplyput,adatalakeprovidesanITenvironmentthatincorporatesstructured,semi-structured, and unstructured data from trusted external and internal sources, andultimatelyimproveseffectivenessandqualityofcriticalbusinessandclinicalpractices.In addition, a data lake applies advanced analytics to produce actionable insights,enablingtimelyinterventionstopreventadversehealtheventsandultimately,elevateoverallpopulationwellness.

https://www.linkedin.com/pulse/diving-healthcare-data-lake-predictive-care-eric-newman

WhyHealthcareOrganisationsShouldTakethePlungeWithDataLakesHereinliesthesolutioninadatalake.Datalakescanbenefithealthcareorganisationsbyrevealingactionableinsightsaboutanorganisation’sperformanceindicatorsandimpactsonpatientcareinterventionstobettermanagerisksanddeliveraffordable,higherqualitycare.Thistypeoftechnologydrawsalldataintoasinglelocation,reducingoreliminatingsiloesacrossthehealthcareenterprise,whilecollectingdatafromtrustedexternalsources,suchasresearchcentresandpublichealthdatabases.Thedatalakemakesiteasyforhealthcareprofessionalstomineandanalyseallsourcesofdataforinsightsintopatientcare.Theycanalsoanalysedatatounderstandhowtheycanmaketheirorganisationsmoreefficientand inawaythatwon’tadverselyaffectpatient care. Analytics can provide insight to optimize staffing schedules and bettermanageITresources,whilsthelpingtoidentifytrendsinservices.

Andwithpatientconcernsovertheprivacyofinformationgathered,thevalueofactivelymanagingthisdataalsosupportsanumberofotherbenefitsincludingcompliancewithregulatory requirements, eliminating backup concerns and ensuring patient data issecure.Asdataentersthe“lake,”eachpiecesofinformationisidentifiedwitharangeofsecurityinformationthatembedssecuritycapabilitywithinthedataitself.Thistypeofapproachcouldreducebarrierstoinformationsharingbetweentheindustryandthepublic.

https://www.linkedin.com/pulse/why-healthcare-organisations-should-take-plunge-data-lakes-norman

Page 13: Data Lakes and Their Implication on the Global Health ... · A data lake is a storage repository that holds an enormous amount of raw or refined data in native format until it is

13

PartnersDataLakeOffersHealthcareAnalyticsasaService“We’reputtingpublicdomaindatainthere,aswellasinstitutionaldata,”hesaid.“Andthen investigators can bring their own data, too. That’s what’s unique about theplatform.Itprovidestools,storage,andtechnology,andthesecuritywrappedaroundit,soinvestigatorsdon’tnecessarilyhavetothinkabouttheinfrastructure.Buttheycanbringtheirideasandbegintoleveragethosetoolstodeveloptheirareaofinterest.”

In the future, the platform will be able to ingest imaging data from radiology andpathologysources,aswell,whichwillhelptocultivatearichandrobustcommondatasetthatresearchersdrawuponfortheirwork.

Data lakesarebecomingapopularwaytostoremassivevolumesofhealthcaredatathatmayormaynothaveaclearusecaseatthemoment.Unliketraditionalrelationaldatabases,datalakesdonotneedtolockdataintoanyparticularformat,category,orframework.Instead,agraphdatabaseordatalakecanapplyastandardizedtagtoeachdataelement,andthetagscanbecombinedandrecombinedinendlesswaystoproducenearlyunlimitedinsights.

https://healthitanalytics.com/news/partners-data-lake-offers-healthcare-analytics-as-a-service

MedicalInsightSettoFlowFromSemanticDataLakesAccordingtoFranzCEOJansAasman,asemanticdatalakeemploysacombinationoftechnologies,includingHadoop,graphanalytics,asemantic“triplestore,”theSPARQLquerylanguage,andSpark-basedmachinelearning,toallowdoctorstoconnectthedotsbetweenpatientconditionsandaworldofknowledgecontainedinstructuredinternalsystems,aswellasunstructureddatasourcesoutsideoftheorganization.

Insteadofhammeringthedataintoarelationalschemaandstoringitinadatamart,theywanttopullallthepertinentdata—includingany“linkedopendata”suchasdruginteractiondatabases,genetictestresults,orademographicsdatabasesortedbyZIPcode–intoHadoopandHDFS.

ThesemanticmagicstartsoncethedataisinHadoopandHBase.MontefioretransformsallthedataintosemantictriplesusingaspecialETLtoolcalledasemanticannotationengine.ThentheFranzAllegroGraphgraphdatabasesindexesallthetriplesandpowerstheSPARQLqueriesacrossthejoinedcorpus,inadditiontomachinelearningpoweredbyApacheSparkMLorR.

Page 14: Data Lakes and Their Implication on the Global Health ... · A data lake is a storage repository that holds an enormous amount of raw or refined data in native format until it is

14Asaresultofthissemanticapproach,userscanrunqueriesthattraversedifferentdatasources,whichwasthestumblingblockoftraditionalrelationalapproaches.

https://www.datanami.com/2015/08/26/medical-insight-set-to-flow-from-semantic-data-lakes/