data lakes and their implication on the global health ... · a data lake is a storage repository...
TRANSCRIPT
Data Lakes and Their Implication on the Global Health Sector v2.0
Research | Blogs | Expert Opinions
Pekka Neittaanmäki
Dean of the Faculty of Information Technology
Professor, Department of Mathematical Information Technology
University of Jyväskylä
Miikael Lehto Anthony Ogbechie
Project Researcher Service Innovation Management
University of Jyväskylä University of Jyväskylä
JYVÄSKYLÄN YLIOPISTO
INFORMAATIOTEKNOLOGIAN TIEDEKUNTA 2017
TEKES HANKE: VALUE FROM PUBLIC HEALTH WITH COGNITIVE COMPUTING
TABLEOFCONTENTS
WhatisaDataLake?..............................................................................3
DataLakes101:AnOverview.................................................................4
DataLake..............................................................................................4
DataLakesandthePromiseofUnsiloedData........................................6
InvestinginaDataLake?ShoreUptheBigDataGateway.....................7
WhyDoINeedaDataLake?...................................................................8
DataLakes:TheBiggestBigDataChallenges.........................................9
Your‘ResolutionList’for2017:5BestPracticesforUnleashingthePowerofYourDataLakes.................................................................................9
ChartingtheCourseTowardValue-BasedCareWithaHealthcareDataLake.....................................................................................................11
DivinginWithaHealthcareDataLakeforPredictiveCare...................11
WhyHealthcareOrganisationsShouldTakethePlungeWithDataLakes.............................................................................................................12
PartnersDataLakeOffersHealthcareAnalyticsasaService................13
MedicalInsightSettoFlowFromSemanticDataLakes........................13
3
WhatisaDataLake?A data lake is a shared data environment that comprises multiple repositories andcapitalizesonbigdatatechnologies.Itprovidesdatatoanorganizationforavarietyofanalyticsprocessingincluding:
• discoveryandexplorationofdata• simpleadhocanalytics• complexanalysisforbusinessdecisions• reporting• real-timeanalytics
Organizationsareincreasinglyexploringthedatalakeapproachtoaddressdemandsforanagileyetsecureandwell-governeddataenvironmentthatsupportsbothstructuredandunstructureddata.
Adatalakeis:
• Anenvironmentwhereuserscanaccessvastamountsofrawdata• Anenvironmentfordevelopingandprovingananalyticsmodel,andthen
movingitintoproduction• Ananalyticssandboxforexploringdatatogaininsight• Anenterprise-widecatalogthathelpsusersfinddataandlinkbusinessterms
withtechnicalmetadata• Anenvironmentforenablingreuseofdatatransformationsandqueries
Adata lakeisastoragerepositorythatholdsanenormousamountofraworrefineddata in native format until it is accessed. The term data lake is usually associatedwithHadoop-orientedobjectstorageinwhichanorganization’sdataisloadedintotheHadoopplatformandthenbusinessanalyticsanddata-miningtoolsareappliedtothedatawhereitresidesontheHadoopcluster.
However, data lakes can also be used effectively without incorporating Hadoopdependingontheneedsandgoalsoftheorganization.Thetermdatalakeisincreasinglybeingusedtodescribeanylargedatapoolinwhichtheschemaanddatarequirementsarenotdefineduntilthedataisqueried.https://www.ibm.com/analytics/uk-en/data-lake/
4
DataLakes101:AnOverviewADataLakeisapoolofunstructuredandstructureddata,storedas-is,withoutaspecificpurposeinmind,thatcanbe“builtonmultipletechnologiessuchasHadoop,NoSQL,AmazonSimpleStorageService,arelationaldatabase,orvariouscombinationsthereof,”accordingtoawhitepapercalledWhatisaDataLakeandWhyHasitBecomePopular?
“ADataLakeischaracterizedbythreekeyattributes:
1. Collecteverything.ADataLakecontainsalldata,bothrawsourcesover
extendedperiodsoftimeaswellasanyprocesseddata.2. Diveinanywhere.ADataLakeenablesusersacrossmultiplebusinessunitsto
refine,exploreandenrichdataontheirterms.3. Flexibleaccess.ADataLakeenablesmultipledataaccesspatternsacrossa
sharedinfrastructure:batch,interactive,online,search,in-memoryandotherprocessingengines.”
http://www.dataversity.net/data-lakes-101-overview/
DataLakeThereisavitaldistinctionbetweenthedatalakeandthedatawarehouse.Thedatalakestoresrawdata,inwhateverformthedatasourceprovides.Thereisnoassumptionsabouttheschemaofthedata,eachdatasourcecanusewhateverschemaitlikes.It'suptotheconsumersofthatdatatomakesenseofthatdatafortheirownpurposes.
5Thecomplexityofthisrawdatameansthatthereisroomforsomethingthatcuratesthedataintoamoremanageablestructure(aswellasreducingtheconsiderablevolumeofdata.)Thedatalakeshouldn'tbeaccesseddirectlyverymuch.Becausethedataisraw,youneedalotofskilltomakeanysenseofit.Youhaverelativelyfewpeoplewhoworkin thedata lake,as theyuncovergenerallyusefulviewsofdata in the lake, theycancreateanumberofdatamartseachofwhichhasaspecificmodelforasingleboundedcontext.Alargernumberofdownstreamuserscanthentreattheselakeshoremartsasanauthoritativesourceforthatcontext.
https://martinfowler.com/bliki/DataLake.html
6
DataLakesandthePromiseofUnsiloedData
Hadoopallowsthehospital’sdisparaterecordstobestoredintheirnativeformatsforlater parsing, rather than forcing all-or-nothing integration up front as in a datawarehousing scenario. Preserving the native format also helps maintain dataprovenance and fidelity, so different analyses can be performed using differentcontexts.
Previous approaches to broad-based data integration have forced all users into acommonpredeterminedschema,ordatamodel.Unlikethismonolithicviewofasingleenterprise-widedatamodel,thedatalakerelaxesstandardizationanddefersmodeling,resultinginanearlyunlimitedpotentialforoperationalinsightanddatadiscovery.Asdatavolumes,datavariety,andmetadatarichnessgrow,sodoesthebenefit.
7
http://usblogs.pwc.com/emerging-technology/data-lakes-and-the-promise-of-unsiloed-data/
InvestinginaDataLake?ShoreUptheBigDataGatewayBycapturinglargelyunstructureddataforalowcostandstoringvarioustypesofdatainthesameplace,adatalake:
• Breaksdownsilosandroutesinformationintoonenavigablestructure.Data
poursintothelakelivesthereuntilitisneeded,whenitflowsbackoutagain.
• Gathersallinformationintooneplacewithoutfirstqualifyingwhetherapieceofdataisrelevantornot.
• Enablesanalyststoeasilyexplorenewdatarelationships,unlockinglatent
value.Datadistillationcanbeperformedondemandbasedonbusinessneeds,allowingforidentifyingnewpatternsandrelationshipsinexistingdata.
8• Helpsdeliverresultsfasterthanatraditionaldataapproach.Datalakesprovide
aplatformtoutiliseheapsofinformationforbusinessbenefitsinnearreal-time.
• Healthcare:Healthsystemsmaintainandanalysemillionsofrecordsfor
millionsofpeopletoimproveambulatorycareandpatientoutcomes.Quickinsightandactiononsuchrecordsalsocanimproveoperationalefficiencyandenableanaccountablecareorganisation.
http://www.itproportal.com/2015/09/25/investing-data-lake-shore-up-big-data-gateway-2/
WhyDoINeedaDataLake?
Thedatalakeisapowerfuldataarchitecturethatleveragestheeconomicsofbigdata(where it is 20x to 50x cheaper to store,manage and analyze data as compared totraditionaldatawarehouse technologies).Andnewbigdataprocessingandanalyticscapabilitieshelporganizationsaddressbusinessandoperationalchallengesthatweredifficult to address using conventional Business Intelligence and data warehousingtechnologies.
https://www.linkedin.com/pulse/why-do-i-need-data-lake-bill-schmarzo
9
DataLakes:TheBiggestBigDataChallenges
Datalakesneedtohavefourprimarycomponents–dataingestion,datamanagement,querymanagementanddatalakemanagement.DataIngestion
• Modeldriven• Semantictagging• On-demandquery• Streaming• Scheduledbatchload• Selfservice
DataManagement• Datamovement• Dataprovenance• Types(Inmemory,NoSQL,MR,columnar,graph,Semantic,HDFS)• Dataflow(governeddata,lightlygoverneddata,ungoverneddata)
QueryManagement• Semanticsearch• Datadiscovery• Analyticsdirectedtobestqueryengine• Captureandshareanalyticsexpertise• Querydata,metadataandprovenance
DataLakeManagement• Models(Bizunitdataoptimizedtoassistanalytics)• Dataassetscatalog(ontologies,taxonomies)• Workflow(processes,schedules,provenancecapture)• Accessmanagement(AAA,group/role/rule/user-basedauthorization)• Metadata
http://analytics-magazine.org/data-lakes-biggest-big-data-challenges/
Your‘ResolutionList’for2017:5BestPracticesforUnleashingthePowerofYourDataLakes
Inthepast,companiesthoughtthey’dgainfull360-degreevisibilityintotheirenterpriseinformationwith a datawarehouse. However, the advent of big data has put thesesystemsunderdistress,pushingthemtocapacity,anddrivingupthecostsofstorage.Asa result, some companies have startedmoving some of their data (often times less
10utilizeddata)offtoanewsetofsystemslikethoseruninHadoop,NoSQLdatabasesortheCloud.
Asaresultofthismigration,companiesalsocametorealizethattheycanactuallydomorewithHadoop,NoSQLandCloudvs.usingenterprisedatawarehouses.Thus,theystarted addingnew sourcesof data like sensor,mobile, social andbigdata to thesesystems, ultimately transforming their Hadoop, NoSQL and Cloud systems into datalakes.
AccordingtoNickHuedeckeratGartner,“Datalakesaremarketedasenterprise-widedatamanagementplatformsforanalyzingdisparatesourcesofdatainitsnativeformat.Theideaissimple:insteadofplacingdatainapurpose-builtdatastore,youmoveitintoadatalakeinitsoriginalformat.Thiseliminatestheupfrontcostsofdataingestion,liketransformation.Oncedataisplacedintothelake,it’savailableforanalysisbyeveryoneintheorganization.”
Thedatalakemetaphoremergedbecause‘lakes’areagreatconcepttoexplainoneofthebasicprinciplesofbigdata.Thatis,theneedtocollectalldataanddetectexceptions,trendsandpatternsusinganalyticsandmachine learning.This isbecauseoneof thebasicprinciplesofdatascienceisthemoredatayouget,thebetteryourdatamodelwillultimatelybe.Withaccesstoallthedata,youcanmodelusingtheentiresetofdataversusjustasampleset,whichreducesthenumberoffalsepositivesyoumightget.
Thelackofdatagovernanceispreventingmanyorganizationsfromfullyopeningupdatalakesforallemployeestouse,becausemoreoftenthannot,datalakescontainsensitivedatalikesocialsecuritynumbers,dateofbirth,creditcardnumbers,etc.thatneedtobeprotected.Hencetheseorganizationswillnotreapthefullbenefitsandgettheirfullreturn on data lake investment without having a thorough information governancestrategy.https://www.talend.com/blog/2016/12/21/your-resolution-list-for-2017-5-best-practices-for-unleashing-the-power-of-your/
11
ChartingtheCourseTowardValue-BasedCareWithaHealthcareDataLake
Thelargevolumesofunstructuredandsemi-structureddatathatarecurrentlysiloedinEHRs,PACS,and labsystemswillalsoneedtobe integratedtoguide informeddata-drivendecisions.Thisenterprise-wideapproachhelpsrevealactionableinsightsaboutanorganization’sperformanceindicatorsandimpactsofpatientcareinterventionstobettermanagerisksanddeliveraffordable,higherqualitycare.
A data lake offers healthcare organizations a powerful data architecture with one,unified location for all healthcare data required for mining and analysis by clinicaldepartments,businessanalysts,anddatascienceteams.Theendgoal—reducetimetoinsights—isattainedthroughtheabilitytoanalyzemultiplevariablestoquicklyidentifytrends,patterns,andcorrelations.
Complementingexistingbusinessintelligenceanddatawarehouseinvestments,adatalake enables healthcare providers to execute analytics across disparate systems —runningdatabases,datawarehouses,andstructuredorunstructureddatasetswithoutimpactingday-to-dayoperationsoraccesstodata.
https://www.healthitoutcomes.com/doc/charting-the-course-toward-value-based-care-with-a-healthcare-data-lake-0001
DivinginWithaHealthcareDataLakeforPredictiveCare
Notonlyisdatacomingintothehealthcaresystematarapidrate,butitisalsocominginmanyformsincludingstructured,semi-structured,andunstructuredformats-suchastheEMR,patientimages,labreports,pathology,genomics,clinicalnotes,andsocialmediaactivity.Instrumentation,sensor,andtelemetrydata.Yet,alloftheinformationsourcesresidingacrossthehealthsystemrelevanttoaparticularpatientcareepisodeneedstobeavailabletotherightcaregiverattherighttimefora360-degreeviewofthepatienttoprovideasafeandappropriatediagnosis.
Movingtowardpredictiveanalyticscreatesfuture-focusedinsights-creatinganewrealmof data science for uncovering trends, patterns, relationships, correlations, anddiscoveries that impact integrated patient care. Data can also be consolidated fromoutside resources, including payers, genomic research centers, biobanks, and socialmediafeeds.
Yourorganizationcanincorporatecurrentbusinessintelligencereporting,buildinternaldatasciencecapabilities,andthen,deployadatalakeforhealthcaretomoveforwardpredictiveanalyticsanddatamining.
12
Simplyput,adatalakeprovidesanITenvironmentthatincorporatesstructured,semi-structured, and unstructured data from trusted external and internal sources, andultimatelyimproveseffectivenessandqualityofcriticalbusinessandclinicalpractices.In addition, a data lake applies advanced analytics to produce actionable insights,enablingtimelyinterventionstopreventadversehealtheventsandultimately,elevateoverallpopulationwellness.
https://www.linkedin.com/pulse/diving-healthcare-data-lake-predictive-care-eric-newman
WhyHealthcareOrganisationsShouldTakethePlungeWithDataLakesHereinliesthesolutioninadatalake.Datalakescanbenefithealthcareorganisationsbyrevealingactionableinsightsaboutanorganisation’sperformanceindicatorsandimpactsonpatientcareinterventionstobettermanagerisksanddeliveraffordable,higherqualitycare.Thistypeoftechnologydrawsalldataintoasinglelocation,reducingoreliminatingsiloesacrossthehealthcareenterprise,whilecollectingdatafromtrustedexternalsources,suchasresearchcentresandpublichealthdatabases.Thedatalakemakesiteasyforhealthcareprofessionalstomineandanalyseallsourcesofdataforinsightsintopatientcare.Theycanalsoanalysedatatounderstandhowtheycanmaketheirorganisationsmoreefficientand inawaythatwon’tadverselyaffectpatient care. Analytics can provide insight to optimize staffing schedules and bettermanageITresources,whilsthelpingtoidentifytrendsinservices.
Andwithpatientconcernsovertheprivacyofinformationgathered,thevalueofactivelymanagingthisdataalsosupportsanumberofotherbenefitsincludingcompliancewithregulatory requirements, eliminating backup concerns and ensuring patient data issecure.Asdataentersthe“lake,”eachpiecesofinformationisidentifiedwitharangeofsecurityinformationthatembedssecuritycapabilitywithinthedataitself.Thistypeofapproachcouldreducebarrierstoinformationsharingbetweentheindustryandthepublic.
https://www.linkedin.com/pulse/why-healthcare-organisations-should-take-plunge-data-lakes-norman
13
PartnersDataLakeOffersHealthcareAnalyticsasaService“We’reputtingpublicdomaindatainthere,aswellasinstitutionaldata,”hesaid.“Andthen investigators can bring their own data, too. That’s what’s unique about theplatform.Itprovidestools,storage,andtechnology,andthesecuritywrappedaroundit,soinvestigatorsdon’tnecessarilyhavetothinkabouttheinfrastructure.Buttheycanbringtheirideasandbegintoleveragethosetoolstodeveloptheirareaofinterest.”
In the future, the platform will be able to ingest imaging data from radiology andpathologysources,aswell,whichwillhelptocultivatearichandrobustcommondatasetthatresearchersdrawuponfortheirwork.
Data lakesarebecomingapopularwaytostoremassivevolumesofhealthcaredatathatmayormaynothaveaclearusecaseatthemoment.Unliketraditionalrelationaldatabases,datalakesdonotneedtolockdataintoanyparticularformat,category,orframework.Instead,agraphdatabaseordatalakecanapplyastandardizedtagtoeachdataelement,andthetagscanbecombinedandrecombinedinendlesswaystoproducenearlyunlimitedinsights.
https://healthitanalytics.com/news/partners-data-lake-offers-healthcare-analytics-as-a-service
MedicalInsightSettoFlowFromSemanticDataLakesAccordingtoFranzCEOJansAasman,asemanticdatalakeemploysacombinationoftechnologies,includingHadoop,graphanalytics,asemantic“triplestore,”theSPARQLquerylanguage,andSpark-basedmachinelearning,toallowdoctorstoconnectthedotsbetweenpatientconditionsandaworldofknowledgecontainedinstructuredinternalsystems,aswellasunstructureddatasourcesoutsideoftheorganization.
Insteadofhammeringthedataintoarelationalschemaandstoringitinadatamart,theywanttopullallthepertinentdata—includingany“linkedopendata”suchasdruginteractiondatabases,genetictestresults,orademographicsdatabasesortedbyZIPcode–intoHadoopandHDFS.
ThesemanticmagicstartsoncethedataisinHadoopandHBase.MontefioretransformsallthedataintosemantictriplesusingaspecialETLtoolcalledasemanticannotationengine.ThentheFranzAllegroGraphgraphdatabasesindexesallthetriplesandpowerstheSPARQLqueriesacrossthejoinedcorpus,inadditiontomachinelearningpoweredbyApacheSparkMLorR.
14Asaresultofthissemanticapproach,userscanrunqueriesthattraversedifferentdatasources,whichwasthestumblingblockoftraditionalrelationalapproaches.
https://www.datanami.com/2015/08/26/medical-insight-set-to-flow-from-semantic-data-lakes/