big data technologies for storage and querying of large...

HADOOPECOSYSTEM

APPLICATIONS

SpecialthankstoShonaMa<hewtheNIHRGlobalHealthResearchUnitManagerandHowardRogersfromHICcenterUoD.TheresearchwascommissionedbytheNaIonalInsItuteforHealthResearchusingOfficialDevelopmentAssistance(ODA)funding[INSPIRED16/136/102].Disclaimer:Theviewsexpressedarethoseoftheauthor(s)andnotnecessarilythoseoftheNHS,theNIHRortheDepartmentofHealthandSocialCare.

ACKNOWLEDGEMENT

SCOTLANDDATA:•  ClinicalData:250Kpeople

o  GoDARTSo  GoSHAREo  NHSTaysideandNHSFile

•  GenotypeData13,000peopleX40MillionSNPs

INDIANDATA:•  ClinicalData:400Kpeople•  GenotypeData:25KpeopleUKBIOBANKDATA:•  ClinicalData:490Kpeople•  GenotypeData:

490Kpeopleand93MillionSNPs

INTRODUCTION

CHALLENGES

BigDataTechnologiesforStorageandQueryingofLarge-ScaleMedicalDatainClinicalGeneIcResearch

Gi$uGeorge1,PhilipAppleby1,AlexDoney1,V.Mohan2,RadhaV21SchoolofMedicine,UniversityofDundee,Dundee,UnitedKingdom

2MadrasDiabe=cResearchFounda=on(MDRF),Chennai,India

•  Workinbigdataecosystem

•  Movetoalgorithmsthatcanruninparallel

•  Advancedfileformatsthatareindexedandreadyforparallelism

•  Centralizedstorageforbothphenotypeandgenotypedata

•  Providebestperformancetotheamountofhardwareavailable

SOLUTION

•  IdenIfydifferentquesIonstoaskthedata

•  ExpandtheHadoopclustertobecomemorepowerful

•  ShowinggenomicsignificanceforeachchromosomeinFIGIWAS

•  DynamicUIforclinicians/geneIciststointeractwiththedata.

FUTUREWORK

•  Flattextfileswhicharehardtobreakornaturallynotindexed,makesitstrenuoustomanuallyseparateandparallelize

•  Lackofcentralizedstorageforgenotypeandphenotypedata

•  Singlenodetoolsandprogrammingmethodsthatdoesn’tscale

•  DifficulIesinlearningacrossmulIpleanddeeperphenotypes&genotypes

•  OpensourceframeworkforstoringdataandrunningapplicaIonsinclusters.

•  Horizontalscalability

•  Failureisnormalandexpected

•  DataLocality-Computeshouldmovetothedata

WHYHADOOP?

HADOOPDISTRIBUTIONFILESYSTEM(HDFS):•  FilesystemforHadoopframework•  Usescommodityhardware–lowcost•  OpJmizedforMapReduceworkloads-deliver

dataintothecomputeinfrastructureatahugedatarate

•  Supportofhighlyefficientdatatypes–Parquet,ORC

MAPREDUCE:•  ProgrammingparadigmforHadoop•  ConsistofMapperandReducer

Pig:AnengineforexecuIngdataflowsinparallelonHadoop.Hive:ADataWarehouseinfrastructureforHadoopOozie:WorkflowschedulersystemforMapReducejobs.Spark:Hadooponsteroids.RunsupfasterinmemoryandevenfastondiskthanHadoop.HBase:Acolumn-orienteddatabasemanagementsystemthatrunsontopofHDFS.Kama:Buildingreal-ImedatapipelinesandstreamingappsonHadoop.

•  PipelinetoloadthedatatoHDFS

•  EnrichthedatainHDFS

•  MergethePhenotypeandGenotypedataintoasinglefileforfurtheranalysis

•  CreatevisualizaIonUItobridgethegapbetweenthemedicalresearchersandgrowingbigdata

•  LandscapevisualizaIonofManyGeneVariantandManyDisease

•  HeatmaptovisualizethevariaIonsamongdifferentphenotypeoneachchromosome

•  Abilitytozoom,pan,orbitalandturntablerotaIon

•  Provideacutofthelandscapeforuserstovisualizethesignificantregions

FIGIWAS(OngoingWork)

big data technologies for storage and querying of large...

Documents

querying sensor networks

querying multiple distributed storage systems with apache...

spatial and geographic databases. spatial databases store...

querying bio2rdf data

querying big data tractability revisited for querying big...

querying - ics.uci.edu

path-based mxml storage and querying

querying linked data

querying cultural heritage

in-network querying

storage and querying - sti innsbruck

querying using excel

querying graph databases

extreme querying with_analytics

storage and querying

storage and querying of large persistent arrays

xml storage - 1 xml storage and querying. xml storage - 2...

1-1 cmpe 259 sensor networks katia obraczka winter 2005...

querying - westmont college

on smart query routing: for distributed graph querying with...