an introduction to big data - dis.uniroma1.itrosati/dmds-1819/introduction-to-big-data.pdf · data...

37
Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione, Informatica e Statistica Sapienza Università di Roma AA 2018/2019 Domenico Lembo Dipartimento di Ingegneria Informatica, Automatica e Gestionale A. Ruberti An Introduction to Big Data

Upload: others

Post on 30-May-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

Data Management for Data Science

Master of Science in Data Science

Facoltà di Ing. dell'Informazione, Informatica e Statistica Sapienza Università di Roma

AA 2018/2019

Domenico Lembo Dipartimento di Ingegneria Informatica,

Automatica e Gestionale A. Ruberti

An Introduction to Big Data

Page 2: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

AvailabilityofMassiveData

•  Digitaldataarenowadayscollectedatanunprecedentscaleandinverymanyformatsinavarietyofdomains(e-commerce,socialnetworks,sensornetworks,astronomy,genomics,medicalrecords,etc.)

•  Thisishasbeenmadepossiblebytheincrediblegrowthinrecentyearsofthecapacityofdatastoragetoolsandofthecomputingpowerofelectronicdevices,aswellasbytheadventofmobileandpervasivecomputing,cloudcomputing,andcloudstorage.

Page 3: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

ExploitabilityofMassiveData

•  Howtotransformavailabledataintoinformation,andhowtomakeorganizations’businesstotakeadvantagesofsuchinformationarelong-standingproblemsinIT,andinparticularininformationmanagementandanalysis.

•  Theseissueshavebecomemoreandmorechallengingandcomplexinthe“BigData”era

•  Atthesametime,facingthechallengecanbeevenmoreworthythaninthepast,sincethemassiveamountofdatathatisnowavailablemayallowforanalyticalresultsneverachievedbefore

Page 4: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

Becareful!•  “Bigdataisavaguetermforamassive

phenomenonthathasrapidlybecomeanobsessionwithentrepreneurs,scientists,governmentsandthemedia”(TimHarford,journalistandeconomist,March,2014)*

*http://www.ft.com/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html#axzz3EvSLWwbu

Moore'sLawfor#BigData:Theamountofnonsensepackedintotheterm"BigData"doublesapproximatelyeverytwoyears(MikePluta,DataArchitect,onTwitterAugust2014).https://twitter.com/mikepluta/status/502878691740090369

Page 5: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

TheGoogleFluTrends*

•  2008:Googlepeoplepublish“Detectinginfluenzaepidemicsusingsearchenginequerydata”onnature(https://www.nature.com/articles/nature07634).

•  TheywereabletotrackthespreadofinfluenzaacrosstheUSmorequicklythantheUSCentersforDiseaseControlandPrevention(CDC).

•  Thetrackingwasessentiallybasedoncorrelationbetweenwhatpeoplesearchedforonlineandwhethertheyhadflusymptoms.

•  Fouryearslater,withasimilarexperimentsGooglepeopleoverstimatedthespreadofinfluenzabyalmostafactoroftwo!

•  “theory-freeanalysisofmerecorrelationsisinevitablyfragileifyouhavenoideawhatisbehindacorrelation”.

*http://www.ft.com/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html#axzz3EvSLWwbu

Page 6: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

CorrelationvsCausality!

Page 7: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

UnderstandingBigDataisinfactdifficult!

“Therearealotofsmalldataproblemsthatoccurinbigdata.Theydon’tdisappearbecauseyou’vegotlotsofthestuff.Theygetworse!”(DavidSpiegelhalter,CambridgeUniversity)

Page 8: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

ThinkingBigData*

"BigData"hasleaptrapidlyintooneofthemosthypedtermsinourindustry,yetthehypeshouldnotblindpeopletothefactthatthisisagenuinelyimportantshiftabouttheroleofdataintheworld.Theamount,speed,andvalueofdatasourcesisrapidlyincreasing.Datamanagementhastochangeinfivebroadareas:extractionofdatafromawiderrangeofsources,changestothelogisticsofdatamanagementwithnewdatabaseandintegrationapproaches,theuseofagileprinciplesinrunninganalyticsprojects,anemphasisontechniquesfordatainterpretationtoseparatesignalfromnoise,andtheimportanceofwell-designedvisualizationtomakethatsignalmorecomprehensible.Summingupthismeanswedon'tneedbiganalyticsprojects,insteadwewantthenewdatathinkingtopermeateourregularwork.”

MartinFowler

*http://martinfowler.com/articles/bigData/

Page 9: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

ThinkingBigData

•  Thus,roughly,BigDataisdatathatexceedstheprocessingcapacityofconventionaldatabasesystems

•  ButalsoBigDataisunderstoodasacapabilitythatallowscompaniestoextractvaluefromlargevolumesofdata

•  but,notice,thisdoesnotmeanonlyextremelylarge,massivedatabases

•  Besidesdatadimension,whatcharacterizesBigDataarealsotheheterogeneityinthewayinwhichinformationisstructured,thedynamicitywithwhichdatachanges,andtheabilityofquicklyprocessingit

•  Thiscallsfornewcomputingparadigmsorframeworks,notonlyadvanceddatastoragemechanisms

Page 10: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

TheThreeVs

TocharacterizeBigData,threeVsareused,whicharetheVsof

–  Volume

–  Velocity–  Variety

Page 11: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

Volume•  Bigdataapplicationsarecharacterizedofcoursebybigamountsofdata,

wherebigmeansextremelylarge,e.g.,morethanaterabyte(TB)orpetabyte(PB),ormore.

•  Therearevariouscontextsinwhichthesedimensionscanbeeasilyreached:chattersfromsocialnetworks,webserverlogs,trafficflowsensors,satelliteimagery,broadcastaudiostreams,bankingtransactions,GPStrails,financialmarketdata,biologicaldata,etc.

•  Somemoreconcreteexamples:–  DespitesomeYoutubestatisticsareavailable1thetotalstoragecapacity

ofYoutubeit’snotknown,butrealisticallyitshouldbenolessthan1EB(2016)

–  NSAdatacenter:estimatedstoragecapacityofatleast2,000PBs(2013)2

–  Facebook:300PBdatawarehouse(2014)31http://web.archive.org/web/20150217015601/http://www.youtube.com/yt/press/statistics.html2http://www.forbes.com/sites/netapp/2013/07/26/nsa-utah-datacenter/#1b66cc7c3cd23https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/

Page 12: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

Volume•  Howmanydataintheworld?

AccordingtoIDC(InternationalDataCorporation):–  800Terabytes,2000–  160Exabytes,2006(1EB=1018B)–  500Exabytes,2009–  1.8Zettabytes,2011(1ZB=1021B)1–  2.8Zettabytes,20121–  4.4Zettabytes,2013–  175Zettabytesby20252(estimate)

1http://www.webopedia.com/quick_ref/just-how-much-data-is-out-there.html2https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf3https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm

Around90%ofworld’sdatageneratedinthelast4years.

Thedigitaluniverseisdoublinginsize

everytwoyears3

Multipleofbitsorbytes

Symbol Name Decimalvalue

BinaryValue

k kilo 1000 1024

M mega 10002 10242

G giga 10003 10243

T tera 10004 10244

P peta 10005 10245

E exa 10006 10246

Z zetta 10007 10247

Y yotta 10008 10248

Page 13: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

Volume

•  Thesheervolumeofdataisenoughtodefeatmanylong-followedapproachestodatamanagement

•  Traditionalcentralizeddatabasesystemscannothandlemanyofthedatavolumes,forcingtheuseofclusters

•  Datahavetobenecessarilydistributed,andthenumberofsourcesprovidinginformationcanbehuge,muchhigherthanthenumberconsideredintraditionaldataintegrationandvirtualizationsystems

Page 14: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

Velocity

•  Data’svelocity(i.e.,therateatwhichdataiscollectedandmadeavailableintoanorganization)hasfollowedasimilarpatterntothatofvolume

•  Manydatasourcesaccessedbyorganizationsfortheirbusinessareextremelydynamic

•  Mobiledevicesincreasetherateofdatainflow:data“everywhere”,collectedandconsumedcontinuously

Page 15: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

Velocity•  Someexamples:

–  Walmart:1milliontransactionperhour(2010)1–  eBay:datathroughputreaches100PBsperday(2013)2–  Googleprocesses100PBsperday(2013-14)3–  Facebook:600TBaddedtothewarehouseeveryday(2014)4–  6000-8000tweetspersecondeveryday(in2019)5

•  In2013,ithasbeenestimatedthateveryminuteofeverydaywecreated6:-  Morethan204millionemailmessages-  571newWebsitesand347blogpostscreated-  72hoursofnewYouTubevideos-  1.8millionsoflikeonFacebook-  216.000newphotosoninstagram-  $83.000spentonAmazon

1http://martinfowler.com/articles/bigData/2http://www.v3.co.uk/v3-uk/news/2302017/ebay-using-big-data-analytics-to-drive-up-price-listings3http://www.slideshare.net/kmstechnology/big-data-overview-2013-20144https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/5http://www.internetlivestats.com/twitter-statistics/6http://www.dailymail.co.uk/sciencetech/article-2381188/Revealed-happens-just-ONE-minute-internet-216-000-photos-posted-278-000-Tweets-1-8m-Facebook-likes.html

Page 16: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

Velocity•  Processinginformationassoonasitisavailable,thusspeedingthe

“feedbackloop”,canprovidecompetitiveadvantages•  SomeexamplesofFastDataProcessing:

–  CustomerExperience/Retail:onlineretailsthatareabletosuggestadditionalproductstoacustomerateverynewinformationinsertedduringanon-linepurchase(Click-streamanalysis)

–  FinancialServicesIndustry:Algorithmictradingusingeventprocesstechnologybutalsoreal-timedataintegrationandanalytics

–  Telecommunication:understandallocationofnetworkresourcesbasedontrafficandapplicationrequirements,networkusagepatterns

–  Energy:real-timeprocessofhighvolumeofeventstomakeimportantdecisionsinordertoeffectivelyandefficientlymanagepossiblefaultsonthedistributionnetwork

–  Manufacturing:analyzereal-timemetricstotakecorrectiveactionbeforeafailureoccurs

Page 17: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

Velocity

•  Streamprocessingisanewchallengingcomputingparadigm,whereinformationisnotstoredforlaterbatchprocessing,butisconsumedonthefly

•  Thisisparticularlyusefulwhendataaretoofasttostorethementirely(forexamplebecausetheyneedsomeprocessingtobestoredproperly),asinscientificapplications,orwhentheapplicationrequiresanimmediateanswer

Page 18: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

Variety

•  Dataisextremelyheterogeneous:e.g.,intheformatinwhicharerepresented,butalsoandinthewaytheyrepresentinformation,bothattheintensionalandextensionallevel

•  E.g.,textfromsocialnetworks,sensordata,logsfromwebapplications,databases,XMLdocuments,RDFdata,etc.

•  Dataformatrangesthereforefromstructured(e.g,relationaldatabases)tosemistructured(e.g.,XMLdocuments),tounstructured(e.g.,textdocuments)

Page 19: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

Variety

•  Asforunstructureddata,forexample,thechallengeistoextractmeaningforconsumptionbothbyhumansormachines

•  Entityresolution,whichistheprocessthatresolves,i.e.,identifies,entitiesanddetectsrelationships,thenplaysanimportantrole

•  Infact,thesearewell-knownissuesstudiedsinceseveralyearsinthefieldsofdataintegration,dataexchange,anddataquality.IntheBigDatascenario,however,theybecomeevenmorechallenging

Page 20: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

AfourthV:Veracity*

•  Dataareofwidelydifferentquality

•  Traditionallydataisthoughtofascomingfromwellorganizeddatabaseswithcontrolledschemas

•  Instead,in“BigData”thereisoftenlittleornoschematocontroltheirstructure

•  Theresultisthatthereareseriousproblemswiththequalityofthedata

*TheliteratureoftenmentionsonlythreeVsanddoesnotincludeveracity.

HoweversomeauthorstendtoincludeveracityasacorecharacteristcofBigData(alternatively,veracityisconsideredanaspectofvariety)

Page 21: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

BigData:V3+Value

BigDatacangeneratehugecompetitiveadvantages!

Page 22: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

ThevalueofDatafororganizations

•  Althoughit'sdifficulttogethardfiguresonthevalueofmakingfulluseofyourdata,muchofthesuccessofcompaniessuchasAmazonandGoogleiscreditedtotheireffectiveuseofdata1

•  Thuscompaniesspendlargeamountsofmoneytoreachthiseffectiveuse:AccordingtoIDC,in2017bigdataandanalyticssoftwaremarketreached$54.1billionwordlwide,anditisexpectedtogrowatafive-yearCAGR(compoundannualgrowthrate)of11.2%.(analysis2018-2022)2

•  ThusvariousBigDatasolutionsarenowpromotedbyallmajorvendorsindatamanagementsystems

1http://martinfowler.com/articles/bigData/2https://www.idc.com/getdoc.jsp?containerId=US44243318

Page 23: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

Potentialvalue

Page 24: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

Demandfornewdatamanagementsolutions*

•  Inthescenarioswedepicteditisnotsurprisingthatnewdatamangementsolutionsaredemanded

•  Indeed,despitethepopularityandwellunderstoodnatureofrelationaldatabases,itisnotthecasethattheyshouldalwaysbethedestinationfordata

•  Dependingonthecharacteristicofdata,certainclassesofdatabasesaremoresuitedthanothersfortheirmanagement

•  XMLdocumentsaremoreversatilewhenstoredindedicatedXMLstoragesystems(e.g.,MarkLogic)

•  SocialnetworkrelationsaregraphbynatureandgraphdatabasessuchasNeo4Jcanmakeoperationsonthemsimplerandmoreefficient

*From:EddDumbill.WhatisBigdata.InPlanningforBigData.O’ReillyRadarTeam

Page 25: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

Demandfornewdatamanagementsolutions*

•  Adisadvantageoftherelationaldatabaseisthestaticnatureofitsschema

•  Inanagileenvironment,theresultsofcomputationwillevolvewiththedetectionandextractionofnewinformation

•  Semi-structuredNoSQLdatabasesmeetthisneedforflexibility:theyprovidesomestructuretoorganizedata(enoughforcertainapplications),butdonotrequiretheexactschemaofthedatabeforestoringit

*From:EddDumbill.WhatisBigdata.InPlanningforBigData.O’ReillyRadarTeam

Page 26: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

NoSQLdatabases*

Orbetter…notonlySQL•  Theterm"NoSQL"isveryill-defined.It'sgenerallyappliedtoa

numberofnon-relationaldatabasessuchasCassandra,Mongo,Dynamo,Neo4J,Riak,andmanyothers

•  Theyembraceschemalessdata,runonclusters,andhavetheabilitytotradeofftraditionalconsistencyforotherusefulproperties

•  AdvocatesofNoSQLdatabasesclaimthattheycanbuildsystemsthataremoreperformant,scalemuchbetter,andareeasiertoprogramwith

*From:MartinFowler.NoSQLDistilled.Preface.(http://martinfowler.com/books/nosql.html)

Page 27: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

Graphdatabases

Page 28: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

Key-valuesdatabases

Page 29: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

Documentdatabases

Page 30: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

ColumnFamilyDatabases

Page 31: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

NoSQLdatabases*

•  Isthisthefirstrattleofthedeathknellforrelationaldatabases,oryetanotherpretendertothethrone?Ouranswertothatis"neither"

•  Relationaldatabasesareapowerfultoolthatweexpecttobeusingformanymoredecades,butwedoseeaprofoundchangeinthatrelationaldatabaseswon'tbetheonlydatabasesinuse

•  OurviewisthatweareenteringaworldofPolyglotPersistencewhereenterprises,andevenindividualapplications,usemultipletechnologiesfordatamanagement

*From:MartinFowler.NoSQLDistilled.Preface.(http://martinfowler.com/books/nosql.html)

Page 32: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

Multipletechnologiesfordatamanagement

Asanexercise,letusaskgooglewhichisthedatabaseengineusedbyFacebook.Wegetthefollowingtools1:•  MySQLascoredatabaseengine(infactacustomizedversion

ofMySQL,highlyoptimizedanddistributed)2•  Cassandra(anApacheopensourcefaulttolerantdistributed

NoSQLDBMS,originallydevelopedatFacebookitself)asdatabasefortheInobxmailsearch

•  Memcached,amemorycachingsystemtospeedupdynamicdatabasedrivenwebsites

•  HayStack,forstorageandmanagementofphotos•  Hive,anopensource,peta-bytescaledatawarehousing

frameworkbasedonHadoop,foranalytics,andalsoPresto,anexabytescaledatawarehouse3

1https://www.techworm.net/2013/05/what-database-actually-facebook-uses.html

2http://www.datacenterknowledge.com/data-center-faqs/facebook-data-center-faq-page-23http://prestodb.io/

Page 33: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

DataWarehouse•  Adatawarehouseisadatabaseusedforreportinganddata

analysis.Itisacentralrepositoryofdatawhichiscreatedbyintegratingdatafromoneormoredisparatesources

•  AccordingtoInmon*,adatawarehouseis:–  Subject-oriented:Thedatainthedatawarehouseisorganizedsothat

allthedataelementsrelatingtothesamereal-worldeventorobjectarelinkedtogether

–  Non-volatile:Datainthedatawarehouseareneverover-writtenordeletedoncecommitted,thedataarestatic,read-only,andretainedforfuturereporting

–  Integrated:Thedatawarehousecontainsdatafrommostorallofanorganization'soperationalsystemsandthesedataaremadeconsistent

–  Time-variant:Foranoperationalsystem,thestoreddatacontainsthecurrentvalue.Thedatawarehouse,however,containsthehistoryofdatavalues

*Inmon,Bill(1992).BuildingtheDataWarehouse.Wiley

Page 34: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

DataWarehousevs.BigData•  AreDataWarehouses(DWs)underthehatofBigData?

•  Thenotionofdatawarehousingdatesbacktotheendof80s,andverymanydatawarehouseandbusinessintelligencesolutionshavebeenproposedsincethen

•  BTW,BigDataandDWshavemanypointsincommon,atleastw.r.t.–  Volume:datawarehousesstorelargeamountsofdata,–  Variety:atleastinprinciple,datawarehousesintegrate

heterogeneousinformation–  Veracity:datawarehosesusuallyareequippedwithdatacleaning

solutions,appliedintheso-calledextract-transformation-load(ETL)phase

Page 35: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

DataWarehousevs.BigData

•  Existingenterprisedatawarehousesandrelationaldatabasesexcelatprocessingstructureddata,andcanstoremassiveamountsofdata,thoughatcost

•  However,thisrequirementforstructureimposesaninertiathatmakesdatawarehousesunsuitedforagileexplorationofmassiveheterogenousdata

•  Theamountofeffortrequiredtowarehousedataoftenmeansthatvaluabledatasourcesinorganizationsarenevermined

•  Therefore,newcomputingmodelsandframeworksareneededtomakenewDWsolutionscompliantwiththeBigDataecosystem.

Page 36: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

MapReduce

•  MapReduceisaprogrammingframeworkforparallelizingcomputation

•  OriginallydefinedatGoogle

•  Next,therehavebeenvariousimplementations

•  Awell-knownopensourcedistributionisApacheHadoop

Page 37: An Introduction to Big Data - dis.uniroma1.itrosati/dmds-1819/Introduction-to-big-data.pdf · Data Management for Data Science Master of Science in Data Science Facoltà di Ing. dell'Informazione,

MapReduce

AMapReduceprogramisconstitutedbytwocomponents•  Map()procedure(themapper)thatperformsfilteringand

sorting(itdecomposestheproblemintoparallelizablesubproblems)

•  Reduce()procedure(thereducer)devotedtosolvesubproblems

TheMapReduceFrameworkmanagesdistributedservers,whichexecutethevarioussubtasksinparallel,andcontrolscommunicationanddatatransfersbetweenthevariousservers,aswellasguaranteesfaulttoleranceanddisasterrecovery.