an introduction to big data - dis.uniroma1.itrosati/dmds-1819/introduction-to-big-data.pdf · data...

Post on 30-May-2020

14 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Data Management for Data Science

Master of Science in Data Science

Facoltà di Ing. dell'Informazione, Informatica e Statistica Sapienza Università di Roma

AA 2018/2019

Domenico Lembo Dipartimento di Ingegneria Informatica,

Automatica e Gestionale A. Ruberti

An Introduction to Big Data

AvailabilityofMassiveData

•  Digitaldataarenowadayscollectedatanunprecedentscaleandinverymanyformatsinavarietyofdomains(e-commerce,socialnetworks,sensornetworks,astronomy,genomics,medicalrecords,etc.)

•  Thisishasbeenmadepossiblebytheincrediblegrowthinrecentyearsofthecapacityofdatastoragetoolsandofthecomputingpowerofelectronicdevices,aswellasbytheadventofmobileandpervasivecomputing,cloudcomputing,andcloudstorage.

ExploitabilityofMassiveData

•  Howtotransformavailabledataintoinformation,andhowtomakeorganizations’businesstotakeadvantagesofsuchinformationarelong-standingproblemsinIT,andinparticularininformationmanagementandanalysis.

•  Theseissueshavebecomemoreandmorechallengingandcomplexinthe“BigData”era

•  Atthesametime,facingthechallengecanbeevenmoreworthythaninthepast,sincethemassiveamountofdatathatisnowavailablemayallowforanalyticalresultsneverachievedbefore

Becareful!•  “Bigdataisavaguetermforamassive

phenomenonthathasrapidlybecomeanobsessionwithentrepreneurs,scientists,governmentsandthemedia”(TimHarford,journalistandeconomist,March,2014)*

*http://www.ft.com/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html#axzz3EvSLWwbu

Moore'sLawfor#BigData:Theamountofnonsensepackedintotheterm"BigData"doublesapproximatelyeverytwoyears(MikePluta,DataArchitect,onTwitterAugust2014).https://twitter.com/mikepluta/status/502878691740090369

TheGoogleFluTrends*

•  2008:Googlepeoplepublish“Detectinginfluenzaepidemicsusingsearchenginequerydata”onnature(https://www.nature.com/articles/nature07634).

•  TheywereabletotrackthespreadofinfluenzaacrosstheUSmorequicklythantheUSCentersforDiseaseControlandPrevention(CDC).

•  Thetrackingwasessentiallybasedoncorrelationbetweenwhatpeoplesearchedforonlineandwhethertheyhadflusymptoms.

•  Fouryearslater,withasimilarexperimentsGooglepeopleoverstimatedthespreadofinfluenzabyalmostafactoroftwo!

•  “theory-freeanalysisofmerecorrelationsisinevitablyfragileifyouhavenoideawhatisbehindacorrelation”.

*http://www.ft.com/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html#axzz3EvSLWwbu

CorrelationvsCausality!

UnderstandingBigDataisinfactdifficult!

“Therearealotofsmalldataproblemsthatoccurinbigdata.Theydon’tdisappearbecauseyou’vegotlotsofthestuff.Theygetworse!”(DavidSpiegelhalter,CambridgeUniversity)

ThinkingBigData*

"BigData"hasleaptrapidlyintooneofthemosthypedtermsinourindustry,yetthehypeshouldnotblindpeopletothefactthatthisisagenuinelyimportantshiftabouttheroleofdataintheworld.Theamount,speed,andvalueofdatasourcesisrapidlyincreasing.Datamanagementhastochangeinfivebroadareas:extractionofdatafromawiderrangeofsources,changestothelogisticsofdatamanagementwithnewdatabaseandintegrationapproaches,theuseofagileprinciplesinrunninganalyticsprojects,anemphasisontechniquesfordatainterpretationtoseparatesignalfromnoise,andtheimportanceofwell-designedvisualizationtomakethatsignalmorecomprehensible.Summingupthismeanswedon'tneedbiganalyticsprojects,insteadwewantthenewdatathinkingtopermeateourregularwork.”

MartinFowler

*http://martinfowler.com/articles/bigData/

ThinkingBigData

•  Thus,roughly,BigDataisdatathatexceedstheprocessingcapacityofconventionaldatabasesystems

•  ButalsoBigDataisunderstoodasacapabilitythatallowscompaniestoextractvaluefromlargevolumesofdata

•  but,notice,thisdoesnotmeanonlyextremelylarge,massivedatabases

•  Besidesdatadimension,whatcharacterizesBigDataarealsotheheterogeneityinthewayinwhichinformationisstructured,thedynamicitywithwhichdatachanges,andtheabilityofquicklyprocessingit

•  Thiscallsfornewcomputingparadigmsorframeworks,notonlyadvanceddatastoragemechanisms

TheThreeVs

TocharacterizeBigData,threeVsareused,whicharetheVsof

–  Volume

–  Velocity–  Variety

Volume•  Bigdataapplicationsarecharacterizedofcoursebybigamountsofdata,

wherebigmeansextremelylarge,e.g.,morethanaterabyte(TB)orpetabyte(PB),ormore.

•  Therearevariouscontextsinwhichthesedimensionscanbeeasilyreached:chattersfromsocialnetworks,webserverlogs,trafficflowsensors,satelliteimagery,broadcastaudiostreams,bankingtransactions,GPStrails,financialmarketdata,biologicaldata,etc.

•  Somemoreconcreteexamples:–  DespitesomeYoutubestatisticsareavailable1thetotalstoragecapacity

ofYoutubeit’snotknown,butrealisticallyitshouldbenolessthan1EB(2016)

–  NSAdatacenter:estimatedstoragecapacityofatleast2,000PBs(2013)2

–  Facebook:300PBdatawarehouse(2014)31http://web.archive.org/web/20150217015601/http://www.youtube.com/yt/press/statistics.html2http://www.forbes.com/sites/netapp/2013/07/26/nsa-utah-datacenter/#1b66cc7c3cd23https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/

Volume•  Howmanydataintheworld?

AccordingtoIDC(InternationalDataCorporation):–  800Terabytes,2000–  160Exabytes,2006(1EB=1018B)–  500Exabytes,2009–  1.8Zettabytes,2011(1ZB=1021B)1–  2.8Zettabytes,20121–  4.4Zettabytes,2013–  175Zettabytesby20252(estimate)

1http://www.webopedia.com/quick_ref/just-how-much-data-is-out-there.html2https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf3https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm

Around90%ofworld’sdatageneratedinthelast4years.

Thedigitaluniverseisdoublinginsize

everytwoyears3

Multipleofbitsorbytes

Symbol Name Decimalvalue

BinaryValue

k kilo 1000 1024

M mega 10002 10242

G giga 10003 10243

T tera 10004 10244

P peta 10005 10245

E exa 10006 10246

Z zetta 10007 10247

Y yotta 10008 10248

Volume

•  Thesheervolumeofdataisenoughtodefeatmanylong-followedapproachestodatamanagement

•  Traditionalcentralizeddatabasesystemscannothandlemanyofthedatavolumes,forcingtheuseofclusters

•  Datahavetobenecessarilydistributed,andthenumberofsourcesprovidinginformationcanbehuge,muchhigherthanthenumberconsideredintraditionaldataintegrationandvirtualizationsystems

Velocity

•  Data’svelocity(i.e.,therateatwhichdataiscollectedandmadeavailableintoanorganization)hasfollowedasimilarpatterntothatofvolume

•  Manydatasourcesaccessedbyorganizationsfortheirbusinessareextremelydynamic

•  Mobiledevicesincreasetherateofdatainflow:data“everywhere”,collectedandconsumedcontinuously

Velocity•  Someexamples:

–  Walmart:1milliontransactionperhour(2010)1–  eBay:datathroughputreaches100PBsperday(2013)2–  Googleprocesses100PBsperday(2013-14)3–  Facebook:600TBaddedtothewarehouseeveryday(2014)4–  6000-8000tweetspersecondeveryday(in2019)5

•  In2013,ithasbeenestimatedthateveryminuteofeverydaywecreated6:-  Morethan204millionemailmessages-  571newWebsitesand347blogpostscreated-  72hoursofnewYouTubevideos-  1.8millionsoflikeonFacebook-  216.000newphotosoninstagram-  $83.000spentonAmazon

1http://martinfowler.com/articles/bigData/2http://www.v3.co.uk/v3-uk/news/2302017/ebay-using-big-data-analytics-to-drive-up-price-listings3http://www.slideshare.net/kmstechnology/big-data-overview-2013-20144https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/5http://www.internetlivestats.com/twitter-statistics/6http://www.dailymail.co.uk/sciencetech/article-2381188/Revealed-happens-just-ONE-minute-internet-216-000-photos-posted-278-000-Tweets-1-8m-Facebook-likes.html

Velocity•  Processinginformationassoonasitisavailable,thusspeedingthe

“feedbackloop”,canprovidecompetitiveadvantages•  SomeexamplesofFastDataProcessing:

–  CustomerExperience/Retail:onlineretailsthatareabletosuggestadditionalproductstoacustomerateverynewinformationinsertedduringanon-linepurchase(Click-streamanalysis)

–  FinancialServicesIndustry:Algorithmictradingusingeventprocesstechnologybutalsoreal-timedataintegrationandanalytics

–  Telecommunication:understandallocationofnetworkresourcesbasedontrafficandapplicationrequirements,networkusagepatterns

–  Energy:real-timeprocessofhighvolumeofeventstomakeimportantdecisionsinordertoeffectivelyandefficientlymanagepossiblefaultsonthedistributionnetwork

–  Manufacturing:analyzereal-timemetricstotakecorrectiveactionbeforeafailureoccurs

Velocity

•  Streamprocessingisanewchallengingcomputingparadigm,whereinformationisnotstoredforlaterbatchprocessing,butisconsumedonthefly

•  Thisisparticularlyusefulwhendataaretoofasttostorethementirely(forexamplebecausetheyneedsomeprocessingtobestoredproperly),asinscientificapplications,orwhentheapplicationrequiresanimmediateanswer

Variety

•  Dataisextremelyheterogeneous:e.g.,intheformatinwhicharerepresented,butalsoandinthewaytheyrepresentinformation,bothattheintensionalandextensionallevel

•  E.g.,textfromsocialnetworks,sensordata,logsfromwebapplications,databases,XMLdocuments,RDFdata,etc.

•  Dataformatrangesthereforefromstructured(e.g,relationaldatabases)tosemistructured(e.g.,XMLdocuments),tounstructured(e.g.,textdocuments)

Variety

•  Asforunstructureddata,forexample,thechallengeistoextractmeaningforconsumptionbothbyhumansormachines

•  Entityresolution,whichistheprocessthatresolves,i.e.,identifies,entitiesanddetectsrelationships,thenplaysanimportantrole

•  Infact,thesearewell-knownissuesstudiedsinceseveralyearsinthefieldsofdataintegration,dataexchange,anddataquality.IntheBigDatascenario,however,theybecomeevenmorechallenging

AfourthV:Veracity*

•  Dataareofwidelydifferentquality

•  Traditionallydataisthoughtofascomingfromwellorganizeddatabaseswithcontrolledschemas

•  Instead,in“BigData”thereisoftenlittleornoschematocontroltheirstructure

•  Theresultisthatthereareseriousproblemswiththequalityofthedata

*TheliteratureoftenmentionsonlythreeVsanddoesnotincludeveracity.

HoweversomeauthorstendtoincludeveracityasacorecharacteristcofBigData(alternatively,veracityisconsideredanaspectofvariety)

BigData:V3+Value

BigDatacangeneratehugecompetitiveadvantages!

ThevalueofDatafororganizations

•  Althoughit'sdifficulttogethardfiguresonthevalueofmakingfulluseofyourdata,muchofthesuccessofcompaniessuchasAmazonandGoogleiscreditedtotheireffectiveuseofdata1

•  Thuscompaniesspendlargeamountsofmoneytoreachthiseffectiveuse:AccordingtoIDC,in2017bigdataandanalyticssoftwaremarketreached$54.1billionwordlwide,anditisexpectedtogrowatafive-yearCAGR(compoundannualgrowthrate)of11.2%.(analysis2018-2022)2

•  ThusvariousBigDatasolutionsarenowpromotedbyallmajorvendorsindatamanagementsystems

1http://martinfowler.com/articles/bigData/2https://www.idc.com/getdoc.jsp?containerId=US44243318

Potentialvalue

Demandfornewdatamanagementsolutions*

•  Inthescenarioswedepicteditisnotsurprisingthatnewdatamangementsolutionsaredemanded

•  Indeed,despitethepopularityandwellunderstoodnatureofrelationaldatabases,itisnotthecasethattheyshouldalwaysbethedestinationfordata

•  Dependingonthecharacteristicofdata,certainclassesofdatabasesaremoresuitedthanothersfortheirmanagement

•  XMLdocumentsaremoreversatilewhenstoredindedicatedXMLstoragesystems(e.g.,MarkLogic)

•  SocialnetworkrelationsaregraphbynatureandgraphdatabasessuchasNeo4Jcanmakeoperationsonthemsimplerandmoreefficient

*From:EddDumbill.WhatisBigdata.InPlanningforBigData.O’ReillyRadarTeam

Demandfornewdatamanagementsolutions*

•  Adisadvantageoftherelationaldatabaseisthestaticnatureofitsschema

•  Inanagileenvironment,theresultsofcomputationwillevolvewiththedetectionandextractionofnewinformation

•  Semi-structuredNoSQLdatabasesmeetthisneedforflexibility:theyprovidesomestructuretoorganizedata(enoughforcertainapplications),butdonotrequiretheexactschemaofthedatabeforestoringit

*From:EddDumbill.WhatisBigdata.InPlanningforBigData.O’ReillyRadarTeam

NoSQLdatabases*

Orbetter…notonlySQL•  Theterm"NoSQL"isveryill-defined.It'sgenerallyappliedtoa

numberofnon-relationaldatabasessuchasCassandra,Mongo,Dynamo,Neo4J,Riak,andmanyothers

•  Theyembraceschemalessdata,runonclusters,andhavetheabilitytotradeofftraditionalconsistencyforotherusefulproperties

•  AdvocatesofNoSQLdatabasesclaimthattheycanbuildsystemsthataremoreperformant,scalemuchbetter,andareeasiertoprogramwith

*From:MartinFowler.NoSQLDistilled.Preface.(http://martinfowler.com/books/nosql.html)

Graphdatabases

Key-valuesdatabases

Documentdatabases

ColumnFamilyDatabases

NoSQLdatabases*

•  Isthisthefirstrattleofthedeathknellforrelationaldatabases,oryetanotherpretendertothethrone?Ouranswertothatis"neither"

•  Relationaldatabasesareapowerfultoolthatweexpecttobeusingformanymoredecades,butwedoseeaprofoundchangeinthatrelationaldatabaseswon'tbetheonlydatabasesinuse

•  OurviewisthatweareenteringaworldofPolyglotPersistencewhereenterprises,andevenindividualapplications,usemultipletechnologiesfordatamanagement

*From:MartinFowler.NoSQLDistilled.Preface.(http://martinfowler.com/books/nosql.html)

Multipletechnologiesfordatamanagement

Asanexercise,letusaskgooglewhichisthedatabaseengineusedbyFacebook.Wegetthefollowingtools1:•  MySQLascoredatabaseengine(infactacustomizedversion

ofMySQL,highlyoptimizedanddistributed)2•  Cassandra(anApacheopensourcefaulttolerantdistributed

NoSQLDBMS,originallydevelopedatFacebookitself)asdatabasefortheInobxmailsearch

•  Memcached,amemorycachingsystemtospeedupdynamicdatabasedrivenwebsites

•  HayStack,forstorageandmanagementofphotos•  Hive,anopensource,peta-bytescaledatawarehousing

frameworkbasedonHadoop,foranalytics,andalsoPresto,anexabytescaledatawarehouse3

1https://www.techworm.net/2013/05/what-database-actually-facebook-uses.html

2http://www.datacenterknowledge.com/data-center-faqs/facebook-data-center-faq-page-23http://prestodb.io/

DataWarehouse•  Adatawarehouseisadatabaseusedforreportinganddata

analysis.Itisacentralrepositoryofdatawhichiscreatedbyintegratingdatafromoneormoredisparatesources

•  AccordingtoInmon*,adatawarehouseis:–  Subject-oriented:Thedatainthedatawarehouseisorganizedsothat

allthedataelementsrelatingtothesamereal-worldeventorobjectarelinkedtogether

–  Non-volatile:Datainthedatawarehouseareneverover-writtenordeletedoncecommitted,thedataarestatic,read-only,andretainedforfuturereporting

–  Integrated:Thedatawarehousecontainsdatafrommostorallofanorganization'soperationalsystemsandthesedataaremadeconsistent

–  Time-variant:Foranoperationalsystem,thestoreddatacontainsthecurrentvalue.Thedatawarehouse,however,containsthehistoryofdatavalues

*Inmon,Bill(1992).BuildingtheDataWarehouse.Wiley

DataWarehousevs.BigData•  AreDataWarehouses(DWs)underthehatofBigData?

•  Thenotionofdatawarehousingdatesbacktotheendof80s,andverymanydatawarehouseandbusinessintelligencesolutionshavebeenproposedsincethen

•  BTW,BigDataandDWshavemanypointsincommon,atleastw.r.t.–  Volume:datawarehousesstorelargeamountsofdata,–  Variety:atleastinprinciple,datawarehousesintegrate

heterogeneousinformation–  Veracity:datawarehosesusuallyareequippedwithdatacleaning

solutions,appliedintheso-calledextract-transformation-load(ETL)phase

DataWarehousevs.BigData

•  Existingenterprisedatawarehousesandrelationaldatabasesexcelatprocessingstructureddata,andcanstoremassiveamountsofdata,thoughatcost

•  However,thisrequirementforstructureimposesaninertiathatmakesdatawarehousesunsuitedforagileexplorationofmassiveheterogenousdata

•  Theamountofeffortrequiredtowarehousedataoftenmeansthatvaluabledatasourcesinorganizationsarenevermined

•  Therefore,newcomputingmodelsandframeworksareneededtomakenewDWsolutionscompliantwiththeBigDataecosystem.

MapReduce

•  MapReduceisaprogrammingframeworkforparallelizingcomputation

•  OriginallydefinedatGoogle

•  Next,therehavebeenvariousimplementations

•  Awell-knownopensourcedistributionisApacheHadoop

MapReduce

AMapReduceprogramisconstitutedbytwocomponents•  Map()procedure(themapper)thatperformsfilteringand

sorting(itdecomposestheproblemintoparallelizablesubproblems)

•  Reduce()procedure(thereducer)devotedtosolvesubproblems

TheMapReduceFrameworkmanagesdistributedservers,whichexecutethevarioussubtasksinparallel,andcontrolscommunicationanddatatransfersbetweenthevariousservers,aswellasguaranteesfaulttoleranceanddisasterrecovery.

top related