big data analytics_ a practical_guide

655

Upload: kannanthegreat

Post on 18-Dec-2015

271 views

Category:

Documents


15 download

DESCRIPTION

It gives an overview of Big Data applications

TRANSCRIPT

  • DInformationTechnology/Database

    unnPri

    Withthisbook,managersanddecisionmakersaregiventhetoolstomakemorei

    e

    g

    s

    informeddecisionsaboutbigdatapurchasinginitiatives.BigDataAnalytics:Aa

    PracticalGuideforManagersnotonlysuppliesdescriptionsofcommontools,n

    butalsosurveysthevariousproductsandvendorsthatsupplythebigdatamarket.

    BI

    BIGDATA

    Comparingandcontrastingthedifferenttypesofanalysiscommonlyconductedwithbigdata,thisaccessiblereferencepresentsclear-cutexplanationsofthegeneralworkingsofbigdatatools.InsteadofspendingtimeonHOWtoinstallspecificGD

    packages,itfocusesonthereasonsWHYreaderswouldinstallagivenpackage.

    ANALYTICS

    Thebookprovidesauthoritativeguidanceonarangeoftools,includingopensourceandproprietarysystems.Itdetailsthestrengthsandweaknessesofincorporatingbigdataanalysisintodecision-makingandexplainshowtoleveragethestrengthswhilemitigatingtheweaknesses.

    A

  • APracticalGuide

    Describesthebenefitsofdistributedcomputinginsimpleterms

    T

    forManagers

    Includessubstantialvendor/toolmaterial,especiallyforopensourcedecisionsAA

    Coversprominentsoftwarepackages,includingHadoopandOracleEndeca

    ExaminesGISandmachinelearningapplications

    Considersprivacyandsurveillanceissues

    Thebookfurtherexploresbasicstatisticalconceptsthat,whenmisapplied,canbeN

    thesourceoferrors.Timeandagain,bigdataistreatedasanoraclethatdiscoversresultsnobodywouldhaveimagined.Whilebigdatacanservethisvaluablefunction,A

    KimH.Pries

    alltoooftentheseresultsareincorrectyetarestillreportedunquestioningly.TheprobabilityofhavingerroneousresultsincreasesasalargernumberofvariablesareL

    comparedunlesspreventativemeasuresaretaken.

    Y

    RobertDunnigan

    TheapproachtakenbytheauthorsistoexplaintheseconceptssomanagerscanaskbetterquestionsoftheiranalystsandvendorsabouttheappropriatenessoftheT

    methodsusedtoarriveataconclusion.Becausetheworldofscienceandmedicinehasbeengrapplingwithsimilarissuesinthepublicationofstudies,theauthorsIC

    drawontheireffortsandapplythemtobigdata.

    S

    K23000

    6000BrokenSoundParkway,NW

    Suite300,BocaRaton,FL33487

    ISBN:978-1-4822-3451-0

    711ThirdAvenue

    NewYork,NY10017

    90000

    aninformabusiness

    2ParkSquare,MiltonPark

    www.crcpress.com

  • Abingdon,OxonOX144RN,UK

    9781482234510

    www.auerbach-publications.com

    K23000mechrev.indd1

    12/29/1410:12AM

    BIGDATA

    ANALYTICS

    APracticalGuide

    forManagers

    BIGDATA

    ANALYTICS

    APracticalGuide

    forManagers

    KimH.Pries

    RobertDunnigan

    MATLABandSimulinkaretrademarksofTheMathWorks,Inc.andareusedwithpermission.TheMathWorksdoesnotwarranttheaccuracyofthetextorexercisesinthisbook.ThisbooksuseordiscussionofMATLABandSimulinksoftwareorrelatedproductsdoesnotconstituteendorsementorsponsorshipbyTheMathWorksofaparticularpedagogicalapproachorparticularuseoftheMATLABandSimulink

    software.

    CRCPress

    Taylor&FrancisGroup

    6000BrokenSoundParkwayNW,Suite300

    BocaRaton,FL33487-2742

    2015byTaylor&FrancisGroup,LLC

    CRCPressisanimprintofTaylor&FrancisGroup,anInformabusiness

    NoclaimtooriginalU.S.Governmentworks

    VersionDate:20141024

    InternationalStandardBookNumber-13:978-1-4822-3452-7(eBook-PDF)

    Thisbookcontainsinformationobtainedfromauthenticandhighlyregardedsources.Reasonableeffortshavebeenmadetopublishreliabledataandinformation,buttheauthorandpublishercannotassumeresponsibilityforthevalidityofallmaterialsortheconsequencesoftheiruse.Theauthorsandpublishershaveattemptedtotracethecopyrightholdersofallmaterialreproducedinthispublicationandapologizetocopyright

  • holdersifpermissiontopublishinthisformhasnotbeenobtained.Ifanycopyrightmaterialhasnotbeenacknowledgedpleasewriteandletusknowsowemayrectifyinanyfuturereprint.

    ExceptaspermittedunderU.S.CopyrightLaw,nopartofthisbookmaybereprinted,reproduced,transmitted,orutilizedinanyformbyanyelectronic,mechanical,orothermeans,nowknownorhereafterinvented,includingphotocopying,microfilming,andrecording,orinanyinformationstorageorretrievalsystem,withoutwrittenpermissionfromthepublishers.

    Forpermissiontophotocopyorusematerialelectronicallyfromthiswork,pleaseaccesswww.copyright.

    com(http://www.copyright.com/)orcontacttheCopyrightClearanceCenter,Inc.(CCC),222RosewoodDrive,Danvers,MA01923,978-750-8400.CCCisanot-for-profitorganizationthatprovideslicensesandregistrationforavarietyofusers.FororganizationsthathavebeengrantedaphotocopylicensebytheCCC,aseparatesystemofpaymenthasbeenarranged.

    TrademarkNotice:Productorcorporatenamesmaybetrademarksorregisteredtrademarks,andareusedonlyforidentificationandexplanationwithoutintenttoinfringe.

    VisittheTaylor&FrancisWebsiteat

    http://www.taylorandfrancis.com

    andtheCRCPressWebsiteat

    http://www.crcpress.com

    Contents

    Preface.xiiiAcknowledgments..xvAuthorsxviiChapter1Introduction..1

    SoWhatIsBigData?.1

    GrowingInterestinDecisionMaking..4

    WhatThisBookAddresses..6

    TheConversationaboutBigData.7

    TechnologicalChangeasaDriverofBigData.12

    TheCentralQuestion:SoWhat?13

    OurGoalsasAuthors18

    References..19

    Chapter2TheMotherofInventionsTriplets:MooresLaw,the

    ProliferationofData,andDataStorageTechnology.21

  • MooresLaw..22

    ParallelComputing,betweenandwithinMachines25

    QuantumComputing31

    RecapofGrowthinComputingPower.31

    Storage,StorageEverywhere.32

    GristfortheMill:DataUsedandUnused..39

    Agriculture..40

    Automotive..42

    MarketinginthePhysicalWorld..45

    OnlineMarketing.49

    AssetReliabilityandEfficiency.54

    ProcessTrackingandAutomation..56

    TowardaDefinitionofBigData.58

    PuttingBigDatainContext.62

    KeyConceptsofBigDataandTheirConsequences64

    Summary67

    References..67

    v

    viContents

    Chapter3Hadoop73

    PowerthroughDistribution.75

    CostEffectivenessofHadoop79

    NotEveryProblemIsaNail.81

    SomeTechnicalAspects81

    TroubleshootingHadoop83

    RunningHadoop.84

    HadoopFileSystem84

    MapReduce86

    PigandHive90

    Installation91

    CurrentHadoopEcosystem..91

    HadoopVendors94

  • Cloudera94

    AmazonWebServices(AWS).95

    Hortonworks97

    IBM.97

    Intel99

    MapR.100

    Microsoft.100

    RunningPigLatinUsingPowershell.101

    Pivotal103

    References104

    Chapter4HBaseandOtherBigDataDatabases105

    EvolutionfromFlatFiletotheThreeVs..105

    FlatFile106

    HierarchicalDatabase..110

    NetworkDatabase..110

    RelationalDatabase111

    Object-OrientedDatabases..114

    Relational-ObjectDatabases114

    TransitiontoBigDataDatabases115

    WhatIsDifferentaboutHBase?116

    WhatIsBigtable?.119

    WhatIsMapReduce?..120

    WhatAretheVariousModalitiesforBigData

    Databases?122

    Contentsvii

    GraphDatabases123

    HowDoesaGraphDatabaseWork?.123

    WhatIsthePerformanceofaGraphDatabase?..124

    DocumentDatabases.124

    Key-ValueDatabases131

    Column-OrientedDatabases.138

    HBase138

  • ApacheAccumulo..142

    References149

    Chapter5MachineLearning.151

    MachineLearningBasics.151

    ClassifyingwithNearestNeighbors.153

    NaiveBayes154

    SupportVectorMachines.155

    ImprovingClassificationwithAdaptiveBoosting.156

    Regression157

    LogisticRegression158

    Tree-BasedRegression160

    K-MeansClustering.161

    AprioriAlgorithm.162

    FrequentPattern-Growth.164

    PrincipalComponentAnalysis(PCA)165

    SingularValueDecomposition.166

    NeuralNetworks168

    BigDataandMapReduce.173

    DataExploration175

    SpamFiltering..176

    Ranking177

    PredictiveRegression..177

    TextRegression178

    MultidimensionalScaling179

    SocialGraphing..182

    References191

    Chapter6Statistics..193

    Statistics,StatisticsEverywhere193

    DiggingintotheData.195

    viiiContents

    StandardDeviation:TheStandardMeasureof

    Dispersion..200

  • ThePowerofShapes:Distributions..201

    Distributions:GaussianCurve205

    Distributions:WhyBeNormal?..214

    Distributions:TheLongArmofthePowerLaw.220

    TheUpshot?StatisticsAreNotBloodless227

    FoolingOurselves:SeeingWhatWeWanttoSeeinthe

    Data228

    WeCanLearnMuchfromanOctopus..232

    HypothesisTesting:SeekingaVerdict..234

    Two-TailedTesting240

    HypothesisTesting:ABroadField.241

    MovingOntoSpecificHypothesisTests.242

    RegressionandCorrelation247

    pValueinHypothesisTesting:ASuccessful

    Gatekeeper?.254

    SpeciousCorrelationsandOverfittingtheData.268

    ASampleofCommonStatisticalSoftwarePackages273

    Minitab273

    SPSS..274

    R..275

    SAS277

    BigDataAnalytics..277

    HadoopIntegration.278

    Angoss.278

    Statistica.279

    Capabilities279

    Summary280

    References..282

    Chapter7Google..285

    BigDataGiants..285

    Google..286

    Go..292

  • Android..293

    GoogleProductOfferings.294

    GoogleAnalytics299

    Contentsix

    AdvertisingandCampaignPerformance299

    AnalysisandTesting.300

    Facebook.308

    Ning.310

    Non-UnitedStatesSocialMedia.311

    Tencent311

    Line311

    SinaWeibo312

    Odnoklassniki312

    Vkontakte.312

    Nimbuzz.312

    RankingNetworkSites..313

    NegativeIssueswithSocialNetworks.314

    Amazon.316

    SomeFinalWords320

    References321

    Chapter8GeographicInformationSystems(GIS)323

    GISImplementations.324

    AGISExample.332

    GISTools..335

    GISDatabases.346

    References..348

    Chapter9Discovery351

    FacetedSearchversusStrictTaxonomy.352

    FirstKeyAbility:BreakingDownBarriers356

    SecondKeyAbility:FlexibleSearchandNavigation..358

    UnderlyingTechnology364

    TheUpshot365

  • Summary366

    References..367

    Chapter10DataQuality.369

    KnowThyDataandThyself..369

    Structured,Unstructured,andSemistructuredData..373

    DataInconsistency:AnExamplefromThisBook..374

    TheBlackSwanandIncompleteData.378

    xContents

    HowDataCanFoolUs..379

    AmbiguousData..379

    AgingofDataorVariables..384

    MissingVariablesMayChangetheMeaning.386

    InconsistentUseofUnitsandTerminology388

    Biases.392

    SamplingBias392

    PublicationBias..396

    SurvivorshipBias396

    DataasaVideo,NotaSnapshot:DifferentViewpoints

    asaNoiseFilter..400

    WhatIsMyToolkitforImprovingMyData?..406

    IshikawaDiagram.409

    InterrelationshipDigraph..412

    ForceFieldAnalysis414

    Data-CentricMethods415

    TroubleshootingQueriesfromSourceData.416

    TroubleshootingDataQualitybeyondtheSource

    System..419

    UsingOurHiddenResources422

    Summary423

    References..424

    Chapter11Benefits427

    DataSerendipity427

  • ConvertingDataDrecktoUsefulness428

    Sales430

    ReturnedMerchandise.432

    Security434

    Medical435

    Travel.437

    Lodging.437

    Vehicle439

    Meals..440

    GeographicalInformationSystems.442

    NewYorkCity..442

    ChicagoCLEARMAP.443

    Baltimore.446

    Contentsxi

    SanFrancisco448

    LosAngeles.449

    Tucson,Arizona,UniversityofArizona,and

    COPLINK.451

    SocialNetworking.452

    Education454

    GeneralEducationalData454

    LegacyData.455

    GradesandOtherIndicators.456

    TestingResults.456

    Addresses,PhoneNumbers,andMore..457

    ConcludingComments458

    References..459

    Chapter12Concerns.463

    LogicalFallacies.469

    AffirmingtheConsequent.470

    DenyingtheAntecedent.471

    LudicFallacy..473

  • CognitiveBiases..473

    ConfirmationBias..473

    NotationalBias..475

    Selection/SampleBias..475

    HaloEffect476

    ConsistencyandHindsightBiases.477

    CongruenceBias..478

    VonRestorffEffect..478

    DataSerendipity.479

    ConvertingDataDrecktoUsefulness..479

    Sales.479

    MerchandiseReturns.482

    Security483

    CompStat.483

    Medical..486

    Travel.487

    Lodging.487

    Vehicle488

    Meals..490

    xiiContents

    SocialNetworking.491

    Education492

    MakingYourselfHardertoTrack.497

    Misinformation498

    Disinformation.499

    Reducing/EliminatingProfiles.500

    SocialMedia500

    SelfRedefinition500

    IdentityTheft501

    Facebook..503

    ConcludingComments.519

    References521

  • Chapter13Epilogue..525

    MichaelPortersFiveForcesModel..527

    BargainingPowerofCustomers528

    BargainingPowerofSuppliers530

    ThreatofNewEntrants531

    Others..533

    TheOODALoop.533

    ImplementingBigData..534

    Nonlinear,QualitativeThinking.538

    Closing..539

    References..540

    Preface

    Whenwestartedthisbook,bigdatahadnotquitebecomeabusiness

    buzzword.Aswedidourresearch,werealizedthebooksweperused

    wereeitheroftheGee,whiz!Canyoubelievethis?classorincredibly

    abstruse.Wefeltthemarketneededexplanationorientedtowardmanag-

    erswhohadtomakepotentiallyexpensivedecisions.

    Wewouldlikemanagersandimplementorstoknowwheretostartwhen

    theydecidetopursuethebigdataoption.Asweindicate,themarketplace

    forbigdataismuchlikethatforpersonalcomputingintheearly1980s

    fullofconsultants,productswithbizarrenames,andtonsofhyperbole.

    Luckily,inthe2010s,muchofthesoftwareisopensourceandextremely

    powerful.Bigdataconsultanciesexisttotranslatethisfreesoftwareintousefultoolsfortheenterprise.Hence,nothingisreallyfree.

    Wealsoensureourreaderscanunderstandboththebenefitsandthe

    costsofbigdatainthemarketplace,especiallythedarksideofdata.By

    now,wethinkitisobviousthattheUSNationalSecurityAgencyisan

    archetypeforbigdataproblemsolving.Large-citypolicedepartments

    havetheirownstatisticaldatatoolsandsomeofthempondertheuseful-

    nessofcellphoneconfiscationandinvestigationaswellastheuseofsocialmedia,whicharepublic.

    Asweresearched,wefoundourselvessurprisedatthesizeofwell-known

    marketerssuchasGoogleandAmazon.Bothoftheseenterpriseshave

  • purchasedcompaniesandhavegrownthemselvesorganically.Facebook

    continuestopurchasecompanies(e.g.,Oculus,thesupplierofapoten-

    tiallygame-changingvirtualrealitysystem)andhasover1billionusers.

    Algorithmicanalysisofcolossalvolumesofdatayieldsinformation;infor-

    mationallowsvendorstotickleourbuyingreflexesbeforeweevenknow

    ourownpatterns.

    Previously,wethoughtEsriownedthegeographicalinformationsys-

    temsmarket,butwefoundavarietyofgeographicalinformationsystems

    solutionsalthoughtheEsriproductlineisrelativelymatureandthey

    servelarge-citypolicedepartmentsacrosstheUnitedStates.Databasecre-

    atorsexplorenewwaysoflookingatandstoring/retrievingdatamethods

    goingbeyondtherelationalparadigm.Newandoldalgorithmicmethods

    xiii

    xivPreface

    calledmachinelearningallowcomputerstosortandseparatetheuseful

    datafromtheuseless.

    Wehavegrowntoappreciatetheopen-sourcestatisticallanguageRover

    theyears.Rhasbecomethestatisticallinguafrancaforbigdata.Someofthemajorstatisticalvendorsadvertisetheirfunctionalpartnershipswith

    R.Weusethetoolourselvestogeneratemanyofourfigures.WesuspectR

    isnowthemostpowerfulgenerallyavailablestatisticaltoolontheplanet.

    Letsmoveonandseewhatwecanlearnaboutbigdata!

    MATLABisaregisteredtrademarkofTheMathWorks,Inc.Forproduct

    information,pleasecontact:

    TheMathWorks,Inc.

    3AppleHillDrive

    Natick,MA01760-2098USA

    Tel:5086477000

    Fax:508-647-7001

    E-mail:[email protected]

    Web:www.mathworks.com

    Acknowledgments

    KimH.PrieswouldliketoacknowledgeJanisePries,theloveofhislife,forhersupport

  • andeditingskills.Inaddition,RobertDunnigansupplied

    verbiage,chapters,SixSigmaexpertise,andbigdataprofessionalism.As

    always,JohnWyzalekandtheTaylor&Francisteamarekeyplayersintheproductionandpublicationoftechnicalworkssuchasthisone.

    RobertDunniganthankshiswife,FlabiaDunnigan,andhissonRobertIIIfortheirloveandpatienceduringthecompositionofthisbook.Hewould

    alsoliketothankKimH.Priesforhisdepthofexpertiseinabroadarrayoftechnicalsubjectsaswellashisexperienceasanauthor.Heskillfullynavigatedtheprocessofproposing,developing,andfinalizingwhatis

    auniqueandpracticalofferinginthefieldofbigdataliterature.Robertwouldalsoliketothankhisemployer,TheKratosGroup,fortheirinterestandmoralsupportduringthewritingofthisbook.Kratosisaremarkable

    companyofwhichRobertisproudtobeapart.Finally,thanksaredueto

    Taylor&Francisforbringingthisnewperspectiveonbigdatatomarket.

    xv

    Authors

    KimH.Prieshasfourcollegedegrees:abachelorofartsinhistoryfromtheUniversityofTexasatElPaso(UTEP),abachelorofscienceinmetallurgicalengineeringfromUTEP,amasterofscienceinengineeringfrom

    UTEP,andamasterofscienceinmetallurgicalengineeringandmaterials

    sciencefromCarnegie-MellonUniversity.Inaddition,heholdsthefol-

    lowingcertifications:

    APICS

    CertifiedProductionandInventoryManager(CPIM)

    AmericanSocietyforQuality(ASQ)

    CertifiedReliabilityEngineer(CRE)

    CertifiedQualityEngineer(CQE)

    CertifiedSoftwareQualityEngineer(CSQE)

    CertifiedSixSigmaBlackBelt(CSSBB)

    CertifiedManagerofQuality/OperationalExcellence(CMQ/OE)

    CertifiedQualityAuditor(CQA)

    Priesworkedasacomputersystemsmanager,asoftwareengineerforan

    electricalutility,andascientificprogrammerunderadefensecontract;forStoneridge,Incorporated(SRI),hehasworkedasthefollowing:

    Softwaremanager

  • Engineeringservicesmanager

    Reliabilitysectionmanager

    Productintegrityandreliabilitydirector

    Inadditiontohisotherresponsibilities,PrieshasprovidedSixSigma

    trainingforbothUTEPandSRI,andcostreductioninitiativesforSRI.

    PriesisalsoafoundingfacultymemberofPracticalProjectManagement.

    Additionally,inconcertwithJonQuigley,Prieswasacofounderandprin-

    cipalwithValueTransformation,LLC,atraining,testing,costimprove-

    ment,andproductdevelopmentconsultancy.PriesalsoholdsTexas

    teachercertificationsin:

    xvii

    xviiiAuthors

    Mathematics(812)

    Mathematics(48)

    Technologyeducation(612)

    Technologyapplications(EC12)

    Physics(812)

    Generalist(48)

    EnglishLanguageArtsandReading(812)

    History(812)

    ComputerScience(812)

    Science(812)

    Specialeducation(EC12)

    HetrainedforIntroductiontoEngineeringDesignandComputer

    ScienceandSoftwareEngineeringwithProjectLeadtheWay.Hecur-

    rentlyteachesbiotechnology,computerscienceandsoftwareengineering,

    andintroductiontoengineeringdesignatthebeautifulParklandHigh

    SchoolintheYsletaIndependentSchoolDistrictofElPaso,Texas.

    Priesauthoredorcoauthoredthefollowingbooks:

    SixSigmafortheNextMilennium:ACSSBBGuidebook(Quality

    Press,2005)

    SixSigmafortheNewMilennium:ACSSBBGuidebook,Second

  • Edition(QualityPress,2009)

    ProjectManagementofComplexandEmbeddedSystems:Ensuring

    ProductIntegrityandProgramQuality(CRCPress,2008),withJon

    M.Quigley

    ScrumProjectManagement(CRCPress,2010),withJonM.Quigley

    TestingComplexandEmbeddedSystems(CRCPress,2010),withJonM.Quigley

    TotalQualityManagementforProjectManagement(CRCPress,

    2012),withJonM.Quigley

    ReducingProcessCostswithLean,SixSigma,andValueEngineering

    Techniques(CRCPress,2012),withJonM.Quigley

    ASchoolCounselorsGuidetoEthics(CounselorConnectionPress,2012),withJaniseG.Pries

    ASchoolCounselorsGuidetoTechniques(CounselorConnection

    Press,2012),withJaniseG.Pries

    ASchoolCounselorsGuidetoGroupCounseling(Counselor

    ConnectionPress,2012),withJaniseG.Pries

    Authorsxix

    ASchoolCounselorsGuidetoPracticum(CounselorConnection

    Press,2013),withJaniseG.Pries

    ASchoolCounselorsGuidetoCounselingTheories(Counselor

    ConnectionPress,2013),withJaniseG.Pries

    ASchoolCounselorsGuidetoAssessment,Appraisal,Statistics,andResearch(CounselorConnectionPress,2013),withJaniseG.Pries

    RobertDunniganisamanagerwithTheKratosGroupandisbasedin

    Dallas,Texas.Heholdsabachelorofscienceinpsychologyandinsociol-

    ogywithananthropologyemphasisfromNorthDakotaStateUniversity.

    HealsoholdsamasterofbusinessadministrationfromINSEAD,the

    businessschoolfortheworld,whereheattendedtheSingaporecampus.

    AsaPeaceCorpsvolunteer,Robertservedover3yearsinHonduras

    developingagribusinessopportunities.Asaconsultant,helaterworked

    ontheAfghanistanSmallandMediumEnterpriseDevelopmentproject

    inAfghanistan,wherehetraveledthecountrywithhisAfghancolleagues

    andfriendsseekingopportunitiestodevelopamanufacturingsectorin

  • thecountry.

    RobertisanAmericanSocietyforQualitycertifiedSixSigmaBlackBelt

    andaScrumAlliancecertifiedScrumMaster.

    1

    Introduction

    SOWHATISBIGDATA?

    Asamanager,youareexpectedtooperateasafactotum.Youneedtobe

    anindustrial/organizationalpsychologist,alogician,abeancounter,and

    arepresentativeofyourcompanytotheoutsideworld.Inotherwords,

    youaresomewhatofageneralistwhocandiveintospecifics.Thespecific

    technologiesyouencounterarebecomingmorecomplex,yetthediffer-

    encesbetweenthemandtheirpredecessorsarebecomingmorenuanced.

    Youmayhavealreadyguidedyourfirmstransitiontoothernewtechnol-

    ogies.ThinkoftheInternet.Inthedecadeandahalfbeforethisbookwaswritten,Internetpresencewentfrombeingoptionaltobeingmandatory

    formostbusinesses.Inthepastdecade,Internetpresencewentfrombeing

    unidirectionaltoconversational.Once,yourfirmcouldhangoutitsonlineshinglewitheitherinformationaboutitsphysicallocation,hours,and

    offeringsifitwereabrick-and-mortarbusinessorelseyourofferingsandanautomatedpaymentsystemifitwereanonlinebusiness.Firmsranging

    fromBarnes&Nobletoyourcornerpizzachainbridgedtheseworlds.

    Anewbuzzwordarrived:Web2.0.Despitemuchhyperbolicrhetoric,

    thisdesignationdescribedtherealphenomenonofareciprocalonline

    world.Andisgruntledrepresentativeofyourcompanyrespondingby

    thearchetypicalWeb2.0technologycalledsocialmediacouldcausereal

    damagetoyourfirm.TwonewsstoriesinvolvingTwitterbrokeasthis

    introductionwasinitsfinalstagesofrefinement.

    First,BrendanEich,thenewCEOofthesoftwareorganizationMozilla

    (creatoroftheFirefoxbrowser),steppeddownafternewssurfacedindi-

    catinghehaddonatedmoneyinsupportofProposition8,anantigay

    marriageinitiativeinCalifornia,some6yearsbefore(in2008).AnuproareruptedlargelyonTwitterwhichledMr.Eichtoresign.Voicesin

    1

  • 2BigDataAnalytics

    Mr.EichsdefensefromacrossthepoliticalspectrumincludingAndrew

    Sullivan,therespectedconservativecolumnistwhoishimselfgayand

    aproponentforgaymarriagerights,andConorFriedersdorfofThe

    Atlantic,whowasalsoanoutspokenopponentofProposition8didnotsaveMr.Eichsjob.Hewasousted.

    ThesecondTwitterstorybeganwithatweetedcomplaintfromacus-

    [email protected]

    thetypicalreactionofacompanyfacingsuchacomplaintinthepublic

    forumofTwitter.Theyinvited@ElleRaftertoprovidemoreinformation,

    alongwithalink.UnlikethetypicalTwitterresponse,however,theUS

    Airwaystweetincludedapornographicphotoinvolvingtheuseofatoy

    USAirwaysaircraft.Thisdoesnotappeartohavebeenapremeditated

    actbytheUSAirwaysrepresentativeinvolvedbutitcausedsubstantial

    humiliatingpresscoverageforthecompany.

    AstheInternetspreadandmatured,itbecameanecessaryforumfor

    communication,aswellasadangeroustoolwhosepotentialforgoodor

    badcanpullinothersbysurpriseorcauseself-inflictedharm.JustasWorldWarIgeneralswerelefttofigureouthowtechnologychangedthefieldof

    battle,shiftingtheadvantagefromtheoffensetothedefense,Internettechnologyleftmanagerstryingtocopewithanewlandscapefilledwithboth

    promiseandthreats.Now,thereisanothernewbuzzword:bigdata.

    So,whatisbigdata?Isitafad?Isitemptyjargon?Isitjustanewnameforgrowingcapacityofthesamedatabasesthathavebeenapartofourlivesfordecades?Or,isitsomethingqualitativelydifferent?Whatarethepromisesofbigdata?Fromwhichdirectionshouldamanageranticipatethreats?

    Thetendencyofthemediatohypenewandbarelyunderstoodphenom-

    enamakesitdifficulttoevaluatenewtechnologies,alongwiththenature

    andextentoftheirsignificance.Thisbookarguesthatbigdataisnewandpossessesstrategicsignificance.Theargumenttheauthorsmakeaboutbig

    dataisabouthowitbuildsonunderstandabledevelopmentsintechnology

    andisitselfcomprehensible.Althoughitiscomprehensible,itisnoteasytouseanditcandelivermisleadingorincorrectresults.However,these

    erroneousresultsarenotoftenrandom.Theyresultfromcertainstatisti-

    calanddata-relatedphenomena.Knowingthesephenomenaarerealand

  • understandinghowtheyfunctionenableyouasamanagertobecomea

    betteruserofyourbigdatasystem.

    Likecellphonesande-mail,bigdataisarecentphenomenonthathas

    emergedasapartofthepanoramaofourdailylives.Whenyoushop

    online,catchupwithfriendsonFacebook,conductwebsearches,read

    Introduction3

    articlesreferencingdatabasesearches,andreceiveunsolicitedcoupons,

    youinteractwithbigdata.Manyreaders,asparticipantsinastoresloyaltyprogram,possessakeyfobfeaturingabarcodeononesideandthelogo

    ofafavoritestoreontheother.Oneoftheprimaryrationalesoftheseprograms,asidefromdecreasingyourincentivetoshopelsewhere,istogatherdataonthecompanysmostimportantcustomers.Everytimeyouswipe

    yourkeyfoborenteryourphonenumberintothekeypadofthecreditcard

    machinewhileyouarecheckingoutatthecashregister,youaretyinga

    pieceofidentifyingdata(whoyouare)withwhichitemsyoupurchased,

    howmanyitemsyoupurchased,whattimeofdayyouwereshopping,and

    otherdata.Fromthese,analystscandeterminewhetheryoushopbybrand

    orbuywhateverisonsale,whetheryouarepurchasingdifferentitemsfrombefore(suggestingalifechange),andwhetheryouhavestoppedmaking

    yourlargepurchasesinthestoreandnowonlydropinforquickitemssuchasmilkorsugar.Inthelattercase,thatisasignyouswitchedtoanotherretailerforthebulkofyourshoppingandcouponsorsomeotherinterventionmaybeinorder.Storeshavelongcollectedcustomerdata,longbeforetheageofbigdata,buttheynowpossesstheabilitytopullinagreatervarietyofdataandconductmorepowerfulanalysesofthedata.

    Bigdatainfluencesuslessobviouslyitinformstheobscureunderpin-

    ningsofoursociety,suchasmanufacturing,transportation,andenergy.

    Anyindustrydevelopingenormousquantitiesofdiversedataisready

    forbigdata.Infact,theseindustriesprobablyusebigdataalready.The

    technologicalrevolutionoccurringindataanalyticsenablesmoreprecise

    allocationofresourcesinourevolvingeconomymuchastherevolution

    innavigationaltechnology,fromthesupersededsextanttomodernGPS

    devices,enabledshipstonavigateopenseas.

    BigdataismuchliketheInternetithasdrawbacks,butitsnetvalueis

    positive.Thedebateonbigdata,likepoliticaldebate,tendstowardmis-

    leadingabsolutesandfalsedichotomies.Thetruth,asinthecaseofpoliticaldebates,

  • almostneverliesinthoseabsolutes.Likeacar,youdonotstartupabigdatasolutionandletitmotoralongunguidedyoudriveit,you

    guideit,andyouextractvaluefromit.

    Dataitselfisnowanasset,oneforcompaniestosecureandhoard,much

    astheFederalReserveBankofNewYorkstockpilesgold(though,forthe

    sakeofaccuracy,theFederalReserveonlystoresgoldforcountriesother

    thantheUnitedStates).Companiesinvestinsystemstoorganizeand

    extractvaluefromtheirdata,justastheywouldapieceoflandorreserveofrawmaterials.Dataareboughtandsold.Somecompanies,including

    4BigDataAnalytics

    IHS,Experian,andDataLogix,buildentirebusinessestocollect,refine,

    andselldata.Companiesinthebusinessofdataarediverse.IHSprovides

    informationaboutspecificindustriessuchasenergy,whereasExperian

    andDataLogixprovidepersonalinformationaboutindividualconsum-

    ers.Thesecompanieswouldnotexistiftheexchangeofdatawasnotlucra-

    tive.Theywouldenjoynoprofitmotiveiftheycouldnotusedatatomake

    moremoneythanthecostofitsgeneration,storage,andanalysis.

    OneofyourauthorswasadevoteeofBorders,thebookretailer(and

    stillkeepshisloyaltyprogramcardondisplayasamemorialtothecom-

    pany).AftertheliquidationofBorders,hereceivedane-mailmessagefromWiliamLynch,thechiefexecutiveofBarnes&Noble(anotherfavorite

    store),statinginpart,AspartofBordersceasingoperations,weacquiredsomeofitsassetsincludingBordersbrandtrademarksandtheircustomer

    list.ThesubjectmatterofyourDVDandothervideopurchaseswillbepartofthetransferredinformationIfyouwouldliketoopt-out,wewillensureallyourdatawereceivefromBordersisdisposedofinasecureandconfidentialmanner.ThedatathatBordersaccumulatedwerearealassetsold

    offafteritsbankruptcy.

    DataanalysishasevenenteredpopularcultureintheformofMichael

    LewissbookMoneyball,aswellastheeponymousmovie.ThestorycentersonBillyBeane,whouseddatatosupplantintuitionandturnedthe

    OaklandAthleticsintoawinningteam.Therelationshipbetweendata

    anddecisionmakingis,infact,thekeythemeofthisbook.

    GROWINGINTERESTINDECISIONMAKING

    Anybusinessbookofvaluemustanswerasimple,two-wordquestion:So

  • what?So,whydoesbigdatamatter?Theansweristheconfluenceoftwo

    factors.Thefirstisthatawarenessofthelimitationsofhumanintuition,alsoknownasgutfeel,hasbecomeobvious.Thesecondisthatbigdata

    technologieshavereachedthelevelofmaturitynecessarytomakestun-

    ningcomputationalfeatsaffordable.Moreover,thiscomputationalability

    isnowvisibletothegeneralpublic.Facebook,Amazon.com,andsearch

    enginessuchasBing,Yahoo!,andGoogleareprimeexamples.Eventradi-

    tionalbrick-and-mortarstoresmatchpowerfulwebsiteswithanalytics

    thatwouldhavebeenunimaginable20yearsago.Barnes&Noble,Wal-

    Mart,andHomeDepotareexcellentexamples.

    Introduction5

    Manyprominentactorsinpsychology,marketing,andbehavioral

    financehavepointedouttheflawsinhumandecisionmaking.Psychologist

    DanielKahnemanwontheNobelMemorialPrizeinEconomicSciences

    in2002forhisworkonthesystematicflawsinthewaypeopleweighrisk

    andrewardinarrivingatdecisions.BuildingonKahnemanswork,avari-

    etyofscholars,includingDanAriely,ZivCarmon,andCassSunstein,

    demonstratedhowhiddeninfluencersandmentalheuristicsinfluence

    decisionmaking.Oneoftheauthorshadthepleasureofstudyingunder

    Mr.CarmonatINSEADand,duringaclassexercise,pointedouthow

    muchhepreferredoneketchupsampletoanotheronlytodiscoverthey

    camefromthesamebottleandweremerelypresentedasbeingdifferent.

    Thedifferencebetweenthetwosampleswasnonexistent,butthediffer-

    encewithtasteperceptionswasquitereal.

    Infact,Mr.Ariely,Mr.Carmon,andtheircoauthorswonthefollowing

    2008IgNobelaward:

    MEDICINEPRIZE.DanArielyofDukeUniversity(USA),RebeccaL.

    WaberofMIT(USA),BabaShivofStanfordUniversity(USA),andZiv

    CarmonofINSEAD(Singapore)fordemonstratingthathigh-pricedfake

    medicineismoreeffectivethanlow-pricedfakemedicine.1

    Thewebsitestates,TheIgNobelPrizeshonorachievementsthatfirst

    makepeoplelaugh,andthenmakesthemthink.Theprizesareintended

    tocelebratetheunusual,honortheimaginativeandspurpeoplesinter-

  • estinscience,medicine,andtechnology.2Itmaybeeasytolaughabout

    thisresearch,butjustconsiderhowpowerfulitis.Yourperceptionofthemedicaleffectivenessofwhatisinfactauselessplaceboisinfluencedbyhowmuchyoubelieveitcosts.

    TheAtlanticrananarticleinitsDecember2013issuedescribinghowbigdatachangeshiringdecisions.Althoughthisphenomenonisnotaltogetherunderstood,wehavepilotstudies,andyes,computerscanoften

    doabetterjobthanpeople.3Hiringmanagersbasetheirwillingnessto

    hireonarangeofirrelevantfactorsininterviews.Considersomeofthesefactors:firmnessofhandshake,physicalappearance,projectionofconfidence,name,andsimilaritiesofhobbieswiththepersonconductingthe

    interviewallinfluenceemploymentdecisions.Often,theseextraneous

    factorshaveminimalrelevancetotheabilityofsomeonetoexecutetheir

    job.Itislittlewonderthatcomputersanddatascientistshavebeenabletoimprovecompanieshiringpracticesbybringinginbigdata.

    6BigDataAnalytics

    In1960,acognitivescientistbythenameofPeterCathcartWasonpub-

    lishedastudyinwhichparticipantswereaskedtohypothesizethepattern

    underlyingaseriesofnumbers:2,4,and6.Theythenneededtotestitbyaskingifanotherseriesofnumbersfitthepattern.Whatisyourhypothesisandhowwouldyoutestit?WhatWasonuncoveredisatendencyto

    seekconfirmatoryinformation.Participantstendedtoproposeseriesthat

    alreadyfitthepatternoftheirassumptions,suchas8,10,and12.Thisisnotahelpfulapproachtotheproblem,though.Amoreproductiveapproach

    wouldbe12,10,and8(descendingorder,separatedbytwo),or2,3,and4

    (ascendingorder,separatedbyone).Irregularseriessuchas3,,and4,or0,

    1,and4wouldalsobeuseful,aswouldanythingthatdirectlyviolatesthepatternoftheoriginalsetofnumbersprovided.Thepatternsoughtinthe

    studywasanyseriesofnumbersinascendingorder.Participantsdidapoorjobofeliminatingpotentialhypothesesbyseekingoutoptionsthatdirectlycontradictedtheiroriginalhunches,tendinginsteadtoconfirmwhatthey

    alreadybelieved.Thetitleofthisseminalstudy,OntheFailuretoEliminateHypothesesinaConceptualTask,highlightsthisintelectualbias.4

    Wasonsfindingswerepioneeringworkinthisfield,andinmanyways

    DanielKahnemansworkisafruitfulandingeniousoffshootthereof.As

    thisisanintroduction,wewillnotcontinuelistingexamplesofcognitivebiases,buttheyhavebeendemonstratedmanytimesinhowweevaluate

  • others,howwejudgeourownsatisfaction,andhowweestimatenumbers.

    Bigdatanotonlyaddressesthearcanerelationshipsbetweentechnical

    variables,butitalsohasapragmaticroleinsavingcosts,controllingrisks,andpreventingheadachesformanagersinavarietyofroles.Itdoesthis

    inpartbyfindingpatternswheretheyexistratherthanwhereourfalliblereckoningfindsthemeremiragesofpatterns.

    WHATTHISBOOKADDRESSES

    Thisbookaddressesaseriousgapinthebigdataliterature.Duringour

    research,wefoundpopularbooksandarticlesthatdescribewhatbigdata

    isforageneralaudience.Wealsofoundtechnicalbooksandarticlesfor

    programmers,administrators,andotherspecializedroles.Thereislittle

    discussion,however,facilitatingtheintelligentandinquisitivebutnon-

    technicalreadertounderstandbigdatanuances.

    Ourgoalistoenableyou,thereader,todiscussbigdataataprofound

    levelwithyourinformationtechnology(IT)department,thesalespeople

    Introduction7

    withwhomyouwillinteractinimplementingabigdatasystem,andthe

    analystswhowilldevelopandreportresultsdrawnfromthemyriadofdata

    pointsinyourorganization.Wewantyoutobeabletoaskintelligentand

    probingquestionsandtobeabletomakeanalystsdefendtheirpositions

    beforeyouinvestinprojectsbyactingontheirconclusions.Afterreadingthisbook,youshouldbeabletoreadthefootnotesofapositionpaperandknowthesoundnessofthemethodsused.WhenyourITdepartmentdiscussesanewproject,youshouldbeabletoguidethediscussions.

    Thediscussioninthisbookrangeswellbeyondbigdataitself.The

    authorsincludeexamplesfromscience,medicine,SixSigma,statistics,andprobabilitywithgoodreason.Allofthesedisciplinesarewrestlingwith

    similarissues.Bigdatainvolvestheprocessingofalargenumberofvari-

    ablestopulloutnuggetsofwisdom.Thisisusingtheconclusiontoguide

    theformationofahypothesisratherthantestingthehypothesistoarrive

    ataconclusion.Somemayconsiderthisapproachsloppywhenappliedto

    anyparticularscientificstudy,butthesheernumberofstudies,combined

    withabiastowardpublishingonlypositiveresults,meansthatastatisticalysimilarphenomenonisoccurringinscientificjournals.Asscienceisaself-criticaldiscipline,thelessonsgleanedfromitsinternalstruggletoensuremeaningfulresultsareapplicableto

  • yourorganizations,whichneedtopul

    accurateresultsfrombigdatasystems.Thecurrentdiscussioninthepopu-

    larandbusinesspressonbigdataignoresnonbusinessfieldsanddoessotothedetrimentoforganizationstryingtomakeeffectiveuseofbigdatatools.

    Thediscussioninthisbookwillprovideyouwithanunderstandingof

    theseconversationshappeningoutsidetheworldofbigdata.LouisPasteur

    said,Inthefieldsofobservation,chancefavorsonlythepreparedmind.5

    Someofthemostprofoundconversationsontopicsofdirectrelevanceto

    bigdatapractitionersarehappeningoutsideofbigdata.Understandingtheseconversationswillbeofdirectbenefittoyouasamanager.

    THECONVERSATIONABOUTBIGDATA

    Wementionedthediscussionsaroundbigdataandhowunhelpfulthey

    are.Someofthediscussionisoptimistic;someispessimistic.Wewillstartontheoptimisticside.

    Perhapsthemostfamousstoryaboutthecapabilitiesofpredictiveana-

    lyticswasa2012articleinTheNewYorkTimesMagazineaboutTarget.6

    8BigDataAnalytics

    Targetsellsnearlyanycategoryofproductsomeonecouldneed,butisnot

    alwaysfirstincustomersmindsforallofthosecategories.Targetsells

    clothing,groceries,toys,andmyriadotheritems.However,someonemay

    purchaseclothingfromTarget,butgotoKrogerforgroceriesandToysR

    Usfortoys.Anywell-managedstorewillwanttoincreasesalestoitscus-

    tomers,andTargetisnoexception.ItwantsyoutothinkofTargetfirstformostcategoriesofitems.

    Whenlifechanges,habitschange.Targetrealizedthatpeoplespur-

    chasinghabitschangeasfamiliesgrowwiththebirthofchildrenand

    arethereforemalleable.Targetwantedtodiscoverwhichcustomerswere

    pregnantaroundthetimeofthesecondtrimestersoastoinitiatemarket-

    ingtoparents-to-bebeforetheirbabieswereborn.

    Abirthispublicrecordandthereforeresultsinablizzardofadvertising.

    Fromamarketingaspect,acompanyiswisetobeatthatblizzard.Target

    sawawaytodosobyusingthedataitaccumulated.

    AsaTargetstatisticiantoldtheauthorofthearticle,Ifyouuseacreditcardoracoupon,orfilloutasurvey,ormailinarefund,orcallthecustomerhelpline,oropenane-mail

  • wevesentyouorvisitourWebsite,

    wellrecorditandlinkittoyourGuestID.TheguestIDistheunique

    identifierusedbyTarget.Thestatisticiancontinued,Wewanttoknow

    everythingwecan.TheguestIDisnotonlylinkedtowhatyoudowithin

    Targetswalls,butalsotoalargevolumeofdemographicandeconomic

    informationaboutyou.6

    Targetlookedathowwomenspurchasinghabitschangedaroundthe

    timetheyopenedababyregistry,thengeneralizedthesepurchasinghab-

    itsbacktowomenwhomaynothaveopenedababyregistry.Purchases

    ofunscentedlotion,largequantitiesofcottonballs,andcertainmin-

    eralsupplementscorrelatedwellwithsecond-trimesterpregnancy.By

    matchingthisknowledgetopromotionsthathadahighlikelihoodof

    effectivenessagaingleanedfromTargetscustomer-specificdatathe

    companycouldtrytochangethesewomensshoppinghabitsatatime

    whentheirliveswereinflux,duringpregnancy.6Thearticlepropelled

    Targetsdataanalyticsprowesstofameandalsogenerateduneasiness.

    Targetalsodidnotcommunicatehowtrickyandresource-intensivesuch

    ananalysisis.Thismaybeanunfaircriticism,asthearticlewasdirectedatageneralreadershipratherthanatbusinesspeoplewhoareconsideringthe

    useofbigdata.However,abusinessreaderofsuchstoriesshouldunder-

    standhownuanced,messy,convoluted,andmaddeningbigdatacanbe.

    ThedatausedbyabigdatasystemtoreachitsconclusionsoftencomewithIntroduction9

    built-inbiasesandflaws.Thestatisticsuseddonotprovideapreciseyes

    ornoanswer,butratherdescribealevelofconfidenceonaspectrumof

    likelihood.Thisdoesnotmakeforexcitingpress,anditisthereforeallbutinvisibleinbigdataarticles,exceptthoseinspecialistsources.

    Therearemanyarticlesaboutbigdataandhealth,bigdataand

    marketing,bigdataandhiring,andsoforth.Theserarelycoverthe

    risksandrewardsofdata.Therealityisthathealthdatacanbemessy

    andinaccurate.Moreover,itisprotectedbyastrictlegalregimen,the

    HealthInsurancePortabilityandAccountabilityActof1996(HIPAA),

    whichrestrictsitsflow.Marketingdataarelikewisedifficulttolinkup.

  • Dataanalyticsingeneral,andnowbigdata,haveimprovedmarketing

    effortsbutarenotamagicbullet.Somestoresseldomtrackwhattheir

    customerspurchase,andthosethatdosodonottrusteachotherwith

    theirdatabases.Inanybigdatasystem,thenatureofwhocanseewhat

    dataneedstobeconsidered,aswellashowthedatawillbesecured.

    Itisverylikelythatyourfirmwillowndataonlysomeemployeesor

    contractorscansee.Makingiteasiertoaccessthisdataisnotalways

    agoodidea.

    Laterinthisintroduction,wewilldiscussdataanalyticsappliedtohir-

    ingandhowpoorlythiscanbereported.Asanewsconsumer,yourskep-

    ticismshouldkickinwheneveryoureadaboutsomeamazingdiscovery

    uncoveredbybigdatamethodsabouthowtwodissimilarattributesare

    infactlinked.Therealityisatbestmuchmorenuancedandatworstisa

    falserelationship.Thesefalserelationshipsareprettymuchinevitable,andwededicatemanypagestoshowinghowdataandstatisticscanleadthe

    unwaryuserastray.Onceyouembracethiscondition,youwillprobably

    neverreadnewsstoriesaboutbigdatawithoutautomaticallycritiquing

    them.

    Ontheothersideoftheargument,perhapsthemostastutecriticof

    bigdataisNassimNicholasTaleb.Inanopinionpiecehewroteforthe

    websiteofWiredmagazine(drawnfromhisbook,Antifragile),hestates,

    Modernityprovidestoomanyvariables,buttoolittledatapervariable.

    Sothespuriousrelationshipsgrowmuch,muchfasterthanrealinforma-

    tionInotherwords:Bigdatamaymeanmoreinformation,butitalso

    meansmorefalseinformation.7

    Mr.Talebmaybepessimistic,butheraisesvaluablepoints.Asaformer

    traderwithaformidablequantitativebackground,Talebhasmadeaname

    forhimselfwithhisastutecritiquesoffaultydecisionmaking.Talebisararity,apublicintellectualwhoisalsoanintellectualheavyweight.Heis10BigDataAnalytics

    notpartisan,developingdevastatingtakedownsofsloppyargumentation

    withequalopportunityfervor.Talebargues:

    Theincentivetodrawaconclusionmaynotalignwithwhatthedata

    reallyshow.Withthis,Talebdiscussestheexistenceofmedicalstud-

  • iesthatcannotbereplicated.Therearefundingincentivestofindsig-

    nificantrelationshipsinstudiesanddisincentivestopublishstudies

    thatshownosignificantfindings.Thehallmarkofatrulysignificant

    findingisthatotherscanreplicatetheresultsintheirownstudies.

    Thereisnotanabsenceofmeaningfulinformationinlargedatasets,

    itissimplythattheinformationwithinishiddenwithinalarger

    quantityofnoise.Noiseisgenerallyconsideredtobeanunwel-

    comerandomnessthatobscuresasignal.AsTalebstates,Iamnot

    sayingherethatthereisnoinformationinbigdata.Thereisplenty

    ofinformation.Theproblemthecentralissueisthattheneedle

    comesinanincreasinglylargerhaystack.7

    Onedifficultyindrawingconclusionsfrombigdataisthatalthough

    itisgoodfordebunkingfalseconclusions,itisnotasstrongindraw-

    ingvalidconclusions.Stateddifferently,Ifsuchstudiescannotbe

    usedtoconfirm,theycanbeeffectivelyusedtodebunktotellus

    whatswrongwithatheory,notwhetheratheoryisright.7Ifweare

    usingthescientificmethod,itmaytakeonlyonevalidcounterex-

    ampletotoppleavulnerabletheory.

    Thisisanimportantarticle.Infact,thebookthatyounowholdinyour

    handswasconceivedasaresponse.Talebpointsoutrealflawsinhowwe

    usebigdata,butyourauthorsargueweneednotusebigdatathisway.A

    managerwhounderstandsthepromiseandlimitsofbigdatacanobtain

    improvedresultsjustbyknowingthelimitsofdataandstatisticsandthenensuringthatanyanalysisincludesmeasurestoseparatewheatfromchaff.

    Paradoxically,theflawsofbigdataoriginatefromtheuniquestrengths

    ofbigdatasystems.Thefirstamongthesestrengthsistheabilitytopulltogetherlargenumbersofdiversevariablesandseekoutrelationships

    betweenthem.Thisenablesanorganizationtofindrelationshipswithin

    itsdatathatwouldhaveotherwiseremainedundiscovered.However,

    morevariablesandmoretestsmustmeananincreasedchanceforerror.

    Thisbookisintendedtoguidetheuserinunderstandingthis.

    AmorewidelyrecognizedconceptmadefamousbyTaleb,notdirected

    atbigdatabutapplicablejustthesame,ishisconceptoftheblackswan.

  • Introduction11

    Theterm,whichheusestodescribeanunforeseeableevent,asopposed

    tojustunforeseen,derivesfromtheideathatifoneconceivesofthecolorwhiteasbeinganintrinsicaspectofaswan,thenfindingablackswanis

    anunforeseeableexperiencethatrendersthatexpectationuntenable.The

    BlackSwanisathereforeashock.The1987stockmarketcrashandthe

    terroristattacksofSeptember11arelarge-scaleBlackSwans,butsmaller

    BlackSwanshappentousinourpersonallivesandwithourbusinesses.

    Talebistalentedatbringingconceptsintofocusthroughtheskillfuluse

    ofexamples.Inthiscase,hisexampleisofthecomfortableturkeyraised

    onafarm.Heisfed,getsfat,projectsahead,andfeelsgoodabouthislife

    untilThanksgiving.8

    Thefieldofpredictiveanalyticsisrelatedto,andoftenverymuchapartof,bigdata.Ithasbeenquitepowerfulinboostingefficiencyandcontrollingrisk,anditiswithoutdoubtanindispensabletechnologyformany

    firms.Evenso,thereisanuncomfortabletruth.Withlittleexperience

    usingdatatounderstandaparticularphenomenon(orperhapswithout

    collectionoftheneededdata),youwillnotbeabletoforeseeit.Bigdataisbothartandscience,butitisnotanall-seeingwellspringofwisdomand

    knowledge.Itwillnotenableyoutoeliminateblackswanevents.Itisuptotheuserofbigdatasystemstounderstandtherisksandlimiteddatathat

    actasaconstraintoncalculatingprobabilitiesforthephenomenabeing

    analyzedandtorespectchance.

    WhileTalebsargumentisamongthemostsubstantivecritiquesofbig

    data,thegeneralformofhiscriticismisfamiliar.Bigdataisnotaltogetherdismissed,sothecriticismisbalanced.Theflawinmostcriticismofbig

    dataisnotthatitispolemic,ordishonest,oruninformed.Itisnoneof

    these.Itisthatitisfatalistic.Bigdatahasflawsandisthusoverrated.Whatmuchofthecriticalbigdataliteraturefailstodoislookatthistechnologyasanenablingtechnology.Askilleduserwhounderstandsthedataitself,

    thetoolsanalyzingit,andthestatisticalmethodsbeingusedcanextract

    tremendousvalue.Theuserwhoblindlyexpectsbigdatasystemstospit

    outmeaningfuldatarunsaveryhighriskofdeliveringpotentialdisaster

    tohisorherorganization.

    AfurtherexampleisanarticlefromtheKDnuggetswebsiteentitled

  • Viewpoint:WhyYourCompanyShouldNOTUsebigdata.9Thearticle

    describesthedifficultyofusingdatawellandarguesthatthemostgainscanbeobtainedbyusingthedataonesfirmalreadypossesseswithwit.Italsopuncturessomebaloonsinvolvingthemisuseoflanguage,suchasreferringtoNateSilversbriliantworkasbigdatawheninfactitisstraightforward12BigDataAnalytics

    analysis.Yourauthorscanattesttotheimportanceofusingafirmsdata

    moreeffectivelyweareexperiencedSixSigmapractitioners.Weareaccus-

    tomedtousingdatatoenhanceefficiencyandqualityandtoreducerisk.

    Thisarticleisstillflawed.Amoreproductiveapproachwouldbetolook

    atwhereyourorganizationisnow,whereitwantstogo,andhowbigdata

    mayhelpitgetthere.Yourorganizationmaynotbereadytoimplement

    bigdatanow.Itmayneedtofocusonbetterusingitsexistingdata.To

    prepareforthefuture,itmayneedtotakeamorestrategicapproachto

    ensuringthatthedataitnowgeneratesisproperlylinked,sothatausersshoppinghistorycanbetiedtotheparticularuser.Ifyouranalysisleadsyoutoconcludethatbigdataisnotaproductiveeffortforyourcompany,

    thenyoushouldheedthatadvice.Manyfirmsdonotneedbigdata,andto

    attempttoimplementthisapproachjusttokeepupwiththepackwould

    bewasteful.Ifyourfirmdoesseearealisticneedforbigdataandhastheresourcesandcommitmenttoseeitthrough,thenthelackofanexisting

    competenceisnotavalidreasontoavoiddevelopingone.

    TECHNOLOGICALCHANGEASADRIVEROFBIGDATA

    Wealsodiscusstechnology,includingitsevolution.Thedatasetsgener-

    atedeverydaybyonlineretailers,searchengines,investmentfirms,oil

    andgascompanies,governments,andotherorganizationsaresomas-

    siveandconvoluted,theyrequirespecialhandling.Astandarddatabase

    managementsystem(DBMS)maynotberobustenoughtomanagethe

    sheerenormityofthedata.Considerprocessingapetabyte(1000TB,or

    1millionoftheharddrivesonamedium-tohigh-endlaptop)ofdata.

    Physicallystoring,processing,andlocatingallofthisdatapresentssignificantobstacles.Amazon,Facebook,andotherhigh-profilewebsitesmea-

    suretheirstorageinpetabytes.

    Somecompanies,likeGoogle,developedtheirowntools,suchas

    MapReduce,theGoogleFileSystem,andBigTable,tomanagecolossal

  • volumesofinformation.Theopen-sourceApacheFoundationoversees

    Hadoop(adata-intensivesoftwareframework),Hive(datawarehouseon

    topofHadoop),andHBase(nonrelational,distributeddatabase)inorder

    toprovidetheprogrammingcommunitywithaccesstotoolsthatcan

    manipulatebigdata.PaperspublishedbyGoogleaboutitsowntechniques

    inspiredtheopen-sourcedistributedprocessingmanager,Hadoop.

    Introduction13

    Anotherareaforbigdataanalysisistheuseofgeographicalinfor-

    mationsystems(GIS).AtypicalexampleofGISsoftwarewouldbethe

    commercialproduct,ArcGIS,ortheopen-sourceproduct,QuantumGIS.

    Duetothecomplexityofmapdata,evenanassessmentatthemunicipal-

    itylevelwouldconstituteabigdatasituation.Whenwearelookingat

    theentireplanet,weareanalyzingbigdata.GISisinterestingnotonly

    becauseitinvolvesrawnumbers,butitalsoinvolvesdatarepresentation

    andvisualization,whichmustthenrelatetoamapwithaclearinterpre-

    tation.GoogleEarthaddstheextracomplexityofzooming,decluttering,

    andoverlaying,aswellaschoosingbetweenpoliticalmapsandsatellite

    images.Wenowaddtheextracomplexitiesofcolor,line,contrast,shape,

    andsoon.

    Theneedforlowlatency,anotherwayofsayingshortlagtime,betweena

    requestandthedeliveryoftheresults,drivesthegrowthofanotherareaofbigdatain-memorydatabasesystemssuchasOracleEndecaInformation

    DiscoveryandSAPHANA.Thoughtwoverydifferentbeasts,bothdem-

    onstratetheabilitytouselarge-capacityrandomaccessmemory(RAM)

    tofindrelationshipswithinsizableanddiversesetsofdata.

    THECENTRALQUESTION:SOWHAT?

    Ashasbeenstatedinthisintroduction,andaswewillargue,bigdatais

    oneofthemostpowerfultoolscreatedbyman.Itdrawstogetherinforma-

    tionrecordedindifferentsourcesystemsanddifferentformats,thenruns

    analysesatspeedsandcapacitiesthehumanmindcannotmatch.Bigdata

    isatruebreakthrough,butbeingabreakthroughdoesnotconferinfalli-

    bility.Likeanysystem,bigdataslimitsclusteraroundparticularthemes.

    Thesethemesarenotstraightforwardweaknessessuchasthosefoundin

  • poorengineering,buttheyareinseparablefrombigdatasstrengths.By

    understandingtheselimits,wecanminimizeandcontrolthem.

    Thespecificexamplesofbigdatausedsofararesuchthatwhenwedraw

    faultyconclusions,wesufferminorconsequences.Oneoftheauthorshad

    abafflinglyoff-basecategoryofmoviesrecommendedtohimbyNetflix

    andhasreceivedmembershipcardsinthemailfromtheAARPdespite

    beingdecadesawayfromretirement,andhasaspousewhowastwice

    bombardedwithbabyformulacouponsinthemail.Thefirsttimewas

    soonbeforehissonwasborn;thesecondtimewasbriefandwastriggered

    14BigDataAnalytics

    byerroneousconclusionsdrawnbysomealgorithminanunknown

    computer.

    Falseconclusionsdonotalwayscomewithsmallconsequencesthough.

    Bigdataismovingintofrauddetection,crimeprevention,medicine,

    businessstrategy,forensicdata,andnumerousotherareasoflifewhere

    erroneousconclusionsaremoreseriousthanunanticipatedjunkmailor

    strangerecommendationsfromonlineretailers.

    Forexample,bigdataismovingintothefieldofhiringandfiring.The

    previouslyreferencedarticlefromTheAtlanticdiscussesthisindetail.

    Citingmyriadfindingsabouthowpoorlyjobinterviewsfunctioninevalu-

    atingpotentialclients,thearticlediscussesdifferentmeansbywhichdataareusedtoevaluatepotentialcandidatesandcurrentemployees.

    OnecompanydiscussedbythearticleisEvolv.Onitswebsite,Evolv

    whosesloganisBigDataforWorkforceOptimizationstatesitsvalue

    proposition:

    Faster,moreaccurateselectiontools:Evolvsplatformenables

    recruiterstoquicklyidentifythebesthiresfromvolumesofcandi-

    datesbasedonyouruniqueroles.

    HigherQualitycandidates:Bettercandidateselectionresultsin

    longer-tenuredemployeesandlowerattrition.

    Post-hireengagementtools:Easytodeployemployeeengagement

    surveyskeeptabsonwhatworkplacepracticesareworkingforyou,

    andwhichonesarenot.10

  • Toattainthis,Evolvadministersquestionnairestoonlineapplicants

    andthenmatchestheresultstothoseobtainedfromitsdatasetof347,000

    hiresthatpassedthroughtheprocess.Whoarethebest-performingcan-

    didates?Whoismostlikelytostickaround?TheAtlanticstates:

    Thesheernumberofobservationsthatthisapproachmakespossibleallows

    Evolvtosaywithprecisionwhichattributesmattermoretothesuccess

    ofretail-salesworkers(decisiveness,spatialorientation,persuasiveness)orcustomer-servicepersonnelatcallcenters(rapport-building).Andthe

    companycancontinuallytweakitsquestions,oraddnewvariablestoits

    model,toseekoutever-strongercorrelatesofsuccessinanygivenjob.3

    Bigdatahasinmanywaysmadehiringdecisionsmorefairandeffec-

    tive,butitisstillprudenttomaintainskepticism.OneofthemostnotedfindingsbyEvolvistheroleofanapplicantsbrowserwhilefillinginthejobapplicationindeterminingthesuccessoftheemployeeonthejob.

    Introduction15

    AccordingtoEvolv,applicantswhouseaftermarketbrowserssuchas

    FirefoxandChrometendtobemoresuccessfulthanthoseapplicantswho

    usethebrowserthatcamewiththeoperatingsystem,suchasInternet

    Explorer.

    ThearticlefromTheAtlanticaddssomeprecisionindescribingEvolvsfindingslinkinganapplicantswebbrowsertojobperformance,stating,

    thebrowserthatapplicantsusetotaketheonlinetestturnsouttomat-

    ter,especiallyfortechnicalroles:somebrowsersaremorefunctionalthanothers,butittakesameasureofsavvyandinitiativetodownloadthem.3

    Otherarticleshavemadeitsoundlikeanapplicantsbrowserwasasilver

    bullettodetermininghoweffectiveanemployeewouldbe:

    OneofthemostsurprisingfindingsisjusthoweasyitcanbetotellagoodapplicantfromabadonewithInternet-basedjobapplications.Evolvcon-tendsthatthesimpledistinctionofwhichWebbrowseranapplicantis

    usingwhenheorshesendsinajobapplicationcanshowwhosgoingtobe

    astaremployeeandwhomaynotbe.11

    Thisfindingraisestwokeypointsinusingbigdatatodrawconclusions.

    First,isthisameaningfulresult,aspuriouscorrelation,orthemisreadingofdata?Withoutdiggingintothedataandthestatistics,itisimpossibletosay.AnonlinearticleinTheEconomiststates,Thismaysimplybeacoincidence,butEvolvsanalystsreckonan

  • applicantswillingnesstogo

    tothetroubleofinstallinganewbrowsershowsdecisiveness,avaluable

    traitinapotentialemployee.12TherelationshipfoundbyEvolvmaybe

    realandgroundbreaking.Itmayalsojustbeastatisticalartifactofthe

    kindwewillbediscussinginthisbook.Evenifitisarealandstatisticallysignificantfindingthatstandsuptoexperimentalreplication,itmaybe

    sominorastobequasi-meaningless.Withoutknowingaboutthedata

    sampled,thestatisticsused,andthestrengthoftherelationshipbetween

    thevariables,theconclusionmustbetakenwithagrainofsalt.Wewill

    discussinalaterchapterhowastatisticallysignificantfindingneednotbepracticallysignificant.Wemustrememberthatstatisticalsignificance

    isamathematicalabstractionmuchlikethemean,anditmaynothave

    profoundhumanmeaning.

    Thesecondissueraisedbythefindingrelatestointerpretation.The

    Economistwasveryresponsibleinpointingoutthepossibilityofacoincidence,orwhatwearereferringtointhisbookasastatisticalartifact.TheAtlanticdeservescreditforpointingoutthatthisfinding(assumingitislegitimate)relatesmoretotechnicaljobs.

    16BigDataAnalytics

    However,rememberaquotedpassageinoneofthearticles,Evolvcon-

    tendsthatthesimpledistinctionofwhichWebbrowseranapplicantis

    usingwhenheorshesendsinajobapplicationcanshowwhosgoingto

    beastaremployeeandwhomaynotbe.Suchstatementsshouldneverbe

    usedindiscussingbigdataresultswithinyourorganization.Whatdoes

    whosgoingtobeastaremployeereallymean?Itgrantstoomuchcer-

    taintytoaresultthatwillatbestbeatendencyinthedataratherthanasetrule.Thestatementandwhomaynotbeislikewisemeaningless,butin

    theotherdirection.Itassertsnothing.Inreallife,manyofthosewhouseFirefoxorChromewillbepoorhires.Eveniftherewerearealrelationshipinthedata,itwouldfranklybeirresponsibleforahiringmanagertoplaceoverridingimportanceonthisattributewhentherearemanyotherattributestoconsider.Languagematters.

    Thepointsraisedbythewebbrowserexamplearenotacademic.The

    consequencesforacompetentanddiligentjobseekerwhoisjustfinewith

    InternetExplorer,orafirmwhoneedsthatjobseeker,arenotdifficulttofigureoutandarecertainlynotminor.Oneofyourauthorskeepsboth

    FirefoxandInternetExploreropenatthesametime,assomepageswork

  • betterononeortheother.

    AnotherfirmmentionedinTheAtlanticisGild.Gildevaluatesprogrammersbyanalyzingtheironlineprofiles,includingcodetheyhavewritten

    anditslevelofadoption,thewaythattheyuselanguageonLinkedInand

    Twitter,theircontributionstoforums,andoneratheroddcriteria:whethertheyarefansofaparticularJapanesemangasite.TheGildrepresentative

    interviewedinthearticleherselfstatedthatthereisnocausalrelationshipbetweenmangafandomandcodingabilityjustacorrelation.

    FirmssuchasEvolvandGild,however,workforemployersandnot

    applicants.Theresultsfromtheiranalysesshouldresultinimprovedper-

    formance.Itistherule,andnottheexceptions,thatdrivestheadoption

    ofbigdatainhiringdecisions.OnesuccessstoryEvolvpointsoutisthe

    reductionofonefirms3-monthattritionrateby30%throughtheapplica-

    tionofbigdata.Itisnowhelpingthisclientmonitorthegrowthofemployeeswithinthefirm,basednotonlyonthecharacteristicsoftheemployeesthemselvesbutalsoontheenvironmentinwhichtheyoperate,suchas

    whotheirtrainersandmanagerswere.

    ThecaseofEvolvisagoodillustrationofthenatureofbigdata.Proper

    applicationofthetechnologyincreasesefficiency,butacomplexsetof

    issuessurroundsthisapplication.Manyoftheseissuesrelatetothepotentialofincorrectconclusionsdrawnfromthedataandtheneedtomitigate

    Introduction17

    theireffect.Yes,judgmentscanbebaselessorunfair.Whatisthealternative?Thinkbacktoourdiscussionofthefaultinessofhumanjudgment.

    Whenabigdatasystemrevealsacorrelation,itisincumbentonthe

    operatortoexplorethatcorrelationingreatdetailratherthantotakeitsuperficially.Whenacorrelationisdiscovered,itistemptingtocreateaposthocexplanationofwhythevariablesinquestionarecorrelated.Wegleanamathematicallyneatandseeminglycoherentnugget.However,a

    falsecorrelationdressedupnicelyisnothingbutfoolsgold.Itcanchangehowtherecipientsofthatnuggetrespondtoreality,butitcannotchange

    theunderlyingreality.Asbigdataspreadsitsinfluenceintomoreareasofourlives,theconsequencesofmisinterpretationgrow.Thisiswhyscientificinvestigationintothedataisimportant.

    Bigdataraisesotherissuesyourorganizationshouldconsider.

    Maintainingdataraiseslegalissuesifitiscompromised.Medicaldatais

  • themostprominentofthese,butanydatawithtradesecretsorpersonal

    informationsuchascreditcardnumbersfitinthiscategory.Incorrect

    usagecreatesarisktocorporatereputations.Googlesaggressivecollec-

    tionofcustomerdata,sometimesintrusively,hastarnishedthatfirms

    reputation.Evenworse,thedataheldcanharmothers.TheNewYorker

    reportsthecaseofMichaelSeay,thefatherofayoungladywhoselife

    tragicallyendedattheageof17,whoreceivedanOfficeMaxflierinthe

    mailaddressedtoMikeSeay/DaughterKilledinCarCrash/OrCurrent

    Business.13ThisobviouslycreatedmuchpainforMr.Seay,asitwouldforanyparent.

    GoogleMapsStreetViewhaslikewisebeenacurseformany,including

    amanurinatinginhisownbackyardwhosemomentofimprudencecoin-

    cidedwithGooglescardrivingpasthishouse.ThatwillbeontheInternetforever.TheWallStreetJournalcarriedanin-deptharticledescribingdatabasesofscannedlicenseplatesinboththepublicandprivatesector.

    Thesecompaniesphotographandloglicenseplates,usingautomaticread-

    ers,soacarcanbetiedtothelocationwhereitwasphotographed.Two

    privatesectorcompaniesarelisted:DigitalRecognitionNetwork,Inc.and

    MVTrac.Arepossessionfirmmentionedinthearticlehasvehiclesthat

    drivehundredsofmileseachnightlogginglicenseplatesofparkedcars.

    ThemajorityofcarsstilldrivenintheUnitedStatesareprobablyloggedinthesesystems,oneofwhichhad700millionscans.14

    Thesedevelopmentsmaynotimpactyourbusinessdirectly,butaswe

    willseeinourlaterdiscussionsoftheadvantagesanddisadvantagesof

    bigdata,othertechnologiesinteractingwithbigdatahavethepowerto

    18BigDataAnalytics

    undermineyourtradesecretsorcreateacompetitiveenvironmentwhere

    youcanobtainusefulanalysisonlyattheexpenseofturningoveryour

    owndata.Itwouldbenavetoassumethatthosewhoseeopportunity

    ingobblingupyourcompanysinformationwillnotdoso.Inusingbig

    data,dataownershipwillbeanissue.Thequestionofwhohasarightto

    whosedatastillneedstobesettledthroughlegislationandinthecourts.

    Notonlywillyouneedtoknowhowtoprotectyourownfirmsdatafrom

    externalparties,youwillneedtounderstandhowtoresponsiblyandethi-

  • callyprotectthedatayouholdthatbelongtoothers.

    Thedangersshouldnotscareusersawayfrombigdata.Justasmuch

    ofmoderntechnologycarriesriskthinkofthespaceprogram,aviation,

    andenergyexplorationsuchriskdeliversrichrewardswhenwellused.

    Bigdataisoneofthemostvaluableinnovationsofthetwenty-firstcen-

    tury.Whenproperlyusedinaspiritofcooperativeautomationwherethe

    operatorguidestheuseandresultsthepromiseofbigdataisimmense.

    OURGOALSASAUTHORS

    Anauthorshouldundertakethetaskofwritingabookbecauseheorshe

    hassomethingcompelingtosay.Weknowofmanygoodbooksonbigdata,

    analytics,anddecisionmaking.Whatwehavenotseenisabookforthe

    perplexedthatpartitionsthephenomenonofbigdataintousablechunks.

    Inthisintroduction,wealludedtothediscussioninthepressaboutbig

    data.Forabusinessperson,projectmanager,orqualityprofessionalwho

    isfacedwithbigdata,itisdifficulttojumpintothisdiscussionandunderstandwhatisbeingsaidandwhy.Theworldofbusiness,likehistory,is

    regularlyburnedbybusinessfadsthatappear,notchupprominentsuc-

    cessfulcasestudies,thenfadeouttoleaveatrailofless-publicizedwreckageintheirwake.Wewanttohelpyouunderstandthefundamentalsand

    setrealisticexpectationssothatyourexperienceisthatofbeingasuccessfulcasestudy.

    Wewantyouasthereadertounderstandcertainkeypoints:

    Bigdataiscomprehensible.Itspringsfromwell-knowntrendsthat

    youexperienceeveryday.Theseincludethegrowthincomputing

    power,datastorage,anddatacreation,aswellasnewideasfororga-

    nizinginformation.

    Introduction19

    Youshouldbecomeawareofkeybigdatapackages,whichwelist

    anddiscussindetail.Eachhasitscharacteristicsthatareeasyto

    remember.Onceyouunderstandthese,youcanaskbetterquestions

    ofexternalsalespeopleandyourinternalITdepartments.

    Bigdatatechnologiesenabletheintegrationofcapabilitiesprevi-

    ouslynotincludedinmostbusinessanalytics.TheseincludeGISand

    predictiveanalytics.Newkindsofanalysisareevolving.

  • Dataisnotanoracle.Itreflectstheconditionsunderwhichitwascreated.Therearebiasesanderrorsthatcreepintodata.Eventhebest

    datacannotpredictdevelopmentsforwhichthereisnoprecedent.

    Bigdatawillopenlegal,logistical,andstrategicchallengesforyour

    organization,evenifyoudecidethatbigdataisnotrightforyour

    firm.Notonlymustafirmbeawareofthevalueandsecuritymea-

    suressurroundingdatathatitholds,itmustbeawareofdatathatit

    givesupvoluntarilyandinvoluntarilytootherparties.Thereareno

    black-and-whiteanswerstoguideyou,asthisisadevelopingfield.

    Dataanalyticsinbigdatastillrelyonestablishedstatisticaltools.

    Someofthesemaybearcane,buttherearecommonstatisticaltools

    thatcanapplyareasonabilitychecktoyourresults.Understanding

    analyticsenablesyoutoaskbetterquestionsofyourdataanalysts

    andmonitortheassumptionsunderlyingtheresultsuponwhichyou

    takeaction.

    Yourorganizationmayalreadyhavetheknowledgeworkersneces-

    sarytoconductanalysisorevenjustsanitycheckresultstoensure

    thattheyareaccurateandyieldresults.DoyouhaveaSixSigma

    unit?Doyouhaveactuaries?Doyouhavestatisticians?Ifyoudo,

    thenyouhavetheknowledgebasein-housetouseyourbigdatasolu-

    tionmoreeffectively.

    Now,ontoourjourneythroughthisremarkabletechnology.

    REFERENCES

    1.ImprobableResearch.WinnersoftheIgNobelPrize.ImprobableResearch.http://

    www.improbable.com/ig/winners/.AccessedApril16,2014.

    2.ImprobableResearch.AbouttheIgNobelPrizes.ImprobableResearch.http://www.

    improbable.com/ig/.AccessedApril16,2014.

    3.Peck,D.Theyrewatchingyouatwork.TheAtlantic.December2013.

    4.Wason,P.Onthefailuretoeliminatehypothesesinaconceptualtask.QuarterlyJournalofExperimentalPsychology,1960,12(3):129140.

    20BigDataAnalytics

    5.LouisPasteur.Wikiquote.http://en.wikiquote.org/wiki/Louis_Pasteur.AccessedApril18,2014.

  • 6.Duhigg,C.Howcompanieslearnyoursecrets.TheNewYorkTimes.February16,2012.http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html?_

    r=0&pagewanted=al.AccessedApril19,2014.

    7.Taleb,N.(guesteditorial,creditedtoOgiOgasinthebyline)Bewarethebigerrorsof

    bigdata.Wired.February8,2013.http://www.wired.com/2013/02/big-data-means-big-errors-people/.AccessedApril19,2014.

    8.Taleb,N.FooledbyRandomness:TheHiddenRoleofChanceinLifeandintheMarkets.

    NewYork:ThompsonTEXERE,2004.

    9.Nevraumont,E.Viewpoint:WhyyourcompanyshouldNOTuseBigData.KD

    Nuggets.January2014.http://www.kdnuggets.com/2014/01/viewpoint-why-your-company-should-not-use-big-data.html.AccessedApril19,2014.

    10.Evolv.Ourexpertise.Evolvcompanywebsite.http://www.evolv.net/expertise/.

    AccessedApril19,2014.

    11.Javers,E.Insidethewackyworldofweirddata:Whatsgettingcrunched.CNBC.

    February12,2014.http://www.cnbc.com/id/101410448.AccessedApril18,2014.

    12.E.H.Howmightyourchoiceofbrowseraffectyourjobprospects?TheEconomist.

    April10,2013.http://www.economist.com/blogs/economist-explains/2013/04/economist-explains-how-browser-affects-job-prospects#sthash.iNblvZ6J.dpuf.AccessedApril19,2014.

    13.Merrick,A.Adeathinthedatabase.TheNewYorker.January23,2014.http://www.

    newyorker.com/online/blogs/currency/2014/01/ashley-seay-officemax-car-crash-death-in-the-database.html.AccessedApril17,2014.

    14.Angwin,J.andValentino-DeVries,J.Newtrackingfrontier:Yourlicenseplates.TheWallStreetJournal.September29,2012.http://online.wsj.com/news/articles/SB1000

    0872396390443995604578004723603576296.AccessedApril19,2014.

    2

    TheMotherofInventionsTriplets:

    MooresLaw,theProliferationof

    Data,andDataStorageTechnology

    Isbigdatajusthype?Isitreallysomethingnew?Ifitisdifferent,howisitdifferent?Ifitbringsachange,isitevolutionaryorrevolutionarychange?

    Whilewewishwecouldpresentyouwithaclear-cutanswer,wecannot.

    Thatargumenthasnotbeenresolvedasitwillremainamatterofopinion.

    Insteadofpresentingyouwithloftyscenariosofwhatbigdatamay

  • somedaybeabletodo,wewillshowyouhowbigdataaroseduetotech-

    nologicaldevelopmentsandtheneedsarisingfrommoreandmoredata.

    Seeingthechangesthatmadebigdataapossibilityalmostaninevita-

    bilityreallywillhelpyoutosortouttheknowledgefromthehype.By

    usingthisbottom-upapproachtoexplainingbigdata,wehopeyouwill

    begintoseepotentialwaystousetechnologyandvendorrelationships

    thatyoualreadyhavetobetterusedatathatyoualreadypossessbutdo

    notuse.Theoddsarethatyouwillnotwantyourfirstbigdataprojecttoturnyourfacilitiesintosomethingfromasciencefictionmovie.Workis

    beingdonetomakethatareality,butalower-riskapproachwithafaster

    returnistosimplypulltogetherandusethedatathatyoualreadyhave.

    Inotherwords,wedonotwanttodazzleyou.Wewanttohelpyoumake

    decisionsnow.

    Bigdataisnewinthatitscapabilitiesforprocessingdataareunprec-

    edented.Byunprecedented,wedonotrefermerelytothequantityofdata

    butalsothevarietyofdata.Bigdatatechnologiesforcrunchingdatain

    searchofrelationshipsbetweenvariablesbothobviousandobscure

    havedevelopedalongsideexplosivegrowthindatastoragecapabilities.As

    dataprocessingandstoragecapabilitiesdemonstrateseeminglyboundless

    21

    22BigDataAnalytics

    growth,twootherdevelopmentsprovidethedatathatfillthatstorageand

    providethegristforthemillsofmodernprocessors.

    Lifeusedtobeanalog.Weinhabitedaworldofrecords,letters,andcopperphonelines.Thatworldisdisappearing.Sensorsarebecomingubiquitous,

    fromcarenginecomputerstohomeburglaralarmstoradio-frequencyID

    (RFID)tags.Computersmetamorphosedintointermediariesforincreasing

    quantitiesoftransactionsandinteractions.LinkedInandFacebookenable

    userstocreatepublicorsemipublicpersonas;theInternetwentfrombeinganobscuremediumfortechies(informaltermforprofoundlyinvolved,

    technologicallyawareusersandcreators)tobeingaglobalmarketplace,

    andtextingisnowaquickwayforfriendstosharetidbitsofinformation.

    Thebackgroundnoiseofmodernlifeisdata.Ourdataaccumulate,they

  • live,theyarerecordedandstored,andtheyarevaluableintheirownright.

    Inasense,bigdatatechnologiesarematurebecausewecancompre-

    hendthemintermsofthetechnologiesfromwhichtheydeveloped,most

    ofwhichhaveestablishedhistories.Computersbecameapossibilityonce

    CharlesBabbageproposedthedifferenceengine,anunrealizedbutlogi-

    callyfullydevelopedmechanicalcomputer,in1822.Electroniccomput-

    erscameintotheirowninthetwentiethcentury,withthecode-breaking

    bombes(abombewasaquasi-computerdevotedtodecryptionsolely)

    atBletchleyParkinBritainduringWorldWarIIbeingconcreteexamples

    ofhowcomputerscanshakethefoundationsofmodernwarfare.Forall

    theawesomepoweroftanks,planes,andbombs,thegreatmindsthat

    crackedAxiscodesmostfamouslythetoweringandtragicfigureof

    AlanTuringdidsomethingjustaspowerful.Decipheringthosecodes

    allowedthemtopenetratethenervoussystemoftheenemysintelligence

    apparatusandknowwhatitwoulddoquicklyenoughtoanticipateenemy

    actionsandneutralizethem.

    Ourcurrentdigitalworldcantracelineagebacktothispioneering

    technology.Toexplorethisdevelopment,letusstartwiththegrowthof

    processingpower.

    MOORESLAW

    In1965,GordonMoorewhowouldgoon3yearslatertocofoundthecom-

    panythatwouldbecomeIntelpublishedapapertitled,CrammingMore

    ComponentsontoIntegratedCircuitsinthejournalElectronics.ThoughaTheMotherofInventionsTriplets23

    merefourpagesinlength,thepaperlaidoutthecaseforthenowfamous

    Mooreslaw.Itisanintriguingreadafterdiscussingadvancesintheman-

    ufactureofintegratedcircuits,Moorecoverstheiradvantagesintermsof

    costandreliability,thelatterdemonstratedbytheinclusionofintegratedcircuitsinNASAsApolospacemissions(itwasApolo11thatlandedNeilArmstrongandBuzzAldrinonthemoon).Inhispaper,Moorecorrectly

    foreseestheuseofthistechnologyinsuccessivelyincreasingnumbersof

    devices.Mooreslawandthespreadoftheintegratedcircuitareastoryofacceleratedtechnologicalaugmentationbuiltontopofwhathadalready

  • beenawhirlwindpaceinthedevelopmentofcomputertechnology.1

    ThishistoryoftechnologicaldevelopmentleadinguptoGordon

    Moorespaperisacompellingstoryonitsownmerits.Theintegrated

    circuitisaclusteroftransistorsmanufacturedtogetherasasingleunit.

    Infact,theyarenotassembledinanymeaningfulsense.Theprocessofphotolithography,ortheuseoflighttoprintoverastencilofthecircuitlaidoverasiliconwafer,meansthattransistorsareetchedtogether,emerging

    inausefuldesignasasingleunit.Theprocessisnotentirelyunlikeusingastenciltopaintwritingonawall,althoughthetechnologyisclearlymoredemandingandprecise.Itmeansthattherearenojoins(e.g.,solderjoints)thatcancrack,andtherearenomovingpartstowearout.Moorewas

    writingonly7yearsafterJackKilbyofTexasInstrumentshadbuiltthe

    firstworkingmodeloftheintegratedcircuitwhilemostotheremployees

    ofhisfirmwereonvacation!2TexasInstrumentsisstilloneoftheleadingmanufacturersofintegratedcircuits,alongwithIntel.

    Beforetheintroductionoftheintegratedcircuit,thetransistorwas

    thestandardfordataprocessing.Thefirstpatentforaworkingtransis-

    tor(undemonstrateddesignshadreceivedearlierpatents)waspatent

    number2,524,035,awardedtoJohnBardeenandWalterBrattainofBell

    Labsin1950,3withpatentnumber2,569,347beingawardedtotheircol-

    league,WilliamShockley,thefollowingyear.4Thetransistorofferedmany

    improvementsoveritspredecessor,thevacuumtube.Itwaseasiertoman-

    ufacture,moreenergyefficient,andmorereliable.Itdidnotgenerateas

    muchheatandthusenjoyedalongerlife.Still,itwasadiscretedevice.

    Unliketheintegratedcircuit,inwhichmillionsoftransistorscanbea

    singlearray,transistorsneededtobeassembledbeforeJackKilbysseem-

    inglyinnocuousbutworld-alteringinsight.Individualtransistorassembly

    generallymeanttheuseoftheolderthrough-holetechnology,wherethe

    leadstothediscretetransistorwentthroughtheprintedcircuitboardandwereoftenwave-soldered.

    24BigDataAnalytics

    Mooresargumentinhispaper,thisartifactfromthedawnoftheinte-

    gratedcircuit,isnuancedandcarefullyargued.Itiseasytoforgetthis,decadeslater,whenfewcommentatorsactuallyreaditandthepopular

    pressreducestheconcepttopithysoundbitesabouttheincreaseinprocessingpower

  • versustime.WhatMoorecomposedwasneitheraluckyguess

    norabaldassertion.Itwasanexquisiteargumentincorporatingtechnol-

    ogy,economics,andperhapsmostimportantly,manufacturingability.He

    famouslyarguedthatasmorecomponentsareaddedtoanintegratedcir-

    cuitofagivensize,thecostpercomponentdecreases.Inhiswords:

    Forsimplecircuits,thecostpercomponentisnearlyinverselyproportionaltothenumberofcomponents,theresultoftheequivalentpieceofsemi-conductorintheequivalentpackagecontainingmorecomponents.But

    ascomponentsareadded,decreasedyieldsmorethancompensateforthe

    increasedcomplexity,tendingtoraisethecostpercomponent.1

    Asof1965,thenumberofcomponentsthatcouldbeincludedonan

    integratedcircuitatthelowestpricepercomponentis50.Mooreforesaw

    theoptimalnumberofcomponentspercircuit,fromacostpercompo-

    nentpointofview,being1000by1970,withacostpercomponentthat

    was10%ofthe1965cost.By1975,hesawtheoptimalnumberofcompo-

    nentsreaching65,000.Inotherwords,heperceivesthecostofproduction

    decliningasthetechnologysecuresitselfinourculture.1

    Beforemovingontoatechnicaldiscussionofthecircuits,Moorestated,

    Thecomplexityforminimumcomponentcostshasincreasedatarateof

    roughlyafactoroftwoperyearthereisnoreasontobelieve[thisrate

    ofchange]willnotremainnearlyconstantforatleast10years.1Infact,nearly50yearsafteritsformulation,Mooreslawabides.Figure2.1isa

    powerfulillustration.

    RespectedphysicistMichioKakupredictstheendofMooreslaw,point-

    ingoutthatthephotolithographyprocessusedtomanufactureintegrated

    circuitsreliesonultravioletlightwithawavelengththatcanbeassmallas10nm,orapproximately30atomsacross.Currentmanufacturingmethods

    cannotbeusedtobuildtransistorssmallerthanthis.Thereisamorefun-

    damentalbarrierlurking,however.Dr.Kakulaysouthisargumentthus:

    Transistorswillbesosmallthatquantumtheoryoratomicphysicstakes

    overandelectronsleakoutofthewires.Forexample,thethinnestlayer

    insideyourcomputerwillbeaboutfiveatomsacross.Atthatpoint,accordingtothelawsofphysics,thequantumtheorytakesover.TheHeisenberg

    TheMotherofInventionsTriplets25

  • Doublingeverytwoyears

    10,00,00,00,000

    1,00,00,00,000

    10,00,00,000

    t

    1,00,00,000

    10,00,000

    1,00,000

    10,000

    1,000

    Transistorcoun

    100

    101

    #Transistors

    1960

    1970

    1980

    1990

    2000

    2010

    2020

    FIGURE2.1

    Year

    Mooreslaw.

    uncertaintyprinciplestatesthatyoucannotknowboththepositionand

    velocityofanyparticle.Thismaysoundcounterintuitive,butattheatomiclevelyousimplycannotknowwheretheelectronis,soitcanneverbecon-finedpreciselyinanultrathinwireorlayeranditnecessarilyleaksout,causingthecircuittoshort-circuit.5

    AspessimisticasDr.Kakusargumentsounds,thereiscauseforopti-

    mismregardingsustainedimprovementsinfuturecomputingpower.

    NoticethatDr.Kakusargumentispointingoutthattheconstrainton

    whatcanbeaccomplishedisquantumtheory,whichisanelegantargu-

    mentforthepowerofhumaningenuity.Individualtransistorswithinan

  • integratedcircuitarenowsosmallthatitisthephysicsoftheindividualatomsthatmakeupthetransistorthathasbecometheconstrainingfactor.Thesameingenuitythatbroughtustothispointwillinevitablyturn

    towardinnovatinginotherformsnewapproacheswherethephysics

    involvedhasnotyetbecomeaconstraint.

    TheendofMooreslaw,inotherwords,simplymeanstheclosingofone

    doortoevermorepowerfulcomputers.Itdoesnotnecessarilyspellthe

    endofothermethods.Infact,onemethodisalreadywellestablished,thatbeingparallelcomputing.

    PARALLELCOMPUTING,BETWEEN

    ANDWITHINMACHINES

    Thenumberofcircuitsrunninginacoordinatedmannercanbeincreased.

    Thiscanbewithinamachineusingmultipleprocessors,multiple

    26BigDataAnalytics

    integratedcircuitswithinaprocessor(knownasamulticoreprocessor),oracombinationofthesetwoapproaches.Theseprocessorsandcoressimply

    dividethetaskofprocessingforthesakeofspeedandoverallcapacity,

    muchastwopeoplecanmakeacakefasterifonepreparesthefrosting

    whiletheotherpreparesthecake.Anotherwaytoconductparallelcom-

    putingisbetweendevicesorcomponentswithinadevice,suchasoccurs

    withamainframecomputer.Whileparallelcomputingdoesnothingto

    promotethefurtherminiaturizationofindividualcomponents,anintel-

    ligentlydesignedarchitecturewillallowthecontinuedminiaturizationof

    thedevices.Thesecomponentsshrinkbymovingthebulkofprocessing

    toonelocationwhiletheoutputgoestoanotherlocation.Thissounds

    bizarreandconfusingintheabstract,butyouarefamiliarwithitinthe

    concrete.

    Whenyourunawebsearchonyourcellphone,yourphoneisnotque-

    ryingitsownindexesofwebpages,anditisnotrunningthealgorithms

    thatunderpinthesearch.Thephoneandassociatedsoftwareconduct

    basicprocessingofyoursearch,orquery,andthenpassthesedatatoa

    serverclusterlocatedelsewhere.Theythenreceivetheoutputofthatclustersdataprocessingandtranslateitbackintoaconvenientlayoutthatcanberepresentedonyourdisplay.Inthisway,yourphonecanknowhow

  • manycopiesofthisbookAmazonhasinstock.Thisprocessingcapability

    ishowthatsmallandunpretentiousphoneknowswhatsongisplayingin

    yourfavoritebar,theperformanceofyourstockportfolio,howlongyour

    flightisdelayed,andthedrivingdirectionstotheAfghanrestaurantinthenorthernpartofDallas,Texas,aboutwhichyouhaveheardsuchenthusiasticreviews(mostlikelyonthesamephone!).Oneoftheauthorsofthis

    bookhasaccesstothefollowingonhissmartphone:

    Multiplee-mailservices

    Up-to-the-minutereadingsfromDopplerradarbelongingtothe

    NationalOceanicandAtmosphericAdministration

    AphotographicrepresentationofeveryoutdoorspotonEarth

    MapsofallofNorthAmericaandotherplacesaroundtheworld

    Updatesoftheactivitiesofmostofhisfriendsviasocialmedia

    Multiplewaysofaccessingmusicfromonlinesources,bothonasub-

    scriptionandanownershipmodel

    Nearinstantaneoussharingofphotostakenbetweenhiswifeand

    him

    Multiplevideostreamingservices

    TheMotherofInventionsTriplets27

    Theabilitytoidentifyasongbynameandartistbyholdingthephone

    uptoaspeaker

    Imagesfromtrafficcamerasliningtheroadsandintersectionsnear

    wherehelives

    Theabilitytopurchaseandimmediatelyaccessbooksandmusic

    Alloftheseabilitiesaredependentoncomputingpowerlocatedin

    serversofunknownlocation,accessedbyhisphoneusinganInternet

    connection.Alloftheseservicesrequirehugeamountsofcomputing

    power,relyingonparallelcomputing.Parallelcomputingisubiquitous

    bothwithincomputersandphones,andintheremoteservicesthatthese

    devicesaccess.Itisthisremoteaccessthatbypassesthelimitationson

    whatindividualmicroprocessorsinsideindividualcomputersandphones

    canaccomplish.

    WhenFred,aFacebookuser,visitshisfriendAnnaspage,theheavy

  • liftingprocessingnecessarytodeliverthatpagetoFredscomputer

    occursinadatacenter,whichiswherethesystemstoresAnnasprofile,

    andwherethecomputingpowerexiststoselectonlythecorrectinforma-

    tion,subsequentlyreturningittoFredscomputer.Likewise,whenAnna

    seesFredsnewpatiofurnitureinaFacebookpostanddecidestolook

    forsomethingsimilaronAmazon,theprocessingfortheheavylifting

    thatdeliverstheresultsofherAmazonsearchbacktoher,andrunsthe

    paymenttransactions,takesplaceinadatacenter.Asimilarprocessisatplayforwebsearches.Noneofthisindexingofwebpagesorthesearch

    forspecifictermsburiedinallofthoseindexedpagesoccursonFredsorAnnascomputers.Thisistheheavy-dutyworkourdevicesoutsourceto

    datacenters(Figure2.2).

    Asingledatacentermayverywellberesponsibleforthisprocessingfor

    usersallaroundtheworldatanygiventime,oritmaybeoneofseveral

    datacentersthateachcatertoaregion.Forthisreason,datacentersarelarge,warehouse-stylebuildingslocatednearmajorsourcesofelectrical

    power.Theyareoften(butbynomeansexclusively)foundinlocations

    wherenaturalcoolinghelpsreleasetheheatgeneratedbyalloftheserverswithin.ThisisonereasonsomanydatacentersarelocatedinthePacific

    NorthwestoftheUnitedStates.

    DatacentersarenotlimitedtoInternetfirms,however.Tohandlethe

    delugeofdata,theshippinggiantUPShastwodatacenters,a470,600ft2

    facilityinMahwah,NewJersey,anda172,000ft2facilitynearAtlanta,

    Georgia,thathostswhatUPSstatesisthelargestIBMDB2relational

  • 28BigDataAnalytics

    Fred

    Data

  • center

    Anna

    FIGURE2.2

    Datacenterinterme-

    diarytousers.

    databaseintheworld.Afootballfield(Americanfootball)is57,600ft2,

    includingendzones.Thesedatacenterscould,betweenthem,fullycon-

    tainmorethan11suchfootballfields.Betweenthetwodatacenters,thereisalsoenoughairconditioningcapacitytocool3500homes,alongwith60

    milesofundergroundconduit,7000backupbatteriestomaintainpower

    untilthegeneratorscankickinduringanoutage,and70,000gallonsof

    fueltoruntheirgenerators.6TheUPSdeliverypersoncomestoyourdoor

    andhandsyouasmallcomputerwithatouchscreenuponwhichyou

    signusingastylus;thesedatacentersarewherethosedatago,withdeliverydateandtime,alongwithyoursignature.Infact,ifyouorderfrom

    Amazonyoumayevensignuptoreceiveatextmessagewhentheydeliver

    yourpackageyetanotherexampleofhowadatapointmakesitbackto

    yourphone.

    Googleisfamouslyopaqueinprovidinginformationaboutitsdatacen-

    ters,althoughitisopeningupwithphotosoftheirinteriors(itstillremainsclose-mouthedonthesizeandstatisticsforitsfacilities).TheDataCenterKnowledgewebsitelistswhatitbelievestobe20GoogledatacentersintheUnitedStatesand17overseasdatacenters.7GooglelistsonlysixUSdata

    centersandsevenoverseasdatacentersonitswebsite.ItspagelistsdatacentersthatdidnotmakeittotheDataCenterKnowledgewebsite,such

    asthoseinFinland(firstphasecompletedin2011,withasecondphase

    TheMotherofInventionsTriplets29

    estimatedcompletiondateof2014),Singapore,andChile.8Whatwesee

    revealed,regardlessofwhoseestimatesweuse,isthatGooglehasatruly

    immenseandglobaldatacenterfootprint.Itmust.AsthedominantInternetsearchfirmwithotheroperations,amongwhichareaweb-basede-mail

    offering(Gmail),anonlinemediastore(GooglePlay),itsownsocialmediasite(Google+),avideohostingservice(YouTube),globalsateliteimages

    (GoogleEarth),comprehensivemappingabilities(GoogleMaps),andthe

    firmsbreadandbutter(GoogleAdSense).Theseserviceshaveworldwide

  • reachandinvolveatremendousamountofprocessingthattheenduser

    neversees.

    Facebookismoretransparentaboutitsdatacenters,andtheyaretruly

    gigantic.Wewouldexpectlargedatacenterswithafirmof1.23billion

    monthlyactiveusersasofDecember31,2013.9Asthisbookisbeingwritten,thecompanycurrentlyoperatesa333,400ft2datacenterinPrineville,Oregon(andisbuildinganidenticalfacilitynexttoit),10andadatacenterofapproximately300,000ft2nearForestCity,NorthCarolina.11The

    firmplanstoconstructabehemoth1.4millionft2facility,estimatedto

    cost$1.5billion,nearDesMoines,Iowa.12Thereisalsoa290,000ft2facilityinSwedenthatisbeingfinished.13Thetotalsquarefootageofallof

    Facebooksdatacenters,bothconstructedandplanned,willbesufficient

    tofullycontainover46footballfields.

    Thespreadofdatacentershasalsogivenrisetooneofthefast-rising

    buzzwordsoftheearlytwenty-firstcentury:thecloud.Thecloud,cloud

    storage,andcloudcomputingallrelatetomovementofdatastorageand

    manipulationfromonesowndevicebeitdesktop,laptop,ormobile

    devicetoadatacentersomewhere.Businesspeoplearewisetobewaryof

    buzzwords,butthecloudisabuzzwordwithsubstancebehindit.

    Processingisalsoestablishingitselfinthecloudaspartofthesoft-

    ware-as-a-service(SaaS)model.Salesforce.comprovidesserious,mar-

    ket-respectedcustomerrelationshipmanagement(CRM)tofirmswho

    accessitthroughawebsite.Accountingpackages,suchasNetSuiteand

    QuickBooks,havemovedintothecloud.Thoughfarfrombeingamar-

    ketchanger,GoogleDocshasestablishedanicheasamethodforsharing

    andcollaboratingondocumentsinthecloud.Thesedocumentsinclude

    spreadsheets,wordprocessing,andpresentationsinaformatsimilarto

    thoseofMicrosoftOffice.Microsoftisalsoofferingcloud-basedfilesharingandcollaborationthroughitsSkyDriveservice.

    Movingfromthepubliccloud(commercialapplicationsthatareusedon

    asubscriberbasis)totheprivatecloud(hostedsolutionsthatareunique

    30BigDataAnalytics

    toasinglecustomer),moreandmorecompaniesaredevelopingcustom

    solutionsthatarehostedremotelyindatacenters.Thesesolutionsarecreatedspecifically

  • forthecompanythatinitiatedtheprojectandareusuallynotsharedwithanyothercompany.

    Weexpectcloudcomputingtoproliferatefurther.Firstofall,itprovidescompaniesawaytoaddcapabilitieswhiletransferringtheexpendituresto

    operatingexpenses(OPEX)insteadofcapitalexpenses(CAPEX).Second,

    datacentersgenerallyofferadegreeofprotectionandredundancytodata

    thatisnotpossiblewhentheyarestoredonaharddriveintheoffice.

    Third,datasecurityinthecloudcanbesuperbwhenitisintheright

    hands.Withpropersecurity,eventheemployeesofthedatacenterare

    physicallyunabletoaccessanyoftheclientsdata.PrivatecloudsolutionswithnopresenceontheInternetare,aswediscussed,onlyaccessibleto

    theintendedclientusingasecureconnectionandcanbeverysafesolu-

    tionsthatare,forallintentsandpurposes,partoftheinternalinformationtechnology(IT)solution.Finally,thecostofadministeringdataonthe

    cloudcanbelowsincethestaffadministeringthehardwareareshared

    amongcustomersofthehostingfacility.Ifyouneedonlyafewminutesa

    weekofworkonyoursystem,plusanoccasionalhardwareupgrade,your

    firmdoesnotneedtopayfull-timestafftohandlethat.Youpayafeeforthisbenefit,alongwiththefeespaidbyothercustomers,tocoversalariesandphysicalinfrastructure.

    Alongwiththegrowthofdatacentersisthecontinuedgrowthinthe

    computationalcapabilitiesachievedthroughtheuseofparallelcomput-

    ingwithinasingledevice.Thehistoryofparallelcomputingisinextricablyintertwinedwiththehistoryofcomputationitself.Itissimplythestrategyofbreakingaproblemupintosmallerproblemsthatarethendistributed

    andprocessedconcurrently.Thisconceptwillreappearaswedelveinto

    greaterdetailofhowbigdatasolutionsfunction.

    Multiplecomputersrunningsidebysideorthedifferentcomponentsof

    amainframemayhandleparallelcomputing,butthereisasimplerexam-

    pleofparallelcomputingthatisrunninginmosthomeandofficecomput-

    ersandevenonsmartphones.Thisisthemulticoreprocessor.

    Amulticoreprocessorisanintegratedcircuitthatcontainsmorethanone

    centralprocessingunit(CPUorcore);itsplitsupprocessingtasksamong

    these.Inhomeuse,themostcommonmulticoreprocessorsarecurrently

    thedual-coreandquad-coreprocessors(thecurrentAppleMacintoshPro

  • canrundualhexacores!),thoughsomespecializedprocessorsmayhave

    TheMotherofInventionsTriplets31

    morethan100cores.Thespreadofmulticoreprocessorswillbediscussed

    ingreaterdetaillaterinthischapter.

    Aswewritethisbook,theTianhe-2supercomputerwasunveiledin

    China,attaining33.86petaflopsofprocessingpowerwithatheoretical

    peakperformanceof54.9petaflops(apetaflopis1015flops,orFLoating-

    pointOperationsPerSecond;theperformanceofastandarddesktopcom-

    puterismeasuredingigaflops,oneofwhichisequalto0.000001petaflop),topplingtheTitancomputeratOakRid