big data analytics_ a practical_guide
DESCRIPTION
It gives an overview of Big Data applicationsTRANSCRIPT
-
DInformationTechnology/Database
unnPri
Withthisbook,managersanddecisionmakersaregiventhetoolstomakemorei
e
g
s
informeddecisionsaboutbigdatapurchasinginitiatives.BigDataAnalytics:Aa
PracticalGuideforManagersnotonlysuppliesdescriptionsofcommontools,n
butalsosurveysthevariousproductsandvendorsthatsupplythebigdatamarket.
BI
BIGDATA
Comparingandcontrastingthedifferenttypesofanalysiscommonlyconductedwithbigdata,thisaccessiblereferencepresentsclear-cutexplanationsofthegeneralworkingsofbigdatatools.InsteadofspendingtimeonHOWtoinstallspecificGD
packages,itfocusesonthereasonsWHYreaderswouldinstallagivenpackage.
ANALYTICS
Thebookprovidesauthoritativeguidanceonarangeoftools,includingopensourceandproprietarysystems.Itdetailsthestrengthsandweaknessesofincorporatingbigdataanalysisintodecision-makingandexplainshowtoleveragethestrengthswhilemitigatingtheweaknesses.
A
-
APracticalGuide
Describesthebenefitsofdistributedcomputinginsimpleterms
T
forManagers
Includessubstantialvendor/toolmaterial,especiallyforopensourcedecisionsAA
Coversprominentsoftwarepackages,includingHadoopandOracleEndeca
ExaminesGISandmachinelearningapplications
Considersprivacyandsurveillanceissues
Thebookfurtherexploresbasicstatisticalconceptsthat,whenmisapplied,canbeN
thesourceoferrors.Timeandagain,bigdataistreatedasanoraclethatdiscoversresultsnobodywouldhaveimagined.Whilebigdatacanservethisvaluablefunction,A
KimH.Pries
alltoooftentheseresultsareincorrectyetarestillreportedunquestioningly.TheprobabilityofhavingerroneousresultsincreasesasalargernumberofvariablesareL
comparedunlesspreventativemeasuresaretaken.
Y
RobertDunnigan
TheapproachtakenbytheauthorsistoexplaintheseconceptssomanagerscanaskbetterquestionsoftheiranalystsandvendorsabouttheappropriatenessoftheT
methodsusedtoarriveataconclusion.Becausetheworldofscienceandmedicinehasbeengrapplingwithsimilarissuesinthepublicationofstudies,theauthorsIC
drawontheireffortsandapplythemtobigdata.
S
K23000
6000BrokenSoundParkway,NW
Suite300,BocaRaton,FL33487
ISBN:978-1-4822-3451-0
711ThirdAvenue
NewYork,NY10017
90000
aninformabusiness
2ParkSquare,MiltonPark
www.crcpress.com
-
Abingdon,OxonOX144RN,UK
9781482234510
www.auerbach-publications.com
K23000mechrev.indd1
12/29/1410:12AM
BIGDATA
ANALYTICS
APracticalGuide
forManagers
BIGDATA
ANALYTICS
APracticalGuide
forManagers
KimH.Pries
RobertDunnigan
MATLABandSimulinkaretrademarksofTheMathWorks,Inc.andareusedwithpermission.TheMathWorksdoesnotwarranttheaccuracyofthetextorexercisesinthisbook.ThisbooksuseordiscussionofMATLABandSimulinksoftwareorrelatedproductsdoesnotconstituteendorsementorsponsorshipbyTheMathWorksofaparticularpedagogicalapproachorparticularuseoftheMATLABandSimulink
software.
CRCPress
Taylor&FrancisGroup
6000BrokenSoundParkwayNW,Suite300
BocaRaton,FL33487-2742
2015byTaylor&FrancisGroup,LLC
CRCPressisanimprintofTaylor&FrancisGroup,anInformabusiness
NoclaimtooriginalU.S.Governmentworks
VersionDate:20141024
InternationalStandardBookNumber-13:978-1-4822-3452-7(eBook-PDF)
Thisbookcontainsinformationobtainedfromauthenticandhighlyregardedsources.Reasonableeffortshavebeenmadetopublishreliabledataandinformation,buttheauthorandpublishercannotassumeresponsibilityforthevalidityofallmaterialsortheconsequencesoftheiruse.Theauthorsandpublishershaveattemptedtotracethecopyrightholdersofallmaterialreproducedinthispublicationandapologizetocopyright
-
holdersifpermissiontopublishinthisformhasnotbeenobtained.Ifanycopyrightmaterialhasnotbeenacknowledgedpleasewriteandletusknowsowemayrectifyinanyfuturereprint.
ExceptaspermittedunderU.S.CopyrightLaw,nopartofthisbookmaybereprinted,reproduced,transmitted,orutilizedinanyformbyanyelectronic,mechanical,orothermeans,nowknownorhereafterinvented,includingphotocopying,microfilming,andrecording,orinanyinformationstorageorretrievalsystem,withoutwrittenpermissionfromthepublishers.
Forpermissiontophotocopyorusematerialelectronicallyfromthiswork,pleaseaccesswww.copyright.
com(http://www.copyright.com/)orcontacttheCopyrightClearanceCenter,Inc.(CCC),222RosewoodDrive,Danvers,MA01923,978-750-8400.CCCisanot-for-profitorganizationthatprovideslicensesandregistrationforavarietyofusers.FororganizationsthathavebeengrantedaphotocopylicensebytheCCC,aseparatesystemofpaymenthasbeenarranged.
TrademarkNotice:Productorcorporatenamesmaybetrademarksorregisteredtrademarks,andareusedonlyforidentificationandexplanationwithoutintenttoinfringe.
VisittheTaylor&FrancisWebsiteat
http://www.taylorandfrancis.com
andtheCRCPressWebsiteat
http://www.crcpress.com
Contents
Preface.xiiiAcknowledgments..xvAuthorsxviiChapter1Introduction..1
SoWhatIsBigData?.1
GrowingInterestinDecisionMaking..4
WhatThisBookAddresses..6
TheConversationaboutBigData.7
TechnologicalChangeasaDriverofBigData.12
TheCentralQuestion:SoWhat?13
OurGoalsasAuthors18
References..19
Chapter2TheMotherofInventionsTriplets:MooresLaw,the
ProliferationofData,andDataStorageTechnology.21
-
MooresLaw..22
ParallelComputing,betweenandwithinMachines25
QuantumComputing31
RecapofGrowthinComputingPower.31
Storage,StorageEverywhere.32
GristfortheMill:DataUsedandUnused..39
Agriculture..40
Automotive..42
MarketinginthePhysicalWorld..45
OnlineMarketing.49
AssetReliabilityandEfficiency.54
ProcessTrackingandAutomation..56
TowardaDefinitionofBigData.58
PuttingBigDatainContext.62
KeyConceptsofBigDataandTheirConsequences64
Summary67
References..67
v
viContents
Chapter3Hadoop73
PowerthroughDistribution.75
CostEffectivenessofHadoop79
NotEveryProblemIsaNail.81
SomeTechnicalAspects81
TroubleshootingHadoop83
RunningHadoop.84
HadoopFileSystem84
MapReduce86
PigandHive90
Installation91
CurrentHadoopEcosystem..91
HadoopVendors94
-
Cloudera94
AmazonWebServices(AWS).95
Hortonworks97
IBM.97
Intel99
MapR.100
Microsoft.100
RunningPigLatinUsingPowershell.101
Pivotal103
References104
Chapter4HBaseandOtherBigDataDatabases105
EvolutionfromFlatFiletotheThreeVs..105
FlatFile106
HierarchicalDatabase..110
NetworkDatabase..110
RelationalDatabase111
Object-OrientedDatabases..114
Relational-ObjectDatabases114
TransitiontoBigDataDatabases115
WhatIsDifferentaboutHBase?116
WhatIsBigtable?.119
WhatIsMapReduce?..120
WhatAretheVariousModalitiesforBigData
Databases?122
Contentsvii
GraphDatabases123
HowDoesaGraphDatabaseWork?.123
WhatIsthePerformanceofaGraphDatabase?..124
DocumentDatabases.124
Key-ValueDatabases131
Column-OrientedDatabases.138
HBase138
-
ApacheAccumulo..142
References149
Chapter5MachineLearning.151
MachineLearningBasics.151
ClassifyingwithNearestNeighbors.153
NaiveBayes154
SupportVectorMachines.155
ImprovingClassificationwithAdaptiveBoosting.156
Regression157
LogisticRegression158
Tree-BasedRegression160
K-MeansClustering.161
AprioriAlgorithm.162
FrequentPattern-Growth.164
PrincipalComponentAnalysis(PCA)165
SingularValueDecomposition.166
NeuralNetworks168
BigDataandMapReduce.173
DataExploration175
SpamFiltering..176
Ranking177
PredictiveRegression..177
TextRegression178
MultidimensionalScaling179
SocialGraphing..182
References191
Chapter6Statistics..193
Statistics,StatisticsEverywhere193
DiggingintotheData.195
viiiContents
StandardDeviation:TheStandardMeasureof
Dispersion..200
-
ThePowerofShapes:Distributions..201
Distributions:GaussianCurve205
Distributions:WhyBeNormal?..214
Distributions:TheLongArmofthePowerLaw.220
TheUpshot?StatisticsAreNotBloodless227
FoolingOurselves:SeeingWhatWeWanttoSeeinthe
Data228
WeCanLearnMuchfromanOctopus..232
HypothesisTesting:SeekingaVerdict..234
Two-TailedTesting240
HypothesisTesting:ABroadField.241
MovingOntoSpecificHypothesisTests.242
RegressionandCorrelation247
pValueinHypothesisTesting:ASuccessful
Gatekeeper?.254
SpeciousCorrelationsandOverfittingtheData.268
ASampleofCommonStatisticalSoftwarePackages273
Minitab273
SPSS..274
R..275
SAS277
BigDataAnalytics..277
HadoopIntegration.278
Angoss.278
Statistica.279
Capabilities279
Summary280
References..282
Chapter7Google..285
BigDataGiants..285
Google..286
Go..292
-
Android..293
GoogleProductOfferings.294
GoogleAnalytics299
Contentsix
AdvertisingandCampaignPerformance299
AnalysisandTesting.300
Facebook.308
Ning.310
Non-UnitedStatesSocialMedia.311
Tencent311
Line311
SinaWeibo312
Odnoklassniki312
Vkontakte.312
Nimbuzz.312
RankingNetworkSites..313
NegativeIssueswithSocialNetworks.314
Amazon.316
SomeFinalWords320
References321
Chapter8GeographicInformationSystems(GIS)323
GISImplementations.324
AGISExample.332
GISTools..335
GISDatabases.346
References..348
Chapter9Discovery351
FacetedSearchversusStrictTaxonomy.352
FirstKeyAbility:BreakingDownBarriers356
SecondKeyAbility:FlexibleSearchandNavigation..358
UnderlyingTechnology364
TheUpshot365
-
Summary366
References..367
Chapter10DataQuality.369
KnowThyDataandThyself..369
Structured,Unstructured,andSemistructuredData..373
DataInconsistency:AnExamplefromThisBook..374
TheBlackSwanandIncompleteData.378
xContents
HowDataCanFoolUs..379
AmbiguousData..379
AgingofDataorVariables..384
MissingVariablesMayChangetheMeaning.386
InconsistentUseofUnitsandTerminology388
Biases.392
SamplingBias392
PublicationBias..396
SurvivorshipBias396
DataasaVideo,NotaSnapshot:DifferentViewpoints
asaNoiseFilter..400
WhatIsMyToolkitforImprovingMyData?..406
IshikawaDiagram.409
InterrelationshipDigraph..412
ForceFieldAnalysis414
Data-CentricMethods415
TroubleshootingQueriesfromSourceData.416
TroubleshootingDataQualitybeyondtheSource
System..419
UsingOurHiddenResources422
Summary423
References..424
Chapter11Benefits427
DataSerendipity427
-
ConvertingDataDrecktoUsefulness428
Sales430
ReturnedMerchandise.432
Security434
Medical435
Travel.437
Lodging.437
Vehicle439
Meals..440
GeographicalInformationSystems.442
NewYorkCity..442
ChicagoCLEARMAP.443
Baltimore.446
Contentsxi
SanFrancisco448
LosAngeles.449
Tucson,Arizona,UniversityofArizona,and
COPLINK.451
SocialNetworking.452
Education454
GeneralEducationalData454
LegacyData.455
GradesandOtherIndicators.456
TestingResults.456
Addresses,PhoneNumbers,andMore..457
ConcludingComments458
References..459
Chapter12Concerns.463
LogicalFallacies.469
AffirmingtheConsequent.470
DenyingtheAntecedent.471
LudicFallacy..473
-
CognitiveBiases..473
ConfirmationBias..473
NotationalBias..475
Selection/SampleBias..475
HaloEffect476
ConsistencyandHindsightBiases.477
CongruenceBias..478
VonRestorffEffect..478
DataSerendipity.479
ConvertingDataDrecktoUsefulness..479
Sales.479
MerchandiseReturns.482
Security483
CompStat.483
Medical..486
Travel.487
Lodging.487
Vehicle488
Meals..490
xiiContents
SocialNetworking.491
Education492
MakingYourselfHardertoTrack.497
Misinformation498
Disinformation.499
Reducing/EliminatingProfiles.500
SocialMedia500
SelfRedefinition500
IdentityTheft501
Facebook..503
ConcludingComments.519
References521
-
Chapter13Epilogue..525
MichaelPortersFiveForcesModel..527
BargainingPowerofCustomers528
BargainingPowerofSuppliers530
ThreatofNewEntrants531
Others..533
TheOODALoop.533
ImplementingBigData..534
Nonlinear,QualitativeThinking.538
Closing..539
References..540
Preface
Whenwestartedthisbook,bigdatahadnotquitebecomeabusiness
buzzword.Aswedidourresearch,werealizedthebooksweperused
wereeitheroftheGee,whiz!Canyoubelievethis?classorincredibly
abstruse.Wefeltthemarketneededexplanationorientedtowardmanag-
erswhohadtomakepotentiallyexpensivedecisions.
Wewouldlikemanagersandimplementorstoknowwheretostartwhen
theydecidetopursuethebigdataoption.Asweindicate,themarketplace
forbigdataismuchlikethatforpersonalcomputingintheearly1980s
fullofconsultants,productswithbizarrenames,andtonsofhyperbole.
Luckily,inthe2010s,muchofthesoftwareisopensourceandextremely
powerful.Bigdataconsultanciesexisttotranslatethisfreesoftwareintousefultoolsfortheenterprise.Hence,nothingisreallyfree.
Wealsoensureourreaderscanunderstandboththebenefitsandthe
costsofbigdatainthemarketplace,especiallythedarksideofdata.By
now,wethinkitisobviousthattheUSNationalSecurityAgencyisan
archetypeforbigdataproblemsolving.Large-citypolicedepartments
havetheirownstatisticaldatatoolsandsomeofthempondertheuseful-
nessofcellphoneconfiscationandinvestigationaswellastheuseofsocialmedia,whicharepublic.
Asweresearched,wefoundourselvessurprisedatthesizeofwell-known
marketerssuchasGoogleandAmazon.Bothoftheseenterpriseshave
-
purchasedcompaniesandhavegrownthemselvesorganically.Facebook
continuestopurchasecompanies(e.g.,Oculus,thesupplierofapoten-
tiallygame-changingvirtualrealitysystem)andhasover1billionusers.
Algorithmicanalysisofcolossalvolumesofdatayieldsinformation;infor-
mationallowsvendorstotickleourbuyingreflexesbeforeweevenknow
ourownpatterns.
Previously,wethoughtEsriownedthegeographicalinformationsys-
temsmarket,butwefoundavarietyofgeographicalinformationsystems
solutionsalthoughtheEsriproductlineisrelativelymatureandthey
servelarge-citypolicedepartmentsacrosstheUnitedStates.Databasecre-
atorsexplorenewwaysoflookingatandstoring/retrievingdatamethods
goingbeyondtherelationalparadigm.Newandoldalgorithmicmethods
xiii
xivPreface
calledmachinelearningallowcomputerstosortandseparatetheuseful
datafromtheuseless.
Wehavegrowntoappreciatetheopen-sourcestatisticallanguageRover
theyears.Rhasbecomethestatisticallinguafrancaforbigdata.Someofthemajorstatisticalvendorsadvertisetheirfunctionalpartnershipswith
R.Weusethetoolourselvestogeneratemanyofourfigures.WesuspectR
isnowthemostpowerfulgenerallyavailablestatisticaltoolontheplanet.
Letsmoveonandseewhatwecanlearnaboutbigdata!
MATLABisaregisteredtrademarkofTheMathWorks,Inc.Forproduct
information,pleasecontact:
TheMathWorks,Inc.
3AppleHillDrive
Natick,MA01760-2098USA
Tel:5086477000
Fax:508-647-7001
E-mail:[email protected]
Web:www.mathworks.com
Acknowledgments
KimH.PrieswouldliketoacknowledgeJanisePries,theloveofhislife,forhersupport
-
andeditingskills.Inaddition,RobertDunnigansupplied
verbiage,chapters,SixSigmaexpertise,andbigdataprofessionalism.As
always,JohnWyzalekandtheTaylor&Francisteamarekeyplayersintheproductionandpublicationoftechnicalworkssuchasthisone.
RobertDunniganthankshiswife,FlabiaDunnigan,andhissonRobertIIIfortheirloveandpatienceduringthecompositionofthisbook.Hewould
alsoliketothankKimH.Priesforhisdepthofexpertiseinabroadarrayoftechnicalsubjectsaswellashisexperienceasanauthor.Heskillfullynavigatedtheprocessofproposing,developing,andfinalizingwhatis
auniqueandpracticalofferinginthefieldofbigdataliterature.Robertwouldalsoliketothankhisemployer,TheKratosGroup,fortheirinterestandmoralsupportduringthewritingofthisbook.Kratosisaremarkable
companyofwhichRobertisproudtobeapart.Finally,thanksaredueto
Taylor&Francisforbringingthisnewperspectiveonbigdatatomarket.
xv
Authors
KimH.Prieshasfourcollegedegrees:abachelorofartsinhistoryfromtheUniversityofTexasatElPaso(UTEP),abachelorofscienceinmetallurgicalengineeringfromUTEP,amasterofscienceinengineeringfrom
UTEP,andamasterofscienceinmetallurgicalengineeringandmaterials
sciencefromCarnegie-MellonUniversity.Inaddition,heholdsthefol-
lowingcertifications:
APICS
CertifiedProductionandInventoryManager(CPIM)
AmericanSocietyforQuality(ASQ)
CertifiedReliabilityEngineer(CRE)
CertifiedQualityEngineer(CQE)
CertifiedSoftwareQualityEngineer(CSQE)
CertifiedSixSigmaBlackBelt(CSSBB)
CertifiedManagerofQuality/OperationalExcellence(CMQ/OE)
CertifiedQualityAuditor(CQA)
Priesworkedasacomputersystemsmanager,asoftwareengineerforan
electricalutility,andascientificprogrammerunderadefensecontract;forStoneridge,Incorporated(SRI),hehasworkedasthefollowing:
Softwaremanager
-
Engineeringservicesmanager
Reliabilitysectionmanager
Productintegrityandreliabilitydirector
Inadditiontohisotherresponsibilities,PrieshasprovidedSixSigma
trainingforbothUTEPandSRI,andcostreductioninitiativesforSRI.
PriesisalsoafoundingfacultymemberofPracticalProjectManagement.
Additionally,inconcertwithJonQuigley,Prieswasacofounderandprin-
cipalwithValueTransformation,LLC,atraining,testing,costimprove-
ment,andproductdevelopmentconsultancy.PriesalsoholdsTexas
teachercertificationsin:
xvii
xviiiAuthors
Mathematics(812)
Mathematics(48)
Technologyeducation(612)
Technologyapplications(EC12)
Physics(812)
Generalist(48)
EnglishLanguageArtsandReading(812)
History(812)
ComputerScience(812)
Science(812)
Specialeducation(EC12)
HetrainedforIntroductiontoEngineeringDesignandComputer
ScienceandSoftwareEngineeringwithProjectLeadtheWay.Hecur-
rentlyteachesbiotechnology,computerscienceandsoftwareengineering,
andintroductiontoengineeringdesignatthebeautifulParklandHigh
SchoolintheYsletaIndependentSchoolDistrictofElPaso,Texas.
Priesauthoredorcoauthoredthefollowingbooks:
SixSigmafortheNextMilennium:ACSSBBGuidebook(Quality
Press,2005)
SixSigmafortheNewMilennium:ACSSBBGuidebook,Second
-
Edition(QualityPress,2009)
ProjectManagementofComplexandEmbeddedSystems:Ensuring
ProductIntegrityandProgramQuality(CRCPress,2008),withJon
M.Quigley
ScrumProjectManagement(CRCPress,2010),withJonM.Quigley
TestingComplexandEmbeddedSystems(CRCPress,2010),withJonM.Quigley
TotalQualityManagementforProjectManagement(CRCPress,
2012),withJonM.Quigley
ReducingProcessCostswithLean,SixSigma,andValueEngineering
Techniques(CRCPress,2012),withJonM.Quigley
ASchoolCounselorsGuidetoEthics(CounselorConnectionPress,2012),withJaniseG.Pries
ASchoolCounselorsGuidetoTechniques(CounselorConnection
Press,2012),withJaniseG.Pries
ASchoolCounselorsGuidetoGroupCounseling(Counselor
ConnectionPress,2012),withJaniseG.Pries
Authorsxix
ASchoolCounselorsGuidetoPracticum(CounselorConnection
Press,2013),withJaniseG.Pries
ASchoolCounselorsGuidetoCounselingTheories(Counselor
ConnectionPress,2013),withJaniseG.Pries
ASchoolCounselorsGuidetoAssessment,Appraisal,Statistics,andResearch(CounselorConnectionPress,2013),withJaniseG.Pries
RobertDunniganisamanagerwithTheKratosGroupandisbasedin
Dallas,Texas.Heholdsabachelorofscienceinpsychologyandinsociol-
ogywithananthropologyemphasisfromNorthDakotaStateUniversity.
HealsoholdsamasterofbusinessadministrationfromINSEAD,the
businessschoolfortheworld,whereheattendedtheSingaporecampus.
AsaPeaceCorpsvolunteer,Robertservedover3yearsinHonduras
developingagribusinessopportunities.Asaconsultant,helaterworked
ontheAfghanistanSmallandMediumEnterpriseDevelopmentproject
inAfghanistan,wherehetraveledthecountrywithhisAfghancolleagues
andfriendsseekingopportunitiestodevelopamanufacturingsectorin
-
thecountry.
RobertisanAmericanSocietyforQualitycertifiedSixSigmaBlackBelt
andaScrumAlliancecertifiedScrumMaster.
1
Introduction
SOWHATISBIGDATA?
Asamanager,youareexpectedtooperateasafactotum.Youneedtobe
anindustrial/organizationalpsychologist,alogician,abeancounter,and
arepresentativeofyourcompanytotheoutsideworld.Inotherwords,
youaresomewhatofageneralistwhocandiveintospecifics.Thespecific
technologiesyouencounterarebecomingmorecomplex,yetthediffer-
encesbetweenthemandtheirpredecessorsarebecomingmorenuanced.
Youmayhavealreadyguidedyourfirmstransitiontoothernewtechnol-
ogies.ThinkoftheInternet.Inthedecadeandahalfbeforethisbookwaswritten,Internetpresencewentfrombeingoptionaltobeingmandatory
formostbusinesses.Inthepastdecade,Internetpresencewentfrombeing
unidirectionaltoconversational.Once,yourfirmcouldhangoutitsonlineshinglewitheitherinformationaboutitsphysicallocation,hours,and
offeringsifitwereabrick-and-mortarbusinessorelseyourofferingsandanautomatedpaymentsystemifitwereanonlinebusiness.Firmsranging
fromBarnes&Nobletoyourcornerpizzachainbridgedtheseworlds.
Anewbuzzwordarrived:Web2.0.Despitemuchhyperbolicrhetoric,
thisdesignationdescribedtherealphenomenonofareciprocalonline
world.Andisgruntledrepresentativeofyourcompanyrespondingby
thearchetypicalWeb2.0technologycalledsocialmediacouldcausereal
damagetoyourfirm.TwonewsstoriesinvolvingTwitterbrokeasthis
introductionwasinitsfinalstagesofrefinement.
First,BrendanEich,thenewCEOofthesoftwareorganizationMozilla
(creatoroftheFirefoxbrowser),steppeddownafternewssurfacedindi-
catinghehaddonatedmoneyinsupportofProposition8,anantigay
marriageinitiativeinCalifornia,some6yearsbefore(in2008).AnuproareruptedlargelyonTwitterwhichledMr.Eichtoresign.Voicesin
1
-
2BigDataAnalytics
Mr.EichsdefensefromacrossthepoliticalspectrumincludingAndrew
Sullivan,therespectedconservativecolumnistwhoishimselfgayand
aproponentforgaymarriagerights,andConorFriedersdorfofThe
Atlantic,whowasalsoanoutspokenopponentofProposition8didnotsaveMr.Eichsjob.Hewasousted.
ThesecondTwitterstorybeganwithatweetedcomplaintfromacus-
thetypicalreactionofacompanyfacingsuchacomplaintinthepublic
forumofTwitter.Theyinvited@ElleRaftertoprovidemoreinformation,
alongwithalink.UnlikethetypicalTwitterresponse,however,theUS
Airwaystweetincludedapornographicphotoinvolvingtheuseofatoy
USAirwaysaircraft.Thisdoesnotappeartohavebeenapremeditated
actbytheUSAirwaysrepresentativeinvolvedbutitcausedsubstantial
humiliatingpresscoverageforthecompany.
AstheInternetspreadandmatured,itbecameanecessaryforumfor
communication,aswellasadangeroustoolwhosepotentialforgoodor
badcanpullinothersbysurpriseorcauseself-inflictedharm.JustasWorldWarIgeneralswerelefttofigureouthowtechnologychangedthefieldof
battle,shiftingtheadvantagefromtheoffensetothedefense,Internettechnologyleftmanagerstryingtocopewithanewlandscapefilledwithboth
promiseandthreats.Now,thereisanothernewbuzzword:bigdata.
So,whatisbigdata?Isitafad?Isitemptyjargon?Isitjustanewnameforgrowingcapacityofthesamedatabasesthathavebeenapartofourlivesfordecades?Or,isitsomethingqualitativelydifferent?Whatarethepromisesofbigdata?Fromwhichdirectionshouldamanageranticipatethreats?
Thetendencyofthemediatohypenewandbarelyunderstoodphenom-
enamakesitdifficulttoevaluatenewtechnologies,alongwiththenature
andextentoftheirsignificance.Thisbookarguesthatbigdataisnewandpossessesstrategicsignificance.Theargumenttheauthorsmakeaboutbig
dataisabouthowitbuildsonunderstandabledevelopmentsintechnology
andisitselfcomprehensible.Althoughitiscomprehensible,itisnoteasytouseanditcandelivermisleadingorincorrectresults.However,these
erroneousresultsarenotoftenrandom.Theyresultfromcertainstatisti-
calanddata-relatedphenomena.Knowingthesephenomenaarerealand
-
understandinghowtheyfunctionenableyouasamanagertobecomea
betteruserofyourbigdatasystem.
Likecellphonesande-mail,bigdataisarecentphenomenonthathas
emergedasapartofthepanoramaofourdailylives.Whenyoushop
online,catchupwithfriendsonFacebook,conductwebsearches,read
Introduction3
articlesreferencingdatabasesearches,andreceiveunsolicitedcoupons,
youinteractwithbigdata.Manyreaders,asparticipantsinastoresloyaltyprogram,possessakeyfobfeaturingabarcodeononesideandthelogo
ofafavoritestoreontheother.Oneoftheprimaryrationalesoftheseprograms,asidefromdecreasingyourincentivetoshopelsewhere,istogatherdataonthecompanysmostimportantcustomers.Everytimeyouswipe
yourkeyfoborenteryourphonenumberintothekeypadofthecreditcard
machinewhileyouarecheckingoutatthecashregister,youaretyinga
pieceofidentifyingdata(whoyouare)withwhichitemsyoupurchased,
howmanyitemsyoupurchased,whattimeofdayyouwereshopping,and
otherdata.Fromthese,analystscandeterminewhetheryoushopbybrand
orbuywhateverisonsale,whetheryouarepurchasingdifferentitemsfrombefore(suggestingalifechange),andwhetheryouhavestoppedmaking
yourlargepurchasesinthestoreandnowonlydropinforquickitemssuchasmilkorsugar.Inthelattercase,thatisasignyouswitchedtoanotherretailerforthebulkofyourshoppingandcouponsorsomeotherinterventionmaybeinorder.Storeshavelongcollectedcustomerdata,longbeforetheageofbigdata,buttheynowpossesstheabilitytopullinagreatervarietyofdataandconductmorepowerfulanalysesofthedata.
Bigdatainfluencesuslessobviouslyitinformstheobscureunderpin-
ningsofoursociety,suchasmanufacturing,transportation,andenergy.
Anyindustrydevelopingenormousquantitiesofdiversedataisready
forbigdata.Infact,theseindustriesprobablyusebigdataalready.The
technologicalrevolutionoccurringindataanalyticsenablesmoreprecise
allocationofresourcesinourevolvingeconomymuchastherevolution
innavigationaltechnology,fromthesupersededsextanttomodernGPS
devices,enabledshipstonavigateopenseas.
BigdataismuchliketheInternetithasdrawbacks,butitsnetvalueis
positive.Thedebateonbigdata,likepoliticaldebate,tendstowardmis-
leadingabsolutesandfalsedichotomies.Thetruth,asinthecaseofpoliticaldebates,
-
almostneverliesinthoseabsolutes.Likeacar,youdonotstartupabigdatasolutionandletitmotoralongunguidedyoudriveit,you
guideit,andyouextractvaluefromit.
Dataitselfisnowanasset,oneforcompaniestosecureandhoard,much
astheFederalReserveBankofNewYorkstockpilesgold(though,forthe
sakeofaccuracy,theFederalReserveonlystoresgoldforcountriesother
thantheUnitedStates).Companiesinvestinsystemstoorganizeand
extractvaluefromtheirdata,justastheywouldapieceoflandorreserveofrawmaterials.Dataareboughtandsold.Somecompanies,including
4BigDataAnalytics
IHS,Experian,andDataLogix,buildentirebusinessestocollect,refine,
andselldata.Companiesinthebusinessofdataarediverse.IHSprovides
informationaboutspecificindustriessuchasenergy,whereasExperian
andDataLogixprovidepersonalinformationaboutindividualconsum-
ers.Thesecompanieswouldnotexistiftheexchangeofdatawasnotlucra-
tive.Theywouldenjoynoprofitmotiveiftheycouldnotusedatatomake
moremoneythanthecostofitsgeneration,storage,andanalysis.
OneofyourauthorswasadevoteeofBorders,thebookretailer(and
stillkeepshisloyaltyprogramcardondisplayasamemorialtothecom-
pany).AftertheliquidationofBorders,hereceivedane-mailmessagefromWiliamLynch,thechiefexecutiveofBarnes&Noble(anotherfavorite
store),statinginpart,AspartofBordersceasingoperations,weacquiredsomeofitsassetsincludingBordersbrandtrademarksandtheircustomer
list.ThesubjectmatterofyourDVDandothervideopurchaseswillbepartofthetransferredinformationIfyouwouldliketoopt-out,wewillensureallyourdatawereceivefromBordersisdisposedofinasecureandconfidentialmanner.ThedatathatBordersaccumulatedwerearealassetsold
offafteritsbankruptcy.
DataanalysishasevenenteredpopularcultureintheformofMichael
LewissbookMoneyball,aswellastheeponymousmovie.ThestorycentersonBillyBeane,whouseddatatosupplantintuitionandturnedthe
OaklandAthleticsintoawinningteam.Therelationshipbetweendata
anddecisionmakingis,infact,thekeythemeofthisbook.
GROWINGINTERESTINDECISIONMAKING
Anybusinessbookofvaluemustanswerasimple,two-wordquestion:So
-
what?So,whydoesbigdatamatter?Theansweristheconfluenceoftwo
factors.Thefirstisthatawarenessofthelimitationsofhumanintuition,alsoknownasgutfeel,hasbecomeobvious.Thesecondisthatbigdata
technologieshavereachedthelevelofmaturitynecessarytomakestun-
ningcomputationalfeatsaffordable.Moreover,thiscomputationalability
isnowvisibletothegeneralpublic.Facebook,Amazon.com,andsearch
enginessuchasBing,Yahoo!,andGoogleareprimeexamples.Eventradi-
tionalbrick-and-mortarstoresmatchpowerfulwebsiteswithanalytics
thatwouldhavebeenunimaginable20yearsago.Barnes&Noble,Wal-
Mart,andHomeDepotareexcellentexamples.
Introduction5
Manyprominentactorsinpsychology,marketing,andbehavioral
financehavepointedouttheflawsinhumandecisionmaking.Psychologist
DanielKahnemanwontheNobelMemorialPrizeinEconomicSciences
in2002forhisworkonthesystematicflawsinthewaypeopleweighrisk
andrewardinarrivingatdecisions.BuildingonKahnemanswork,avari-
etyofscholars,includingDanAriely,ZivCarmon,andCassSunstein,
demonstratedhowhiddeninfluencersandmentalheuristicsinfluence
decisionmaking.Oneoftheauthorshadthepleasureofstudyingunder
Mr.CarmonatINSEADand,duringaclassexercise,pointedouthow
muchhepreferredoneketchupsampletoanotheronlytodiscoverthey
camefromthesamebottleandweremerelypresentedasbeingdifferent.
Thedifferencebetweenthetwosampleswasnonexistent,butthediffer-
encewithtasteperceptionswasquitereal.
Infact,Mr.Ariely,Mr.Carmon,andtheircoauthorswonthefollowing
2008IgNobelaward:
MEDICINEPRIZE.DanArielyofDukeUniversity(USA),RebeccaL.
WaberofMIT(USA),BabaShivofStanfordUniversity(USA),andZiv
CarmonofINSEAD(Singapore)fordemonstratingthathigh-pricedfake
medicineismoreeffectivethanlow-pricedfakemedicine.1
Thewebsitestates,TheIgNobelPrizeshonorachievementsthatfirst
makepeoplelaugh,andthenmakesthemthink.Theprizesareintended
tocelebratetheunusual,honortheimaginativeandspurpeoplesinter-
-
estinscience,medicine,andtechnology.2Itmaybeeasytolaughabout
thisresearch,butjustconsiderhowpowerfulitis.Yourperceptionofthemedicaleffectivenessofwhatisinfactauselessplaceboisinfluencedbyhowmuchyoubelieveitcosts.
TheAtlanticrananarticleinitsDecember2013issuedescribinghowbigdatachangeshiringdecisions.Althoughthisphenomenonisnotaltogetherunderstood,wehavepilotstudies,andyes,computerscanoften
doabetterjobthanpeople.3Hiringmanagersbasetheirwillingnessto
hireonarangeofirrelevantfactorsininterviews.Considersomeofthesefactors:firmnessofhandshake,physicalappearance,projectionofconfidence,name,andsimilaritiesofhobbieswiththepersonconductingthe
interviewallinfluenceemploymentdecisions.Often,theseextraneous
factorshaveminimalrelevancetotheabilityofsomeonetoexecutetheir
job.Itislittlewonderthatcomputersanddatascientistshavebeenabletoimprovecompanieshiringpracticesbybringinginbigdata.
6BigDataAnalytics
In1960,acognitivescientistbythenameofPeterCathcartWasonpub-
lishedastudyinwhichparticipantswereaskedtohypothesizethepattern
underlyingaseriesofnumbers:2,4,and6.Theythenneededtotestitbyaskingifanotherseriesofnumbersfitthepattern.Whatisyourhypothesisandhowwouldyoutestit?WhatWasonuncoveredisatendencyto
seekconfirmatoryinformation.Participantstendedtoproposeseriesthat
alreadyfitthepatternoftheirassumptions,suchas8,10,and12.Thisisnotahelpfulapproachtotheproblem,though.Amoreproductiveapproach
wouldbe12,10,and8(descendingorder,separatedbytwo),or2,3,and4
(ascendingorder,separatedbyone).Irregularseriessuchas3,,and4,or0,
1,and4wouldalsobeuseful,aswouldanythingthatdirectlyviolatesthepatternoftheoriginalsetofnumbersprovided.Thepatternsoughtinthe
studywasanyseriesofnumbersinascendingorder.Participantsdidapoorjobofeliminatingpotentialhypothesesbyseekingoutoptionsthatdirectlycontradictedtheiroriginalhunches,tendinginsteadtoconfirmwhatthey
alreadybelieved.Thetitleofthisseminalstudy,OntheFailuretoEliminateHypothesesinaConceptualTask,highlightsthisintelectualbias.4
Wasonsfindingswerepioneeringworkinthisfield,andinmanyways
DanielKahnemansworkisafruitfulandingeniousoffshootthereof.As
thisisanintroduction,wewillnotcontinuelistingexamplesofcognitivebiases,buttheyhavebeendemonstratedmanytimesinhowweevaluate
-
others,howwejudgeourownsatisfaction,andhowweestimatenumbers.
Bigdatanotonlyaddressesthearcanerelationshipsbetweentechnical
variables,butitalsohasapragmaticroleinsavingcosts,controllingrisks,andpreventingheadachesformanagersinavarietyofroles.Itdoesthis
inpartbyfindingpatternswheretheyexistratherthanwhereourfalliblereckoningfindsthemeremiragesofpatterns.
WHATTHISBOOKADDRESSES
Thisbookaddressesaseriousgapinthebigdataliterature.Duringour
research,wefoundpopularbooksandarticlesthatdescribewhatbigdata
isforageneralaudience.Wealsofoundtechnicalbooksandarticlesfor
programmers,administrators,andotherspecializedroles.Thereislittle
discussion,however,facilitatingtheintelligentandinquisitivebutnon-
technicalreadertounderstandbigdatanuances.
Ourgoalistoenableyou,thereader,todiscussbigdataataprofound
levelwithyourinformationtechnology(IT)department,thesalespeople
Introduction7
withwhomyouwillinteractinimplementingabigdatasystem,andthe
analystswhowilldevelopandreportresultsdrawnfromthemyriadofdata
pointsinyourorganization.Wewantyoutobeabletoaskintelligentand
probingquestionsandtobeabletomakeanalystsdefendtheirpositions
beforeyouinvestinprojectsbyactingontheirconclusions.Afterreadingthisbook,youshouldbeabletoreadthefootnotesofapositionpaperandknowthesoundnessofthemethodsused.WhenyourITdepartmentdiscussesanewproject,youshouldbeabletoguidethediscussions.
Thediscussioninthisbookrangeswellbeyondbigdataitself.The
authorsincludeexamplesfromscience,medicine,SixSigma,statistics,andprobabilitywithgoodreason.Allofthesedisciplinesarewrestlingwith
similarissues.Bigdatainvolvestheprocessingofalargenumberofvari-
ablestopulloutnuggetsofwisdom.Thisisusingtheconclusiontoguide
theformationofahypothesisratherthantestingthehypothesistoarrive
ataconclusion.Somemayconsiderthisapproachsloppywhenappliedto
anyparticularscientificstudy,butthesheernumberofstudies,combined
withabiastowardpublishingonlypositiveresults,meansthatastatisticalysimilarphenomenonisoccurringinscientificjournals.Asscienceisaself-criticaldiscipline,thelessonsgleanedfromitsinternalstruggletoensuremeaningfulresultsareapplicableto
-
yourorganizations,whichneedtopul
accurateresultsfrombigdatasystems.Thecurrentdiscussioninthepopu-
larandbusinesspressonbigdataignoresnonbusinessfieldsanddoessotothedetrimentoforganizationstryingtomakeeffectiveuseofbigdatatools.
Thediscussioninthisbookwillprovideyouwithanunderstandingof
theseconversationshappeningoutsidetheworldofbigdata.LouisPasteur
said,Inthefieldsofobservation,chancefavorsonlythepreparedmind.5
Someofthemostprofoundconversationsontopicsofdirectrelevanceto
bigdatapractitionersarehappeningoutsideofbigdata.Understandingtheseconversationswillbeofdirectbenefittoyouasamanager.
THECONVERSATIONABOUTBIGDATA
Wementionedthediscussionsaroundbigdataandhowunhelpfulthey
are.Someofthediscussionisoptimistic;someispessimistic.Wewillstartontheoptimisticside.
Perhapsthemostfamousstoryaboutthecapabilitiesofpredictiveana-
lyticswasa2012articleinTheNewYorkTimesMagazineaboutTarget.6
8BigDataAnalytics
Targetsellsnearlyanycategoryofproductsomeonecouldneed,butisnot
alwaysfirstincustomersmindsforallofthosecategories.Targetsells
clothing,groceries,toys,andmyriadotheritems.However,someonemay
purchaseclothingfromTarget,butgotoKrogerforgroceriesandToysR
Usfortoys.Anywell-managedstorewillwanttoincreasesalestoitscus-
tomers,andTargetisnoexception.ItwantsyoutothinkofTargetfirstformostcategoriesofitems.
Whenlifechanges,habitschange.Targetrealizedthatpeoplespur-
chasinghabitschangeasfamiliesgrowwiththebirthofchildrenand
arethereforemalleable.Targetwantedtodiscoverwhichcustomerswere
pregnantaroundthetimeofthesecondtrimestersoastoinitiatemarket-
ingtoparents-to-bebeforetheirbabieswereborn.
Abirthispublicrecordandthereforeresultsinablizzardofadvertising.
Fromamarketingaspect,acompanyiswisetobeatthatblizzard.Target
sawawaytodosobyusingthedataitaccumulated.
AsaTargetstatisticiantoldtheauthorofthearticle,Ifyouuseacreditcardoracoupon,orfilloutasurvey,ormailinarefund,orcallthecustomerhelpline,oropenane-mail
-
wevesentyouorvisitourWebsite,
wellrecorditandlinkittoyourGuestID.TheguestIDistheunique
identifierusedbyTarget.Thestatisticiancontinued,Wewanttoknow
everythingwecan.TheguestIDisnotonlylinkedtowhatyoudowithin
Targetswalls,butalsotoalargevolumeofdemographicandeconomic
informationaboutyou.6
Targetlookedathowwomenspurchasinghabitschangedaroundthe
timetheyopenedababyregistry,thengeneralizedthesepurchasinghab-
itsbacktowomenwhomaynothaveopenedababyregistry.Purchases
ofunscentedlotion,largequantitiesofcottonballs,andcertainmin-
eralsupplementscorrelatedwellwithsecond-trimesterpregnancy.By
matchingthisknowledgetopromotionsthathadahighlikelihoodof
effectivenessagaingleanedfromTargetscustomer-specificdatathe
companycouldtrytochangethesewomensshoppinghabitsatatime
whentheirliveswereinflux,duringpregnancy.6Thearticlepropelled
Targetsdataanalyticsprowesstofameandalsogenerateduneasiness.
Targetalsodidnotcommunicatehowtrickyandresource-intensivesuch
ananalysisis.Thismaybeanunfaircriticism,asthearticlewasdirectedatageneralreadershipratherthanatbusinesspeoplewhoareconsideringthe
useofbigdata.However,abusinessreaderofsuchstoriesshouldunder-
standhownuanced,messy,convoluted,andmaddeningbigdatacanbe.
ThedatausedbyabigdatasystemtoreachitsconclusionsoftencomewithIntroduction9
built-inbiasesandflaws.Thestatisticsuseddonotprovideapreciseyes
ornoanswer,butratherdescribealevelofconfidenceonaspectrumof
likelihood.Thisdoesnotmakeforexcitingpress,anditisthereforeallbutinvisibleinbigdataarticles,exceptthoseinspecialistsources.
Therearemanyarticlesaboutbigdataandhealth,bigdataand
marketing,bigdataandhiring,andsoforth.Theserarelycoverthe
risksandrewardsofdata.Therealityisthathealthdatacanbemessy
andinaccurate.Moreover,itisprotectedbyastrictlegalregimen,the
HealthInsurancePortabilityandAccountabilityActof1996(HIPAA),
whichrestrictsitsflow.Marketingdataarelikewisedifficulttolinkup.
-
Dataanalyticsingeneral,andnowbigdata,haveimprovedmarketing
effortsbutarenotamagicbullet.Somestoresseldomtrackwhattheir
customerspurchase,andthosethatdosodonottrusteachotherwith
theirdatabases.Inanybigdatasystem,thenatureofwhocanseewhat
dataneedstobeconsidered,aswellashowthedatawillbesecured.
Itisverylikelythatyourfirmwillowndataonlysomeemployeesor
contractorscansee.Makingiteasiertoaccessthisdataisnotalways
agoodidea.
Laterinthisintroduction,wewilldiscussdataanalyticsappliedtohir-
ingandhowpoorlythiscanbereported.Asanewsconsumer,yourskep-
ticismshouldkickinwheneveryoureadaboutsomeamazingdiscovery
uncoveredbybigdatamethodsabouthowtwodissimilarattributesare
infactlinked.Therealityisatbestmuchmorenuancedandatworstisa
falserelationship.Thesefalserelationshipsareprettymuchinevitable,andwededicatemanypagestoshowinghowdataandstatisticscanleadthe
unwaryuserastray.Onceyouembracethiscondition,youwillprobably
neverreadnewsstoriesaboutbigdatawithoutautomaticallycritiquing
them.
Ontheothersideoftheargument,perhapsthemostastutecriticof
bigdataisNassimNicholasTaleb.Inanopinionpiecehewroteforthe
websiteofWiredmagazine(drawnfromhisbook,Antifragile),hestates,
Modernityprovidestoomanyvariables,buttoolittledatapervariable.
Sothespuriousrelationshipsgrowmuch,muchfasterthanrealinforma-
tionInotherwords:Bigdatamaymeanmoreinformation,butitalso
meansmorefalseinformation.7
Mr.Talebmaybepessimistic,butheraisesvaluablepoints.Asaformer
traderwithaformidablequantitativebackground,Talebhasmadeaname
forhimselfwithhisastutecritiquesoffaultydecisionmaking.Talebisararity,apublicintellectualwhoisalsoanintellectualheavyweight.Heis10BigDataAnalytics
notpartisan,developingdevastatingtakedownsofsloppyargumentation
withequalopportunityfervor.Talebargues:
Theincentivetodrawaconclusionmaynotalignwithwhatthedata
reallyshow.Withthis,Talebdiscussestheexistenceofmedicalstud-
-
iesthatcannotbereplicated.Therearefundingincentivestofindsig-
nificantrelationshipsinstudiesanddisincentivestopublishstudies
thatshownosignificantfindings.Thehallmarkofatrulysignificant
findingisthatotherscanreplicatetheresultsintheirownstudies.
Thereisnotanabsenceofmeaningfulinformationinlargedatasets,
itissimplythattheinformationwithinishiddenwithinalarger
quantityofnoise.Noiseisgenerallyconsideredtobeanunwel-
comerandomnessthatobscuresasignal.AsTalebstates,Iamnot
sayingherethatthereisnoinformationinbigdata.Thereisplenty
ofinformation.Theproblemthecentralissueisthattheneedle
comesinanincreasinglylargerhaystack.7
Onedifficultyindrawingconclusionsfrombigdataisthatalthough
itisgoodfordebunkingfalseconclusions,itisnotasstrongindraw-
ingvalidconclusions.Stateddifferently,Ifsuchstudiescannotbe
usedtoconfirm,theycanbeeffectivelyusedtodebunktotellus
whatswrongwithatheory,notwhetheratheoryisright.7Ifweare
usingthescientificmethod,itmaytakeonlyonevalidcounterex-
ampletotoppleavulnerabletheory.
Thisisanimportantarticle.Infact,thebookthatyounowholdinyour
handswasconceivedasaresponse.Talebpointsoutrealflawsinhowwe
usebigdata,butyourauthorsargueweneednotusebigdatathisway.A
managerwhounderstandsthepromiseandlimitsofbigdatacanobtain
improvedresultsjustbyknowingthelimitsofdataandstatisticsandthenensuringthatanyanalysisincludesmeasurestoseparatewheatfromchaff.
Paradoxically,theflawsofbigdataoriginatefromtheuniquestrengths
ofbigdatasystems.Thefirstamongthesestrengthsistheabilitytopulltogetherlargenumbersofdiversevariablesandseekoutrelationships
betweenthem.Thisenablesanorganizationtofindrelationshipswithin
itsdatathatwouldhaveotherwiseremainedundiscovered.However,
morevariablesandmoretestsmustmeananincreasedchanceforerror.
Thisbookisintendedtoguidetheuserinunderstandingthis.
AmorewidelyrecognizedconceptmadefamousbyTaleb,notdirected
atbigdatabutapplicablejustthesame,ishisconceptoftheblackswan.
-
Introduction11
Theterm,whichheusestodescribeanunforeseeableevent,asopposed
tojustunforeseen,derivesfromtheideathatifoneconceivesofthecolorwhiteasbeinganintrinsicaspectofaswan,thenfindingablackswanis
anunforeseeableexperiencethatrendersthatexpectationuntenable.The
BlackSwanisathereforeashock.The1987stockmarketcrashandthe
terroristattacksofSeptember11arelarge-scaleBlackSwans,butsmaller
BlackSwanshappentousinourpersonallivesandwithourbusinesses.
Talebistalentedatbringingconceptsintofocusthroughtheskillfuluse
ofexamples.Inthiscase,hisexampleisofthecomfortableturkeyraised
onafarm.Heisfed,getsfat,projectsahead,andfeelsgoodabouthislife
untilThanksgiving.8
Thefieldofpredictiveanalyticsisrelatedto,andoftenverymuchapartof,bigdata.Ithasbeenquitepowerfulinboostingefficiencyandcontrollingrisk,anditiswithoutdoubtanindispensabletechnologyformany
firms.Evenso,thereisanuncomfortabletruth.Withlittleexperience
usingdatatounderstandaparticularphenomenon(orperhapswithout
collectionoftheneededdata),youwillnotbeabletoforeseeit.Bigdataisbothartandscience,butitisnotanall-seeingwellspringofwisdomand
knowledge.Itwillnotenableyoutoeliminateblackswanevents.Itisuptotheuserofbigdatasystemstounderstandtherisksandlimiteddatathat
actasaconstraintoncalculatingprobabilitiesforthephenomenabeing
analyzedandtorespectchance.
WhileTalebsargumentisamongthemostsubstantivecritiquesofbig
data,thegeneralformofhiscriticismisfamiliar.Bigdataisnotaltogetherdismissed,sothecriticismisbalanced.Theflawinmostcriticismofbig
dataisnotthatitispolemic,ordishonest,oruninformed.Itisnoneof
these.Itisthatitisfatalistic.Bigdatahasflawsandisthusoverrated.Whatmuchofthecriticalbigdataliteraturefailstodoislookatthistechnologyasanenablingtechnology.Askilleduserwhounderstandsthedataitself,
thetoolsanalyzingit,andthestatisticalmethodsbeingusedcanextract
tremendousvalue.Theuserwhoblindlyexpectsbigdatasystemstospit
outmeaningfuldatarunsaveryhighriskofdeliveringpotentialdisaster
tohisorherorganization.
AfurtherexampleisanarticlefromtheKDnuggetswebsiteentitled
-
Viewpoint:WhyYourCompanyShouldNOTUsebigdata.9Thearticle
describesthedifficultyofusingdatawellandarguesthatthemostgainscanbeobtainedbyusingthedataonesfirmalreadypossesseswithwit.Italsopuncturessomebaloonsinvolvingthemisuseoflanguage,suchasreferringtoNateSilversbriliantworkasbigdatawheninfactitisstraightforward12BigDataAnalytics
analysis.Yourauthorscanattesttotheimportanceofusingafirmsdata
moreeffectivelyweareexperiencedSixSigmapractitioners.Weareaccus-
tomedtousingdatatoenhanceefficiencyandqualityandtoreducerisk.
Thisarticleisstillflawed.Amoreproductiveapproachwouldbetolook
atwhereyourorganizationisnow,whereitwantstogo,andhowbigdata
mayhelpitgetthere.Yourorganizationmaynotbereadytoimplement
bigdatanow.Itmayneedtofocusonbetterusingitsexistingdata.To
prepareforthefuture,itmayneedtotakeamorestrategicapproachto
ensuringthatthedataitnowgeneratesisproperlylinked,sothatausersshoppinghistorycanbetiedtotheparticularuser.Ifyouranalysisleadsyoutoconcludethatbigdataisnotaproductiveeffortforyourcompany,
thenyoushouldheedthatadvice.Manyfirmsdonotneedbigdata,andto
attempttoimplementthisapproachjusttokeepupwiththepackwould
bewasteful.Ifyourfirmdoesseearealisticneedforbigdataandhastheresourcesandcommitmenttoseeitthrough,thenthelackofanexisting
competenceisnotavalidreasontoavoiddevelopingone.
TECHNOLOGICALCHANGEASADRIVEROFBIGDATA
Wealsodiscusstechnology,includingitsevolution.Thedatasetsgener-
atedeverydaybyonlineretailers,searchengines,investmentfirms,oil
andgascompanies,governments,andotherorganizationsaresomas-
siveandconvoluted,theyrequirespecialhandling.Astandarddatabase
managementsystem(DBMS)maynotberobustenoughtomanagethe
sheerenormityofthedata.Considerprocessingapetabyte(1000TB,or
1millionoftheharddrivesonamedium-tohigh-endlaptop)ofdata.
Physicallystoring,processing,andlocatingallofthisdatapresentssignificantobstacles.Amazon,Facebook,andotherhigh-profilewebsitesmea-
suretheirstorageinpetabytes.
Somecompanies,likeGoogle,developedtheirowntools,suchas
MapReduce,theGoogleFileSystem,andBigTable,tomanagecolossal
-
volumesofinformation.Theopen-sourceApacheFoundationoversees
Hadoop(adata-intensivesoftwareframework),Hive(datawarehouseon
topofHadoop),andHBase(nonrelational,distributeddatabase)inorder
toprovidetheprogrammingcommunitywithaccesstotoolsthatcan
manipulatebigdata.PaperspublishedbyGoogleaboutitsowntechniques
inspiredtheopen-sourcedistributedprocessingmanager,Hadoop.
Introduction13
Anotherareaforbigdataanalysisistheuseofgeographicalinfor-
mationsystems(GIS).AtypicalexampleofGISsoftwarewouldbethe
commercialproduct,ArcGIS,ortheopen-sourceproduct,QuantumGIS.
Duetothecomplexityofmapdata,evenanassessmentatthemunicipal-
itylevelwouldconstituteabigdatasituation.Whenwearelookingat
theentireplanet,weareanalyzingbigdata.GISisinterestingnotonly
becauseitinvolvesrawnumbers,butitalsoinvolvesdatarepresentation
andvisualization,whichmustthenrelatetoamapwithaclearinterpre-
tation.GoogleEarthaddstheextracomplexityofzooming,decluttering,
andoverlaying,aswellaschoosingbetweenpoliticalmapsandsatellite
images.Wenowaddtheextracomplexitiesofcolor,line,contrast,shape,
andsoon.
Theneedforlowlatency,anotherwayofsayingshortlagtime,betweena
requestandthedeliveryoftheresults,drivesthegrowthofanotherareaofbigdatain-memorydatabasesystemssuchasOracleEndecaInformation
DiscoveryandSAPHANA.Thoughtwoverydifferentbeasts,bothdem-
onstratetheabilitytouselarge-capacityrandomaccessmemory(RAM)
tofindrelationshipswithinsizableanddiversesetsofdata.
THECENTRALQUESTION:SOWHAT?
Ashasbeenstatedinthisintroduction,andaswewillargue,bigdatais
oneofthemostpowerfultoolscreatedbyman.Itdrawstogetherinforma-
tionrecordedindifferentsourcesystemsanddifferentformats,thenruns
analysesatspeedsandcapacitiesthehumanmindcannotmatch.Bigdata
isatruebreakthrough,butbeingabreakthroughdoesnotconferinfalli-
bility.Likeanysystem,bigdataslimitsclusteraroundparticularthemes.
Thesethemesarenotstraightforwardweaknessessuchasthosefoundin
-
poorengineering,buttheyareinseparablefrombigdatasstrengths.By
understandingtheselimits,wecanminimizeandcontrolthem.
Thespecificexamplesofbigdatausedsofararesuchthatwhenwedraw
faultyconclusions,wesufferminorconsequences.Oneoftheauthorshad
abafflinglyoff-basecategoryofmoviesrecommendedtohimbyNetflix
andhasreceivedmembershipcardsinthemailfromtheAARPdespite
beingdecadesawayfromretirement,andhasaspousewhowastwice
bombardedwithbabyformulacouponsinthemail.Thefirsttimewas
soonbeforehissonwasborn;thesecondtimewasbriefandwastriggered
14BigDataAnalytics
byerroneousconclusionsdrawnbysomealgorithminanunknown
computer.
Falseconclusionsdonotalwayscomewithsmallconsequencesthough.
Bigdataismovingintofrauddetection,crimeprevention,medicine,
businessstrategy,forensicdata,andnumerousotherareasoflifewhere
erroneousconclusionsaremoreseriousthanunanticipatedjunkmailor
strangerecommendationsfromonlineretailers.
Forexample,bigdataismovingintothefieldofhiringandfiring.The
previouslyreferencedarticlefromTheAtlanticdiscussesthisindetail.
Citingmyriadfindingsabouthowpoorlyjobinterviewsfunctioninevalu-
atingpotentialclients,thearticlediscussesdifferentmeansbywhichdataareusedtoevaluatepotentialcandidatesandcurrentemployees.
OnecompanydiscussedbythearticleisEvolv.Onitswebsite,Evolv
whosesloganisBigDataforWorkforceOptimizationstatesitsvalue
proposition:
Faster,moreaccurateselectiontools:Evolvsplatformenables
recruiterstoquicklyidentifythebesthiresfromvolumesofcandi-
datesbasedonyouruniqueroles.
HigherQualitycandidates:Bettercandidateselectionresultsin
longer-tenuredemployeesandlowerattrition.
Post-hireengagementtools:Easytodeployemployeeengagement
surveyskeeptabsonwhatworkplacepracticesareworkingforyou,
andwhichonesarenot.10
-
Toattainthis,Evolvadministersquestionnairestoonlineapplicants
andthenmatchestheresultstothoseobtainedfromitsdatasetof347,000
hiresthatpassedthroughtheprocess.Whoarethebest-performingcan-
didates?Whoismostlikelytostickaround?TheAtlanticstates:
Thesheernumberofobservationsthatthisapproachmakespossibleallows
Evolvtosaywithprecisionwhichattributesmattermoretothesuccess
ofretail-salesworkers(decisiveness,spatialorientation,persuasiveness)orcustomer-servicepersonnelatcallcenters(rapport-building).Andthe
companycancontinuallytweakitsquestions,oraddnewvariablestoits
model,toseekoutever-strongercorrelatesofsuccessinanygivenjob.3
Bigdatahasinmanywaysmadehiringdecisionsmorefairandeffec-
tive,butitisstillprudenttomaintainskepticism.OneofthemostnotedfindingsbyEvolvistheroleofanapplicantsbrowserwhilefillinginthejobapplicationindeterminingthesuccessoftheemployeeonthejob.
Introduction15
AccordingtoEvolv,applicantswhouseaftermarketbrowserssuchas
FirefoxandChrometendtobemoresuccessfulthanthoseapplicantswho
usethebrowserthatcamewiththeoperatingsystem,suchasInternet
Explorer.
ThearticlefromTheAtlanticaddssomeprecisionindescribingEvolvsfindingslinkinganapplicantswebbrowsertojobperformance,stating,
thebrowserthatapplicantsusetotaketheonlinetestturnsouttomat-
ter,especiallyfortechnicalroles:somebrowsersaremorefunctionalthanothers,butittakesameasureofsavvyandinitiativetodownloadthem.3
Otherarticleshavemadeitsoundlikeanapplicantsbrowserwasasilver
bullettodetermininghoweffectiveanemployeewouldbe:
OneofthemostsurprisingfindingsisjusthoweasyitcanbetotellagoodapplicantfromabadonewithInternet-basedjobapplications.Evolvcon-tendsthatthesimpledistinctionofwhichWebbrowseranapplicantis
usingwhenheorshesendsinajobapplicationcanshowwhosgoingtobe
astaremployeeandwhomaynotbe.11
Thisfindingraisestwokeypointsinusingbigdatatodrawconclusions.
First,isthisameaningfulresult,aspuriouscorrelation,orthemisreadingofdata?Withoutdiggingintothedataandthestatistics,itisimpossibletosay.AnonlinearticleinTheEconomiststates,Thismaysimplybeacoincidence,butEvolvsanalystsreckonan
-
applicantswillingnesstogo
tothetroubleofinstallinganewbrowsershowsdecisiveness,avaluable
traitinapotentialemployee.12TherelationshipfoundbyEvolvmaybe
realandgroundbreaking.Itmayalsojustbeastatisticalartifactofthe
kindwewillbediscussinginthisbook.Evenifitisarealandstatisticallysignificantfindingthatstandsuptoexperimentalreplication,itmaybe
sominorastobequasi-meaningless.Withoutknowingaboutthedata
sampled,thestatisticsused,andthestrengthoftherelationshipbetween
thevariables,theconclusionmustbetakenwithagrainofsalt.Wewill
discussinalaterchapterhowastatisticallysignificantfindingneednotbepracticallysignificant.Wemustrememberthatstatisticalsignificance
isamathematicalabstractionmuchlikethemean,anditmaynothave
profoundhumanmeaning.
Thesecondissueraisedbythefindingrelatestointerpretation.The
Economistwasveryresponsibleinpointingoutthepossibilityofacoincidence,orwhatwearereferringtointhisbookasastatisticalartifact.TheAtlanticdeservescreditforpointingoutthatthisfinding(assumingitislegitimate)relatesmoretotechnicaljobs.
16BigDataAnalytics
However,rememberaquotedpassageinoneofthearticles,Evolvcon-
tendsthatthesimpledistinctionofwhichWebbrowseranapplicantis
usingwhenheorshesendsinajobapplicationcanshowwhosgoingto
beastaremployeeandwhomaynotbe.Suchstatementsshouldneverbe
usedindiscussingbigdataresultswithinyourorganization.Whatdoes
whosgoingtobeastaremployeereallymean?Itgrantstoomuchcer-
taintytoaresultthatwillatbestbeatendencyinthedataratherthanasetrule.Thestatementandwhomaynotbeislikewisemeaningless,butin
theotherdirection.Itassertsnothing.Inreallife,manyofthosewhouseFirefoxorChromewillbepoorhires.Eveniftherewerearealrelationshipinthedata,itwouldfranklybeirresponsibleforahiringmanagertoplaceoverridingimportanceonthisattributewhentherearemanyotherattributestoconsider.Languagematters.
Thepointsraisedbythewebbrowserexamplearenotacademic.The
consequencesforacompetentanddiligentjobseekerwhoisjustfinewith
InternetExplorer,orafirmwhoneedsthatjobseeker,arenotdifficulttofigureoutandarecertainlynotminor.Oneofyourauthorskeepsboth
FirefoxandInternetExploreropenatthesametime,assomepageswork
-
betterononeortheother.
AnotherfirmmentionedinTheAtlanticisGild.Gildevaluatesprogrammersbyanalyzingtheironlineprofiles,includingcodetheyhavewritten
anditslevelofadoption,thewaythattheyuselanguageonLinkedInand
Twitter,theircontributionstoforums,andoneratheroddcriteria:whethertheyarefansofaparticularJapanesemangasite.TheGildrepresentative
interviewedinthearticleherselfstatedthatthereisnocausalrelationshipbetweenmangafandomandcodingabilityjustacorrelation.
FirmssuchasEvolvandGild,however,workforemployersandnot
applicants.Theresultsfromtheiranalysesshouldresultinimprovedper-
formance.Itistherule,andnottheexceptions,thatdrivestheadoption
ofbigdatainhiringdecisions.OnesuccessstoryEvolvpointsoutisthe
reductionofonefirms3-monthattritionrateby30%throughtheapplica-
tionofbigdata.Itisnowhelpingthisclientmonitorthegrowthofemployeeswithinthefirm,basednotonlyonthecharacteristicsoftheemployeesthemselvesbutalsoontheenvironmentinwhichtheyoperate,suchas
whotheirtrainersandmanagerswere.
ThecaseofEvolvisagoodillustrationofthenatureofbigdata.Proper
applicationofthetechnologyincreasesefficiency,butacomplexsetof
issuessurroundsthisapplication.Manyoftheseissuesrelatetothepotentialofincorrectconclusionsdrawnfromthedataandtheneedtomitigate
Introduction17
theireffect.Yes,judgmentscanbebaselessorunfair.Whatisthealternative?Thinkbacktoourdiscussionofthefaultinessofhumanjudgment.
Whenabigdatasystemrevealsacorrelation,itisincumbentonthe
operatortoexplorethatcorrelationingreatdetailratherthantotakeitsuperficially.Whenacorrelationisdiscovered,itistemptingtocreateaposthocexplanationofwhythevariablesinquestionarecorrelated.Wegleanamathematicallyneatandseeminglycoherentnugget.However,a
falsecorrelationdressedupnicelyisnothingbutfoolsgold.Itcanchangehowtherecipientsofthatnuggetrespondtoreality,butitcannotchange
theunderlyingreality.Asbigdataspreadsitsinfluenceintomoreareasofourlives,theconsequencesofmisinterpretationgrow.Thisiswhyscientificinvestigationintothedataisimportant.
Bigdataraisesotherissuesyourorganizationshouldconsider.
Maintainingdataraiseslegalissuesifitiscompromised.Medicaldatais
-
themostprominentofthese,butanydatawithtradesecretsorpersonal
informationsuchascreditcardnumbersfitinthiscategory.Incorrect
usagecreatesarisktocorporatereputations.Googlesaggressivecollec-
tionofcustomerdata,sometimesintrusively,hastarnishedthatfirms
reputation.Evenworse,thedataheldcanharmothers.TheNewYorker
reportsthecaseofMichaelSeay,thefatherofayoungladywhoselife
tragicallyendedattheageof17,whoreceivedanOfficeMaxflierinthe
mailaddressedtoMikeSeay/DaughterKilledinCarCrash/OrCurrent
Business.13ThisobviouslycreatedmuchpainforMr.Seay,asitwouldforanyparent.
GoogleMapsStreetViewhaslikewisebeenacurseformany,including
amanurinatinginhisownbackyardwhosemomentofimprudencecoin-
cidedwithGooglescardrivingpasthishouse.ThatwillbeontheInternetforever.TheWallStreetJournalcarriedanin-deptharticledescribingdatabasesofscannedlicenseplatesinboththepublicandprivatesector.
Thesecompaniesphotographandloglicenseplates,usingautomaticread-
ers,soacarcanbetiedtothelocationwhereitwasphotographed.Two
privatesectorcompaniesarelisted:DigitalRecognitionNetwork,Inc.and
MVTrac.Arepossessionfirmmentionedinthearticlehasvehiclesthat
drivehundredsofmileseachnightlogginglicenseplatesofparkedcars.
ThemajorityofcarsstilldrivenintheUnitedStatesareprobablyloggedinthesesystems,oneofwhichhad700millionscans.14
Thesedevelopmentsmaynotimpactyourbusinessdirectly,butaswe
willseeinourlaterdiscussionsoftheadvantagesanddisadvantagesof
bigdata,othertechnologiesinteractingwithbigdatahavethepowerto
18BigDataAnalytics
undermineyourtradesecretsorcreateacompetitiveenvironmentwhere
youcanobtainusefulanalysisonlyattheexpenseofturningoveryour
owndata.Itwouldbenavetoassumethatthosewhoseeopportunity
ingobblingupyourcompanysinformationwillnotdoso.Inusingbig
data,dataownershipwillbeanissue.Thequestionofwhohasarightto
whosedatastillneedstobesettledthroughlegislationandinthecourts.
Notonlywillyouneedtoknowhowtoprotectyourownfirmsdatafrom
externalparties,youwillneedtounderstandhowtoresponsiblyandethi-
-
callyprotectthedatayouholdthatbelongtoothers.
Thedangersshouldnotscareusersawayfrombigdata.Justasmuch
ofmoderntechnologycarriesriskthinkofthespaceprogram,aviation,
andenergyexplorationsuchriskdeliversrichrewardswhenwellused.
Bigdataisoneofthemostvaluableinnovationsofthetwenty-firstcen-
tury.Whenproperlyusedinaspiritofcooperativeautomationwherethe
operatorguidestheuseandresultsthepromiseofbigdataisimmense.
OURGOALSASAUTHORS
Anauthorshouldundertakethetaskofwritingabookbecauseheorshe
hassomethingcompelingtosay.Weknowofmanygoodbooksonbigdata,
analytics,anddecisionmaking.Whatwehavenotseenisabookforthe
perplexedthatpartitionsthephenomenonofbigdataintousablechunks.
Inthisintroduction,wealludedtothediscussioninthepressaboutbig
data.Forabusinessperson,projectmanager,orqualityprofessionalwho
isfacedwithbigdata,itisdifficulttojumpintothisdiscussionandunderstandwhatisbeingsaidandwhy.Theworldofbusiness,likehistory,is
regularlyburnedbybusinessfadsthatappear,notchupprominentsuc-
cessfulcasestudies,thenfadeouttoleaveatrailofless-publicizedwreckageintheirwake.Wewanttohelpyouunderstandthefundamentalsand
setrealisticexpectationssothatyourexperienceisthatofbeingasuccessfulcasestudy.
Wewantyouasthereadertounderstandcertainkeypoints:
Bigdataiscomprehensible.Itspringsfromwell-knowntrendsthat
youexperienceeveryday.Theseincludethegrowthincomputing
power,datastorage,anddatacreation,aswellasnewideasfororga-
nizinginformation.
Introduction19
Youshouldbecomeawareofkeybigdatapackages,whichwelist
anddiscussindetail.Eachhasitscharacteristicsthatareeasyto
remember.Onceyouunderstandthese,youcanaskbetterquestions
ofexternalsalespeopleandyourinternalITdepartments.
Bigdatatechnologiesenabletheintegrationofcapabilitiesprevi-
ouslynotincludedinmostbusinessanalytics.TheseincludeGISand
predictiveanalytics.Newkindsofanalysisareevolving.
-
Dataisnotanoracle.Itreflectstheconditionsunderwhichitwascreated.Therearebiasesanderrorsthatcreepintodata.Eventhebest
datacannotpredictdevelopmentsforwhichthereisnoprecedent.
Bigdatawillopenlegal,logistical,andstrategicchallengesforyour
organization,evenifyoudecidethatbigdataisnotrightforyour
firm.Notonlymustafirmbeawareofthevalueandsecuritymea-
suressurroundingdatathatitholds,itmustbeawareofdatathatit
givesupvoluntarilyandinvoluntarilytootherparties.Thereareno
black-and-whiteanswerstoguideyou,asthisisadevelopingfield.
Dataanalyticsinbigdatastillrelyonestablishedstatisticaltools.
Someofthesemaybearcane,buttherearecommonstatisticaltools
thatcanapplyareasonabilitychecktoyourresults.Understanding
analyticsenablesyoutoaskbetterquestionsofyourdataanalysts
andmonitortheassumptionsunderlyingtheresultsuponwhichyou
takeaction.
Yourorganizationmayalreadyhavetheknowledgeworkersneces-
sarytoconductanalysisorevenjustsanitycheckresultstoensure
thattheyareaccurateandyieldresults.DoyouhaveaSixSigma
unit?Doyouhaveactuaries?Doyouhavestatisticians?Ifyoudo,
thenyouhavetheknowledgebasein-housetouseyourbigdatasolu-
tionmoreeffectively.
Now,ontoourjourneythroughthisremarkabletechnology.
REFERENCES
1.ImprobableResearch.WinnersoftheIgNobelPrize.ImprobableResearch.http://
www.improbable.com/ig/winners/.AccessedApril16,2014.
2.ImprobableResearch.AbouttheIgNobelPrizes.ImprobableResearch.http://www.
improbable.com/ig/.AccessedApril16,2014.
3.Peck,D.Theyrewatchingyouatwork.TheAtlantic.December2013.
4.Wason,P.Onthefailuretoeliminatehypothesesinaconceptualtask.QuarterlyJournalofExperimentalPsychology,1960,12(3):129140.
20BigDataAnalytics
5.LouisPasteur.Wikiquote.http://en.wikiquote.org/wiki/Louis_Pasteur.AccessedApril18,2014.
-
6.Duhigg,C.Howcompanieslearnyoursecrets.TheNewYorkTimes.February16,2012.http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html?_
r=0&pagewanted=al.AccessedApril19,2014.
7.Taleb,N.(guesteditorial,creditedtoOgiOgasinthebyline)Bewarethebigerrorsof
bigdata.Wired.February8,2013.http://www.wired.com/2013/02/big-data-means-big-errors-people/.AccessedApril19,2014.
8.Taleb,N.FooledbyRandomness:TheHiddenRoleofChanceinLifeandintheMarkets.
NewYork:ThompsonTEXERE,2004.
9.Nevraumont,E.Viewpoint:WhyyourcompanyshouldNOTuseBigData.KD
Nuggets.January2014.http://www.kdnuggets.com/2014/01/viewpoint-why-your-company-should-not-use-big-data.html.AccessedApril19,2014.
10.Evolv.Ourexpertise.Evolvcompanywebsite.http://www.evolv.net/expertise/.
AccessedApril19,2014.
11.Javers,E.Insidethewackyworldofweirddata:Whatsgettingcrunched.CNBC.
February12,2014.http://www.cnbc.com/id/101410448.AccessedApril18,2014.
12.E.H.Howmightyourchoiceofbrowseraffectyourjobprospects?TheEconomist.
April10,2013.http://www.economist.com/blogs/economist-explains/2013/04/economist-explains-how-browser-affects-job-prospects#sthash.iNblvZ6J.dpuf.AccessedApril19,2014.
13.Merrick,A.Adeathinthedatabase.TheNewYorker.January23,2014.http://www.
newyorker.com/online/blogs/currency/2014/01/ashley-seay-officemax-car-crash-death-in-the-database.html.AccessedApril17,2014.
14.Angwin,J.andValentino-DeVries,J.Newtrackingfrontier:Yourlicenseplates.TheWallStreetJournal.September29,2012.http://online.wsj.com/news/articles/SB1000
0872396390443995604578004723603576296.AccessedApril19,2014.
2
TheMotherofInventionsTriplets:
MooresLaw,theProliferationof
Data,andDataStorageTechnology
Isbigdatajusthype?Isitreallysomethingnew?Ifitisdifferent,howisitdifferent?Ifitbringsachange,isitevolutionaryorrevolutionarychange?
Whilewewishwecouldpresentyouwithaclear-cutanswer,wecannot.
Thatargumenthasnotbeenresolvedasitwillremainamatterofopinion.
Insteadofpresentingyouwithloftyscenariosofwhatbigdatamay
-
somedaybeabletodo,wewillshowyouhowbigdataaroseduetotech-
nologicaldevelopmentsandtheneedsarisingfrommoreandmoredata.
Seeingthechangesthatmadebigdataapossibilityalmostaninevita-
bilityreallywillhelpyoutosortouttheknowledgefromthehype.By
usingthisbottom-upapproachtoexplainingbigdata,wehopeyouwill
begintoseepotentialwaystousetechnologyandvendorrelationships
thatyoualreadyhavetobetterusedatathatyoualreadypossessbutdo
notuse.Theoddsarethatyouwillnotwantyourfirstbigdataprojecttoturnyourfacilitiesintosomethingfromasciencefictionmovie.Workis
beingdonetomakethatareality,butalower-riskapproachwithafaster
returnistosimplypulltogetherandusethedatathatyoualreadyhave.
Inotherwords,wedonotwanttodazzleyou.Wewanttohelpyoumake
decisionsnow.
Bigdataisnewinthatitscapabilitiesforprocessingdataareunprec-
edented.Byunprecedented,wedonotrefermerelytothequantityofdata
butalsothevarietyofdata.Bigdatatechnologiesforcrunchingdatain
searchofrelationshipsbetweenvariablesbothobviousandobscure
havedevelopedalongsideexplosivegrowthindatastoragecapabilities.As
dataprocessingandstoragecapabilitiesdemonstrateseeminglyboundless
21
22BigDataAnalytics
growth,twootherdevelopmentsprovidethedatathatfillthatstorageand
providethegristforthemillsofmodernprocessors.
Lifeusedtobeanalog.Weinhabitedaworldofrecords,letters,andcopperphonelines.Thatworldisdisappearing.Sensorsarebecomingubiquitous,
fromcarenginecomputerstohomeburglaralarmstoradio-frequencyID
(RFID)tags.Computersmetamorphosedintointermediariesforincreasing
quantitiesoftransactionsandinteractions.LinkedInandFacebookenable
userstocreatepublicorsemipublicpersonas;theInternetwentfrombeinganobscuremediumfortechies(informaltermforprofoundlyinvolved,
technologicallyawareusersandcreators)tobeingaglobalmarketplace,
andtextingisnowaquickwayforfriendstosharetidbitsofinformation.
Thebackgroundnoiseofmodernlifeisdata.Ourdataaccumulate,they
-
live,theyarerecordedandstored,andtheyarevaluableintheirownright.
Inasense,bigdatatechnologiesarematurebecausewecancompre-
hendthemintermsofthetechnologiesfromwhichtheydeveloped,most
ofwhichhaveestablishedhistories.Computersbecameapossibilityonce
CharlesBabbageproposedthedifferenceengine,anunrealizedbutlogi-
callyfullydevelopedmechanicalcomputer,in1822.Electroniccomput-
erscameintotheirowninthetwentiethcentury,withthecode-breaking
bombes(abombewasaquasi-computerdevotedtodecryptionsolely)
atBletchleyParkinBritainduringWorldWarIIbeingconcreteexamples
ofhowcomputerscanshakethefoundationsofmodernwarfare.Forall
theawesomepoweroftanks,planes,andbombs,thegreatmindsthat
crackedAxiscodesmostfamouslythetoweringandtragicfigureof
AlanTuringdidsomethingjustaspowerful.Decipheringthosecodes
allowedthemtopenetratethenervoussystemoftheenemysintelligence
apparatusandknowwhatitwoulddoquicklyenoughtoanticipateenemy
actionsandneutralizethem.
Ourcurrentdigitalworldcantracelineagebacktothispioneering
technology.Toexplorethisdevelopment,letusstartwiththegrowthof
processingpower.
MOORESLAW
In1965,GordonMoorewhowouldgoon3yearslatertocofoundthecom-
panythatwouldbecomeIntelpublishedapapertitled,CrammingMore
ComponentsontoIntegratedCircuitsinthejournalElectronics.ThoughaTheMotherofInventionsTriplets23
merefourpagesinlength,thepaperlaidoutthecaseforthenowfamous
Mooreslaw.Itisanintriguingreadafterdiscussingadvancesintheman-
ufactureofintegratedcircuits,Moorecoverstheiradvantagesintermsof
costandreliability,thelatterdemonstratedbytheinclusionofintegratedcircuitsinNASAsApolospacemissions(itwasApolo11thatlandedNeilArmstrongandBuzzAldrinonthemoon).Inhispaper,Moorecorrectly
foreseestheuseofthistechnologyinsuccessivelyincreasingnumbersof
devices.Mooreslawandthespreadoftheintegratedcircuitareastoryofacceleratedtechnologicalaugmentationbuiltontopofwhathadalready
-
beenawhirlwindpaceinthedevelopmentofcomputertechnology.1
ThishistoryoftechnologicaldevelopmentleadinguptoGordon
Moorespaperisacompellingstoryonitsownmerits.Theintegrated
circuitisaclusteroftransistorsmanufacturedtogetherasasingleunit.
Infact,theyarenotassembledinanymeaningfulsense.Theprocessofphotolithography,ortheuseoflighttoprintoverastencilofthecircuitlaidoverasiliconwafer,meansthattransistorsareetchedtogether,emerging
inausefuldesignasasingleunit.Theprocessisnotentirelyunlikeusingastenciltopaintwritingonawall,althoughthetechnologyisclearlymoredemandingandprecise.Itmeansthattherearenojoins(e.g.,solderjoints)thatcancrack,andtherearenomovingpartstowearout.Moorewas
writingonly7yearsafterJackKilbyofTexasInstrumentshadbuiltthe
firstworkingmodeloftheintegratedcircuitwhilemostotheremployees
ofhisfirmwereonvacation!2TexasInstrumentsisstilloneoftheleadingmanufacturersofintegratedcircuits,alongwithIntel.
Beforetheintroductionoftheintegratedcircuit,thetransistorwas
thestandardfordataprocessing.Thefirstpatentforaworkingtransis-
tor(undemonstrateddesignshadreceivedearlierpatents)waspatent
number2,524,035,awardedtoJohnBardeenandWalterBrattainofBell
Labsin1950,3withpatentnumber2,569,347beingawardedtotheircol-
league,WilliamShockley,thefollowingyear.4Thetransistorofferedmany
improvementsoveritspredecessor,thevacuumtube.Itwaseasiertoman-
ufacture,moreenergyefficient,andmorereliable.Itdidnotgenerateas
muchheatandthusenjoyedalongerlife.Still,itwasadiscretedevice.
Unliketheintegratedcircuit,inwhichmillionsoftransistorscanbea
singlearray,transistorsneededtobeassembledbeforeJackKilbysseem-
inglyinnocuousbutworld-alteringinsight.Individualtransistorassembly
generallymeanttheuseoftheolderthrough-holetechnology,wherethe
leadstothediscretetransistorwentthroughtheprintedcircuitboardandwereoftenwave-soldered.
24BigDataAnalytics
Mooresargumentinhispaper,thisartifactfromthedawnoftheinte-
gratedcircuit,isnuancedandcarefullyargued.Itiseasytoforgetthis,decadeslater,whenfewcommentatorsactuallyreaditandthepopular
pressreducestheconcepttopithysoundbitesabouttheincreaseinprocessingpower
-
versustime.WhatMoorecomposedwasneitheraluckyguess
norabaldassertion.Itwasanexquisiteargumentincorporatingtechnol-
ogy,economics,andperhapsmostimportantly,manufacturingability.He
famouslyarguedthatasmorecomponentsareaddedtoanintegratedcir-
cuitofagivensize,thecostpercomponentdecreases.Inhiswords:
Forsimplecircuits,thecostpercomponentisnearlyinverselyproportionaltothenumberofcomponents,theresultoftheequivalentpieceofsemi-conductorintheequivalentpackagecontainingmorecomponents.But
ascomponentsareadded,decreasedyieldsmorethancompensateforthe
increasedcomplexity,tendingtoraisethecostpercomponent.1
Asof1965,thenumberofcomponentsthatcouldbeincludedonan
integratedcircuitatthelowestpricepercomponentis50.Mooreforesaw
theoptimalnumberofcomponentspercircuit,fromacostpercompo-
nentpointofview,being1000by1970,withacostpercomponentthat
was10%ofthe1965cost.By1975,hesawtheoptimalnumberofcompo-
nentsreaching65,000.Inotherwords,heperceivesthecostofproduction
decliningasthetechnologysecuresitselfinourculture.1
Beforemovingontoatechnicaldiscussionofthecircuits,Moorestated,
Thecomplexityforminimumcomponentcostshasincreasedatarateof
roughlyafactoroftwoperyearthereisnoreasontobelieve[thisrate
ofchange]willnotremainnearlyconstantforatleast10years.1Infact,nearly50yearsafteritsformulation,Mooreslawabides.Figure2.1isa
powerfulillustration.
RespectedphysicistMichioKakupredictstheendofMooreslaw,point-
ingoutthatthephotolithographyprocessusedtomanufactureintegrated
circuitsreliesonultravioletlightwithawavelengththatcanbeassmallas10nm,orapproximately30atomsacross.Currentmanufacturingmethods
cannotbeusedtobuildtransistorssmallerthanthis.Thereisamorefun-
damentalbarrierlurking,however.Dr.Kakulaysouthisargumentthus:
Transistorswillbesosmallthatquantumtheoryoratomicphysicstakes
overandelectronsleakoutofthewires.Forexample,thethinnestlayer
insideyourcomputerwillbeaboutfiveatomsacross.Atthatpoint,accordingtothelawsofphysics,thequantumtheorytakesover.TheHeisenberg
TheMotherofInventionsTriplets25
-
Doublingeverytwoyears
10,00,00,00,000
1,00,00,00,000
10,00,00,000
t
1,00,00,000
10,00,000
1,00,000
10,000
1,000
Transistorcoun
100
101
#Transistors
1960
1970
1980
1990
2000
2010
2020
FIGURE2.1
Year
Mooreslaw.
uncertaintyprinciplestatesthatyoucannotknowboththepositionand
velocityofanyparticle.Thismaysoundcounterintuitive,butattheatomiclevelyousimplycannotknowwheretheelectronis,soitcanneverbecon-finedpreciselyinanultrathinwireorlayeranditnecessarilyleaksout,causingthecircuittoshort-circuit.5
AspessimisticasDr.Kakusargumentsounds,thereiscauseforopti-
mismregardingsustainedimprovementsinfuturecomputingpower.
NoticethatDr.Kakusargumentispointingoutthattheconstrainton
whatcanbeaccomplishedisquantumtheory,whichisanelegantargu-
mentforthepowerofhumaningenuity.Individualtransistorswithinan
-
integratedcircuitarenowsosmallthatitisthephysicsoftheindividualatomsthatmakeupthetransistorthathasbecometheconstrainingfactor.Thesameingenuitythatbroughtustothispointwillinevitablyturn
towardinnovatinginotherformsnewapproacheswherethephysics
involvedhasnotyetbecomeaconstraint.
TheendofMooreslaw,inotherwords,simplymeanstheclosingofone
doortoevermorepowerfulcomputers.Itdoesnotnecessarilyspellthe
endofothermethods.Infact,onemethodisalreadywellestablished,thatbeingparallelcomputing.
PARALLELCOMPUTING,BETWEEN
ANDWITHINMACHINES
Thenumberofcircuitsrunninginacoordinatedmannercanbeincreased.
Thiscanbewithinamachineusingmultipleprocessors,multiple
26BigDataAnalytics
integratedcircuitswithinaprocessor(knownasamulticoreprocessor),oracombinationofthesetwoapproaches.Theseprocessorsandcoressimply
dividethetaskofprocessingforthesakeofspeedandoverallcapacity,
muchastwopeoplecanmakeacakefasterifonepreparesthefrosting
whiletheotherpreparesthecake.Anotherwaytoconductparallelcom-
putingisbetweendevicesorcomponentswithinadevice,suchasoccurs
withamainframecomputer.Whileparallelcomputingdoesnothingto
promotethefurtherminiaturizationofindividualcomponents,anintel-
ligentlydesignedarchitecturewillallowthecontinuedminiaturizationof
thedevices.Thesecomponentsshrinkbymovingthebulkofprocessing
toonelocationwhiletheoutputgoestoanotherlocation.Thissounds
bizarreandconfusingintheabstract,butyouarefamiliarwithitinthe
concrete.
Whenyourunawebsearchonyourcellphone,yourphoneisnotque-
ryingitsownindexesofwebpages,anditisnotrunningthealgorithms
thatunderpinthesearch.Thephoneandassociatedsoftwareconduct
basicprocessingofyoursearch,orquery,andthenpassthesedatatoa
serverclusterlocatedelsewhere.Theythenreceivetheoutputofthatclustersdataprocessingandtranslateitbackintoaconvenientlayoutthatcanberepresentedonyourdisplay.Inthisway,yourphonecanknowhow
-
manycopiesofthisbookAmazonhasinstock.Thisprocessingcapability
ishowthatsmallandunpretentiousphoneknowswhatsongisplayingin
yourfavoritebar,theperformanceofyourstockportfolio,howlongyour
flightisdelayed,andthedrivingdirectionstotheAfghanrestaurantinthenorthernpartofDallas,Texas,aboutwhichyouhaveheardsuchenthusiasticreviews(mostlikelyonthesamephone!).Oneoftheauthorsofthis
bookhasaccesstothefollowingonhissmartphone:
Multiplee-mailservices
Up-to-the-minutereadingsfromDopplerradarbelongingtothe
NationalOceanicandAtmosphericAdministration
AphotographicrepresentationofeveryoutdoorspotonEarth
MapsofallofNorthAmericaandotherplacesaroundtheworld
Updatesoftheactivitiesofmostofhisfriendsviasocialmedia
Multiplewaysofaccessingmusicfromonlinesources,bothonasub-
scriptionandanownershipmodel
Nearinstantaneoussharingofphotostakenbetweenhiswifeand
him
Multiplevideostreamingservices
TheMotherofInventionsTriplets27
Theabilitytoidentifyasongbynameandartistbyholdingthephone
uptoaspeaker
Imagesfromtrafficcamerasliningtheroadsandintersectionsnear
wherehelives
Theabilitytopurchaseandimmediatelyaccessbooksandmusic
Alloftheseabilitiesaredependentoncomputingpowerlocatedin
serversofunknownlocation,accessedbyhisphoneusinganInternet
connection.Alloftheseservicesrequirehugeamountsofcomputing
power,relyingonparallelcomputing.Parallelcomputingisubiquitous
bothwithincomputersandphones,andintheremoteservicesthatthese
devicesaccess.Itisthisremoteaccessthatbypassesthelimitationson
whatindividualmicroprocessorsinsideindividualcomputersandphones
canaccomplish.
WhenFred,aFacebookuser,visitshisfriendAnnaspage,theheavy
-
liftingprocessingnecessarytodeliverthatpagetoFredscomputer
occursinadatacenter,whichiswherethesystemstoresAnnasprofile,
andwherethecomputingpowerexiststoselectonlythecorrectinforma-
tion,subsequentlyreturningittoFredscomputer.Likewise,whenAnna
seesFredsnewpatiofurnitureinaFacebookpostanddecidestolook
forsomethingsimilaronAmazon,theprocessingfortheheavylifting
thatdeliverstheresultsofherAmazonsearchbacktoher,andrunsthe
paymenttransactions,takesplaceinadatacenter.Asimilarprocessisatplayforwebsearches.Noneofthisindexingofwebpagesorthesearch
forspecifictermsburiedinallofthoseindexedpagesoccursonFredsorAnnascomputers.Thisistheheavy-dutyworkourdevicesoutsourceto
datacenters(Figure2.2).
Asingledatacentermayverywellberesponsibleforthisprocessingfor
usersallaroundtheworldatanygiventime,oritmaybeoneofseveral
datacentersthateachcatertoaregion.Forthisreason,datacentersarelarge,warehouse-stylebuildingslocatednearmajorsourcesofelectrical
power.Theyareoften(butbynomeansexclusively)foundinlocations
wherenaturalcoolinghelpsreleasetheheatgeneratedbyalloftheserverswithin.ThisisonereasonsomanydatacentersarelocatedinthePacific
NorthwestoftheUnitedStates.
DatacentersarenotlimitedtoInternetfirms,however.Tohandlethe
delugeofdata,theshippinggiantUPShastwodatacenters,a470,600ft2
facilityinMahwah,NewJersey,anda172,000ft2facilitynearAtlanta,
Georgia,thathostswhatUPSstatesisthelargestIBMDB2relational
-
28BigDataAnalytics
Fred
Data
-
center
Anna
FIGURE2.2
Datacenterinterme-
diarytousers.
databaseintheworld.Afootballfield(Americanfootball)is57,600ft2,
includingendzones.Thesedatacenterscould,betweenthem,fullycon-
tainmorethan11suchfootballfields.Betweenthetwodatacenters,thereisalsoenoughairconditioningcapacitytocool3500homes,alongwith60
milesofundergroundconduit,7000backupbatteriestomaintainpower
untilthegeneratorscankickinduringanoutage,and70,000gallonsof
fueltoruntheirgenerators.6TheUPSdeliverypersoncomestoyourdoor
andhandsyouasmallcomputerwithatouchscreenuponwhichyou
signusingastylus;thesedatacentersarewherethosedatago,withdeliverydateandtime,alongwithyoursignature.Infact,ifyouorderfrom
Amazonyoumayevensignuptoreceiveatextmessagewhentheydeliver
yourpackageyetanotherexampleofhowadatapointmakesitbackto
yourphone.
Googleisfamouslyopaqueinprovidinginformationaboutitsdatacen-
ters,althoughitisopeningupwithphotosoftheirinteriors(itstillremainsclose-mouthedonthesizeandstatisticsforitsfacilities).TheDataCenterKnowledgewebsitelistswhatitbelievestobe20GoogledatacentersintheUnitedStatesand17overseasdatacenters.7GooglelistsonlysixUSdata
centersandsevenoverseasdatacentersonitswebsite.ItspagelistsdatacentersthatdidnotmakeittotheDataCenterKnowledgewebsite,such
asthoseinFinland(firstphasecompletedin2011,withasecondphase
TheMotherofInventionsTriplets29
estimatedcompletiondateof2014),Singapore,andChile.8Whatwesee
revealed,regardlessofwhoseestimatesweuse,isthatGooglehasatruly
immenseandglobaldatacenterfootprint.Itmust.AsthedominantInternetsearchfirmwithotheroperations,amongwhichareaweb-basede-mail
offering(Gmail),anonlinemediastore(GooglePlay),itsownsocialmediasite(Google+),avideohostingservice(YouTube),globalsateliteimages
(GoogleEarth),comprehensivemappingabilities(GoogleMaps),andthe
firmsbreadandbutter(GoogleAdSense).Theseserviceshaveworldwide
-
reachandinvolveatremendousamountofprocessingthattheenduser
neversees.
Facebookismoretransparentaboutitsdatacenters,andtheyaretruly
gigantic.Wewouldexpectlargedatacenterswithafirmof1.23billion
monthlyactiveusersasofDecember31,2013.9Asthisbookisbeingwritten,thecompanycurrentlyoperatesa333,400ft2datacenterinPrineville,Oregon(andisbuildinganidenticalfacilitynexttoit),10andadatacenterofapproximately300,000ft2nearForestCity,NorthCarolina.11The
firmplanstoconstructabehemoth1.4millionft2facility,estimatedto
cost$1.5billion,nearDesMoines,Iowa.12Thereisalsoa290,000ft2facilityinSwedenthatisbeingfinished.13Thetotalsquarefootageofallof
Facebooksdatacenters,bothconstructedandplanned,willbesufficient
tofullycontainover46footballfields.
Thespreadofdatacentershasalsogivenrisetooneofthefast-rising
buzzwordsoftheearlytwenty-firstcentury:thecloud.Thecloud,cloud
storage,andcloudcomputingallrelatetomovementofdatastorageand
manipulationfromonesowndevicebeitdesktop,laptop,ormobile
devicetoadatacentersomewhere.Businesspeoplearewisetobewaryof
buzzwords,butthecloudisabuzzwordwithsubstancebehindit.
Processingisalsoestablishingitselfinthecloudaspartofthesoft-
ware-as-a-service(SaaS)model.Salesforce.comprovidesserious,mar-
ket-respectedcustomerrelationshipmanagement(CRM)tofirmswho
accessitthroughawebsite.Accountingpackages,suchasNetSuiteand
QuickBooks,havemovedintothecloud.Thoughfarfrombeingamar-
ketchanger,GoogleDocshasestablishedanicheasamethodforsharing
andcollaboratingondocumentsinthecloud.Thesedocumentsinclude
spreadsheets,wordprocessing,andpresentationsinaformatsimilarto
thoseofMicrosoftOffice.Microsoftisalsoofferingcloud-basedfilesharingandcollaborationthroughitsSkyDriveservice.
Movingfromthepubliccloud(commercialapplicationsthatareusedon
asubscriberbasis)totheprivatecloud(hostedsolutionsthatareunique
30BigDataAnalytics
toasinglecustomer),moreandmorecompaniesaredevelopingcustom
solutionsthatarehostedremotelyindatacenters.Thesesolutionsarecreatedspecifically
-
forthecompanythatinitiatedtheprojectandareusuallynotsharedwithanyothercompany.
Weexpectcloudcomputingtoproliferatefurther.Firstofall,itprovidescompaniesawaytoaddcapabilitieswhiletransferringtheexpendituresto
operatingexpenses(OPEX)insteadofcapitalexpenses(CAPEX).Second,
datacentersgenerallyofferadegreeofprotectionandredundancytodata
thatisnotpossiblewhentheyarestoredonaharddriveintheoffice.
Third,datasecurityinthecloudcanbesuperbwhenitisintheright
hands.Withpropersecurity,eventheemployeesofthedatacenterare
physicallyunabletoaccessanyoftheclientsdata.PrivatecloudsolutionswithnopresenceontheInternetare,aswediscussed,onlyaccessibleto
theintendedclientusingasecureconnectionandcanbeverysafesolu-
tionsthatare,forallintentsandpurposes,partoftheinternalinformationtechnology(IT)solution.Finally,thecostofadministeringdataonthe
cloudcanbelowsincethestaffadministeringthehardwareareshared
amongcustomersofthehostingfacility.Ifyouneedonlyafewminutesa
weekofworkonyoursystem,plusanoccasionalhardwareupgrade,your
firmdoesnotneedtopayfull-timestafftohandlethat.Youpayafeeforthisbenefit,alongwiththefeespaidbyothercustomers,tocoversalariesandphysicalinfrastructure.
Alongwiththegrowthofdatacentersisthecontinuedgrowthinthe
computationalcapabilitiesachievedthroughtheuseofparallelcomput-
ingwithinasingledevice.Thehistoryofparallelcomputingisinextricablyintertwinedwiththehistoryofcomputationitself.Itissimplythestrategyofbreakingaproblemupintosmallerproblemsthatarethendistributed
andprocessedconcurrently.Thisconceptwillreappearaswedelveinto
greaterdetailofhowbigdatasolutionsfunction.
Multiplecomputersrunningsidebysideorthedifferentcomponentsof
amainframemayhandleparallelcomputing,butthereisasimplerexam-
pleofparallelcomputingthatisrunninginmosthomeandofficecomput-
ersandevenonsmartphones.Thisisthemulticoreprocessor.
Amulticoreprocessorisanintegratedcircuitthatcontainsmorethanone
centralprocessingunit(CPUorcore);itsplitsupprocessingtasksamong
these.Inhomeuse,themostcommonmulticoreprocessorsarecurrently
thedual-coreandquad-coreprocessors(thecurrentAppleMacintoshPro
-
canrundualhexacores!),thoughsomespecializedprocessorsmayhave
TheMotherofInventionsTriplets31
morethan100cores.Thespreadofmulticoreprocessorswillbediscussed
ingreaterdetaillaterinthischapter.
Aswewritethisbook,theTianhe-2supercomputerwasunveiledin
China,attaining33.86petaflopsofprocessingpowerwithatheoretical
peakperformanceof54.9petaflops(apetaflopis1015flops,orFLoating-
pointOperationsPerSecond;theperformanceofastandarddesktopcom-
puterismeasuredingigaflops,oneofwhichisequalto0.000001petaflop),topplingtheTitancomputeratOakRid