democratizing big data with microsoft azure hdinsight
TRANSCRIPT
DemocratizingBigDatawithMicrosoftAzureHDInsight
SaptakSenSolutionEngineeringManagerHortonworks@saptak
NishantThackerTechnicalProductManager–BigDataMicrosoft@nishantthacker
Hortonworks+Microsoft:TogetherSince2012
"AtHortonworkswehaveseenmoreandmoreHadooprelatedworkloadsandapplicationsmovetothecloud.StartinginHDP2.6,weareadoptinga“CloudFirst”strategyinwhichourplatformwillbeavailableonourcloudplatforms–AzureHDInsightatthesametimeorevenbeforeitisavailableontraditionalon-premisessettings.With thisinmind,weareveryexcited thatMicrosoftandHortonworkswillempowerAzureHDInsightcustomerstobethefirsttobenefitfromourHDP2.6innovationinthenearfuture."- Arun Murthy,co-founder,Hortonworks(February,2017)
“Operatingafullymanagedcloudservice likeAzureHDInsight,whichisbackedbyanenterprisegradeSLA,requiresthatwecandeploythelatestbitsofHadoop&ApacheSparkondemand.Tothatend,weareexcited thatthelatestHortonworksDataPlatform2.6willbecontinuouslyavailable toAzureHDInsightevenbeforeitson-premise release.Hortonworks’commitment tobeingcloudfirstisespecially significantgiventhegrowingimportanceofcloudwithHadoopandSparkworkloads.”- DharmaShukla,DistinguishedEngineerandGeneral ManageratMicrosoft.(February,2017)
BigDataintheCloud
3
BigDataintheCloud
4
TraditionalClusters
5
Challengeswithimplementingclusters
HadoopClustersintheCloud
7
WhyHadoopinthecloud?
Distributed Storage• Filessplitacrossstorage• Filesreplicated
• Nearestnoderesponds• AbstractedAdministration
Hadoop/SparkClusters
Extensible• APIstoextendfunctionality• Addnewcapabilities• Allowforinclusionincustomenvironments
Automated Failover• Unmonitoredfailovertoreplicateddata• Builtforresiliency• Metadatastoredforlaterretrieval
Hyper-Scale• Addresourcesasdesired• Builttoincludecommodityconfigs• Directcorrelationofperformanceandresources
Distributed Compute• Distributedprocessing• ResourceUtilization• Cost-Efficientmethodcalls
9
Distributed Storage• Filessplitacrossstorage• Filesreplicated
• Nearestnoderesponds• AbstractedAdministration
Cloud
Extensible• APIstoextendfunctionality• Addnewcapabilities• Allowforinclusionincustomenvironments
Automated Failover• Unmonitoredfailovertoreplicateddata• Builtforresiliency• Metadatastoredforlaterretrieval
Hyper-Scale• Addresourcesasdesired• Builttoincludecommodityconfigs• Directcorrelationofperformanceandresources
Distributed Compute• Distributedprocessing• ResourceUtilization• Cost-Efficientmethodcalls
10
Distributed Storage• Filessplitacrossstorage• Filesreplicated
• Nearestnoderesponds• AbstractedAdministration
BigDataintheCloud
Extensible• APIstoextendfunctionality• Addnewcapabilities• Allowforinclusionincustomenvironments
Automated Failover• Unmonitoredfailovertoreplicateddata• Builtforresiliency• Metadatastoredforlaterretrieval
Hyper-Scale• Addresourcesasdesired• Builttoincludecommodityconfigs• Directcorrelationofperformanceandresources
Distributed Compute• Distributedprocessing• ResourceUtilization• Cost-Efficientmethodcalls
11
HDInsightProvidesPurpose-builtClusterTypesClusterType Components
Hadoop HDFS,MapReduce2,YARN,Tez,Hive,Pig,Sqoop,Oozie,Zookeeper,Ambari Metrics,Slider
HBase HDFS,MapReduce2,YARN,Tez,Hive,HBase, PhoenixQueryServer,Pig,Sqoop,Oozie,Zookeeper,Ambari Metrics
Storm HDFS,MapReduce2,YARN,Tez,Hive,Pig,Sqoop,Oozie,Zookeeper,Storm,Ambari Metrics,Kafka,
Spark HDFS,MapReduce2,YARN,Tez,Hive,Pig,Sqoop,Oozie,Zookeeper,Ambari Metrics, Spark,Zeppelin, Livy
InteractiveHive HDFS,MapReduce2,YARN,Tez,Hive2LLAP,Pig,Sqoop,Oozie,Zookeeper,AmbariMetrics,Slider
RServer HDFS,MapReduce2,YARN,Tez,Hive,Pig,Sqoop,Oozie,Zookeeper,Ambari Metrics, Spark,Livy
Kafka HDFS,MapReduce2,YARN,Tez,Hive,Pig,Sqoop,Oozie,Zookeeper,Ambari Metrics,Kafka
• ComponentsmarkedinREDarethecomponentsthatdrivetheclustertypeusecase
• SparkclustersalsohaveJupyter installed• AllclusterscomeHAenabledbydefault
BigDataintheCloud
13
BigDataintheCloud- Options
Scenariosfordeployingashybrid
TraditionalClusters– OnPrem
16
HadoopCluster
WorkerNode
HDFSHDFS HDFS
Tasks Tasks Tasks Tasks Tasks Tasks
TaskTracker
MasterNode
Client
Job(jar)file
Job(jar)file
ClustersintheCloud
AzureHDInsightHadoopandSparkasaServiceonAzure
FullymanagedHadoopandSparkforthecloud
100%OpenSourceHortonworksDataPlatform
Clustersupandrunninginminutes
Managed,monitoredandsupportedbyMicrosoftwiththeindustry’sbestenterpriseSLA
UsefamiliarBItoolsforanalysis,oropensourcenotebooksforinteractivedatascience
63%lowertotalcostofownershipthandeployyourownHadoopon-premises*
*IDCstudy“TheBusinessValueandTCOAdvantageofApacheHadoopintheCloudwithMicrosoftAzureHDInsight”
HDInsightCluster
AzureDataLakeStorage
HDInsightcluster
Domaincredentials
AzureStorageBlob
Headnode
Back-up
Datanode
HDInsightClusterSecurity
AADtenantAzureVNETtoVNETpeering
HDInsightCluster
AzureDataLakeStorage
Domaincredentials
AzureStorageBlob
Headnode
Back-up
Datanode
Decoupling- Benefits
What’sNewinHDInsight3.6• HDInsight3.6GAannouncedduringDataWorksSummitMunich
• “HDInsight3.6hasthelatestHortonworksDataPlatform(HDP)2.6platform,acollaborativeeffortbetweenMicrosoftandHortonworkstobringHDPtomarketcloud-first. ”
• https://azure.microsoft.com/en-us/blog/announcing-general-availability-of-azure-hdinsight-3-6/
What’sNewinHDInsight3.6
• InteractiveHiveimprovements• Spark2.1GA*• ZeppelinaddedtoSparkClusterType• Improvedclustercreationtime
*GAmeansclustersarebackedbyAzureSLA
BigDataintheCloud
24
25
BigDataApplicationArchitecture
TheAzureArchitectureSourceA
SourceB
SourceC
DataFactory
AzureDataLakeStore
SourceD
Powershell
StreamAnalytics
HDInsight
AzureDataLakeAnalytics
AzureSQLDataWarehouse
AzureAnalysisServices
Ingestion Backend Frontend
PushStream
DAX
T-SQL
H iveQL
Analyst
Analyst
Analyst
Analyst
TheAzureArchitecture- Detailed
27
Example:BigDatainTelcoTelarix usesbigdatatohelpmaintaincallquality
“Carriersaregoingtocreatenewwirelessapplicationsandofferings—voice,video,MMS,orwhateverthenextgreat
applicationis—andourcustomers’networksneedtobeabletosupport this.”
VicBozzo,SeniorVPofWorldwide SalesandMarketing
Scenario
Telarix helps telecommunications carriersworldwidemaintaincallquality,managecosts,andstreamlinetraffic.Telarix’s suitehandles trafficandqualitymanagement,trading,routing,billing, andsettlementformorethan300billion voice,SMS,content,anddataminuteseachyear.
SolutionTelarix used SQLServerandAzureHDInsightwiththeabilitytoanalyzelargevolumesofstructuredandunstructureddatainrealtime.
Result
• KeepupwithCarrierswhoarecreatingnewwirelessapplications andofferings, suchasvoice,video,MMS.Telarixwillprovidethesecarriersthesamebusiness processtotrade,route,settle,manage,invoice, bill, andcollect,acrossalloftheirservices
Linkury usesbigdatatomakeonlinecontentdiscoveryprofitableforsearchandsocialengines,publishers,andmarketers
Scenario
Linkury isatooltohelpmonetizationoftheonlineadvertisingmarket. Theyneeded toanalyzehundreds ofmillions ofwebtrafficeventseachdaytohelpbuild targetedadvertising basedoncustomerbehavior
Solution AzureHDInsight (Hadoop-as-a-service) with StormforHDInsighttoanalyzereal-timedatainHadoop.
Result
• Linkury nowcaptureshundreds ofmillions ofwebtrafficeventsinreal-timeincluding howusersbrowse/actions,interactwiththedevice,products, etc.todisplay targetedonline advertisements.
• Cannowshowadvertisingeffectiveness throughthirdpartyBItools thatshow keymetrics
“Wehadgainedalotoftraffic,butwecouldn’treallymanageandanalyzethedatainrealtime.Nowwehaveregained
control,whichmeans,forexample,thatwecanspendmoretimeanalyzingfraudoradcampaignsthatareperforming
poorly”
KobiEldar,CTO
Example:BigDatausedfortargetedcustomeradvertisement
Example:BigDatausedforconnectedcarsDelphiAutomotiveusesbigdataforcarownerstokeeptabsontheircars
“WithDelphiConnect,carownerscanfindouthowclosetohometheirspouse issotheycanputthefinishingtouchesondinner.Theycankeeptabsonteenagedriversbysettingupgeo-fences.Ifthecargoesoutsideofageo-fenceordrivesfasterthanaspecifiedspeedlimit,momordadreceivesan
emailortextmessage.”
VictorCanseco,ManagingDirector
Scenario
Delphiis aleadingglobalsupplier oftechnologies fortheautomotiveindustry, introducedDelphiConnect, anafter-marketconnected-carproductthatletsdriversdigitallyinteractwiththeircarsthroughsmartphones, tablets,andPCs.
Solution
AzureHDInsightandSQLServerinanInternetofThings (IoT)scenarioforcapturingandanalyzingdatafromcars(vehiclediagnostics, geo-fencing, geo-location,mileagetracking,bluetooth). AlsouseAzureServiceBus,andSQLDatabasetounderstand geo-fencingaroundamap.
Result
• Driverscannowunderstand informationontheircarslikehowtheyweredriven,wheytheyparked,routetheytook,duration,andmileage.Theyalsoget real-timeinformation onwhatotherdriversaredoingwiththeircar.
Summary
31
CalltoAction
Pointstoremember
CONNECT• Contacts:
• [email protected]• DocsandForums:
• https://docs.microsoft.com/en-us/azure/hdinsight/
• https://azure.microsoft.com/en-us/support/forums/
Connectandvoiceyourcustomers’opinion
RampuponournewservicesNOW!!
32
EVOLVE• Knowmore
• http://www.microsoft.com/hdinsight• LeveragefreetrialonAzure
• https://azure.microsoft.com/en-us/free/
• TryHortonworksSandboxonAzure• http://hortonworks.com/sandbox
LEARN• http://learnanalytics.microsoft.com/• Trainingson
• SparkinAzureHDInsight• AzureHDInsightAdministrationand
Security• RServeronAzureHDInsight
©2016MicrosoftCorporation.Allrightsreserved.