sqt03 bigdataandhadoop brustredmondevents.com/virtual/vslive/2015/live360or/pdf/sqt03... · sqt03...

17
Live! 360 Orlando 2015 SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust Big Data and Hadoop with Azure HDInsight Andrew Brust Senior Director, Technical Product Marketing and Evangelism Datameer Level: Intermediate Meet Andrew Senior Director, Technical Product Marketing and Evangelism Big Data blogger for ZDNet Microsoft Regional Director, MVP Co-chair Visual Studio Live! Redmond Reviewcolumnist for Visual Studio Magazine Twitter: @andrewbrust

Upload: others

Post on 09-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SQT03 BigDataandHadoop Brustredmondevents.com/virtual/vslive/2015/live360or/pdf/SQT03... · SQT03 ‐Big Data and Hadoop with Azure HDInsight ‐Andrew Brust What is Big Data? •

Live! 360 Orlando 2015

SQT03 ‐ Big Data and Hadoop with Azure HDInsight ‐ Andrew Brust

Big Data and Hadoop with Azure HDInsight

Andrew BrustSenior Director,

Technical Product Marketing and EvangelismDatameer

Level: Intermediate

Meet Andrew

• Senior Director,Technical Product Marketing and Evangelism

• Big Data blogger for ZDNet• Microsoft Regional Director, MVP• Co-chair Visual Studio Live!• “Redmond Review” columnist for Visual Studio

Magazine• Twitter: @andrewbrust

Page 2: SQT03 BigDataandHadoop Brustredmondevents.com/virtual/vslive/2015/live360or/pdf/SQT03... · SQT03 ‐Big Data and Hadoop with Azure HDInsight ‐Andrew Brust What is Big Data? •

Live! 360 Orlando 2015

SQT03 ‐ Big Data and Hadoop with Azure HDInsight ‐ Andrew Brust

Andrew’s New/Old Blog (bit.ly/bigondata)

Read all about it!

Page 3: SQT03 BigDataandHadoop Brustredmondevents.com/virtual/vslive/2015/live360or/pdf/SQT03... · SQT03 ‐Big Data and Hadoop with Azure HDInsight ‐Andrew Brust What is Big Data? •

Live! 360 Orlando 2015

SQT03 ‐ Big Data and Hadoop with Azure HDInsight ‐ Andrew Brust

What is Big Data?

• 100s of TB into PB and higher• Involving data from: financial data, sensors, web

logs, social media, etc.• Parallel processing often involved

– Hadoop is emblematic, but other technologies are Big Data too

• Processing of data sets too large for transactional databases– Analyzing interactions, rather than transactions– The three V’s: Volume, Velocity, Variety

• Big Data tech sometimes imposed on small data problems

What’s MapReduce?

• “Big” data input accepted in file form

• Data is partitioned and sent to mappers (nodes in cluster)

• Mappers pre-process data into KV pairs, then all output for (a) given key(s) goes to a reducer

• Reducers aggregate; one line of output per unique key, with one value

• Map and Reduce code natively written as Java functions

Page 4: SQT03 BigDataandHadoop Brustredmondevents.com/virtual/vslive/2015/live360or/pdf/SQT03... · SQT03 ‐Big Data and Hadoop with Azure HDInsight ‐Andrew Brust What is Big Data? •

Live! 360 Orlando 2015

SQT03 ‐ Big Data and Hadoop with Azure HDInsight ‐ Andrew Brust

MapReduce, in a Diagram

mapper

mapper

mapper

mapper

mapper

mapper

Input

reducer

reducer

reducer

Input

Input

Input

Input

Input

Input

Output

Output

Output

Output

Output

Output

Output

Input

Input

Input

K1

K2

K3

Output

Output

Output

HDFS

• File system whose data gets distributed over commodity drives on commodity servers

• Data is replicated• If one box goes down, no data lost

– “Shared Nothing”– Except the name node

• BUT: Immutable– Files can only be written to once– So updates require drop + re-write (slow)– You can append though– Like a DVD/CD-ROM

Page 5: SQT03 BigDataandHadoop Brustredmondevents.com/virtual/vslive/2015/live360or/pdf/SQT03... · SQT03 ‐Big Data and Hadoop with Azure HDInsight ‐Andrew Brust What is Big Data? •

Live! 360 Orlando 2015

SQT03 ‐ Big Data and Hadoop with Azure HDInsight ‐ Andrew Brust

HBase

• A Wide-Column Store, NoSQL database

• Modeled after Google BigTable

• HBase tables are HDFS files– Therefore, Hadoop-compatible

• Hadoop often used with HBase– But you can use either without the other

• HBase now available on HDInsight– Implemented as different cluster type

The Hadoop Stack

MapReduce, HDFS

Database

RDBMS Import/Export

Query: HiveQL and Pig Latin

Machine Learning/Data Mining

Log file integration

Page 6: SQT03 BigDataandHadoop Brustredmondevents.com/virtual/vslive/2015/live360or/pdf/SQT03... · SQT03 ‐Big Data and Hadoop with Azure HDInsight ‐Andrew Brust What is Big Data? •

Live! 360 Orlando 2015

SQT03 ‐ Big Data and Hadoop with Azure HDInsight ‐ Andrew Brust

• Cloudera (CDH)

• MapR– Network File System replaces HDFS

• Hortonworks (HDP)

• Open Data Platform (ODP)– Pivotal HD

• Greenplum IP; full dev stack

– IBM InfoSphere BigInsights• HDFS<->DB2 integration

• And Microsoft…

Hadoop Distributions

Microsoft HDInsight

• Developed with Hortonworks and incorporates Hortonworks Data Platform (HDP) for Windows

• Windows Azure HDInsight and Microsoft HDInsight Server– Single node preview runs on Windows client– Also Hortonworks HDP for Windows– Also HDInsight with Analytics Platform System

• Includes ODBC Drivers for Hive• All contributed back to open source Apache

project

Page 7: SQT03 BigDataandHadoop Brustredmondevents.com/virtual/vslive/2015/live360or/pdf/SQT03... · SQT03 ‐Big Data and Hadoop with Azure HDInsight ‐Andrew Brust What is Big Data? •

Live! 360 Orlando 2015

SQT03 ‐ Big Data and Hadoop with Azure HDInsight ‐ Andrew Brust

HDInsight Recent Changes

• YARN and Tez now available– MapReduce no longer mandatory

• Access via PowerShell and HDInsightcmdlets– Need to install PowerShell for Microsoft Azure

and HDInsight

• Web GUI– For Hive queries and job monitoring

Azure HDInsight Provisioning

• Go to Windows Azure portal• Select HDInsight from left navbar• Click “+ NEW” button @ lower-left• Specify cluster name, number of nodes,

admin password, storage account– Credentials used for ODBC– Optionally, enable RDP access to head node,

with credentials• Click “CREATE HDINSIGHT CLUSTER”• Wait for provisioning to complete• Use PowerShell or

RDP into clustername.azurehdinsight.net

Page 8: SQT03 BigDataandHadoop Brustredmondevents.com/virtual/vslive/2015/live360or/pdf/SQT03... · SQT03 ‐Big Data and Hadoop with Azure HDInsight ‐Andrew Brust What is Big Data? •

Live! 360 Orlando 2015

SQT03 ‐ Big Data and Hadoop with Azure HDInsight ‐ Andrew Brust

Azure HDInsight Provisioning

Working with HDInsight

• Web GUI

– For Hive queries and job monitoring

• Access via PowerShell and HDInsight cmdlets

– Need to install PowerShell for Microsoft Azure and HDInsight

• RDP into head node

– To clustername.azurehdinsight.net

– Work from (remote) Windows command prompt

Page 9: SQT03 BigDataandHadoop Brustredmondevents.com/virtual/vslive/2015/live360or/pdf/SQT03... · SQT03 ‐Big Data and Hadoop with Azure HDInsight ‐Andrew Brust What is Big Data? •

Live! 360 Orlando 2015

SQT03 ‐ Big Data and Hadoop with Azure HDInsight ‐ Andrew Brust

Submitting, Running and Monitoring Jobs• Upload a JAR• Use Streaming

– Use other languages (i.e. other than Java) to write MapReduce code

– Python is popular option– Any executable works, even C# console apps– On HDInsight, JavaScript works too– Still uses a JAR file: streaming.jar

• Run at command line (PowerShell or Command window via RDP) passing JAR name and params

Amenities for Visual Studio/.NET

HortonworksData Platform for Windows

.NET SDK for Hadoop

LINQ to Hive

OdbcClient + Hive ODBC Driver

HDInsightEmulator

HDInsightPowerShell Cmdlets

Visual Studio Hadoop Tools for HDInsight

Page 10: SQT03 BigDataandHadoop Brustredmondevents.com/virtual/vslive/2015/live360or/pdf/SQT03... · SQT03 ‐Big Data and Hadoop with Azure HDInsight ‐Andrew Brust What is Big Data? •

Live! 360 Orlando 2015

SQT03 ‐ Big Data and Hadoop with Azure HDInsight ‐ Andrew Brust

Running MapReduce Jobs

Hive

• Used by most BI products which connect to Hadoop

• Provides a SQL-like abstraction over Hadoop– Officially HiveQL, or HQL

• Works on own tables, but also on HBase• Query generates MapReduce job, output of which

becomes result set• Microsoft has Hive ODBC driver

– Connects Excel, Reporting Services, PowerPivot, Analysis Services Tabular Mode (only)

• New! SET hive.execution.engine = Tez

Page 11: SQT03 BigDataandHadoop Brustredmondevents.com/virtual/vslive/2015/live360or/pdf/SQT03... · SQT03 ‐Big Data and Hadoop with Azure HDInsight ‐Andrew Brust What is Big Data? •

Live! 360 Orlando 2015

SQT03 ‐ Big Data and Hadoop with Azure HDInsight ‐ Andrew Brust

Hive

The “Data-Refinery” Idea

• Use Hadoop to “on-board” unstructured data, then extract manageable subsets

• Load the subsets into conventional DW/BI servers and use familiar analytics tool to examine

• This is the current rationalization of Hadoop + BI tools’ coexistence

• Will it stay this way?

Page 12: SQT03 BigDataandHadoop Brustredmondevents.com/virtual/vslive/2015/live360or/pdf/SQT03... · SQT03 ‐Big Data and Hadoop with Azure HDInsight ‐Andrew Brust What is Big Data? •

Live! 360 Orlando 2015

SQT03 ‐ Big Data and Hadoop with Azure HDInsight ‐ Andrew Brust

HDInsight on Linux

• Currently in preview

• Allows interaction with full ecosystem of Linux-based Hadoop tools

• Connect via Apache Ambari or SSH/PuTTY

• Still maps HDFS onto Azure storage

HDInsight Script Steps

• Allows automated customization of an HDInsight cluster

• Uses PowerShell to install additional components

• Pre-built scripts exist for components like Apache Spark

• But you can write your own• AWS Elastic MapReduce has a very similar

feature

Page 13: SQT03 BigDataandHadoop Brustredmondevents.com/virtual/vslive/2015/live360or/pdf/SQT03... · SQT03 ‐Big Data and Hadoop with Azure HDInsight ‐Andrew Brust What is Big Data? •

Live! 360 Orlando 2015

SQT03 ‐ Big Data and Hadoop with Azure HDInsight ‐ Andrew Brust

Just-in-time Schema• When looking at unstructured data, schema

is imposed at query time• Schema is context specific

– If scanning a book, are the values words, lines, or pages?

– Are notes a single field, or is each word a value?

– Are date and time two fields or one?– Are street, city, state, zip separate or one

value?– Pig and Hive let you determine this at query

time– So does the Map function in MapReduce code

How Does MS BI Fit In?• Excel, PowerPivot: can query via Hive ODBC driver• Analysis Services (SSAS) Tabular Mode

– Also compatible with Hive ODBC Driver• Multidimensional mode is not

• Power Query– Connectivity to HDInsight and standard HDFS

• Power View– Works against PowerPivot and SSAS Tabular

• RDBMS + Parallel Data Warehouse (PDW)– Sqoop connectors– Columnstore Indexes

• Enterprise Edition and PDW only

• APS/PDW: PolyBase and HDInsight Region

Page 14: SQT03 BigDataandHadoop Brustredmondevents.com/virtual/vslive/2015/live360or/pdf/SQT03... · SQT03 ‐Big Data and Hadoop with Azure HDInsight ‐Andrew Brust What is Big Data? •

Live! 360 Orlando 2015

SQT03 ‐ Big Data and Hadoop with Azure HDInsight ‐ Andrew Brust

Excel, PowerPivot

• Excel and PowerPivot use the BI Semantic Model (BISM), which can query Hadoop via Hive and its ODBC driver

• Excel also features Power Query, which can query HDFS directly and insert the results into a BISM repository

• Excel BISM accommodates millions of rows through compression. Not petabyte scale, but sufficient to store and analyze output of Hadoop queries.

PowerPivot, SSAS Tabular

• SQL Server Analysis Services Tabular mode is the enterprise server implementation of BISM

• Features partitioning and role-based security

• Can store billions of rows. So even better for Hadoop output analysis.

• Excel-based BISM repositories can be upsized to SSAS Tabular

Page 15: SQT03 BigDataandHadoop Brustredmondevents.com/virtual/vslive/2015/live360or/pdf/SQT03... · SQT03 ‐Big Data and Hadoop with Azure HDInsight ‐Andrew Brust What is Big Data? •

Live! 360 Orlando 2015

SQT03 ‐ Big Data and Hadoop with Azure HDInsight ‐ Andrew Brust

Querying Hadoop from Microsoft BI

Sqoop

• Acronym for “SQL to Hadoop”• Essentially a technology for moving data

between data warehouses and Hadoop• Command line utility; allows specification of

source/target HDFS file and relational server, database and table

• Sqoop connectors available for SQL Server and PDW

• Sqoop generates MapReduce job to extract data from, or insert data into, HDFS

Page 16: SQT03 BigDataandHadoop Brustredmondevents.com/virtual/vslive/2015/live360or/pdf/SQT03... · SQT03 ‐Big Data and Hadoop with Azure HDInsight ‐Andrew Brust What is Big Data? •

Live! 360 Orlando 2015

SQT03 ‐ Big Data and Hadoop with Azure HDInsight ‐ Andrew Brust

APS/PDW, PolyBase

• Microsoft Analytics Platform System includes SQL Server Parallel Data Warehouse (PDW)– Massively Parallel Processing (MPP) data warehouse

appliance version of SQL Server– MPP manages a grid of relational database servers

for divide-and-conquer processing of large data sets.• PDW and SQL 2016 include “PolyBase,” a

component which allows PDW to query data in Hadoop directly.– Bypasses MapReduce; addresses data nodes directly

and orchestrates parallelism itself• APS available with or without a built-in, PolyBase-

configured HDInsight “region”

PolyBase Versus Hive, Sqoop

• Hive and Sqoop generate MapReducejobs, and work in batch mode

• PolyBase addresses HDFS data itself

• This is true SQL over Hadoop.

• Competitors:– Cloudera Impala

– Teradata QueryGrid

– Pivotal HAWQ

Page 17: SQT03 BigDataandHadoop Brustredmondevents.com/virtual/vslive/2015/live360or/pdf/SQT03... · SQT03 ‐Big Data and Hadoop with Azure HDInsight ‐Andrew Brust What is Big Data? •

Live! 360 Orlando 2015

SQT03 ‐ Big Data and Hadoop with Azure HDInsight ‐ Andrew Brust

Resources

• Apache Hadoop home page– http://hadoop.apache.org/

• Hive home page– http://hive.apache.org/

• Azure HDInsight• http://bit.ly/azurebigdata

• Microsoft Big Data– http://bit.ly/sql2012bigdata

• Analytics Platform System– http://bit.ly/msapspdw

Thank You!

• Email• [email protected]

• Twitter• @andrewbrust on twitter