introduction to polybase

Introduction to PolyBase

James SerraBig Data [email protected]

About Me Microsoft, Big Data Evangelist In IT for 30 years, worked on many BI and DW projects Worked as desktop/web/database developer, DBA, BI and DW architect and

developer, MDM architect, PDW/APS developer Been perm employee, contractor, consultant, business owner Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data

World conference Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting

Microsoft Azure Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data Platform Solutions

Blog at JamesSerra.com Former SQL Server MVP Author of book “Reporting with Microsoft SQL Server 2012”

• What is PolyBase?• Scale-out Groups• Why use PolyBase?• How to use PolyBase?• Demo

Agenda

What is PolyBase?

Big Picture

RDBMS Hadoop

Provides a scalable, T-SQL compatible query processing framework for combining

data from both universes

PolyBase

A little PolyBase trivia

2012 2013 ……… 2016…2014

PolyBase shipped in SQL Server PDW V2 (now APS)

PolyBase in SQL Server 16 (CTP3)

PolyBase in SQL DW

PolyBase in SQL Server 2016

2015

PolyBaseQuery relational and non-relational data with T-SQL

CapabilityT-SQL for querying relational and non-relational data across SQL Server (APS, SQL Server 2016, SQL DW) and Hadoop and Azure blob storage (soon ADLS)

Benefits New business insights across

your data lake Leverage existing skillsets

and BI tools Faster time to insights and

simplified ETL process

PolyBase Use Cases

Load Data

Use Hadoop as an ETL tool to cleanse data before loading to data warehouse with PolyBase

Interactively Query

Analyze relational data with semi-structured data using split-based query processing

Age-out Data

Age-out data to HDFS and use it as ‘cold’ but query-able storage

Disaster recovery: We have several customers that use a pattern of APS > Blob Storage > SQL DW (all via PolyBase) as a pattern for DR (using the cloud service)

In SQL Server 16

SQL ServerW/ PolyBase

QuerySELECT TOP 10 * FROM SQLServer SJOIN Hadoop HS.Key = H.Key

In SQL Server 16


Hadoop

Query SELECT TOP 10 * FROM SQLServer SJOIN Hadoop HS.Key = H.Key

In SQL Server 16


Hadoop

Query ResultsSELECT TOP 10 * FROM SQLServer SJOIN Hadoop HS.Key = H.Key

In SQL Server 16


QuerySELECT TOP 10 * FROM SQLServer SJOIN Blob BS.Key = B.Key

In SQL Server 16


Azure Blob Storage

Query ResultsSELECT TOP 10 * FROM SQLServer SJOIN Blob BS.Key = B.Key

In SQL Server 16


Hadoop Azure Blob Storage

Query ResultsSELECT TOP 10 * FROM SQLServer SJOIN Hadoop HS.Key = H.KeyJOIN Blob Band S.Key = B.Key

• Option 0: Disable Hadoop connectivity• Option 1: Hortonworks HDP 1.3 on Windows Server• Option 1: Azure blob storage (WASB[S])• Option 2: Hortonworks HDP 1.3 on Linux• Option 3: Cloudera CDH 4.3 on Linux• Option 4: Hortonworks HDP 2.0 on Windows Server• Option 4: Azure blob storage (WASB[S])• Option 5: Hortonworks HDP 2.0 on Linux• Option 6: Cloudera 5.1, 5.2, 5.3, 5.4, and 5.5 on Linux• Option 7: Hortonworks 2.1, 2.2, and 2.3 on Linux• Option 7: Hortonworks 2.1, 2.2, and 2.3 on Windows Server• Option 7: Azure blob storage (WASB[S]), HDInsight

sp_configure: https://msdn.microsoft.com/en-us/library/mt143174.aspx

Azure Data Lake Store and HDInsight push down not supported yet. Working on Metanautix integration to add support for additional data sources

Supported data sources for SQL Server 2016

https://msdn.microsoft.com/en-us/library/mt143174.aspx

PolyBase SupportPolybase (works with)

Azure Blob Store

Push Down

HDInsight

Push Down

Cloudera

Push Down

HortonWorks

Push Down

Azure Data Lake Store

Push Down

SQL 2016 (Now) Yes N/A Yes No Yes Yes Yes Yes No N/A

SQL 2016 (Near future) Yes N/A Yes No Yes Yes Yes Yes No N/A Azure SQL DW (Now) Yes N/A Yes No No No No No Yes! N/A

Azure SQL DW (Near future) Yes N/A Yes No No No No No Yes N/A

APS (Now) Yes N/A YesYes (int). No (ext) Yes Yes Yes Yes No N/A

APS (Near future) Yes N/A Yes Yes/No Yes Yes Yes Yes No N/A

Push Down creates MapReduce job to query file and returns just the results.

PolyBase offers ability to create statistics on tables (but they are not auto-created or auto-updated).

• Delimited text (UTF-8)• Hive RCFile• Hive ORC• Parquet• gzip, zlib, Snappy compressed files

Does not support:• UTF-16• Extended ASCII• Fixed-file format• WinZip• JSON• XML

PolyBase supported file formats

PolyBase Scale-Out Groups in SQL Server 2016


Allows you to create a cluster of SQL Server instances to process large data sets from external data sources in a scale-out fashion for better query performance


• Ability to integrate SQL Server with data stored in HDFS or Windows Azure Storage BLOB

• Commodity hardware and storage are cheap, easily distributed on HDFS; increases data reliability at a low cost

• Increasing number of different types of data; structured, unstructured, semi-structuredCan have them stored on the best system suitable and queried in one

place• Increasing size of data and strong aversion to data deletion due to

company culture or restrictions

Why use PolyBase?

• Must install PolyBase engine and prerequisites (Java)• Enable “Hadoop connectivity”• Create Credential to connect to Azure BLOB storage• Create External Data Source• Create External File Format• Create External Table/View• Enable “Allow PolyBase Export” – to write to Hadoop/WASB

How to use PolyBase?

Creating External Tables (secure Hadoop)CREATE DATABASE SCOPED CREDENTIAL HadoopCredential WITH IDENTITY = 'hadoopUserName', Secret = 'hadoopPassword';

CREATE EXTERNAL DATA SOURCE HadoopCluster WITH (TYPE = Hadoop, LOCATION = 'hdfs://10.193.26.177:8020',

RESOURCE_MANAGER_LOCATION = '10.193.26.178:8050', CREDENTIAL = HadoopCredential);

CREATE EXTERNAL FILE FORMAT TextFile WITH ( FORMAT_TYPE = DELIMITEDTEXT,

DATA_COMPRESSION = 'org.apache.hadoop.io.compress.GzipCodec',FORMAT_OPTIONS (FIELD_TERMINATOR ='|', USE_TYPE_DEFAULT = TRUE));

CREATE EXTERNAL TABLE [dbo].[Customer] ( [SensorKey] int NOT NULL, [CustomerKey] int NOT NULL, [Speed] float NOT NULL)WITH (LOCATION='//Sensor_Data//May2014/', DATA_SOURCE = HadoopCluster, FILE_FORMAT = TextFile);

Once per Hadoop User

HDFS File Path

Once per File Format

Once per Hadoop Cluster per user

Demos• Using PolyBase to connect to Windows Azure Storage BLOB• Reading data using PolyBase• Writing data using PolyBase• Creating JSON formatted output• Reading JSON data from WASB

Resources PolyBase guide: https://msdn.microsoft.com/en-us/library/mt143171.aspx Azure SQL Data Warehouse loading patterns and strategies: http://bit.ly/1XskZL2


http://bit.ly/1XskZL2

Q & A ?James Serra, Big Data EvangelistEmail me at: [email protected] me at: @JamesSerra Link to me at: www.linkedin.com/in/JamesSerra Visit my blog at: JamesSerra.com (where this slide deck is posted via the “Presentations” link on the top menu)

mailto:[email protected]

http://www.linkedin.com/in/JamesSerra

http://www.jamesserra.com/

introduction to polybase

Technology