introduction to polybase

24
Introduction to PolyBase James Serra Big Data Evangelist Microsoft [email protected]

Upload: james-serra

Post on 16-Apr-2017

1.169 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Introduction to PolyBase

Introduction to PolyBase

James SerraBig Data [email protected]

Page 2: Introduction to PolyBase

About Me Microsoft, Big Data Evangelist In IT for 30 years, worked on many BI and DW projects Worked as desktop/web/database developer, DBA, BI and DW architect and

developer, MDM architect, PDW/APS developer Been perm employee, contractor, consultant, business owner Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data

World conference Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting

Microsoft Azure Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data Platform Solutions

Blog at JamesSerra.com Former SQL Server MVP Author of book “Reporting with Microsoft SQL Server 2012”

Page 3: Introduction to PolyBase

• What is PolyBase?• Scale-out Groups• Why use PolyBase?• How to use PolyBase?• Demo

Agenda

Page 4: Introduction to PolyBase

What is PolyBase?

Page 5: Introduction to PolyBase

Big Picture

RDBMS Hadoop

Provides a scalable, T-SQL compatible query processing framework for combining

data from both universes

PolyBase

Page 6: Introduction to PolyBase

A little PolyBase trivia

2012 2013 ……… 2016…2014

PolyBase shipped in SQL Server PDW V2 (now APS)

PolyBase in SQL Server 16 (CTP3)

PolyBase in SQL DW

PolyBase in SQL Server 2016

2015

Page 7: Introduction to PolyBase

PolyBaseQuery relational and non-relational data with T-SQL

CapabilityT-SQL for querying relational and non-relational data across SQL Server (APS, SQL Server 2016, SQL DW) and Hadoop and Azure blob storage (soon ADLS)

Benefits New business insights across

your data lake Leverage existing skillsets

and BI tools Faster time to insights and

simplified ETL process

Page 8: Introduction to PolyBase

PolyBase Use Cases

Load Data

Use Hadoop as an ETL tool to cleanse data before loading to data warehouse with PolyBase

Interactively Query

Analyze relational data with semi-structured data using split-based query processing

Age-out Data

Age-out data to HDFS and use it as ‘cold’ but query-able storage

Disaster recovery: We have several customers that use a pattern of APS > Blob Storage > SQL DW (all via PolyBase) as a pattern for DR (using the cloud service)

Page 9: Introduction to PolyBase

In SQL Server 16

SQL ServerW/ PolyBase

QuerySELECT TOP 10 * FROM SQLServer SJOIN Hadoop HS.Key = H.Key

Page 10: Introduction to PolyBase

In SQL Server 16

SQL ServerW/ PolyBase

Hadoop

Query SELECT TOP 10 * FROM SQLServer SJOIN Hadoop HS.Key = H.Key

Page 11: Introduction to PolyBase

In SQL Server 16

SQL ServerW/ PolyBase

Hadoop

Query ResultsSELECT TOP 10 * FROM SQLServer SJOIN Hadoop HS.Key = H.Key

Page 12: Introduction to PolyBase

In SQL Server 16

SQL ServerW/ PolyBase

QuerySELECT TOP 10 * FROM SQLServer SJOIN Blob BS.Key = B.Key

Page 13: Introduction to PolyBase

In SQL Server 16

SQL ServerW/ PolyBase

Azure Blob Storage

Query ResultsSELECT TOP 10 * FROM SQLServer SJOIN Blob BS.Key = B.Key

Page 14: Introduction to PolyBase

In SQL Server 16

SQL ServerW/ PolyBase

Hadoop Azure Blob Storage

Query ResultsSELECT TOP 10 * FROM SQLServer SJOIN Hadoop HS.Key = H.KeyJOIN Blob Band S.Key = B.Key

Page 15: Introduction to PolyBase

• Option 0: Disable Hadoop connectivity• Option 1: Hortonworks HDP 1.3 on Windows Server• Option 1: Azure blob storage (WASB[S])• Option 2: Hortonworks HDP 1.3 on Linux• Option 3: Cloudera CDH 4.3 on Linux• Option 4: Hortonworks HDP 2.0 on Windows Server• Option 4: Azure blob storage (WASB[S])• Option 5: Hortonworks HDP 2.0 on Linux• Option 6: Cloudera 5.1, 5.2, 5.3, 5.4, and 5.5 on Linux• Option 7: Hortonworks 2.1, 2.2, and 2.3 on Linux• Option 7: Hortonworks 2.1, 2.2, and 2.3 on Windows Server• Option 7: Azure blob storage (WASB[S]), HDInsight

sp_configure: https://msdn.microsoft.com/en-us/library/mt143174.aspx

Azure Data Lake Store and HDInsight push down not supported yet. Working on Metanautix integration to add support for additional data sources

Supported data sources for SQL Server 2016

Page 16: Introduction to PolyBase

PolyBase SupportPolybase (works with)

Azure Blob Store

Push Down

HDInsight

Push Down

Cloudera

Push Down

HortonWorks

Push Down

Azure Data Lake Store

Push Down

SQL 2016 (Now) Yes N/A Yes No Yes Yes Yes Yes No N/A

SQL 2016 (Near future) Yes N/A Yes No Yes Yes Yes Yes No N/A                     Azure SQL DW (Now) Yes N/A Yes No No No No No Yes! N/A

Azure SQL DW (Near future) Yes N/A Yes No No No No No Yes N/A                     

APS (Now) Yes N/A YesYes (int). No (ext) Yes Yes Yes Yes No N/A

APS (Near future) Yes N/A Yes Yes/No Yes Yes Yes Yes No N/A

  

Push Down creates MapReduce job to query file and returns just the results.

PolyBase offers ability to create statistics on tables (but they are not auto-created or auto-updated).

Page 17: Introduction to PolyBase

• Delimited text (UTF-8)• Hive RCFile• Hive ORC• Parquet• gzip, zlib, Snappy compressed files

Does not support:• UTF-16• Extended ASCII• Fixed-file format• WinZip• JSON• XML

PolyBase supported file formats

Page 18: Introduction to PolyBase

PolyBase Scale-Out Groups in SQL Server 2016

https://msdn.microsoft.com/en-us/library/mt607030.aspx

Allows you to create a cluster of SQL Server instances to process large data sets from external data sources in a scale-out fashion for better query performance

Page 19: Introduction to PolyBase

• Ability to integrate SQL Server with data stored in HDFS or Windows Azure Storage BLOB

• Commodity hardware and storage are cheap, easily distributed on HDFS; increases data reliability at a low cost

• Increasing number of different types of data; structured, unstructured, semi-structuredCan have them stored on the best system suitable and queried in one

place• Increasing size of data and strong aversion to data deletion due to

company culture or restrictions

Why use PolyBase?

Page 20: Introduction to PolyBase

• Must install PolyBase engine and prerequisites (Java)• Enable “Hadoop connectivity”• Create Credential to connect to Azure BLOB storage• Create External Data Source• Create External File Format• Create External Table/View• Enable “Allow PolyBase Export” – to write to Hadoop/WASB

How to use PolyBase?

Page 21: Introduction to PolyBase

Creating External Tables (secure Hadoop)CREATE DATABASE SCOPED CREDENTIAL HadoopCredential WITH IDENTITY = 'hadoopUserName', Secret = 'hadoopPassword';

CREATE EXTERNAL DATA SOURCE HadoopCluster WITH (TYPE = Hadoop, LOCATION = 'hdfs://10.193.26.177:8020',

RESOURCE_MANAGER_LOCATION = '10.193.26.178:8050', CREDENTIAL = HadoopCredential);

CREATE EXTERNAL FILE FORMAT TextFile WITH ( FORMAT_TYPE = DELIMITEDTEXT,

DATA_COMPRESSION = 'org.apache.hadoop.io.compress.GzipCodec',FORMAT_OPTIONS (FIELD_TERMINATOR ='|', USE_TYPE_DEFAULT = TRUE));

CREATE EXTERNAL TABLE [dbo].[Customer] ( [SensorKey] int NOT NULL, [CustomerKey] int NOT NULL, [Speed] float NOT NULL)WITH (LOCATION='//Sensor_Data//May2014/', DATA_SOURCE = HadoopCluster, FILE_FORMAT = TextFile);

Once per Hadoop User

HDFS File Path

Once per File Format

Once per Hadoop Cluster per user

Page 22: Introduction to PolyBase

Demos• Using PolyBase to connect to Windows Azure Storage BLOB• Reading data using PolyBase• Writing data using PolyBase• Creating JSON formatted output• Reading JSON data from WASB

Page 23: Introduction to PolyBase

Resources PolyBase guide: https://msdn.microsoft.com/en-us/library/mt143171.aspx Azure SQL Data Warehouse loading patterns and strategies: http://bit.ly/1XskZL2

Page 24: Introduction to PolyBase

Q & A ?James Serra, Big Data EvangelistEmail me at: [email protected] me at: @JamesSerra Link to me at: www.linkedin.com/in/JamesSerra Visit my blog at: JamesSerra.com (where this slide deck is posted via the “Presentations” link on the top menu)