introduction to polybase
TRANSCRIPT
Introduction to PolyBase
James SerraBig Data [email protected]
About Me Microsoft, Big Data Evangelist In IT for 30 years, worked on many BI and DW projects Worked as desktop/web/database developer, DBA, BI and DW architect and
developer, MDM architect, PDW/APS developer Been perm employee, contractor, consultant, business owner Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data
World conference Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting
Microsoft Azure Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data Platform Solutions
Blog at JamesSerra.com Former SQL Server MVP Author of book “Reporting with Microsoft SQL Server 2012”
• What is PolyBase?• Scale-out Groups• Why use PolyBase?• How to use PolyBase?• Demo
Agenda
What is PolyBase?
Big Picture
RDBMS Hadoop
Provides a scalable, T-SQL compatible query processing framework for combining
data from both universes
PolyBase
A little PolyBase trivia
2012 2013 ……… 2016…2014
PolyBase shipped in SQL Server PDW V2 (now APS)
PolyBase in SQL Server 16 (CTP3)
PolyBase in SQL DW
PolyBase in SQL Server 2016
2015
PolyBaseQuery relational and non-relational data with T-SQL
CapabilityT-SQL for querying relational and non-relational data across SQL Server (APS, SQL Server 2016, SQL DW) and Hadoop and Azure blob storage (soon ADLS)
Benefits New business insights across
your data lake Leverage existing skillsets
and BI tools Faster time to insights and
simplified ETL process
PolyBase Use Cases
Load Data
Use Hadoop as an ETL tool to cleanse data before loading to data warehouse with PolyBase
Interactively Query
Analyze relational data with semi-structured data using split-based query processing
Age-out Data
Age-out data to HDFS and use it as ‘cold’ but query-able storage
Disaster recovery: We have several customers that use a pattern of APS > Blob Storage > SQL DW (all via PolyBase) as a pattern for DR (using the cloud service)
In SQL Server 16
SQL ServerW/ PolyBase
QuerySELECT TOP 10 * FROM SQLServer SJOIN Hadoop HS.Key = H.Key
In SQL Server 16
SQL ServerW/ PolyBase
Hadoop
Query SELECT TOP 10 * FROM SQLServer SJOIN Hadoop HS.Key = H.Key
In SQL Server 16
SQL ServerW/ PolyBase
Hadoop
Query ResultsSELECT TOP 10 * FROM SQLServer SJOIN Hadoop HS.Key = H.Key
In SQL Server 16
SQL ServerW/ PolyBase
QuerySELECT TOP 10 * FROM SQLServer SJOIN Blob BS.Key = B.Key
In SQL Server 16
SQL ServerW/ PolyBase
Azure Blob Storage
Query ResultsSELECT TOP 10 * FROM SQLServer SJOIN Blob BS.Key = B.Key
In SQL Server 16
SQL ServerW/ PolyBase
Hadoop Azure Blob Storage
Query ResultsSELECT TOP 10 * FROM SQLServer SJOIN Hadoop HS.Key = H.KeyJOIN Blob Band S.Key = B.Key
• Option 0: Disable Hadoop connectivity• Option 1: Hortonworks HDP 1.3 on Windows Server• Option 1: Azure blob storage (WASB[S])• Option 2: Hortonworks HDP 1.3 on Linux• Option 3: Cloudera CDH 4.3 on Linux• Option 4: Hortonworks HDP 2.0 on Windows Server• Option 4: Azure blob storage (WASB[S])• Option 5: Hortonworks HDP 2.0 on Linux• Option 6: Cloudera 5.1, 5.2, 5.3, 5.4, and 5.5 on Linux• Option 7: Hortonworks 2.1, 2.2, and 2.3 on Linux• Option 7: Hortonworks 2.1, 2.2, and 2.3 on Windows Server• Option 7: Azure blob storage (WASB[S]), HDInsight
sp_configure: https://msdn.microsoft.com/en-us/library/mt143174.aspx
Azure Data Lake Store and HDInsight push down not supported yet. Working on Metanautix integration to add support for additional data sources
Supported data sources for SQL Server 2016
PolyBase SupportPolybase (works with)
Azure Blob Store
Push Down
HDInsight
Push Down
Cloudera
Push Down
HortonWorks
Push Down
Azure Data Lake Store
Push Down
SQL 2016 (Now) Yes N/A Yes No Yes Yes Yes Yes No N/A
SQL 2016 (Near future) Yes N/A Yes No Yes Yes Yes Yes No N/A Azure SQL DW (Now) Yes N/A Yes No No No No No Yes! N/A
Azure SQL DW (Near future) Yes N/A Yes No No No No No Yes N/A
APS (Now) Yes N/A YesYes (int). No (ext) Yes Yes Yes Yes No N/A
APS (Near future) Yes N/A Yes Yes/No Yes Yes Yes Yes No N/A
Push Down creates MapReduce job to query file and returns just the results.
PolyBase offers ability to create statistics on tables (but they are not auto-created or auto-updated).
• Delimited text (UTF-8)• Hive RCFile• Hive ORC• Parquet• gzip, zlib, Snappy compressed files
Does not support:• UTF-16• Extended ASCII• Fixed-file format• WinZip• JSON• XML
PolyBase supported file formats
PolyBase Scale-Out Groups in SQL Server 2016
https://msdn.microsoft.com/en-us/library/mt607030.aspx
Allows you to create a cluster of SQL Server instances to process large data sets from external data sources in a scale-out fashion for better query performance
• Ability to integrate SQL Server with data stored in HDFS or Windows Azure Storage BLOB
• Commodity hardware and storage are cheap, easily distributed on HDFS; increases data reliability at a low cost
• Increasing number of different types of data; structured, unstructured, semi-structuredCan have them stored on the best system suitable and queried in one
place• Increasing size of data and strong aversion to data deletion due to
company culture or restrictions
Why use PolyBase?
• Must install PolyBase engine and prerequisites (Java)• Enable “Hadoop connectivity”• Create Credential to connect to Azure BLOB storage• Create External Data Source• Create External File Format• Create External Table/View• Enable “Allow PolyBase Export” – to write to Hadoop/WASB
How to use PolyBase?
Creating External Tables (secure Hadoop)CREATE DATABASE SCOPED CREDENTIAL HadoopCredential WITH IDENTITY = 'hadoopUserName', Secret = 'hadoopPassword';
CREATE EXTERNAL DATA SOURCE HadoopCluster WITH (TYPE = Hadoop, LOCATION = 'hdfs://10.193.26.177:8020',
RESOURCE_MANAGER_LOCATION = '10.193.26.178:8050', CREDENTIAL = HadoopCredential);
CREATE EXTERNAL FILE FORMAT TextFile WITH ( FORMAT_TYPE = DELIMITEDTEXT,
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.GzipCodec',FORMAT_OPTIONS (FIELD_TERMINATOR ='|', USE_TYPE_DEFAULT = TRUE));
CREATE EXTERNAL TABLE [dbo].[Customer] ( [SensorKey] int NOT NULL, [CustomerKey] int NOT NULL, [Speed] float NOT NULL)WITH (LOCATION='//Sensor_Data//May2014/', DATA_SOURCE = HadoopCluster, FILE_FORMAT = TextFile);
Once per Hadoop User
HDFS File Path
Once per File Format
Once per Hadoop Cluster per user
Demos• Using PolyBase to connect to Windows Azure Storage BLOB• Reading data using PolyBase• Writing data using PolyBase• Creating JSON formatted output• Reading JSON data from WASB
Resources PolyBase guide: https://msdn.microsoft.com/en-us/library/mt143171.aspx Azure SQL Data Warehouse loading patterns and strategies: http://bit.ly/1XskZL2
Q & A ?James Serra, Big Data EvangelistEmail me at: [email protected] me at: @JamesSerra Link to me at: www.linkedin.com/in/JamesSerra Visit my blog at: JamesSerra.com (where this slide deck is posted via the “Presentations” link on the top menu)